/
aff.py
1174 lines (944 loc) · 47 KB
/
aff.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
"""
The module represents data from Hunspell's ``*.aff`` file.
This text file has the following format:
.. code-block:: text
# pseudo-comment
DIRECTIVE_NAME value1 value2 value 3
# directives with large array of values
DIRECTIVE_NAME <num_of_values>
DIRECTIVE_NAME value1_1 value1_2 value1_3
DIRECTIVE_NAME value2_1 value2_2 value2_3
# ...
How many values should be after ``DIRECTIVE_NAME``, is defined by directive itself. Values are separated
by any number of spaces (so, if some values should include literal " ", they encode it as "_").
*Note:* We are saying "pseudo-comment" above, because it is just a convention. In fact, Hunspell has
no code explicitly interpreting anything starting with ``#`` as a comment -- it is rather ignores everything
that is not known directive name, and everything after expected number of directive values. But it is
important NOT to drop ``#`` and content after it before interpreting, as it might be meaningful!
Some dictionaries define ``#`` to be a flag, or a ``BREAK`` character. For example ``en_GB`` in
LibreOffice does this:
.. code-block:: text
# in .aff file:
COMPOUNDRULE #*0{
# reads: rule of producing compound words:
# any words with flag "#", 0 or more times (*),
# then any word with flag "0",
# then any word with flag "{"
# in .dic file:
1/#0
# reads: "1" is a word, having flags "#" and "0"
The :class:`Aff` class stores all data from the the file — read class docs to better understand the
conventions and usage of directives.
``Aff``
-------
.. autoclass:: Aff
``Prefix`` and ``Suffix``
-------------------------
.. autoclass:: Affix
:members:
.. autoclass:: Prefix
.. autoclass:: Suffix
Helper pattern-alike classes
----------------------------
This classes are wrapping several types of somewhat pattern-alike objects that can be ``*.aff``-file,
"compiling" them into something applyable much like Python's ``re`` module compiles regexps.
.. autoclass:: BreakPattern
.. autoclass:: Ignore
.. autoclass:: RepPattern
.. autoclass:: ConvTable
.. autoclass:: CompoundPattern
.. autoclass:: CompoundRule
.. autoclass:: PhonetTable
"""
import re
import functools
import itertools
from operator import itemgetter
from collections import defaultdict
from dataclasses import dataclass, field
from typing import List, Set, Dict, Tuple, Optional
from spylls.hunspell.algo.capitalization import Casing, GermanCasing, TurkicCasing
from spylls.hunspell.algo.trie import Trie
@dataclass
class BreakPattern:
"""
Contents of the :attr:`Aff.BREAK` directive, pattern for splitting the word, compiled to regexp.
Directives are stored this way:
.. code-block:: text
BREAK 3
BREAK -
BREAK ^-
BREAK -$
(That's, by the way, the default value of ``BREAK``). It means Hunspell while checking the word
like "left-right", will check "left" and "right" separately; also will ignore "-" at the beginning
and end of the word (second and third lines). Note that ``BREAK -`` without any special chars
will NOT ignore "-" at the beginning/end.
"""
pattern: str
def __post_init__(self):
# special chars like #, -, * etc should be escaped, but ^ and $ should be treated as in regexps
pattern = re.escape(self.pattern).replace('\\^', '^').replace('\\$', '$')
if pattern.startswith('^') or pattern.endswith('$'):
self.regexp = re.compile(f"({pattern})")
else:
self.regexp = re.compile(f".({pattern}).")
@dataclass
class Ignore:
"""
Contents of the :attr:`Aff.IGNORE` directive, chars to ignore on lookup/suggest, compiled with
``str.maketrans``.
"""
chars: str
def __post_init__(self):
self.tr = str.maketrans('', '', self.chars)
@dataclass
class RepPattern:
"""
Contents of the :attr:`Aff.REP` directive, pair of ``(frequent typo, its replacement)``. Typo pattern
compiled to regexp.
Example from Hunspell's docs, showing all the features:
.. code-block:: text
REP 5
REP f ph
REP ph f
REP tion$ shun
REP ^cooccurr co-occurr
REP ^alot$ a_lot
This means:
* table of 5 replacements (first line):
* try to see if "f -> ph" produces good word,
* try "ph -> f",
* at the end of the word try "tion -> shun",
* at the beginning of the word try "cooccurr -> co-occurr",
* and try to replace the whole word "alot" with "a lot" (``_`` stands for space).
"""
pattern: str
replacement: str
def __post_init__(self):
self.regexp = re.compile(self.pattern)
@dataclass
class Affix:
"""
Common base for :class:`Prefix` and :class:`Suffix`.
Affixes are stored in table looking this way:
.. code-block:: text
SFX X Y 1
SFX X 0 able/CD . ds:able
Meaning of the first line (table header):
* Suffix (can be ``PFX`` for prefix)
* ...designated by flag ``X``
* ...supports cross-product (Y or N, "cross-product" means form with this suffix also allowed to
have prefixes)
* ...and there is 1 of them below
Meaning of the table row:
* Suffix X (should be same as table header)
* ...when applies, doesn't change the stem (0 = "", but it can be "...removes some part at the end of the stem")
* ...when applies, adds "able" to the stem
* ...and the whole form will have also flags "C", "D"
* ...condition of application is "any stem" (``.`` -- read it as regexp's "any char")
* ...and the whole form would have data tags (morphology) ``ds:able``
Then, if in the dictionary we have ``drink/X`` (can have the suffix marked by ``X``), the whole
thing means "'drinkable' is a valid word form, has additional flags 'C', 'D' and some morphological info".
Another example (from ``en_US.aff``):
.. code-block:: text
SFX N Y 3
SFX N e ion e
SFX N y ication y
SFX N 0 en [^ey]
This defines suffix designated by flag ``N``, non-cross-productable, with 3 forms:
* removes "e" and adds "ion" for words ending with "e" (animate => animation)
* removes "y" and adds "icaton" for words ending with "y" (amplify => amplification)
* removes nothing and adds "en" for words ending with neither (befall => befallen)
*(TBH, I don't have a slightest idea why the third option is grouped with two previous... Probably
because dictionary building is semi-automated process of "packing" word lists in dic+aff, and
the "affixes" actually doesn't need to bear any grammatical sense.)*
"""
#: Flag this affix marked with. Note that several affixes can have same flag (and in this case,
#: which of them is relevant for the word, is decided by its :attr:`condition`)
flag: str
#: Whether this affix is compatible with opposite affix (e.g. if the word has both suffix and prefix,
#: both of them should have ``crossproduct=True``)
crossproduct: bool
#: What is stripped from the stem when the affix is applied
strip: str
#: What is added when the affix is applied
add: str
#: Condition against which stem should be checked to understand whether this affix is relevant
condition: str
#: Flags this affix has
flags: Set[str] = field(default_factory=set)
@dataclass
class Prefix(Affix):
"""
:class:`Affix` at the beginning of the word, stored in :attr:`Aff.PFX` directive.
"""
def __post_init__(self):
# "-" does NOT have a special regex-meaning, while might happen as a regular word char (for ex., hu_HU)
condition = self.condition.replace('-', '\\-')
self.cond_regexp = re.compile('^' + condition)
cond_parts = re.findall(r'(\[.+\]|[^\[])', condition)
cond_parts = cond_parts[len(self.strip):]
if cond_parts and cond_parts != ['.']:
cond = '(?=' + ''.join(cond_parts) + ')'
else:
cond = ''
self.lookup_regexp = re.compile('^' + self.add + cond)
self.replace_regexp = re.compile('^' + self.add)
def __repr__(self):
return (
f"Prefix({self.add}: {self.flag}{'×' if self.crossproduct else ''}" +
(f"/{','.join(self.flags)}" if self.flags else '') +
f", on ^{self.strip}[{self.condition}])"
)
@dataclass
class Suffix(Affix):
"""
:class:`Affix` at the end of the word, stored in :attr:`Aff.SFX` directive.
"""
def __post_init__(self):
# "-" does NOT have a special regex-meaning, while might happen as a regular word char (for ex., hu_HU)
self.cond_regexp = re.compile(self.condition.replace('-', '\\-') + '$')
cond_parts = re.findall(r'(\[.+\]|[^\[])', self.condition)
if self.strip:
cond_parts = cond_parts[:-len(self.strip)]
if cond_parts and cond_parts != ['.']:
cond = '(' + ''.join(cond_parts) + ')'
else:
cond = ''
cond = cond.replace('-', '\\-')
self.lookup_regexp = re.compile(cond + self.add + '$')
self.replace_regexp = re.compile(self.add + '$')
def __repr__(self):
return (
f"Suffix({self.add}: {self.flag}{'×' if self.crossproduct else ''}" +
(f"/{','.join(self.flags)}" if self.flags else '') +
f", on [{self.condition}]{self.strip}$)"
)
@dataclass
class CompoundRule:
"""
Regexp-alike rule for generating compound words, content of :attr:`Aff.COMPOUNDRULE` directive.
It is a way of specifying compounding alternative (and unrelated) to :attr:`Aff.COMPOUNDFLAG` and
similar. Rules look this way:
.. code-block:: text
COMPOUNDRULE A*B?CD
...reading: compound word might consist of any number of words with flag ``A``, then 0 or 1 words
with flag ``B``, then words with flags ``C`` and ``D``.
``en_US.aff`` uses this feature to specify spelling of numerals. In .aff-file, it has
.. code-block:: text
COMPOUNDRULE 2
COMPOUNDRULE n*1t
COMPOUNDRULE n*mp
And, in .dic-file:
.. code-block:: text
0/nm
0th/pt
1/n1
1st/p
1th/tc
2/nm
2nd/p
2th/tc
# ...and so on...
Which makes "111th" valid (one hundred eleventh): "1" with "n", "1" with "1" and "1th" with "t"
is valid by rule ``n*1t``, but "121th" is not valid (should be "121st")
"""
text: str
def __post_init__(self):
# TODO: proper flag parsing! Long is (aa)(bb)*(cc), numeric is (1001)(1002)*(1003)
# This works but is super ad-hoc!
if '(' in self.text:
self.flags = set(re.findall(r'\((.+?)\)', self.text))
parts = re.findall(r'\([^*?]+?\)[*?]?', self.text)
else:
self.flags = set(re.sub(r'[\*\?]', '', self.text))
# There are ) flags used in real-life sv_* dictionaries
# Obviously it is quite ad-hoc (other chars that have special meaning in regexp might be
# used eventually)
parts = [part.replace(')', '\\)') for part in re.findall(r'[^*?][*?]?', self.text)]
self.re = re.compile(''.join(parts))
self.partial_re = re.compile(
functools.reduce(lambda res, part: f"{part}({res})?", parts[::-1])
)
def fullmatch(self, flag_sets):
relevant_flags = [self.flags.intersection(f) for f in flag_sets]
return any(
self.re.fullmatch(''.join(fc))
for fc in itertools.product(*relevant_flags)
)
def partial_match(self, flag_sets):
relevant_flags = [self.flags.intersection(f) for f in flag_sets]
return any(
self.partial_re.fullmatch(''.join(fc))
for fc in itertools.product(*relevant_flags)
)
@dataclass
class CompoundPattern:
"""
Pattern to check whether compound word is correct, stored in :attr:`Aff.CHECKCOMPOUNDPATTERN` directive.
Format of the pattern:
.. code-block:: text
endchars[/flag] beginchars[/flag] [replacement]
The pattern matches (telling that this compound is not allowed) if some pair of the words inside
compound matches conditions:
* first word ends with ``endchars`` (and have ``flags`` from the first element, if they are specified)
* second word starts with ``beginchars`` (and have ``flags`` from the second element, if they are
specified)
``endchars`` can be 0, specifying "word has zero affixes".
``replacement`` complicates things, allowing to specify "...but this string at the border of the
words, should be unpacked into this ``endchars`` and that ``beginchars``, but make the compound
allowed"... It complicates algorithm significantly, and **no known dictionary** uses this feature,
so ``replacement`` is just ignored by Spylls.
"""
left: str
right: str
replacement: Optional[str] = None
def __post_init__(self):
self.left_stem, _, self.left_flag = self.left.partition('/')
self.right_stem, _, self.right_flag = self.right.partition('/')
if self.left_stem == '0':
self.left_stem = ''
self.left_no_affix = True
else:
self.left_no_affix = False
# FIXME: Hunpell docs say 0 is only allowed for the first pattern
if self.right_stem == '0':
self.right_stem = ''
self.right_no_affix = True
else:
self.right_no_affix = False
def match(self, left, right):
return (left.stem.endswith(self.left_stem)) and (right.stem.startswith(self.right_stem)) and \
(not self.left_no_affix or not left.is_base()) and \
(not self.right_no_affix or not right.is_base()) and \
(not self.left_flag or self.left_flag in left.flags()) and \
(not self.right_flag or self.right_flag in right.flags())
@dataclass
class ConvTable:
"""
Table of conversions that should be applied on pre- or post-processing, stored in :attr:`Aff.ICONV` and
:attr:`Aff.OCONV`. Format is as follows (as far as I can guess from code and tests, documentation
is very sparse):
.. code-block:: text
ICONV <number of entries>
ICONV <pattern> <replacement>
Typically, ``pattern`` and ``replacement`` are just simple strings, used mostly for replacing
typographics (like trigraphs and "nice" apostrophes) before/after processing.
But if there is a ``_`` in ``pattern``, it is treated as: regexp ``^`` if at the beginning of
the pattern, regexp ``$`` if at the end, and just ignored otherwise. This seem to be a "hidden"
feature, demonstrated by ``nepali.*`` set of tests in Hunspell distribution
Conversion rules are applied as follows:
* for each position in word
* ...find any matching rules
* ...chose the one with longest pattern
* ...apply it, and shift to position after its applied (so there can't be recursive application
of several rules on top of each other).
"""
pairs: List[Tuple[str, str]]
def __post_init__(self):
def compile_row(pat1, pat2):
pat1clean = pat1.replace('_', '')
pat1re = pat1clean
if pat1.startswith('_'):
pat1re = '^' + pat1re
if pat1.endswith('_'):
pat1re = pat1re + '$'
return (pat1clean, re.compile(pat1re), pat2.replace('_', ' '))
# TODO: don't need key=?.. (default behavior)
self.table = sorted([compile_row(*row) for row in self.pairs], key=itemgetter(0))
def __call__(self, word):
pos = 0
res = ''
while pos < len(word):
matches = sorted(
[(search, pattern, replacement)
for search, pattern, replacement in self.table
if pattern.match(word, pos)],
key=lambda r: len(r[0]),
reverse=True
)
if matches:
search, pattern, replacement = matches[0]
res += replacement
pos += len(search)
else:
res += word[pos]
pos += 1
return res
@dataclass
class PhonetTable:
"""
Represents table of metaphone transformations stored in :attr:`Aff.PHONE`. Format is borrowed
from aspell and described `in its docs <http://aspell.net/man-html/Phonetic-Code.html>`_.
Basically, each line of the table specifies pair of "pattern"/"replacement". Replacement is
a literal string (with "_" meaning "empty string"), and pattern is ... complicated. Spylls, as
of now, parses rules fully (see ``parse_rule`` method in the source), but doesn't implements all
the algorithm's details (like rule prioritizing, concept of "follow-up rule" etc.)
It is enough to pass Hunspell's (small) test for PHONE implementation, but definitely more naive
than expected. But as it is marginal feature (and there are enough metaphone implementations in
Python), we aren't (yet?) bothered by this fact.
"""
table: List[Tuple[str, str]]
RULE_PATTERN = re.compile(
r'(?P<letters>\w+)(\((?P<optional>\w+)\))?(?P<lookahead>[-]+)?(?P<flags>[\^$<]*)(?P<priority>\d)?'
)
@dataclass
class Rule: # pylint: disable=missing-class-docstring
search: re.Pattern
replacement: str
start: bool = False
end: bool = False
priority: int = 5
followup: bool = True
def match(self, word, pos):
if self.start and pos > 0:
return False
if self.end:
return self.search.fullmatch(word, pos)
return self.search.match(word, pos)
def __post_init__(self):
self.rules = defaultdict(list)
for search, replacement in self.table:
self.rules[search[0]].append(self.parse_rule(search, replacement))
def parse_rule(self, search: str, replacement: str) -> Rule:
m = self.RULE_PATTERN.fullmatch(search)
if not m:
raise ValueError(f'Not a proper rule: {search!r}')
text = [*m.group('letters')]
if m.group('optional'):
text.append('[' + m.group('optional') + ']')
if m.group('lookahead'):
la = len(m.group('lookahead'))
regex = ''.join(text[:-la]) + '(?=' + ''.join(text[-la:]) + ')'
else:
regex = ''.join(text)
return PhonetTable.Rule(
search=re.compile(regex),
replacement=replacement,
start=('^' in m.group('flags')),
end=('$' in m.group('flags')),
followup=(m.group('lookahead') is not None)
)
@dataclass
class Aff:
"""
The class contains all directives from .aff file in its attributes.
Attribute **names** are exactly the same as directives they've read from
(they are upper-case, which is un-Pythonic, but allows to unambiguously relate directives to attrs and
grep them in code).
Attribute **values** are either appropriate primitive data types (strings, numbers, arrays etc),
or simple objects wrapping this data to make it easily usable in algorithms (mostly it is some
pattern-alike objects, like the result of Python's standard ``re.compile``, but specific for
Hunspell domain).
Attribute **docs** include explanations derived from
`Hunspell's man page <https://www.manpagez.com/man/5/hunspell/>`_ (sometimes rephrased/abbreviated),
plus links to relevant chunks of ``spylls`` code which uses the directive.
Note that **all** directives are optional, empty .aff file is a valid one.
**General**
.. autoattribute:: SET
.. autoattribute:: FLAG
.. autoattribute:: LANG
.. autoattribute:: WORDCHARS
.. autoattribute:: IGNORE
.. autoattribute:: CHECKSHARPS
.. autoattribute:: FORBIDDENWORD
**Suggestions**
.. autoattribute:: KEY
.. autoattribute:: TRY
.. autoattribute:: NOSUGGEST
.. autoattribute:: KEEPCASE
.. autoattribute:: REP
.. autoattribute:: MAP
.. autoattribute:: NOSPLITSUGS
.. autoattribute:: PHONE
.. autoattribute:: MAXCPDSUGS
**N-gram suggestions**
.. autoattribute:: MAXNGRAMSUGS
.. autoattribute:: MAXDIFF
.. autoattribute:: ONLYMAXDIFF
**Stemming**
.. autoattribute:: PFX
.. autoattribute:: SFX
.. autoattribute:: NEEDAFFIX
.. autoattribute:: CIRCUMFIX
.. autoattribute:: COMPLEXPREFIXES
.. autoattribute:: FULLSTRIP
**Compounding**
.. autoattribute:: BREAK
.. autoattribute:: COMPOUNDRULE
.. autoattribute:: COMPOUNDMIN
.. autoattribute:: COMPOUNDWORDMAX
.. autoattribute:: COMPOUNDFLAG
.. autoattribute:: COMPOUNDBEGIN
.. autoattribute:: COMPOUNDMIDDLE
.. autoattribute:: COMPOUNDEND
.. autoattribute:: ONLYINCOMPOUND
.. autoattribute:: COMPOUNDPERMITFLAG
.. autoattribute:: COMPOUNDFORBIDFLAG
.. autoattribute:: FORCEUCASE
.. autoattribute:: CHECKCOMPOUNDCASE
.. autoattribute:: CHECKCOMPOUNDDUP
.. autoattribute:: CHECKCOMPOUNDREP
.. autoattribute:: CHECKCOMPOUNDTRIPLE
.. autoattribute:: CHECKCOMPOUNDPATTERN
.. autoattribute:: SIMPLIFIEDTRIPLE
.. autoattribute:: COMPOUNDSYLLABLE
.. autoattribute:: COMPOUNDMORESUFFIXES
.. autoattribute:: COMPOUNDROOT
**Pre/post-processing**
.. autoattribute:: ICONV
.. autoattribute:: OCONV
**Aliasing**
.. autoattribute:: AF
.. autoattribute:: AM
**Other/Ignored**
.. autoattribute:: WARN
.. autoattribute:: FORBIDWARN
.. autoattribute:: SYLLABLENUM
.. autoattribute:: SUBSTANDARD
Some other directives that are in docs, but are deprecated/not used (and never implemented by Spylls):
* ``LEMMA_PRESENT``
**Derived attributes**
This attributes are calculated after Aff reading and initialization
.. py:attribute:: casing
:type: spylls.hunspell.algo.capitalization.Casing
"Casing" class (defining how the words in this language lowercased/uppercased). See
:class:`Casing <spylls.hunspell.algo.capitalization.Casing>` for details. In ``Aff``, basically, it is
* :class:`GermanCasing <spylls.hunspell.algo.capitalization.GermanCasing>` if :attr:`CHECKSHARPS`
is ``True``,
* :class:`TurkicCasing <spylls.hunspell.algo.capitalization.TurkicCasing>` if :attr:`LANG` is
one of Turkic languages (Turkish, Azerbaijani, Crimean Tatar),
* regular ``Casing`` otherwise.
.. py:attribute:: suffixes_index
:type: spylls.hunspell.algo.trie.Trie
`Trie <https://en.wikipedia.org/wiki/Trie>`_ structure for fast selecting of all possible suffixes
for some word, created from :attr:`SFX`
.. py:attribute:: prefixes_index
:type: spylls.hunspell.algo.trie.Trie
`Trie <https://en.wikipedia.org/wiki/Trie>`_ structure for fast selecting all possible prefixes
for some word, created from :attr:`PFX`
"""
#: .aff and .dic encoding.
#:
#: *Usage*: Stored in :class:`readers.aff.Context <spylls.hunspell.readers.aff.Context>` and used
#: for reopening .aff file (after the directive was read) in
#: :meth:`reader_aff <spylls.hunspell.readers.aff.read_aff>`, and for opening .dic file
#: in :meth:`reader_dic <spylls.hunspell.readers.dic.read_dic>`
SET: str = 'Windows-1252'
#: .aff file declares one of the possible flag formats:
#:
#: * ``short`` (default) -- each flag is one ASCII character
#: * ``long`` -- each flag is two ASCII characters
#: * ``numeric`` -- each flag is number, set of flags separates them with ``,``
#: * ``UTF-8`` -- each flag is one UTF-8 character
#:
#: Flag format defines how flag sets attached to stems and affixes are parsed. For example,
#: .dic file entry ``cat/ABCD`` can be considered having flags ``{"A", "B", "C", "D"}``
#: (default flag format, "short"), or ``{"AB", "CD"}`` (flag format "long")
#:
#: *Usage*: Stored in :class:`readers.aff.Context <spylls.hunspell.readers.aff.Context>` and used
#: in :meth:`reader_aff <spylls.hunspell.readers.aff.read_aff>`, and
#: in :meth:`reader_dic <spylls.hunspell.readers.dic.read_dic>`
FLAG: str = 'short' # TODO: Enum of possible values, in fact
#: ISO language code. The only codes that change behavior is codes of Turkic languages, which
#: have different I/i capitalization logic.
#:
#: *Usage*: Abstracted into :attr:`casing` which is used in both lookup and suggest.
LANG: Optional[str] = None
#: Extends tokenizer of Hunspell command line interface with additional word characters, for example,
#: dot, dash, n-dash, numbers.
#:
#: *Usage*: Not used in Spylls at all, as it doesn't do tokenization.
WORDCHARS: Optional[str] = None
#: Sets characters to ignore dictionary words, affixes and input words. Useful for optional characters,
#: as Arabic (harakat) or Hebrew (niqqud) diacritical marks.
#:
#: *Usage*: in :meth:`Lookup.__call__ <spylls.hunspell.algo.lookup.Lookup.__call__>` for preparing
#: input word, and in :meth:`reader_aff <spylls.hunspell.readers.aff.read_aff>`, and
#: in :meth:`reader_dic <spylls.hunspell.readers.dic.read_dic>`.
IGNORE: Optional[Ignore] = None
#: Specify this language has German "sharp S" (ß), so this language is probably German
#: :)
#:
#: This declaration effect is that uppercase word with "ß" letter is considered correct (uppercase
#: form of "ß" is "SS", but it is allowed to leave downcased "ß"). The effect can be prohibited
#: for some words by applying to word :attr:`KEEPCASE` flag (which for other situations has
#: different meaning).
#:
#: *Usage:* To define whether to use
#: :class:`GermanCasing <spylls.hunspell.algo.capitalization.GermanCasing>` in :attr:`casing`
#: (which changes word lower/upper-casing slightly), and in
#: :meth:`Lookup.good_forms <spylls.hunspell.algo.lookup.Lookup.good_forms>` to drop forms where
#: lowercase "ß" is prohibited.
CHECKSHARPS: bool = False
#: Flag that marks word as forbidden. The main usage of this flag is to specify that some form
#: that is logically possible (by affixing/suffixing or compounding) is in fact non-existent.
#:
#: Imaginary example (not from actual English dictionary!): let's say word "create" can have suffixes
#: "-d", "-s", "-ion", and prefixes: "un-", "re-", "de-", but of all possible forms (created,
#: creates, creation, uncreates, uncreation, ....) we decide "decreated" is not an existing word.
#: Then we mark (in .dic file) word "create" with flag for all those suffixes and prefixes,
#: but also add separate word "decreated" to dictionary, marked with flag that specified
#: in .aff's FORBIDDENWORD directive. Now, this word wouldn't be considered correct, but all other
#: combinations would.
#:
#: *Usage:* multiple times in both :class:`Lookup <spylls.hunspell.algo.lookup.Lookup>` and
#: :class:`Suggest <spylls.hunspell.algo.suggest.Suggest>`
FORBIDDENWORD: Optional[str] = None
#: Flag to mark words which shouldn't be considered correct unless their casing is exactly like in
#: the dictionary.
#:
#: Note: With :attr:`CHECKSHARPS` declaration, words with sharp s (ß) and ``KEEPCASE`` flag
#: may be capitalized and uppercased, but uppercased forms of these words may not contain "ß",
#: only "SS".
#:
#: *Usage:* :meth:`Suggest.suggestions <spylls.hunspell.algo.suggest.Suggest.suggestions>`
#: to produce suggestions in proper case,
#: :meth:`Lookup.is_good_form <spylls.hunspell.algo.lookup.Lookup.is_good_form>`.
KEEPCASE: Optional[str] = None
# **Suggestions**
#: Flag to mark word/affix as "shouldn't be suggested" (but considered correct on lookup), like
#: obscenities.
#:
#: *Usage:* on :class:`Suggest <spylls.hunspell.algo.suggest.Suggest>` creation (to make list of
#: dictionary words for ngram-check), and in
#: :meth:`Lookup.is_good_form <spylls.hunspell.algo.lookup.Lookup.is_good_form>` (if the lookup is
#: called from suggest, with ``allow_nosuggest=False``)
NOSUGGEST: Optional[str] = None
#: String that specifies sets of adjacent characters on keyboard (so suggest could understand
#: that "kitteb" is most probable misspelling of "kitten"). Format is "abc|def|xyz". For QWERTY
#: English keyboard might be ``qwertyuiop|asdfghjkl|zxcvbnm``
#:
#: *Usage:*
#: :meth:`Suggest.edits <spylls.hunspell.algo.suggest.Suggest.edits>`
#: to pass to :meth:`permutations.badcharkey <spylls.hunspell.algo.permutations.badcharkey>`.
KEY: str = 'qwertyuiop|asdfghjkl|zxcvbnm'
#: List of all characters that can be used in words, *in order of probability* (most probable first),
#: used on edits for suggestions (trying to add missing, or replace erroneous character).
#:
#: *Usage:*
#: :meth:`Suggest.edits <spylls.hunspell.algo.suggest.Suggest.edits>`
#: to pass to :meth:`permutations.badchar <spylls.hunspell.algo.permutations.badchar>` and
#: :meth:`permutations.forgotchar <spylls.hunspell.algo.permutations.forgotchar>`. Note that,
#: obscurely enough, Suggest checks this option to
#: decide whether dash should be used when suggesting two words (e.g. for misspelled "foobar",
#: when it is decided that it is two words erroneously joined, suggest either returns only
#: "foo bar", or also "foo-bar"). Whether dash is suggested, decided by presence of ``"-"`` in ``TRY``,
#: or by presence of Latin ``"a"`` (= "the language use Latin script, all of them allow dashes
#: between words")... That's how it is in Hunspell!
TRY: str = ''
#: *Table* of replacements for typical typos (like "shun"->"tion") to try on suggest. See :class:`RepPattern`
#: for details of format.
#:
#: *Usage:* :meth:`Suggest.edits <spylls.hunspell.algo.suggest.Suggest.edits>` to pass to
#: :meth:`permutations.replchars <spylls.hunspell.algo.permutations.replchars>`.
#: Note that the table populated from aff's ``REP`` directive, *and* from dic's file ``ph:``
#: tags (see :class:`Word <spylls.hunspell.data.dic.Word>` and
#: :meth:`read_dic <spylls.hunspell.readers.dic.read_dic>` for detailed explanations).
REP: List[RepPattern] = field(default_factory=list)
#: Sets of "similar" chars to try in suggestion (like ``aáã`` -- if they all exist in the language,
#: replacing one in another would be a frequent typo). Several chars as a single entry should be
#: grouped by parentheses: ``MAP ß(ss)`` (German "sharp s" and "ss" sequence are more or less the same).
#:
#: *Usage:*
#: :meth:`Suggest.edits <spylls.hunspell.algo.suggest.Suggest.edits>`
#: to pass to :meth:`permutations.mapchars <spylls.hunspell.algo.permutations.mapchars>`.
MAP: List[Set[str]] = field(default_factory=list)
#: Never try to suggest "this word should be split in two". LibreOffice sv_SE dictionary says
#: "it is a must for Swedish". (Interestingly enough, Hunspell's tests doesn't check this flag at
#: all).
#:
#: *Usage:*
#: :meth:`Suggest.edits <spylls.hunspell.algo.suggest.Suggest.edits>`
NOSPLITSUGS: bool = False
#: Table for metaphone transformations. Format is borrowed from aspell and described
#: `in its docs <http://aspell.net/man-html/Phonetic-Code.html>`_.
#:
#: Note that dictionaries with ``PHONE`` table are extremely rare: of all LibreOffice/Firefox
#: dictionaries on en_ZA (South Africa) contains it -- though it is a generic English metaphone
#: rules an it is quite weird they are not used more frequently.
#:
#: Showcase (with LibreOffice dictionaries)::
#:
#: >>> misspelled = 'excersized'
#:
#: >>> nometaphone = Dictionary.from_files('en/en_US')
#: >>> [*nometaphone.suggest(misspelled)])
#: ['supersized']
#:
#: >>> withmetaphone = Dictionary.from_files('en/en_ZA')
#: >>> [*withmetaphone.suggest(misspelled)]
#: ['excerpted', 'exercised', 'excessive']
#:
#: *Usage:* :mod:`phonet_suggest <spylls.hunspell.algo.phonet_suggest>`
PHONE: Optional[PhonetTable] = None
#: Limits number of compound suggestions.
#: *Usage:* :meth:`Suggest.suggestions <spylls.hunspell.algo.suggest.Suggest.suggestions>`
#: to limit number of edit-based suggestions which are compound words.
MAXCPDSUGS: int = 3
# *NGram suggestions*:
#: Set max. number of n-gram suggestions. Value 0 switches off the n-gram suggestions (see also
#: :attr:`MAXDIFF`).
#:
#: *Usage:* :meth:`Suggest.ngram_suggestions <spylls.hunspell.algo.suggest.Suggest.ngram_suggestions>`
#: (to decide whether ``ngram_suggest`` should be called at all) and
#: :meth:`Suggest.suggestions <spylls.hunspell.algo.suggest.Suggest.suggestions>` (to limit
#: amount of ngram-based suggestions).
MAXNGRAMSUGS: int = 4
#: Set the similarity factor for the n-gram based suggestions:
#:
#: * 5 = default value
#: * 0 = fewer n-gram suggestions, but at least one;
#: * 10 (max) = :attr:`MAXNGRAMSUGS` n-gram suggestions.
#:
#: *Usage:* :meth:`Suggest.ngram_suggestions <spylls.hunspell.algo.suggest.Suggest.ngram_suggestions>` where
#: it is passed to :mod:`ngram_suggest <spylls.hunspell.algo.ngram_suggest>` module, and used in
#: :meth:`detailed_affix_score <spylls.hunspell.algo.ngram_suggest.detailed_affix_score>`.
MAXDIFF: int = -1
#: Remove all bad n-gram suggestions (default mode keeps one, see :attr:`MAXDIFF`).
#:
#: *Usage:* :meth:`Suggest.ngram_suggestions <spylls.hunspell.algo.suggest.Suggest.ngram_suggestions>` where
#: it is passed to :mod:`ngram_suggest <spylls.hunspell.algo.ngram_suggest>` module, and used in
#: :meth:`filter_guesses <spylls.hunspell.algo.ngram_suggest.filter_guesses>`.
ONLYMAXDIFF: bool = False
# **Stemming**
#: Dictionary of ``flag => prefixes with this flag``. See :class:`Affix` for detailed format and
#: meaning description.
#:
#: Usage:
#:
#: * in :meth:`Suggest.ngram_suggestions <spylls.hunspell.algo.suggest.Suggest.ngram_suggestions>`
#: to pass to :mod:`ngram_suggest <spylls.hunspell.algo.ngram_suggest>`
#: (and there to construct all possible forms).
#: * also parsed into :attr:`prefixes_index` Trie, which then used in
#: :meth:`Lookup.deprefix <spylls.hunspell.algo.lookup.Lookup.deprefix>`
PFX: Dict[str, List[Prefix]] = field(default_factory=dict)
#: Dictionary of ``flag => suffixes with this flag``. See :class:`Affix` for detailed format and
#: meaning description.
#:
#: Usage:
#:
#: * in :meth:`Suggest.ngram_suggestions <spylls.hunspell.algo.suggest.Suggest.ngram_suggestions>`
#: to pass to :mod:`ngram_suggest <spylls.hunspell.algo.ngram_suggest>`
#: (and there to construct all possible forms).
#: * also parsed into :attr:`suffixes_index` Trie, which then used in
#: :meth:`Lookup.desuffix <spylls.hunspell.algo.lookup.Lookup.desuffix>`
SFX: Dict[str, List[Suffix]] = field(default_factory=dict)
#: Flag saying "this stem can't be used without affixes". Can be also assigned to suffix/prefix,
#: meaning "there should be other affixes besides this one".
#:
#: *Usage:* :meth:`Lookup.is_good_form <spylls.hunspell.algo.lookup.Lookup.is_good_form>`
NEEDAFFIX: Optional[str] = None
#: Suffixes signed with this flag may be on a word when this word also has a prefix with
#: this flag, and vice versa.
#:
#: *Usage:* :meth:`Lookup.is_good_form <spylls.hunspell.algo.lookup.Lookup.is_good_form>`
CIRCUMFIX: Optional[str] = None
#: If two prefixes stripping is allowed (only one prefix by default). Random fun fact:
#: of all currently available LibreOffice and Firefox dictionaries, only Firefox's Zulu has this
#: flag.
#:
#: *Usage:* :meth:`Lookup.deprefix <spylls.hunspell.algo.lookup.Lookup.deprefix>`
COMPLEXPREFIXES: bool = False
#: If affixes are allowed to remove entire stem.
#:
#: Not used in Spylls (e.g. spylls doesn't fails when this option is False and entire word is removed,
#: so hunspell's tests ``fullstrip.*`` are passing).
FULLSTRIP: bool = False
# **Compounding**
#: Defines break points for breaking words and checking word parts separately. See :class:`BreakPattern`
#: for format definition.
#:
#: *Usage:* :meth:`Lookup.break_word <spylls.hunspell.algo.lookup.Lookup.break_word>`
BREAK: List[BreakPattern] = \
field(default_factory=lambda: [BreakPattern('-'), BreakPattern('^-'), BreakPattern('-$')])
#: Rule of producing compound words, with regexp-like syntax. See :class:`CompoundRule` for
#: format definition.
#:
#: *Usage:* :meth:`Lookup.compounds_by_rules <spylls.hunspell.algo.lookup.Lookup.compounds_by_rules>`
COMPOUNDRULE: List[CompoundRule] = field(default_factory=list)
#: Minimum length of words used for compounding.
#:
#: *Usage:* :meth:`Lookup.compounds_by_rules <spylls.hunspell.algo.lookup.Lookup.compounds_by_rules>` &
#: :meth:`Lookup.compounds_by_flags <spylls.hunspell.algo.lookup.Lookup.compounds_by_flags>`
COMPOUNDMIN: int = 3
#: Set maximum word count in a compound word.
#:
#: *Usage:* :meth:`Lookup.compounds_by_rules <spylls.hunspell.algo.lookup.Lookup.compounds_by_rules>` &
#: :meth:`Lookup.compounds_by_flags <spylls.hunspell.algo.lookup.Lookup.compounds_by_flags>`
COMPOUNDWORDMAX: Optional[int] = None
#: Forms with this flag (marking either stem, or one of affixes) can be part of the compound.
#: Note that triple of flags :attr:`COMPOUNDBEGIN`, :attr:`COMPOUNDMIDDLE`, :attr:`COMPOUNDEND`
#: is more precise way of marking ("this word can be at the beginning of compound").
#:
#: *Usage:* :meth:`Lookup.is_good_form <spylls.hunspell.algo.lookup.Lookup.is_good_form>` to compare
#: form's compound position (or lack thereof) with presence of teh flag.
COMPOUNDFLAG: Optional[str] = None
#: Forms with this flag (marking either stem, or one of affixes) can be at the beginning of the
#: compound.
#: Part of the triple of flags :attr:`COMPOUNDBEGIN`, :attr:`COMPOUNDMIDDLE`, :attr:`COMPOUNDEND`;
#: alternative to the triple is just :attr:`COMPOUNDFLAG` ("this form can be at any place in compound").
#:
#: *Usage:* :meth:`Lookup.is_good_form <spylls.hunspell.algo.lookup.Lookup.is_good_form>`
#: to compare form's compound position (or lack thereof) with the presence of the flag.
COMPOUNDBEGIN: Optional[str] = None
#: Forms with this flag (marking either stem, or one of affixes) can be in the middle of the
#: compound (not the last part, and not the first).
#: Part of the triple of flags :attr:`COMPOUNDBEGIN`, :attr:`COMPOUNDMIDDLE`, :attr:`COMPOUNDEND`;
#: alternative to the triple is just :attr:`COMPOUNDFLAG` ("this form can be at any place in compound").
#:
#: *Usage:* :meth:`Lookup.is_good_form <spylls.hunspell.algo.lookup.Lookup.is_good_form>`
#: to compare form's compound position (or lack thereof) with the presence of the flag.
COMPOUNDMIDDLE: Optional[str] = None
#: Forms with this flag (marking either stem, or one of affixes) can be at the end of the
#: compound.
#: Part of the triple of flags :attr:`COMPOUNDBEGIN`, :attr:`COMPOUNDMIDDLE`, :attr:`COMPOUNDEND`;
#: alternative to the triple is just :attr:`COMPOUNDFLAG` ("this form can be at any place in compound").
#:
#: *Usage:* :meth:`Lookup.is_good_form <spylls.hunspell.algo.lookup.Lookup.is_good_form>`
#: to compare form's compound position (or lack thereof) with the presence of the flag.
COMPOUNDEND: Optional[str] = None