Skip to content

Commit

Permalink
Dependencies: Upgrade spaCy to 3.7.5; Utils: Add spaCy's Faroese and …
Browse files Browse the repository at this point in the history
…Norwegian (Nynorsk) word tokenizers
  • Loading branch information
BLKSerene committed Jun 18, 2024
1 parent bff0c56 commit c4074d2
Show file tree
Hide file tree
Showing 10 changed files with 20 additions and 14 deletions.
2 changes: 1 addition & 1 deletion ACKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ As Wordless stands on the shoulders of giants, I hereby extend my sincere gratit
23|[Sacremoses](https://github.com/hplt-project/sacremoses)|0.1.1|Liling Tan, Jelmer van der Linde|[MIT](https://github.com/hplt-project/sacremoses/blob/master/LICENSE)
24|[SciPy](https://scipy.org/scipylib/)|1.11.3|SciPy Developers|[BSD-3-Clause](https://github.com/scipy/scipy/blob/main/LICENSE.txt)
25|[simplemma](https://github.com/adbar/simplemma)|1.0.0|Adrien Barbaresi|[MIT](https://github.com/adbar/simplemma/blob/main/LICENSE)
26|[spaCy](https://spacy.io/)|3.7.2|Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann|[MIT](https://github.com/explosion/spaCy/blob/master/LICENSE)
26|[spaCy](https://spacy.io/)|3.7.5|Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann|[MIT](https://github.com/explosion/spaCy/blob/master/LICENSE)
27|[spacy-pkuseg](https://github.com/explosion/spacy-pkuseg)|0.0.33|Ruixuan Luo (罗睿轩), Jingjing Xu (许晶晶),<br>Xuancheng Ren (任宣丞), Yi Zhang (张艺),<br>Zhiyuan Zhang (张之远), Bingzhen Wei (位冰镇),<br>Xu Sun (孙栩)<br>Adriane Boyd, Ines Montani|[MIT](https://github.com/explosion/spacy-pkuseg/blob/master/LICENSE)
28|[Stanza](https://github.com/stanfordnlp/stanza)|1.7.0|Peng Qi (齐鹏), Yuhao Zhang (张宇浩),<br>Yuhui Zhang (张钰晖), Jason Bolton,<br>Tim Dozat, John Bauer|[Apache-2.0](https://github.com/stanfordnlp/stanza/blob/main/LICENSE)
29|[SudachiPy](https://github.com/WorksApplications/sudachi.rs)|0.6.8|Works Applications Co., Ltd.|[Apache-2.0](https://github.com/WorksApplications/sudachi.rs/blob/develop/LICENSE)
Expand Down
5 changes: 3 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,10 +24,11 @@
- Measures: Add lexical density/diversity - Brunét's Index / Honoré's statistic / Lexical Density
- Settings: Add Settings - Stop Word Lists - Stop Word List Settings - Case-sensitive
- Settings: Add Settings - Tables - Dependency Parser
- Utils: Add encoding detection - UTF-8 with BOM
- Utils: Add Pyphen's Basque syllable tokenizer
- Utils: Add PyThaiNLP's Han-solo
- Utils: Add spaCy's Faroese and Norwegian (Nynorsk) word tokenizers
- Utils: Add Stanza's Sindhi part-of-speech tagger
- Utils: Add encoding detection - UTF-8 with BOM
- Utils: Add VADER's sentiment analyzers
- Work Area: Add Colligation Extractor - Filter results - Node/Colligation length
- Work Area: Add Collocation Extractor - Filter results - Node/Collocation length
Expand Down Expand Up @@ -71,7 +72,7 @@
- Dependencies: Upgrade Requests to 2.32.2
- Dependencies: Upgrade Sacremoses to 0.1.1
- Dependencies: Upgrade simplemma to 1.0.0
- Dependencies: Upgrade spaCy to 3.7.2
- Dependencies: Upgrade spaCy to 3.7.5
- Dependencies: Upgrade spacy-pkuseg to 0.0.33
- Dependencies: Upgrade Stanza to 1.7.0
- Dependencies: Upgrade SudachiPy to 0.6.8
Expand Down
4 changes: 2 additions & 2 deletions doc/doc.md
Original file line number Diff line number Diff line change
Expand Up @@ -638,7 +638,7 @@ You can generate line charts or word clouds for keywords using any statistics. Y
### [4.1 Supported Languages](#doc)

Language|Sentence Token-ization|Word Token-ization|Syllable Token-ization|Part-of-speech Tagging|Lemma-tization|Stop Word List|Depen-dency Parsing|Senti-ment Analysis
:--------------------------------:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:
:-----------------------:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:
Afrikaans |✔|✔|✔|✔|✔|✖️|✔|✔
Albanian |⭕️ |✔|✔|✖️|✔|✖️|✖️|✔
Amharic |⭕️ |✔|✖️|✖️|✖️|✖️|✖️|✔
Expand All @@ -648,7 +648,7 @@ Armenian (Western) |✔|✔|✖️|✔|✔|✖️|✔|✔
Assamese |⭕️ |✔|✖️|✖️|✖️|✖️|✖️|✔
Asturian |⭕️ |⭕️ |✖️|✖️|✔|✖️|✖️|✖️
Azerbaijani |⭕️ |✔|✖️|✖️|✖️|✔|✖️|✔
Basque |✔|✔|✖️|✔|✔|✔|✔|✔
Basque |✔|✔||✔|✔|✔|✔|✔
Belarusian |✔|✔|✔|✔|✔|✖️|✔|✔
Bengali |⭕️ |✔|✖️|✖️|✔|✔|✖️|✔
Bulgarian |✔|✔|✔|✔|✔|✖️|✔|✔
Expand Down
2 changes: 1 addition & 1 deletion doc/trs/zho_cn/ACKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@
23|[Sacremoses](https://github.com/hplt-project/sacremoses)|0.1.1|Liling Tan, Jelmer van der Linde|[MIT](https://github.com/hplt-project/sacremoses/blob/master/LICENSE)
24|[SciPy](https://scipy.org/scipylib/)|1.11.3|SciPy 开发人员|[BSD-3-Clause](https://github.com/scipy/scipy/blob/main/LICENSE.txt)
25|[simplemma](https://github.com/adbar/simplemma)|1.0.0|Adrien Barbaresi|[MIT](https://github.com/adbar/simplemma/blob/main/LICENSE)
26|[spaCy](https://spacy.io/)|3.7.2|Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann|[MIT](https://github.com/explosion/spaCy/blob/master/LICENSE)
26|[spaCy](https://spacy.io/)|3.7.5|Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann|[MIT](https://github.com/explosion/spaCy/blob/master/LICENSE)
27|[spacy-pkuseg](https://github.com/explosion/spacy-pkuseg)|0.0.33|罗睿轩, 许晶晶, 任宣丞, 张艺, 张之远, 位冰镇, 孙栩<br>Adriane Boyd, Ines Montani|[MIT](https://github.com/explosion/spacy-pkuseg/blob/master/LICENSE)
28|[Stanza](https://github.com/stanfordnlp/stanza)|1.7.0|齐鹏, 张宇浩, 张钰晖,<br>Jason Bolton, Tim Dozat, John Bauer|[Apache-2.0](https://github.com/stanfordnlp/stanza/blob/main/LICENSE)
29|[SudachiPy](https://github.com/WorksApplications/sudachi.rs)|0.6.8|Works Applications Co., Ltd.|[Apache-2.0](https://github.com/WorksApplications/sudachi.rs/blob/develop/LICENSE)
Expand Down
2 changes: 1 addition & 1 deletion doc/trs/zho_tw/ACKS.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@
23|[Sacremoses](https://github.com/hplt-project/sacremoses)|0.1.1|Liling Tan, Jelmer van der Linde|[MIT](https://github.com/hplt-project/sacremoses/blob/master/LICENSE)
24|[SciPy](https://scipy.org/scipylib/)|1.11.3|SciPy 开发人员|[BSD-3-Clause](https://github.com/scipy/scipy/blob/main/LICENSE.txt)
25|[simplemma](https://github.com/adbar/simplemma)|1.0.0|Adrien Barbaresi|[MIT](https://github.com/adbar/simplemma/blob/main/LICENSE)
26|[spaCy](https://spacy.io/)|3.7.2|Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann|[MIT](https://github.com/explosion/spaCy/blob/master/LICENSE)
26|[spaCy](https://spacy.io/)|3.7.5|Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann|[MIT](https://github.com/explosion/spaCy/blob/master/LICENSE)
27|[spacy-pkuseg](https://github.com/explosion/spacy-pkuseg)|0.0.33|罗睿轩, 许晶晶, 任宣丞, 张艺, 张之远, 位冰镇, 孙栩<br>Adriane Boyd, Ines Montani|[MIT](https://github.com/explosion/spacy-pkuseg/blob/master/LICENSE)
28|[Stanza](https://github.com/stanfordnlp/stanza)|1.7.0|齐鹏, 张宇浩, 张钰晖,<br>Jason Bolton, Tim Dozat, John Bauer|[Apache-2.0](https://github.com/stanfordnlp/stanza/blob/main/LICENSE)
29|[SudachiPy](https://github.com/WorksApplications/sudachi.rs)|0.6.8|Works Applications Co., Ltd.|[Apache-2.0](https://github.com/WorksApplications/sudachi.rs/blob/develop/LICENSE)
Expand Down
2 changes: 1 addition & 1 deletion requirements/requirements_tests.txt
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ pymorphy3-dicts-ru == 2.4.417150.4580142
pymorphy3-dicts-uk == 2.4.1.1.1663094765

## spaCy
spacy == 3.7.2
spacy == 3.7.5
spacy-lookups-data == 1.0.5
spacy-pkuseg == 0.0.33

Expand Down
4 changes: 4 additions & 0 deletions tests/tests_nlp/test_word_tokenization.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,6 +145,8 @@ def test_word_tokenize(lang, word_tokenizer):
tests_lang_util_skipped = True
case 'est':
assert tokens == ['Eesti', 'keelel', 'on', 'kaks', 'suuremat', 'murderühma', '(', 'põhjaeesti', 'ja', 'lõunaeesti', ')', ',', 'mõnes', 'käsitluses', 'eristatakse', 'ka', 'kirderanniku', 'murdeid', 'eraldi', 'murderühmana', '.']
case 'fao':
assert tokens == ['Føroyskt', 'er', 'høvuðsmálið', 'í', 'Føroyum', '.']
case 'fin':
assert tokens == ['Suomen', 'kieli', 'eli', 'suomi', 'on', 'uralilaisten', 'kielten', 'itämerensuomalaiseen', 'ryhmään', 'kuuluva', 'kieli', ',', 'jota', 'puhuvat', 'pääosin', 'suomalaiset', '.']
case 'fra':
Expand Down Expand Up @@ -247,6 +249,8 @@ def test_word_tokenize(lang, word_tokenizer):
assert tokens == ['नेपाली', 'भाषा', '(', 'अन्तर्राष्ट्रिय', 'ध्वन्यात्मक', 'वर्णमाला', '[', 'neˈpali', 'bʱaʂa', ']', ')', 'नेपालको', 'सम्पर्क', 'भाषा', 'तथा', 'भारत', ',', 'भुटान', 'र', 'म्यानमारको', 'केही', 'भागमा', 'मातृभाषाको', 'रूपमा', 'बोलिने', 'भाषा', 'हो', '।']
case 'nob':
assert tokens == ['Bokmål', 'er', 'en', 'varietet', 'av', 'norsk', 'skriftspråk', '.']
case 'nno':
assert tokens == ['Nynorsk', ',', 'før', '1929', 'offisielt', 'kalla', 'landsmål', ',', 'er', 'sidan', 'jamstillingsvedtaket', 'av', '12.', 'mai', '1885', 'ei', 'av', 'dei', 'to', 'offisielle', 'målformene', 'av', 'norsk', ';', 'den', 'andre', 'forma', 'er', 'bokmål', '.']
case 'ori':
assert tokens == ['ଓଡ଼ିଆ', '(', 'ଇଂରାଜୀ', 'ଭାଷାରେ', 'Odia', '/', 'əˈdiːə', '/', 'or', 'Oriya', '/', 'ɒˈriːə', '/', ',', ')', 'ଏକ', 'ଭାରତୀୟ', 'ଭାଷା', 'ଯାହା', 'ଏକ', 'ଇଣ୍ଡୋ-ଇଉରୋପୀୟ', 'ଭାଷାଗୋଷ୍ଠୀ', 'ଅନ୍ତର୍ଗତ', 'ଇଣ୍ଡୋ-ଆର୍ଯ୍ୟ', 'ଭାଷା', '।']
case 'fas':
Expand Down
2 changes: 1 addition & 1 deletion utils/wl_generate_acks.py
Original file line number Diff line number Diff line change
Expand Up @@ -138,7 +138,7 @@
'MIT', 'https://github.com/adbar/simplemma/blob/main/LICENSE'
], [
'spaCy', 'https://spacy.io/',
'3.7.2', "Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann",
'3.7.5', "Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann",
'MIT', 'https://github.com/explosion/spaCy/blob/master/LICENSE'
], [
'spacy-pkuseg', 'https://github.com/explosion/spacy-pkuseg',
Expand Down
7 changes: 2 additions & 5 deletions wordless/wl_nlp/wl_nlp_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -328,11 +328,6 @@ def run(self):
]

def init_model_spacy(main, lang, sentencizer_only = False):
if lang == 'nno':
lang = 'nob'
else:
lang = wl_conversion.remove_lang_code_suffixes(main, lang)

sentencizer_config = {'punct_chars': list(wl_sentence_tokenization.SENTENCE_TERMINATORS)}

# Sentencizer
Expand All @@ -341,6 +336,8 @@ def init_model_spacy(main, lang, sentencizer_only = False):
main.__dict__['spacy_nlp_sentencizer'] = spacy.blank('en')
main.__dict__['spacy_nlp_sentencizer'].add_pipe('sentencizer', config = sentencizer_config)
else:
lang = wl_conversion.remove_lang_code_suffixes(main, lang)

if f'spacy_nlp_{lang}' not in main.__dict__:
# Languages with models
if lang in LANGS_SPACY:
Expand Down
4 changes: 4 additions & 0 deletions wordless/wl_settings/wl_settings_global.py
Original file line number Diff line number Diff line change
Expand Up @@ -573,6 +573,7 @@
_tr('wl_settings_global', 'spaCy - Dutch word tokenizer'): 'spacy_nld',
_tr('wl_settings_global', 'spaCy - English word tokenizer'): 'spacy_eng',
_tr('wl_settings_global', 'spaCy - Estonian word tokenizer'): 'spacy_est',
_tr('wl_settings_global', 'spaCy - Faroese word tokenizer'): 'spacy_fao',
_tr('wl_settings_global', 'spaCy - Finnish word tokenizer'): 'spacy_fin',
_tr('wl_settings_global', 'spaCy - French word tokenizer'): 'spacy_fra',
_tr('wl_settings_global', 'spaCy - German word tokenizer'): 'spacy_deu',
Expand Down Expand Up @@ -602,6 +603,7 @@
_tr('wl_settings_global', 'spaCy - Marathi word tokenizer'): 'spacy_mar',
_tr('wl_settings_global', 'spaCy - Nepali word tokenizer'): 'spacy_nep',
_tr('wl_settings_global', 'spaCy - Norwegian (Bokmål) word tokenizer'): 'spacy_nob',
_tr('wl_settings_global', 'spaCy - Norwegian (Nynorsk) word tokenizer'): 'spacy_nno',
_tr('wl_settings_global', 'spaCy - Persian word tokenizer'): 'spacy_fas',
_tr('wl_settings_global', 'spaCy - Polish word tokenizer'): 'spacy_pol',
_tr('wl_settings_global', 'spaCy - Portuguese word tokenizer'): 'spacy_por',
Expand Down Expand Up @@ -1977,6 +1979,7 @@

'fao': [
'nltk_nist', 'nltk_nltk', 'nltk_regex', 'nltk_twitter',
'spacy_fao',
'stanza_fao'
],

Expand Down Expand Up @@ -2227,6 +2230,7 @@
],
'nno': [
'nltk_nist', 'nltk_nltk', 'nltk_regex', 'nltk_twitter',
'spacy_nno',
'stanza_nno'
],

Expand Down

0 comments on commit c4074d2

Please sign in to comment.