Dependencies: Upgrade spaCy to 3.7.5; Utils: Add spaCy's Faroese and …

…Norwegian (Nynorsk) word tokenizers
BLKSerene · Jun 18, 2024 · c4074d2 · c4074d2
1 parent bff0c56
commit c4074d2
Show file tree

Hide file tree

Showing 10 changed files with 20 additions and 14 deletions.
diff --git a/ACKS.md b/ACKS.md
@@ -47,7 +47,7 @@ As Wordless stands on the shoulders of giants, I hereby extend my sincere gratit
 23|[Sacremoses](https://github.com/hplt-project/sacremoses)|0.1.1|Liling Tan, Jelmer van der Linde|[MIT](https://github.com/hplt-project/sacremoses/blob/master/LICENSE)
 24|[SciPy](https://scipy.org/scipylib/)|1.11.3|SciPy Developers|[BSD-3-Clause](https://github.com/scipy/scipy/blob/main/LICENSE.txt)
 25|[simplemma](https://github.com/adbar/simplemma)|1.0.0|Adrien Barbaresi|[MIT](https://github.com/adbar/simplemma/blob/main/LICENSE)
-26|[spaCy](https://spacy.io/)|3.7.2|Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann|[MIT](https://github.com/explosion/spaCy/blob/master/LICENSE)
+26|[spaCy](https://spacy.io/)|3.7.5|Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann|[MIT](https://github.com/explosion/spaCy/blob/master/LICENSE)
 27|[spacy-pkuseg](https://github.com/explosion/spacy-pkuseg)|0.0.33|Ruixuan Luo (罗睿轩), Jingjing Xu (许晶晶),<br>Xuancheng Ren (任宣丞), Yi Zhang (张艺),<br>Zhiyuan Zhang (张之远), Bingzhen Wei (位冰镇),<br>Xu Sun (孙栩)<br>Adriane Boyd, Ines Montani|[MIT](https://github.com/explosion/spacy-pkuseg/blob/master/LICENSE)
 28|[Stanza](https://github.com/stanfordnlp/stanza)|1.7.0|Peng Qi (齐鹏), Yuhao Zhang (张宇浩),<br>Yuhui Zhang (张钰晖), Jason Bolton,<br>Tim Dozat, John Bauer|[Apache-2.0](https://github.com/stanfordnlp/stanza/blob/main/LICENSE)
 29|[SudachiPy](https://github.com/WorksApplications/sudachi.rs)|0.6.8|Works Applications Co., Ltd.|[Apache-2.0](https://github.com/WorksApplications/sudachi.rs/blob/develop/LICENSE)

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -24,10 +24,11 @@
 - Measures: Add lexical density/diversity - Brunét's Index / Honoré's statistic / Lexical Density
 - Settings: Add Settings - Stop Word Lists - Stop Word List Settings - Case-sensitive
 - Settings: Add Settings - Tables - Dependency Parser
+- Utils: Add encoding detection - UTF-8 with BOM
 - Utils: Add Pyphen's Basque syllable tokenizer
 - Utils: Add PyThaiNLP's Han-solo
+- Utils: Add spaCy's Faroese and Norwegian (Nynorsk) word tokenizers
 - Utils: Add Stanza's Sindhi part-of-speech tagger
-- Utils: Add encoding detection - UTF-8 with BOM 
 - Utils: Add VADER's sentiment analyzers
 - Work Area: Add Colligation Extractor - Filter results - Node/Colligation length
 - Work Area: Add Collocation Extractor - Filter results - Node/Collocation length
@@ -71,7 +72,7 @@
 - Dependencies: Upgrade Requests to 2.32.2
 - Dependencies: Upgrade Sacremoses to 0.1.1
 - Dependencies: Upgrade simplemma to 1.0.0
-- Dependencies: Upgrade spaCy to 3.7.2
+- Dependencies: Upgrade spaCy to 3.7.5
 - Dependencies: Upgrade spacy-pkuseg to 0.0.33
 - Dependencies: Upgrade Stanza to 1.7.0
 - Dependencies: Upgrade SudachiPy to 0.6.8

diff --git a/doc/doc.md b/doc/doc.md
@@ -638,7 +638,7 @@ You can generate line charts or word clouds for keywords using any statistics. Y
 ### [4.1 Supported Languages](#doc)
 
 Language|Sentence Token-ization|Word Token-ization|Syllable Token-ization|Part-of-speech Tagging|Lemma-tization|Stop Word List|Depen-dency Parsing|Senti-ment Analysis
-:--------------------------------:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:
+:-----------------------:|:-:|:-:|:-:|:-:|:-:|:-:|:-:|:-:
 Afrikaans                |✔|✔|✔|✔|✔|✖️|✔|✔
 Albanian                 |⭕️ |✔|✔|✖️|✔|✖️|✖️|✔
 Amharic                  |⭕️ |✔|✖️|✖️|✖️|✖️|✖️|✔
@@ -648,7 +648,7 @@ Armenian (Western)       |✔|✔|✖️|✔|✔|✖️|✔|✔
 Assamese                 |⭕️ |✔|✖️|✖️|✖️|✖️|✖️|✔
 Asturian                 |⭕️ |⭕️ |✖️|✖️|✔|✖️|✖️|✖️
 Azerbaijani              |⭕️ |✔|✖️|✖️|✖️|✔|✖️|✔
-Basque                   |✔|✔|✖️|✔|✔|✔|✔|✔
+Basque                   |✔|✔|✔|✔|✔|✔|✔|✔
 Belarusian               |✔|✔|✔|✔|✔|✖️|✔|✔
 Bengali                  |⭕️ |✔|✖️|✖️|✔|✔|✖️|✔
 Bulgarian                |✔|✔|✔|✔|✔|✖️|✔|✔

diff --git a/doc/trs/zho_cn/ACKS.md b/doc/trs/zho_cn/ACKS.md
@@ -47,7 +47,7 @@
 23|[Sacremoses](https://github.com/hplt-project/sacremoses)|0.1.1|Liling Tan, Jelmer van der Linde|[MIT](https://github.com/hplt-project/sacremoses/blob/master/LICENSE)
 24|[SciPy](https://scipy.org/scipylib/)|1.11.3|SciPy 开发人员|[BSD-3-Clause](https://github.com/scipy/scipy/blob/main/LICENSE.txt)
 25|[simplemma](https://github.com/adbar/simplemma)|1.0.0|Adrien Barbaresi|[MIT](https://github.com/adbar/simplemma/blob/main/LICENSE)
-26|[spaCy](https://spacy.io/)|3.7.2|Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann|[MIT](https://github.com/explosion/spaCy/blob/master/LICENSE)
+26|[spaCy](https://spacy.io/)|3.7.5|Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann|[MIT](https://github.com/explosion/spaCy/blob/master/LICENSE)
 27|[spacy-pkuseg](https://github.com/explosion/spacy-pkuseg)|0.0.33|罗睿轩, 许晶晶, 任宣丞, 张艺, 张之远, 位冰镇, 孙栩<br>Adriane Boyd, Ines Montani|[MIT](https://github.com/explosion/spacy-pkuseg/blob/master/LICENSE)
 28|[Stanza](https://github.com/stanfordnlp/stanza)|1.7.0|齐鹏, 张宇浩, 张钰晖,<br>Jason Bolton, Tim Dozat, John Bauer|[Apache-2.0](https://github.com/stanfordnlp/stanza/blob/main/LICENSE)
 29|[SudachiPy](https://github.com/WorksApplications/sudachi.rs)|0.6.8|Works Applications Co., Ltd.|[Apache-2.0](https://github.com/WorksApplications/sudachi.rs/blob/develop/LICENSE)

diff --git a/doc/trs/zho_tw/ACKS.md b/doc/trs/zho_tw/ACKS.md
@@ -47,7 +47,7 @@
 23|[Sacremoses](https://github.com/hplt-project/sacremoses)|0.1.1|Liling Tan, Jelmer van der Linde|[MIT](https://github.com/hplt-project/sacremoses/blob/master/LICENSE)
 24|[SciPy](https://scipy.org/scipylib/)|1.11.3|SciPy 开发人员|[BSD-3-Clause](https://github.com/scipy/scipy/blob/main/LICENSE.txt)
 25|[simplemma](https://github.com/adbar/simplemma)|1.0.0|Adrien Barbaresi|[MIT](https://github.com/adbar/simplemma/blob/main/LICENSE)
-26|[spaCy](https://spacy.io/)|3.7.2|Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann|[MIT](https://github.com/explosion/spaCy/blob/master/LICENSE)
+26|[spaCy](https://spacy.io/)|3.7.5|Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann|[MIT](https://github.com/explosion/spaCy/blob/master/LICENSE)
 27|[spacy-pkuseg](https://github.com/explosion/spacy-pkuseg)|0.0.33|罗睿轩, 许晶晶, 任宣丞, 张艺, 张之远, 位冰镇, 孙栩<br>Adriane Boyd, Ines Montani|[MIT](https://github.com/explosion/spacy-pkuseg/blob/master/LICENSE)
 28|[Stanza](https://github.com/stanfordnlp/stanza)|1.7.0|齐鹏, 张宇浩, 张钰晖,<br>Jason Bolton, Tim Dozat, John Bauer|[Apache-2.0](https://github.com/stanfordnlp/stanza/blob/main/LICENSE)
 29|[SudachiPy](https://github.com/WorksApplications/sudachi.rs)|0.6.8|Works Applications Co., Ltd.|[Apache-2.0](https://github.com/WorksApplications/sudachi.rs/blob/develop/LICENSE)

diff --git a/requirements/requirements_tests.txt b/requirements/requirements_tests.txt
@@ -42,7 +42,7 @@ pymorphy3-dicts-ru == 2.4.417150.4580142
 pymorphy3-dicts-uk == 2.4.1.1.1663094765
 
 ## spaCy
-spacy == 3.7.2
+spacy == 3.7.5
 spacy-lookups-data == 1.0.5
 spacy-pkuseg == 0.0.33
 

diff --git a/tests/tests_nlp/test_word_tokenization.py b/tests/tests_nlp/test_word_tokenization.py
@@ -145,6 +145,8 @@ def test_word_tokenize(lang, word_tokenizer):
                 tests_lang_util_skipped = True
         case 'est':
             assert tokens == ['Eesti', 'keelel', 'on', 'kaks', 'suuremat', 'murderühma', '(', 'põhjaeesti', 'ja', 'lõunaeesti', ')', ',', 'mõnes', 'käsitluses', 'eristatakse', 'ka', 'kirderanniku', 'murdeid', 'eraldi', 'murderühmana', '.']
+        case 'fao':
+            assert tokens == ['Føroyskt', 'er', 'høvuðsmálið', 'í', 'Føroyum', '.']
         case 'fin':
             assert tokens == ['Suomen', 'kieli', 'eli', 'suomi', 'on', 'uralilaisten', 'kielten', 'itämerensuomalaiseen', 'ryhmään', 'kuuluva', 'kieli', ',', 'jota', 'puhuvat', 'pääosin', 'suomalaiset', '.']
         case 'fra':
@@ -247,6 +249,8 @@ def test_word_tokenize(lang, word_tokenizer):
             assert tokens == ['नेपाली', 'भाषा', '(', 'अन्तर्राष्ट्रिय', 'ध्वन्यात्मक', 'वर्णमाला', '[', 'neˈpali', 'bʱaʂa', ']', ')', 'नेपालको', 'सम्पर्क', 'भाषा', 'तथा', 'भारत', ',', 'भुटान', 'र', 'म्यानमारको', 'केही', 'भागमा', 'मातृभाषाको', 'रूपमा', 'बोलिने', 'भाषा', 'हो', '।']
         case 'nob':
             assert tokens == ['Bokmål', 'er', 'en', 'varietet', 'av', 'norsk', 'skriftspråk', '.']
+        case 'nno':
+            assert tokens == ['Nynorsk', ',', 'før', '1929', 'offisielt', 'kalla', 'landsmål', ',', 'er', 'sidan', 'jamstillingsvedtaket', 'av', '12.', 'mai', '1885', 'ei', 'av', 'dei', 'to', 'offisielle', 'målformene', 'av', 'norsk', ';', 'den', 'andre', 'forma', 'er', 'bokmål', '.']
         case 'ori':
             assert tokens == ['ଓଡ଼ିଆ', '(', 'ଇଂରାଜୀ', 'ଭାଷାରେ', 'Odia', '/', 'əˈdiːə', '/', 'or', 'Oriya', '/', 'ɒˈriːə', '/', ',', ')', 'ଏକ', 'ଭାରତୀୟ', 'ଭାଷା', 'ଯାହା', 'ଏକ', 'ଇଣ୍ଡୋ-ଇଉରୋପୀୟ', 'ଭାଷାଗୋଷ୍ଠୀ', 'ଅନ୍ତର୍ଗତ', 'ଇଣ୍ଡୋ-ଆର୍ଯ୍ୟ', 'ଭାଷା', '।']
         case 'fas':

diff --git a/utils/wl_generate_acks.py b/utils/wl_generate_acks.py
@@ -138,7 +138,7 @@
         'MIT', 'https://github.com/adbar/simplemma/blob/main/LICENSE'
     ], [
         'spaCy', 'https://spacy.io/',
-        '3.7.2', "Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann",
+        '3.7.5', "Matthew Honnibal, Ines Montani, Sofie Van Landeghem,<br>Adriane Boyd, Paul O'Leary McCann",
         'MIT', 'https://github.com/explosion/spaCy/blob/master/LICENSE'
     ], [
         'spacy-pkuseg', 'https://github.com/explosion/spacy-pkuseg',

diff --git a/wordless/wl_nlp/wl_nlp_utils.py b/wordless/wl_nlp/wl_nlp_utils.py
@@ -328,11 +328,6 @@ def run(self):
 ]
 
 def init_model_spacy(main, lang, sentencizer_only = False):
-    if lang == 'nno':
-        lang = 'nob'
-    else:
-        lang = wl_conversion.remove_lang_code_suffixes(main, lang)
-
     sentencizer_config = {'punct_chars': list(wl_sentence_tokenization.SENTENCE_TERMINATORS)}
 
     # Sentencizer
@@ -341,6 +336,8 @@ def init_model_spacy(main, lang, sentencizer_only = False):
             main.__dict__['spacy_nlp_sentencizer'] = spacy.blank('en')
             main.__dict__['spacy_nlp_sentencizer'].add_pipe('sentencizer', config = sentencizer_config)
     else:
+        lang = wl_conversion.remove_lang_code_suffixes(main, lang)
+
         if f'spacy_nlp_{lang}' not in main.__dict__:
             # Languages with models
             if lang in LANGS_SPACY:

diff --git a/wordless/wl_settings/wl_settings_global.py b/wordless/wl_settings/wl_settings_global.py
@@ -573,6 +573,7 @@
             _tr('wl_settings_global', 'spaCy - Dutch word tokenizer'): 'spacy_nld',
             _tr('wl_settings_global', 'spaCy - English word tokenizer'): 'spacy_eng',
             _tr('wl_settings_global', 'spaCy - Estonian word tokenizer'): 'spacy_est',
+            _tr('wl_settings_global', 'spaCy - Faroese word tokenizer'): 'spacy_fao',
             _tr('wl_settings_global', 'spaCy - Finnish word tokenizer'): 'spacy_fin',
             _tr('wl_settings_global', 'spaCy - French word tokenizer'): 'spacy_fra',
             _tr('wl_settings_global', 'spaCy - German word tokenizer'): 'spacy_deu',
@@ -602,6 +603,7 @@
             _tr('wl_settings_global', 'spaCy - Marathi word tokenizer'): 'spacy_mar',
             _tr('wl_settings_global', 'spaCy - Nepali word tokenizer'): 'spacy_nep',
             _tr('wl_settings_global', 'spaCy - Norwegian (Bokmål) word tokenizer'): 'spacy_nob',
+            _tr('wl_settings_global', 'spaCy - Norwegian (Nynorsk) word tokenizer'): 'spacy_nno',
             _tr('wl_settings_global', 'spaCy - Persian word tokenizer'): 'spacy_fas',
             _tr('wl_settings_global', 'spaCy - Polish word tokenizer'): 'spacy_pol',
             _tr('wl_settings_global', 'spaCy - Portuguese word tokenizer'): 'spacy_por',
@@ -1977,6 +1979,7 @@
 
         'fao': [
             'nltk_nist', 'nltk_nltk', 'nltk_regex', 'nltk_twitter',
+            'spacy_fao',
             'stanza_fao'
         ],
 
@@ -2227,6 +2230,7 @@
         ],
         'nno': [
             'nltk_nist', 'nltk_nltk', 'nltk_regex', 'nltk_twitter',
+            'spacy_nno',
             'stanza_nno'
         ],