-
Meadow Mari monoligual corpus 1.4M texts, over 20M word occurrences, 19 genres.
-
Meadow Mari corpora (web-corpora.net) web corpora with 5.53M word occurrences in the main monolingual corpus and 3.59M Mari and 15.11 Russian word occurrences in social media corpus (commentaries in VK).
-
Mari-language Korp (Vienna) web corpora with 57.38M tokens in Meadow Mari corpus and 6.25M tokens in Hill Mari corpus.
-
Hill Mari Corpus (tilda) Hill Mari corpus with 63522 word occurences in latin transcription.
-
Tatoeba corpus of parallel sentences, 3869 pairs for Meadow Mari and Russian, 72 sentences for Hill Mari and Russian.
-
Wiki-dumps hours of Meadow and Hill Mari audio with transcriptions.
-
UralicNLP pretrained morphological analysators/generators and lemmatisation for uralic languages. Includes Meadow Mari and Hill Mari.
-
TartuNLP “Smugri” Multilingual Neural Machine Translation model for low-resource Finno-Ugric languages. Includes Meadow Mari (average to-lang scores: 8.51 bleau, 43.42 chrf, 38.76 chrf++), Hill Mari (average to-lang scores: 7.30 bleau, 40.81 chrf, 36.40 chrf++).
-
Trained adapters on wikipedia corpus for Meadow Mari. BERT. XLM-R.
-
GiellaLT tools for building morphological analysers, proofing tools and dictionaries. Meadow Mari repository. Hill Mari repository.
-
Apertium morphological analysis and generation, PoS-tagging.
- Mari-lab community