Skip to content

DrozdovDan/mari-nlp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 

Repository files navigation

Mari NLP

Datasets

Corpora

  • Meadow Mari monoligual corpus 1.4M texts, over 20M word occurrences, 19 genres.

  • Meadow Mari corpora (web-corpora.net) web corpora with 5.53M word occurrences in the main monolingual corpus and 3.59M Mari and 15.11 Russian word occurrences in social media corpus (commentaries in VK).

  • Mari-language Korp (Vienna) web corpora with 57.38M tokens in Meadow Mari corpus and 6.25M tokens in Hill Mari corpus.

  • Hill Mari Corpus (tilda) Hill Mari corpus with 63522 word occurences in latin transcription.

  • Tatoeba corpus of parallel sentences, 3869 pairs for Meadow Mari and Russian, 72 sentences for Hill Mari and Russian.

  • Wiki-dumps hours of Meadow and Hill Mari audio with transcriptions.

Pretrained models

  • UralicNLP pretrained morphological analysators/generators and lemmatisation for uralic languages. Includes Meadow Mari and Hill Mari.

  • TartuNLP “Smugri” Multilingual Neural Machine Translation model for low-resource Finno-Ugric languages. Includes Meadow Mari (average to-lang scores: 8.51 bleau, 43.42 chrf, 38.76 chrf++), Hill Mari (average to-lang scores: 7.30 bleau, 40.81 chrf, 36.40 chrf++).

  • Trained adapters on wikipedia corpus for Meadow Mari. BERT. XLM-R.

Word Similarity

Methods/Software

Miscellaneous

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published