GitHub - GhanaNLP/YorubaTwi-Embedding

Massive vs. Curated Word Embeddings for Low-Resourced Languages. The Case of Yorùbá and Twi

In the paper, we compared pre-trained word embeddings(Fasttext and BERT) and embeddings obtained from curated for Yorùbá and Twi. For this purpose, we gather and select corpora and study the most appropriate techniques for the languages. We also create test sets for the evaluation of the word embeddings within a word similarity task (wordsim353) and the contextual embeddings within a NER task. The Corpora and the embeddings are available in here.

For the comparison, we define 3 datasets according to the quality and quantity of textual data used for training:

Curated Small Dataset (clean), C1, about 1.6 million tokens for Yorùbá and over 735k tokens for Twi. The clean text for Twi is the Bible and for Yorùbá all texts marked under the C1 column in the figure below.
In Curated Small Dataset (clean + noisy), C2, we add noise to the clean corpus (Wikipedia articles for Twi, and BBC Yorùbá news articles for Yorùbá). This increases the number of training tokens for Twi to 742k tokens and Yorùbá to about 2 million tokens.
Curated Large Dataset, C3 consists of all available texts we are able to crawl and source out for, either clean or noisy (Twi, Yorùbá).

The evaluation datasets are the following:

Translated WordSim-353 for Twi
Translated WordSim-353 for Yorùbá
Yorùbá NER dataset

Data set from the Niger Volta-Language Technology Institute can be gotten here

For reproducing the NER results using our fine-tuned BERT models, please use a modified BERT-NER code. First, copy all the BERT embeddings in the Google drive and Google's uncased-multilingual-bert-base model into the bert_models directory of the github. Then, you can run the following bash scripts in the BERT-NER:

sh run_ner_yorubaPM.sh for the baseline model i.e uncased Multilingual-bert
sh run_ner_yorubaFMB.sh for the fine-tuned uncased Multilingual-bert on Yorùbá corpus with the default multilingual vocab.txt
sh run_ner_yorubaFM.sh for the fine-tuned uncased Multilingual-bert on Yorùbá corpus but with mostly Yorùbá vocabulary (i.e in vocab.txt) . We found out that using Yorùbá vocabulary for fine-tuning gave a better performance than using the multilingual vocab.txt

If you use any of the resources in this page, please cite the paper:

Jesujoba O Alabi, Kwabena Amponsah-Kaakyire, David I Adelani, and Cristina Espa ̃na-Bonet. Massive vs. curated word embeddings for low-resourced languages. the case of Yor\ub\’a and Twi . In LREC, 2020

Name		Name	Last commit message	Last commit date
Latest commit History 78 Commits
Twi		Twi
Yoruba		Yoruba
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Twi

Twi

Yoruba

Yoruba

README.md

README.md

Repository files navigation

Massive vs. Curated Word Embeddings for Low-Resourced Languages. The Case of Yorùbá and Twi

About

Releases

Packages

GhanaNLP/YorubaTwi-Embedding

Folders and files

Latest commit

History

Repository files navigation

Massive vs. Curated Word Embeddings for Low-Resourced Languages. The Case of Yorùbá and Twi

About

Resources

Stars

Watchers

Forks