Skip to content

Using AI to restore Diacritics on Yoruba language (which is a low resource language)

Notifications You must be signed in to change notification settings

Crinmatic/Diacritic-Restoration

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Diacritic-Restoration

A project on diacritic restoration of languages...

Cotributors

Contributor Social Media post Article
Alao David LinkedIn
Alagbe Seun LinkedIn Medium

The Problem:

Yoruba or more specifically "Èdè Yorùbá" is a major language, spoken by the indigenous people of the Yoruba tribe in Nigeria (more than 45 million people globally).

The significance of solving this problem is evidenced by this analysis below:
“Yoruba is a tonal language and Diacritics are used to mark the tone of a vowel. The dot accent or alternatively the vertical line below, are used to mark open variants of vowels, namely ẹ and ọ for [ɛ] and [ɔ]; or below the s which transcribes the [ʃ], a post-alveolar consonant like 'sh' in English.”

Given below are Yoruba vowel classification:

a  =>  (á à ā)  
e => (é è ē ẹ/e̩ ẹ́/é̩ ẹ̀/è̩ ẹ̄/ē̩  
i =>  (í ì ī)  
o =>  (ó ò ō ọ/o̩ ọ́/ó̩ ọ̀/ò̩ ọ̄/ō̩)  
s  => (ṣ s)  
u =>  (ú ù ū)

Using these letters interchangeably, gives different meanings to words as they are pronounced differently.
For example :

A) oro  = oró  
As in “oh-row” 		(oró is “sting” when translated to English)  
  
B)oro  =  ọ̀rọ̀  
As in “aw-raw” 		(ọ̀rọ̀ is “text” or "word" when translated to English)   

C)	 ojo  = òjò  
As in “oh-joe” 		(òjò is “rain” when translated to English)  

D)	ojo  =  ọjọ́  
As in “or-jaw”		   (ọjọ́ is “day” when translated to English)  

It's awesome right?

Unfortunately, there's a problem. Words might be easy to pronounce (maybe not so much for beginners), but when it comes to writing or typing "Yoruba" text, we often disregard the "TONE MARKS", leading to the following:

  1. We end up fumbling the tone marks for each words, we intend on typing (even if we know how to pronounce it well)
  2. Some people just get lazy when putting the "TONE MARKS" because it takes quite a while for them to transcribe plain words and add the "TONE MARKS".

The Solution:

[Excerpt from Alao David's - post]

Being Yorùbá, working on this project was a thing of delight for I and Oluwaseun Alagbe, and definitely an avenue to uplift our culture. 🌻

What we've built here is an A.I. (Natural Language Processing [NLP]) model which utilizes Artificial Neural Networks 🤖, to learn patterns in use for Diacritical (tonal) marks/"Àmì ohùn", in thousands of Yorùbá words, which is possible thanks to the abundant availability of Yorùbá text data, that we mined from open-source Yorùbá literary texts i.e. Books, News articles, etc.

This model when deployed, would go a long way in aiding proper communication in our mother tongue. ✍🏾

Please note

Notebook 1 - Data Mining
Notebook 2 - Text Processing
Notebook 3 - Training and Evaluation

About

Using AI to restore Diacritics on Yoruba language (which is a low resource language)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published