Skip to content

ChmHsm/latinAr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

latinAr

An extensive dataset for latin-written and arabic-written dialectal arabic.

Quick JSON example

One-to-one mapping:
{latin_ar_arabic:"سير", american_english:"go"}
{latin_ar_french:"sir", american_english:"go"}

One-to-many mapping:
{latin_ar_arabic:"سير", classical_arabic:"إذهب", american_english:"go"}
{latin_ar_french:"sir", classical_arabic:"إذهب", american_english:"go"}

Many-to-many mapping:
{latin_ar_arabic:["سير","مشي"], classical_arabic:"إذهب", american_english:"go"}
{latin_ar_french:["sir","mchi"], classical_arabic:"إذهب", american_english:"go"}

What is latin-written Arabic?

Latin-written dialectal Arabic is any text written in latin alphabet, mainly english or french, but which represents dialectal arabic words and expressions. Here are some examples:

Examples

The word "إذهب" (not dialectal, used here for clarity) is arabic for "Go" in English. And because most arabic-speaking countries don't use arabic keyboards for short and even long text messaging (i.e. chatting), it is written "Idhab" (Frenchly-pronounced) or "Edheb" (Englishly-prnonouced).
The ultimate goal from this repo is to provide a latinAR-to-Classical-arabic mapping, so for the word "إذهب", we'll have the following mapping: {latin_ar: "Idhab", classical_arabic: "إذهب"} for frenchly-pronouncing regions and {latin_ar: "Edheb", classical_arabic: "إذهب"} englishly-pronouncing regions.

A (real) dialectal example: "sir" (frenchly) or "seer" (englishly), not the english "sir", but maghrebi-dialectal arabic for the word "إذهب". The mapping would consequently be {latin_ar: "sir", classical_arabic: "إذهب", american_english:"go"}.

The Goal

We aim with this repository to provide an as-extensive-as-possible dataset (CSV/JSON files actually) for latin written arabic, hence the name "latinAr". Specifically regions such as Morocco, Algeria and Tunisia, but also Egypt, Mauritania, Lybia, Middle-east and eventually more depending on adoption and/or need.

This repository is structured into regions (i.e. a batch of countries that have roughly the same language aspects), countries (i.e. Morocco, Algeria, Tunisia) and then country regions (i.e. northern, eastern etc... depending on language differences in the same country). Thus, the data is structured as follows: data/Regions/Countries/Country-regions/data-types. The data-type part of the repo structures data into words, phrases and sentences, paragraphs, then long texts.

Why launch a latinAr dataset?!

Being a deep learning driven team of individuals, and being disappointed that there's a huge lack in structured latinAr data, we've realized that we could not perform anything deep learning-related for the dialectal arabic language! So we've decided to create one. For the sake of AI.

Want to contribute (we need you!)? or interested in any way? Have suggestions? Please let us know.

What's this repo going to be used for anyway?

Mainly, this repo is intended to contain a training, validation and evaluation dataset for dialectal arabic RNN and CNN-related models:
Language modeling & generation
Word representation (word embeddings, Word2Vec,...)
Sentiment analysis
Machine translation
Text-to-speech (soon)
Speech-to-text (soon)
And others...

Types of data you'll find in this repo

You'll find 3 sets of data in every region-directory (an example of a "region-directory": data\regions\Maghreb\Morocco\Northern -chamali):

  • Raw data: data which didn't go through any sort of processing, directly created through a copy/paste operation from the original source.
  • Pre-processed data : data which has been cleansed of irregularities. Basicaly, it's the raw data plus regular expressions.
  • Plug-and-play data : JSON or CSV structured data. Ready for use.
    In the plug and play data you'll always find a "column" called:
  • "latin_ar", containing the original word/phrase.
  • The translations of the latin_ar word/phrase in the target language(s) (as of 01/08/2018, we're thinking classical arabic and english)