# Arabic Transliteration 
This notebook demonstrates a use example of the Arabic Transliterator module compiled by the MTG. Please download and install the latest module version. The phrase used here are very short and are just for demonstration. For a more complicated use-case of the module invoving the transliteration of structured tsv files, please refer to the sanaa transliteration notebooks of the https://github.com/MTG/arab-andalusian-music/tree/master/sanaa_lyrics/transliteration_guide

## Description
This module reads phrases from a csv file column and creates another file with the transliterations. 
Original and transliterated versions of the phrases are displayed side by side.
Phrases must NOT contain numbers

In [1]:
from arabic_transliterator import ALA_LC_Transliterator
from mishkal.tashkeel import TashkeelClass
import pandas as pd
from IPython.display import display, HTML

In [41]:
inputfile1 = 'phrases.csv'
inputfile2 = 'nouns.csv'

In [2]:
transliterator = ALA_LC_Transliterator()

In [4]:
def transliterate(text, vocalize=True):
    voc = text
    if vocalize:
        vocalizer=TashkeelClass()
        voc = vocalizer.tashkeel(text)
    return transliterator.do(voc.strip())

def transliterate_df(inputdf):
    outputdf = pd.DataFrame(columns = ['transliterations'])
    for i, row in inputdf.iterrows():
        for elem in row: 
            t = transliterate(elem.strip(), vocalize=True)
            t_df = pd.DataFrame({'transliterations' : [t]}, index=[i])
            outputdf = outputdf.append(t_df)
    return outputdf

In [44]:
inputdf = pd.read_csv(inputfile1)
outputdf = transliterate_df(inputdf)
d = pd.concat([inputdf, outputdf], axis=1)
display(d)

Unnamed: 0,phrase,transliterations
0,إذا كنت ذا عشق ووجد ورقة,idhā kannat dhā ‘ishqun wawujid wariqah
1,فنغمته تحيي النفوس وتشتفي,fanaghmatuh tuḥayyī al-nufūs watashtaffī
2,قُلْتُ أَهَذَا وَلَيْسَ يَكْفِي,qult ahadhā walays yakfī
3,يا من على طيف الخيال أحالني,yā man ‘alá ṭayf al-khayāl aḥālanī


In [46]:
inputdf = pd.read_csv(inputfile2)
outputdf = transliterate_df(inputdf)
d = pd.concat([inputdf, outputdf], axis=1)
display(d[:20])

Unnamed: 0,الاستهلال,transliterations
0,الحجاز الكبير,al-ḥjāz al-kabīr
1,إنشاد,inshād
2,احمد أقريش,iḥmad aquraysh
3,احمد البردعي,iḥmad al-barda‘ī
4,احمد الشنتوف,iḥmad al-shntūf
5,احمد المرابط,iḥmad al-marābiṭ
6,احمد الورديغي,iḥmad al-ūrdīghī
7,احمد بنعياد,iḥmad bn‘yād
8,احمد حرازم,iḥmad ḥrāzm
9,استوديو خاص,stūdīū khāṣṣ


In [11]:
mixed_text = "Quran: رحمن ވަންތަ رحيم ވަންތަ اللَّه ގެ اسم ފުޅުން ފަށައިގަންނަމެވެ."

In [13]:
# The following does not work if the text has non-Arabic chars
transliterate(mixed_text)

'[UNK][UNK][UNK][UNK][UNK][UNK]raḥimn [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] raḥīm [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] al-lah [UNK] [UNK] āism [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK] [UNK].'

In [9]:
import re
def transliterate_mixed(text):
    matches = re.findall("[\u0600-\u06FF\u0750-\u077F\u08A0-\u08FF]+", text)
    for arabic_match in matches:
        roman_match = transliterate(arabic_match)
        text = re.sub(arabic_match, roman_match, text, 1)
    return text

In [12]:
transliterate_mixed(mixed_text)

'Quran: raḥimn ވަންތަ raḥīmun ވަންތަ al-lah ގެ ismun ފުޅުން ފަށައިގަންނަމެވެ.'