# Transliteration

in our case, this will involve transliteration, and ensuring that each row has the appropriate features, i.e target and source languages, etc..

In [None]:
sentences = ['وحُلم المزارع هذا هوَ كلّ ما لدى بعض الناس',
 'ذلك الشيطان الماكر بلغارد شجع بورثوس وهذا ما حدث.',
 'هذهِ اخر محطة لي قبل ان اتقاعد',
 'اثداء كبيره لن تخسرها ابدا',
 'حسنا، الجميع يكون منتبه لهذه الخطوات.',
 'شخصـاً يضحك عندمـا يـرى الدم الذي يجعـل الأشخـاص الضعفـاء يتقيـأون',
 'أخبرته بشأن والدك صحيح؟',
 'آه, ماذا تريدين للعشاء؟',
 'أعلم, لكن ما الذي فعلتيه في الحقيقة؟',
 '. إن تسكعتِ معي فقط سوف يقبلونكِ في أسرع وقت']

## Camel Tools

In [5]:
from camel_tools.utils.charmap import CharMapper

sentence = "ذهبت إلى المكتبة."
print(sentence)

ar2bw = CharMapper.builtin_mapper('ar2hsb')
bw2ar = CharMapper.builtin_mapper('hsb2ar')

sent_bw = ar2bw(sentence)
sent_ar = bw2ar(sent_bw)
print(sent_bw)
print(sent_ar)


ذهبت إلى المكتبة.
ðhbt Ălý Almktbħ.
ذهبت إلى المكتبةْ



The mapping, as show in this example, is fully reversible. Going from arabic to HSB and the to arabic fully preserves the sentence, which is one of the main requirements we have in our choice.

## Arabic Transliterator

In [4]:
import sys
import os

# Add directory to PATH
sys.path.append('../..')
from src.utils.transliterator import ArabicTransliterator

"""Test Latin to Arabic transliteration"""
print("\n=== Testing Latin to Arabic Transliteration ===")

test_words = [
    "mar7aba",      # Hello
    "salaam",       # Peace
    "shukran",      # Thank you
    "kaifa 7alak",  # How are you?
    "sabah al5air", # Good morning
    "madrasa",      # School
    "kitaab",       # Book
    "bab",          # Door
    "bayt",         # House
    "sadiiq"        # Friend
]

transliterator = ArabicTransliterator()

for word in test_words:
    results = transliterator.transliterate(word, limit=3)
    print(f"\nInput: {word}")
    for i, result in enumerate(results, 1):
        print(f"  {i}. {result}")


    phonetic = transliterator.pronunciate(result, limit=1)
    arabic_from_phonetic = transliterator.to_arb(phonetic[0], limit=1)
    print(f"latin->Phonetic → Arabic: {arabic_from_phonetic[0]}")

    print(f"Results match: {'Yes' if word == arabic_from_phonetic[0] else 'No'}")




=== Testing Latin to Arabic Transliteration ===

Input: mar7aba
  1. مَرحَبة
  2. مَرحَبا
  3. مارحَبى
latin->Phonetic → Arabic: مارحَبى
Results match: No

Input: salaam
  1. سَلام
  2. صَلام
latin->Phonetic → Arabic: صَلام
Results match: No

Input: shukran
  1. شوكراً
  2. سهوكراً
latin->Phonetic → Arabic: سهوكراً
Results match: No

Input: kaifa 7alak
  1. كَِفَ حَلَك
  2. قَِفَ حَلَك
latin->Phonetic → Arabic: قَِفَ حَلَك
Results match: No

Input: sabah al5air
  1. سَبَه َلخَِر
  2. صَبَه َلخَِر
latin->Phonetic → Arabic: صَبَه َلخَِر
Results match: No

Input: madrasa
  1. مَدرَسة
  2. مَدرَسا
  3. مادرَسى
latin->Phonetic → Arabic: مادرَسى
Results match: No

Input: kitaab
  1. كِتاب
  2. قِتاب
latin->Phonetic → Arabic: قِتاب
Results match: No

Input: bab
  1. بَب
  2. باب
latin->Phonetic → Arabic: باب
Results match: No

Input: bayt
  1. بَيت
  2. بايت
latin->Phonetic → Arabic: بايت
Results match: No

Input: sadiiq
  1. سَديق
  2. صَديق
latin->Phonetic → Arabic: صَديق
Results match: No

This library is a python adaptation of a javascript library which is used to transliterate levantine arabic. Although the library is two way, the reverse mapping, latin->arabic, is not exact, and so the library provides multiple alternatives for each word. Upon inspecting it, we noted that the reverse mapping is low quality, and in most cases results in grammatical errors and/or loss of meaning. 

This is reason we did not choose this transliteration system.

## Uroman

In [1]:
import uroman 

trans = uroman.Uroman()

In [None]:
sentences = [
    "ذهبت إلى المكتبة.",
]
roman = trans.romanize_string(sentence)
print(roman)

thhbt ila almktba.


The problem here is that, although the romanization is good and legible, they do not provide a way to reverse it.

Because we want to avoid lossy transliterations, we ended up choosing not use Uroman