# This Colab created using  CAMeL_Tools_Guided_Tour notebook shared by This repo https://github.com/CAMeL-Lab/camel_tools

Which belogs to this paper (CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing) 
</br>
https://aclanthology.org/2020.lrec-1.868/

# Installation and Setup

The following steps are needed if you want to run the examples in this notebook on Google Colaboratory. If you want to run this notebook on your own machine, please follow the [installation instructions](https://camel-tools.readthedocs.io/en/latest/getting_started.html#installation) instead.

First, we install the CAMeL Tools Python package.

In [None]:
%pip install camel-tools

In order to use all the components provided in CAMeL Tools, we need to install all the datasets required by these components.
To do this in Colab, we need to first mount a Google Drive and create a directory where the data will be installed.

Run the code below and follow the instructions in the output.

In [3]:
from google.colab import drive
import os

drive.mount('/gdrive')

%mkdir /gdrive/MyDrive/camel_tools

Mounted at /gdrive


Next, we need to tell CAMeL Tools to install the data in the newly created directory. This will take a couple of minutes to complete.

**NOTE:** You will need at least 2.3GB of available space on your Google Drive to install all the CAMeL Tools data.

In [4]:
os.environ['CAMELTOOLS_DATA'] = '/gdrive/MyDrive/camel_tools'

!export | camel_data full

We also provide a lightweight dataset for the Morphology and Disambiguation components **only** that can be installed by calling `camel_data light` instead of `camel_data full`.


**Once the data has been installed on your Google Drive, you only need to run the following the next time you want to run this notebook.**

In [None]:
%pip install camel-tools

from google.colab import drive
import os

drive.mount('/gdrive')
os.environ['CAMELTOOLS_DATA'] = '/gdrive/MyDrive/camel_tools'

In [6]:
!pip install unicodecsv

Collecting unicodecsv
  Downloading unicodecsv-0.14.1.tar.gz (10 kB)
Building wheels for collected packages: unicodecsv
  Building wheel for unicodecsv (setup.py) ... [?25l[?25hdone
  Created wheel for unicodecsv: filename=unicodecsv-0.14.1-py3-none-any.whl size=10765 sha256=cdaa292d986284074bd70ae9b14817d3e1e407f73b6b0a1e321c9081327eb2e3
  Stored in directory: /root/.cache/pip/wheels/1a/f4/8a/a5024fb77b32ed369e5c409081e5f00fbe3b92fdad653f6e69
Successfully built unicodecsv
Installing collected packages: unicodecsv
Successfully installed unicodecsv-0.14.1


In [11]:
from camel_tools.utils.normalize import normalize_alef_maksura_ar
from camel_tools.utils.normalize import normalize_alef_ar
from camel_tools.utils.normalize import normalize_teh_marbuta_ar
from camel_tools.utils.dediac import dediac_ar
from camel_tools.tokenizers.word import simple_word_tokenize
from camel_tools.dialectid import DialectIdentifier

import numpy as np
import re


In [16]:
def preprocessArabicSentences(sentence):
  
  #sentence = "هَلْ ذَهَبْتَ إِلَى المَكْتَبَةِ؟"

  # remove diacritics
  sent_dediac = dediac_ar(sentence)

  # Normalize alef variants to 'ا'
  sent_norm = normalize_alef_ar(sent_dediac)

  # Normalize alef maksura 'ى' to yeh 'ي'
  sent_norm = normalize_alef_maksura_ar(sent_norm)


  # Normalize teh marbuta 'ة' to heh 'ه'
  pureMSASentence = normalize_teh_marbuta_ar(sent_norm)

  return pureMSASentence


def checkIfSentenceHasKeyWork(sentence, keyWordsArray):

  #tokenize the sentence to array of words
  sent_split = simple_word_tokenize(sentence)
  intersectionResult = np.intersect1d(sent_split, keyWordsArray)
  print(intersectionResult)



# Dialect Identification

We provide a pretrained dialect identification system that can distinguish between 25 city dialects as well as Modern Standard Arabic. The model can be accessed using the [`DialectIdentifier`](https://camel-tools.readthedocs.io/en/latest/api/dialectid.html#camel_tools.dialectid.DialectIdentifier) class. In addition to city dialects, we can provide results aggregated by region and by country. While these agregated results are less fine-grained, they tend to be more accurate.

The example below illustrates how `DialectIdentifier` can be used to predict the dialects of given sentences by city, country and region. 

**NOTE:** You may get some warnings when running the example below in Colab. These can be safely ignored.

In [None]:
#define the dialect identifier object
did = DialectIdentifier.pretrained()


In [36]:
def checkIfIraqDialect(sentence):
  sentences=[sentence]
  predictions = did.predict(sentences, 'country')
  predictedDialectCountry=[p.top for p in predictions]
  isIraq=predictedDialectCountry[0]=='Iraq'
  #print(predictedDialectCountry)
  return isIraq
  #print([p.top for p in predictions])


In [38]:
# -*- coding: utf-8 -*-
import sys

try:
    import unicodecsv as csv
except ImportError:
    sys.stderr.write(
        "`sudo pip install unicodecsv` for unicode csv support\n")
    exit(1)
onlyIraqiSentences=[]
with open(u'twitter_and_reddit.csv', "rb") as csvfile:
    reader = csv.reader(csvfile, encoding="utf-8-sig")
    header = True
    i=0
    for row in reader:
        if (header):
          header =False
          continue
        #print(u", ". join(row))
        #print(row[0])
        arabicOnly = u" ".join(re.findall(r'[\u0600-\u06FF]+', row[0]))
        #print(arabicOnly)
        preprocessedPreparedSentence = preprocessArabicSentences(arabicOnly)
        #print(preprocessedPreparedSentence)
        if(checkIfIraqDialect(preprocessedPreparedSentence)):
          print(preprocessedPreparedSentence)
          onlyIraqiSentences.append([preprocessedPreparedSentence,row[1],row[2],row[3]])
        #else:
        #  print("not Iraq")
        i+=1
        #if(i>10):
        #  break

ضرب الحبيب مثل ضرب ال ار بي جي
الملبغ ١٥٠ كيف صار ٩ الاف الواضح ان ١٥٠ مبلغ الصرف اليومي نضرب بعدد ايام الشهر وهو ٣٠ ؟ مثال ١٥٠ ضرب ٣٠ ٤٥٠٠ نقول اربعه الالاف وخمس مئه ريال طلعي الحاسبه وجربي
ضرب مخك بالطائغيه والافتراء بالمذب والتبلي علي محبي الحسين وعلي الميسحيه انسان جاهل ما تفكر بالرغم من كبر سنك ما تعلمت ولا تحرم الاهرين بالمذب الي تنشره صانع الطائغيه
اقسم بالله انا بعد اللقاح احس مخي ضرب
اقای باغگلی یه ضرب المثل قدیمی داریم که می گه تو اول برادریت را ثابت کن،بعد ادعای ارث و میراث کن شما هم اول ثابت کن که از جنس ا پ هستی بعد برای ادارۀ امور معلمان برنامه بده
وهل الظلم و ضرب الظهور واخذ الاموال وعدم اتباع سنه الرسول صلي الله عليه وسلم من الحكم بما انزل الله ام من الحكم بغير ما انزل الله؟؟
مجھے اج تک کربلا میں موجود وہ مسلمان سمجھ نہیں ایا جس نے امام حسین علیہ السلام پر اخری ضرب لگا کر اپنے ساتھیوں سے کہا کہ جلدی کرو نماز عصر کا وقت ہوگیا ہے ۔۔۔
هذا لعب توم وجيري نبي ضرب قلب ايران
انا اقرب لان كل شي جمع مادخلت ضرب فانا صح والاوله ٤ ١ ٥ والناتج الي قبله صفر لان نقطه بدايه فهمت؟ فلمسال

In [41]:
# field names 
fields = ['text', 'tags', 'user', 'source'] 
    
# data rows of csv file 
rows = onlyIraqiSentences
  
with open(u'twitter_and_reddit_iraqi_dialect.csv', "wb") as arabicCsvfile:
    write = csv.writer(arabicCsvfile, encoding="utf-8-sig")

      
    write.writerow(fields)
    write.writerows(rows)

In [43]:
len(onlyIraqiSentences)#total Iraqi sentences

3103

Testing the modal by predicting moroccan lebanese dialect and iraqi dialect</br>
The results are correct

In [47]:
moroccanAndlebaneseDialect = [
    'مال الهوى و مالي شكون اللي جابني ليك  ما كنت انايا ف حالي بلاو قلبي يانا بيك',
    'بدي دوب قلي قلي بجنون بحبك انا مجنون ما بنسى حبك يوم'
]


predictions_lebanese_dialect = did.predict(moroccanAndlebaneseDialect, 'country')
print([p.top for p in predictions_lebanese_dialect])

['Morocco', 'Lebanon']


In [46]:
IraqiDialect = [
    'انت صدك تحجي عمي السالفة خربانة',
    'اليوم شفت الاستاذ مالتي بالشارع ورجعتلي ذاكرتي مال ايام قبل بالمدرسة والله حلوة'
]


predictions_iraqi_dialect = did.predict(IraqiDialect, 'country')
print([p.top for p in predictions_iraqi_dialect])

['Iraq', 'Iraq']
