# Installation **for Google Colab**

In [32]:
#For first time installing CAMel tools > use the installation code at the end of this colab
#For CAMel tools
%pip install camel-tools
from google.colab import drive
import os
drive.mount('/gdrive')
os.environ['CAMELTOOLS_DATA'] = '/gdrive/MyDrive/camel_tools'

In [33]:
#for Wikipedia API
!pip3 install wikipedia-api

In [34]:
#for nltk
!pip install nltk

#Task

In [4]:
import wikipediaapi #to extract articles
import re #regular expression
#for POS tagging
from camel_tools.disambig.mle import MLEDisambiguator
from camel_tools.tagger.default import DefaultTagger
#for NER recognition
from camel_tools.ner import NERecognizer
#for normalization
from camel_tools.utils.normalize import normalize_alef_maksura_ar
from camel_tools.utils.normalize import normalize_alef_ar
from camel_tools.utils.normalize import normalize_teh_marbuta_ar
#for stemmer
from nltk.stem.isri import ISRIStemmer


## 1. Review two Python NLP libraries
## NLTK vs. CAMeL
### Natural Language Processing Libraries in Python

I used to work with [NLTK](https://link-url-here.org) (Natural Language Toolkit), a free, open source, famous toolkit in Python that provides a suite of text processing libraries for classification, tokenization, stemming, tagging and other NLP tasks. Some of its functions and libraries are supporting Arabic, but the others are not. 

When I start doing this task I faced some issues regarding Arabic functions. That's why I decided to look for another useful toolkit to perform this task.

[CAMeL tools](https://camel-tools.readthedocs.io/en/latest/) is a suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi. It provides utilities for pre-processing, POS tagging, dialect identification, named entity recognition and sentiment analysis. 

Unfortunately, I couldn't run POS tagging due to some security limits by Google to access huge files on the drive (I'm using Google Colab and the data files are on Google Drive).








## 2. Tokenise, 3.Preprocessing, 4. Tagging

### Extract data

In [5]:
#Wikipedia objects to extract articles
wiki1=wikipediaapi.Wikipedia("ar")  # ar refers to Modern Standard Arabic version of Wikipedia
wiki2=wikipediaapi.Wikipedia("arz") # arz refers to Egyption dialect version of Wikipedia

In [6]:
#Get pages based on their titles
try:
    page1 = wiki1.page('لهجة مصرية')
    page2 = wiki2.page('اللغه المصريه الحديثه')
    
except:
    print("An exception occurred")

In [7]:
#Print the first 200 characters of the two articles
print('Article 1:\n',page1.summary[0:200])
print('Article 2:\n',page2.summary[0:200])

Article 1:
 اللهجة العربية المصرية، المعروفة محليًا باسم العامية المصرية أو المصري، يتحدث بها معظم المصريين المعاصرين.
المصرية هي لهجة شمال إفريقية للغة العربية وهي فرع سامي من عائلة اللغات الإفروآسيوية. نشأت في 
Article 2:
 اللغه المصريه الحديثه او المصرى هى اللغه اللى المصريين بيتكلّموها فى مصر, ابتدا تاريخها فى دلتا النيل حوالين بلادها الحضريه زى القاهره واسكندريه.
النهارده اللغه المصرى هى اللغه السايده فى مصر وبتتكون 


In [8]:
#Extract text from the two pages
article1=page1.text
article2=page2.text
text=[article1,article2]

### Special characters, punctuations, english letters, and digits removing

In [9]:
cleaned_text=[]
for article in text:
    article=re.sub(r'[^\w\s]','',article)   #Remove special characters & punctuations
    article=re.sub(r'[a-zA-Z\d]','',article) #Remove english letters & digits
    cleaned_text.append(article)

In [10]:
#Display first 300 letters of the two articles
print('Article 1:\n',cleaned_text[0][0:300])
print('Article 2:\n',cleaned_text[1][0:300])

Article 1:
 اللهجة العربية المصرية المعروفة محليا باسم العامية المصرية أو المصري يتحدث بها معظم المصريين المعاصرين
المصرية هي لهجة شمال إفريقية للغة العربية وهي فرع سامي من عائلة اللغات الإفروآسيوية نشأت في دلتا النيل في مصر السفلى حول العاصمة القاهرة تطورت اللهجة المصرية من اللغة العربية التي تم نقلها إلى مصر 
Article 2:
 اللغه المصريه الحديثه او المصرى هى اللغه اللى المصريين بيتكلموها فى مصر ابتدا تاريخها فى دلتا النيل حوالين بلادها الحضريه زى القاهره واسكندريه
النهارده اللغه المصرى هى اللغه السايده فى مصر وبتتكون من لهجات زى مصرى قاهراوى  بحيرى او مصرى دلتاوى فى جنوب مصر الناس بتتكلم مصري صعيدى و هى لغه بنفسيها معت


###Tokenizing & Stopwords Removing



In [11]:
#Download Stopwords text file from: https://github.com/mohataher/arabic-stop-words
#Read the file
stopwords=[]
with open('/gdrive/MyDrive/ColabNotebooks/list.txt', 'r') as file:
    lines = file.readlines()
#remove \n from each line
for i in range(len(lines)):
  stopwords.append(lines[i].rstrip('\n'))
#Print some of the stopwords
print(stopwords[100:110])

['إليكم', 'إليكما', 'إليكنّ', 'إليكَ', 'إلَيْكَ', 'إلّا', 'إمّا', 'إن', 'إنَّ', 'إى']


In [12]:
#Each article will be tokenized and stopwords will be reomoved
filtered_tokens=[]
filtered_articles=[]
for article in cleaned_text:
    #Tokenize
    tokens=article.split()
    #Check if any word is a stopword, to be neglected
    for word in tokens:
        if word not in stopwords:
            filtered_tokens.append(word)
    #Add tokenized article to the article list
    filtered_articles.append(filtered_tokens)
    filtered_tokens=[]

In [13]:
#Print part of the egyption article after tokenizing and stopwords removing
print(filtered_articles[1][0:20])

['اللغه', 'المصريه', 'الحديثه', 'المصرى', 'هى', 'اللغه', 'اللى', 'المصريين', 'بيتكلموها', 'مصر', 'ابتدا', 'تاريخها', 'دلتا', 'النيل', 'حوالين', 'بلادها', 'الحضريه', 'زى', 'القاهره', 'واسكندريه']


### Part-of-Speech and Named entity recognition tagging using CAMeL tools

#### POS Tagging

In [None]:
#ALERT: THIS CELL IS NOT WORKING
#The issue only appears on Colab and it seems to be that a Google Drive is blocking the download since the file is too big and thus the antivirus cannot be applied:
#FileNotFoundError: [Errno 2] No such file or directory: '/gdrive/MyDrive/camel_tools/data/morphology_db/calima-egy-r13/morphology.db'

#Pretrained tagger for MSA 
mle = MLEDisambiguator.pretrained('calima-msa-r13')
#Pretrained tagger for Egyption Arabic 
mle2 = MLEDisambiguator.pretrained('calima-egy-r13')

tagger = DefaultTagger(mle, 'pos')
tagger2 = DefaultTagger(mle2, 'pos')

pos_tags_article1 = tagger.tag(filtered_articles[0])
pos_tags_article2 = tagger2.tag(filtered_articles[1])

#Print part of POS tags for MSA article
print(pos_tags_article1[0:20])

#Print part of POS tags for Egyption article
print(pos_tags_article2[0:20])

#### NER Tagging

In [14]:
#Named Entity Recognition tags
#Pretrained model
ner = NERecognizer.pretrained()
#Tags
ner_tags=[]
for article in filtered_articles:
  labels = ner.predict_sentence(article)
  ner_tags.append(list(zip(article, labels)))

In [15]:
#Print part of token-NER label pairs for article 1
print('Article 1 NER:\n',ner_tags[0][0:20])
#Print part of token-NER label pairs for article 2
print('Article 2 NER:\n',ner_tags[1][0:20])

Article 1 NER:
 [('اللهجة', 'O'), ('العربية', 'O'), ('المصرية', 'O'), ('المعروفة', 'O'), ('محليا', 'O'), ('العامية', 'O'), ('المصرية', 'O'), ('المصري', 'O'), ('يتحدث', 'O'), ('معظم', 'O'), ('المصريين', 'O'), ('المعاصرين', 'O'), ('المصرية', 'O'), ('لهجة', 'O'), ('إفريقية', 'O'), ('للغة', 'O'), ('العربية', 'O'), ('فرع', 'O'), ('سامي', 'O'), ('عائلة', 'O')]
Article 2 NER:
 [('اللغه', 'O'), ('المصريه', 'O'), ('الحديثه', 'O'), ('المصرى', 'O'), ('هى', 'O'), ('اللغه', 'O'), ('اللى', 'O'), ('المصريين', 'O'), ('بيتكلموها', 'O'), ('مصر', 'B-LOC'), ('ابتدا', 'O'), ('تاريخها', 'O'), ('دلتا', 'O'), ('النيل', 'I-LOC'), ('حوالين', 'O'), ('بلادها', 'O'), ('الحضريه', 'O'), ('زى', 'O'), ('القاهره', 'B-LOC'), ('واسكندريه', 'B-LOC')]


## 5. Find related open class words

### Normalizing Alef & Alef_maksura & teh_marbuta

MSA applies Hamza of Alef hamza, teh marbuta, but Egyption Arabic does not use teh marbuta and no Hamza for Alef. The normalization is done to make the words more similar 

In [16]:
#Normalize for Standard Arabic
norm_article=[]
norm_articles=[]

for word in filtered_articles[0]:
  #Normalize alef variants to 'ا'
  word_norm = normalize_alef_ar(word)
  #Normalize teh marbuta 'ة' to heh 'ه'
  word_norm = normalize_teh_marbuta_ar(word_norm)
  norm_article.append(word_norm)

#Add the article after normalization to the list of normalized articles
norm_articles.append(norm_article)
norm_article=[]

#Normailze for Egyption Arabic
for word in filtered_articles[1]:
  #Normalize alef maksura 'ى' to yeh 'ي'
  word_norm = normalize_alef_maksura_ar(word)
  norm_article.append(word_norm)

#Add the article after normalization to the list of normalized articles
norm_articles.append(norm_article)

In [17]:
#Print the first 10 words in each article before and after normalization to show the difference
print('Article 1:\n',filtered_articles[0][0:10])
print(norm_articles[0][0:10])
print('*'*60)
print('Article 2:\n',filtered_articles[1][0:10])
print(norm_articles[1][0:10])

Article 1:
 ['اللهجة', 'العربية', 'المصرية', 'المعروفة', 'محليا', 'العامية', 'المصرية', 'المصري', 'يتحدث', 'معظم']
['اللهجه', 'العربيه', 'المصريه', 'المعروفه', 'محليا', 'العاميه', 'المصريه', 'المصري', 'يتحدث', 'معظم']
************************************************************
Article 2:
 ['اللغه', 'المصريه', 'الحديثه', 'المصرى', 'هى', 'اللغه', 'اللى', 'المصريين', 'بيتكلموها', 'مصر']
['اللغه', 'المصريه', 'الحديثه', 'المصري', 'هي', 'اللغه', 'اللي', 'المصريين', 'بيتكلموها', 'مصر']


### What are similar words?
I can answer this in two ways:

**First: Synonyms**. Synonyms cannot be found using the existing Arabic NLP libraries -to my knowledge-, but we can achieve this task using a pre-trained embedding model or train any embedding model from scratch after transforming the Arabic text into embeddings (vector of numbers), then allow the model to calculate the distance between the words to find most similar words. I did an expriment using a pre-trained word2vec embedding model that was trained using Wikipedia Arabic articles. Unfortunately, all the similar words the model found were different inflections of words, not synonyms. 

**Second: Inflections**. Inflection is a process of word formation. We can return words to their root or lemma using lammtizating or stemming, then try to find similar roots between the two articles. This is faster than using embedding models.


**Note: Since POS tagging didn't work on Colab, thus, I didn't use NER tags either to find similar words. I used an Arabic stemmer in NLTK library since there is not any stemmer in CAMeL tools.**

### Stemming to find similarities


In [18]:
#Initialize the stemmer
stemmer = ISRIStemmer()
similar_pairs=[]
article1=list(set(norm_articles[0]))
article2=list(set(norm_articles[1]))
for i in range(0,len(article1)):
  for w in range(0,len(article2)):
    if stemmer.stem(article1[i]) == stemmer.stem(article2[w]):
      similar_pairs.append([i,w])

In [19]:
#Print no. of similar pairs
print(len(similar_pairs))

3748


In [20]:
#Print part of the similar pairs
print(similar_pairs[0:10])

[[2, 1306], [3, 1061], [4, 3], [4, 205], [4, 308], [4, 566], [4, 742], [4, 896], [4, 1119], [4, 1244]]


In [28]:
#Print part of the similar words between the two articles
print("Article 1 _______ Article 2")
for i in range(130,150):
  print(article1[similar_pairs[i][0]],' _______ ',article2[similar_pairs[i][1]])
  

Article 1 _______ Article 2
اللغات  _______  لغتهم
اللغات  _______  لغته
صناعه  _______  مصنوعة
الشام  _______  الشام
رئيسي  _______  رئيس
كاللغه  _______  باللغه
كاللغه  _______  لغه
كاللغه  _______  لللغه
كاللغه  _______  اللغه
الراجل  _______  الرجاله
مدارس  _______  دراسه
مدارس  _______  المدرسه
مدارس  _______  بتدرس
مدارس  _______  درس
مدارس  _______  الدرس
مدارس  _______  مدرسين
مدارس  _______  ادريس
مدارس  _______  المدارس
انظروا  _______  النظر
انظروا  _______  نظرية


# These 3 cells we run in first time using this notebook on Colab only:

In [None]:
!pip install camel-tools -f https://download.pytorch.org/whl/torch_stable.html

In [None]:
!pip install camel-tools -f https://download.pytorch.org/whl/torch_stable.html
import os
from google.colab import drive
drive.mount('/gdrive')

%mkdir /gdrive/MyDrive/camel_tools

In [None]:
#install camel tools data
os.environ['CAMELTOOLS_DATA'] = '/gdrive/MyDrive/camel_tools'

#!export | camel_data -i all
!export | camel_data -i defaults