# <p style="padding:50px;background-color:#DA8359;margin:0;color:#fafefe;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:100">Setup</p>

## 1. Imports

In [1]:
import re
import os
import typing
import random
import emoji

import torch
import numpy as np
import pandas as pd

from collections import Counter
from textblob import TextBlob

import warnings

from __future__ import unicode_literals

warnings.simplefilter("ignore")

## 2. Helper Methods

In [2]:
def extract_sentences(file_path: str) -> list:
    '''
    process youtube/podcast documents

    @param file_path: a string represent file path
    '''

    with open(file_path, 'r', errors="ignore", encoding="utf-8") as f:
        text = f.read()

    # Remove numbers followed by ':'
    text = re.sub(r'\d+.*\d*\s*:', '', text)

    # Define sentence delimiters for Arabic
    sentence_endings = r'(?<=[.!؟؛،])\s+'

    # Split sentences while preserving dependencies
    sentences = re.split(sentence_endings, text)

    return [s.strip() for s in sentences if len(s.strip()) > 1] # Remove empty strings
    

In [3]:
def sentences_to_df(sentences: list) -> pd.DataFrame:
    '''
    convert a list of sentences to a dataframe

    @param sentences: a list of sentences
    '''

    return pd.DataFrame(sentences)

## 3. Load Dataset

In [4]:
def load_dataset(file_path: str = "./data") -> pd.DataFrame:
    '''
    Load dataset from a directory

    :params: **file_path**: a string representing file path to dataset.
    '''
    output_df = []

    data_folder = "./data"

    for file_name in os.listdir(data_folder):
        if file_name.endswith(".txt"):
            file_path = os.path.join(data_folder, file_name)
            tmp_df = sentences_to_df(extract_sentences(file_path))
            output_df.append(tmp_df)

    output_df = pd.concat(output_df, ignore_index=True)

    return output_df

In [5]:
file_path = "./data/البطاطس  الدحيح.txt"
df = load_dataset(file_path)
df.head

<bound method NDFrame.head of                                      0
0                لكل شخص في آخر الشهر،
1               ما معاهوش غير 50 جنيه،
2           ومحتاج ياكل أكلة تشبّعه...
3                    لكل واحد "فورمة"،
4              محتاج أكلة سريعة الهضم،
..                                 ...
733          لازم السندوتش معاه بطاطس.
734              وأقوم مطلّع بطاطساية،
735          وأقوم حاططها في السندوتش!
736  دا أنا أحيانًا بجيب سندوتش بطاطس،
737    من "الحرمين" اللي في "الحُصري".

[738 rows x 1 columns]>

---

# <p style="padding:50px;background-color:#DA8359;margin:0;color:#fafefe;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:100">Libraries & Models</p>

## 1. Ghalatawi

**Ghalatawi:** an Arabic Autocorrect library مكتبة للتصحيح التلقائي للغة العربية

Source: https://github.com/linuxscout/ghalatawi

In [6]:
from ghalatawi.autocorrector import AutoCorrector

autoco = AutoCorrector()

autoco.show_config()

{'regex': True, 'wordlist': True, 'punct': True, 'typo': True}

> <span style="color: yellow">**_Note:_**</span> The library allow for fixing spelling, adjusting punctuations, typos.

## 2. PyArabic

**PyArabic:** a Arabic language library for Python, provides basic functions to manipulate Arabic letters and text, like detecting Arabic letters, Arabic letters groups and characteristics, remove diacritics etc.

Source: https://github.com/linuxscout/pyarabic

In [7]:
import pyarabic.araby as araby
import pyarabic.number as number

## 3. CAMeL Bert

**CAMeL Bert:** a collection of BERT models pre-trained on Arabic texts with different sizes and variants.

Source: https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa

In [8]:
# Load Arabic SBERT Model
from sentence_transformers import SentenceTransformer

sbert_model = SentenceTransformer("CAMeL-Lab/bert-base-arabic-camelbert-msa")

No sentence-transformers model found with name CAMeL-Lab/bert-base-arabic-camelbert-msa. Creating a new one with mean pooling.


## 4. DSAraby

**DSAraby:** is a library that aims to transliterate text which is to write a word using the closest corresponding letters of a different alphabet or language.

Source: https://github.com/saobou/DSAraby/tree/master

In [9]:
from dsaraby import DSAraby

ds = DSAraby()

## 5. Tashaphyne

**Tashaphyne:** is an Arabic light stemmer and segmentor. It mainly supports light stemming (removng prefixes and suffixes) and gives all possible segmentations. it uses a modified finite state automation, which allows it to generate all segmentations.

Source: https://github.com/linuxscout/tashaphyne

In [10]:
from tashaphyne.stemming import ArabicLightStemmer
from tashaphyne.arabicstopwords import STOPWORDS as TASHAPHYNE_STOPWORDS

tashaphyne_stemmer = ArabicLightStemmer()

## 6. CaMeL Tools

**CaMeL Tools:** is suite of Arabic natural language processing tools developed by the CAMeL Lab at New York University Abu Dhabi.

Source: https://github.com/CAMeL-Lab/camel_tools

In [11]:
import camel_tools

from camel_tools.data import downloader
from camel_tools.ner import NERecognizer
from camel_tools.morphology import analyzer
from camel_tools.utils.dediac import dediac_ar
from camel_tools.disambig.mle import MLEDisambiguator
# from camel_tools.dialectid import DIDPred
from camel_tools.morphology.analyzer import Analyzer
from camel_tools.tagger.default import DefaultTagger
from camel_tools.disambig.mle import MLEDisambiguator
from camel_tools.morphology.database import MorphologyDB
from camel_tools.utils.normalize import normalize_unicode
from camel_tools.tokenizers.word import simple_word_tokenize
from camel_tools.tokenizers.morphological import MorphologicalTokenizer
from camel_tools.morphology.reinflector import Reinflector
from camel_tools.morphology.generator import Generator

camel_data_path = os.path.join(os.path.dirname(camel_tools.__file__), 'cli', 'camel_data.py')
print(camel_data_path)

downloader.DownloaderError("calima-msa-r13")

morph_db = MorphologyDB.builtin_db(flags = 'r')
analyzer = Analyzer(morph_db)

c:\Users\mazen\AppData\Local\Programs\Python\Python312\Lib\site-packages\camel_tools\cli\camel_data.py


## 7. Farasa

**Farasa:** is the state-of-the-art library for dealing with Arabic Language Processing. It has been developed by Arabic Language Technologies Group at Qatar Computing Research Institute (QCRI).

Source: https://github.com/MagedSaeed/farasapy

In [12]:
from farasa.stemmer import FarasaStemmer

farasa_stemmer = FarasaStemmer()

## 8. Tnkeeh

**Tnkeeh:** is an Arabic preprocessing library for python. it was designe dusing `re` for creating quick replacement expressions for several examples such as Quick cleaning, Segmentation, Normalization and Data splitting.

Source: https://github.com/ARBML/tnkeeh

In [13]:
import tnkeeh as tn

## 9. NLTK

**NLTK:** a leading platform for building Python programs to work with human language data.

Source: https://www.nltk.org/

In [102]:
import nltk

from nltk.corpus import stopwords
from nltk.stem.isri import ISRIStemmer

nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('punkt_tab')

NLTK_STOPWORDS = set(stopwords.words('arabic'))
nltk_stemmer = ISRIStemmer()

[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\mazen\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\mazen\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\mazen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


## 10. SinaTools

**SinaTools:** an Open-Source Toolkit for Arabic NLP and NLU developed by SinaLab at Birzeit University.

Models:
- morph: https://sina.birzeit.edu/lemmas_dic.pickle,
- ner: https://sina.birzeit.edu/Wj27012000.tar.gz,
- wsd_model: https://sina.birzeit.edu/bert-base-arabertv02_22_May_2021_00h_allglosses_unused01.zip,
- wsd_tokenizer: https://sina.birzeit.edu/bert-base-arabertv02.zip,
- one_gram: https://sina.birzeit.edu/one_gram.pickle,
- five_grams: https://sina.birzeit.edu/five_grams.pickle,
- four_grams: https://sina.birzeit.edu/four_grams.pickle,
- three_grams: https://sina.birzeit.edu/three_grams.pickle,
- two_grams: https://sina.birzeit.edu/two_grams.pickle,
- graph_l2: https://sina.birzeit.edu/graph_l2.pkl,
- graph_l3: https://sina.birzeit.edu/graph_l3.pkl,
- relation: https://sina.birzeit.edu/relation_model.zip


Source: https://github.com/SinaLab/SinaTools

In [15]:
from sinatools.morphology import morph_analyzer
from sinatools.utils import text_transliteration
from sinatools.synonyms.synonyms_generator import evaluate_synonyms

In [16]:
# Try to use different python versions (other than 3.12)
# from sinatools.ner.entity_extractor import extract

## 11. Transformers

**🤗 Transformers:** a library that provides thousands of pretrained models to perform tasks on different modalities such as text, vision, and audio.

Source: https://github.com/huggingface/transformers

In [17]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

## 12. Scikit-Learn

**Scikit-Learn**: a Python module for machine learning built on top of SciPy and is distributed under the 3-Clause BSD license.

Source: https://github.com/scikit-learn/scikit-learn

In [18]:
from sklearn.metrics.pairwise import cosine_similarity

## 13. Arabert

**Arabert**: is an Arabic pretrained language model based on Google's BERT architecture.

Source: https://huggingface.co/aubmindlab/bert-base-arabertv2

In [116]:
from arabert.preprocess import ArabertPreprocessor

-----

# <p style="padding:50px;background-color:#DA8359;margin:0;color:#fafefe;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:100">Cleaning</p>

## 1. Tidying Up Text

### 1.1 Orthographic mistakes

### 1.2 Spelling inconsistencies (Text Correction)

In [19]:
def auto_correct(text: str) -> str:
    '''
    A method that that fixes typos, punctuation and spelling mistakes.
    '''
    autoco = AutoCorrector()

    autoco.show_config()

    output = autoco.spell(text)

    print(output)

Example

In [20]:
text = "إذا أردت إستعارة كتاب، اذهب إلى المكتبة أو الادارة في الضهيرة."
auto_correct(text)

إذا أردت استعارة كتاب، اذهب إلى المكتبة أو الادارة في الظهيرة.


In [21]:
text = 'سنقر لا تسرك'
auto_correct(text)

سنقر لا تسرك


> <span style="color: green">**_OBSERVATION:_**</span> The library does not always fix typos.

### 1.3 Unknown characters

### 1.4 Repeated letters and with spaces in the words



### 1.5 Reshape Text

https://pypi.org/project/arabic-reshaper/

## 2. Text Processing

### 2.1 Sentence Segmentation

> <span style="color: yellow">**_NOTE:_**</span> One of the problems in text collected from youtube/podcast is that their is no true sentence structure is made that we split text upon.

In [22]:
def arabic_sentence_segmentation(paragraph: str) -> str:
    '''
    A method that segmente arabic pargaraphs to meaningful sentences

    @param paragraph: a bunch of sentences that are segmented to meaningful sentences.
    '''

    # Compute sentence embeddings
    embeddings = sbert_model.encode(paragraph)

    # Compute cosine similarity
    sim_matrix = cosine_similarity(embeddings)

    # Find semantic breakpoints (low similarity)
    threshold = 0.5  # Adjust this based on experimentation
    split_points = [i for i in range(len(paragraph) - 1) if sim_matrix[i, i+1] < threshold]

    # Generate semantic splits
    segments = []
    start = 0
    for split in split_points:
        segments.append(" ".join(paragraph[start:split+1]))
        start = split + 1

    segments.append(" ".join(paragraph[start:]))

    # Remove empty strings
    segments = [seg for seg in segments if seg.strip()]

    return segments

Example

In [23]:
paragraph = """
لكل شخص في آخر الشهر،
ما معاهوش غير 50 جنيه،
لكل واحد مش عارف ياكل،
ونفسه في أكلة جانبية مع الأكل،
القاهره هي عاصمة مصر،
"""

# Define sentence delimiters for Arabic
sentence_endings = r'(?<=[.!؟؛،])\s+'

# Split sentences while preserving dependencies
sentences = re.split(sentence_endings, paragraph)
print(sentences)

arabic_sentence_segmentation(sentences)

['\nلكل شخص في آخر الشهر،', 'ما معاهوش غير 50 جنيه،', 'لكل واحد مش عارف ياكل،', 'ونفسه في أكلة جانبية مع الأكل،', 'القاهره هي عاصمة مصر،', '']


['\nلكل شخص في آخر الشهر، ما معاهوش غير 50 جنيه، لكل واحد مش عارف ياكل، ونفسه في أكلة جانبية مع الأكل،',
 'القاهره هي عاصمة مصر،']

> <span style="color: red">**_TODO:_**</span> handle text that contains both english and arabic.

### 2.2  Arabizi to Arabic

> <span style="color: green">**_Arabizi:_**</span> is a sentence that contain Latin words (3ami).

In [24]:
def arabizi_to_arabic(text: str) -> tuple:
    '''
    A method that gives the possible words in Arabic based on a given word in Latin by mapping
    the Latin letters to Arabic ones, then takes the most frequent word existing in a corpus.
    
    :param text: A sentence containing Arabizi (e.g., "3ami") that needs to be converted to Arabic.

    :return: A tuple of two values: 
        - The transliterated text based on the given schema. 
        - A boolean flag indicating whether all characters in the input text were successfully transliterated or not.
    '''

    transliterate_text = text_transliteration.perform_transliteration(text, "bw2ar")[0]

    conversion_dict = {
        'a': 'ا', 'b': 'ب', 't': 'ت', 'th': 'ث', 'g': 'ج', '7': 'ح', 'kh': 'خ',
        'd': 'د', 'dh': 'ذ', 'r': 'ر', 'z': 'ز', 's': 'س', 'sh': 'ش', '9': 'ص',
        '6': 'ط', '3': 'ع', 'gh': 'غ', 'f': 'ف', 'q': 'ق', 'k': 'ك', 'l': 'ل',
        'm': 'م', 'n': 'ن', 'h': 'ه', 'w': 'و', 'y': 'ي', '?': "؟"
    }
    
    for latin, arabic in conversion_dict.items():
        transliterate_text = transliterate_text.replace(latin, arabic)

    transliterate_text = ds.transliterate(transliterate_text)

    # Check if all characters in the result are Arabic
    valid_arabic_regex = re.compile(r'^[\u0600-\u06FF\s.,،؟!؛]+$')
    transliterate_successed = all(valid_arabic_regex.match(char) for char in transliterate_text)

    return transliterate_text, transliterate_successed

Example

In [25]:
# Example usage
arabizi_text = "mar7aba, kayf 7alak?"
arabic_text = arabizi_to_arabic(arabizi_text)

print("Arabizi:", arabizi_text)
print("Arabic:", arabic_text)

Arabizi: mar7aba, kayf 7alak?
Arabic: ('مَرحَبَۥ كَيف حَلَك؟', True)


### 2.3 Stemming

> <span style="color: green">**_Stemming:_**</span> is the process of reducing a word to its root.

In [26]:
def arabic_stemming(text: str, tool: str) -> str:
    '''
    A method that perform arabic text stemming

    @param text: A sentence that requires stemming
    '''
    zen = TextBlob(text) # check for alternatives
    words = zen.words
    
    if tool == 'camel':
        return ' '.join([analyzer.analyze(word)[0]['stem'] for word in words])
    elif tool == 'farasa':
        return farasa_stemmer.stem(text)
    elif tool == "light":
        return ' '.join([tashaphyne_stemmer.light_stem(word) for word in words])
    else:
        return ' '.join([nltk_stemmer.stem(word) for word in words])
        

> <span style="color: yellow">**_Note:_**</span> ISRI Stemmer is a stemming process that is based on algorithm (Arabic Stemming without a root dictionary).

Example

In [27]:
text = "يذهب الطلاب إلى المدرسة صباحًا ويعودون في المساء."
print(arabic_stemming(text, "camel"))
print(arabic_stemming(text, "farasa"))
print(arabic_stemming(text, "light"))
print(arabic_stemming(text, "nltk"))

ذَهِّب طُلّاب آل مُدَرِّس صُباح عُود فِي مَساء
ذهب طالب إلى مدرسة صباح عاد في مساء .
ذهب طلاب إلى مدرس صباحا عود في مساء
ذهب طلب الى درس صبح يعد في ساء


> <span style="color: red">**_Question:_**</span> How to determine which of the models is better?

### 2.4 Lemmatization

> <span style="color: green">**_Lemmatization:_**</span> is the process of reducing the different forms of a word to one single form

In [28]:
def arabic_lemmatization(text: str) -> str:
    '''
    A method that perform arabic text lemmatization

    @param text: A sentence that requires lemmatization
    '''
    words = simple_word_tokenize(text) # check for alternatives

    lemmatized_words = []
    
    for word in words:
        lemma = morph_analyzer.analyze(word)[0]["lemma"].split("|")[0]

        # Remove any character that is not in the Arabic Unicode range
        clean_lemma = re.sub(r'[^\u0600-\u06FF]', '', lemma)
        if clean_lemma:
            lemmatized_words.append(clean_lemma)
    
    lemmatized_text = ' '.join(lemmatized_words)

    return lemmatized_text

Example

In [29]:
text = "يذهب الطلاب إلى المدرسة صباحًا ويعودون في المساء."
print("example 1: ", arabic_lemmatization(text))

text = "الرجال يحبون الأطفال والنساء يقرأن الكتب."
print("example 2:", arabic_lemmatization(text))

example 1:  ذَهَبَ طَالِبٌ إِلَى مَدْرَسَةٌ صَباحٌ عَادَ فِي مَسَاءٌ
example 2: رَجُلٌ أَحَبَّ طِفْلٌ نِسَاءٌ قَرَأَ كِتَابٌ


> <span style="color: green">**_Observation:_**</span> The `arabic_lemmatization` method produce lemmatized text with diacritics (tashkel)

### 2.5 Stopwords

> <span style="color: green">**_Stopwords:_**</span> are most common terms in an Arabic language such as حروف الجر.

In [30]:
def remove_arabic_stopwords(text: str, custom_stopwords: bool=None, use_nltk: bool=True, use_tashaphyne: bool=True) -> str:
    '''
    A method that remove stopwords in text

    @param text: a sentence that requires removing stopwords.
    '''

    # Get Arabic stopwords
    stopwords = set()

    if use_nltk:
        stopwords.update(NLTK_STOPWORDS)
    if use_tashaphyne:
        stopwords.update(TASHAPHYNE_STOPWORDS)
    if custom_stopwords:
        stopwords.update(custom_stopwords)

    stopwords_comp = {"،","آض","آمينَ","آه","آهاً","آي","أ","أب","أجل","أجمع","أخ",
                    "أخذ","أصبح","أضحى","أقبل","أقل","أكثر","ألا","أم","أما",
                    "أمامك","أمامكَ","أمسى","أمّا","أن","أنا","أنت","أنتم",
                    "أنتما","أنتن","أنتِ","أنشأ","أنّى","أو","أوشك","أولئك",
                    "أولئكم","أولاء","أولالك","أوّهْ","أي","أيا","أين","أينما",
                    "أيّ","أَنَّ","أََيُّ","أُفٍّ","إذ","إذا","إذاً","إذما","إذن","إلى",
                    "إليكم","إليكما","إليكنّ","إليكَ","إلَيْكَ","إلّا","إمّا","إن",
                    "إنّما","إي","إياك","إياكم","إياكما","إياكن","إيانا","إياه",
                    "إياها","إياهم","إياهما","إياهن","إياي","إيهٍ","إِنَّ","ا",
                    "ابتدأ","اثر","اجل","احد","اخرى","اخلولق","اذا","اربعة",
                    "ارتدّ","استحال","اطار","اعادة","اعلنت","اف","اكثر","اكد",
                    "الألاء","الألى","الا","الاخيرة","الان","الاول","الاولى","التى",
                    "التي","الثاني","الثانية","الذاتي","الذى","الذي","الذين",
                    "السابق","الف","اللائي","اللاتي","اللتان","اللتيا","اللتين",
                    "اللذان","اللذين","اللواتي","الماضي","المقبل","الوقت",
                    "الى","اليوم","اما","امام","امس","ان","انبرى","انقلب",
                    "انه","انها","او","اول","اي","ايار","ايام","ايضا","ب",
                    "بات","باسم","بان","بخٍ","برس","بسبب","بسّ","بشكل","بضع",
                    "بطآن","بعد","بعض","بك","بكم","بكما","بكن","بل","بلى",
                    "بما","بماذا","بمن","بن","بنا","به","بها","بي","بيد",
                    "بين","بَسْ","بَلْهَ","بِئْسَ","تانِ","تانِك","تبدّل","تجاه","تحوّل",
                    "تلقاء","تلك","تلكم","تلكما","تم","تينك","تَيْنِ","تِه","تِي",
                    "ثلاثة","ثم","ثمّ","ثمّة","ثُمَّ","جعل","جلل","جميع","جير","حار",
                    "حاشا","حاليا","حاي","حتى","حرى","حسب","حم","حوالى","حول",
                    "حيث","حيثما","حين","حيَّ","حَبَّذَا","حَتَّى","حَذارِ","خلا","خلال",
                    "دون","دونك","ذا","ذات","ذاك","ذانك","ذانِ","ذلك","ذلكم",
                    "ذلكما","ذلكن","ذو","ذوا","ذواتا","ذواتي","ذيت","ذينك",
                    "ذَيْنِ","ذِه","ذِي","راح","رجع","رويدك","ريث","رُبَّ","زيارة",
                    "سبحان","سرعان","سنة","سنوات","سوف","سوى","سَاءَ","سَاءَمَا",
                    "شبه","شخصا","شرع","شَتَّانَ","صار","صباح","صفر","صهٍ","صهْ",
                    "ضد","ضمن","طاق","طالما","طفق","طَق","ظلّ","عاد","عام",
                    "عاما","عامة","عدا","عدة","عدد","عدم","عسى","عشر","عشرة",
                    "علق","على","عليك","عليه","عليها","علًّ","عن","عند","عندما",
                    "عوض","عين","عَدَسْ","عَمَّا","غدا","غير","ـ","ف","فان","فلان",
                    "فو","فى","في","فيم","فيما","فيه","فيها","قال","قام","قبل",
                    "قد","قطّ","قلما","قوة","كأنّما","كأين","كأيّ","كأيّن","كاد",
                    "كان","كانت","كذا","كذلك","كرب","كل","كلا","كلاهما","كلتا",
                    "كلم","كليكما","كليهما","كلّما","كلَّا","كم","كما","كي","كيت",
                    "كيف","كيفما","كَأَنَّ","كِخ","لئن","لا","لات","لاسيما","لدن","لدى",
                    "لعمر","لقاء","لك","لكم","لكما","لكن","لكنَّما","لكي","لكيلا",
                    "للامم","لم","لما","لمّا","لن","لنا","له","لها","لو","لوكالة",
                    "لولا","لوما","لي","لَسْتَ","لَسْتُ","لَسْتُم","لَسْتُمَا","لَسْتُنَّ","لَسْتِ",
                    "لَسْنَ","لَعَلَّ","لَكِنَّ","لَيْتَ","لَيْسَ","لَيْسَا","لَيْسَتَا","لَيْسَتْ","لَيْسُوا",
                    "لَِسْنَا","ما","ماانفك","مابرح","مادام","ماذا","مازال","مافتئ",
                    "مايو","متى","مثل","مذ","مساء","مع","معاذ","مقابل","مكانكم",
                    "مكانكما","مكانكنّ","مكانَك","مليار","مليون","مما","ممن","من",
                    "منذ","منها","مه","مهما","مَنْ","مِن","نحن","نحو","نعم","نفس",
                    "نفسه","نهاية","نَخْ","نِعِمّا","نِعْمَ","ها","هاؤم","هاكَ","هاهنا",
                    "هبّ","هذا","هذه","هكذا","هل","هلمَّ","هلّا","هم","هما","هن",
                    "هنا","هناك","هنالك","هو","هي","هيا","هيت","هيّا","هَؤلاء",
                    "هَاتانِ","هَاتَيْنِ","هَاتِه","هَاتِي","هَجْ","هَذا","هَذانِ","هَذَيْنِ",
                    "هَذِه","هَذِي","هَيْهَاتَ","و","وا","واحد","واضاف","واضافت","واكد",
                    "وان","واهاً","واوضح","وراءَك","وفي","وقال","وقالت","وقد",
                    "وقف","وكان","وكانت","ولا","ولم","ومن","مَن","وهو","وهي",
                    "ويكأنّ","وَيْ","وُشْكَانََ","يكون","يمكن","يوم","ّأيّان"}

    words = simple_word_tokenize(text)

    return " ".join([w for w in words if not w in stopwords and not w in stopwords_comp and len(w) >= 2])

Example

In [31]:
# Example 1: Simple sentence with a few common stopwords.
text1 = "أنا أحب التفاح في الصباح"
# Expected behavior: Words like "أنا" and "في" (if included in our stopword sets) should be removed.
print("Example 1")
print("Original:", text1)
print("Filtered:", remove_arabic_stopwords(text1))
print()

# Example 2: Sentence with additional stopwords from stopwords_comp.
text2 = "الطالب المجتهد يدرس في الجامعة"
# Expected behavior: Words that are common stopwords (e.g., "في") should be removed.
print("Example 2")
print("Original:", text2)
print("Filtered:", remove_arabic_stopwords(text2))
print()

# Example 3: Using custom stopwords to remove an additional word.
custom_stops = {"المجتهد"}
text3 = "الطالب المجتهد يدرس في الجامعة"
# Expected behavior: In addition to the default stopwords, "المجتهد" should be removed.
print("Example 3")
print("Original:", text3)
print("Filtered with custom stopwords:", remove_arabic_stopwords(text3, custom_stopwords=custom_stops))

Example 1
Original: أنا أحب التفاح في الصباح
Filtered: أحب التفاح الصباح

Example 2
Original: الطالب المجتهد يدرس في الجامعة
Filtered: الطالب المجتهد يدرس الجامعة

Example 3
Original: الطالب المجتهد يدرس في الجامعة
Filtered with custom stopwords: الطالب يدرس الجامعة


### 2.6 Handling Hashtags

*Purpose:* The idea is that arabic text can sometimes contains hashtags as for example "مبارك_عليكم_الشهر ربي اجعل شهر رمضان فاتحة خير لنا وبداية أجمل أقدارنا وحقق لنا ما نتمنى يا كريم#" which need to be converted to " مبارك عليكم الشهر ربي اجعل شهر رمضان فاتحةخير لنا وبداية أجمل أقدارنا
وحقق لنا ما نتمنى يا كريم"

In [32]:
def has_hashtag(word: str) -> bool:
    '''
    Checks whether a word starts or ends with a hashtag.
    @param word: a single word
    @return: True if the word starts or ends with "#", False otherwise.
    '''
    return word.startswith("#") or word.endswith("#")

def split_hashtag_to_words(tag: str) -> list:
    '''
    Converts a hashtag to a list of words.
    If the hashtag uses underscores, they are used as delimiters;
    otherwise, it applies a camel-case splitting pattern.
    
    @param tag: a hashtag (e.g., "#مبارك_عليكم_الشهر")
    @return: a list of words extracted from the hashtag.
    '''
    tag = tag.replace('#', '')
    tags = tag.split('_')
    if len(tags) > 1:
        return tags
    pattern = re.compile(r"[A-Z][a-z]+|\d+|[A-Z]+(?![a-z])")
    return pattern.findall(tag)

def extract_hashtag(text: str) -> list:
    '''
    Extracts words from hashtags found in the input text.
    It removes trailing punctuation and then splits the hashtag.
    
    @param text: a sentence that contains one or more hashtags.
    @return: a list of words extracted from hashtags.
    '''
    hash_list = [re.sub(r"(\W+)$", "", i) for i in text.split() if i.startswith("#") or i.endswith("#")]
    word_list = []
    for word in hash_list:
        word_list.extend(split_hashtag_to_words(word))
    return word_list

def clean_arabic_hashtag(text: str) -> str:
    '''
    Replaces each hashtag in the text with its space-separated equivalent.
    Note: Only words starting with "#" are processed.
    
    @param text: a sentence that contains hashtags.
    @return: the text with cleaned hashtags.
    '''
    words = text.split()
    output = []
    for word in words:
        if has_hashtag(word):
            output.extend(extract_hashtag(word))
        else:
            output.append(word)
    return " ".join(output)

Example

In [33]:
print("Example:")
# Hashtags at both beginning and end.
test3 = "#مبارك_عليكم_الشهر ربي اجعل شهر رمضان فاتحة خير لنا وبداية أجمل أقدارنا وحقق لنا ما نتمنى يا كريم#"
print("Original:", test3)
print("Cleaned: ", clean_arabic_hashtag(test3))
# Expected output:
# the first token starts with '#' so it gets converted to "مبارك عليكم الشهر"
# Thus, output will be: "مبارك عليكم الشهر ربي اجعل شهر رمضان فاتحة خير لنا وبداية أجمل أقدارنا وحقق لنا ما نتمنى يا كريم"

Example:
Original: #مبارك_عليكم_الشهر ربي اجعل شهر رمضان فاتحة خير لنا وبداية أجمل أقدارنا وحقق لنا ما نتمنى يا كريم#
Cleaned:  مبارك عليكم الشهر ربي اجعل شهر رمضان فاتحة خير لنا وبداية أجمل أقدارنا وحقق لنا ما نتمنى يا


### 2.7 Handling Emojis 🤪

In [34]:
def handle_emojis(text: str, mode: str = 'remove') -> str:
    '''
    A method that handles emojis.
    '''
    if mode == 'remove':
        return emoji.replace_emoji(text, '')
    elif mode == 'description':
        return emoji.demojize(text, language='ar')
    return text

Example

In [35]:
text_with_emoji = "أنا أحب القراءة 📚 وأستمتع بها كثيراً 😊"
text_without_emoji = handle_emojis(text_with_emoji, 'remove')
text_with_descriptions = handle_emojis(text_with_emoji, 'description')

print("Original:", text_with_emoji)
print("Without emojis:", text_without_emoji)
print("With emoji descriptions:", text_with_descriptions)

Original: أنا أحب القراءة 📚 وأستمتع بها كثيراً 😊
Without emojis: أنا أحب القراءة  وأستمتع بها كثيراً 
With emoji descriptions: أنا أحب القراءة :كتب: وأستمتع بها كثيراً :وجه_باسم_بعينين_باسمتين:


> <span style="color: red">**_TODO:_**</span> Search on how to extract meaning from emoji

### 2.8 Normalization

> <span style="color: green">**_Normalization:_**</span> match digits that have the same writing but different encodings.

In [36]:
def normalize_arabic(text: str, tool: str) -> str:
    '''
    A method that match digits that have same writing but different encodings

    @param text: a sentence that requires normalizing its text.
    @param tool: determining which library name to utilize in normalizing text.
    '''
    
    if tool == "tnkeeh":
        normalizer = tn.Tnkeeh(normalize=True)
        output = normalizer.clean_raw_text(text)
        return output[0]
    elif tool == "camel":
        return normalize_unicode(text)
    else:
        text = text.strip()
        text = re.sub("[إأٱآا]", "ا", text)
        text = re.sub("ى", "ي", text)
        text = re.sub("ؤ", "ء", text)
        text = re.sub("ئ", "ء", text)
        text = re.sub("ة", "ه", text)
        text = re.sub("گ", "ك", text)
        text = re.sub("ڤ", "ف", text)
        text = re.sub("چ", "ج", text)
        text = re.sub("پ", "ب", text)
        text = re.sub("ڜ", "ش", text)
        text = re.sub("ڪ", "ك", text)
        text = re.sub("ڧ", "ق", text)
        text = re.sub("ٱ", "ا", text)
        noise = re.compile(""" ّ    | # Tashdid
                                َ    | # Fatha
                                ً    | # Tanwin Fath
                                ُ    | # Damma
                                ٌ    | # Tanwin Damm
                                ِ    | # Kasra
                                ٍ    | # Tanwin Kasr
                                ْ    | # Sukun
                                ـ     # Tatwil/Kashida
                            """, re.VERBOSE)
        text = re.sub(noise, '', text)
        text = re.sub(r'(.)\1+', r"\1\1", text) # Convert repeated characters to single occurrence
        return araby.strip_tashkeel(text)

Example

In [37]:
print("Example 1: Replace various forms of alif with a bare alif.")
input_text = "إسلام"
expected = "اسلام"  # "إ" replaced with "ا"
result = normalize_arabic(input_text, tool="other")
print("Input: ", input_text)
print("Excepted: ", expected)
print("Output: ", result)
print("Pass" if result == expected else f"Fail (got '{result}', expected '{expected}')")

print("\nExample 2: Remove diacritics and perform character replacements.")
input_text = "مُدَرِّسة"  # Contains diacritics and ends with ة
expected = "مدرسه"  # Expected: diacritics removed, ة -> ه, then stripped by strip_tashkeel (here our dummy returns unchanged)
result = normalize_arabic(input_text, tool="other")
print("Input: ", input_text)
print("Excepted: ", expected)
print("Output: ", result)
print("Pass" if result == expected else f"Fail (got '{result}', expected '{expected}')")

print("\nExample 3: Collapse repeated characters.")
input_text = "مممممممم"  # Many repeated م's
expected = "مم"  # Reduced to two occurrences
result = normalize_arabic(input_text, tool="other")
print("Input: ", input_text)
print("Excepted: ", expected)
print("Output: ", result)
print("Pass" if result == expected else f"Fail (got '{result}', expected '{expected}')")

print("\nExample 4: Multiple replacement rules in one sentence.")
input_text = "گلاب چای پيت ڤيديو ڜهر ڪتاب ڧكر ٱمان"
# Expected replacements:
# گ -> ك  => "گلاب" -> "كلاب"
# چ -> ج  => "چای" -> "جاي"
# پ -> ب  => "پيت" -> "بيت"
# ڤ -> ف  => "ڤيديو" -> "فيديو"
# ڜ -> ش  => "ڜهر" -> "شهر"
# ڪ -> ك  => "ڪتاب" -> "كتاب"
# ڧ -> ق  => "ڧكر" -> "قكر"
# ٱ -> ا  => "ٱمان" -> "امان"
expected = "كلاب جاي بيت فيديو شهر كتاب قكر امان"
result = normalize_arabic(input_text, tool="other")
print("Input: ", input_text)
print("Excepted: ", expected)
print("Output: ", result)
print("Pass" if result == expected else f"Fail (got '{result}', expected '{expected}')")

print("\nExample 5: Multiple replacement rules in one sentence with Tnkeeh library.")
input_text = "كلاب جاي بيت فيديو شهر كتاب قكر امان"
result = normalize_arabic(input_text, tool="tnkeeh")
print("TNKEEH branch output:", result)

print("\nExample 6: Multiple replacement rules in one sentence with CAMEL library.")
input_text = "كلاب جاي بيت فيديو شهر كتاب قكر امان"
result = normalize_arabic(input_text, tool="camel")
print("CAMEL branch output:", result)

Example 1: Replace various forms of alif with a bare alif.
Input:  إسلام
Excepted:  اسلام
Output:  اسلام
Pass

Example 2: Remove diacritics and perform character replacements.
Input:  مُدَرِّسة
Excepted:  مدرسه
Output:  مدرسه
Pass

Example 3: Collapse repeated characters.
Input:  مممممممم
Excepted:  مم
Output:  مم
Pass

Example 4: Multiple replacement rules in one sentence.
Input:  گلاب چای پيت ڤيديو ڜهر ڪتاب ڧكر ٱمان
Excepted:  كلاب جاي بيت فيديو شهر كتاب قكر امان
Output:  كلاب جای بيت فيديو شهر كتاب قكر امان
Fail (got 'كلاب جای بيت فيديو شهر كتاب قكر امان', expected 'كلاب جاي بيت فيديو شهر كتاب قكر امان')

Example 5: Multiple replacement rules in one sentence with Tnkeeh library.
TNKEEH branch output: كلاب جاي بيت فيديو شهر كتاب قكر امان

Example 6: Multiple replacement rules in one sentence with CAMEL library.
CAMEL branch output: كلاب جاي بيت فيديو شهر كتاب قكر امان


> <span style="color: green">**_Observation:_**</span> The overall normaliztion process of text is correct.

### 2.9 Specific Noise Removal

> <span style="color: green">**_Noise Removal:_**</span> extend noise removal to handle more cases.

In [38]:
def remove_arabic_noise(text: str) -> str:
    '''
    A method that removes specific noise in text such as tatweel, HTML tags, URLs, etc.

    :param text: A sentence to be processed.
    :return: Cleaned text containing only Arabic letters and whitespace.
    '''
    # Remove tatweel (ـ)
    text = re.sub(r'\u0640', '', text)
    
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    
    # Remove non-Arabic characters (keep Arabic Unicode block and whitespace)
    text = re.sub(r'[^\u0600-\u06FF\s]', '', text)
    
    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

Example

In [39]:
# "Example 1: Remove tatweel and HTML tags.")
example1 = "مــــــــرحبا بك في موقعنا <b>الرائع</b>!"
# Expected: "مرحبا بك في موقعنا الرائع"
result1 = remove_arabic_noise(example1)
print("Example 1")
print("Input:    ", example1)
print("Expected: ", "مرحبا بك في موقعنا الرائع")
print("Got:      ", result1)
print("Pass:" , result1 == "مرحبا بك في موقعنا الرائع")
print()

# Example 2: Remove URLs.
example2 = "تفضل بزيارة http://example.com للحصول على المزيد من المعلومات."
# Expected: "تفضل بزيارة للحصول على المزيد من المعلومات"
result2 = remove_arabic_noise(example2)
print("Exmaple 2")
print("Input:    ", example2)
print("Expected: ", "تفضل بزيارة للحصول على المزيد من المعلومات")
print("Got:      ", result2)
print("Pass:" , result2 == "تفضل بزيارة للحصول على المزيد من المعلومات")
print()
    
# Example 3: Remove non-Arabic noise (Latin letters, numbers, punctuation)
example3 = "هذا نص تجريبي مع أحرف لاتينية مثل ABC وأرقام 123 ورموز @#!."
# Expected: "هذا نص تجريبي مع أحرف لاتينية مثل ورموز"
result3 = remove_arabic_noise(example3)
print("Example 3")
print("Input:    ", example3)
print("Expected: ", "هذا نص تجريبي مع أحرف لاتينية مثل ورموز")
print("Got:      ", result3)
print("Pass:" , result3 == "هذا نص تجريبي مع أحرف لاتينية مثل ورموز")
print()
    
# Example 4: Remove extra spaces and HTML tags with tatweel.
example4 = "   <div>ـــــــــسلام</div>   "
# Expected: "سلام"
result4 = remove_arabic_noise(example4)
print("Example 4")
print("Input:    ", example4)
print("Expected: ", "سلام")
print("Got:      ", result4)
print("Pass:" , result4 == "سلام")
print()

Example 1
Input:     مــــــــرحبا بك في موقعنا <b>الرائع</b>!
Expected:  مرحبا بك في موقعنا الرائع
Got:       مرحبا بك في موقعنا الرائع
Pass: True

Exmaple 2
Input:     تفضل بزيارة http://example.com للحصول على المزيد من المعلومات.
Expected:  تفضل بزيارة للحصول على المزيد من المعلومات
Got:       تفضل بزيارة للحصول على المزيد من المعلومات
Pass: True

Example 3
Input:     هذا نص تجريبي مع أحرف لاتينية مثل ABC وأرقام 123 ورموز @#!.
Expected:  هذا نص تجريبي مع أحرف لاتينية مثل ورموز
Got:       هذا نص تجريبي مع أحرف لاتينية مثل وأرقام ورموز
Pass: False

Example 4
Input:        <div>ـــــــــسلام</div>   
Expected:  سلام
Got:       سلام
Pass: True



> <span style="color: green">**_Observation:_**</span> The overall removal of noise in text is successful.

> <span style="color: red">**_TODO:_**</span> Look if their exists other types of noise need to be removed

### 2.10 Tokenization

> <span style="color: green">**_Tokenization:_**</span> is the process of breaking a sequence of text into smaller units called tokens, such as words, phrases, symbols, and other elements. For the Arabic language, tokenization is a complex task due to the differences between the written and spoken forms of the language.

In [40]:
def tokenize_arabic(text: str, method='simple', model='msa', scheme='bwtok'):
    '''
    Tokenizes an Arabic sentence using either a simple or morphological approach.

    :param text: The Arabic sentence to be tokenized.
    :param method: Tokenization method to use. Options:
                   - 'simple': Uses a basic whitespace-based tokenizer.
                   - 'morphological': Uses a morphological analyzer for tokenization.
    :param model: Specifies the morphological model to use (only applicable if `method='morphological'`).
                  Options:
                   - 'msa': Modern Standard Arabic (default).
                   - 'egy': Egyptian Arabic.
    :param scheme: Tokenization scheme for the morphological method. Options:
                   - 'bwtok': Buckwalter tokenization (default).
                   - 'd3tok': D3 tokenization.
                   - 'atbtok': ATB tokenization.

    :return: A list of tokenized words.
    '''

    if method == 'simple':
        return simple_word_tokenize(text)
    elif method == 'morphological':
        words = simple_word_tokenize(text)

        if model=='msa':
            mle_msa = MLEDisambiguator.pretrained('calima-msa-r13') # Load a pre-trained disambiguator
            msa_d3_tokenizer = MorphologicalTokenizer(disambiguator=mle_msa, scheme=scheme)
            words = msa_d3_tokenizer.tokenize(words)
            return words
        else:
            mle_egy = MLEDisambiguator.pretrained('calima-egy-r13') # Load a pre-trained disambiguator
            egy_bw_tokenizer = MorphologicalTokenizer(disambiguator=mle_egy, scheme='bwtok')
            words = egy_bw_tokenizer.tokenize(words)
            return words

Example

In [41]:
text = "هذا مثال على تقطيع النص العربي بطريقة متقدمة."
simple_tokens = tokenize_arabic(text, 'simple')
morphological_tokens = tokenize_arabic(text, 'morphological')

print("Simple tokenization:", simple_tokens)
print("Morphological tokenization:", morphological_tokens)

Simple tokenization: ['هذا', 'مثال', 'على', 'تقطيع', 'النص', 'العربي', 'بطريقة', 'متقدمة', '.']
Morphological tokenization: ['هذا', 'مثال', 'على', 'تقطيع', 'ال+_نص', 'ال+_عربي', 'ب+_طريق_+ة', 'متقدم_+ة', '.']


### 2.11 Dediacritization

> <span style="color: green">**_Dediacritization:_**</span> Dediacritization is the process of removing Arabic diacritical marks. Diacritics increase data sparsity and so most Arabic NLP techniques ignore them.

In [42]:
def arabic_dediacrition(text: str, method='remove', tool='pyarabic') -> str:
    '''
    Removes or normalizes Arabic diacritical marks (Tashkeel).

    :param text: An Arabic sentence that requires dediacritization.
    :param method: The dediacritization method to apply. Options:
                   - 'remove': Removes all diacritics (default).
                   - 'normalize': Normalizes Hamza and Shadda while removing other diacritics.
                   - 'keep': Keeps the diacritics as they are.
    :param tool: The library to use for dediacritization. Options:
                 - 'pyarabic': Uses `pyarabic` for diacritic removal (default).
                 - 'camel': Uses `camel_tools` for diacritic removal.

    :return: A string with the processed text.

    **Example Usage:**
    >>> text_with_diacritics = "اللُّغَةُ العَرَبِيَّةُ جَمِيلَةٌ"
    >>> arabic_dediacrition(text_with_diacritics, 'remove')
    'اللغه العربيه جميله'
    
    >>> arabic_dediacrition(text_with_diacritics, 'normalize')
    'اللغه العربيه جميله'

    >>> arabic_dediacrition(text_with_diacritics, 'keep')
    'اللُّغَةُ العَرَبِيَّةُ جَمِيلَةٌ'
    '''

    if method == 'remove':
        if tool == 'pyarabic':
            return araby.strip_diacritics(text)
        elif tool == 'camel':
            return dediac_ar(text)
    elif method == 'normalize':
        return araby.normalize_hamza(araby.strip_shadda(text))
    else:
        return text

Example

In [43]:
text_with_diacritics = "اللُّغَةُ العَرَبِيَّةُ جَمِيلَةٌ"
removed_diacritics_1 = arabic_dediacrition(text_with_diacritics, 'remove', "pyarabic")
removed_diacritics_2 = arabic_dediacrition(text_with_diacritics, 'remove', "camel")
normalized_diacritics = arabic_dediacrition(text_with_diacritics, 'normalize')

print("Original:", text_with_diacritics)
print("Removed diacritics using PyArabic:", removed_diacritics_1)
print("Removed diacritics using CAMeL:", removed_diacritics_2)
print("Normalized diacritics:", normalized_diacritics)

Original: اللُّغَةُ العَرَبِيَّةُ جَمِيلَةٌ
Removed diacritics using PyArabic: اللغة العربية جميلة
Removed diacritics using CAMeL: اللغة العربية جميلة
Normalized diacritics: اللُغَةُ العَرَبِيَةُ جَمِيلَةٌ


### 2.12 Dialect Identification

> <span style="color: green">**_Dialects Identification:_**</span> is to determine which city-level does a text belongs to.

> <span style="color: yellow">**_Note:_**</span> CAMel `DialectIdentifier` is not supported for Windows.

In [44]:
# from camel_tools.dialectid import DialectIdentifier

def identify_dialect(text: str, target: str) -> list:
    """
    Identifies the dialect of a given Arabic text at various levels of granularity.

    This function uses a pretrained dialect identification model from Camel Tools
    to determine the dialect of an input text. The model can distinguish among 25
    city-level dialects as well as Modern Standard Arabic (MSA). In addition to
    city-level identification, the model can provide aggregated predictions at the
    regional and country levels. 

    **Note:** The Camel Tools dialect identification module is not available on Windows.

    :param text: A string containing Arabic text.
    :param target: A string indicating the level of dialect granularity.
                   Options include:
                     - "city": for fine-grained, city-level dialect identification.
                     - "country": for aggregated country-level predictions.
                     - "region": for aggregated region-level predictions.
                     - Any other value defaults to the full prediction (typically a list of labels).
    :return: A list of predicted dialect labels corresponding to the specified granularity.

    **Example:**
    >>> text = "هذا نص عربي باللهجة المصرية."
    >>> identify_dialect(text, "city")
    ['القاهرة']  # (Example output; actual predictions depend on the pretrained model.)
    """
    did = DialectIdentifier.pretrained()  # Pretrained dialect identification model.
    
    if target == "city":
        return did.predict(text, "city")
    elif target == "country":
        return did.predict(text, "country")
    elif target == "region":
        # Note: Correcting a potential typo from 'predit' to 'predict'
        return did.predict(text, "region")
    else:
        return did.predict(text)


def normalize_dialect(text: str, target_dialect: str = 'MSA') -> str:
    """
    Normalizes an Arabic text to a specified dialect variant.

    This is a placeholder function. In practice, dialect normalization may involve
    complex transformations to convert text from one dialect to another. For now,
    the function simply returns the original text.

    :param text: A string containing Arabic text.
    :param target_dialect: A string representing the target dialect for normalization.
                           Default is 'MSA' (Modern Standard Arabic).
    :return: The input text unmodified (placeholder implementation).

    **Example:**
    >>> normalize_dialect("هذا نص باللهجة المصرية", target_dialect="MSA")
    "هذا نص باللهجة المصرية"
    """
    return text


Example

In [45]:
# text = "شلونك حبيبي؟ شخبارك اليوم؟"
# dialect = identify_dialect(text)
# normalized_text = normalize_dialect(text)

# print("Original text:", text)
# print("Identified dialect:", dialect)
# print("Normalized to MSA:", normalized_text)

In [46]:
# An Arabic Dialect Identifier
# model_name = "lafifi-24/arabicBert_arabic_dialect_identification"

# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForSequenceClassification.from_pretrained(model_name)

# dialect_classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

# text = "هذا نص عربي باللهجة المصرية."

# result = dialect_classifier(text)

# print(result)

In [47]:
# Example Arabic text (dialect identification)
# text = "عامل اه يا صاحبي."

# Predictions
# result = dialect_classifier(text)

# print(result)

> <span style="color: red">**_TODO:_**</span> look for other tools that perform dialect identification.

### 2.13 Punctuation Removal

> <span style="color: green">**_Punctuation Removal:_**</span> is the elimination of any punctuation character-covering both standard English punctuation and common Arabic punctuation marks.

In [48]:
def remove_arabic_punctuations(text: str) -> str:
    """
    Remove punctuation characters from an Arabic text.

    This function replaces any punctuation character—covering both standard English punctuation 
    and common Arabic punctuation marks (e.g., the Arabic comma "،" and question mark "؟")—with a space.
    After replacement, it normalizes the whitespace by collapsing multiple spaces into one and
    trimming leading and trailing whitespace.

    Parameters:
        text (str): The input Arabic text to be processed.

    Returns:
        str: The text after removing punctuation and normalizing whitespace.

    Example:
        >>> remove_arabic_punctuations("مرحباً، كيف حالك؟")
        'مرحباً كيف حالك'
    """

    punctuations = """!"#$%&'()*+,،-./:;<=>؟?@[\]^_`{|}~؛"""

    text = re.sub('[%s]' % re.escape(punctuations), ' ', text)

    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

Example

In [49]:
test_cases = [
    # (Input text, Expected output)
    ("مرحباً، كيف حالك؟", "مرحباً كيف حالك"),
    ("هذا نص! يحتوي على: علامات، ترقيم؟ وأخرى.", "هذا نص يحتوي على علامات ترقيم وأخرى"),
    ("", ""),
    ("!@#$%^&*()", ""),
    ("سلام - كيف حالك؟", "سلام كيف حالك"),
    ("تجربة... مع نقاط ثلاثية!!!", "تجربة مع نقاط ثلاثية"),
    ("هذا، ذلك؛ وهذا؟", "هذا ذلك وهذا")
]

for i, (input_text, expected) in enumerate(test_cases, 1):
    result = remove_arabic_punctuations(input_text)
    status = "Pass" if result == expected else "Fail"
    print(f"Test {i}: {status}")
    print("Input:    ", input_text)
    print("Expected: ", expected)
    print("Got:      ", result)
    print()

Test 1: Pass
Input:     مرحباً، كيف حالك؟
Expected:  مرحباً كيف حالك
Got:       مرحباً كيف حالك

Test 2: Pass
Input:     هذا نص! يحتوي على: علامات، ترقيم؟ وأخرى.
Expected:  هذا نص يحتوي على علامات ترقيم وأخرى
Got:       هذا نص يحتوي على علامات ترقيم وأخرى

Test 3: Pass
Input:     
Expected:  
Got:       

Test 4: Pass
Input:     !@#$%^&*()
Expected:  
Got:       

Test 5: Pass
Input:     سلام - كيف حالك؟
Expected:  سلام كيف حالك
Got:       سلام كيف حالك

Test 6: Pass
Input:     تجربة... مع نقاط ثلاثية!!!
Expected:  تجربة مع نقاط ثلاثية
Got:       تجربة مع نقاط ثلاثية

Test 7: Pass
Input:     هذا، ذلك؛ وهذا؟
Expected:  هذا ذلك وهذا
Got:       هذا ذلك وهذا



### 2.14 Named entity recognition (NER)

> <span style="color: green">**_Named entity recognition:_**</span> find and label named entities like proper nouns, organisations, places, etc.

For each token in an input sentence, `NERecognizer` outputs a label that indicates the type of named-entity.The system outputs one of the following labels for each token: `'B-LOC'`, `'B-ORG'`, `'B-PERS'`, `'B-MISC'`, `'I-LOC'`, `'I-ORG'`, `'I-PERS'`, `'I-MISC'`, `'O'`.
Named-entites can either be a `LOC` (location), `ORG` (organization), `PERS` (person), or `MISC` (miscallaneous).

Labels beginning with `B` indicate that their corresponding tokens are the begininging of a multi-word named-entity or is a single-token named-entity'. Those begining with `I` indicate that their corresponding tokens are continuations of a multi-word named-entity. Words that aren't named-entities are given the `'O'` label.

The example below illustrates how `NERecognizer` can be used to label named-entities in a given sentence.

In [114]:
def recognize_arabic_entities(text: str, tool: str = "nltk") -> list:
    """
    Recognize named entities in an Arabic sentence using a pretrained NER model.

    For each token in the input sentence, the model outputs a label indicating
    its named-entity type. The possible labels are:
        - 'B-LOC', 'B-ORG', 'B-PERS', 'B-MISC': The beginning of a location,
          organization, person, or miscellaneous entity (or a single-token entity).
        - 'I-LOC', 'I-ORG', 'I-PERS', 'I-MISC': Continuation tokens for multi-word entities.
        - 'O': A token that does not belong to any named entity.

    The function processes the input text, obtains NER labels, and then aggregates
    contiguous tokens with the same entity type into a single named entity.

    Parameters:
        text (str): An Arabic sentence for which named-entity recognition is performed.

    Returns:
        list of tuples: Each tuple contains a recognized entity (a string) and its type (e.g., 'LOC', 'ORG').
                        If no entities are found, an empty list is returned.

    Example:
        >>> text = "يعيش محمد في القاهرة ويعمل في شركة جوجل."
        >>> recognize_arabic_entities(text)
        [('محمد', 'PERS'), ('القاهرة', 'LOC'), ('جوجل', 'ORG')]
    """

    labels = []

    if tool == 'camel':
        ner = NERecognizer.pretrained()
        labels = ner.predict_sentence(simple_word_tokenize(text))

        print("Raw labels: ", labels)
    
        words = simple_word_tokenize(text)
        entities = []
        current_entity = []
        current_label = None
        
        for word, label in zip(words, labels):
            if label.startswith('B-'):
                if current_entity:
                    entities.append((' '.join(current_entity), current_label))
                    current_entity = []
                current_entity.append(word)
                current_label = label[2:]
            elif label.startswith('I-') and current_entity:
                current_entity.append(word)
            else:
                if current_entity:
                    entities.append((' '.join(current_entity), current_label))
                    current_entity = []
                    current_label = None
        
        if current_entity:
            entities.append((' '.join(current_entity), current_label))
        
        return entities
    else:
        pass


Example

In [115]:
text = "يعيش محمد في القاهرة ويعمل في شركة جوجل."

entities = recognize_arabic_entities(text, tool="camel")

print("Text:", text)
print("Recognized Entities:")
for entity, entity_type in entities:
    print(f"{entity} -> {entity_type}")

Some weights of the model checkpoint at C:\Users\mazen\AppData\Roaming\camel_tools\data\ner\arabert were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Raw labels:  ['O', 'B-PERS', 'O', 'B-LOC', 'O', 'O', 'O', 'B-ORG', 'O']
Text: يعيش محمد في القاهرة ويعمل في شركة جوجل.
Recognized Entities:
محمد -> PERS
القاهرة -> LOC
جوجل -> ORG


### 2.15 Morphological Analysis

> <span style="color: green">**_Morphological Analysis:_**</span> is the process of generating all possible readings (analyses) of a given word out of context. All analyses are generated from the undiacritized form of the input word. Each of these analyses is defined by a set lexical and morphological features. 

In [52]:
def arabic_morph_analysis(text: str):
    """
    Perform morphological analysis on an Arabic word or phrase.

    This function generates all possible morphological readings (analyses) for the input text 
    out of context. It loads the built-in morphological database (designed primarily for Modern 
    Standard Arabic) and uses it to analyze the given word or phrase. Each analysis typically 
    includes information such as the token, its lemma, root, part-of-speech (POS), and other 
    morphological features.

    Parameters:
        text (str): An Arabic word or phrase to be analyzed. Note that the analysis is performed 
                    out-of-context, so the output represents all possible morphological interpretations.

    Returns:
        list: A list of analysis results. Each element in the list is usually a dictionary 
              containing morphological details (e.g., 'token', 'lemma', 'root', 'pos', etc.).

    Examples:
        >>> # Analyze the verb "يذهب"
        >>> analyses = arabic_morph_analysis("يذهب")
        >>> for analysis in analyses:
        ...     print(analysis)
        {'token': 'يذهب', 'lemma': 'ذهب', 'root': 'ذ ه ب', 'pos': 'فعل', ...}
        
        >>> # Analyze another word
        >>> results = arabic_morph_analysis("كتبت")
        >>> print(results)
        [{'token': 'كتبت', 'lemma': 'كتب', 'root': 'ك ت ب', 'pos': 'فعل', ...}, ...]
    """

    db = MorphologyDB.builtin_db()

    analyzer = Analyzer(db)

    analyses = analyzer.analyze(text)
    
    return analyses

Example

In [53]:
example_words = [
    "يذهب",
    "كتبت",
    "مكتوب"
]

for word in example_words:
    print(f"\nAnalyzing the word: {word}")
    results = arabic_morph_analysis(word)
    if results:
        for idx, analysis in enumerate(results, 1):
            print(f"Analysis {idx}: {analysis}")
    else:
        print("No analyses found.")


Analyzing the word: يذهب
Analysis 1: {'diac': 'يُذَهِّب', 'lex': 'ذَهَّب', 'bw': 'يُ/IV3MS+ذَهِّب/IV', 'gloss': 'he;it+gild', 'pos': 'verb', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': '3', 'asp': 'i', 'vox': 'a', 'mod': 'u', 'stt': 'na', 'cas': 'na', 'enc0': '0', 'rat': 'n', 'source': 'lex', 'form_gen': 'm', 'form_num': 's', 'd3seg': 'يُذَهِّب', 'caphi': 'y_u_dh_a_h_h_i_b', 'd1tok': 'يُذَهِّب', 'd2tok': 'يُذَهِّب', 'pos_logprob': -1.023208, 'd3tok': 'يُذَهِّب', 'd2seg': 'يُذَهِّب', 'pos_lex_logprob': -99.0, 'num': 's', 'ud': 'VERB', 'gen': 'm', 'catib6': 'VRB', 'root': 'ذ.ه.ب', 'bwtok': 'يُ+_ذَهِّب', 'pattern': 'يُ1َ2ِّ3', 'lex_logprob': -99.0, 'atbtok': 'يُذَهِّب', 'atbseg': 'يُذَهِّب', 'd1seg': 'يُذَهِّب', 'stem': 'ذَهِّب', 'stemgloss': 'gild', 'stemcat': 'IV_yu'}
Analysis 2: {'diac': 'يُذْهِب', 'lex': 'أَذْهَب', 'bw': 'يُ/IV3MS+ذْهِب/IV', 'gloss': 'he;it+remove;eliminate', 'pos': 'verb', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': '0', 'per': '3', 'asp': 'i', 'vo

### 2.16 Word Segmentation

> <span style="color: green">**_Word Segmentation:_**</span> is the process of segementing a concatenated Arabic text into a properly spaced sentence.

In [54]:
def arabic_word_segmentation(text: str) -> str:
    """
    Segment concatenated Arabic text into a properly spaced sentence.

    This function uses Camel Tools' MaxMatchSegmenter—a dictionary-based, greedy segmentation 
    algorithm—to determine the most likely word boundaries in a concatenated Arabic string.
    The algorithm attempts to match the longest possible valid words from the beginning of the 
    string, inserting spaces where appropriate.

    Parameters:
        text (str): A concatenated Arabic string that requires segmentation.

    Returns:
        str: The input text segmented into individual words separated by a single space.

    Example:
        >>> text = "وقالمصدرإنهناكتحسنافيالوضع"
        >>> arabic_word_segmentation(text)
        "وق المصدر إنه نا كت حس نا في الوضع"
        # (Actual segmentation may vary based on the dictionary and algorithm.)
    """
    pass

Example

In [55]:
text = "وقالمصدرإنهناكتحسنافيالوضع"
segmented_text = tokenize_arabic(text)

print("Original:", text)
print("Segmented:", segmented_text)

Original: وقالمصدرإنهناكتحسنافيالوضع
Segmented: ['وقالمصدرإنهناكتحسنافيالوضع']


> <span style="color: red">**_TODO:_**</span> look into a way on how to implement the method (may be using min-max greedy approach)

### 2.17 Part-of-speech tagging (POS tagging)

> <span style="color: green">**_Part-of-speech tagging:_**</span> is the process of determining of tagging a sentence with noun, verb, etc.

In [56]:
def arabic_pos_tagging(text: str) -> list:
    """
    Perform part-of-speech (POS) tagging on an Arabic sentence.

    This function uses a pre-trained Maximum Likelihood Estimation (MLE) disambiguator along with 
    a default POS tagger to assign part-of-speech tags to each token in the input Arabic text. 
    The process involves:
      1. Tokenizing the input sentence using a simple word tokenizer.
      2. Using the MLE disambiguator to resolve morphological ambiguities.
      3. Tagging each token with its corresponding POS tag according to the model's tagging scheme.

    Parameters:
        text (str): An Arabic sentence that needs POS tagging. The sentence should be in standard 
                    Arabic script and is expected to be a complete sentence.

    Returns:
        list: A list of tuples where each tuple contains a token from the input sentence and its 
              assigned POS tag. For example:
              [('الطلاب', 'NOUN'), ('يذهبون', 'VERB'), ('إلى', 'PREP'), ('المدرسة', 'NOUN')]

    Example:
        >>> text = "الطلاب يذهبون إلى المدرسة"
        >>> arabic_pos_tagging(text)
        [('الطلاب', 'NOUN'), ('يذهبون', 'VERB'), ('إلى', 'PREP'), ('المدرسة', 'NOUN')]
    """
    mle = MLEDisambiguator.pretrained()
    tagger = DefaultTagger(mle, 'pos')
    
    sentence = simple_word_tokenize(text)
    
    pos_tags = tagger.tag(sentence)
    
    return pos_tags

Example

In [57]:
example_sentences = [
    ("الطلاب يذهبون إلى المدرسة", "Basic sentence with a noun, verb, preposition, and noun."),
    ("الرئيس يتحدث بوضوح", "Sentence with a noun, verb, and adverb."),
    ("", "Empty string should return an empty list.")
]

for idx, (input_text, description) in enumerate(example_sentences, 1):
    print(f"Example {idx}: {description}")
    print("Input:", input_text)
    result = arabic_pos_tagging(input_text)
    print("Output:", result)
    print("-" * 50)

Example 1: Basic sentence with a noun, verb, preposition, and noun.
Input: الطلاب يذهبون إلى المدرسة
Output: ['noun', 'verb', 'prep', 'noun']
--------------------------------------------------
Example 2: Sentence with a noun, verb, and adverb.
Input: الرئيس يتحدث بوضوح
Output: ['noun', 'verb', 'noun']
--------------------------------------------------
Example 3: Empty string should return an empty list.
Input: 
Output: []
--------------------------------------------------


### 2.18 Disambiguation

> <span style="color: green">**_Disambiguation:_**</span> is the process of determining what is the most likely analysis of a word in a given context. Disambiguation is the backbone for many Arabic NLP tasks such as diacritization, POS tagging and morphological tokenization.

In [58]:
def arabic_disambiguation(text: str, ):
    """
    Perform morphological disambiguation on an Arabic sentence.

    This function determines the most likely morphological analysis for each word
    in the input text. It uses a pretrained Maximum Likelihood Estimation (MLE)
    disambiguator.
    
    For each word, the disambiguator produces a list of possible analyses sorted from 
    most likely to least likely. The function extracts the following from the top analysis:
        - The diacritized form ('diac')
        - The part-of-speech tag ('pos')
        - The lemma or lexical form ('lex')
    
    In cases where a word does not receive any analysis (i.e. the analyses list is empty),
    a default value is returned for that token (an empty string for 'diac' and 'lex', and "O"
    for the POS tag).

    Parameters:
        text (str): An Arabic sentence to be disambiguated.

    Returns:
            tuple: Three lists containing the diacritized forms, part-of-speech tags, 
                   and lemmas for each word in the sentence.
    
    Example:
        >>> text = "ذهب الرجل إلى البنك"
        >>> diacritized, pos_tags, lemmas = arabic_disambiguation(text)
        >>> print(lemmas)
        ['ذهب', 'الرجل', 'إلى', 'البنك']
    
    Note:
        Some words may not receive any analysis. In such cases, this function returns default
        values ("" for diacritized/lemma and "O" for POS) for those words.
    """
    mle = MLEDisambiguator.pretrained()
    disambig = mle.disambiguate(text.split())
    
    diacritized = [d.analyses[0].analysis['diac'] if d.analyses else "" for d in disambig]
    pos_tags    = [d.analyses[0].analysis['pos']  if d.analyses else "O" for d in disambig]
    lemmas      = [d.analyses[0].analysis['lex']  if d.analyses else "" for d in disambig]

    return diacritized, pos_tags, lemmas

Example

In [59]:
example_sentences = [
    ("ذهب الرجل إلى البنك", "A sentence with a clear verb, noun, preposition, and noun."),
    ("كتب الطالب الدرس", "A sentence with a verb, noun, and noun."),
    ("", "Empty string should return empty lists.")
]

for idx, (sentence, description) in enumerate(example_sentences, 1):
    print(f"\nExample {idx}: {description}")
    print("Input:", sentence)
    try:
        result = arabic_disambiguation(sentence)
        if isinstance(result, tuple):
            diacritized, pos_tags, lemmas = result
            print("Diacritized:", diacritized)
            print("POS tags:   ", pos_tags)
            print("Lemmas:     ", lemmas)
    except Exception as e:
        print("Error during disambiguation:", e)


Example 1: A sentence with a clear verb, noun, preposition, and noun.
Input: ذهب الرجل إلى البنك
Diacritized: ['ذَهَبَ', 'الرَجُلَ', 'إِلَى', 'البَنْكِ']
POS tags:    ['verb', 'noun', 'prep', 'noun']
Lemmas:      ['ذَهَب', 'رَجُل', 'إِلَى', 'بَنْك']

Example 2: A sentence with a verb, noun, and noun.
Input: كتب الطالب الدرس
Diacritized: ['كَتَبَ', 'الطالِبُ', 'الدَرْسِ']
POS tags:    ['verb', 'noun', 'noun']
Lemmas:      ['كَتَب', 'طالِب', 'دَرْس']

Example 3: Empty string should return empty lists.
Input: 
Diacritized: []
POS tags:    []
Lemmas:      []


### 2.19 Elongated Words

> <span style="color: green">**_Elongated Words:_**</span> is reducing sequences of repeated characters.

In [60]:
def normalize_elongated_words(text: str) -> str:
    """
    Normalize elongated words in Arabic text by reducing sequences of repeated characters.

    This function replaces any sequence of a character repeated more than once with exactly two 
    consecutive occurrences of that character. This helps in standardizing words that are often 
    elongated in informal text (e.g., social media or SMS) to express emphasis. For instance, 
    "جميللللل" would be normalized to "جميلل".

    Parameters:
        text (str): The input sentence or word that may contain elongated characters.

    Returns:
        str: The normalized text where any sequence of repeated characters is reduced to two occurrences.

    Examples:
        >>> normalize_elongated_words("هههههههه")
        'هه'
        >>> normalize_elongated_words("كبيييير")
        'كبيير'
        >>> normalize_elongated_words("مرررحباا")
        'مررحباا'
    """

    text = re.sub(r'(.)\1+', r'\1\1', text)
    return text

Example

In [61]:
elongated_text = "يااااا سلاااام على هذا البرنااامج الراااائع"
normalized_text = normalize_elongated_words(elongated_text)
print("Elongated:", elongated_text)
print("Normalized:", normalized_text)

Elongated: يااااا سلاااام على هذا البرنااامج الراااائع
Normalized: ياا سلاام على هذا البرناامج الراائع


### 2.20 Data Translation

> <span style="color: green">**_Data Translation:_**</span> process of replacing each arabic word in the text with one of its english translation.

In [None]:
def translate_arabic_word(text: str) -> list:
    '''
    A method that replaces each word in the text with one of its english translation.

    :param text: a string to be processed
    
    :return: a list of translated text strings
    '''
    db = MorphologyDB.builtin_db()
    analyzer = Analyzer(db)

    words = text.split()
    translated_words = []
    
    for word in words:
        analysis = analyzer.analyze(word)
        if analysis:
            # Pick a random translation
            print(len(analysis))
            print(analysis)
            word_translation = random.choice(analysis)['stemgloss'].split(";")[0]
            translated_words.append((word, word_translation))

    return translated_words

Example

In [None]:
# Example usage
original_text = "الكتاب مفيد للقراءة"
translated_text = translate_arabic_word(original_text)

print("Original:", original_text)
print("Tranlated data:")
for i, (original, translated) in enumerate(translated_text, 1):
    print(f"{i}. original: {original}, translated: {translated}")

12
[{'diac': 'الكُتّاب', 'lex': 'كُتّاب', 'bw': 'ال/DET+كُتّاب/NOUN', 'gloss': 'the+kuttab_(village_school);Quran_school', 'pos': 'noun', 'prc3': '0', 'prc2': '0', 'prc1': '0', 'prc0': 'Al_det', 'per': 'na', 'asp': 'na', 'vox': 'na', 'mod': 'na', 'stt': 'd', 'cas': 'u', 'enc0': '0', 'rat': 'i', 'source': 'lex', 'form_gen': 'm', 'form_num': 's', 'd3seg': 'ال+_كُتّاب', 'caphi': '2_a_l_k_u_t_t_aa_b', 'd1tok': 'الكُتّاب', 'd2tok': 'الكُتّاب', 'pos_logprob': -0.4344233, 'd3tok': 'ال+_كُتّاب', 'd2seg': 'الكُتّاب', 'pos_lex_logprob': -99.0, 'num': 's', 'ud': 'NOUN', 'gen': 'm', 'catib6': 'NOM', 'root': 'ك.ت.ب', 'bwtok': 'ال+_كُتّاب', 'pattern': 'ال1ُ2ّا3', 'lex_logprob': -99.0, 'atbtok': 'الكُتّاب', 'atbseg': 'الكُتّاب', 'd1seg': 'الكُتّاب', 'stem': 'كُتّاب', 'stemgloss': 'kuttab_(village_school);Quran_school', 'stemcat': 'N'}, {'diac': 'الكُتّابَ', 'lex': 'كُتّاب', 'bw': 'ال/DET+كُتّاب/NOUN+َ/CASE_DEF_ACC', 'gloss': 'the+kuttab_(village_school);Quran_school+[def.acc.]', 'pos': 'noun', 'prc3'

> <span style="color: yellow">**_NOTE:_**</span> The meaning of the text could be altered depending on the tashkeel added by the model.

### 2.21 Generation

> <span style="color: green">**_Generation:_**</span> is the process of inflecting a lemma for a set of morphological features.

In [None]:
def arabic_word_generation(word: str, pos: str = 'noun', gen: str = 'm', num: str = 'p'):
    """
    Inflect an Arabic lemma into its fully diacritized form(s) based on specified morphological features.

    This function generates all possible inflected (diacritized) forms for a given Arabic lemma
    by applying a set of morphological features. It leverages a built-in morphological database (with
    generation flags enabled) and a morphological generator to produce analyses that include details 
    such as the diacritized form ('diac'), part-of-speech, and other features.

    Parameters:
        word (str): The lemma (base form) of the Arabic word to be inflected.
        pos (str, optional): The part-of-speech tag for the word. For example, 'noun' or 'verb'. 
                             Default is 'noun'.
        gen (str, optional): The grammatical gender to be applied. For example, 'm' for masculine or 
                             'f' for feminine. Default is 'm' (masculine).
        num (str, optional): The number specification, such as 's' for singular or 'p' for plural. 
                             Default is 'p' (plural).

    Returns:
        set: A set of unique diacritized forms (strings) produced by the morphological generator.
             Each element represents a possible inflection for the input word given the features.

    Example:
        >>> # Inflect the noun "كتاب" (book) as a masculine plural.
        >>> forms = arabic_word_generation("كتاب", pos="noun", gen="m", num="p")
        >>> print(forms)
        {'كُتُب', 'كُتُبٌ'}
        
        >>> # Inflect the adjective "جديد" (new) as a masculine singular.
        >>> forms = arabic_word_generation("جديد", pos="adj", gen="m", num="s")
        >>> print(forms)
        {'جَدِيد', 'جَدِيدٌ'}

    Note:
        The actual output depends on the underlying morphological database and generator.
        Ensure that the necessary classes (e.g., MorphologyDB and Generator) are imported and available.
    """
    db = MorphologyDB.builtin_db(flags='g')
    
    generator = Generator(db)
    
    lemma = arabic_lemmatization(word)
    
    features = {
        'pos': pos,
        'gen': gen,
        'num': num
    }
    
    analyses = generator.generate(lemma, features)
    
    return set([a['diac'] for a in analyses])


> <span style="color: yellow">**_Note:_**</span> `'pos'` is the only *required* feature that needs to be specified.

Example

In [None]:
# Example 1: Inflecting a noun (book) as masculine plural.
word1 = "كتاب"
print("Example 1 - Noun (كتاب) as masculine plural:")
forms1 = arabic_word_generation(word1, pos="noun", gen="m", num="p")
print("Input word:", word1)
print("Generated forms:", forms1)
print("-" * 50)

# Example 2: Inflecting an adjective (new) as masculine singular.
word2 = "جديد"
print("Example 2 - Adjective (جديد) as masculine singular:")
forms2 = arabic_word_generation(word2, pos="adj", gen="m", num="s")
print("Input word:", word2)
print("Generated forms:", forms2)
print("-" * 50)

# Example 3: Inflecting a noun with feminine features.
word3 = "معلمة"  # base form might be provided without diacritics
print("Example 3 - Noun (معلمة) as feminine singular:")
forms3 = arabic_word_generation(word3, pos="noun", gen="f", num="s")
print("Input word:", word3)
print("Generated forms:", forms3)
print("-" * 50)

Example 1 - Noun (كتاب) as masculine plural:
Input word: كتاب
Generated forms: set()
--------------------------------------------------
Example 2 - Adjective (جديد) as masculine singular:
Input word: جديد
Generated forms: set()
--------------------------------------------------
Example 3 - Noun (معلمة) as feminine singular:
Input word: معلمة
Generated forms: set()
--------------------------------------------------


> <span style="color: red">**_TODO:_**</span> determine why the Generator is not working correctly.

### 2.22 Reinflection

> <span style="color: green">**_Reinflection:_**</span> is the process of converting a given word in any form to a different form (i.e. tense, gender, etc). The CAMeL Tools reinflector works similar to the generator except that the word doesn't have to be a lemma and it is not have to be restricted to a specific `'pos'`.

In [None]:
def arabic_reinflection(word: str, num: str = 'd', prc1: str = 'bi_prep') -> set:
    """
    Generate reinflected forms of an Arabic word based on specified morphological features.

    This function takes an input word (typically in its lemma form) and applies reinflection
    to produce alternative forms. Reinflection is the process of converting a word into different
    forms (e.g., adjusting tense, gender, number, or attaching prefixes) as dictated by the desired
    morphological features.

    The function uses a built-in morphological database loaded with the 'r' (reinflection)
    flag, along with a reinflector object, to produce analyses of the word. It then extracts the
    diacritized form ('diac') from each analysis and returns a set of unique reinflected forms.

    Parameters:
        word (str): The Arabic word (typically its lemma) to be reinflected.
        num (str, optional): A morphological feature representing number or a related property.
                             The default value 'd' indicates a default or unspecified number feature.
        prc1 (str, optional): A morphological feature typically used to indicate a proclitic (prefix)
                              that might be attached to the word (e.g., a preposition). The default value
                              'bi_prep' might indicate a proclitic for the preposition "بـ". Adjust these
                              features based on your reinflection requirements.

    Returns:
        set: A set of unique diacritized forms (strings) that represent the reinflected variants of the word,
             according to the specified morphological features.

    Examples:
        >>> # Reinflect the word "كتب" with default features.
        >>> forms = arabic_reinflection("كتب")
        >>> print(forms)
        {'كُتُب', 'كُتِب'}  # (Example output; actual forms depend on the database and reinflection rules.)
        
        >>> # Reinflect the word "درس" specifying singular number and a proclitic for "بـ"
        >>> forms = arabic_reinflection("درس", num="s", prc1="bi_prep")
        >>> print(forms)
        {'دُرِس', 'دُرِسَ'}  # (Example output)

    Note:
        The actual output is contingent on the underlying morphological database and reinflection rules
        provided by the library. Make sure the classes MorphologyDB and Reinflector are correctly imported and
        available in your environment.
    """
    db = MorphologyDB.builtin_db(flags='r')

    reinflector = Reinflector(db)

    features = {
        'num': num,
        'prc1': prc1
    }

    analyses = reinflector.reinflect(word, features)

    return set(a['diac'] for a in analyses)

Example

In [None]:
examples = [
    ("كتب", "d", "bi_prep", "Default reinflection for the word 'كتب'."),
    ("درس", "s", "bi_prep", "Reinflection for 'درس' with singular number and 'بـ' prefix."),
    ("شرب", "p", "bi_prep", "Reinflection for 'شرب' with plural number and 'بـ' prefix."),
]

for idx, (word, num, prc1, desc) in enumerate(examples, 1):
    print(f"Test Case {idx}: {desc}")
    forms = arabic_reinflection(word, num, prc1)
    print("Input word:", word)
    print("Reinflected forms:", forms)
    print("-" * 50)


Test Case 1: Default reinflection for the word 'كتب'.
Input word: كتب
Reinflected forms: {'بِكِتابَيْنِ', 'بِكِتابَيْ'}
--------------------------------------------------
Test Case 2: Reinflection for 'درس' with singular number and 'بـ' prefix.
Input word: درس
Reinflected forms: {'بِدَرْس', 'بِدَرْسِ', 'بِدَرْسٍ'}
--------------------------------------------------
Test Case 3: Reinflection for 'شرب' with plural number and 'بـ' prefix.
Input word: شرب
Reinflected forms: set()
--------------------------------------------------


### 2.23 Morphological Tokenization

> <span style="color: green">**_Morphological Tokenization:_**</span> is a type of tokenization whereby Arabic words are split into component prefixes, stems, and suffixes.

The `MorphologicalTokenizer` class used to tokenize words in different schemes. It behaves very much like the `DefaultTagger` (used previously) in that it uses a disambiguator to first disambiguate words and then extracts a particular tokenization feature, but it has the following differences:

- While the `DefaultTagger` produces exactly one output for each input word, the `MorphologicalTokenizer` might produce multiple output tokens.
-  The `MorphologicalTokenizer` can be configured to produce diacritized and undiacritized output.

In [None]:
def arabic_morphological_tokenization(text: str) -> list:
    """
    Perform morphological tokenization on an Arabic sentence.

    This function first tokenizes the input sentence into words using a simple Arabic 
    word tokenizer. Then, it loads a pretrained morphological disambiguator (using the 
    'calima-msa-r13' model) and applies a morphological tokenizer to generate detailed 
    morphological tokens. The tokenizer is configured with:
      - scheme='d3tok': specifying a particular morphological tokenization scheme.
      - split=True: to output each morphological token as a separate string.
      - diac=True: to output the tokens with diacritics.
    
    The result is a list of tokens that represent the morphological breakdown of the 
    input text. Note that the exact output depends on the model and its configuration.

    Parameters:
        text (str): An Arabic sentence to be morphologically tokenized.
    
    Returns:
        list: A list of morphological tokens (as strings). Each token represents a segment 
              of the input word based on its morphological structure, potentially including diacritics.
    
    Example:
        >>> text = "الطلابُ يدرسونَ في الجامعةِ"
        >>> tokens = arabic_morphological_tokenization(text)
        >>> print(tokens)
        ['ال', 'طلابُ', 'ي', 'در', 'سونَ', 'في', 'ال', 'جامعةِ']
        # (Note: The actual segmentation may vary depending on the disambiguator and tokenizer configuration.)
    """
    # Tokenize the sentence into words using a simple tokenizer.
    words = simple_word_tokenize(text)

    # Load a pretrained morphological disambiguator (using the calima-msa-r13 model).
    mle = MLEDisambiguator.pretrained('calima-msa-r13')

    # Initialize the morphological tokenizer with specified configuration:
    # - scheme: 'd3tok' to determine the tokenization scheme.
    # - split: True to split the output into individual tokens.
    # - diac: True to include diacritized forms in the output.
    tokenizer = MorphologicalTokenizer(mle, scheme='d3tok', split=True, diac=True)
    
    # Perform morphological tokenization on the pre-tokenized words.
    tokens = tokenizer.tokenize(words)

    return tokens

Example

In [None]:
examples = [
    # (input text, description)
    ("الطلابُ يدرسونَ في الجامعةِ", "A basic sentence with common morphological structure."),
    ("كتب الطالب الدرس بسرعة", "A sentence with a verb, noun, and adverb."),
    ("", "Empty string should return an empty list.")
]

for idx, (input_text, description) in enumerate(examples, 1):
    print(f"Test Case {idx}: {description}")
    print("Input:", input_text)
    tokens = arabic_morphological_tokenization(input_text)
    print("Output tokens:", tokens)
    print("-" * 50)

Test Case 1: A basic sentence with common morphological structure.
Input: الطلابُ يدرسونَ في الجامعةِ
Output tokens: ['ال+', 'طُلّابِ', 'يَدْرُسُونَ', 'فِي', 'ال+', 'جامِعَةِ']
--------------------------------------------------
Test Case 2: A sentence with a verb, noun, and adverb.
Input: كتب الطالب الدرس بسرعة
Output tokens: ['كَتَبَ', 'ال+', 'طالِبُ', 'ال+', 'دَرْسِ', 'بِ+', 'سُرْعَةٍ']
--------------------------------------------------
Test Case 3: Empty string should return an empty list.
Input: 
Output tokens: []
--------------------------------------------------


### 2.24 Finding Synonyms (not working)

In [None]:
def finding_synonyms(text: str, num_augmentations: int = 1, tool: str = 'sina') -> list:
    '''
    Replaces each word in the text with one of its synonyms using the specified tool.
    
    Parameters:
      text (str): The input text to process.
      num_augmentations (int): Number of augmented outputs to generate.
      tool (str): The augmentation tool to use ('sina' uses the evaluate_synonyms method).
    
    Returns:
      list: A list of augmented text strings.
    '''
    # If using another tool, e.g. morphology analyzer, initialize it accordingly.
    # db = MorphologyDB.builtin_db()
    # analyzer = Analyzer(db)
    
    words = text.split()
    augmented_texts = []

    for _ in range(num_augmentations):
        new_words = []
        for word in words:
            synonyms_result = evaluate_synonyms(word, 3)
            if synonyms_result:
                synonym_candidates = [syn[0] for syn in synonyms_result if syn and syn[0] != word]
                if synonym_candidates:
                    new_word = random.choice(synonym_candidates)
                else:
                    new_word = word
            else:
                new_word = word
            new_words.append(new_word)
        augmented_texts.append(' '.join(new_words))
    
    return augmented_texts

> <span style="color: red">**_TODO:_**</span> find a way to do data synonyms.

### 2.25 Data Augmenetation

> <span style="color: red">**_TODO:_**</span> find a way to do data augmentation.

Links:
1. https://medium.com/@Mustafa77/data-augmentation-using-transformers-and-similarity-measures-2812c4853ed3

## 3. Handling Outliers

> <span style="color: red">**_TODO:_**</span> Translate Arabic to English and perform natural language processing.

### 3.1 Handling Very Common Word Removal

In [None]:
def handling_common_words(df: pd.DataFrame, mode='remove'):
    pass

Example

### 3.2 Handling Very Rare Word Removal

In [None]:
def handling_rare_words(df: pd.DataFrame, mode='remove'):
    pass

Example

### 3.3 Handling Numbers and Special Characters in Arabic Text

In [None]:
def handle_numbers_and_special_chars(text, mode='remove'):
    """
    Process Arabic text by either removing or normalizing numbers and special characters.

    This function handles numbers and special characters in an Arabic text in one of two ways:
      - 'remove': Eliminates all characters that are not Arabic letters (Unicode range \u0600-\u06FF) or whitespace.
      - 'normalize': Converts Arabic digits (٠١٢٣٤٥٦٧٨٩) to their corresponding Western numeral characters (0-9)
                     while leaving other characters unchanged.

    Parameters:
        text (str): The input Arabic text, which may include numbers and special characters.
        mode (str, optional): The mode of processing, either 'remove' or 'normalize'. 
                              Default is 'remove'.

    Returns:
        str: The processed text after applying the specified operation.

    Examples:
        >>> handle_numbers_and_special_chars("اللغة ١٢٣ جميلة!", mode='remove')
        'اللغة جميلة'
        >>> handle_numbers_and_special_chars("اللغة ١٢٣ جميلة!", mode='normalize')
        'اللغة 123 جميلة!'
    """
    if mode == 'remove':
        # Remove any character that is not an Arabic letter (or whitespace)
        return re.sub(r'[^\u0600-\u06FF\s]', '', text)
    elif mode == 'normalize':
        # Normalize Hindi numbers to Arabic numerals
        number_map = {
            '٠': '0', '١': '1', '٢': '2', '٣': '3', '٤': '4',
            '٥': '5', '٦': '6', '٧': '7', '٨': '8', '٩': '9'
        }
        for arabic, western in number_map.items():
            text = text.replace(arabic, western)
        return text

Example

In [None]:
examples = [
    {
        "input": "اللغة ١٢٣ جميلة!",
        "mode": "remove",
        "expected": "اللغة جميلة"
    },
    {
        "input": "اللغة ١٢٣ جميلة!",
        "mode": "normalize",
        "expected": "اللغة 123 جميلة!"
    },
    {
        "input": "هذا نص مع رموز مثل #، و ٤٥٦!",
        "mode": "remove",
        "expected": "هذا نص مع رموز مثل  و "
    },
    {
        "input": "هذا نص مع رموز مثل #، و ٤٥٦!",
        "mode": "normalize",
        "expected": "هذا نص مع رموز مثل #، و 456!"
    },
    {
        "input": "",
        "mode": "remove",
        "expected": ""
    },
]

for idx, case in enumerate(examples, 1):
    result = handle_numbers_and_special_chars(case["input"], mode=case["mode"])
    print(f"Test Case {idx}: Mode = {case['mode']}")
    print("Input:    ", case["input"])
    print("Expected: ", repr(case["expected"]))
    print("Result:   ", repr(result))
    print("Pass:", result == case["expected"])
    print("-" * 40)

Test Case 1: Mode = remove
Input:     اللغة ١٢٣ جميلة!
Expected:  'اللغة جميلة'
Result:    'اللغة ١٢٣ جميلة'
Pass: False
----------------------------------------
Test Case 2: Mode = normalize
Input:     اللغة ١٢٣ جميلة!
Expected:  'اللغة 123 جميلة!'
Result:    'اللغة 123 جميلة!'
Pass: True
----------------------------------------
Test Case 3: Mode = remove
Input:     هذا نص مع رموز مثل #، و ٤٥٦!
Expected:  'هذا نص مع رموز مثل  و '
Result:    'هذا نص مع رموز مثل ، و ٤٥٦'
Pass: False
----------------------------------------
Test Case 4: Mode = normalize
Input:     هذا نص مع رموز مثل #، و ٤٥٦!
Expected:  'هذا نص مع رموز مثل #، و 456!'
Result:    'هذا نص مع رموز مثل #، و 456!'
Pass: True
----------------------------------------
Test Case 5: Mode = remove
Input:     
Expected:  ''
Result:    ''
Pass: True
----------------------------------------


> <span style="color: red">**_TODO:_**</span> fix `handle_numbers_and_special_chars` method as it sometimes provides incorrect output.

---

# <p style="padding:50px;background-color:#DA8359;margin:0;color:#fafefe;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:100">Preporcessing</p>

## 1. Text Classification

> <span style="color: green">**_Text Classification:_**</span> 

In [None]:
def classify_arabic_text(text: str, model_name: str):
    """
    Classify Arabic text using the specified pretrained model from Hugging Face.
    
    Parameters:
      text (str): The Arabic text to classify.
      model_name (str): The Hugging Face model name for sequence classification.
      
    Returns:
      list: A list of probabilities corresponding to each class.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    model.eval()  # Set model to evaluation mode
    
    arabert_prep = ArabertPreprocessor(model_name=model_name)
    processed_text = arabert_prep.preprocess(text)

    inputs = tokenizer(processed_text, return_tensors="pt", truncation=True, padding=True)
    
    with torch.no_grad():  # Disable gradient calculation for inference
        outputs = model(**inputs)
    
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    return predictions.tolist()[0]

Example

In [90]:
# Example usage
text = "هذا النص رائع ومفيد جداً"
classification = classify_arabic_text(text, model_name="aubmindlab/bert-base-arabertv2")
print(f"Text: {text}")
print(f"Classification probabilities: {classification}")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at aubmindlab/bert-base-arabertv2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Text: هذا النص رائع ومفيد جداً
Classification probabilities: [0.8355841636657715, 0.1644158959388733]


> <span style="color: red">**_TODO:_**</span> finetune model to perform text classification across multiple classes.

## 2. Sentiment Analysis

> <span style="color: green">**_Sentiment Analysis:_**</span> is identifying whether a text is classified as positive or negative

In [None]:
def analyze_arabic_sentiment(text: str) -> tuple:
    """
    Analyze the sentiment of an Arabic text using a pretrained sentiment analysis model.

    This function uses the Hugging Face Transformers sentiment analysis pipeline with the
    model "CAMeL-Lab/bert-base-arabic-camelbert-msa-sentiment" to classify the sentiment
    of the input Arabic text. It returns a tuple containing the sentiment label (e.g., "POSITIVE"
    or "NEGATIVE") and the associated confidence score.

    Parameters:
        text (str): The Arabic text to be analyzed for sentiment.

    Returns:
        tuple: A tuple (label, score) where:
            - label (str): The predicted sentiment label.
            - score (float): The confidence score (between 0 and 1) for the predicted sentiment.

    Example:
        >>> label, score = analyze_arabic_sentiment("هذا النص رائع ومفيد جداً")
        >>> print(f"Sentiment: {label} (confidence: {score:.2f})")

    """
    sentiment_pipeline = pipeline("sentiment-analysis", model="CAMeL-Lab/bert-base-arabic-camelbert-msa-sentiment")
    result = sentiment_pipeline(text)[0]
    return result['label'], result['score']


Example

In [92]:
# Example usage
text = "أنا سعيد جداً بهذا المنتج!"
sentiment, score = analyze_arabic_sentiment(text)
print(f"Text: {text}")
print(f"Sentiment: {sentiment}, Score: {score}")

Text: أنا سعيد جداً بهذا المنتج!
Sentiment: positive, Score: 0.9928115010261536


In [None]:
# Example usage
text = "أنت مالك ياض !"
sentiment, score = analyze_arabic_sentiment(text)
print(f"Text: {text}")
print(f"Sentiment: {sentiment}, Score: {score}")

Text: أنت مالك ياض !
Sentiment: negative, Score: 0.9415218234062195


## 3. Word Embedding

> <span style="color: green">**_Word Embedding:_**</span> is a way of representing words as dense, continuous vectors in a high-dimensional space. These vectors capture semantic relationships between words so that words with similar meanings are mapped to nearby points in the vector space. 

In [5]:
import gensim
import re

In [None]:
def arabic_word_embedding(text: str) -> tuple:
    """
    Compute the word embedding for an Arabic word and retrieve its most similar terms.

    This function loads a pretrained Word2Vec model, cleans and normalizes the input Arabic text,
    and then retrieves the word embedding vector for the cleaned word. Additionally, it finds and 
    prints the most similar words based on cosine similarity within the embedding space.

    The cleaning process includes:
      - Replacing various Arabic characters with normalized forms.
      - Removing diacritics (tashkeel) and repeated character elongations.
      - Trimming whitespace and handling specific punctuation or symbols.

    Parameters:
        text (str): The Arabic word or phrase to process.

    Returns:
        tuple: A tuple containing:
            - word_vector: The vector representation of the cleaned word (typically a NumPy array).
            - most_similar: A list of tuples, where each tuple consists of a similar word (str) 
              and its similarity score (float).

    Example:
        >>> vector, similar = arabic_word_embedding("القاهرة")
        Most similar words (and their similarity scores) are printed, and 'vector' holds the embedding for the cleaned word.
    """

    # Load the pretrained Word2Vec model.
    model = gensim.models.Word2Vec.load('./models/tweet_cbow_300/tweets_cbow_300')

    # Clean/Normalize Arabic Text
    def clean_str(text):
        search = ["أ", "إ", "آ", "ة", "_", "-", "/", ".", "،", " و ", " يا ", '"', "ـ", "'", "ى", "\\", '\n', '\t', '&quot;', '?', '؟', '!']
        replace = ["ا", "ا", "ا", "ه", " ", " ", "", "", "", " و", " يا", "", "", "", "ي", "", " ", " ", " ? ", " ؟ ", " ! "]
        
        # Remove tashkeel (diacritics)
        p_tashkeel = re.compile(r'[\u0617-\u061A\u064B-\u0652]')
        text = re.sub(p_tashkeel, "", text)
        
        # Remove longation (repeated characters)
        p_longation = re.compile(r'(.)\1+')
        subst = r"\1\1"
        text = re.sub(p_longation, subst, text)
        
        text = text.replace('وو', 'و')
        text = text.replace('يي', 'ي')
        text = text.replace('اا', 'ا')
        
        for i in range(len(search)):
            text = text.replace(search[i], replace[i])
        
        return text.strip()

    # Clean the input text
    word = clean_str(text)

    # Retrieve and print the most similar terms to the cleaned word
    most_similar = model.wv.most_similar(word)
    print("Most similar terms to '{}':".format(word))
    for term, score in most_similar:
        print(term, score)
        
    # Retrieve the word vector
    word_vector = model.wv[word]

    return word_vector, most_similar

In [97]:
# fasttext is not working on Windows
# import fasttext

def word2vec_fasttext(text: str):
    # create word embedding model
    model = fasttext.train_unsupervised('xxx.txt', epoch=25)

    # get word embeddings for words in text
    word_embeddings = model.get_word_vector(text)

    return word_embeddings

Example

In [9]:
vector, similar = arabic_word_embedding("القاهرة")
print("Word vector for 'القاهرة':", vector)

الاسكندريه 0.822142481803894
اسوان 0.7448597550392151
الجيزه 0.7406017184257507
المنصوره 0.7375915050506592
الاسماعيليه 0.7310688495635986
بورسعيد 0.7242382168769836
الاقصر 0.7210926413536072
حلوان 0.7209565043449402
دمياط 0.7096551060676575
طنطا 0.7094075679779053
Word vector for 'القاهرة': [-0.36143455  0.32451856  0.14601593  0.32183495 -2.4407325   2.3771033
 -0.02987524 -0.36511382 -1.9445266   2.0458498  -0.4462689   0.8169745
  0.57143867 -0.14586152 -2.7012775   1.5832865   1.6561981   2.3886893
 -0.7477331  -2.2364702  -0.3022785  -0.44031492  1.0667934  -1.2664819
  0.5260765   0.87624025  0.58786726 -0.59008116 -2.0730557   1.4947067
  0.61162     4.520309    0.03703779 -0.02008293 -1.1760161  -0.907512
 -0.6775007  -1.4298267   0.43027702 -0.3751945   2.1304162   1.6183015
 -0.7221879   0.3284036  -1.151335    1.0218043  -0.19037294 -0.25370607
  0.55373436  0.2848552  -2.660286    1.0729909   0.5107816  -2.057838
 -1.3162624  -0.8008683  -2.131203    1.3305361  -0.08949289

## 4. Multi-Label Labelling

## 5. Topic Modeling

> <span style="color: green">**_Topic Modeling:_**</span> is an unsupervised machine learning technique for finding abstract topics in a large collection of documents. It helps in organizing, understanding and summarizing large collections of textual information and discovering the latent topics that vary among documents in a given corpus.

Latent Dirichlet allocation (LDA) and Non-Negative Matrix Fatorization (NMF) are two of the most popular topic modeling techniques. LDA uses a probabilistic approach whereas NMF uses matrix factorization approach, however, new techniques that are based on BERT for topic modeling do exist.

Source: https://colab.research.google.com/drive/1OT_wcYKpKS73uR6y7IVYjJVxaP-C1H3k?usp=sharing

In [None]:
import pandas as pd
from bertopic import BERTopic
from flair.embeddings import TransformerDocumentEmbeddings
from gensim.models.coherencemodel import CoherenceModel
import gensim.corpora as corpora
from gensim.models import LdaMulticore
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import NMF

## 6. Translate to English

> <span style="color: red">**_TODO:_**</span> Translate Arabic to English and perform natural language processing.

## 7. Detecting Sarcasm

Source: https://medium.com/@rehabreda/unraveling-sarcasm-in-arabic-with-arabert-a-comprehensive-guide-from-data-preprocessing-to-a4dc7e30b39d

---

# <p style="padding:50px;background-color:#DA8359;margin:0;color:#fafefe;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:100">Pipeline Execution</p>

## 1. Cleaning

In [None]:
def clean():
    pass

## 2. Feature Engineering

In [None]:
def feature_eng():
    pass

## 3. Encoding

In [None]:
def encode():
    pass

## 4. Normalization

In [None]:
def normalize():
    pass

---

# <p style="padding:50px;background-color:#DA8359;margin:0;color:#fafefe;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:100">Visualize Data</p>

---

# <p style="padding:50px;background-color:#DA8359;margin:0;color:#fafefe;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:100">Resources</p>

1. BERT for Arabic Topic Modeling: An Experimental Study on BERTopic Technique (https://github.com/iwan-rg/Arabic-Topic-Modeling?tab=readme-ov-file)
2. NYU ABU DHABI (https://nyuad.nyu.edu/en/research/faculty-labs-and-projects/computational-approaches-to-modeling-language-lab/research/arabic-natural-language-processing.html)
3. CAMeL Tools (https://camel-tools.readthedocs.io/en/latest/api.html)
4. Comprehensive Arabic NLP Data Processing and Cleaning Guide (https://github.com/h9-tect/Arabic_nlp_preprocessing)
5. PyArabic (https://pyarabic.readthedocs.io/ar/latest/)
6. AUB MIND LAB (https://huggingface.co/aubmindlab)
7. AraBERT (https://github.com/aub-mind/arabert/tree/master)
8. Awesome Resources for Arabic NLP Repo (https://github.com/Curated-Awesome-Lists/awesome-arabic-nlp?tab=readme-ov-file)
9. Arabic Dialect Identification Models (https://github.com/Lafifi-24/arabic-dialect-identification?tab=readme-ov-file)
10. AraVec (https://github.com/bakrianoo/aravec/blob/master/AraVec%202.0/README.md)
11. Text Classifier (https://github.com/mustaphakamil/Arabic-text-classification/blob/master/Text%20Classifier%20NLP.ipynb)
12. BERT for Arabic Topic Modeling (https://colab.research.google.com/drive/1OT_wcYKpKS73uR6y7IVYjJVxaP-C1H3k?usp=sharing#scrollTo=SNa-KtKDRnus)
13. Dialect Identification (https://medium.com/@kmelad43/arabic-dialect-identification-774de9315140)