# <p style="padding:50px;background-color:#DA8359;margin:0;color:#fafefe;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:100">Setup</p>

In [1]:
# Imports
import re
import typing
from collections import Counter

In [2]:
def extract_sentences_yt(file_path: str) -> list:
    '''
    process youtube documents

    @param file_path: a string represent file path
    '''

    with open(file_path, 'r', errors="ignore", encoding="utf-8") as f:
        text = f.read()

    # Remove numbers followed by ':'
    text = re.sub(r'\d+.*\d*\s*:', '', text)

    # Define sentence delimiters for Arabic
    sentence_endings = r'(?<=[.!؟؛،])\s+'

    # Split sentences while preserving dependencies
    sentences = re.split(sentence_endings, text)

    return [s.strip() for s in sentences if len(s.strip()) > 1] # Remove empty strings
    

In [3]:
# Example
dataset = extract_sentences_yt("البطاطس  الدحيح.txt")
dataset

['لكل شخص في آخر الشهر،',
 'ما معاهوش غير 50 جنيه،',
 'ومحتاج ياكل أكلة تشبّعه...',
 'لكل واحد "فورمة"،',
 'محتاج أكلة سريعة الهضم،',
 'تدّيله طاقة،',
 'تخلّيه يكمّل التمرينة...',
 'لكل واحد مش عارف ياكل،',
 'ونفسه في أكلة جانبية مع الأكل،',
 'عشان تفتح نِفسه...',
 'العالم محتاج بطل خارق...',
 'بطل حقيقي يقدر ينقذهم.',
 'العالم محتاج...',
 'بطاطس.',
 'اللي ما معاهوش فلوس،',
 'ياكل سندوتش بطاطس سوري.',
 'اللي عايز طاقة في الـGym...',
 'ياكل بطاطس مهروسة.',
 'اللي عايز تتفتح نِفسه على الأكل...',
 'ياكل "شيبسي".',
 'العالم كان محتاج معجزة من زمان،',
 'والمعجزة اتجسدت على صورة إنسان.',
 '!Potatoman\n !Potatoman\n أعزائي المشاهدين،',
 'السلام عليكم ورحمة الله وبركاته،',
 'أهلًا بكم في حلقة جديدة،',
 'من برنامج "الدحّيح"!',
 'عزيزي،',
 'لو ربنا كرمك،',
 'وذهبت إلى "ألمانيا"،',
 'وذهبت إلى حديقة "سانسوسي"\nفي مدينة "بوتسدام"،',
 'وزُرت كنيسة "السلام" هناك،',
 'هتلاقي قبور مجموعة من ملوك "ألمانيا"،',
 'منهم قبر واحد\nمن أهم ملوك "بروسيا" القرن الـ18،',
 'الملك "فريدريك التاني"،',
 'المعروف بـ"

# <p style="padding:50px;background-color:#DA8359;margin:0;color:#fafefe;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:100">EDA</p>

## Discover Dataset

## Questions

# <p style="padding:50px;background-color:#DA8359;margin:0;color:#fafefe;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:100">Cleaning</p>

## Tidying Up Text

### Orthographic mistakes

### Spelling inconsistencies (Text Correction)

**Ghalatawi**: Arabic Autocorrect library مكتبة للتصحيح التلقائي للغة العربية

Source: https://github.com/linuxscout/ghalatawi

In [4]:
from ghalatawi.autocorrector import AutoCorrector

autoco = AutoCorrector()

autoco.show_config()

{'regex': True, 'wordlist': True, 'punct': True, 'typo': True}

> <span style="color: yellow">**_Note:_**</span> The library allow for fixing spelling, adjusting punctuations, typos.


In [5]:
def auto_correct(dataset: list) -> list:
    '''
    A method that that fixes typos, punctuation and spelling mistakes.
    '''

    autoco = AutoCorrector()

    output = []

    for text in dataset:
        output = autoco.spell(text)

    return output

### Unknown characters

### Repeated letters and with spaces in the words



## Text Processing

### Semantic Segmentation

One of the problems in text collected from youtube/podcast is that their is no true sentence structure is made that we split text upon.

In [6]:
import pyarabic.araby as araby
import pyarabic.number as number

**PyArabic**: A specific Arabic language library for Python, provides basic functions to manipulate Arabic letters and text, like detecting Arabic letters, Arabic letters groups and characteristics, remove diacritics etc.

Source: https://github.com/linuxscout/pyarabic

In [7]:
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

  from .autonotebook import tqdm as notebook_tqdm


**CAMeLBERT**: is a collection of BERT models pre-trained on Arabic texts with different sizes and variants.

Source: https://huggingface.co/CAMeL-Lab/bert-base-arabic-camelbert-msa

In [8]:
# Load Arabic SBERT Model
model = SentenceTransformer("CAMeL-Lab/bert-base-arabic-camelbert-msa")

No sentence-transformers model found with name CAMeL-Lab/bert-base-arabic-camelbert-msa. Creating a new one with mean pooling.


In [9]:
# Tokenized sentences (initial splitting based on commas or manual segmentation)
sentences = dataset

# Compute sentence embeddings
embeddings = model.encode(sentences)

# Compute cosine similarity
sim_matrix = cosine_similarity(embeddings)

# Find semantic breakpoints (low similarity)
threshold = 0.5  # Adjust this based on experimentation
split_points = [i for i in range(len(sentences) - 1) if sim_matrix[i, i+1] < threshold]

# Generate semantic splits
segments = []
start = 0
for split in split_points:
    segments.append(" ".join(sentences[start:split+1]))
    start = split + 1

segments.append(" ".join(sentences[start:]))

# Print results
for segment in segments:
    print(segment)
    print("-----")

لكل شخص في آخر الشهر، ما معاهوش غير 50 جنيه، ومحتاج ياكل أكلة تشبّعه... لكل واحد "فورمة"، محتاج أكلة سريعة الهضم، تدّيله طاقة، تخلّيه يكمّل التمرينة...
-----
لكل واحد مش عارف ياكل، ونفسه في أكلة جانبية مع الأكل، عشان تفتح نِفسه... العالم محتاج بطل خارق... بطل حقيقي يقدر ينقذهم. العالم محتاج...
-----
بطاطس.
-----
اللي ما معاهوش فلوس،
-----
ياكل سندوتش بطاطس سوري.
-----
اللي عايز طاقة في الـGym...
-----
ياكل بطاطس مهروسة. اللي عايز تتفتح نِفسه على الأكل... ياكل "شيبسي". العالم كان محتاج معجزة من زمان، والمعجزة اتجسدت على صورة إنسان.
-----
!Potatoman
 !Potatoman
 أعزائي المشاهدين، السلام عليكم ورحمة الله وبركاته، أهلًا بكم في حلقة جديدة، من برنامج "الدحّيح"! عزيزي، لو ربنا كرمك، وذهبت إلى "ألمانيا"، وذهبت إلى حديقة "سانسوسي"
في مدينة "بوتسدام"، وزُرت كنيسة "السلام" هناك، هتلاقي قبور مجموعة من ملوك "ألمانيا"، منهم قبر واحد
من أهم ملوك "بروسيا" القرن الـ18، الملك "فريدريك التاني"، المعروف بـ"فريدريك العظيم". "يا (أبو حميد) ما تخش بقى في الموضوع، عايزين نعرف!"
 يا عزيزي، انتظر، هقولّك أهو. د

> <span style="color: red">**_TODO:_**</span> handle text that contains both english and arabic.

In [10]:
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
import os
import warnings

warnings.simplefilter("ignore")

s = pd.DataFrame(segments)


In [None]:
from nltk.corpus import stopwords
from textblob import TextBlob
import re
from dsaraby import DSAraby
from tashaphyne.stemming import ArabicLightStemmer
from nltk.stem.isri import ISRIStemmer

stops = set(stopwords.words("arabic"))
stop_word_comp = {"،","آض","آمينَ","آه","آهاً","آي","أ","أب","أجل","أجمع","أخ","أخذ","أصبح","أضحى","أقبل","أقل","أكثر","ألا","أم","أما","أمامك","أمامكَ","أمسى","أمّا","أن","أنا","أنت","أنتم","أنتما","أنتن","أنتِ","أنشأ","أنّى","أو","أوشك","أولئك","أولئكم","أولاء","أولالك","أوّهْ","أي","أيا","أين","أينما","أيّ","أَنَّ","أََيُّ","أُفٍّ","إذ","إذا","إذاً","إذما","إذن","إلى","إليكم","إليكما","إليكنّ","إليكَ","إلَيْكَ","إلّا","إمّا","إن","إنّما","إي","إياك","إياكم","إياكما","إياكن","إيانا","إياه","إياها","إياهم","إياهما","إياهن","إياي","إيهٍ","إِنَّ","ا","ابتدأ","اثر","اجل","احد","اخرى","اخلولق","اذا","اربعة","ارتدّ","استحال","اطار","اعادة","اعلنت","اف","اكثر","اكد","الألاء","الألى","الا","الاخيرة","الان","الاول","الاولى","التى","التي","الثاني","الثانية","الذاتي","الذى","الذي","الذين","السابق","الف","اللائي","اللاتي","اللتان","اللتيا","اللتين","اللذان","اللذين","اللواتي","الماضي","المقبل","الوقت","الى","اليوم","اما","امام","امس","ان","انبرى","انقلب","انه","انها","او","اول","اي","ايار","ايام","ايضا","ب","بات","باسم","بان","بخٍ","برس","بسبب","بسّ","بشكل","بضع","بطآن","بعد","بعض","بك","بكم","بكما","بكن","بل","بلى","بما","بماذا","بمن","بن","بنا","به","بها","بي","بيد","بين","بَسْ","بَلْهَ","بِئْسَ","تانِ","تانِك","تبدّل","تجاه","تحوّل","تلقاء","تلك","تلكم","تلكما","تم","تينك","تَيْنِ","تِه","تِي","ثلاثة","ثم","ثمّ","ثمّة","ثُمَّ","جعل","جلل","جميع","جير","حار","حاشا","حاليا","حاي","حتى","حرى","حسب","حم","حوالى","حول","حيث","حيثما","حين","حيَّ","حَبَّذَا","حَتَّى","حَذارِ","خلا","خلال","دون","دونك","ذا","ذات","ذاك","ذانك","ذانِ","ذلك","ذلكم","ذلكما","ذلكن","ذو","ذوا","ذواتا","ذواتي","ذيت","ذينك","ذَيْنِ","ذِه","ذِي","راح","رجع","رويدك","ريث","رُبَّ","زيارة","سبحان","سرعان","سنة","سنوات","سوف","سوى","سَاءَ","سَاءَمَا","شبه","شخصا","شرع","شَتَّانَ","صار","صباح","صفر","صهٍ","صهْ","ضد","ضمن","طاق","طالما","طفق","طَق","ظلّ","عاد","عام","عاما","عامة","عدا","عدة","عدد","عدم","عسى","عشر","عشرة","علق","على","عليك","عليه","عليها","علًّ","عن","عند","عندما","عوض","عين","عَدَسْ","عَمَّا","غدا","غير","ـ","ف","فان","فلان","فو","فى","في","فيم","فيما","فيه","فيها","قال","قام","قبل","قد","قطّ","قلما","قوة","كأنّما","كأين","كأيّ","كأيّن","كاد","كان","كانت","كذا","كذلك","كرب","كل","كلا","كلاهما","كلتا","كلم","كليكما","كليهما","كلّما","كلَّا","كم","كما","كي","كيت","كيف","كيفما","كَأَنَّ","كِخ","لئن","لا","لات","لاسيما","لدن","لدى","لعمر","لقاء","لك","لكم","لكما","لكن","لكنَّما","لكي","لكيلا","للامم","لم","لما","لمّا","لن","لنا","له","لها","لو","لوكالة","لولا","لوما","لي","لَسْتَ","لَسْتُ","لَسْتُم","لَسْتُمَا","لَسْتُنَّ","لَسْتِ","لَسْنَ","لَعَلَّ","لَكِنَّ","لَيْتَ","لَيْسَ","لَيْسَا","لَيْسَتَا","لَيْسَتْ","لَيْسُوا","لَِسْنَا","ما","ماانفك","مابرح","مادام","ماذا","مازال","مافتئ","مايو","متى","مثل","مذ","مساء","مع","معاذ","مقابل","مكانكم","مكانكما","مكانكنّ","مكانَك","مليار","مليون","مما","ممن","من","منذ","منها","مه","مهما","مَنْ","مِن","نحن","نحو","نعم","نفس","نفسه","نهاية","نَخْ","نِعِمّا","نِعْمَ","ها","هاؤم","هاكَ","هاهنا","هبّ","هذا","هذه","هكذا","هل","هلمَّ","هلّا","هم","هما","هن","هنا","هناك","هنالك","هو","هي","هيا","هيت","هيّا","هَؤلاء","هَاتانِ","هَاتَيْنِ","هَاتِه","هَاتِي","هَجْ","هَذا","هَذانِ","هَذَيْنِ","هَذِه","هَذِي","هَيْهَاتَ","و","و6","وا","واحد","واضاف","واضافت","واكد","وان","واهاً","واوضح","وراءَك","وفي","وقال","وقالت","وقد","وقف","وكان","وكانت","ولا","ولم","ومن","مَن","وهو","وهي","ويكأنّ","وَيْ","وُشْكَانََ","يكون","يمكن","يوم","ّأيّان"}

ArListem = ArabicLightStemmer()
ds = DSAraby()

FileNotFoundError: [Errno 2] No such file or directory: 'D:\\MET\\Semester 10\\[CSEN1076] Natural Language Processing and Information Retrieval\\Project\\Project\\dsaraby/assets/mapping.manual.json'

In [None]:
# Helper Methods

def to_arabic(text: str) -> str:
    '''
    A method that transliterate text which write a word using the closest corresponding letters
    of a different alphabet or language.
    The algorithm gives the possible words in Arabic based on a given word in Latin by mapping
    Latin letters to Arabic ones, then takes the most frequent word existing in a corpus.
    '''

    return ds.transliterate(text)

def stem(text):
    '''
    A method
    '''

    zen = TextBlob(text)
    words = zen.words
    cleaned = list()

    for w in words:
        ArListem.light_stem(w)
        cleaned.append(ArListem.get_root())
    return " ".join(cleaned)

import pyarabic.araby as araby


def remove_stop_words(text: str):
    '''
    
    '''
    zen = TextBlob(text)
    words = zen.words
    return " ".join([w for w in words if not w in stops and not w in stop_word_comp and len(w) >= 2])

def split_hashtag_to_words(tag):
    '''
    
    '''
    tag = tag.replace('#', '')
    tags = tag.split('_')
    if len(tags) > 1:
        return tags
    
    pattern = re.compile(r"[A-Z][a-z]+|\d+|[A-Z]+(?![a-z])")
    return pattern.findall(tag)

def clean_hashtag(text):
    '''
    
    '''
    words = text.split()
    text = list()

    for word in words:
        if is_hashtag(word):
            text.extend(extract_hashtag(word))
        else:
            text.append(word)
    
    return " ".join(text)

def is_hashtag(word):
    '''
    
    '''
    if word.startswith("#"):
        return True
    else:
        return False

def extract_hashtag(text):
    '''
    
    '''
    hash_list = ([re.sub(r"(\W+)$", "", i) for i in text.split() if i.startswith("#")])
    word_list = []
    for word in hash_list:
        word_list.extend(split_hashtag_to_words(word))
    return word_list

def split_hashtag_to_words(tag):
    '''
    
    '''
    tag = tag.replace('#', '')
    tags = tag.split('_')
    if len(tags) > 1:
        return tags
    
    pattern = re.compile(r"[A-Z][a-z]+|\d+|+(?![a-z])")
    return pattern.findall(tag)

def clean_hashtag(text):
    words = text.split()
    text = list()

    for word in words:
        if is_hashtag(word):
            text.extend(extract_hashtag(word))
        else:
            text.append(word)

    return " ".join(text)

def is_hashtag(word):
    if word.startswith("#"):
        return True
    else: return False

def extract_hashtag(text):
    hash_list = ([re.sub(r"(\W+)$", "", i) for i in text.split() if i.startswith("#")])
    word_list = []
    
    for word in hash_list:
        word_list.extend(split_hashtag_to_words(word))

    return word_list

from __future__ import unicode_literals

def remove_emoji(text):
    emoji_pattern = re.compile("["
                                   u"\U0001F600-\U0001F64F"  # emoticons
                                   u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                                   u"\U0001F680-\U0001F6FF"  # transport & map symbols
                                   u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                                   u"\U00002702-\U000027B0"
                                   u"\U000024C2-\U0001F251"
                                   "]+", flags=re.UNICODE)
    text = emoji_pattern.sub(r'', text)
    return text

import unicodedata
from unidecode import unidecode

def emoji_native_translation(text):
    text = text.lower()
    loves = ["<3", "♥",'❤']
    smilefaces = []
    sadfaces = []
    neutralfaces = []

    eyes = ["8",":","=",";"]
    nose = ["'","`","-",r"\\"]
    for e in eyes:
        for n in nose:
            for s in ["\)", "d", "]", "}","p"]:
                smilefaces.append(e+n+s)
                smilefaces.append(e+s)
            for s in ["\(", "\[", "{"]:
                sadfaces.append(e+n+s)
                sadfaces.append(e+s)
            for s in ["\|", "\/", r"\\"]:
                neutralfaces.append(e+n+s)
                neutralfaces.append(e+s)
            #reversed
            for s in ["\(", "\[", "{"]:
                smilefaces.append(s+n+e)
                smilefaces.append(s+e)
            for s in ["\)", "\]", "}"]:
                sadfaces.append(s+n+e)
                sadfaces.append(s+e)
            for s in ["\|", "\/", r"\\"]:
                neutralfaces.append(s+n+e)
                neutralfaces.append(s+e)

    smilefaces = list(set(smilefaces))
    sadfaces = list(set(sadfaces))
    neutralfaces = list(set(neutralfaces))
    t = []
    for w in text.split():
        if w in loves:
            t.append("حب")
        elif w in smilefaces:
            t.append("مضحك")
        elif w in neutralfaces:
            t.append("عادي")
        elif w in sadfaces:
            t.append("محزن")
        else:
            t.append(w)
    newText = " ".join(t)
    return newText

import emoji

def is_emoji(word):
    if word in emojis_ar:
        return True
    else:
        return False

from aiogoogletrans import Translator
translator = Translator()
import asyncio
loop = asyncio.get_event_loop()
def translate_emojis(words):
    word_list = list()
    words_to_translate = list()
    for word in words :
        t = emojis_ar.get(word.get('emoji'),None)
        if t is None:
            word.update({'translation':'عادي','translated':True})
            #words_to_translate.append('normal')
        else:
            word.update({'translated':False,'translation':t})
            words_to_translate.append(t.replace(':','').replace('_',' '))
        word_list.append(word)
    return word_list

def emoji_unicode_translation(text):
    text = add_space(text)
    words = text.split()
    text_list = list()
    emojis_list = list()
    c = 0
    for word in words:
        if is_emoji(word):
            emojis_list.append({'emoji':word,'emplacement':c})
        else:
            text_list.append(word)
        c+=1
    emojis_translated = translate_emojis(emojis_list)
    for em in emojis_translated:
        text_list.insert(em.get('emplacement'),em.get('translation'))
    text = " ".join(text_list)
    return text
    
def clean_emoji(text):
    text = emoji_native_translation(text)
    text = emoji_unicode_translation(text)
    return text

def clean_text(text):
    ## Remove punctuations
    text = re.sub('[%s]' % re.escape("""!"#$%&'()*+,،-./:;<=>؟?@[\]^_`{|}~"""), ' ', text)  # remove punctuation
    
    ## Remove extra whitespace
    text = re.sub('\s+', ' ', text)

    ## Remove Emojis
    text = remove_emoji(text)

    ## Convert text to lowercases
    text = text.lower()

    ## Arabisy the text
    text = to_arabic(text)

    ## Remove stop words
    text = remove_stop_words(text)

    ## Remove numbers
    text = re.sub("\d+", " ", text)

    ## Remove Tashkeel
    text = normalizeArabic(text)

    #text = re.sub('\W+', ' ', text)
    text = re.sub('[A-Za-z]+',' ',text)
    text = re.sub(r'\\u[A-Za-z0-9\\]+',' ',text)
    ## remove extra whitespace
    text = re.sub('\s+', ' ', text)  
    #Stemming
    #text = stem(text)
    return text



### Normalization (Check)

> <span style="color: green">**_Normalization:_**</span> match digits that have the same writing but different encodings.

In [None]:
import tnkeeh as tn


def normalize_arabic(text):
    '''
    A method that match digits that have same writing but different encodings
    '''
    
    normalizer = tn.Tnkeeh(normalize=True)
    output = normalizer.clean_raw_text(text)
    
    return output

In [None]:
def normalize_arabic(text: str) -> str:
    '''
    A method that match digits that have same writing but different encodings
    '''
    text = text.strip()
    text = re.sub("[إأٱآا]", "ا", text)
    text = re.sub("ى", "ي", text)
    text = re.sub("ؤ", "ء", text)
    text = re.sub("ئ", "ء", text)
    text = re.sub("ة", "ه", text)
    text = re.sub("گ", "ك", text)
    text = re.sub("ڤ", "ف", text)
    text = re.sub("چ", "ج", text)
    text = re.sub("پ", "ب", text)
    text = re.sub("ڜ", "ش", text)
    text = re.sub("ڪ", "ك", text)
    text = re.sub("ڧ", "ق", text)
    text = re.sub("ٱ", "ا", text)
    noise = re.compile(""" ّ    | # Tashdid
                             َ    | # Fatha
                             ً    | # Tanwin Fath
                             ُ    | # Damma
                             ٌ    | # Tanwin Damm
                             ِ    | # Kasra
                             ٍ    | # Tanwin Kasr
                             ْ    | # Sukun
                             ـ     # Tatwil/Kashida
                         """, re.VERBOSE)
    text = re.sub(noise, '', text)
    text = re.sub(r'(.)\1+', r"\1\1", text) # Convert repeated characters to single occurrence
    return araby.strip_tashkeel(text)


Example

In [25]:
# Example usage
raw_text = "هذا نص تجريبي يحتوي على أحرف مختلفة مثل إ و أ و آ و ى و ڤ و چ"
normalized_text = normalize_arabic(raw_text)
print(normalized_text)

هذا نص تجريبي يحتوي علي احرف مختلفه مثل ا و ا و ا و ي و ف و ج


Usage

### Specific Noise Removal

> <span style="color: green">**_Noise Removal:_**</span> extend noise removal to handle more cases.

In [27]:
def remove_arabic_noise(text: str) -> str:
    # Remove diacritics
    text = re.sub(r'[\u0617-\u061A\u064B-\u0652]', '', text)

    # Remove tatweel
    text = re.sub(r'\u0640', '', text)

    # Remove non-Arabic characters
    text = re.sub(r'[^\u0600-\u06FF\s]', '', text)

    # Remove HTML tags
    text = re.sub('<.*?>', '', text)

    # Remove extra whitespaces
    text = re.sub(r'\s+', ' ', text).strip()

    return text

Example

In [28]:
# Example usage
noisy_text = "هَـــذا نَـــصّ <b>تَجْــرِيــبـِـيّ</b> مع   مسافات  زائدة"
clean_text = remove_arabic_noise(noisy_text)
print(clean_text)

هذا نص تجريبي مع مسافات زائدة


Usage

### Tokenization

> <span style="color: green">**_Tokenization:_**</span> is the process of breaking a sequence of text into smaller units called tokens, such as words, phrases, symbols, and other elements. For the Arabic language, tokenization is a complex task due to the differences between the written and spoken forms of the language.

In [40]:
import camel_tools
import os

camel_data_path = os.path.join(os.path.dirname(camel_tools.__file__), 'cli', 'camel_data.py')
print(camel_data_path)


c:\Users\mazen\AppData\Local\Programs\Python\Python312\Lib\site-packages\camel_tools\cli\camel_data.py


In [None]:
from camel_tools.tokenizers.word import simple_word_tokenize
from camel_tools.tokenizers.morphological import MorphologicalTokenizer
from camel_tools.disambig.mle import MLEDisambiguator
from camel_tools.data import downloader

downloader.DownloaderError("calima-msa-r13")

def tokenize_arabic(text: str, method='simple'):
    if method == 'simple':
        return simple_word_tokenize(text)
    elif method == 'morphological':
        disambiguator = MLEDisambiguator.pretrained() # Load a pre-trained disambiguator
        tokenizer = MorphologicalTokenizer(disambiguator) # Create a tokenize
        words = tokenizer.tokenize(text) # Tokenize text
        return words

Example

In [45]:
text = "هذا مثال على تقطيع النص العربي بطريقة متقدمة."
simple_tokens = tokenize_arabic(text, 'simple')
morphological_tokens = tokenize_arabic(text, 'morphological')

print("Simple tokenization:", simple_tokens)
print("Morphological tokenization:", morphological_tokens)

FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\mazen\\AppData\\Roaming\\camel_tools\\data\\morphology_db\\calima-msa-r13\\morphology.db'

Usage

### Stemming & Lemmatization

> <span style="color: green">**_Stemming:_**</span> 

> <span style="color: green">**_Lemmatization:_**</span> 

In [None]:
from camel_tools.stem import CAMeLStemeer
from farasa.stemmer import FarasaStemmer
from tashaphyne.stemming import ArabicLightStemmer

camel_stemmer = CAMeLStemeer.pretrained('')
farasa_stemmer = FarasaStemmer()
light_stemmer = ArabicLightStemmer()

def process_arabic(text, method='stem', tool='camel'):
    if method == 'stem':
        return camel_stemmer.stem(text)
    elif tool == 'farasa':
        return farasa_stemmer.stem(text)
    elif tool == 'light':
        return ' '.join([light_stemmer.light_stem(word) for word in text.split()])
    elif method == 'lemmatize':
        return camel_stemmer.lemmatize(text)

Example

In [None]:
# Example usage
text = "الكتب المدرسية مفيدة للطلاب"
camel_stemmed = process_arabic(text, 'stem', 'camel')
farasa_stemmed = process_arabic(text, 'stem', 'farasa')
light_stemmed = process_arabic(text, 'stem', 'light')
lemmatized = process_arabic(text, 'lemmatize')

print("CAMeL stemmed:", camel_stemmed)
print("Farasa stemmed:", farasa_stemmed)
print("Light stemmed:", light_stemmed)
print("Lemmatized:", lemmatized)

Usage

### Stop Words Removal

> <span style="color: green">**_Stop Words Removal:_**</span> 

In [47]:
from camel_tools.ner import STOPWORDS as CAMEL_STOPWORDS
from nltk.corpus import stopwords

NLTK_STOPWORDS = set(stopwords.words('arabic'))

def remove_arabic_stopwords(tokens, custom_stopwords=None, use_nltk=True, use_camel=True):
    stopword_set = set()

    if use_nltk:
        stopword_set.update(NLTK_STOPWORDS)
    if use_camel:
        stopword_set.update(CAMEL_STOPWORDS)
    if custom_stopwords:
        stopword_set.update(custom_stopwords)

    return [token for token in tokens if token not in stopword_set]

ImportError: cannot import name 'STOPWORDS' from 'camel_tools.ner' (c:\Users\mazen\AppData\Local\Programs\Python\Python312\Lib\site-packages\camel_tools\ner\__init__.py)

Example

In [None]:
# Example usage
tokens = ["هذا", "مثال", "على", "إزالة", "كلمات", "التوقف", "بشكل", "متقدم"]
custom_stopwords = ["متقدم"]
filtered_tokens = remove_arabic_stopwords(tokens, custom_stopwords=custom_stopwords)
print(filtered_tokens)

Usage

### Diacritics

> <span style="color: green">**_Diacritics:_**</span> 

In [None]:
import pyarabic.araby as araby

def handle_diacritics(text, method='remove'):
    if method == 'remove':
        return araby.strip_diacritics(text)
    elif method == 'keep':
        return text
    elif method == 'normalize':
        return araby.normalize_hamza(araby.strip_shadda(text))

Example

In [None]:
# Example usage
text_with_diacritics = "اللُّغَةُ العَرَبِيَّةُ جَمِيلَةٌ"
removed_diacritics = handle_diacritics(text_with_diacritics, 'remove')
normalized_diacritics = handle_diacritics(text_with_diacritics, 'normalize')

print("Original:", text_with_diacritics)
print("Removed diacritics:", removed_diacritics)
print("Normalized diacritics:", normalized_diacritics)

Usage

### Dialects

> <span style="color: green">**_Dialects:_**</span> 

In [None]:
from camel_tools.dialectid import DialectIdentifier

def identify_dialect(text):
    did = DialectIdentifier.pretrained()
    dialect = did.predict(text)
    return dialect

def normalize_dialect(text, target_dialect='MSA'):
    # This is a placeholder function. In practice, you would use more sophisticated
    # methods to normalize dialects, which is an active area of research.
    return text

Example

In [None]:
# Example usage
text = "شلونك حبيبي؟ شخبارك اليوم؟"
dialect = identify_dialect(text)
normalized_text = normalize_dialect(text)

print("Original text:", text)
print("Identified dialect:", dialect)
print("Normalized to MSA:", normalized_text)

Usage

### Punctuation

> <span style="color: green">**_Punctuation:_**</span> 

### Part-of-speech tagging

> <span style="color: green">**_Part-of-speech tagging:_**</span> 

### Named entity recognition (NER)

> <span style="color: green">**_Named entity recognition:_**</span> find and label named entities like proper nouns, organisations, places, etc.

In [None]:
from camel_tools.ner import NERecognizer

def recognize_entities(text):
    ner = NERecognizer.pretrained()
    labels = ner.predict_sentence(text)
    entities = []
    current_entity = []
    current_label = None
    
    for word, label in zip(text.split(), labels):
        if label.startswith('B-'):
            if current_entity:
                entities.append((' '.join(current_entity), current_label))
                current_entity = []
            current_entity.append(word)
            current_label = label[2:]
        elif label.startswith('I-') and current_entity:
            current_entity.append(word)
        else:
            if current_entity:
                entities.append((' '.join(current_entity), current_label))
                current_entity = []
                current_label = None
    
    if current_entity:
        entities.append((' '.join(current_entity), current_label))
    
    return entities

Example

In [None]:
# Example usage
text = "يعيش محمد في القاهرة ويعمل في شركة جوجل."
entities = recognize_entities(text)
print("Text:", text)
print("Recognized entities:", entities)

Usage

### Corpus and lexical resources

> <span style="color: green">**_Corpus and lexical resources:_**</span> annotated corpora and lexical databases that can be used for tasks like language modelling and information retrieval.

### Segmentation

> <span style="color: green">**_Segmentation:_**</span> 

In [None]:
from camel_tools.segmenters.word import MaxLikelihoodProbabilityModel

def segment_arabic_text(text):
    mlp_model = MaxLikelihoodProbabilityModel.pretrained()
    segmented = mlp_model.segment(text)
    return ' '.join(segmented)

Example

In [None]:
# Example usage
text = "وقالمصدرإنهناكتحسنافيالوضع"
segmented_text = segment_arabic_text(text)
print("Original:", text)
print("Segmented:", segmented_text)

NameError: name 'segment_arabic_text' is not defined

### Arabizi

> <span style="color: green">**_Arabizi:_**</span> 

In [None]:
def arabizi_to_arabic(text):
    # This is a simplified conversion. A complete solution would be more complex.
    conversion_dict = {
        'a': 'ا', 'b': 'ب', 't': 'ت', 'th': 'ث', 'g': 'ج', '7': 'ح', 'kh': 'خ',
        'd': 'د', 'th': 'ذ', 'r': 'ر', 'z': 'ز', 's': 'س', 'sh': 'ش', '9': 'ص',
        '6': 'ط', '3': 'ع', 'gh': 'غ', 'f': 'ف', 'q': 'ق', 'k': 'ك', 'l': 'ل',
        'm': 'م', 'n': 'ن', 'h': 'ه', 'w': 'و', 'y': 'ي'
    }
    
    for latin, arabic in conversion_dict.items():
        text = text.replace(latin, arabic)
    
    return text

Example

In [None]:
# Example usage
arabizi_text = "mar7aba, kayf 7alak?"
arabic_text = arabizi_to_arabic(arabizi_text)
print("Arabizi:", arabizi_text)
print("Arabic:", arabic_text)

Usage

### Disambiguation

> <span style="color: green">**_Disambiguation:_**</span> 

In [None]:
from camel_tools.disambig import CamelDisambiguator

def disambiguate_arabic(text):
    disambiguator = CamelDisambiguator.pretrained('calima-msa-r13')
    disambiguated = disambiguator.disambiguate(text.split())
    return [d.analyses[0].analysis['lex'] for d in disambiguated]

Example

In [None]:
# Example usage
text = "ذهب الرجل إلى البنك"
disambiguated = disambiguate_arabic(text)
print("Original:", text)
print("Disambiguated:", ' '.join(disambiguated))

Usage

### Elongated Words

> <span style="color: green">**_Elongated Words:_**</span> 

In [None]:
import re

def normalize_elongated_words(text):
    # Remove elongation
    text = re.sub(r'(.)\1+', r'\1\1', text)
    return text

Example

In [None]:
# Example usage
elongated_text = "يااااا سلاااام على هذا البرنااامج الراااائع"
normalized_text = normalize_elongated_words(elongated_text)
print("Elongated:", elongated_text)
print("Normalized:", normalized_text)

Usage

### Emojis

In [None]:
import emoji

def handle_emojis(text, mode='remove'):
    if mode == 'remove':
        return emoji.replace_emoji(text, '')
    elif mode == 'description':
        return emoji.demojize(text, language='ar')
    return text

Example

In [None]:
# Example usage
text_with_emoji = "أنا أحب القراءة 📚 وأستمتع بها كثيراً 😊"
text_without_emoji = handle_emojis(text_with_emoji, 'remove')
text_with_descriptions = handle_emojis(text_with_emoji, 'description')

print("Original:", text_with_emoji)
print("Without emojis:", text_without_emoji)
print("With emoji descriptions:", text_with_descriptions)

Usage

### Data Augmentation

In [None]:
import random
from camel_tools.morphology import analyzer

def augment_arabic_data(text, num_augmentations=1):
    morph = analyzer.pretrained_analyzer()
    words = text.split()
    augmented_texts = []

    for _ in range(num_augmentations):
        new_words = []
        for word in words:
            analysis = morph.analyze(word)
            if analysis:
                # Randomly choose a different form of the word
                new_word = random.choice(analysis).inflected
                new_words.append(new_word)
            else:
                new_words.append(word)
        augmented_texts.append(' '.join(new_words))

    return augmented_texts

Example

In [None]:
# Example usage
original_text = "الكتاب مفيد للقراءة"
augmented_data = augment_arabic_data(original_text, num_augmentations=3)

print("Original:", original_text)
print("Augmented data:")
for i, text in enumerate(augmented_data, 1):
    print(f"{i}. {text}")

Usage

## Handling Outliers

### Handling Very Common Word Removal

### Handling Very Rare Word Removal

### Handling Numbers and Special Characters in Arabic Text

In [None]:
import re

def handle_numbers_and_special_chars(text, mode='remove'):
    if mode == 'remove':
        # Remove numbers and special characters
        return re.sub(r'[^\u0600-\u06FF\s]', '', text)
    elif mode == 'normalize':
        # Normalize Arabic numbers to Hindi numbers
        number_map = {
            '٠': '0', '١': '1', '٢': '2', '٣': '3', '٤': '4',
            '٥': '5', '٦': '6', '٧': '7', '٨': '8', '٩': '9'
        }
        for arabic, hindi in number_map.items():
            text = text.replace(arabic, hindi)
        
        return text

Example

In [None]:
# Example usage
text = "يوجد ٣ تفاحات و٥ برتقالات في السلة!"
removed_numbers = handle_numbers_and_special_chars(text, 'remove')
normalized_numbers = handle_numbers_and_special_chars(text, 'normalize')

print("Original:", text)
print("Removed numbers and special chars:", removed_numbers)
print("Normalized numbers:", normalized_numbers)

Usage

# <p style="padding:50px;background-color:#DA8359;margin:0;color:#fafefe;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:100">Feature Engineering</p>

# <p style="padding:50px;background-color:#DA8359;margin:0;color:#fafefe;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:100">Preporcessing</p>

### Text Classification

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

def classify_arabic_text(text, model_name="aubmindlab/bert-base-arabertv2"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSequenceClassification.from_pretrained(model_name)
    
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    outputs = model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    
    return predictions.tolist()[0]

Example

In [None]:
# Example usage
text = "هذا النص رائع ومفيد جداً"
classification = classify_arabic_text(text)
print(f"Text: {text}")
print(f"Classification probabilities: {classification}")

Usage

### Sentiment Analysis

In [None]:
from transformers import pipeline

def analyze_arabic_sentiment(text):
    sentiment_pipeline = pipeline("sentiment-analysis", model="CAMeL-Lab/bert-base-arabic-camelbert-msa-sentiment")
    result = sentiment_pipeline(text)[0]
    return result['label'], result['score']

Example

In [49]:
# Example usage
text = "أنا سعيد جداً بهذا المنتج!"
sentiment, score = analyze_arabic_sentiment(text)
print(f"Text: {text}")
print(f"Sentiment: {sentiment}, Score: {score}")

NameError: name 'analyze_arabic_sentiment' is not defined

Usage

### Word Embedding

### Multi-Label Labelling

### Topic Modeling

# <p style="padding:50px;background-color:#DA8359;margin:0;color:#fafefe;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:100">Visualize Data</p>

# <p style="padding:50px;background-color:#DA8359;margin:0;color:#fafefe;font-family:newtimeroman;font-size:100%;text-align:center;border-radius: 15px 50px;overflow:hidden;font-weight:100">Resources</p>