## Preprocessing Arabic Data with CAMeL

### Objective

**Preprocess Arabic text data into three distinct versions to explore and identify the preprocessing method that yields the best results**

In [29]:
import pandas as pd

df1 = pd.read_csv('/Users/najlaalhomaid/Downloads/smsData.csv')

### Version 1

**Apply:**
1. Normalization
2. Remove Links
3. Replace Punctuation
4. Remove Extra Spaces

In [2]:
pip install --upgrade camel-tools

Note: you may need to restart the kernel to use updated packages.


In [30]:
from camel_tools.utils.normalize import normalize_unicode, normalize_alef_maksura_ar, normalize_alef_ar, normalize_teh_marbuta_ar
import re

def preprocess_text(text):
    # Normalize Arabic
    text = normalize_unicode(text)  # Normalize Unicode
    text = normalize_alef_maksura_ar(text)  # Convert ى to ي
    text = normalize_alef_ar(text)  # Convert إئؤأ to ء
    text = normalize_teh_marbuta_ar(text) # Convert ة to ه
    text = re.sub(r'[\u064B-\u065F]', '', text)   # Remove diacritics
    text = re.sub(r'http\S+|www.\S+', '', text)  # Links
    text = re.sub(r'[^\w\s]', ' ', text)  # Replace punctuation with a space
    
    # Remove extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    return text

In [31]:
df1["Message Content"] = df1["Message Content"].apply(preprocess_text)

In [32]:
df1.to_csv("Data_cleaned_v1.csv", index=False)


### Version 2

1. Remove Digits
2. Remove stop words based on frequency analysis. This step helps in focusing on meaningful and distinct terms in the dataset.

In [33]:
def preprocess_text2(text):
    
    # Remove numbers
    text = re.sub(r'\d+', '', text)  # Remove digits (0-9)
    return text

In [34]:
df2 = df1

In [35]:
df2["Message Content"] = df2["Message Content"].apply(preprocess_text2)

In [36]:
df2["Message Content"][1]

'عميلنا العزيز تم اكتشاف عطل فني علي هاتفكم رقم وتمت معالجته ونعتذر عن اي خلل قد تسبب فيه سابقا'

1. **Combine All Text**

    Combine all messages into a single string for analysis.

In [37]:
all_text = " ".join(df2["Message Content"])

2. **Tokenize Words**

    Split the combined text into individual words for frequency analysis.

tokens = all_text.split()

3. **Count Word Frequencies**

    Use Python's collections.Counter to calculate word frequencies.

In [None]:
from collections import Counter

word_counts = Counter(tokens)
most_common_words = word_counts.most_common(20)  # Top 20 most frequent words

4. **Visualize Word Frequencies**

Use Plotly to create a bar chart for the most frequent words.

In [40]:
import plotly.graph_objects as go

# Assuming `most_common_words` is a list of tuples (word, count)
words, counts = zip(*most_common_words)  # Extract words and their counts

# Create a bar chart using Plotly
fig = go.Figure(data=[
    go.Bar(x=words, y=counts, marker_color='lightskyblue')
])

# Customize layout
fig.update_layout(
    title='Most Frequent Words in Messages',
    xaxis_title='Words',
    yaxis_title='Frequency',
    template='plotly_white',
    xaxis=dict(tickangle=-45),  # Rotate x-axis labels
    height=500,
    width=800
)

# Display the figure
fig.show()

5. **Define Stop Words**

    Manually identify and define common stop words to remove.

In [41]:
stop_words = {"علي", "من", "في", "الي", "تم", "عن", "الان", "مع"}

6. **Remove Stop Words from Messages**

    Filter out stop words from each message.

In [42]:
# Remove stop words from messages
df2["Message Content"] = df2["Message Content"].apply(lambda x: " ".join(
    [word for word in x.split() if word not in stop_words]
))

In [43]:
df2["Message Content"][1]

'عميلنا العزيز اكتشاف عطل فني هاتفكم رقم وتمت معالجته ونعتذر اي خلل قد تسبب فيه سابقا'

In [27]:
df2.to_csv("Data_cleaned_v2.csv", index=False)

### Version 3

Remove English words from primarily Arabic text messages.

In [64]:
df3 = df2

In [66]:
def is_arabic_text(text):
    """
    Checks if the text is primarily Arabic by calculating the proportion of Arabic characters.
    """
    arabic_chars = re.findall(r'[\u0600-\u06FF]', text)
    return len(arabic_chars) / len(text) > 0.5 if len(text) > 0 else False

def remove_english_words(text):
    """
    Removes English words from Arabic text.
    """
    # If the text is primarily Arabic, remove English words
    if is_arabic_text(text):
        text = re.sub(r'\b[A-Za-z]+\b', '', text)  # Remove English words
        text = re.sub(r'\s+', ' ', text).strip()   # Remove extra spaces
    return text

In [67]:
df3["Message Content"] = df3["Message Content"].apply(remove_english_words)

In [71]:
df3 = df3.drop_duplicates()

In [73]:
df3.to_csv("Data_cleaned_v3.csv", index=False)