# **Arabic Text Processing**
---
## **This notebook provides an overview of different stages of Arabic text processing:**
*   Diacritization
*   Morphological Analysis
*   Dialect Handling
*   Multilingual Text Handling
------------------------------------------------------------------------

# 1. **Diacritization:**
>Adding Diacritics to Arabic Text
Diacritization restores vowel marks to Arabic text, which are often omitted in standard writing

In [27]:
# !pip install farasapy

Collecting farasapy
  Downloading farasapy-0.0.14-py3-none-any.whl.metadata (8.9 kB)
Downloading farasapy-0.0.14-py3-none-any.whl (11 kB)
Installing collected packages: farasapy
Successfully installed farasapy-0.0.14


In [28]:
# # !pip install camel_tools
# !pip install --upgrade camel-tools
# from camel_tools.utils.diacritizer import FarasaDiacritizer
# from camel_tools.diacritization.farasa import FarasaDiacritizer
from farasa.diacratizer import FarasaDiacritizer

# Text without diacritics
text = "كتب الولد الدرس"

diacritizer = FarasaDiacritizer(interactive=True)

# Diacritize text
diacritized_text = diacritizer.diacritize(text)
print("Diacritized Text:", diacritized_text)



100%|██████████| 241M/241M [05:14<00:00, 768kiB/s] 




Diacritized Text: كُتُبَ الوَلَدُ الدَّرْسَ


# 2. **Morphological Analysis**
>This process analyzes the structure of words to extract lemmas and stems.

## **Stemming & Lemmatization**
>Stemming reduces words to their root, while lemmatization provides the base form.

Using CAMeL Tools or PyArabic:

In [12]:
from nltk.stem.isri import ISRIStemmer

stemmer = ISRIStemmer()
word = "سيشاهدون"
root = stemmer.stem(word)
print("Stem:", root)

Stem: شهد


# 3. Dialect Handling
> Arabic has multiple dialects. To handle dialectal variations, we use specific models or libraries like Hugging Face's AraBERT or dialect-specific tools.

In [31]:
# from transformers import AutoTokenizer, AutoModelForSequenceClassification

# # Load a model for dialect identification
# model_name = "aubmindlab/bert-base-arabertv2-dialect"
# tokenizer = AutoTokenizer.from_pretrained(model_name)
# model = AutoModelForSequenceClassification.from_pretrained(model_name)

# # Sample dialectal text
# dialect_text = "شو بتعمل اليوم؟"

# # Tokenize and predict dialect
# inputs = tokenizer(dialect_text, return_tensors="pt")
# outputs = model(**inputs)
# print("Dialect Prediction:", outputs)

# 4. Handling Abbreviations & Multilingual Text
>Abbreviations and multilingual content often appear in user data. We can normalize or translate them using libraries like langdetect and Google Translate.

## Language Detection

In [19]:
# !pip install langdetect
from langdetect import detect

text = "مرحبا"
detected_language = detect(text)
print("Detected Language:", detected_language)

Detected Language: ar


## Handling Abbreviations

In [16]:
# Example function to replace abbreviations
abbreviations = {
    "brb": "be right back",
    "lol": "laughing out loud"
}

def expand_abbreviation(text):
    words = text.split()
    expanded = [abbreviations.get(w, w) for w in words]
    return " ".join(expanded)

text = "brb I will be back soon lol"
print("Expanded Text:", expand_abbreviation(text))


Expanded Text: be right back I will be back soon laughing out loud


# 5. Multilingual Text Handling with Hugging Face
>Handle different languages using multilingual models like XLM-R.

In [17]:
from transformers import XLMRobertaTokenizer, XLMRobertaModel

# Load multilingual model
tokenizer = XLMRobertaTokenizer.from_pretrained("xlm-roberta-base")
model = XLMRobertaModel.from_pretrained("xlm-roberta-base")

# Multilingual sentence
sentence = "مرحبا, Bonjour, Hello!"

# Tokenize and process
inputs = tokenizer(sentence, return_tensors="pt")
outputs = model(**inputs)
print("Processed Multilingual Text:", outputs)


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Processed Multilingual Text: BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 8.6125e-01,  5.2782e-01,  2.5359e-01,  ..., -6.2637e-01,
           2.2165e-01,  3.6150e-02],
         [ 1.0961e-01, -3.7961e-02, -5.9908e-02,  ..., -6.0765e-02,
           1.0051e-01,  3.6059e-01],
         [ 1.3396e-01,  8.0575e-02,  5.4852e-04,  ..., -2.5156e-01,
           3.1475e-02,  4.3080e-01],
         ...,
         [ 4.3944e-02,  1.6940e-01, -2.9844e-02,  ...,  4.9374e-03,
          -2.3512e-02,  1.7850e-01],
         [ 5.8563e-02,  1.2944e-01, -3.9060e-03,  ..., -1.7369e-01,
           8.1362e-02,  2.7882e-01],
         [ 1.0449e+00,  6.0891e-01, -1.5704e-01,  ..., -1.5381e+00,
          -3.3758e-01,  4.0193e-01]]], grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[ 1.5855e-02, -1.1286e-01, -2.2694e-01,  7.3860e-01,  2.9003e-01,
          5.0825e-01,  2.3286e-01,  1.7255e-01,  4.8712e-01,  1.4403e-02,
         -5.7251e-01,  4.8023e-01, -8.1987e-02,  5.6057e-01, -7

# Conclusion
## With these tools, you can:

* Add diacritics to Arabic text.
* Perform morphological analysis.
* Handle dialects and multilingual text.
* Expand abbreviations and analyze language.