
# **1. Install Required Libraries**
In this section, the necessary Python libraries for both Arabic and English text processing are installed.
These libraries include:

- **nltk**: For natural language processing in English.
- **spacy**: Provides efficient NLP capabilities.
- **camel-tools**: Tailored for Arabic NLP tasks.
- **pyarabic**: Adds utilities for handling Arabic text.
- **textblob**: For sentiment analysis and text preprocessing.
- **transformers**: For handling transformer-based language models such as MARBERT.


In [None]:
!pip install nltk spacy camel-tools pyarabic textblob transformers



In [None]:
!pip install farasapy




# **2. Import Libraries**
This section includes the necessary imports for tasks such as:
- **Arabic-specific processing**: Includes morphological analysis, diacritization, and dialect classification.
- **English-specific processing**: Includes stemming, lemmatization, and text normalization.
Each library is carefully chosen to address specific linguistic challenges for the respective language.


In [None]:
# Arabic-specific processing
from farasa.diacratizer import FarasaDiacritizer
from farasa.segmenter import FarasaSegmenter
from farasa.stemmer import FarasaStemmer

# English-specific processing
import nltk
nltk.download('wordnet')
from nltk.stem import PorterStemmer, WordNetLemmatizer
from textblob import TextBlob
import re

# Multilingual & Advanced processing
from transformers import pipeline



[nltk_data] Downloading package wordnet to /root/nltk_data...



# **3. Arabic-Specific Text Processing**
This section focuses on tasks unique to processing Arabic text, such as:
- **Diacritization**: Adding diacritics (vowel markers) to Arabic text for better readability and pronunciation.
- **Morphological Analysis**: Breaking down Arabic words into their root forms, stems, and segments.
- **Dialect Handling**: Identifying the dialect of Arabic text using a transformer-based model (MARBERTv2).
These tasks address Arabic's rich morphology and dialectal variations, which are critical for accurate text processing.


**Diacritization**

In [None]:
# Text without diacritics
arabic_text = "مرحبًا كيف حالك"

diacritizer = FarasaDiacritizer(interactive=True)

# Diacritize text
diacritized_text = diacritizer.diacritize(text)
print(f"Original: {arabic_text}\nDiacritized: {diacritized_text}")



Original: مرحبًا كيف حالك
Diacritized: مرحبًا كَيْفَ حالُكَ


**Morphological Analysis**

In [None]:
arabic_text = "مرحبًا كيف حالك"

# Initialize segmenter and stemmer
segmenter = FarasaSegmenter(interactive=True)
stemmer = FarasaStemmer(interactive=True)

# Segment the text
segmented_text = segmenter.segment(arabic_text)

# Stem the segmented text
stemmed_text = stemmer.stem(segmented_text)

# Print the results
print(f"Original Text: {arabic_text}")
print(f"Segmented Text: {segmented_text}")
print(f"Stemmed Text: {stemmed_text}")



Original Text: مرحبًا كيف حالك
Segmented Text: مرحب+ا كيف حال+ك
Stemmed Text: مرحب  ا كيف حال  ك


**Dialect Handling**

here i'm using a transformer model to handle the dialect

In [None]:
dialect_pipeline = pipeline("text-classification", model="UBC-NLP/MARBERTv2")
dialect_result = dialect_pipeline(arabic_text)
print(f"Dialect Analysis: {dialect_result}")

config.json:   0%|          | 0.00/757 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/654M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at UBC-NLP/MARBERTv2 and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/439 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/1.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Dialect Analysis: [{'label': 'LABEL_1', 'score': 0.5825506448745728}]



# **4. English-Specific Text Processing**
This section handles tasks specific to English text, including:
- **Stemming and Lemmatization**: Reducing words to their base or root forms for normalization.
- **Handling Abbreviations**: Expanding common abbreviations for better understanding and processing.
These techniques standardize English text, making it easier to analyze while preserving meaning.


**Stemming and Lemmatization**

In [None]:
# Initialize stemmer and lemmatizer
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

text = "running runners are running quickly"
stems = [stemmer.stem(word) for word in text.split()]
lemmas = [lemmatizer.lemmatize(word) for word in text.split()]
print(f"Stems: {stems}\nLemmas: {lemmas}")

Stems: ['run', 'runner', 'are', 'run', 'quickli']
Lemmas: ['running', 'runner', 'are', 'running', 'quickly']


**Handling Abbreviations**

In [None]:
abbreviations = {"I'm": "I am", "etc.": "et cetera", "can't": "cannot"}
def expand_abbreviations(text, abbreviations):
    words = text.split()
    expanded_text = ' '.join([abbreviations[word] if word in abbreviations else word for word in words])
    return expanded_text

text_with_abbrev = "I'm running fast and people can't cope with me, etc."
expanded_text = expand_abbreviations(text_with_abbrev, abbreviations)
print(f"Original: {text_with_abbrev}\nExpanded: {expanded_text}")

Original: I'm running fast and people can't cope with me, etc.
Expanded: I am running fast and people cannot cope with me, et cetera



# **5. Advanced Text Handling**
This section explores more complex tasks such as:
- **Multilingual Processing**: Translating between English and Arabic using transformer models (e.g., Helsinki-NLP).
- **Sentiment Analysis Preprocessing**: Cleaning and preparing text for sentiment analysis by removing noise like mentions, hashtags, and URLs.
These tasks demonstrate the flexibility of NLP models in handling diverse languages and tasks.


**Multilingual Processing**

In [None]:
!pip install sacremoses
translator = pipeline("translation", model="Helsinki-NLP/opus-mt-en-ar")
multilingual_text = "Hello, how are you?"
translated_text = translator(multilingual_text)
print(f"Translated to Arabic: {translated_text}")

Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m11.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sacremoses
Successfully installed sacremoses-0.1.1
Translated to Arabic: [{'translation_text': 'مرحباً، كيف حالك؟'}]


**Sentiment Analysis Preprocessing**

In [None]:
def clean_text(text):
    text = re.sub(r"@[A-Za-z0-9]+", "", text)  # Remove @mentions
    text = re.sub(r"#", "", text)              # Remove hashtag symbol
    text = re.sub(r"http\S+", "", text)        # Remove URLs
    text = re.sub(r"[^a-zA-Z\s]", "", text)    # Remove non-alphanumeric characters
    return text.strip().lower()

# Define a mapping for the model's output labels
label_mapping = {0: 'NEGATIVE', 1: 'NEUTRAL', 2: 'POSITIVE'}

sentiment_analyzer = pipeline("sentiment-analysis", model="cardiffnlp/twitter-roberta-base-sentiment")
sample_text = "I'm so happy with this product! 😊"
cleaned_text = clean_text(sample_text)
sentiment = sentiment_analyzer(cleaned_text)

# Map the label to a human-readable sentiment
sentiment_label = label_mapping[int(sentiment[0]['label'].split('_')[1])]

print(f"Sentiment: {sentiment_label} (Confidence: {sentiment[0]['score']})")

Sentiment: POSITIVE (Confidence: 0.9902803301811218)
