# **Language-Specific & Advanced Text Processing Techniques**

### Objective:
In this notebook, we'll implement language-specific and advanced text preprocessing techniques.
We'll focus on:
- **Arabic-specific text processing**: Normalization and diacritization.
- **English-specific text processing**: Lemmatization and abbreviation handling.
- **Advanced processing**: Multilingual text handling and sentiment analysis preprocessing.


## **1. Setup**

In [1]:
import spacy
from pyarabic.araby import strip_diacritics, normalize_arabic
from langdetect import detect
import re
from transformers import pipeline
import warnings
warnings.filterwarnings('ignore')

ImportError: cannot import name 'normalize_arabic' from 'pyarabic.araby' (c:\Users\abdulrahman_1114\anaconda3\Lib\site-packages\pyarabic\araby.py)

## **2. Load and Explore Data**

In [None]:
# Sample Arabic and English text for preprocessing
arabic_text = "الْكِتَابُ جَمِيلٌ وَمُفِيدٌ."
english_text = "I'm learning NLP, and it's amazing!"
multilingual_text = "I love البرمجة because it’s creative!"


## **3. Arabic-Specific Text Processing**

In [None]:
# Normalize Arabic text
normalized_arabic = normalize_arabic(arabic_text)
# Remove diacritics
undiacritized_arabic = strip_diacritics(normalized_arabic)
print("Original Arabic Text:", arabic_text)
print("Normalized Arabic Text:", normalized_arabic)
print("Undiacritized Arabic Text:", undiacritized_arabic)

## **4. English-Specific Text Processing**

In [None]:
# Load SpaCy model
nlp = spacy.load('en_core_web_sm')

# Lemmatize English text
doc = nlp(english_text)
lemmatized_text = " ".join([token.lemma_ for token in doc])
print("Original English Text:", english_text)
print("Lemmatized Text:", lemmatized_text)

# Handle abbreviations (simple regex-based example)
abbreviation_mapping = {"I'm": "I am", "it's": "it is"}
expanded_text = re.sub(r"\b(?:I'm|it's)\b", lambda match: abbreviation_mapping[match.group(0)], english_text)
print("Expanded Abbreviations:", expanded_text)

## **5. Advanced Text Processing**

In [None]:
# Detect language of multilingual text
detected_language = detect(multilingual_text)
print("Detected Language:", detected_language)

# Preprocessing for sentiment analysis using Hugging Face
sentiment_analyzer = pipeline('sentiment-analysis')
sentiment_result = sentiment_analyzer(multilingual_text)
print("Sentiment Analysis Result:", sentiment_result)

## **6. Conclusion**

### Insights:
- We successfully implemented Arabic-specific preprocessing techniques like normalization and diacritics removal.
- For English, we performed lemmatization and expanded abbreviations using SpaCy and regex.
- Advanced processing included multilingual text handling and sentiment analysis using Hugging Face.

### Challenges:
- Handling mixed-language text can be tricky without proper tools.
- Some Arabic preprocessing steps, like morphological analysis, require specialized libraries.