# **Language-Specific & Advanced Text Processing Techniques with NLTK**

### Objective:
In this notebook, we'll use the NLTK library to implement language-specific and advanced text preprocessing techniques.
We'll focus on:
- **Arabic-specific text processing**: Normalization and diacritics handling.
- **English-specific text processing**: Stemming and abbreviation handling.
- **Advanced processing**: Tokenization and stopword removal for multilingual text.

## **1. Setup**

In [33]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\abdulrahman_1114\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\abdulrahman_1114\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## **2. Load and Explore Data**

In [34]:
# Sample Arabic and English text for preprocessing
arabic_text = '''
    "التكنولوجيا الحديثة تُسهم في تطوير المجتمع.",
    "اللغة العربية غنية ومليئة بالتحديات.",
    "قِراءةُ الكُتُبِ تُعزّزُ الثّقافةَ والفهمَ.",
    "هَلْ يُمكنُكَ أنْ تُساعدني في هذا المشروع؟",
    "التعليمُ هو المفتاحُ لتحقيقِ الأحلام."
'''

english_text = '''
    "Artificial Intelligence is fascinating and powerful.",
    "He's working on a project related to NLP and data science.",
    "Can you believe how quickly technology evolves?",
    "I'm excited to explore the world of programming and coding.",
    "Learning is a never-ending journey; stay curious!"
'''

multilingual_text = '''
    "I enjoy studying الرياضيات because it's challenging.",
    "Learning البرمجة is fun and rewarding!",
    "القراءة helps to expand your knowledge and vocabulary.",
    "The world of الذكاء الاصطناعي is growing rapidly.",
    "I’m building a مشروع using Python and Arabic text.",
    "التكنولوجيا الحديثة تساعد في تحسين حياتنا اليومية بشكل كبير.",
    "Programming in Python is ممتع وسهل التعلم.",
    "أنا أحب تعلم لغات جديدة مثل الفرنسية والإسبانية alongside Arabic.",
    "The concept of الوقت in physics is fascinating and intriguing.",
    "تعلم المهارات الجديدة يساعدك في تحسين مستقبلك الشخصي والمهني.",
    "AI systems like ChatGPT تساعد في الإجابة عن الأسئلة وتقديم الدعم للمستخدمين."
'''

print("Arabic Text:", arabic_text)
print("English Text:", english_text)
print("Multilingual Text:", multilingual_text)

Arabic Text: 
    "التكنولوجيا الحديثة تُسهم في تطوير المجتمع.",
    "اللغة العربية غنية ومليئة بالتحديات.",
    "قِراءةُ الكُتُبِ تُعزّزُ الثّقافةَ والفهمَ.",
    "هَلْ يُمكنُكَ أنْ تُساعدني في هذا المشروع؟",
    "التعليمُ هو المفتاحُ لتحقيقِ الأحلام."

English Text: 
    "Artificial Intelligence is fascinating and powerful.",
    "He's working on a project related to NLP and data science.",
    "Can you believe how quickly technology evolves?",
    "I'm excited to explore the world of programming and coding.",
    "Learning is a never-ending journey; stay curious!"

Multilingual Text: 
    "I enjoy studying الرياضيات because it's challenging.",
    "Learning البرمجة is fun and rewarding!",
    "القراءة helps to expand your knowledge and vocabulary.",
    "The world of الذكاء الاصطناعي is growing rapidly.",
    "I’m building a مشروع using Python and Arabic text.",
    "التكنولوجيا الحديثة تساعد في تحسين حياتنا اليومية بشكل كبير.",
    "Programming in Python is ممتع وسهل التعلم.",
    

## **3. Arabic-Specific Text Processing**

In [35]:
# Normalize Arabic text 
def normalize_arabic(text):
    text = re.sub(r'[^؀-ۿ ]', '', text)  # Keep only Arabic letters
    text = re.sub(r'[ًٌٍَُِْ]', '', text)  # Remove diacritics
    return text

normalized_arabic = normalize_arabic(arabic_text)
print("Normalized Arabic Text:", normalized_arabic)

Normalized Arabic Text:     التكنولوجيا الحديثة تسهم في تطوير المجتمع    اللغة العربية غنية ومليئة بالتحديات    قراءة الكتب تعزّز الثّقافة والفهم    هل يمكنك أن تساعدني في هذا المشروع؟    التعليم هو المفتاح لتحقيق الأحلام


## **4. English-Specific Text Processing**

In [36]:
# Tokenize and Stem English text

abbreviation_mapping = {"I'm": "I am", "it's": "it is"}
expanded_text = re.sub(r"\b(?:I'm|it's)\b", lambda match: abbreviation_mapping[match.group(0)], english_text)
tokens = word_tokenize(expanded_text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens if token.lower() not in stop_words]
ps = PorterStemmer()
stemmed_tokens = [ps.stem(token) for token in filtered_tokens]
cleaned_tokens = [token for token in stemmed_tokens if token.isalnum()]
processed_text = ' '.join(cleaned_tokens)


print("Original Text:", english_text)
print("""
      
      """)
print("Expanded Abbreviations:", expanded_text)
print("""
      
      """)
print("Tokenized Text:", tokens)
print("""
      
      """)
print("Filtered Tokens (no stopwords):", filtered_tokens)
print("""
      
      """)
print("Stemmed Tokens:", stemmed_tokens)
print("""
      
      """)
print("Cleaned Tokens (no punctuation):", cleaned_tokens)
print("""
      
      """)
print("Processed Text:", processed_text)



Original Text: 
    "Artificial Intelligence is fascinating and powerful.",
    "He's working on a project related to NLP and data science.",
    "Can you believe how quickly technology evolves?",
    "I'm excited to explore the world of programming and coding.",
    "Learning is a never-ending journey; stay curious!"


      
      
Expanded Abbreviations: 
    "Artificial Intelligence is fascinating and powerful.",
    "He's working on a project related to NLP and data science.",
    "Can you believe how quickly technology evolves?",
    "I am excited to explore the world of programming and coding.",
    "Learning is a never-ending journey; stay curious!"


      
      
Tokenized Text: ['``', 'Artificial', 'Intelligence', 'is', 'fascinating', 'and', 'powerful', '.', '``', ',', '``', 'He', "'s", 'working', 'on', 'a', 'project', 'related', 'to', 'NLP', 'and', 'data', 'science', '.', '``', ',', '``', 'Can', 'you', 'believe', 'how', 'quickly', 'technology', 'evolves', '?', '``', ',', '`

## **5. Advanced Text Processing**

In [37]:
# Tokenize multilingual text and remove stopwords

tokens_multi = word_tokenize(multilingual_text)
stop_words = set(stopwords.words('english'))
filtered_tokens = [token for token in tokens_multi if token.lower() not in stop_words]
ps = PorterStemmer()
stemmed_tokens = [ps.stem(token) for token in filtered_tokens]
cleaned_tokens = [token for token in stemmed_tokens if token.isalnum()]
processed_text = ' '.join(cleaned_tokens)

print("Original Tokens:", tokens_multi)
print("""
      
      """)
print("Filtered Tokens (no stopwords):", filtered_tokens)
print("""
      
      """)
print("Stemmed Tokens:", stemmed_tokens)
print("""
      
      """)
print("Cleaned Tokens (no punctuation):", cleaned_tokens)
print("""
      
      """)
print("Processed Text:", processed_text)


Original Tokens: ['``', 'I', 'enjoy', 'studying', 'الرياضيات', 'because', 'it', "'s", 'challenging', '.', '``', ',', '``', 'Learning', 'البرمجة', 'is', 'fun', 'and', 'rewarding', '!', '``', ',', '``', 'القراءة', 'helps', 'to', 'expand', 'your', 'knowledge', 'and', 'vocabulary', '.', '``', ',', '``', 'The', 'world', 'of', 'الذكاء', 'الاصطناعي', 'is', 'growing', 'rapidly', '.', '``', ',', '``', 'I', '’', 'm', 'building', 'a', 'مشروع', 'using', 'Python', 'and', 'Arabic', 'text', '.', '``', ',', '``', 'التكنولوجيا', 'الحديثة', 'تساعد', 'في', 'تحسين', 'حياتنا', 'اليومية', 'بشكل', 'كبير', '.', '``', ',', '``', 'Programming', 'in', 'Python', 'is', 'ممتع', 'وسهل', 'التعلم', '.', '``', ',', '``', 'أنا', 'أحب', 'تعلم', 'لغات', 'جديدة', 'مثل', 'الفرنسية', 'والإسبانية', 'alongside', 'Arabic', '.', '``', ',', '``', 'The', 'concept', 'of', 'الوقت', 'in', 'physics', 'is', 'fascinating', 'and', 'intriguing', '.', '``', ',', '``', 'تعلم', 'المهارات', 'الجديدة', 'يساعدك', 'في', 'تحسين', 'مستقبلك', 'الشخ

## **6. Conclusion**

### Insights:
- We used NLTK for tokenization, stemming, and stopword removal.
- Arabic-specific text was normalized by removing diacritics and non-Arabic characters.
- English-specific preprocessing included stemming and abbreviation handling.

### Challenges:
- NLTK doesn't have built-in support for advanced Arabic preprocessing, so custom functions were used.
- Handling mixed-language text requires careful tokenization.