 ## Text Preprocessing Tutorial Using NLTK for Arabic and English

### Objective:
#### Learn to preprocess text (Arabic and English) using NLTK. The steps include:
##### Tokenization
##### Stopword removal
##### Noise removal
##### Normalization
##### Stemming

### Step 1: Install and Import NLTK
#### If you haven't installed NLTK, do so first.

In [53]:
# Install NLTK if not already installed
!pip install nltk

# Import required NLTK libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.isri import ISRIStemmer
from nltk.stem.porter import PorterStemmer
import re





## Step 2: Download NLTK Resources
#### Download the necessary corpora and models.

In [54]:
nltk.download('punkt')
nltk.download('stopwords')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\abdulrahman_1114\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\abdulrahman_1114\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Step 3: Define Preprocessing Steps
#### 1. Tokenization
#### Tokenization splits the text into words.

In [55]:
def tokenize_text(text, language='english'):
    return word_tokenize(text, language='arabic' if language == 'arabic' else 'english')


## 2. Stopword Removal
#### Stopwords are common words that don't contribute much meaning.

In [56]:
def remove_stopwords(words, language='english'):
    stop_words = set(stopwords.words(language))
    return [word for word in words if word.lower() not in stop_words]


## 3. Noise Removal
#### Remove punctuation, numbers, and unnecessary characters.

In [57]:
def remove_noise(text):
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = re.sub(r'\d+', '', text)      # Remove numbers
    return text


## 4. Normalization
#### For Arabic, normalize letters and remove diacritics.

In [58]:
def normalize_text(text):
    # Remove diacritics
    text = re.sub(r'[\u064B-\u065F]', '', text)
    # Normalize Arabic letters
    text = re.sub(r'[إأآا]', 'ا', text)
    text = re.sub(r'ى', 'ي', text)
    text = re.sub(r'ة', 'ه', text)
    text = re.sub(r'[ء]', '', text)
    return text.lower()


## 5. Stemming
#### Apply stemming to reduce words to their root forms.

In [59]:
def stem_words(words, language='english'):
    if language == 'arabic':
        stemmer = ISRIStemmer()
    else:
        stemmer = PorterStemmer()
    return [stemmer.stem(word) for word in words]


## Step 4: Combine All Steps into a Single Function
#### Integrate all the preprocessing steps into one reusable function.

In [60]:
def preprocess_text(text, language='english'):
    # Normalize text
    text = normalize_text(text)
    
    # Remove noise
    text = remove_noise(text)
    
    # Tokenize text
    tokens = tokenize_text(text, language)
    
    # Remove stopwords
    tokens = remove_stopwords(tokens, language)
    
    # Stem words
    tokens = stem_words(tokens, language)
    
    # Combine tokens back into a string
    return ' '.join(tokens)


## Step 5: Apply Preprocessing to Example Texts
#### Example Texts:

In [61]:
english_text = "Wow! Loved this place. Bit far but worth it."
arabic_text = "مرحباً بالعالم! هذه جملة اختبار لمعالجة النصوص. أرقام مثل ١٢٣٤٥ أو رموز مثل $٪& تعتبر ضوضاء."


## Apply Preprocessing:

In [62]:
# Process English Text
processed_english = preprocess_text(english_text, language='english')
print("Processed English Text:", processed_english)




Processed English Text: wow love place bit far worth


# Arabic

In [63]:
# Importing Required Libraries
import re
from nltk.corpus import stopwords
from nltk.stem.isri import ISRIStemmer
import nltk

# Download required resources
nltk.download('stopwords')

# Function to preprocess Arabic text
def preprocess_text_ar(text):
    """
    Preprocess Arabic text by normalizing, removing stopwords, stemming, and cleaning.
    
    Parameters:
    text (str): Input Arabic text
    
    Returns:
    str: Preprocessed Arabic text
    """
    # 1. Normalize Arabic text (remove diacritics and unify letters)
    text = re.sub(r'[\u064B-\u065F]', '', text)  # Remove diacritics
    text = re.sub(r'[إأآا]', 'ا', text)          # Normalize alef
    text = re.sub(r'ة', 'ه', text)              # Normalize ta marbouta
    text = re.sub(r'ى', 'ي', text)              # Normalize ya
    text = re.sub(r'ؤ', 'و', text)              # Normalize waw
    text = re.sub(r'[ء]', '', text)             # Remove hamza
    
    # 2. Convert to lowercase
    text = text.lower()
    
    # 3. Remove punctuation and numbers
    text = re.sub(r'[^\w\s]', '', text)         # Remove punctuation
    text = re.sub(r'\d+', '', text)             # Remove numbers
    
    # 4. Tokenize words
    words = text.split()  # Basic tokenization for Arabic
    
    # 5. Remove stopwords
    arabic_stopwords = set(stopwords.words('arabic')) if 'arabic' in stopwords.fileids() else set(['في', 'من', 'على', 'إلى', 'عن', 'و', 'يا', 'لكن', 'هذا', 'ما'])
    filtered_words = [word for word in words if word not in arabic_stopwords]
    
    # 6. Apply stemming
    stemmer = ISRIStemmer()
    stemmed_words = [stemmer.stem(word) for word in filtered_words]
    
    # 7. Join words back into a string
    processed_text = ' '.join(stemmed_words)
    
    return processed_text


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\abdulrahman_1114\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [64]:
# Example Arabic text
arabic_text = "مرحباً بالعالم! هذه جملة اختبار لمعالجة النصوص. أرقام مثل ١٢٣٤٥ أو رموز مثل $٪& تعتبر ضوضاء."

# Preprocessing the Arabic text
processed_text = preprocess_text_ar(arabic_text)

# Display the result
print("Original Text:", arabic_text)
print("Processed Text:", processed_text)


Original Text: مرحباً بالعالم! هذه جملة اختبار لمعالجة النصوص. أرقام مثل ١٢٣٤٥ أو رموز مثل $٪& تعتبر ضوضاء.
Processed Text: رحب علم جمل خبر علج نصص رقم او رمز عبر ضوض


## Step 6: Expected Outputs
#### Original Texts:
#### English: "Wow! Loved this place. Bit far but worth it."
#### Arabic: "مرحباً بالعالم! هذه جملة اختبار لمعالجة النصوص. أرقام مثل ١٢٣٤٥ أو رموز مثل $٪& تعتبر ضوضاء."
#### Processed Texts:
#### English: wow love place bit far worth
#### Arabic: رحب عالم جمل خبر علج نصوص

#### Summary of Steps
#### Normalize text for consistency.
#### Remove noise like punctuation and numbers.
#### Tokenize text into words.
#### Remove stopwords to focus on meaningful words.
#### Apply stemming to reduce words to their root forms.

### شرح مصطلحات **Text Preprocessing** بطريقة مبسطة (بالعربي والمصري):  

---

### 1. **Tokenization (التقطيع النصي)**  
**بالعربي**:  
تقسيم النص لجمله أو كلمات صغيرة بحيث كل كلمة أو جملة تعتبر وحدة مستقلة.  
زي لما تجيب كتاب وتقسمه لفصول أو صفحات عشان تفهمه أكتر.

**بالمصري**:  
بتقطع النص لجمل أو كلمات صغيرة عشان تبقى سهلة عليك تعالجها.  
مثال:  
جملة زي "أنا بحب القهوة" بتتقطع لـ ["أنا", "بحب", "القهوة"].

**بالإنجليزي**:  
Breaking a text into smaller units like sentences or words, making it easier to process.  
Example:  
A sentence like "I love coffee" becomes ["I", "love", "coffee"].  

---

### 2. **Stopword Removal (إزالة الكلمات الشائعة)**  
**بالعربي**:  
إزالة الكلمات اللي بتتكرر كتير في النصوص بس مش بتضيف معنى قوي (زي: في، عن، إلى، the, is).  

**بالمصري**:  
دي الكلمات اللي ملهاش وزن تقيل في المعنى، بتشيلها عشان تركز على الكلام المهم.  
زي: "أنا رايح على الكافيه" --> ["رايح", "كافيه"].

**بالإنجليزي**:  
Removing commonly used words that don't add much meaning to the text (e.g., "in," "on," "the").  

---

### 3. **Noise Removal (إزالة التشويش)**  
**بالعربي**:  
تنضيف النصوص من الحاجات اللي مالهاش لازمة زي الأرقام، الرموز، أو العلامات الغريبة.  

**بالمصري**:  
زي لما تنظف أوضة وترمي الورق اللي مش محتاجه. هنا بنشيل الرموز زي "#"، والأرقام اللي ممكن متكنش مفيدة.  
مثال: "السلام عليكم 123 !!!" --> "السلام عليكم".

**بالإنجليزي**:  
Cleaning the text by removing unnecessary elements like numbers, symbols, or punctuation marks.  

---

### 4. **Normalization (التطبيع)**  
**بالعربي**:  
توحيد شكل النصوص عشان كله يبقى متطابق زي استخدام نفس الصيغة للأحرف، مثلاً (أ، إ، آ --> ا).  

**بالمصري**:  
تخلي النص كله شكله ثابت زي توحيد طريقة كتابة الكلمات. زي "إلى" و "الي" تبقى "الى".  

**بالإنجليزي**:  
Converting text into a consistent format by standardizing letters or removing accents.  
Example:  
Normalizing "Éléphant" to "elephant."

---

### 5. **Stemming (التجذير)**  
**بالعربي**:  
إرجاع الكلمات لأصلها أو جذورها. زي تحويل "تعلّموا" إلى "علم".  

**بالمصري**:  
تجيب الكلمة من الآخر للجذر عشان تختصر المعنى. زي "الكتابة" تبقى "كتب".  

**بالإنجليزي**:  
Reducing words to their root or base form.  
Example:  
The words "running," "runner," "ran" are reduced to "run."  

---

### **أمثلة توضيحية:**

#### **Input Text:**
- English:  
  *"I am going to the market. Shopping is fun!"*  
- Arabic:  
  *"أنا ذاهب إلى السوق. التسوق ممتع!"*

#### **Steps:**
1. **Tokenization:**  
   - English: ["I", "am", "going", "to", "the", "market", ".", "Shopping", "is", "fun", "!"]  
   - Arabic: ["أنا", "ذاهب", "إلى", "السوق", ".", "التسوق", "ممتع", "!"]

2. **Stopword Removal:**  
   - English: ["going", "market", "shopping", "fun"]  
   - Arabic: ["ذاهب", "السوق", "التسوق", "ممتع"]

3. **Noise Removal:**  
   - English: ["going", "market", "shopping", "fun"]  
   - Arabic: ["ذاهب", "السوق", "التسوق", "ممتع"]

4. **Normalization:**  
   - Arabic: ["ذاهب", "السوق", "تسوق", "ممتع"]

5. **Stemming:**  
   - English: ["go", "market", "shop", "fun"]  
   - Arabic: ["ذهب", "سوق", "تسوق", "متع"]

#### **Final Preprocessed Text:**
- English: "go market shop fun"  
- Arabic: "ذهب سوق تسوق متع"  