# Advanced Arabic Preprocessing Pipeline - Google Colab

This notebook applies advanced preprocessing to Arabic text using CAMeL Tools with morphological segmentation.

**Steps:**
1. Install dependencies
2. Upload preprocessing modules and dataset
3. Process the dataset
4. Download results

## ⚠️ Important Note About Dependency Warnings

When installing CAMeL Tools, you may see **numpy dependency conflict warnings**. These are **safe to ignore** because:

1. Google Colab manages multiple package versions simultaneously
2. CAMeL Tools requires `numpy<2.0`, but some Colab packages need `numpy>=2.0`
3. Both versions coexist in Colab without issues
4. The warnings don't affect functionality

**TL;DR:** The error messages about numpy are expected and won't break anything. Just proceed with the cells below.

## 1. Setup - Install Dependencies

## Alternative: Clean Install (Run if you get errors)

If you encounter issues, use this cell instead to do a clean install:

In [1]:
# ALTERNATIVE INSTALLATION (use if standard install has issues)
# This ignores dependency conflicts which are safe in Colab environment

import os
os.environ['PIP_NO_WARN_CONFLICTS'] = '1'

print("Installing CAMeL Tools (ignoring safe dependency warnings)...")
!pip install --no-warn-conflicts camel-tools

print("\n✓ Installation complete")
print("  Note: Numpy dependency warnings are expected and safe to ignore.")
print("  Colab's environment handles multiple numpy versions correctly.")

Collecting camel-tools
  Downloading camel_tools-1.5.7-py3-none-any.whl.metadata (10 kB)
Collecting docopt (from camel-tools)
  Downloading docopt-0.6.2.tar.gz (25 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting numpy<2 (from camel-tools)
  Downloading numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m981.0 kB/s[0m eta [36m0:00:00[0m
Collecting transformers<4.44.0,>=4.0 (from camel-tools)
  Downloading transformers-4.43.4-py3-none-any.whl.metadata (43 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.7/43.7 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Collecting emoji (from camel-tools)
  Downloading emoji-2.15.0-py3-none-any.whl.metadata (5.7 kB)
Collecting pyrsistent (from camel-tools)
  Downloading pyrsistent-0.20.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting muddler (from c


✓ Installation complete
  Colab's environment handles multiple numpy versions correctly.


In [3]:
# Install CAMeL Tools with dependency fix
print("Installing CAMeL Tools...")
print("Note: Fixing numpy dependency conflicts...\n")

# Install camel-tools which requires numpy<2.0
!pip install -q camel-tools

# Restart warning for numpy conflicts (these are usually safe to ignore in Colab)
print("✓ CAMeL Tools installed successfully")
print("  (Numpy version conflicts are expected and won't affect functionality)\n")

# Import basic libraries
import pandas as pd
import os
import shutil
from google.colab import files

print("✓ Libraries imported")

Installing CAMeL Tools...
Note: Fixing numpy dependency conflicts...

✓ CAMeL Tools installed successfully
  (Numpy version conflicts are expected and won't affect functionality)

✓ Libraries imported


## 2. Upload Preprocessing Modules

Upload your `ArbPreBasic.py` and `ArbPreAdv.py` files when prompted.

In [5]:
print("Please upload ArbPreBasic.py and ArbPreAdv.py:")
print("You should see a 'Choose Files' button below.\n")

uploaded = files.upload()

# Verify files
if 'ArbPreBasic.py' in uploaded and 'ArbPreAdv.py' in uploaded:
    print("\n✓ Both preprocessing modules uploaded successfully")
    print(f"  - ArbPreBasic.py ({len(uploaded['ArbPreBasic.py'])} bytes)")
    print(f"  - ArbPreAdv.py ({len(uploaded['ArbPreAdv.py'])} bytes)")
else:
    print("\n⚠ Warning: Make sure both files are uploaded")
    print(f"Files received: {list(uploaded.keys())}")

Please upload ArbPreBasic.py and ArbPreAdv.py:
You should see a 'Choose Files' button below.



Saving ArbPreBasic.py to ArbPreBasic (1).py
Saving ArbPreAdv.py to ArbPreAdv.py

Files received: ['ArbPreBasic (1).py', 'ArbPreAdv.py']


## 3. Upload Dataset

Upload your `arb.csv` file containing the Arabic text data.

In [6]:
print("Please upload your arb.csv file:")
print("You should see a 'Choose Files' button below.\n")

uploaded = files.upload()

# Create train directory and organize files
if not os.path.exists('train'):
    os.makedirs('train')

if 'arb.csv' in uploaded:
    shutil.move('arb.csv', 'train/arb.csv')
    print("\n✓ Dataset uploaded and moved to train/arb.csv")

    # Show dataset info
    df_preview = pd.read_csv('train/arb.csv', nrows=5)
    print(f"\nDataset preview:")
    print(f"  Columns: {list(df_preview.columns)}")
    print(f"  Shape: {df_preview.shape}")
else:
    print("\n⚠ Error: arb.csv not found in uploaded files")

Please upload your arb.csv file:
You should see a 'Choose Files' button below.



Saving arb.csv to arb.csv

✓ Dataset uploaded and moved to train/arb.csv

Dataset preview:
  Columns: ['id', 'text', 'polarization']
  Shape: (5, 3)


## 4. Import Preprocessing Modules

In [7]:
from ArbPreAdv import ArabicAdvancedPreprocessor

print("✓ Preprocessing modules imported successfully")

✓ Preprocessing modules imported successfully


## 5. Load Dataset

In [8]:
# Load the Arabic training dataset
train_path = 'train/arb.csv'
df = pd.read_csv(train_path)

print(f"✓ Dataset loaded from: {train_path}")
print(f"  Shape: {df.shape}")
print(f"  Columns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()

✓ Dataset loaded from: train/arb.csv
  Shape: (3380, 3)
  Columns: ['id', 'text', 'polarization']

First few rows:


Unnamed: 0,id,text,polarization
0,arb_a2a60c8b4af3389e842d8ec31afb0eea,احلام انتي ونعالي ومنو انتي حتى تقيمين الفناني...,1
1,arb_6723e56a672674a6c1d9b28b213c4a05,وره الكواليس تنيجج من وره بعير صطناعي على فكرة...,1
2,arb_b0365d606edeee38ae6c025b1ca33e96,.خخخخ الملكه احلام فيها شذوذ شنو هل بوس والدلع...,1
3,arb_858c0ee684049ba6f416a6cecb0b0761,الله يخزي احلام هي والبرنامج الخايس الي كله مصخره,1
4,arb_bdafc73afd0bc2cd2badae2a089446b9,كس ام احلام الي ماربتها وش ملكه هههه متستاهل م...,1


## 6. Dataset Information

In [9]:
# Check dataset info
print("Dataset Information:")
print(f"  Total records: {len(df)}")
print(f"  Null values:\n{df.isnull().sum()}")
print(f"\nPolarization distribution:")
print(df['polarization'].value_counts())

Dataset Information:
  Total records: 3380
  Null values:
id              0
text            0
polarization    0
dtype: int64

Polarization distribution:
polarization
0    1868
1    1512
Name: count, dtype: int64


In [16]:
# Download CAMeL Tools morphological database
print("Downloading CAMeL Tools database...")
print("This is a one-time download (~50MB) and may take 1-2 minutes.\n")

!camel_data -i morphology-db-msa-r13

print("\n✓ Database downloaded successfully!")
print("  Location: ~/.camel_tools/data/")
print("\nNow ready to initialize preprocessor...")

Downloading CAMeL Tools database...
This is a one-time download (~50MB) and may take 1-2 minutes.

The following packages will be installed: 'morphology-db-msa-r13'
Downloading package 'morphology-db-msa-r13': 100% 40.5M/40.5M [00:00<00:00, 290MB/s]
Extracting package 'morphology-db-msa-r13': 100% 40.5M/40.5M [00:00<00:00, 532MB/s]

✓ Database downloaded successfully!
  Location: ~/.camel_tools/data/

Now ready to initialize preprocessor...


## 7. Initialize Advanced Preprocessor

This will load CAMeL Tools models (may take 1-2 minutes).

In [17]:
# Initialize the advanced preprocessor
print("Initializing advanced preprocessor...")
print("This may take 1-2 minutes to load CAMeL Tools models...\n")

preprocessor = ArabicAdvancedPreprocessor(
    split_proclitics=None,              # Don't split proclitics by default
    split_enclitics={'PRON'},           # Split pronominal enclitics
    keep_definite_article=True,         # Keep 'Al' attached
    keep_particles=True,                # Keep particles attached
    use_light_stemming=False,           # Don't apply stemming
    use_lemmatization=False,            # Don't apply lemmatization
    use_basic_preprocessing=True        # Apply basic preprocessing first
)

print("✓ Advanced preprocessor initialized")
print("\nPreprocessor features:")
print("  • Basic preprocessing (normalization, diacritics removal)")
print("  • Morphological segmentation")
print("  • Pronominal enclitic splitting (e.g., كتابهم → كتاب + هم)")
print("  • Definite article preserved")
print("  • Particles preserved")

# Test on a sample
sample = df['text'].iloc[0]
print(f"\n{'='*80}")
print("Sample preprocessing:")
print(f"{'='*80}")
print(f"Original:\n{sample[:100]}...")
print(f"\nPreprocessed:\n{preprocessor.preprocess(sample)[:100]}...")

Initializing advanced preprocessor...
This may take 1-2 minutes to load CAMeL Tools models...

✓ Advanced preprocessor initialized

Preprocessor features:
  • Basic preprocessing (normalization, diacritics removal)
  • Morphological segmentation
  • Pronominal enclitic splitting (e.g., كتابهم → كتاب + هم)
  • Definite article preserved
  • Particles preserved

Sample preprocessing:
Original:
احلام انتي ونعالي ومنو انتي حتى تقيمين الفنانين الملكه احلام هههههههه البقره احلام بابا عوفي الفن لا...

Preprocessed:
أحلام أنتي و+ نعالي ومنو أنتي حتى تقيمين الفنانين الملكة أحلام هههههههه البقرة أحلام ب+ أبا عوفي الف...


## 8. Process All Texts

⚠️ **Note:** This may take several minutes depending on dataset size.
- Small dataset (<1000 texts): ~2-5 minutes
- Medium dataset (1000-5000): ~10-20 minutes  
- Large dataset (>5000): 30+ minutes

In [18]:
# Apply preprocessing to all texts
import time

print(f"Preprocessing {len(df)} texts...")
print("This may take several minutes depending on dataset size...\n")

start_time = time.time()

# Handle null values
df['text_advanced'] = df['text'].apply(
    lambda x: preprocessor.preprocess(x) if pd.notna(x) else x
)

elapsed_time = time.time() - start_time

print(f"✓ Preprocessing complete!")
print(f"  Time taken: {elapsed_time/60:.2f} minutes")
print(f"  Average: {elapsed_time/len(df):.3f} seconds per text")
print(f"\nDataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

Preprocessing 3380 texts...
This may take several minutes depending on dataset size...

✓ Preprocessing complete!
  Time taken: 0.57 minutes
  Average: 0.010 seconds per text

Dataset shape: (3380, 4)
Columns: ['id', 'text', 'polarization', 'text_advanced']


## 9. Compare Results

In [19]:
# Compare original vs preprocessed
print("Comparison: Original vs Advanced Preprocessing\n")
print("="*100)

for i in range(3):
    print(f"\nExample {i+1}:")
    print(f"  Original:  {df['text'].iloc[i][:80]}...")
    print(f"  Advanced:  {df['text_advanced'].iloc[i][:80]}...")
    print(f"  Label:     {df['polarization'].iloc[i]}")
    print("-"*100)

# Statistics
orig_len = df['text'].str.len().mean()
adv_len = df['text_advanced'].str.len().mean()
orig_words = df['text'].str.split().str.len().mean()
adv_words = df['text_advanced'].str.split().str.len().mean()

print(f"\nText Statistics:")
print(f"  Original average length:     {orig_len:.2f} chars")
print(f"  Preprocessed average length: {adv_len:.2f} chars")
print(f"  Character reduction:         {(orig_len - adv_len) / orig_len * 100:.2f}%")
print(f"\n  Original average words:      {orig_words:.2f}")
print(f"  Preprocessed average tokens: {adv_words:.2f}")
print(f"  Token change:                {(adv_words - orig_words) / orig_words * 100:+.2f}%")

Comparison: Original vs Advanced Preprocessing


Example 1:
  Original:  احلام انتي ونعالي ومنو انتي حتى تقيمين الفنانين الملكه احلام هههههههه البقره احل...
  Advanced:  أحلام أنتي و+ نعالي ومنو أنتي حتى تقيمين الفنانين الملكة أحلام هههههههه البقرة أ...
  Label:     1
----------------------------------------------------------------------------------------------------

Example 2:
  Original:  وره الكواليس تنيجج من وره بعير صطناعي على فكرة احﻻم رجل مو مره لهن تخيل على البن...
  Advanced:  وره الكواليس تنيجج من وره بعير صطناعي على ف+ كرة احﻻم رجل مو مرة لهن تخيل على ال...
  Label:     1
----------------------------------------------------------------------------------------------------

Example 3:
  Original:  .خخخخ الملكه احلام فيها شذوذ شنو هل بوس والدلع مع شذا والله عيب اطلعت بويه احلام...
  Advanced:  .خخخخ الملكة أحلام فيها شذوذ شنو هل بوس و+ الدلع مع شذا و+ الله عيب أطلعت بوية أ...
  Label:     1
---------------------------------------------------------------------------------------

## 10. Save Results

In [21]:
# Save the preprocessed dataset
output_path = 'train/arb_with_advanced.csv'
df.to_csv(output_path, index=False)

print(f"✓ Full dataset saved!")
print(f"  Location: {output_path}")
print(f"  Columns: {list(df.columns)}")

# Also create a clean version with only the preprocessed text
df_clean = df[['id', 'text_advanced', 'polarization']].copy()
df_clean.rename(columns={'text_advanced': 'text'}, inplace=True)

clean_output_path = 'train/arb_clean_advanced.csv'
df_clean.to_csv(clean_output_path, index=False)

print(f"\n✓ Clean version saved!")
print(f"  Location: {clean_output_path}")
print(f"  Shape: {df_clean.shape}")
print(f"\n{'='*80}")
print("SUMMARY")
print(f"{'='*80}")
print(f"Processed {len(df)} records with advanced morphological preprocessing")
print(f"\nOutput files:")
print(f"  1. {output_path}")
print(f"     (Contains: id, text, polarization, text_advanced)")
print(f"  2. {clean_output_path}")
print(f"     (Contains: id, text, polarization)")

✓ Full dataset saved!
  Location: train/arb_with_advanced.csv
  Columns: ['id', 'text', 'polarization', 'text_advanced']

✓ Clean version saved!
  Location: train/arb_clean_advanced.csv
  Shape: (3380, 3)

SUMMARY
Processed 3380 records with advanced morphological preprocessing

Output files:
  1. train/arb_with_advanced.csv
     (Contains: id, text, polarization, text_advanced)
  2. train/arb_clean_advanced.csv
     (Contains: id, text, polarization)


## 11. Download Results

Download the processed files to your computer.

In [22]:
from google.colab import files

print("Downloading files...\n")

# Download full dataset with both original and processed text
print("Downloading arb_with_advanced.csv...")
files.download('train/arb_with_advanced.csv')

# Download clean version with only processed text
print("Downloading arb_clean_advanced.csv...")
files.download('train/arb_clean_advanced.csv')

print("\n✓ Downloads complete!")
print("Check your browser's download folder.")

Downloading files...

Downloading arb_with_advanced.csv...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Downloading arb_clean_advanced.csv...


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>


✓ Downloads complete!
Check your browser's download folder.


---

## Optional: Alternative Preprocessing Configurations

Try different preprocessing strategies:

In [None]:
# Aggressive segmentation with lemmatization
# Uncomment to use:

# print("Initializing aggressive preprocessor...")
# preprocessor_aggressive = ArabicAdvancedPreprocessor(
#     split_proclitics={'CONJ', 'PREP'},
#     split_enclitics={'PRON'},
#     keep_definite_article=False,
#     use_lemmatization=True,
#     use_basic_preprocessing=True
# )

# print("Processing with aggressive configuration...")
# df['text_aggressive'] = df['text'].apply(
#     lambda x: preprocessor_aggressive.preprocess(x) if pd.notna(x) else x
# )

# # Save aggressive version
# df.to_csv('train/arb_aggressive.csv', index=False)
# files.download('train/arb_aggressive.csv')
# print("✓ Aggressive preprocessing complete!")

## Optional: Batch Processing for Large Datasets

For very large datasets, process in batches to see progress:

In [None]:
# Process in batches with progress tracking
# Uncomment to use:

# import numpy as np

# batch_size = 100
# n_batches = int(np.ceil(len(df) / batch_size))

# print(f"Processing {len(df)} texts in {n_batches} batches of {batch_size}...\n")

# results = []
# for i in range(n_batches):
#     start_idx = i * batch_size
#     end_idx = min((i + 1) * batch_size, len(df))
#
#     batch_texts = df['text'].iloc[start_idx:end_idx].tolist()
#     batch_processed = preprocessor.preprocess_batch(batch_texts)
#     results.extend(batch_processed)
#
#     print(f"Batch {i+1}/{n_batches} complete ({end_idx}/{len(df)} texts)")

# df['text_advanced'] = results
# print("\n✓ Batch processing complete!")