# Advanced Preprocessing for eng2arb Dataset - Google Colab

This notebook applies advanced preprocessing to the Arabic eng2arb dataset using CAMeL Tools with morphological segmentation.

**Input**: `arb_eng2arb_clean_basic.csv` (basic preprocessed)
**Output**: `arb_eng2arb_clean_advanced.csv` (advanced morphological preprocessing)

**Processing includes:**
- Morphological segmentation
- Pronominal enclitic splitting
- Definite article preservation
- Particle preservation

**Steps:**
1. Install CAMeL Tools
2. Upload preprocessing modules and dataset
3. Download CAMeL Tools database
4. Process the dataset
5. Download results

## ⚠️ Important Note About Dependency Warnings

When installing CAMeL Tools, you may see **numpy dependency conflict warnings**. These are **safe to ignore** because:

1. Google Colab manages multiple package versions simultaneously
2. CAMeL Tools requires `numpy<2.0`, but some Colab packages need `numpy>=2.0`
3. Both versions coexist in Colab without issues
4. The warnings don't affect functionality

**TL;DR:** The error messages about numpy are expected and won't break anything. Just proceed with the cells below.

## 1. Setup - Install Dependencies

In [None]:
# Install CAMeL Tools with dependency fix
print("Installing CAMeL Tools...")
print("Note: Fixing numpy dependency conflicts...\n")

# Install camel-tools which requires numpy<2.0
!pip install -q camel-tools

# Restart warning for numpy conflicts (these are usually safe to ignore in Colab)
print("✓ CAMeL Tools installed successfully")
print("  (Numpy version conflicts are expected and won't affect functionality)\n")

# Import basic libraries
import pandas as pd
import os
import shutil
from google.colab import files

print("✓ Libraries imported")

## 2. Upload Preprocessing Modules

Upload your `ArbPreBasic.py` and `ArbPreAdv.py` files when prompted.

In [None]:
print("Please upload ArbPreBasic.py and ArbPreAdv.py:")
print("You should see a 'Choose Files' button below.\n")

uploaded = files.upload()

# Verify files
if 'ArbPreBasic.py' in uploaded and 'ArbPreAdv.py' in uploaded:
    print("\n✓ Both preprocessing modules uploaded successfully")
    print(f"  - ArbPreBasic.py ({len(uploaded['ArbPreBasic.py'])} bytes)")
    print(f"  - ArbPreAdv.py ({len(uploaded['ArbPreAdv.py'])} bytes)")
else:
    print("\n⚠ Warning: Make sure both files are uploaded")
    print(f"Files received: {list(uploaded.keys())}")

## 3. Upload Dataset

Upload your `arb_eng2arb_clean_basic.csv` file.

In [None]:
print("Please upload your arb_eng2arb_clean_basic.csv file:")
print("You should see a 'Choose Files' button below.\n")

uploaded = files.upload()

if 'arb_eng2arb_clean_basic.csv' in uploaded:
    print("\n✓ Dataset uploaded successfully")
    
    # Show dataset info
    df_preview = pd.read_csv('arb_eng2arb_clean_basic.csv', nrows=5)
    print(f"\nDataset preview:")
    print(f"  Columns: {list(df_preview.columns)}")
    print(f"  First few rows:")
    print(df_preview)
else:
    print("\n⚠ Error: arb_eng2arb_clean_basic.csv not found in uploaded files")
    print(f"Files received: {list(uploaded.keys())}")

## 4. Download CAMeL Tools Database

CAMeL Tools needs to download morphological database (first time only, ~50MB).

In [None]:
# Download CAMeL Tools morphological database
print("Downloading CAMeL Tools database...")
print("This is a one-time download (~50MB) and may take 1-2 minutes.\n")

!camel_data -i morphology-db-msa-r13

print("\n✓ Database downloaded successfully!")
print("  Location: ~/.camel_tools/data/")
print("\nNow ready to process the dataset...")

## 5. Import Preprocessing Modules

In [None]:
from ArbPreAdv import ArabicAdvancedPreprocessor

print("✓ Preprocessing modules imported successfully")

## 6. Load Dataset

In [None]:
# Load the eng2arb dataset
data_path = 'arb_eng2arb_clean_basic.csv'
df = pd.read_csv(data_path)

print(f"✓ Dataset loaded from: {data_path}")
print(f"  Shape: {df.shape}")
print(f"  Columns: {list(df.columns)}")
print(f"\nFirst few rows:")
print(df.head())

# Dataset info
print(f"\nDataset Information:")
print(f"  Total records: {len(df)}")
print(f"  Null values:\n{df.isnull().sum()}")
print(f"\nPolarization distribution:")
print(df['polarization'].value_counts())
print(f"\nClass balance:")
print(df['polarization'].value_counts(normalize=True))

## 7. Initialize Advanced Preprocessor

This will load CAMeL Tools models (may take 1-2 minutes).

In [None]:
# Initialize the advanced preprocessor
print("Initializing advanced preprocessor...")
print("This may take 1-2 minutes to load CAMeL Tools models...\n")

preprocessor = ArabicAdvancedPreprocessor(
    split_proclitics=None,              # Don't split proclitics by default
    split_enclitics={'PRON'},           # Split pronominal enclitics
    keep_definite_article=True,         # Keep 'Al' attached
    keep_particles=True,                # Keep particles attached
    use_light_stemming=False,           # Don't apply stemming
    use_lemmatization=False,            # Don't apply lemmatization
    use_basic_preprocessing=False       # Data is already basically preprocessed
)

print("✓ Advanced preprocessor initialized")
print("\nPreprocessor features:")
print("  • Morphological segmentation")
print("  • Pronominal enclitic splitting (e.g., كتابهم → كتاب + هم)")
print("  • Definite article preserved")
print("  • Particles preserved")
print("  • Basic preprocessing: SKIPPED (already done)")

# Test on a sample
sample = df['text'].iloc[0]
print(f"\n{'='*80}")
print("Sample preprocessing:")
print(f"{'='*80}")
print(f"Original:\n{sample[:100]}...")
print(f"\nPreprocessed:\n{preprocessor.preprocess(sample)[:100]}...")

## 8. Process All Texts

⚠️ **Note:** This may take several minutes depending on dataset size.
- Dataset size: ~7000 texts
- Estimated time: 20-30 minutes

In [None]:
# Apply preprocessing to all texts
print(f"Preprocessing {len(df)} texts...")
print("This will take approximately 20-30 minutes...\n")

start_time = time.time()

# Handle null values
df['text_advanced'] = df['text'].apply(
    lambda x: preprocessor.preprocess(x) if pd.notna(x) else x
)

elapsed_time = time.time() - start_time

print(f"✓ Preprocessing complete!")
print(f"  Time taken: {elapsed_time/60:.2f} minutes")
print(f"  Average: {elapsed_time/len(df):.3f} seconds per text")
print(f"\nDataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

## 9. Compare Results

In [None]:
# Compare original vs preprocessed
print("Comparison: Original vs Advanced Preprocessing\n")
print("="*100)

for i in range(5):
    print(f"\nExample {i+1}:")
    print(f"  Original:  {df['text'].iloc[i][:80]}...")
    print(f"  Advanced:  {df['text_advanced'].iloc[i][:80]}...")
    print(f"  Label:     {df['polarization'].iloc[i]}")
    print("-"*100)

# Statistics
orig_len = df['text'].str.len().mean()
adv_len = df['text_advanced'].str.len().mean()
orig_words = df['text'].str.split().str.len().mean()
adv_words = df['text_advanced'].str.split().str.len().mean()

print(f"\nText Statistics:")
print(f"  Original average length:     {orig_len:.2f} chars")
print(f"  Preprocessed average length: {adv_len:.2f} chars")
print(f"  Character change:            {(adv_len - orig_len) / orig_len * 100:+.2f}%")
print(f"\n  Original average words:      {orig_words:.2f}")
print(f"  Preprocessed average tokens: {adv_words:.2f}")
print(f"  Token change:                {(adv_words - orig_words) / orig_words * 100:+.2f}%")

## 10. Save Results

In [None]:
# Save the preprocessed dataset
output_path_full = 'arb_eng2arb_with_advanced.csv'
df.to_csv(output_path_full, index=False)

print(f"✓ Full dataset saved!")
print(f"  Location: {output_path_full}")
print(f"  Columns: {list(df.columns)}")

# Also create a clean version with only the preprocessed text
df_clean = df[['id', 'text_advanced', 'polarization']].copy()
df_clean.rename(columns={'text_advanced': 'text'}, inplace=True)

clean_output_path = 'arb_eng2arb_clean_advanced.csv'
df_clean.to_csv(clean_output_path, index=False)

print(f"\n✓ Clean version saved!")
print(f"  Location: {clean_output_path}")
print(f"  Shape: {df_clean.shape}")
print(f"\n{'='*80}")
print("SUMMARY")
print(f"{'='*80}")
print(f"Processed {len(df)} records with advanced morphological preprocessing")
print(f"\nOutput files:")
print(f"  1. {output_path_full}")
print(f"     (Contains: id, text, polarization, text_advanced)")
print(f"  2. {clean_output_path}")
print(f"     (Contains: id, text, polarization)")
print(f"\n⭐ Use {clean_output_path} for training!")

## 11. Download Results

Download the processed files to your computer.

In [None]:
from google.colab import files

print("Downloading files...\n")

# Download full dataset with both original and processed text
print("Downloading arb_eng2arb_with_advanced.csv...")
files.download('arb_eng2arb_with_advanced.csv')

# Download clean version with only processed text
print("Downloading arb_eng2arb_clean_advanced.csv...")
files.download('arb_eng2arb_clean_advanced.csv')

print("\n✓ Downloads complete!")
print("Check your browser's download folder.")
print("\n⭐ Use arb_eng2arb_clean_advanced.csv for training!")