<a href="https://colab.research.google.com/github/Arv-ind-s/content-moderation-system/blob/main/notebooks/02_data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üßπ Content Moderation System - Data Preprocessing

## Objective
Transform raw Wikipedia comments into clean, model-ready text data for toxic comment classification.

## Preprocessing Strategy

Based on EDA findings:
- **Severe class imbalance (8.8:1)**: Will use stratified splitting to maintain distribution
- **Multi-label problem (6.18%)**: Each label treated as independent binary classification
- **Text characteristics**: Average 394 chars, highly variable length (10-5000+ chars)
- **Rare categories**: Threat (0.3%), severe_toxic (1%), identity_hate (0.9%) need special handling

## Key Decisions

### What to Keep:
- **Capitalization**: Toxic comments often use EXCESSIVE CAPS - keep as signal
- **Punctuation**: Multiple !!! or ??? often indicate emotion/toxicity
- **Contractions**: Keep natural language patterns (won't, can't, etc.)

### What to Remove:
- URLs and links (not relevant to toxicity)
- Newline characters and extra whitespace
- Non-ASCII characters that add noise

### What NOT to do:
- ‚ùå No stemming/lemmatization (transformers handle word variations)
- ‚ùå No stopword removal (context matters: "you are stupid" needs "are")
- ‚ùå No aggressive normalization (preserve linguistic patterns)

## Preprocessing Pipeline

### 1. Text Cleaning
   - Remove URLs, IP addresses, email addresses
   - Normalize whitespace (tabs, newlines, multiple spaces)
   - Remove non-printable characters
   - Handle special Wikipedia markup ([[ ]], {{ }})
   - Keep punctuation and capitalization

### 2. Data Splitting
   - 80% train (127,657 samples)
   - 10% validation (15,957 samples)
   - 10% test (15,957 samples)
   - **Stratified by toxic label** to maintain 90-10 distribution
   - Random seed for reproducibility

### 3. Validation Checks
   - Verify class distribution in all splits
   - Check for data leakage
   - Validate text quality after cleaning
   - Ensure no empty strings after preprocessing

### 4. Save Processed Data
   - Export train/val/test CSVs
   - Document all preprocessing parameters
   - Save to Google Drive for persistence

---

## Dataset Statistics

- **Total samples**: 159,571
- **Label columns**: toxic, severe_toxic, obscene, threat, insult, identity_hate
- **Target splits**: 80-10-10 (train-val-test)
- **Imbalance ratio**: 8.8:1 (clean:toxic)

---

**Author**: Aravind S  
**Date**: December 7, 2025  
**Preprocessing Version**: 1.0  
**GitHub**: https://github.com/Arv-ind-s/content-moderation-system

---

## 1. Setup and Load Data

In [1]:
# Import libraries
import pandas as pd
import numpy as np
import re
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)

print("‚úÖ Libraries imported successfully")

‚úÖ Libraries imported successfully


In [3]:
# Load data using Kaggle API (same as EDA)
!pip install -q kaggle

from google.colab import files
uploaded = files.upload()  # Upload kaggle.json

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

!kaggle competitions download -c jigsaw-toxic-comment-classification-challenge
!unzip -q jigsaw-toxic-comment-classification-challenge.zip
!unzip -q train.csv.zip # Unzip train.csv from its archive

# Load the data
train = pd.read_csv('train.csv')
print(f"‚úÖ Loaded {len(train):,} training samples")
print(f"Columns: {train.columns.tolist()}")
train.head()

Saving kaggle.json to kaggle (1).json
jigsaw-toxic-comment-classification-challenge.zip: Skipping, found more recently modified local copy (use --force to force download)
replace sample_submission.csv.zip? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace test.csv.zip? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace test_labels.csv.zip? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
replace train.csv.zip? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
‚úÖ Loaded 159,571 training samples
Columns: ['id', 'comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


## 2. Text Cleaning Function

Our cleaning strategy balances removing noise while preserving signals:
- Keep CAPS (toxicity signal)
- Keep punctuation patterns (!!!, ???)
- Remove URLs, special markup, non-printable chars
- Normalize whitespace

In [6]:
def clean_text(text):
    """
    Clean text while preserving toxicity signals.

    Args:
        text (str): Raw comment text

    Returns:
        str: Cleaned text
    """
    if pd.isna(text):
        return ""

    # Convert to string
    text = str(text)

    # Remove newlines and tabs
    text = text.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ')

    # Remove URLs
    url_pattern = r'(https?://\S+|www\.\S+|\S+\.(com|org|net|in|info|io)\S*)'
    text = re.sub(url_pattern, '', text)

    # Remove IP addresses
    ip_pattern = r'\b(?:\d{1,3}\.){3}\d{1,3}\b'
    text = re.sub(ip_pattern, '', text)

    # Remove email addresses
    email_pattern = r'\S+@\S+'
    text = re.sub(email_pattern, '', text)

    # Remove Wikipedia markup
    wiki_pattern = r'\[\[.*?\]\]'
    text = re.sub(wiki_pattern, '', text)

    # Normalize whitespace (multiple spaces to single space)
    text = re.sub(r'\s+', ' ', text)

    # Strip leading/trailing spaces
    text = text.strip()

    return text

In [7]:
# Test on actual data samples
print("Testing on real data:\n")
test_indices = [10, 50, 100, 500, 1000]

for idx in test_indices:
    original = train.iloc[idx]['comment_text']
    cleaned = clean_text(original)

    print(f"Sample {idx}:")
    print(f"Original ({len(original)} chars): {original[:100]}...")
    print(f"Cleaned  ({len(cleaned)} chars): {cleaned[:100]}...")
    print("-" * 80)

Testing on real data:

Sample 10:
Original (2875 chars): "
Fair use rationale for Image:Wonju.jpg

Thanks for uploading Image:Wonju.jpg. I notice the image p...
Cleaned  (2866 chars): " Fair use rationale for Image:Wonju.jpg Thanks for uploading Image:Wonju.jpg. I notice the image pa...
--------------------------------------------------------------------------------
Sample 50:
Original (3150 chars): "

BI, you said you wanted to talk

At the bottom of the lead section you have written:

""Its promo...
Cleaned  (3129 chars): " BI, you said you wanted to talk At the bottom of the lead section you have written: ""Its promoter...
--------------------------------------------------------------------------------
Sample 100:
Original (96 chars): However, the Moonlite edit noted by golden daph was me (on optus ...)  Wake up wikkis.  So funny...
Cleaned  (94 chars): However, the Moonlite edit noted by golden daph was me (on optus ...) Wake up wikkis. So funny...
---------------------------------

## 3. Apply Cleaning to Dataset

In [8]:
# Apply cleaning function to all comments
print("Cleaning all comments...")
train['comment_text_clean'] = train['comment_text'].apply(clean_text)

# Check for empty strings after cleaning
empty_count = (train['comment_text_clean'] == '').sum()
print(f"\n‚ö†Ô∏è Empty comments after cleaning: {empty_count}")

# Show before/after statistics
print("\nText length comparison:")
print(f"Original - Mean: {train['comment_text'].str.len().mean():.1f} chars")
print(f"Cleaned  - Mean: {train['comment_text_clean'].str.len().mean():.1f} chars")

# Show some examples
print("\n" + "="*80)
print("Sample Before/After:")
print("="*80)
for i in range(3):
    print(f"\nSample {i+1}:")
    print(f"BEFORE: {train.iloc[i]['comment_text'][:200]}")
    print(f"AFTER:  {train.iloc[i]['comment_text_clean'][:200]}")

Cleaning all comments...

‚ö†Ô∏è Empty comments after cleaning: 12

Text length comparison:
Original - Mean: 394.1 chars
Cleaned  - Mean: 386.2 chars

Sample Before/After:

Sample 1:
BEFORE: Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove th
AFTER:  Explanation Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove th

Sample 2:
BEFORE: D'aww! He matches this background colour I'm seemingly stuck with. Thanks.  (talk) 21:51, January 11, 2016 (UTC)
AFTER:  D'aww! He matches this background colour I'm seemingly stuck with. Thanks. (talk) 21:51, January 11, 2016 (UTC)

Sample 3:
BEFORE: Hey man, I'm really not trying to edit war. It's just that this guy is constantly removing relevant information and talking to m

In [9]:
# Remove empty comments after cleaning
print(f"Dataset size before: {len(train)}")
train = train[train['comment_text_clean'] != ''].reset_index(drop=True)
print(f"Dataset size after: {len(train)}")
print(f"Removed: 12 empty comments (0.008%)")

# Verify no empty strings remain
assert (train['comment_text_clean'] == '').sum() == 0, "Empty strings still present!"
print("‚úÖ No empty comments remaining")

Dataset size before: 159571
Dataset size after: 159559
Removed: 12 empty comments (0.008%)
‚úÖ No empty comments remaining


## 4. Stratified Train-Validation-Test Split

### Strategy:
- 80% Train / 10% Validation / 10% Test
- **Stratified by 'toxic' label** to maintain 90-10 distribution
- Random seed = 42 for reproducibility

### Why Stratified?
With 90% clean and 10% toxic comments, random splitting might create:
- Train set with 92% clean (too easy)
- Test set with 8% toxic (not representative)

Stratification ensures all splits have ~90% clean, ~10% toxic.

In [10]:
from sklearn.model_selection import train_test_split

# Define label columns
label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

# Prepare features and labels
X = train['comment_text_clean']  # Cleaned text
y = train[label_cols]  # All 6 labels

print("Original dataset:")
print(f"Total samples: {len(train):,}")
print(f"Toxic samples: {train['toxic'].sum():,} ({train['toxic'].mean()*100:.2f}%)")
print(f"Clean samples: {(~train['toxic'].astype(bool)).sum():,} ({(1-train['toxic'].mean())*100:.2f}%)")

Original dataset:
Total samples: 159,559
Toxic samples: 15,294 (9.59%)
Clean samples: 144,265 (90.41%)


In [11]:
# First split: 80% train, 20% temp (will become val + test)
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y,
    test_size=0.2,  # 20% for validation + test
    random_state=RANDOM_SEED,
    stratify=train['toxic']  # Stratify by toxic label
)

# Second split: Split temp into 50-50 (val and test)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp,
    test_size=0.5,  # 50% of 20% = 10% each
    random_state=RANDOM_SEED,
    stratify=y_temp['toxic']  # Stratify by toxic label
)

print("‚úÖ Data split complete!")
print(f"\nTrain set: {len(X_train):,} samples ({len(X_train)/len(train)*100:.1f}%)")
print(f"Val set:   {len(X_val):,} samples ({len(X_val)/len(train)*100:.1f}%)")
print(f"Test set:  {len(X_test):,} samples ({len(X_test)/len(train)*100:.1f}%)")

‚úÖ Data split complete!

Train set: 127,647 samples (80.0%)
Val set:   15,956 samples (10.0%)
Test set:  15,956 samples (10.0%)


In [12]:
# Verify class distribution is maintained across splits
print("="*80)
print("CLASS DISTRIBUTION VERIFICATION")
print("="*80)

splits = {
    'Original': train['toxic'],
    'Train': y_train['toxic'],
    'Validation': y_val['toxic'],
    'Test': y_test['toxic']
}

for split_name, split_data in splits.items():
    toxic_pct = split_data.mean() * 100
    clean_pct = (1 - split_data.mean()) * 100
    print(f"\n{split_name}:")
    print(f"  Toxic: {split_data.sum():,} ({toxic_pct:.2f}%)")
    print(f"  Clean: {(~split_data.astype(bool)).sum():,} ({clean_pct:.2f}%)")

# Verify all splits have similar toxic percentage (should be ~9.5-9.6%)
print("\n‚úÖ Stratification successful - all splits maintain similar distribution!")

CLASS DISTRIBUTION VERIFICATION

Original:
  Toxic: 15,294 (9.59%)
  Clean: 144,265 (90.41%)

Train:
  Toxic: 12,235 (9.59%)
  Clean: 115,412 (90.41%)

Validation:
  Toxic: 1,530 (9.59%)
  Clean: 14,426 (90.41%)

Test:
  Toxic: 1,529 (9.58%)
  Clean: 14,427 (90.42%)

‚úÖ Stratification successful - all splits maintain similar distribution!


In [13]:
# Combine features and labels back into DataFrames for easy handling
train_df = pd.DataFrame({
    'comment_text': X_train.values,
    **{col: y_train[col].values for col in label_cols}
})

val_df = pd.DataFrame({
    'comment_text': X_val.values,
    **{col: y_val[col].values for col in label_cols}
})

test_df = pd.DataFrame({
    'comment_text': X_test.values,
    **{col: y_test[col].values for col in label_cols}
})

print("‚úÖ DataFrames created!")
print(f"\nTrain: {train_df.shape}")
print(f"Val:   {val_df.shape}")
print(f"Test:  {test_df.shape}")

# Show samples
print("\nTrain set sample:")
display(train_df.head(3))

‚úÖ DataFrames created!

Train: (127647, 7)
Val:   (15956, 7)
Test:  (15956, 7)

Train set sample:


Unnamed: 0,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,you are a chicken shit cock sucking pussy bast...,1,1,1,0,1,0
1,Why was the link for Cell/Mobile phone ebooks ...,0,0,0,0,0,0
2,""" Simply stating it isn't good enough, please ...",0,0,0,0,0,0


## 5. Save Processed Datasets

Save cleaned and split data for model training phase.

In [14]:
# Save to CSV files
train_df.to_csv('train_processed.csv', index=False)
val_df.to_csv('val_processed.csv', index=False)
test_df.to_csv('test_processed.csv', index=False)

print("‚úÖ Saved processed datasets:")
print("  - train_processed.csv")
print("  - val_processed.csv")
print("  - test_processed.csv")

# Optional: Download to your computer
from google.colab import files
files.download('train_processed.csv')
files.download('val_processed.csv')
files.download('test_processed.csv')

‚úÖ Saved processed datasets:
  - train_processed.csv
  - val_processed.csv
  - test_processed.csv


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>