# Notebook 0: EDA and Data Preprocessing

This notebook performs comprehensive exploratory data analysis of the Jigsaw Toxic Comment Classification dataset.

## Contents
1. Load and inspect data
2. Label distribution analysis
3. Text length and preprocessing
4. Label co-occurrence patterns
5. Sample toxic vs non-toxic comments


In [None]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from src.config import DATA_DIR, LABELS
from src.data_utils import load_raw_jigsaw, get_label_stats, basic_text_clean

# Set plot style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)


## 1. Load Dataset


In [None]:
df = load_raw_jigsaw(DATA_DIR / "jigsaw_train.csv")
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
df.head()


## 2. Label Distribution Analysis

Understanding class imbalance is critical for toxic comment classification.


In [None]:
label_stats = get_label_stats(df)
print("Label Statistics:")
print(label_stats)


In [None]:
# Visualize label distribution
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Bar plot of counts
ax1.bar(label_stats["label"], label_stats["count"], color='steelblue')
ax1.set_xlabel("Label")
ax1.set_ylabel("Positive Count")
ax1.set_title("Label Distribution (Absolute Counts)")
ax1.tick_params(axis='x', rotation=45)

# Bar plot of ratios
ax2.bar(label_stats["label"], label_stats["ratio"], color='coral')
ax2.set_xlabel("Label")
ax2.set_ylabel("Positive Ratio")
ax2.set_title("Label Distribution (Ratios)")
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# Key observations
print("\nKey Observations:")
print(f"- Most imbalanced label: {label_stats.loc[label_stats['ratio'].idxmin(), 'label']} ({label_stats['ratio'].min():.4f})")
print(f"- Most common label: {label_stats.loc[label_stats['ratio'].idxmax(), 'label']} ({label_stats['ratio'].max():.4f})")
print(f"- Overall toxicity rate: {df[LABELS].any(axis=1).mean():.4f}")


## 3. Text Length Distribution


In [None]:
df["clean_text"] = df["comment_text"].astype(str).apply(basic_text_clean)
df["length"] = df["clean_text"].str.split().apply(len)

print("Text Length Statistics:")
print(df["length"].describe())

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram
axes[0].hist(df["length"], bins=50, color='steelblue', edgecolor='black', alpha=0.7)
axes[0].set_xlabel("Number of Tokens")
axes[0].set_ylabel("Frequency")
axes[0].set_title("Comment Length Distribution")
axes[0].axvline(df["length"].median(), color='red', linestyle='--', label=f'Median: {df["length"].median():.0f}')
axes[0].legend()

# Box plot comparing toxic vs non-toxic
df["any_toxic"] = df[LABELS].any(axis=1)
axes[1].boxplot([df[df["any_toxic"]==False]["length"], df[df["any_toxic"]==True]["length"]], 
                 labels=["Non-toxic", "Toxic"])
axes[1].set_ylabel("Number of Tokens")
axes[1].set_title("Length: Toxic vs Non-toxic Comments")

plt.tight_layout()
plt.show()


## 4. Label Co-occurrence Matrix

Multi-label datasets often have correlated labels. Understanding these patterns helps model design.


In [None]:
# Compute correlation matrix
label_corr = df[LABELS].corr()

# Visualize
plt.figure(figsize=(8, 6))
sns.heatmap(label_corr, annot=True, fmt='.2f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title("Label Co-occurrence Correlation Matrix")
plt.tight_layout()
plt.show()

# Count multi-label examples
num_labels_per_example = df[LABELS].sum(axis=1)
print("\nNumber of labels per example:")
print(num_labels_per_example.value_counts().sort_index())
print(f"\nExamples with multiple labels: {(num_labels_per_example > 1).sum()} ({(num_labels_per_example > 1).mean():.2%})")


## 5. Sample Comments

Let's examine some examples for each label.


In [None]:
for label in LABELS:
    print(f"\n{'='*60}")
    print(f"Examples for label: {label.upper()}")
    print('='*60)
    subset = df[df[label] == 1].head(3)
    for idx, row in subset.iterrows():
        print(f"\n{row['comment_text'][:200]}...")
        print(f"Labels: {[l for l in LABELS if row[l] == 1]}")


## Summary

**Key Findings:**
1. **Class Imbalance**: Rare labels like `threat` and `identity_hate` present significant challenges
2. **Text Length**: Toxic comments tend to be slightly shorter than non-toxic ones
3. **Multi-label Nature**: Strong correlations exist between certain labels (e.g., `toxic` and `obscene`)
4. **Data Quality**: Comments vary significantly in length and style

**Implications for Modeling:**
- Need techniques to handle class imbalance (focal loss, resampling, class weights)
- Rare labels may benefit from specialized approaches (lexicon features, transfer learning)
- Multi-label modeling should capture label dependencies
- Consider maximum sequence length based on length distribution
