# Phase 1: Model Experimentation & Selection

## Objective
In this notebook, we will experiment with different approaches to solve the Toxicity Detection problem. We will start with a simple baseline and then move to a Deep Learning solution. The goal is to decide the final architecture and preprocessing steps for our production code.

**Experiments:**
1. **Baseline**: TF-IDF Vectorization + Logistic Regression.
2. **Deep Learning**: LSTM (Long Short-Term Memory) with Keras.

**Outcome:**
At the end of this notebook, we will clearly state the configurations (Max Length, Vocab Size, Model Arch) that will be moved to `src/`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score, f1_score

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional

import os

%matplotlib inline

## 1. Data Preparation

In [None]:
# Load raw data
DATA_PATH = '../data/raw/train.csv'

if os.path.exists(DATA_PATH):
    df = pd.read_csv(DATA_PATH)
    # Minimal cleaning for experiment
    df['comment_text'] = df['comment_text'].fillna('')
    print(f"Loaded {len(df)} rows.")
else:
    print("Error: Data not found.")

# Target variable
target_col = 'toxic'

# Sample for faster experimentation (Optional - comment out for full run)
# df = df.sample(20000, random_state=42)

X = df['comment_text'].astype(str)
y = df[target_col]

# Split: 80% Train, 20% Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Train shape: {X_train.shape}")
print(f"Test shape: {X_test.shape}")

## 2. Baseline Model: TF-IDF + Logistic Regression
Always start simple. If a simple model gives 95% accuracy, complex deep learning might be overkill.

In [None]:
# Vectorize
vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# Train
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_tfidf, y_train)

# Evaluate
y_pred_lr = lr_model.predict(X_test_tfidf)
print("Baseline Accuracy:", accuracy_score(y_test, y_pred_lr))
print("\nClassification Report (Baseline):\n")
print(classification_report(y_test, y_pred_lr))

## 3. Deep Learning Experiment: LSTM
Now we try a specialized sequence model. LSTMs are great for understanding the context in text data.

In [None]:
# Configuration (These are the Hyperparameters we define for production)
MAX_VOCAB_SIZE = 20000  # Max unique words to keep
MAX_LEN = 150           # Max length of a comment (based on EDA)
EMBEDDING_DIM = 64      # Size of word vectors

# 1. Tokenization
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE)
tokenizer.fit_on_texts(X_train)

# 2. Convert text to sequences
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# 3. Padding (make all sequences same length)
X_train_pad = pad_sequences(X_train_seq, maxlen=MAX_LEN)
X_test_pad = pad_sequences(X_test_seq, maxlen=MAX_LEN)

print(f"Data prepared. X_train_pad shape: {X_train_pad.shape}")

In [None]:
# Define the Model Architecture
model = Sequential([
    Embedding(MAX_VOCAB_SIZE, EMBEDDING_DIM, input_length=MAX_LEN),
    Bidirectional(LSTM(64, return_sequences=True)), # Bidirectional learns context from both directions
    tf.keras.layers.GlobalMaxPool1D(),              # Reduces dimensionality
    Dense(64, activation='relu'),
    Dropout(0.3),                                   # Prevents overfitting
    Dense(1, activation='sigmoid')                  # Binary classification output
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

In [None]:
# Train (Using fewer epochs for experiment)
# In production, we might use EarlyStopping and more epochs
history = model.fit(
    X_train_pad, y_train,
    batch_size=32,
    epochs=2,  # Keep it short for notebook experiment
    validation_split=0.1
)

In [None]:
# Evaluate LSTM
y_pred_probs = model.predict(X_test_pad)
y_pred_lstm = (y_pred_probs > 0.5).astype(int)

print("LSTM Accuracy:", accuracy_score(y_test, y_pred_lstm))
print("\nClassification Report (LSTM):\n")
print(classification_report(y_test, y_pred_lstm))

## 4. Final Decision & Conclusion

**Comparison:**
- Check F1-scores of class '1' (Toxic) for both models.
- Deep Learning usually outperforms on larger datasets or more complex language structures.

**Production Plan:**
Based on these experiments, we will adopt the **LSTM approach** for our production codebase.

**Parameters for `src/`:**
- `MAX_WORDS` (Vocab) = 20,000
- `MAX_LEN` = 150
- `EMBEDDING_DIM` = 64
- Architecture: Embedding -> Bi-LSTM -> GlobalMaxPool -> Dense -> Dropout -> Output