**Toxic Comment Classification Assessment**

**The Jigsaw Toxic Comment** Classification dataset is a collection of Wikipedia talk page comments, labeled for various types of toxicity (toxic, severe_toxic, obscene, threat, insult, identity_hate). The goal of this task is to build a safety model capable of flagging abusive content.

The labels are originally multi-label (a comment can be both an insult and a threat). For this project i want just to generalise and know if its abusive or safe , we aggregate these into a Binary Classification task: Safe (0) vs. Abusive (1). This focuses the model on the detection of general malicious intent, which is critical for content moderation systems.

link to model : https://www.kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge/data

In [1]:
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
import re
import string
import torch
from sklearn.model_selection import train_test_split

# Load Data
try:
    df = pd.read_csv('data/train.csv', nrows=50000) # Using 50k subset for dev
    print(f"Data Loaded: {len(df)} rows")
except:
    print("Error loading data")

# Pivot to Binary Target
label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
df['is_abusive'] = df[label_cols].max(axis=1)

x = df['comment_text']
y = df['is_abusive']

print(f"Sample Text: {x[0]}")
print(f"Label: {y[0]}")

Data Loaded: 50000 rows
Sample Text: Explanation
Why the edits made under my username Hardcore Metallica Fan were reverted? They weren't vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don't remove the template from the talk page since I'm retired now.89.205.38.27
Label: 0


Section 1 - Preprocessing
Text data from social media is inherently noisy, containing non-standard grammar, punctuation, and mixed casing. Preprocessing is required to standardize the input space before representation learning.

For this pipeline, we define a clean_text function that performs:

Lowercasing: Standardizes "Hello" and "hello".

Noise Removal: Removes punctuation and special characters using Regex.

Artifact Removal: Strips wiki-style formatting (e.g., [USER]).

In [2]:
def prep(text_series):
    cleaned_text = []
    for text in text_series:
        text = str(text).lower()
        text = re.sub(r'\[.*?\]', '', text)
        text = re.sub(r'[%s]' % re.escape(string.punctuation), '', text)
        text = re.sub(r'\w*\d\w*', '', text)
        cleaned_text.append(text)
    return cleaned_text

# Apply Preprocessing
prep_x = prep(x)

# Split Data
X_train, X_test, y_train, y_test = train_test_split(prep_x, y, test_size=0.2, random_state=42)
print("Preprocessing Complete.")

Preprocessing Complete.


Section 2 - Representation Learning
Machine learning algorithms cannot process raw text; they require numerical input. This section compares two distinct approaches to representation learning:

1. Statistical: TF-IDF Term Frequency-Inverse Document Frequency (TF-IDF) creates a sparse vector representation. It operates on the "Bag-of-Words" assumption, treating words as independent features. It weighs terms by their rarity, assigning higher value to unique words (like "murder") and lower value to common words (like "the").

2. Contextual: Transformer Embeddings Unlike statistical methods, Deep Learning models like DistilBERT and RoBERTa learn dense vector representations. They utilize Self-Attention mechanisms to capture the context of a word based on its neighbors. This allows the model to distinguish between polysemous words (e.g., "kill time" vs. "kill you") and detect semantic nuances like sarcasm.

In [3]:
# We define the representation pipelines here
from sklearn.feature_extraction.text import TfidfVectorizer
from transformers import DistilBertTokenizer

# 1. Statistical Vectorizer Setup
tfidf_vectorizer = TfidfVectorizer(max_features=10000)

# 2. Deep Learning Tokenizer Setup
bert_tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

def get_bert_encodings(texts):
    return bert_tokenizer(texts, padding="max_length", truncation=True, max_length=128)

print("Representation pipelines initialized.")

Representation pipelines initialized.


Section 3 - Algorithms
Logistic Regression (Baseline)
Logistic Regression is a linear classifier that estimates the probability of a binary outcome using the weighted sum of input features. It is chosen as a baseline due to its computational efficiency and interpretability. While effective for simple keyword-based detection, it typically struggles with non-linear relationships (e.g., "I am not happy" vs "I am happy").

DistilBERT (Deep Learning)
DistilBERT is a distilled version of the BERT (Bidirectional Encoder Representations from Transformers) architecture. It is a pre-trained language model that has "read" the entire English Wikipedia. By fine-tuning this model on our toxic dataset, we leverage its deep understanding of language structure (Transfer Learning) to detect subtle forms of abuse that do not rely on explicit keywords.

In [4]:
# --- Algorithm 1: Logistic Regression ---
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, recall_score

# Create Representations
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

# Train
lr_model = LogisticRegression(max_iter=1000)
lr_model.fit(X_train_tfidf, y_train)

# Predict
lr_preds = lr_model.predict(X_test_tfidf)
print("Logistic Regression Trained.")

# --- Algorithm 2: DistilBERT ---
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset

# Prepare Data
train_ds = Dataset.from_dict({'text': X_train, 'label': y_train})
test_ds = Dataset.from_dict({'text': X_test, 'label': y_test})
tokenized_train = train_ds.map(lambda x: get_bert_encodings(x['text']), batched=True)
tokenized_test = test_ds.map(lambda x: get_bert_encodings(x['text']), batched=True)

# Train
bert_model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
training_args = TrainingArguments(output_dir='./results', num_train_epochs=1, fp16=True)

trainer = Trainer(model=bert_model, args=training_args, train_dataset=tokenized_train)
trainer.train()
print("DistilBERT Trained.")

Logistic Regression Trained.


Map:   0%|          | 0/40000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.2147
1000,0.1592
1500,0.1623
2000,0.1455
2500,0.1407
3000,0.1452
3500,0.1427
4000,0.1192
4500,0.1213
5000,0.1214


DistilBERT Trained.
