# Toxic Comment Classification Challenge

This notebook analyzes and builds machine learning models for the Jigsaw Toxic Comment Classification Challenge. The goal is to identify and classify toxic online comments into different categories of toxicity.

## Dataset Overview
The dataset contains comments with the following toxicity labels:
- `toxic`: General toxicity
- `severe_toxic`: Severely toxic comments
- `obscene`: Obscene language
- `threat`: Threatening comments
- `insult`: Insulting comments
- `identity_hate`: Identity-based hate speech

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## 1. Import Required Libraries

Setting up all necessary libraries for data processing, text cleaning, and machine learning.

In [None]:
import pandas as pd
import numpy as np
import os
import re
import warnings

# Machine Learning Libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score

# Text Processing Libraries
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# Configuration
warnings.filterwarnings("ignore")
pd.options.display.max_colwidth = 300
pd.options.display.max_columns = 100

## 2. Data Loading and Initial Exploration

Loading the training and test datasets and performing initial data exploration to understand the structure and characteristics of the data.

In [None]:
# Load datasets
df_train = pd.read_csv("/kaggle/input/jigsaw-toxic-comment-classification-challenge/train.csv")
df_test = pd.read_csv("/kaggle/input/jigsaw-toxic-comment-classification-challenge/test.csv")

print(f"Training data shape: {df_train.shape}")
print(f"Test data shape: {df_test.shape}")

In [None]:
# Combine datasets for unified preprocessing
df_train['is_train'] = 1
df_test['is_train'] = 0

df = pd.concat([df_train, df_test], ignore_index=True)

In [None]:
# Display combined dataset
df

## 3. Data Quality Assessment

Checking for missing values, data types, and overall data quality to identify any preprocessing needs.

In [None]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())

In [None]:
# Dataset information
df.info()

In [None]:
# Statistical summary
df.describe()

In [None]:
# Check for duplicates
print(f"Number of duplicate rows: {df.duplicated().sum()}")

## 4. Exploratory Data Analysis

Examining the distribution of toxicity labels and exploring sample comments for each category to better understand the data characteristics.

In [None]:
# Examine comment text column
print("Sample comment texts:")
df["comment_text"].head()

In [None]:
# Explore toxicity categories with sample comments
toxicity_columns = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']
for col in toxicity_columns:
    print(f'****** {col.upper()} EXAMPLES *******')
    display(df.loc[df[col]==1,['comment_text',col]].sample(5))

In [None]:
# Create a 'clean' label for non-toxic comments
df['clean'] = (df_train[toxicity_columns].sum(axis=1) == 0).astype(int)
print("Dataset with clean label:")
df.head()

## 5. Text Preprocessing and Cleaning

Implementing comprehensive text cleaning including URL removal, HTML tag removal, stopword removal, and stemming to prepare the text data for machine learning models.

In [None]:
# Download NLTK resources
nltk.download('stopwords')

# Initialize text processing tools
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()

def clean_text(text):
    """
    Comprehensive text cleaning function
    
    Steps:
    1. Convert to lowercase
    2. Remove newlines, URLs, and HTML tags
    3. Keep only alphabetic characters
    4. Remove extra spaces
    5. Remove stopwords and apply stemming
    """
    text = str(text).lower()
    text = re.sub(r'\n', ' ', text)                    # Remove newlines
    text = re.sub(r'https?://\S+|www\.\S+', '', text)  # Remove URLs
    text = re.sub(r'<.*?>', '', text)                  # Remove HTML tags
    text = re.sub(r'[^a-z\s]', '', text)               # Keep only letters and spaces
    text = re.sub(r'\s+', ' ', text).strip()           # Remove extra spaces

    # Tokenize, remove stopwords, and stem
    words = text.split()
    cleaned_words = []
    for w in words:
        if w and w not in stop_words:
            try:
                stemmed = stemmer.stem(w)
                cleaned_words.append(stemmed)
            except RecursionError:
                pass  # Skip words causing stemmer errors

    return ' '.join(cleaned_words)

# Apply text cleaning to all comments
print("Applying text cleaning... This may take a few minutes.")
df['comment_text_clean'] = df['comment_text'].apply(clean_text)
print("Text cleaning completed!")

In [None]:
# Compare original and cleaned text
print("Original vs Cleaned Text Comparison:")
df[['comment_text', 'comment_text_clean']].head()

## 6. Data Preparation for Machine Learning

Splitting the data back into training and test sets, and preparing features using TF-IDF vectorization for model training.

In [None]:
# Split back into train and test sets
train_df = df[df['is_train'] == 1].copy()
test_df = df[df['is_train'] == 0].copy()

# Clean up the datasets
train_df.drop("is_train", inplace=True, axis=1)
test_df.drop("is_train", inplace=True, axis=1)

print(f"Final training set shape: {train_df.shape}")
print(f"Final test set shape: {test_df.shape}")

In [None]:
# Display training data structure
train_df.head()

In [None]:
# Display test data structure
test_df.head()

## 7. Feature Engineering with TF-IDF

Creating numerical features from text using TF-IDF (Term Frequency-Inverse Document Frequency) vectorization with optimized parameters for better model performance.

In [None]:
# Initialize TF-IDF Vectorizer with optimized parameters
tfidf = TfidfVectorizer(
    max_features=10000,        # Capture more useful words
    ngram_range=(1, 2),        # Use unigrams and bigrams
    min_df=3,                  # Ignore rare terms (appear in less than 3 documents)
    max_df=0.9,                # Ignore very common terms (appear in more than 90% of documents)
    strip_accents='unicode',   # Handle accented characters
    sublinear_tf=True          # Use sublinear term frequency scaling
)

print("TF-IDF Vectorizer configured with parameters:")
print(f"- Max features: {tfidf.max_features}")
print(f"- N-gram range: {tfidf.ngram_range}")
print(f"- Min document frequency: {tfidf.min_df}")
print(f"- Max document frequency: {tfidf.max_df}")

In [None]:
# Transform text to numerical features
print("Transforming text data to TF-IDF features...")
X_train = tfidf.fit_transform(train_df['comment_text_clean'])

# Prepare target labels (all toxicity categories + clean label)
y_train = train_df[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate', 'clean']].values

print(f"Feature matrix shape: {X_train.shape}")
print(f"Label matrix shape: {y_train.shape}")
print("Feature engineering completed!")

In [None]:
# Verify the prepared data
print("Training features and labels ready:")
print(f"X_train type: {type(X_train)}")
print(f"y_train type: {type(y_train)}")

## 8. Model Training and Evaluation

Training multiple machine learning models and comparing their performance using F1-score. We use OneVsRestClassifier to handle the multi-label classification problem.

In [None]:
# Define models for comparison
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, n_jobs=-1),
    'XGBoost': XGBClassifier(use_label_encoder=False, eval_metric='logloss', n_jobs=-1)
}

print("Training models and evaluating performance...")
print("=" * 50)

results = {}

# Train and evaluate each model
for name, base_model in models.items():
    print(f"\nTraining {name}...")
    
    # Use OneVsRestClassifier for multi-label classification
    clf = OneVsRestClassifier(base_model)
    clf.fit(X_train, y_train)
    
    # Make predictions and calculate F1-score
    y_pred = clf.predict(X_train)
    f1 = f1_score(y_train, y_pred, average='macro')
    results[name] = f1
    
    print(f"{name}: F1-score (train) = {f1:.4f}")

print("\n" + "=" * 50)
print("Training completed!")

## 9. Cross-Validation Evaluation

Performing cross-validation to get a more robust estimate of model performance and avoid overfitting.

In [None]:
print("Performing 3-fold cross-validation...")
print("=" * 50)

cv_results = {}

# Perform cross-validation for each model
for name, base_model in models.items():
    print(f"\nCross-validating {name}...")
    
    clf = OneVsRestClassifier(base_model)
    scores = cross_val_score(clf, X_train, y_train, scoring='f1_macro', cv=3)
    
    cv_results[name] = {
        'mean': scores.mean(),
        'std': scores.std(),
        'scores': scores
    }
    
    print(f"{name}: CV F1-score = {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")

print("\n" + "=" * 50)
print("Cross-validation completed!")

## 10. Results Summary

Summary of model performance and recommendations for next steps.

In [None]:
# Display final results summary
print("FINAL RESULTS SUMMARY")
print("=" * 60)
print()

print("Training F1-Scores:")
print("-" * 30)
for name, score in results.items():
    print(f"{name:<20}: {score:.4f}")

print()
print("Cross-Validation F1-Scores:")
print("-" * 35)
for name, result in cv_results.items():
    print(f"{name:<20}: {result['mean']:.4f} (+/- {result['std'] * 2:.4f})")

print()
print("RECOMMENDATIONS:")
print("-" * 20)
best_model = max(cv_results.keys(), key=lambda x: cv_results[x]['mean'])
print(f"• Best performing model: {best_model}")
print(f"• Best CV F1-score: {cv_results[best_model]['mean']:.4f}")
print()
print("NEXT STEPS:")
print("• Hyperparameter tuning for the best model")
print("• Feature engineering improvements")
print("• Ensemble methods combining multiple models")
print("• Generate predictions for test set")

## Conclusion

This notebook successfully implemented a multi-label text classification pipeline for toxic comment detection. The approach included:

1. **Data Preprocessing**: Comprehensive text cleaning with URL removal, HTML stripping, and linguistic preprocessing
2. **Feature Engineering**: TF-IDF vectorization with optimized parameters for text representation
3. **Model Comparison**: Evaluation of multiple machine learning algorithms using OneVsRestClassifier
4. **Performance Evaluation**: Both training and cross-validation metrics to assess model reliability

The results provide a solid baseline for toxic comment classification, with opportunities for further improvement through hyperparameter tuning and advanced feature engineering techniques.