 Toxic Comment Classification Using NLP + TF-IDF + Logistic Regression

This project tackles the problem of detecting multiple forms of toxicity in user-generated text comments. Each comment can be labeled with one or more categories such as toxic, severe toxic, obscene, threat, insult, or identity hate. The goal is to build a lightweight and interpretable multi-label classification model using traditional Natural Language Processing (NLP) techniques and machine learning.

The pipeline begins with text cleaning and preprocessing using NLTK, including stopword removal, stemming, and handling edge cases like decoding errors. Cleaned comments are transformed into numerical features using TF-IDF vectorization with unigram and bigram support. The model uses a One-vs-Rest logistic regression strategy, where a separate binary classifier is trained for each label. After training, the model is evaluated using ROC AUC scores per class, along with a macro-average ROC AUC to summarize overall performance. This approach provides a fast and interpretable baseline for multi-label text classification tasks in real-world applications such as content moderation and sentiment analysis.

⸻


In [1]:
import numpy as np
import pandas as pd
import tensorflow as tf

In [2]:
TC_train = pd.read_csv('train.csv')
#TC_test = pd.read_csv('test.csv')

In [3]:
print(TC_train.value_counts().sum())

159571


In [4]:
print(TC_train.isnull().sum())

id               0
comment_text     0
toxic            0
severe_toxic     0
obscene          0
threat           0
insult           0
identity_hate    0
dtype: int64


This section performs text preprocessing for the toxic comment dataset using NLTK. It begins by downloading and customizing the stopword list, explicitly keeping the word “not” to preserve sentiment cues. Each comment is cleaned by removing non-alphabetic characters, converting to lowercase, tokenizing, and applying stemming using the Porter Stemmer. A custom safe_stem() function is used to handle potential RecursionErrors during stemming. Stopwords and non-alphabetic tokens are filtered out, and the cleaned words are joined back into strings to form a processed text corpus. This cleaned corpus will be used for feature extraction in the next steps.

In [5]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
all_stopwords = stopwords.words('english')
all_stopwords.remove('not')

def safe_stem(word):
  try:
    return ps.stem(word)
  except RecursionError:
    return ''

for comment in TC_train['comment_text']:
  if isinstance(comment, str):
    review = re.sub('[^a-zA-Z]', ' ', comment)
    review = review.lower()
    review = review.split()
    ps = PorterStemmer()
    review = [safe_stem(w) for w in review if w.isalpha() and w not in all_stopwords]
    review = ' '.join(w for w in review if w)
    corpus.append(review)

print(len(corpus))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


159571


This block converts the cleaned text corpus into numerical features using TfidfVectorizer from scikit-learn. It extracts up to 20,000 features based on unigrams and bigrams (1–2 word combinations) to capture both individual words and contextual phrases. The result is a sparse matrix X representing TF-IDF weighted word features for each comment. The corresponding multi-label targets (toxic, severe_toxic, obscene, threat, insult, identity_hate) are extracted into array Y from the training DataFrame for use in multilabel classification.

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
CV = TfidfVectorizer(max_features=20000, ngram_range=(1,2))
X = CV.fit_transform(corpus)
Y = TC_train.iloc[:, 2:8].values

In [7]:
print(X.shape, Y.shape)

(159571, 20000) (159571, 6)


This block splits the TF-IDF feature matrix X and the multi-label target array Y into training and test sets using an 80/20 ratio. The train_test_split function from scikit-learn ensures that the data is randomly partitioned, with a fixed random_state for reproducibility. This separation allows the model to be trained on one portion of the data and evaluated on another, ensuring fair and unbiased performance assessment.

In [8]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X , Y, test_size=0.2, random_state=42)

This block initializes and trains a multi-label classification model using logistic regression. Since each comment can belong to multiple toxic categories simultaneously, the OneVsRestClassifier strategy is used to train a separate binary logistic regression model for each label. The liblinear solver is specified for efficient optimization with small datasets. The model is then trained on the TF-IDF features (X_train) and corresponding label matrix (Y_train), enabling it to predict multiple toxic attributes per comment.

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
LR = LogisticRegression()
model = OneVsRestClassifier(LogisticRegression(solver='liblinear'))
model.fit(X_train, Y_train)


In [12]:
Y_pred = model.predict_proba(X_test)

This block evaluates the model’s performance using the ROC AUC (Receiver Operating Characteristic - Area Under Curve) metric for each individual label. By iterating over the six toxic comment categories, it computes and prints the ROC AUC score for each, which reflects the model’s ability to distinguish between classes. A higher score indicates better discrimination. Finally, it calculates and prints the macro-averaged ROC AUC across all labels, giving an overall sense of the model’s effectiveness in handling multi-label classification.

In [14]:
from sklearn.metrics import roc_auc_score

label_cols = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

for i, label in enumerate(label_cols):
    score = roc_auc_score(Y_test[:, i], Y_pred[:, i])
    print(f"{label}: ROC AUC = {score:.4f}")

mean_roc = roc_auc_score(Y_test, Y_pred, average='macro')
print(f"\nMean ROC AUC: {mean_roc:.4f}")


toxic: ROC AUC = 0.9700
severe_toxic: ROC AUC = 0.9825
obscene: ROC AUC = 0.9848
threat: ROC AUC = 0.9883
insult: ROC AUC = 0.9753
identity_hate: ROC AUC = 0.9720

Mean ROC AUC: 0.9788
