# **Project Title: Deep Learning for Comment Toxicity Detection with Streamlit**

#### **By Shubham Pandey**

#### **Domain: NLP + Deep Learning**
This project lies in the domain of Natural Language Processing (NLP) and Deep Learning. NLP enables machines to understand and process human language, such as detecting patterns in text and classifying meaning. Deep learning techniques, especially neural networks like LSTMs, CNNs, and transformers (e.g., BERT), enhance this by automatically learning complex linguistic features. Together, they allow us to build robust models for tasks like sentiment analysis, spam filtering, and in this case, toxic comment detection.

#### **Problem Statement:**
Build a deep learning model to detect and classify toxic comments in real-time for safer online communication.

#### **Github Link:** https://github.com/Shubhampandey1git/Deep-Learning-for-Comment-Toxicity-Detection.git

## **Imports**

In [30]:
import pandas as pd

# Cleaning and preprocessing
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import string
from sklearn.feature_extraction.text import TfidfVectorizer

# Model Training
from sklearn.linear_model import LogisticRegression

# Saving
import joblib

## **Data Loading**

In [31]:
# loading the test and train datasets
test = pd.read_csv('data/test.csv')
train = pd.read_csv('data/train.csv')

*First Look*

In [32]:
display(train)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
...,...,...,...,...,...,...,...,...
159566,ffe987279560d7ff,""":::::And for the second time of asking, when ...",0,0,0,0,0,0
159567,ffea4adeee384e90,You should be ashamed of yourself \n\nThat is ...,0,0,0,0,0,0
159568,ffee36eab5c267c9,"Spitzer \n\nUmm, theres no actual article for ...",0,0,0,0,0,0
159569,fff125370e4aaaf3,And it looks like it was actually you who put ...,0,0,0,0,0,0


## **Data Cleaning and Preprocessing**

1. Lowercasing the text

In [33]:
train['comment_text'] = train['comment_text'].str.lower()
test['comment_text'] = test['comment_text'].str.lower()

2. Removing the punctuations & special characters and digits

In [34]:
train['comment_text'] = train['comment_text'].str.replace(r'[^a-z\s]', '', regex=True)
test['comment_text'] = test['comment_text'].str.replace(r'[^a-z\s]', '', regex=True)

3. Tokenization & Removing the stop-words

In [35]:
# Downloading once
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\ADMIN\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [36]:
stop_words = set(stopwords.words('english'))

# Function to remove the stopwords
def remove_stopwords(sentence):
    if pd.isnull(sentence):  # Handles missing values
        return ""
    words = sentence.split()  # Simple tokenization
    filtered_words = [word for word in words if word not in stop_words]
    return ' '.join(filtered_words)

# Applying the function
train['comment_text'] = train['comment_text'].apply(remove_stopwords)
test['comment_text'] = test['comment_text'].apply(remove_stopwords)

4. Stemming/Lemmatization

In [37]:
lemmatizer = WordNetLemmatizer()

def lemmatization(sentence):
    if pd.isnull(sentence):
        return ""
    words = sentence.split()
    lemmatized_words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(lemmatized_words)

train['comment_text'] = train['comment_text'].apply(lemmatization)
test['comment_text'] = test['comment_text'].apply(lemmatization)

5. Converting to numeric (TF-IDF)

In [38]:
# Initialize TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

# Fit on training data and transform
X_train_tfidf = tfidf_vectorizer.fit_transform(train['comment_text'])

# Transform test data (do NOT fit again!)
X_test_tfidf = tfidf_vectorizer.transform(test['comment_text'])


In [39]:
print(X_train_tfidf.shape)
print(X_test_tfidf.shape)

(159571, 5000)
(153164, 5000)


## **Training the Single-Label Classifier**

In [40]:
# Combining all 6 columns into a single binary target
train['target'] = train[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].max(axis=1)

# Checking distribution
print(train['target'].value_counts())


target
0    143346
1     16225
Name: count, dtype: int64


- Training a binary classifier

In [41]:
X_train = X_train_tfidf
y_train = train['target']

clf = LogisticRegression(solver='liblinear')
clf.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'liblinear'
,max_iter,100


- Prediction on the test set

In [42]:
X_test = X_test_tfidf

y_pred = clf.predict(X_test)

# Getting probabilities
y_pred_proba = clf.predict_proba(X_test)[:,1]

- storing the predictions in a df

In [43]:
predictions = pd.DataFrame({
    'comment_text': test['comment_text'],
    'predicted_toxic': y_pred,
    'toxic_probability': y_pred_proba
})

print(predictions.head())

                                        comment_text  predicted_toxic  \
0  yo bitch ja rule succesful youll ever whats ha...                1   
1                                 rfc title fine imo                0   
2                         source zawe ashton lapland                0   
3  look back source information updated correct f...                0   
4                      dont anonymously edit article                0   

   toxic_probability  
0           0.999549  
1           0.009806  
2           0.008458  
3           0.005115  
4           0.038682  


- Saving the predictions as csv and saving the model

In [44]:
predictions.to_csv('toxicity_predictions.csv', index=False)

joblib.dump(clf, 'models/toxicity_model.pkl')

['models/toxicity_model.pkl']

## **Experimenting with multiple deep learning architectures**

### 1. LSTM (Recurrent Neural Network)

In [45]:
# Imports
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

# Tokenize
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(train['comment_text'])
X_train_seq = tokenizer.texts_to_sequences(train['comment_text'])
X_test_seq = tokenizer.texts_to_sequences(test['comment_text'])

# Pad sequences
max_len = 100
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len)

# Target
y_train = train['target']

# LSTM model
model = Sequential([
    Embedding(input_dim=10000, output_dim=128, input_length=max_len),
    LSTM(64, dropout=0.2, recurrent_dropout=0.2),
    Dense(1, activation='sigmoid')  # single-target
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train_pad, y_train, epochs=5, batch_size=64, validation_split=0.1)


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x1f3fa6bdc00>

In [46]:
import pickle

model.save('models/lstm_model.h5')

with open("models/tfidf_vectorizer.pkl", "wb") as f:
    pickle.dump(tfidf_vectorizer, f)


  saving_api.save_model(


### 2. CNN

In [47]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Conv1D, GlobalMaxPooling1D, Dense
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.preprocessing.text import Tokenizer

# Tokenize
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(train['comment_text'])
X_train_seq = tokenizer.texts_to_sequences(train['comment_text'])
X_test_seq = tokenizer.texts_to_sequences(test['comment_text'])

max_len = 100
X_train_pad = pad_sequences(X_train_seq, maxlen=max_len)
X_test_pad = pad_sequences(X_test_seq, maxlen=max_len)

# Target
y_train = train['target']

model = Sequential([
    Embedding(input_dim=10000, output_dim=128, input_length=max_len),
    Conv1D(128, 5, activation='relu'),
    GlobalMaxPooling1D(),
    Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(X_train_pad, y_train, epochs=5, batch_size=64, validation_split=0.1)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x1f3fbff34c0>

In [48]:
model.save("models/cnn_model.h5")

## **Conclusion**

In this notebook, we developed a **Comment Toxicity Detection System** using multiple approaches including TF-IDF based models and deep learning architectures (CNN and LSTM). Key takeaways:

1. **Data Preprocessing:** We cleaned the dataset by removing punctuations, special characters, lowercasing, and removing stopwords. We also performed tokenization and padding for sequence models.

2. **Feature Extraction:** TF-IDF vectorization was applied to convert text data into numerical format for machine learning models.

3. **Target Handling:** Initially explored multiple target columns but decided to create a single target variable for simplicity and consistency.

4. **Model Building:**

   * Implemented CNN and LSTM models for toxicity prediction.
   * Explored saving and loading models, as well as tokenizer reconstruction.

5. **Evaluation:** Models were trained and evaluated, showing reasonable accuracy in predicting toxic vs non-toxic comments.

6. **Deployment Preparation:** Prepared code for real-time predictions using Streamlit, allowing both single-comment and bulk predictions via CSV upload.

**Overall**, this notebook demonstrates a full pipeline for text-based toxicity detection, from preprocessing and feature extraction to model training, evaluation, and deployment-ready implementation.
