**<h1 align = "center">Toxic Comment Detection - Machine Learning</h1>**

## **INITIAL DATA ANALYSIS**<a id="8"></a>

### **2.1 Loading Our Datasets**<a id="3"></a>

In [1]:
import pandas as pd
toxic_df = pd.read_csv("/kaggle/input/jigsaw-toxic-comment-classification-challenge/train.csv.zip")

### **2.2 Initial Analysis On Our Datasets**<a id="4"></a>

In [None]:
toxic_df.head()

In [None]:
toxic_df.shape

### **2.3 Selecting The Required Columns**<a id="5"></a>

In [None]:
#We are going to select just the "comment_text" and "toxic" columns
toxic_df['Toxic'] = toxic_df.iloc[:, 2:].any(axis = 1)
selected_toxic_columns = toxic_df[['comment_text', 'Toxic']]
selected_toxic_columns

In [None]:
selected_toxic_columns.describe()

In [None]:
selected_toxic_columns.isnull().sum()

### **2.5 Handling Duplicates**<a id="7"></a>

In [None]:
#Checking duplicates
selected_toxic_columns.duplicated(subset = ['comment_text'], keep = False).sum()

In [None]:
#Printing the duplicated rows
duplicates = selected_toxic_columns[selected_toxic_columns.duplicated(subset = ['comment_text'], keep = False)]
duplicates

In [None]:
#Dropping Duplicates
selected_toxic_columns.drop_duplicates(subset = ['comment_text'], keep = 'first', inplace = True)

In [None]:
#Confirm Drops
selected_toxic_columns.duplicated(subset = ['comment_text'], keep = False).sum()

In [None]:
selected_toxic_columns['Toxic'].value_counts()
#We can see from the code above that the data is imbalanced.

## **VISUALIZATION**<a id="8"></a>

### **3.1 Toxic vs Non-Toxic Comments Plot**<a id="9"></a>

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
#Graphical representation of the Toxic column values (Toxic vs Non-Toxic Comments) distribution
plt.figure(figsize = (6, 4))
toxic_counts = selected_toxic_columns['Toxic'].value_counts()
toxic_counts.plot(kind = 'bar', color = ['green', 'red'])
plt.title('Toxic vs Non-Toxic Comments')
plt.xlabel('Toxic')
plt.ylabel('Count')
plt.xticks(rotation = 0)
plt.show()

### **3.2 Wordcloud for Toxic Comments**<a id="10"></a>

In [None]:
#"Wordcloud" is for creating word cloud visualization.
from wordcloud import WordCloud
#Creating Word Cloud of Toxic Comments
toxic_comments = ''.join(selected_toxic_columns[selected_toxic_columns['Toxic']]['comment_text'])
toxic_words = WordCloud(width = 900, height = 450, background_color = "white").generate(toxic_comments)
plt.imshow(toxic_words, interpolation = 'bilinear')
plt.axis("off")
plt.title("Word Cloud For Toxic Comments")
plt.show()

### **3.3 Wordcloud for Non-Toxic Comments**<a id="11"></a>

In [None]:
#Creating Word Cloud of Non-Toxic Comments
non_toxic_comments = ''.join(selected_toxic_columns[~selected_toxic_columns['Toxic']]['comment_text'])
non_toxic_words = WordCloud(width = 900, height = 450, background_color = "white").generate(non_toxic_comments)
plt.imshow(non_toxic_words, interpolation = 'bilinear')
plt.axis("off")
plt.title("Word Cloud For Non-Toxic Comments")
plt.show()

## **EXPLORATORY DATA ANALYSIS (EDA)**<a id="12"></a>

### **4.1 Replacing True and False Values**<a id="13"></a>

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
#Replacing True as 1 and False as 0. 
selected_toxic_columns['Toxic'] = selected_toxic_columns['Toxic'].replace({True: 1, False: 0})

### **4.2 Text Preprocessing**<a id="14"></a>

In [None]:
#"re" is for regular expressions and text processing.
import re
#Cleaning the comment texts
def clean_text(text):
    text = text.lower()
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"can't", "cannot ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re'", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r"\'scuse", " excuse ", text)
    text = re.sub("\W", " ", text)
    text = re.sub("\s+", " ", text)
    text = text.strip(" ")
    
    return text

selected_toxic_columns['comment_text'] = selected_toxic_columns['comment_text'].map(lambda cleaned : clean_text(cleaned))

In [None]:
selected_toxic_columns.head()

### **4.3 Text Processing Using TF-IDF**<a id="15"></a>

In [None]:
"""TF-IDF(Term Frequency-Inverse Document Frequency) is used for text analysis: 
Text to Numerical Conversion, Feature Extraction, Dimensionality Reduction, Normalization & Scaling etc."""

from sklearn.feature_extraction.text import TfidfVectorizer

vector = TfidfVectorizer(max_features = 5000, stop_words = 'english')
X = vector.fit_transform(selected_toxic_columns['comment_text'])
Y = selected_toxic_columns['Toxic']

### **4.4 Over-Sampling Using SMOTE**<a id="16"></a>

In [None]:
selected_toxic_columns['Toxic'].value_counts()

In [None]:
#Recall that the data is imbalanced, so we have to balance it using SMOTE
from imblearn.over_sampling import SMOTE

#Initialize SMOTE
smote = SMOTE()

#Using SMOTE for oversampling
X_resampled, y_resampled = smote.fit_resample(X, Y)

#Converting oversampled data to DataFrame
resampled_df = pd.DataFrame(X_resampled.todense(), columns = vector.get_feature_names_out())
resampled_df['Toxic'] = y_resampled

In [None]:
resampled_df['Toxic'].value_counts()

In [None]:
#Plotting the new distribution sample
plt.figure(figsize = (6, 4))
toxic_counts = resampled_df['Toxic'].value_counts()
toxic_counts.plot(kind = 'bar', color = ['green', 'red'])
plt.title('Toxic vs Non-Toxic Comments')
plt.xlabel('Toxic')
plt.ylabel('Count')
plt.xticks(rotation = 0)
plt.show()

## **MODELLING**<a id="17"></a>

### **5.1 Splitting Our Dataset**<a id="18"></a>

In [None]:
from sklearn.model_selection import train_test_split
#Splitting the New Dataset into Training and Testing
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size = 0.2, random_state = 42)

### **5.2 Building Model**<a id="19"></a>

#### **5.2.1 Building Logistic Regression Model (Baseline 1)**<a id="22"></a>

In [None]:
from sklearn.linear_model import LogisticRegression

logreg_model = LogisticRegression(
    max_iter=1000,
    solver='liblinear'
)

#### **5.2.2 Building Feedforward Neural Network (FNN) Model (Baseline 2)**<a id="22"></a>

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
# FNN
FNN_model = Sequential([
    Dense(64, activation = 'relu'),
    Dropout(0.5),
    Dense(1, activation = 'sigmoid')
])

FNN_model.compile(optimizer = Adam(learning_rate = 0.001), loss = 'binary_crossentropy', metrics = ['accuracy'])

#### **5.2.3 Building Bidirectional GRU (BI-GRU) Model**<a id="22"></a>

In [None]:
from tensorflow.keras.layers import Reshape, Bidirectional, GRU
# BI-GRU
BI_GRU_model = Sequential([
    Reshape((1, X_train.shape[1]), input_shape=(X_train.shape[1],)),
    Bidirectional(GRU(64, return_sequences=False)),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

BI_GRU_model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

### **5.3 Training Model**<a id="20"></a>

#### **5.3.1 Training Logistic Regression Model**<a id="22"></a>

In [None]:
from sklearn.metrics import accuracy_score

train_logres_model = logreg_model.fit(X_train, y_train)

y_pred_logreg = logreg_model.predict(X_test)
logreg_accuracy = accuracy_score(y_test, y_pred_logreg)
logreg_accuracy

#### **5.3.2 Training Feedforward Neural Network (FNN) Model**<a id="22"></a>

In [None]:
train_FNN_model = FNN_model.fit(
    X_train.toarray(),
    y_train,
    epochs = 10,
    batch_size = 32,
    validation_split = 0.2
)

y_pred_FNN = FNN_model.predict(X_test.toarray()).round().astype(int)
fnn_accuracy = accuracy_score(y_test, y_pred_FNN)
fnn_accuracy

#### **5.3.2 Training Bidirectional GRU (BI-GRU) Model**<a id="22"></a>

In [None]:
train_BI_GRU_model = BI_GRU_model.fit(
    X_train.toarray(),
    y_train,
    epochs=10,
    batch_size=32,
    validation_split=0.2,
    verbose=1
)

y_pred_BIGRU = BI_GRU_model.predict(X_test.toarray()).round().astype(int)
bigru_accuracy = accuracy_score(y_test, y_pred_BIGRU)
bigru_accuracy

### **5.4 Visualizing Our Model**<a id="21"></a>

#### **5.4.1 Model Accuracy**<a id="22"></a>

##### **5.4.1.1 Model Accuracy for Logistic Regression**<a id="22"></a>

##### **5.4.1.2 Model Accuracy for FNN**<a id="22"></a>

##### **5.4.1.1 Model Accuracy for BI-GRU**<a id="22"></a>

In [None]:
#Training vs Validation Accuracy
plt.figure(figsize = (6, 4))
plt.plot(train_model.history['accuracy'])
plt.plot(train_model.history['val_accuracy'])
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend(['Train', 'Validation'], loc = 'upper left')
plt.show()

#### **5.4.2 Model Loss**<a id="23"></a>

##### **5.4.2.1 Model Accuracy for Logistic Regression**<a id="22"></a>

##### **5.4.2.2 Model Accuracy for FNN**<a id="22"></a>

##### **5.4.2.3 Model Accuracy for BI-GRU**<a id="22"></a>

In [None]:
#Training vs Validation Loss
plt.figure(figsize = (6, 4))
plt.plot(train_model.history['loss'])
plt.plot(train_model.history['val_loss'])
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend(['Train', 'Validation'], loc = 'upper left')
plt.show()

### **5.5 Model Accuracy Evaluation**<a id="24"></a>

In [None]:
#Evaluating Model Accuracy On Test Data
"""Let's ensure that the model is not overfitting."""

loss, accuracy = model.evaluate(X_test, y_test)
print(f"The Test Accuracy is: {accuracy}")

In [None]:
#Loss
print(f"The Model Loss is: {loss}")

In [None]:
from sklearn.metrics import classification_report

#Predictions on Test Data
y_pred_prob = model.predict(X_test)
y_pred = (y_pred_prob > 0.5).astype(int)

#Classification Report
class_report = classification_report(y_test, y_pred)
print(class_report)

In [None]:
import seaborn as sns

#Predictions on Test Data
#y_pred_prob = model.predict(X_test)
#y_pred = (y_pred_prob > 0.5).astype(int)

#Classification Report
class_report = classification_report(y_test, y_pred, output_dict = True)
class_report_df = pd.DataFrame(class_report).transpose()

#Dropping irrelevant metrics for Visualization
class_metrics = class_report_df.drop(['accuracy', 'macro avg', 'weighted avg'])

#Classification Metrics Using Heatmap
plt.figure(figsize = (8, 6))
sns.heatmap(class_metrics[['precision', 'recall', 'f1-score']], annot = True, cmap = 'Reds', fmt = '.2f')
plt.title("Classification Report Metrics")
plt.xlabel("Metrics")
plt.ylabel("Class")
plt.yticks(rotation = 0)
plt.show()

We can see from the diagram above that the model is not overfitting.

#### **5.6 Saving Our Model and Vectorizer**<a id="25"></a>

In [None]:
#Saving the Keras Model
import pickle

with open('tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(vector, f)

model.save('toxic_comment_prediction_model.h5')

#### **5.7 Testing Our Saved Model**<a id="26"></a>

In [None]:
#Reusing The Saved Model
import pickle
from tensorflow.keras.models import load_model
#Import TF-IDF Vectorizer for text handling
from sklearn.feature_extraction.text import TfidfVectorizer

#Loading TF-IDF Vectorizer
with open('/kaggle/working/tfidf_vectorizer.pkl', 'rb') as f:
    loaded_vectorizer = pickle.load(f)
    
    
#Loading The Trained Model
loaded_model = load_model('/kaggle/working/toxic_comment_prediction_model.h5')
new_comments = [
    "You're quite a bad person at keeping to time.",
    "This is a very bad service.",
    "You’ve achieved so much!",
    "You are very stupid and mad.",
]

#Processing New Comments using the Loaded TF-IDF Vectorizer
processed_comment = loaded_vectorizer.transform(new_comments)

#Predicting using the Loaded Model
predictions = (loaded_model.predict(processed_comment) > 0.5).astype(int)

#Prediction Result
for comment, prediction in zip(new_comments, predictions):
    print(f"Comment: {comment} | Is Toxic: {bool(prediction)}")