# Toxic comment analysis with NLP with RNN - Final Project

Reese Thompson, Feb 16th, 2025

### Problem and Data Analysis

This notebook will explore and analyze a set of toxic comments from a Wikipedia data set to train a model capable of identifying types of toxic comments, beyond just a toxic comment.  The target value will ultimately be a probablitily that a comment is indeed toxic.

The initial data set includes a large set of comments that have been categorized accordingly.  We will use this data to build an NLP model using both LSTM RNN and GRU RNN.

Lets start by evaluating the data.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

train_raw = pd.read_csv('data/train.csv')

train = train_raw.copy()

print("\nTrain Data Head:")
print(train.head())

## EDA and Data Cleanup

We should also verify the distrobution of positive and negative cases.

In [None]:

missing_text = train['comment_text'].isnull().sum()
print(f"Missing text: {missing_text}")

In [None]:
labels = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

label_counts = train[labels].sum().sort_values(ascending=False)

plt.figure(figsize=(8, 5))
sns.barplot(x=label_counts.index, y=label_counts.values, palette='viridis')
plt.title("Distribution of Toxicity Labels")
plt.ylabel("Count")
plt.show()

In [None]:
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

def simple_clean(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    text = text.replace("\n", " ").replace("\r", " ").replace("\t", " ")
    return text

train['text'] = train['comment_text'].apply(simple_clean)
train = train.drop(columns=["comment_text"], axis=1)
train.head()

In [None]:
train = train.drop(columns=['id'], axis=1)

train.head()

In [None]:
train['text_length'] = train['text'].apply(lambda x: len(str(x)))
plt.figure(figsize=(10, 5))
sns.histplot(train['text_length'], bins=50, kde=True)
plt.title("Distribution of Comment Text Length")
plt.xlabel("Text Length")
plt.ylabel("Frequency")
plt.show()

In [None]:
import pandas as pd

mean_length = train['text_length'].mean()
median_length = train['text_length'].median()
mode_length = train['text_length'].mode()[0]

print(f"Mean Text Length: {mean_length}")
print(f"Median Text Length: {median_length}")
print(f"Mode Text Length: {mode_length}")

In [None]:
short = train[train['text'].str.len() < 10]

print("Number of comments under 10 characters:", short.shape[0])

short.head()

In [None]:
long = train[train['text'].str.len() > 3000]

print("Number of comments over 3000 characters:", long.shape[0])

long.head()

In [None]:
length_95 = train['text_length'].quantile(0.95)
print(f"95th Percentile of Text Length: {length_95}")

In [None]:
train = train[train['text'].str.len() < 1306]

In [None]:
import random
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('punkt_tab')
from nltk.corpus import wordnet

def get_synonyms(word):
    synonyms = set()
    for syn in wordnet.synsets(word):
        for lemma in syn.lemmas():
            synonym = lemma.name().replace('_', ' ').lower()
            if synonym != word:
                synonyms.add(synonym)
    return list(synonyms)

def synonym_replacement(comment, n=2):
    words = nltk.word_tokenize(comment)
    random_word_list = list(set(words))
    random.shuffle(random_word_list)
    
    num_replaced = 0
    for random_word in random_word_list:
        synonyms = get_synonyms(random_word)
        if len(synonyms) >= 1:
            synonym = random.choice(synonyms)
            words = [synonym if word == random_word else word for word in words]
            num_replaced += 1
        if num_replaced >= n:
            break
    
    return ' '.join(words)

In [None]:
augmented_rows = []
for idx, row in train[train['identity_hate'] > 0].iterrows():
    text = row['text']
    for _ in range(15):
        new_text = synonym_replacement(text, n=2)
        new_row = row.copy()
        new_row['text'] = new_text
        augmented_rows.append(new_row)

augmented_df = pd.DataFrame(augmented_rows)
train = pd.concat([train, augmented_df], axis=0).reset_index(drop=True)


In [None]:
augmented_rows = []
for idx, row in train[train['severe_toxic'] > 0].iterrows():
    text = row['text']
    for _ in range(15):
        new_text = synonym_replacement(text, n=2)
        new_row = row.copy()
        new_row['text'] = new_text
        augmented_rows.append(new_row)

augmented_df = pd.DataFrame(augmented_rows)
train = pd.concat([train, augmented_df], axis=0).reset_index(drop=True)

In [None]:
augmented_rows = []
for idx, row in train[train['threat'] > 0].iterrows():
    text = row['text']
    for _ in range(25):
        new_text = synonym_replacement(text, n=2)
        new_row = row.copy()
        new_row['text'] = new_text
        augmented_rows.append(new_row)

augmented_df = pd.DataFrame(augmented_rows)
train = pd.concat([train, augmented_df], axis=0).reset_index(drop=True)

In [None]:
labels = ["toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]

label_counts = train[labels].sum().sort_values(ascending=False)

plt.figure(figsize=(8, 5))
sns.barplot(x=label_counts.index, y=label_counts.values, palette='viridis')
plt.title("Distribution of Toxicity Labels")
plt.ylabel("Count")
plt.show()

In [None]:
co_occurrence = train[labels].corr()
plt.figure(figsize=(8, 6))
sns.heatmap(co_occurrence, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Among Toxicity Labels")
plt.show()

## Model setup

We will use both LSTM and GRU to test which approach yields the best result.  We will first need to tokenize the text data, using the Keras tokenizer.  In the code below, we set up a temp tokenizer to find the max vocabulary size we should use.

In [None]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, GRU
from tensorflow.keras.optimizers import Adam
from collections import Counter

X = train['text'].values

tokenizer_temp = Tokenizer(oov_token='<OOV>')
tokenizer_temp.fit_on_texts(X)

word_counts = tokenizer_temp.word_counts
sorted_word_counts = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)
total_count = sum(word_counts.values())
print(f"Total word count: {total_count}")

coverage_threshold = 0.97
running_count = 0
vocab_size_threshold = 0

for i, (word, count) in enumerate(sorted_word_counts):
    running_count += count
    if running_count / total_count >= coverage_threshold:
        vocab_size_threshold = i + 1
        break

print(f"Coverage threshold of {coverage_threshold} reached at vocab size: {vocab_size_threshold}")


Now that we know the max vocabulary size, we can set up our data for training and testing.

In [None]:
import matplotlib.pyplot as plt

# Function to plot training curves
def plot_training_curves(history, title):
    plt.figure(figsize=(12, 4))
    
    plt.subplot(1, 2, 1)
    plt.plot(history.history['accuracy'], label='Train Accuracy')
    plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
    plt.title(f'{title} - Accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend()

    plt.subplot(1, 2, 2)
    plt.plot(history.history['loss'], label='Train Loss')
    plt.plot(history.history['val_loss'], label='Validation Loss')
    plt.title(f'{title} - Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()

    plt.show()

Our model architecture for the project will be a straight forward LSTM implementation, having an embedding layer feeding into the LSTM layer, with a dropout layer added to prevent overfitting.  This output will then be fed to dense layer for interpreting the LSTM results, with a final dropout layer added.  The final output layer will be a sigmoid activation since this is a binary classification problem.

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Text data as a list of strings
train_texts = train['text'].astype(str).tolist()

# Label data as a numpy array of shape (num_samples, 6)
train_labels = train[labels].values

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    train_texts, 
    train_labels, 
    test_size=0.2, 
    random_state=42
)

max_words = 10500   # Maximum vocabulary size
max_len = 1350       # Maximum sequence length for padding
embedding_dim = 100 # Dimension of embedding vectors

# 1. Fit the tokenizer on the training set only
tokenizer = Tokenizer(num_words=max_words, lower=True, oov_token="<OOV>")
tokenizer.fit_on_texts(X_train)

# 2. Convert text to sequences
X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_val_sequences = tokenizer.texts_to_sequences(X_val)

# 3. Pad sequences
X_train_padded = pad_sequences(X_train_sequences, maxlen=max_len, padding='post', truncating='post')
X_val_padded = pad_sequences(X_val_sequences, maxlen=max_len, padding='post', truncating='post')

# 4. Build the model
model = Sequential()

# Embedding layer
model.add(Embedding(input_dim=max_words, output_dim=embedding_dim, input_length=max_len))

# LSTM layer
model.add(LSTM(128, return_sequences=False))

# Optional Dropout to prevent overfitting
model.add(Dropout(0.3))

# Dense layer
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.3))

# Output layer: 6 units for 6 labels, sigmoid activation for multi-label
model.add(Dense(6, activation='sigmoid'))

# 5. Compile the model
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['accuracy']
)

model.summary()

In [None]:
num_epochs = 10
batch_size = 256

history = model.fit(
    X_train_padded, 
    y_train, 
    validation_data=(X_val_padded, y_val),
    epochs=num_epochs, 
    batch_size=batch_size,
    verbose=1
)

val_loss, val_acc = model.evaluate(X_val_padded, y_val, verbose=0)

plot_training_curves(history, "Initial Model Accuracy")


## Analysis and Hyperparameters

The initial model shows a significant difference between the training and validation data, suggesting our model is overfitting.  We will now look to optimize hyperparameters to find the best model parameters for this problem set.  We can also compare the LSTM and GRU models to see if there is any significant difference between the two.

In [None]:
def test_gru_model(embedding_dim, dropout, learning_rate, epochs, batch_size, first_layer_size, second_layer_size):
    model = Sequential()
    model.add(Embedding(input_dim=max_vocab_size, 
                        output_dim=embedding_dim, 
                        input_length=max_length))
    model.add(GRU(first_layer_size, return_sequences=False))
    model.add(Dropout(dropout))
    model.add(Dense(second_layer_size, activation='relu'))
    model.add(Dropout(dropout))
    model.add(Dense(1, activation='sigmoid'))

    optimizer = Adam(learning_rate=learning_rate)
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])

    history = model.fit(
        X_train_padded, 
        y_train, 
        validation_data=(X_val_padded, y_val),
        epochs=epochs, 
        batch_size=batch_size,
        verbose=1
    )

    val_loss, val_acc = model.evaluate(X_val_padded, y_val, verbose=0)

    return val_acc

def test_lstm_model(embedding_dim, dropout, learning_rate, epochs, batch_size, first_layer_size, second_layer_size):
    model = Sequential()
    model.add(Embedding(input_dim=max_vocab_size, 
                        output_dim=embedding_dim, 
                        input_length=max_length))
    model.add(GRU(first_layer_size, return_sequences=False))
    model.add(Dropout(dropout))
    model.add(Dense(second_layer_size, activation='relu'))
    model.add(Dropout(dropout))
    model.add(Dense(1, activation='sigmoid'))

    optimizer = Adam(learning_rate=learning_rate)
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])

    history = model.fit(
        X_train_padded, 
        y_train, 
        validation_data=(X_val_padded, y_val),
        epochs=epochs, 
        batch_size=batch_size,
        verbose=1
    )

    val_loss, val_acc = model.evaluate(X_val_padded, y_val, verbose=0)

    return val_acc


In [None]:
gru_acc = test_gru_model(128, .3,.001, 25, 64, 100, 50)
lstm_acc = test_lstm_model(128, .3,.001, 25, 64, 100, 50)

print(f"LSTM - {lstm_acc}")
print(f"GRU - {gru_acc}")

There are too many hyperparameters to test in a full grid search, so we can perform a search to try to find optimals for each parameter individually

In [None]:

from itertools import product

best_accuracy = 0
best_hypers = {}

embedding_dim_list = [64, 128, 256]
dropout_list = [0.2, 0.3, 0.5]
learning_rate_list = [0.001, 0.0001, 0.00001]
epochs_list = [20, 30, 50]
batch_size_list = [32, 64, 128]
first_layer_size_list = [64, 128]
second_layer_size_list = [32, 64]

for emb in embedding_dim_list:
    accuracy = test_lstm_model(
            embedding_dim=emb,
            dropout=.2,
            learning_rate=.001,
            epochs=10,
            batch_size=32,
            first_layer_size=64,
            second_layer_size=32
        )
        
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_hypers['embedding_dim'] = emb

print(best_hypers)


In [None]:
best_accuracy = 0
for drop in dropout_list:
    accuracy = test_lstm_model(
            embedding_dim=128,
            dropout=drop,
            learning_rate=.001,
            epochs=10,
            batch_size=32,
            first_layer_size=64,
            second_layer_size=32
        )
        
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_hypers['dropout'] = drop

print(best_hypers)

In [None]:
best_accuracy = 0
for learn in learning_rate_list:
    accuracy = test_lstm_model(
            embedding_dim=128,
            dropout=.2,
            learning_rate=learn,
            epochs=10,
            batch_size=32,
            first_layer_size=64,
            second_layer_size=32
        )
        
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_hypers['learn'] = learn

print(best_hypers)

In [None]:
best_accuracy = 0
for e in epochs_list:
    accuracy = test_lstm_model(
            embedding_dim=128,
            dropout=.2,
            learning_rate=.0001,
            epochs=e,
            batch_size=32,
            first_layer_size=64,
            second_layer_size=32
        )
        
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_hypers['epochs'] = e

print(best_hypers)

In [None]:
best_accuracy = 0
for batch in batch_size_list:
    accuracy = test_lstm_model(
            embedding_dim=128,
            dropout=.2,
            learning_rate=.0001,
            epochs=20,
            batch_size=batch,
            first_layer_size=64,
            second_layer_size=32
        )
        
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_hypers['batch'] = batch

print(best_hypers)

In [None]:

best_accuracy = 0
for fls in first_layer_size_list:
    accuracy = test_lstm_model(
            embedding_dim=128,
            dropout=.2,
            learning_rate=.0001,
            epochs=20,
            batch_size=128,
            first_layer_size=fls,
            second_layer_size=32
        )
        
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_hypers['fls'] = batch

print(best_hypers)


In [None]:
best_accuracy = 0
for sls in second_layer_size_list:
    accuracy = test_lstm_model(
            embedding_dim=128,
            dropout=.2,
            learning_rate=.0001,
            epochs=20,
            batch_size=128,
            first_layer_size=128,
            second_layer_size=sls
        )
        
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        best_hypers['sls'] = sls

print(best_hypers)


# Review and conclusion

We now have a set of optimal parameters for running our LSTM RNN on the Tweet data.  We can now verify our accuarcy for the project to test our model.

In [None]:
embedding_dim = 256
model = Sequential()
model.add(Embedding(input_dim=max_vocab_size, 
                    output_dim=embedding_dim, 
                    input_length=max_length))
model.add(LSTM(128, return_sequences=False))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

optimizer = Adam(learning_rate=0.0001)
model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
num_epochs = 20
batch_size = 128

history = model.fit(
    X_train_padded, 
    y_train, 
    validation_data=(X_val_padded, y_val),
    epochs=num_epochs, 
    batch_size=batch_size,
    verbose=1
)

val_loss, val_acc = model.evaluate(X_val_padded, y_val, verbose=0)

plot_training_curves(history, "Initial Model Accuracy")

In [None]:

predictions = model.predict(X_test_padded)
predictions = (predictions >= 0.5).astype(int).flatten()

submission = pd.DataFrame({
    'id': test['id'],
    'target': predictions
})

submission.to_csv('./submission.csv', index=False)