# CSCK507 Mid Module - Toxic Comment Classification Challenge

## Table of Contents
[Section 1. Introduction](#introduction)
- [Import Dependencies](#import-dependencies)
- [Initialise SpaCy Model ](#import-dependencies)

[Section 2. Data Exploration & Analysis](#data-exploration-&-analysis)
  - [Dataset Alignment](#data-preprocessing)
  - [Data Preprocessing](#data-preprocessing)
  - [Tokenisation & Lemmatisation](#data-preprocessing)
  - [Combining Tokenised Data and Labels for Training and Test Dataset](#data-preprocessing)
  
    [2.1 Requirements](#data-exploration-&-analysis)
      - [Number of Sentences & Tokens Per Class](#data-preprocessing)
      - [Understanding the Most Common Words](#data-preprocessing)
      - [Data Imbalance](#data-imbalance)
  
[Section 3. Feature Extraction Methods](#data-exploration-&-analysis)

[Section 4. Machine Learning Models](#data-exploration-&-analysis)

[Section 5. Model Evaluation](#data-exploration-&-analysis)


---
## 1. Introduction

Originating in 2018, this challenge revolves around classifying different levels of toxicity in online comments. The dataset from the inaugural competition is utilized to analyze and evaluate the performance of various machine learning algorithms in categorizing six types of toxicity. The primary goal is not only to find an optimal solution but to understand the process of evaluating machine learning algorithms' performance in a classification task. This individual assessment involves data analysis, algorithm selection, and the exploration of feature extraction methods to uncover insights into the nuances of toxic comment classification.

The Toxic Comment Classification Challenge and dataset can be obtained from Kaggle, here: 

### Importing Dependencies

In [13]:
# General
import spacy
import pandas as pd 
import numpy as np

# For Data Preprocessing
from imblearn.over_sampling import RandomOverSampler
from nltk.corpus import stopwords
import nltk
import re  

# For Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

#For Feature Extraction  
from sklearn.feature_extraction.text import TfidfVectorizer
import string 
from nltk import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer   

### Loading Kaggle dataset into DataFrame

In [14]:
df = pd.read_csv('./train.csv')
df_test_labels = pd.read_csv('./test_labels.csv')
df_test_comment = pd.read_csv('./test.csv')

### Initialise SpaCy Model 

In [15]:
try:
    spacy.prefer_gpu()
    nlp = spacy.load('en_core_web_sm')
except (LookupError, OSError):
    print('Run: python -m spacy download en_core_web_sm')

try:
    nltk_stop = stopwords.words('english')
except LookupError:
    nltk.download('stopwords')
    nltk_stop = stopwords.words('english')

This code initialises SpaCy with GPU preference and downloads the 'English' language model if necessary. It also sets up NLTK by downloading the English stopwords list if not already available. 

---
## 2. Data Exploration & Analysis

In [16]:
df.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
5,00025465d4725e87,"""\n\nCongratulations from me as well, use the ...",0,0,0,0,0,0
6,0002bcb3da6cb337,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,1,1,0,1,0
7,00031b1e95af7921,Your vandalism to the Matt Shirvington article...,0,0,0,0,0,0
8,00037261f536c51d,Sorry if the word 'nonsense' was offensive to ...,0,0,0,0,0,0
9,00040093b2687caa,alignment on this subject and which are contra...,0,0,0,0,0,0


In [17]:
df.info()
print("The table dimensions are:",df.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   id             159571 non-null  object
 1   comment_text   159571 non-null  object
 2   toxic          159571 non-null  int64 
 3   severe_toxic   159571 non-null  int64 
 4   obscene        159571 non-null  int64 
 5   threat         159571 non-null  int64 
 6   insult         159571 non-null  int64 
 7   identity_hate  159571 non-null  int64 
dtypes: int64(6), object(2)
memory usage: 9.7+ MB
The table dimensions are: (159571, 8)


### Aligning Train and Test Datasets

In [18]:
try:
    # Obtain class labels of the dataset
    class_labels = list(df.columns[2:])
    print("Class labels extracted successfully.")

    # Remove rows with -1 from df_test as they are not used for scoring
    print(f'df_test before removing -1: {df_test_labels.shape}')
    for class_label in class_labels:
        df_test_labels = df_test_labels[df_test_labels[class_label] != -1]

    print(f'df_test after removing -1: {df_test_labels.shape}')

    # Left join 'df_test' and 'df_test_comment' on the 'id' column
    df_test = pd.merge(df_test_labels, df_test_comment, on='id', how='left')
    print(f"Dataframes merged successfully.")

    # Create a new DataFrame called df_test and match the column structure of 'df'
    df_test = df_test[['id', 'comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]
    print("New DataFrame 'df_test' created successfully.")

except KeyError as ke:
    print(f"Error: {ke} not found.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Class labels extracted successfully.
df_test before removing -1: (153164, 7)
df_test after removing -1: (63978, 7)
Dataframes merged successfully.
New DataFrame 'df_test' created successfully.


### Data Preprocessing

In [26]:
def preprocess_text(text, nltk_stop=None):
    try:
        if nltk_stop is None:
            nltk_stop = set(stopwords.words('english'))
    except LookupError:
        print("NLTK stopwords not available. Consider downloading with nltk.download('stopwords').")

    try:
        # Combine URL removal, extra space replacement, and Non-English characters removal
        text = re.sub(r"(http\S+|www\S+|https\S+)|[^\x00-\x7F]+", " ", text)
        # Remove start and end white spaces
        text = text.strip()
        # Remove single characters
        text = re.sub(r"\s+[a-zA-Z]\s+", " ", text)
        # Remove punctuations and convert to lowercase
        text = re.sub(r"[^a-zA-Z0-9]+", " ", text).lower()
        # Stopword Removal using set operations
        text = ' '.join(set(text.split()) - nltk_stop)

        return text

    except Exception as e:
        print(f"An unexpected error occurred: {e}")
        return None

The 'preprocess_text' function efficiently cleanses and standardises textual data for enhanced manageability. It removes distracting elements like URLs, special characters, and numbers, ensuring a focused text corpus. Handling issues such as extra spaces, newline characters, and non-English characters guarantees consistent text structure. The function simplifies vocabulary by removing single characters and punctuation, while converting text to lowercase aids in case-insensitive consistency. The final step involves removing common stopwords, refining the text for meaningful content. 

### Tokenisation and Lemmatisation

In [25]:
def tokenize_lemma_text(documents):
    """
    Tokenize a list of documents and perform the following:
    1. Break text into individual words or subword tokens.
    2. Reduce words to their base or root form using lemmatization.
    3. Remove stop words and non-alphabetic characters.

    Utilises spaCy's nlp.pipe for efficient batch processing.

    :param documents: List of strings representing documents.
    :return: List of lists of strings, where each list corresponds to the lemmatized tokens of a document.
    """
    lemmatized_tokens_list = []
    
    # Process documents using spaCy's nlp.pipe with "ner" and "parser" components disabled utilising 4 core parallel processing:
    for doc in nlp.pipe(documents, disable=["ner", "parser"], batch_size=5000, n_process=4):
        # Generate lemmatised tokens, remove stop words, and non-alphabetic characters
        lemmatized_tokens = [token.lemma_ for token in doc if token.is_alpha and token.lemma_ not in nlp.Defaults.stop_words]
        lemmatized_tokens_list.append(lemmatized_tokens)

    return lemmatized_tokens_list

In many tokenization tasks, especially when you're primarily interested in lemmatization and removing stop words, you may not need the additional information provided by the "ner" and "parser" components.

Disabling the "ner" and "parser" components during the processing of documents with nlp.pipe will reduce computational laod and can significantly improve efficiency and speed, especially when dealing with a large amount of text data.

It's a trade-off between computational resources and the specific linguistic information your task requires. If named entities and syntactic parsing are not critical for your task, disabling these components is a pragmatic approach to enhance processing speed.

In [21]:
# Preprocess the train dataset
df['comment_text'] = df['comment_text'].apply(preprocess_text)
print("Preprocessed train dataset:")
df.head()

Preprocessed train dataset:


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,explanation edits made username hardcore metal...,0,0,0,0,0,0
1,000103f0d9cfb60f,aww matches background colour seemingly stuck ...,0,0,0,0,0,0
2,000113f07ec002fd,hey man really trying edit war guy constantly ...,0,0,0,0,0,0
3,0001b41b1c6bb37e,make real suggestions improvement wondered sec...,0,0,0,0,0,0
4,0001d958c54c6e35,sir hero chance remember page,0,0,0,0,0,0


In [22]:
# Preprocess the test dataset
df_test['comment_text'] = df_test['comment_text'].apply(preprocess_text)
print("\nPreprocessed test dataset:")
df_test.head()


Preprocessed test dataset:


Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0001ea8717f6de06,thank understanding think highly would revert ...,0,0,0,0,0,0
1,000247e83dcc1211,dear god site horrible,0,0,0,0,0,0
2,0002f87b16116a7f,somebody invariably try add religion really me...,0,0,0,0,0,0
3,0003e1cccfd5a40a,says right type type institution needed case t...,0,0,0,0,0,0
4,00059ace3e3e9a53,adding new product list make sure relevant add...,0,0,0,0,0,0


In [23]:
# Tokenizing the train and test datasets
tokenized_comment_train = tokenize_lemma_text(df['comment_text'].tolist())
tokenized_comment_test = tokenize_lemma_text(df_test['comment_text'].tolist())

# Get labels for train and test data
y = df[class_labels]
y_test = df_test[class_labels]

### Combining Tokenised Text and Labels for Training and Test Dataset

In [34]:
# For training data
df_train = pd.DataFrame({
    'comment': tokenized_comment_train,         # Tokenized comment text
    'toxic': y['toxic'],                        # Toxicity label
    'severe_toxic': y['severe_toxic'],          # Severe toxicity label
    'obscene': y['obscene'],                    # Obscenity label
    'threat': y['threat'],                      # Threatening language label
    'insult': y['insult'],                      # Insult label
    'identity_hate': y['identity_hate']         # Identity hate label
})

print("Head of df_train (Training Data):")
display(df_train.head())

# For test data
df_test = pd.DataFrame({
    'comment': tokenized_comment_test,          # Tokenized comment text for testing
    'toxic': y_test['toxic'],                   # Toxicity label for testing
    'severe_toxic': y_test['severe_toxic'],     # Severe toxicity label for testing
    'obscene': y_test['obscene'],               # Obscenity label for testing
    'threat': y_test['threat'],                 # Threatening language label for testing
    'insult': y_test['insult'],                 # Insult label for testing
    'identity_hate': y_test['identity_hate']    # Identity hate label for testing
})

print("\nHead of df_test (Test Data):")
display(df_test.head())

Head of df_train (Training Data):


Unnamed: 0,comment,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,"[explanation, edit, username, hardcore, metall...",0,0,0,0,0,0
1,"[aww, match, background, colour, seemingly, st...",0,0,0,0,0,0
2,"[hey, man, try, edit, war, guy, constantly, re...",0,0,0,0,0,0
3,"[real, suggestion, improvement, wonder, sectio...",0,0,0,0,0,0
4,"[sir, hero, chance, remember, page]",0,0,0,0,0,0



Head of df_test (Test Data):


Unnamed: 0,comment,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,"[thank, understanding, think, highly, revert, ...",0,0,0,0,0,0
1,"[dear, god, site, horrible]",0,0,0,0,0,0
2,"[somebody, invariably, try, add, religion, mea...",0,0,0,0,0,0
3,"[right, type, type, institution, need, case, l...",0,0,0,0,0,0
4,"[add, new, product, list, sure, relevant, add,...",0,0,0,0,0,0


### 2.2 Requirements

We're going to review the data and it's Perform detailed data analysis of the dataset provided by the competition, observing:

Number of sentences and tokens per class (and check if the dataset is unbalanced or not).

Analyse the most common words for each class and, therefore, understand the most used terms for each level of toxicity.

### Counting Number of Sentences & Tokens Per Class

In [None]:
# Create a dictionary to store counts
class_counts = {'class_label': [], 'num_sentences': [], 'num_tokens': []}

# Iterate through each class
for class_label in class_labels:
    # Select comments for the current class
    class_comments = df[df[class_label] != -1]['comment_text'].tolist()

    # Initialize counters
    total_sentences = 0
    total_tokens = 0

    # Iterate through comments in the current class
    for comment in class_comments:
        # Process the comment with spaCy
        doc = nlp(comment)

        # Count sentences and tokens
        total_sentences += len(list(doc.sents))
        total_tokens += len(doc)

    # Update the counts in the dictionary
    class_counts['class_label'].append(class_label)
    class_counts['num_sentences'].append(total_sentences)
    class_counts['num_tokens'].append(total_tokens)

# Create a DataFrame from the dictionary
class_counts_df = pd.DataFrame(class_counts)

# Display the result
print(class_counts_df)

### Understanding the Most Common Words

### Exploring Class Distribution and Imbalance 

In [None]:
class_columns = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

# Loop through class columns and print class counts
for column in class_columns:
    class_counts = df[column].value_counts()
    
    print(f"{column.capitalize()} Counts:")
    for index, count in class_counts.items():
        class_label = "Non-" + column if index == 0 else column
        print(f"{class_label}: {count}")
    
    print()

In [None]:
# Set a stylish seaborn theme
sns.set_theme()

# Visualize the results with a dark palette
plt.figure(figsize=(12, 8))

plt.subplot(2, 1, 1)
sns.barplot(x='class_label', y='num_sentences', data=class_counts_df, palette='dark')
plt.title('Number of Sentences per Class')

plt.subplot(2, 1, 2)
sns.barplot(x='class_label', y='num_tokens', data=class_counts_df, palette='dark')
plt.title('Number of Tokens per Class')

plt.tight_layout()
plt.show()

---
## Feature Extraction
    
### Create a TF-IDF 
vectoriser = TfTfidfVectorizer()
transformed_output = v.fit_transform(
print(v.vocabulary_)