# CSCK507 Mid Module - Toxic Comment Classification Challenge

## Table of Contents
[Section 1. Introduction](#introduction)
- [Import Dependencies](#import-dependencies)
- [Initialise SpaCy Model ](#import-dependencies)

[Section 2. Data Exploration & Analysis](#data-exploration-&-analysis)
  - [Dataset Alignment](#data-preprocessing)
  - [Data Preprocessing](#data-preprocessing)
  - [Tokenisation & Lemmatisation](#data-preprocessing)
  
    [2.1 Requirements](#data-exploration-&-analysis)
      - [Number of Sentences & Tokens Per Class](#data-preprocessing)
      - [Understanding the Most Common Words](#data-preprocessing)
      - [Data Imbalance](#data-imbalance)
  
[Section 3. Feature Extraction Methods](#data-exploration-&-analysis)

[Section 4. Machine Learning Models](#data-exploration-&-analysis)

[Section 5. Model Evaluation](#data-exploration-&-analysis)


---
## 1. Introduction

Originating in 2018, this challenge revolves around classifying different levels of toxicity in online comments. The dataset from the inaugural competition is utilized to analyze and evaluate the performance of various machine learning algorithms in categorizing six types of toxicity. The primary goal is not only to find an optimal solution but to understand the process of evaluating machine learning algorithms' performance in a classification task. This individual assessment involves data analysis, algorithm selection, and the exploration of feature extraction methods to uncover insights into the nuances of toxic comment classification.

The Toxic Comment Classification Challenge and dataset can be obtained from Kaggle, here: 

### Importing Dependencies

In [1]:
# General
import spacy
import pandas as pd 
import numpy as np

# For Data Preprocessing
from imblearn.over_sampling import RandomOverSampler
from nltk.corpus import stopwords
import nltk
import re  

# For Visualisation
import matplotlib.pyplot as plt
import seaborn as sns

#For Feature Extraction  
from sklearn.feature_extraction.text import TfidfVectorizer
import string 
from nltk import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer   

### Loading Kaggle dataset into DataFrame

In [2]:
df = pd.read_csv('./train.csv')
df_test_labels = pd.read_csv('./test_labels.csv')
df_test_comment = pd.read_csv('./test.csv')

### Initialise SpaCy Model 

In [3]:
try:
    spacy.prefer_gpu()
    spacy.load('en_core_web_sm')
except LookupError:
    print('Run: python -m spacy download en_core_web_sm')

try:
    nltk_stop = stopwords.words('english')
except LookupError:
    nltk.download('stopwords')

In [4]:
spacy.prefer_gpu()
nlp = spacy.load('en_core_web_sm')

---
## 2. Data Exploration & Analysis

In [5]:
df.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0
5,00025465d4725e87,"""\n\nCongratulations from me as well, use the ...",0,0,0,0,0,0
6,0002bcb3da6cb337,COCKSUCKER BEFORE YOU PISS AROUND ON MY WORK,1,1,1,0,1,0
7,00031b1e95af7921,Your vandalism to the Matt Shirvington article...,0,0,0,0,0,0
8,00037261f536c51d,Sorry if the word 'nonsense' was offensive to ...,0,0,0,0,0,0
9,00040093b2687caa,alignment on this subject and which are contra...,0,0,0,0,0,0


In [6]:
df.info()
print("The table dimensions are:",df.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   id             159571 non-null  object
 1   comment_text   159571 non-null  object
 2   toxic          159571 non-null  int64 
 3   severe_toxic   159571 non-null  int64 
 4   obscene        159571 non-null  int64 
 5   threat         159571 non-null  int64 
 6   insult         159571 non-null  int64 
 7   identity_hate  159571 non-null  int64 
dtypes: int64(6), object(2)
memory usage: 9.7+ MB
The table dimensions are: (159571, 8)


### Aligning Train and Test Datasets

In [7]:
# Obtain class labels of the dataset
class_labels = list(df.columns[2:])
class_labels

# Remove rows with -1 from df_test as they are not used for scoring
print(f'df_test before removing -1: {df_test_labels.shape}')
for class_label in class_labels:
    df_test = df_test_labels[df_test_labels[class_label] != -1]
print(f'df_test after removing -1: {df_test_labels.shape}')

# Left join 'df_test' and 'df_test_comment' on the 'id' column
df_test = pd.merge(df_test_labels, df_test_comment, on='id', how='left')

# Rearrange and align columns to match the structure of 'df'
df_test = df_test[['id', 'comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]

df_test before removing -1: (153164, 7)
df_test after removing -1: (153164, 7)


### Data Preprocessing

In [8]:
def preprocess_text(text):
    """
    Clean and preprocess a text string.

    Operations performed:
    - Replace special characters, URLs, and numbers with spaces.
    - Remove extra spaces and replace "\n" with a space.
    - Remove Non-English characters.
    - Remove start and end white spaces.
    - Remove single characters.
    - Remove punctuations.
    - Convert the text to lowercase.
    - Remove common stopwords.

    :param text: Input text (string).
    :return: Cleaned text (string).

    Example:
    >>> input_text = "An example text with special characters: $100 and URLs like https://example.com."
    >>> preprocess_text(input_text)
    'example text special characters URLs like'
    """
    # Remove URLs
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)
    # Remove extra spaces and replace "\n" with a space
    text = re.sub("\s\s+", " ", text).replace("\n", " ")
    # Remove Non-English characters
    text = re.sub(r'[^\x00-\x7F]+', "", text)
    # Remove start and end white spaces
    text = text.strip()
    # Remove single characters
    text = re.sub(r"\s+[a-zA-Z]\s+", " ", text)
    # Remove punctuations
    text = re.sub(r"[^a-zA-Z0-9]+", " ", text)
    # Lowercase the text
    text = text.lower()
    # Stopword Removal
    text = ' '.join([word for word in text.split() if word not in nltk_stop])

    return text

The 'preprocess_text' function uses RegEx (Regular Expression) to cleanse and standardize textual data for enhanced manageability. The reason, being: 

### Tokenisation and Lemmatisation

In [9]:
def tokenize_text(documents):
    """
    Tokenize a list of documents and perform the following:
    1. Break text into individual words or subword tokens.
    2. Reduce words to their base or root form using lemmatization.
    3. Remove stop words and non-alphabetic characters.

    Utilises spaCy's nlp.pipe for efficient batch processing.

    :param documents: List of strings representing documents.
    :return: List of lists of strings, where each list corresponds to the lemmatized tokens of a document.
    """
    lemmatized_tokens_list = []
    
    # Process documents using spaCy's nlp.pipe with "ner" and "parser" components disabled
    for doc in nlp.pipe(documents, disable=["ner", "parser"], batch_size=5000):
        # Generate lemmatised tokens, remove stop words, and non-alphabetic characters
        lemmatized_tokens = [token.lemma_ for token in doc if token.is_alpha and token.lemma_ not in nlp.Defaults.stop_words]
        lemmatized_tokens_list.append(lemmatized_tokens)

    return lemmatized_tokens_list

In many tokenization tasks, especially when you're primarily interested in lemmatization and removing stop words, you may not need the additional information provided by the "ner" and "parser" components.

Disabling the "ner" and "parser" components during the processing of documents with nlp.pipe will reduce computational laod and can significantly improve efficiency and speed, especially when dealing with a large amount of text data.

It's a trade-off between computational resources and the specific linguistic information your task requires. If named entities and syntactic parsing are not critical for your task, disabling these components is a pragmatic approach to enhance processing speed.

In [10]:
# Apply preprocessing to train data
df['comment_text'] = df['comment_text'].apply(preprocess_text)
df.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,explanation edits made username hardcore metal...,0,0,0,0,0,0
1,000103f0d9cfb60f,aww matches background colour seemingly stuck ...,0,0,0,0,0,0
2,000113f07ec002fd,hey man really trying edit war guy constantly ...,0,0,0,0,0,0
3,0001b41b1c6bb37e,make real suggestions improvement wondered sec...,0,0,0,0,0,0
4,0001d958c54c6e35,sir hero chance remember page,0,0,0,0,0,0
5,00025465d4725e87,congratulations well use tools well talk,0,0,0,0,0,0
6,0002bcb3da6cb337,cocksucker piss around work,1,1,1,0,1,0
7,00031b1e95af7921,vandalism matt shirvington article reverted pl...,0,0,0,0,0,0
8,00037261f536c51d,sorry word nonsense offensive anyway intending...,0,0,0,0,0,0
9,00040093b2687caa,alignment subject contrary dulithgow,0,0,0,0,0,0


In [11]:
# Apply preprocessing to new test data
df_test['comment_text'] = df_test['comment_text'].apply(preprocess_text)
df_test.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,00001cee341fdb12,yo bitch ja rule succesful ever whats hating s...,-1,-1,-1,-1,-1,-1
1,0000247867823ef7,rfc title fine imo,-1,-1,-1,-1,-1,-1
2,00013b17ad220c46,sources zawe ashton lapland,-1,-1,-1,-1,-1,-1
3,00017563c3f7919a,look back source information updated correct f...,-1,-1,-1,-1,-1,-1
4,00017695ad8997eb,anonymously edit articles,-1,-1,-1,-1,-1,-1
5,0001ea8717f6de06,thank understanding think highly would revert ...,0,0,0,0,0,0
6,00024115d4cbde0f,please add nonsense wikipedia edits considered...,-1,-1,-1,-1,-1,-1
7,000247e83dcc1211,dear god site horrible,0,0,0,0,0,0
8,00025358d4737918,fool believe numbers correct number lies 10 00...,-1,-1,-1,-1,-1,-1
9,00026d1092fe71cc,double redirects fixing double redirects blank...,-1,-1,-1,-1,-1,-1


### Count number of sentences and tokens per class using SpaCy.

In [None]:
# Create a dictionary to store counts
class_counts = {'class_label': [], 'num_sentences': [], 'num_tokens': []}

# Iterate through each class
for class_label in class_labels:
    # Select comments for the current class
    class_comments = df[df[class_label] != -1]['comment_text'].tolist()

    # Initialize counters
    total_sentences = 0
    total_tokens = 0

    # Iterate through comments in the current class
    for comment in class_comments:
        # Process the comment with spaCy
        doc = nlp(comment)

        # Count sentences and tokens
        total_sentences += len(list(doc.sents))
        total_tokens += len(doc)

    # Update the counts in the dictionary
    class_counts['class_label'].append(class_label)
    class_counts['num_sentences'].append(total_sentences)
    class_counts['num_tokens'].append(total_tokens)

# Create a DataFrame from the dictionary
class_counts_df = pd.DataFrame(class_counts)

# Display the result
print(class_counts_df)

In [None]:
# EXPLORING CLASS DISTRIBUTION
class_columns = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

# Loop through class columns and print class counts
for column in class_columns:
    class_counts = df[column].value_counts()
    
    print(f"{column.capitalize()} Counts:")
    for index, count in class_counts.items():
        class_label = "Non-" + column if index == 0 else column
        print(f"{class_label}: {count}")
    
    print()

---
## Feature Extraction
    
### Create a TF-IDF 
vectoriser = TfTfidfVectorizer()
transformed_output = v.fit_transform(
print(v.vocabulary_)

### 2.2 Requirements

We're going to review the data and it's Perform detailed data analysis of the dataset provided by the competition, observing:

Number of sentences and tokens per class (and check if the dataset is unbalanced or not).

Analyse the most common words for each class and, therefore, understand the most used terms for each level of toxicity.

In [None]:
# Set a stylish seaborn theme
sns.set_theme()

# Visualize the results with a dark palette
plt.figure(figsize=(12, 8))

plt.subplot(2, 1, 1)
sns.barplot(x='class_label', y='num_sentences', data=class_counts_df, palette='dark')
plt.title('Number of Sentences per Class')

plt.subplot(2, 1, 2)
sns.barplot(x='class_label', y='num_tokens', data=class_counts_df, palette='dark')
plt.title('Number of Tokens per Class')

plt.tight_layout()
plt.show()