# CSCK507 Mid-Module Assignment
### Toxic comment classification challenge

## Table of Contents
[Section 1. Introduction](#introduction)
- [Import Dependencies](#import-dependencies)

[Section 2. Data Exploration & Analysis](#data-exploration-&-analysis)
  - [2.1 Data Preprocessing](#data-preprocessing)
  - [2.2 Data Imbalance](#data-imbalance)

## 1. Introduction 

### Importing Dependencies

In [57]:
# General
import spacy
import pandas as pd 
import numpy as np

# For Data Preprocessing
from imblearn.over_sampling import RandomOverSampler
from nltk.corpus import stopwords
import nltk
import re  

# For Visualisation
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec 
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS
from PIL import Image

#For Feature Extraction  
from sklearn.feature_extraction.text import TfidfVectorizer
import string 
from nltk import pos_tag
from nltk.stem.wordnet import WordNetLemmatizer 
from nltk.tokenize import word_tokenize
from nltk.tokenize import TweetTokenizer   

In [58]:
# Load the dataset into pandas DataFrame with relative path

df = pd.read_csv('./train.csv')
df_test = pd.read_csv('./test_labels.csv')
df_testcomments = pd.read_csv('./test.csv')

In [59]:
try:
    spacy.prefer_gpu()
    spacy.load('en_core_web_sm')
except LookupError:
    print('Run: python -m spacy download en_core_web_sm')

try:
    nltk_stop = stopwords.words('english')
except LookupError:
    nltk.download('stopwords')

In [60]:
# Initialise SpaCy Model 
spacy.prefer_gpu()
nlp = spacy.load('en_core_web_sm')

## 2. Data Exploration & Analysis

Some chapter notes

In [61]:
class_columns = ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

for column in class_columns:
    class_counts = df[column].value_counts()
    print(class_counts)
    print()

toxic
0    144277
1     15294
Name: count, dtype: int64

severe_toxic
0    157976
1      1595
Name: count, dtype: int64

obscene
0    151122
1      8449
Name: count, dtype: int64

threat
0    159093
1       478
Name: count, dtype: int64

insult
0    151694
1      7877
Name: count, dtype: int64

identity_hate
0    158166
1      1405
Name: count, dtype: int64



In [62]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   id             159571 non-null  object
 1   comment_text   159571 non-null  object
 2   toxic          159571 non-null  int64 
 3   severe_toxic   159571 non-null  int64 
 4   obscene        159571 non-null  int64 
 5   threat         159571 non-null  int64 
 6   insult         159571 non-null  int64 
 7   identity_hate  159571 non-null  int64 
dtypes: int64(6), object(2)
memory usage: 9.7+ MB
None


In [70]:
# Obtain class labels of the dataset
class_labels = list(df.columns[2:])
class_labels

# Remove rows with -1 from df_test as they are not used for scoring
print(f'df_test before removing -1: {df_test.shape}')
for class_label in class_labels:
    df_test = df_test[df_test[class_label] != -1]
print(f'df_test after removing -1: {df_test.shape}')

# Left join 'df_test' and 'df_testcomments' on the 'id' column
df_test = pd.merge(df_test, df_testcomments, on='id', how='left')

# Rearrange columns to match the structure of df
df_test = df_test[['id', 'comment_text', 'toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']]


df_test before removing -1: (63978, 9)
df_test after removing -1: (63978, 9)


In [64]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159571 entries, 0 to 159570
Data columns (total 8 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   id             159571 non-null  object
 1   comment_text   159571 non-null  object
 2   toxic          159571 non-null  int64 
 3   severe_toxic   159571 non-null  int64 
 4   obscene        159571 non-null  int64 
 5   threat         159571 non-null  int64 
 6   insult         159571 non-null  int64 
 7   identity_hate  159571 non-null  int64 
dtypes: int64(6), object(2)
memory usage: 9.7+ MB
None


### 2.1 Data Preprocessing

This part consists of... 

In [65]:
def preprocess_text(text):
    """
    Clean and preprocess a text string.

    Operations performed:
    - Replace special characters, URLs, and numbers with spaces.
    - Remove extra spaces and replace "\n" with a space.
    - Remove Non-English characters.
    - Remove start and end white spaces.
    - Remove single characters.
    - Remove punctuations.
    - Convert the text to lowercase.
    - Remove common stopwords.

    :param text: Input text (string).
    :return: Cleaned text (string).

    Example:
    >>> input_text = "An example text with special characters: $100 and URLs like https://example.com."
    >>> preprocess_text(input_text)
    'example text special characters URLs like'
    """
    # Remove URLs
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)
    # Remove extra spaces and replace "\n" with a space
    text = re.sub("\s\s+", " ", text).replace("\n", " ")
    # Remove Non-English characters
    text = re.sub(r'[^\x00-\x7F]+', "", text)
    # Remove start and end white spaces
    text = text.strip()
    # Remove single characters
    text = re.sub(r"\s+[a-zA-Z]\s+", " ", text)
    # Remove punctuations
    text = re.sub(r"[^a-zA-Z0-9]+", " ", text)
    # Lowercase the text
    text = text.lower()
    # Stopword Removal
    text = ' '.join([word for word in text.split() if word not in nltk_stop])

    return text

### Tokenisation and Lemmatisation

This code...

In [71]:
def tokenize_text(documents):
    """
    Tokenize a list of documents and perform the following:
    1. Break text into individual words or subword tokens.
    2. Reduce words to their base or root form using lemmatization.
    3. Remove stop words and non-alphabetic characters.

    Utilises spaCy's nlp.pipe for efficient batch processing.

    :param documents: List of strings representing documents.
    :return: List of lists of strings, where each list corresponds to the lemmatized tokens of a document.
    """
    lemmatized_tokens_list = []
    
    # Process documents using spaCy's nlp.pipe with "ner" and "parser" components disabled
    for doc in nlp.pipe(documents, disable=["ner", "parser"], batch_size=5000):
        # Generate lemmatised tokens, remove stop words, and non-alphabetic characters
        lemmatized_tokens = [token.lemma_ for token in doc if token.is_alpha and token.lemma_ not in nlp.Defaults.stop_words]
        lemmatized_tokens_list.append(lemmatized_tokens)

    return lemmatized_tokens_list



In many tokenization tasks, especially when you're primarily interested in lemmatization and removing stop words, you may not need the additional information provided by the "ner" and "parser" components.

Disabling the "ner" and "parser" components during the processing of documents with nlp.pipe will reduce computational laod and can significantly improve efficiency and speed, especially when dealing with a large amount of text data.

It's a trade-off between computational resources and the specific linguistic information your task requires. If named entities and syntactic parsing are not critical for your task, disabling these components is a pragmatic approach to enhance processing speed.

In [67]:
# Apply preprocessing to train data
df['comment_text'] = df['comment_text'].apply(preprocess_text)
df.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,explanation edits made username hardcore metal...,0,0,0,0,0,0
1,000103f0d9cfb60f,aww matches background colour seemingly stuck ...,0,0,0,0,0,0
2,000113f07ec002fd,hey man really trying edit war guy constantly ...,0,0,0,0,0,0
3,0001b41b1c6bb37e,make real suggestions improvement wondered sec...,0,0,0,0,0,0
4,0001d958c54c6e35,sir hero chance remember page,0,0,0,0,0,0
5,00025465d4725e87,congratulations well use tools well talk,0,0,0,0,0,0
6,0002bcb3da6cb337,cocksucker piss around work,1,1,1,0,1,0
7,00031b1e95af7921,vandalism matt shirvington article reverted pl...,0,0,0,0,0,0
8,00037261f536c51d,sorry word nonsense offensive anyway intending...,0,0,0,0,0,0
9,00040093b2687caa,alignment subject contrary dulithgow,0,0,0,0,0,0


In [68]:
# Apply preprocessing to test data
df_test['comment_text'] = df_test['comment_text'].apply(preprocess_text)
df_test.head(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0001ea8717f6de06,thank understanding think highly would revert ...,0,0,0,0,0,0
1,000247e83dcc1211,dear god site horrible,0,0,0,0,0,0
2,0002f87b16116a7f,somebody invariably try add religion really me...,0,0,0,0,0,0
3,0003e1cccfd5a40a,says right type type institution needed case t...,0,0,0,0,0,0
4,00059ace3e3e9a53,adding new product list make sure relevant add...,0,0,0,0,0,0
5,000663aff0fffc80,one 1897,0,0,0,0,0,0
6,000689dd34e20979,reason banning throwing article needs section ...,0,0,0,0,0,0
7,000844b52dee5f3f,blocked editing wikipedia,0,0,0,0,0,0
8,00091c35fa9d0465,arabs committing genocide iraq protests europe...,1,0,0,0,0,0
9,000968ce11f5ee34,please stop continue vandalize wikipedia homos...,0,0,0,0,0,0


### 2.2 Requirements

We're going to review the data and it's Perform detailed data analysis of the dataset provided by the competition, observing:

Number of sentences and tokens per class (and check if the dataset is unbalanced or not).

Analyse the most common words for each class and, therefore, understand the most used terms for each level of toxicity.

## Feature Extraction
    
### Create a TF-IDF 
vectoriser = TfTfidfVectorizer()
transformed_output = v.fit_transform(
print(v.vocabulary_)