<a href="https://colab.research.google.com/github/Advanced-Data-Science-TU-Berlin/Data-Science-Training-Python-Part-2/blob/main/interactive_notebooks/1_1_text_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Classification
Text classification also known as text tagging or text categorization is the process of categorizing text into organized groups. By using Natural Language Processing (NLP), text classifiers can automatically analyze text and then assign a set of pre-defined tags or categories based on its content.

In this exercise we are using the data from one of the Kaggle competitions named [“Toxic Comment Classification Challenge”](https://www.kaggle.com/datasets/julian3833/jigsaw-toxic-comment-classification-challenge). In this competition, we’re challenged to build a multi-headed model that’s capable of detecting different types of toxicity like threats, obscenity, insults, and identity-based hate. The dataset contains comments from Wikipedia’s talk page edits. (So, along with text classification we will also be learning how to implement multi-output/multi-label classification)

> The dataset for this competition contains text that may be considered profane, vulgar, or offensive. We do not encourage such words and this is only for experiment purposes.

Let's start with loading the data and look at its structure:



## Downloading datasets from Kaggle

### Check for Kaggle API token

Do you know how to download datasets from Kaggle using the Kaggle API?
Do you have your Kaggle API token set up?

In [6]:
# opendatasets is a Python library for downloading datasets from online sources
# like Kaggle and Google Drive using a simple Python command.
# install opendatasets python libary
!pip install opendatasets



In [3]:
import opendatasets as od
od.download("https://www.kaggle.com/datasets/julian3833/jigsaw-toxic-comment-classification-challenge", force=True)

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: mahnaznmz
Your Kaggle Key: ··········
Downloading jigsaw-toxic-comment-classification-challenge.zip to ./jigsaw-toxic-comment-classification-challenge


100%|██████████| 53.4M/53.4M [00:00<00:00, 78.0MB/s]





In [4]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
% matplotlib inline


UsageError: Line magic function `%` not found.


In [7]:
# Let's read data related to train, test and test labels into Pandas Dataframes
# Hint: Use the 'pd.read_csv()' function to read the data from the file
# The files are stored in /content/jigsaw-toxic-comment-classification-challenge/
train_df = <your code here>
test_df = <your code here>
test_labels = <your code here>

# Display a few lines of the training data
<write your code>

## Target Variables
From the data bove what are the target variables?

In [13]:
# Create a list with the name of target variables:
# target_variables = [<trarget-variable-names-here>]

Let's take a look at some statistics from the data:

In [None]:
# Display the size of the train and test sets
# Hint: Use the 'len()' function to get the size of the DataFrames
print("Train Size:\t", <your code here>)
print("Test Size:\t", <your code here>)

In [None]:
display(train_df)
# Display the probability distribution of target variables in the training set
# Note: apply pd.Series.value_counts function
# value_counts args: [normalize, sort, ascending, bins, dropna]
# Consider: normalize=True, sort=True, ascending=False, bins=None, dropna=False
display(
    train_df[<target-variables-here>]
        .apply(<function-here>, args = (<comma-seperated-arg-values-here>))
        )

As we can see we have a class imbalance in almost all the training target variables.

Here we also have multiple independent variables.

## Split the dataset for model evaluation
Let's split train dataset into training and validation sets:

In [None]:
from sklearn.model_selection import train_test_split

# Define the dependent variable
# Hint: Use the column 'comment_text' as the dependent variable
X = train_df[<dependent-column-name>]

# Define the independent variables
# Hint: Use the target value columns
y = train_df[<target-variables-here>]

# Split the dataset into training and validation sets
# Hint: Use 'train_test_split()' with shuffle=True and any random_state=42
X_train, X_val, y_train, y_val = <your-code-here>

## Preprocessing
Preprocessing is one of the vital steps in NLP like any other ML task. In NLP, it helps to get rid of unhelpful parts of the data, or noise, by converting all characters to lowercase, removing punctuation marks, and removing stop words and typos. In this case, `punctuations` and `numbers` are removed along with `stopwords` like in, the, of so that these can be removed from texts as these words don't help in determining the classes (Whether a sentence is toxic or not)

> In this exercise we are also using NLTK which is a leading platform for building Python programs to work with human language data. To ream more check [here](https://www.nltk.org/).


Let's create a preprocessing function we can then pass it to our CountVectorizer model.

In [None]:
import nltk
import string
nltk.download('stopwords') # Download the stop-words
from nltk.corpus import stopwords

# Get the English stopwords
en_stopwords = set(stopwords.words('english'))
print("EN Stopwords:", en_stopwords)

# Function for basic cleaning/preprocessing texts
# Hint: The 'clean' function should take a document ('doc') and a set of stopwords as parameters
# It should remove punctuation marks, numbers, and stopwords, and convert the text to lowercase
def clean(<set-inputs-here>):
    # Removal of punctuation marks (.,/\][{} etc) and numbers
    doc = "".join([char for char in doc if char not in string.punctuation and not char.isdigit()])
    # Removal of stopwords
    doc = " ".join([token for token in doc.split() if token not in stop_words])
    # Convert the text into lowercase using doc.lower()
    doc = <your-code-here>
    return doc

## Bag Of Words
The Bag of Words (BoW) model is a common technique used in natural language processing and text mining to represent text data as numerical features. The basic idea is to convert a collection of text documents into a matrix of word occurrences or frequencies.

Here's a brief explanation of the Bag of Words model:

1) Tokenization:
The first step is to break down each document or sentence into individual words, known as tokens.

2) Vocabulary Building:
 Create a vocabulary, which is a unique set of all the words present in the entire collection of documents. Each word in the vocabulary is assigned a unique index.

3) Vectorization:
Represent each document as a vector, where each element of the vector corresponds to the count or frequency of a word from the vocabulary. The length of the vector is equal to the size of the vocabulary.

4) Sparse Representation:
Since most documents use only a small subset of the entire vocabulary, the resulting vectors are often sparse (mostly filled with zeros), making the representation efficient.

The Bag of Words model ignores the order and structure of words in a document but captures the occurrence or frequency of each word. It's a fundamental method used in various natural language processing tasks, such as text classification, sentiment analysis, and document clustering. However, it doesn't consider the semantic relationships between words. Advanced models like TF-IDF and word embeddings address some limitations of the basic Bag of Words approach.


Now let's create a bag of words model with a maximum of 5000 most-frequent words (as including all the words will make the dataset sparse and will only add noise).Also, Clean the dataset when creating the dataset using bag of words

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
# Create a CountVectorizer instance
# - 'max_features': Limits the number of features (words) to the top 5000 most frequent ones.
# - 'preprocessor': A function to clean and preprocess the text data before vectorization.
vectorizer = <your-code-here>

# Transform the training data into a document-term matrix (DTM)
# - 'fit_transform': Fits the vectorizer to the training data and transforms it into a DTM.
# Note: use fit_transform on vectorizer and pass the X_train as input
X_train_dtm = <your-code-here>

# Transform the validation data into a document-term matrix (DTM)
# - 'transform': Uses the same vectorizer to transform the validation data into a DTM.
# Note: use transform on vectorizer and pass the X_val as input
X_val_dtm = <your-code-here>

# Display the shape of the resulting document-term matrices
print(X_train_dtm.shape, X_val_dtm.shape)

As we can see the same number of rows in train and validation datasets are 5000 columns which are essentially number of occurences of the 5000 most common words in each sentence.
Let's look at the 5 samples of bag of words vector:

In [None]:
pd.DataFrame(X_train_dtm.A[:5], columns = vectorizer.get_feature_names())



Unnamed: 0,abc,abide,ability,able,abortion,about,absence,absolute,absolutely,absurd,...,yourselfgo,youth,youtube,youve,ytmndin,yugoslavia,zealand,zero,zionist,zone
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


As we can see, in this vector whenever the word is present in the comment it will be >1 otherwise 0 (showing the number of occurance).

The Bag of words is pretty much sparse (we can further reduce the max_features if required). This will be the input for a Machine Learning Classifier.

## Multi-Output Classification
Since we need to classify each sentence as toxic or not, severe_toxic or not, obscene or not, threat or not, insult or not, and identity_hate or not, we need to classify the sentence against 6 output variables (This is called Multi-Label Classification which is different from multi-class classification where a target variable has more than 2 options e.g. a sentence can be positive, negative and neutral)

We will be using MultiOutputClassifier from sklearn which as mentioned earlier is a wrapper. This strategy consists of fitting one classifier per target.


Now we are going to use scikit-learn to train two different multi-output classification models:

1) a Naive Bayes model and

2) a Logistic Regression model.

### Naive Bayes Model
Naive Bayes is a family of probabilistic classification algorithms based on Bayes' theorem with the "naive" assumption of independence between features. Despite its simplicity, it's a powerful and efficient method, especially for text classification tasks.

<img src="https://miro.medium.com/v2/resize:fit:600/1*aFhOj7TdBIZir4keHMgHOw.png">

### Logistic Regression (logistic/Logit Model)
A Logistic Regression, often referred to as Logit Model, is a statistical method used for binary classification problems with the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick.

This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1, with a sum of one.

<img src="https://www.saedsayad.com/images/LogReg_1.png">

Here's a breakdown of the code:

In [None]:
# Import necessary libraries
from sklearn.multioutput import MultiOutputClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

In [None]:
# Naive Bayes Model
# - MultiOutputClassifier: Extends a classifier to handle multi-output problems.
# - MultinomialNB: Naive Bayes classifier suitable for classification with discrete features.
nb_model = MultiOutputClassifier(MultinomialNB())
# fit the naive base model on train data
# Hint: use fit function on nb_model and pass X_train_dtm and y_train as inputs
<your-code-here>

In [None]:
# Logistic Regression Model
# - MultiOutputClassifier: Extends a classifier to handle multi-output problems.
# - LogisticRegression: Logistic Regression classifier.
#   - 'class_weight': 'balanced' penalizes mistakes in samples of each class with the inverse of counts of that class.
#   - 'max_iter': Maximum number of iterations for optimization.
lr_model = MultiOutputClassifier(LogisticRegression(class_weight='balanced', max_iter=3000))
# fit the lr base model on train data
# Hint: use fit function on nb_model and pass X_train_dtm and y_train as inputs
<your-code-here>

## Measuring Performance
Let's see what metric we should use here to measure the performance of our models.

### ROC - AUC
ROC (Receiver Operating Characteristic) Curve tells us about how good the model can distinguish between two things (e.g If a comment is toxic or not). Better models can accurately distinguish between the two. Whereas, a poor model will have difficulties in distinguishing between the two.

The ROC curve is a valuable metric for evaluating classification models, especially in scenarios where there is an imbalance in the class distribution.

<img src="https://www.statology.org/wp-content/uploads/2021/08/read_roc2.png" width="500">

Let's see how does the ROC curves look like for the `toxic` label for both of our trained models:

In [None]:
from sklearn.metrics import roc_curve, auc

# A helper function for plotting ROC curve for a model and a given label
def plot_roc_auc(model, model_name, label_name, ax):
    # Get the index of the label in the columns
    label_id = list(y_val.columns).index(label_name)

    # Get true labels and predicted probabilities for the positive class
    y_vals = y_val[label_name].to_numpy()
    y_pred_proba = model.predict_proba(X_val_dtm)[label_id][:, 1]

    # Calculate the ROC curve
    # Note: use roc_curve function with y_vals and y_pred_proba as input
    fpr, tpr, thresholds = <your-code-here>

    # Calculate AUC
    # Note: use auc function and pass fpr and tpr
    auc_value = <your-code-here>

    # Plot ROC curve
    ax.plot([0, 1], [0, 1], 'k--')  # Diagonal line for reference
    ax.plot(fpr, tpr, label=f'{model_name} (AUC = {auc_value:.3f})')

    # Set plot labels and title
    ax.set_xlabel('False Positive Rate (fpr)')
    ax.set_ylabel('True Positive Rate (tpr)')
    ax.set_title(f'{model_name} ROC curve for `{label_name}` Label')
    ax.legend()

# Create subplots for two models
fig, ax = plt.subplots(1, 2, figsize=(20, 6))

# Plot ROC curve for Naive Bayes model on the 'toxic' label
plot_roc_auc(<nb-model>, 'Naive Bayes', 'toxic', ax[0])

# Plot ROC curve for Logistic Regression model on the 'toxic' label
plot_roc_auc(<lr-model>, 'Logistic Regression', 'toxic', ax[1])


Let's compare the mean ROC-AUC across both models we have trained as the aggregated measurement of their performances. We will be using the predict_proba function of models instead of predict which gives us the probability scores instead of predicted value based on a threshold of 0.5, as it is used by the roc_auc_measure.

Let's write a function for calculating the roc_auc:

In [28]:
from sklearn.metrics import roc_auc_score

# Function for calculating ROC-AUC scores
def calculate_roc_auc(y_test, y_pred):
    aucs = []

    # Calculate the ROC-AUC for each of the target columns
    for col in range(y_test.shape[1]):
        aucs.append(roc_auc_score(y_test[:, col], y_pred[:, col]))

    return aucs

Given the performance metrics let’s run the models on the validation dataset

In [None]:
from statistics import mean

# Creating an empty list of results
results = []

# Making predictions from all the trained models and measuring performance for each
for model in [nb_model, lr_model]:
    # Extracting the name of the model
    est = type(model.estimator).__name__

    # Actual output variables
    y_vals = y_val.to_numpy()

    # Model Probabilities for class 1 of each of the target variables
    y_preds = np.transpose(np.array(model.predict_proba(X_val_dtm))[:,:,1])

    # Calculate Mean of the ROC-AUC
    # Note: use mean function on calculate_roc_auc between y_vals and y_preds
    mean_auc = <your-code-here>

    # Append the name of the model and the mean_roc_auc into the results list
    results.append([est, mean_auc])

# Output the results as a table
result_df = pd.DataFrame(results, columns=["Model", "Mean AUC"])
result_df

As we can see, Both the models perform really good with LR performing slightly better. So, we will use it as the final model to submit the predictions for the test data. Also, these simple models give pretty good results without much of a hassle or technical know-how, that is why they are still used widely.

## Predicting Target Values for Test Data


In [None]:
# Transform the test dataset using Count Vectorizer
# Note: use transform on vectorizer and pass 'comment_text' column from test_df
X_test_dtm = <your-code-here>

# Use the Logistic Regression model to output probabilities and take the probability for class 1
y_preds = np.transpose(np.array(lr_model.predict_proba(X_test_dtm))[:,:,1])

# Add predicted labels to the test data
test_pred_labels = test_df.assign(**pd.DataFrame(y_preds, columns=target_variables))

# Calculate average ROC-AUC on the test data
test_auc_mean = mean(calculate_roc_auc(test_labels[target_variables].replace(-1, 0).to_numpy(), y_preds))
print("Mean AUC on Test:", test_auc_mean)

As we can see we are still having a rather good performance on unseen data ~90%

## Model Interpretation/ Word Importance
This is the most exciting part. Since, we are just using a simple Logistic Regression model, we can directly use the coefficient values of the model to get an understanding of the predictions made. By doing so, which feature is importance or which word makes a sentence toxic. If we would use a complex model, we could go for SHAP or LIME. Also, since we have 6 output variables, we will have 6 feature importances which will be interesting to see

In [None]:
# Assigning the feature names to an empty list
# Note: use get_feature_names_out() function on vectorizer to get the list of words
feat_impts_df = pd.DataFrame({'word': <your-code-here>})

# For all the models, save the feature importances in the list.
# 'estimators_' gives the internal models used by the multi-output regressor
for target_name, clf in zip(target_variables, lr_model.estimators_):
    # Note: use coef_ on clf to get the coefficients and then use flatten() to get the list
    feat_impts_df[target_name] = <your-code-here>

# Displaying the DataFrame
feat_impts_df.head()

### Top Words Determining Toxicity
We will look at Top 5 words which determine if the sentence is a toxic-type or not according to the model.

In [None]:
# Function to create individual feature importance tables and plots
def plot_top_words(df, category, ax, topn=5):
    top_words = df[["word", category]].sort_values(by=category, ascending=False).head(topn)
    sns.barplot(x=category, y="word", ax=ax, data=top_words)
    ax.set_title(f"Top {topn} Words for {category.capitalize()}")

# Creating subplots for each toxic-type category
# Note: use subplots on plt with 2 rows and 3 columns as we have 6 target values
# use figsize=(15, 5)
fig, axes = <your-code-here>

# Plotting top 5 words for each toxic-type category
for category, ax in zip(target_variables, axes.flatten()):
    # Call plot_top_words function and pass feat_impts_df, category and ax
    <your-code-here>

plt.text(0.5, -0.1, "Note: The following plots may contain words that are not suitable for all audiences. Viewer discretion is advised.",
         horizontalalignment='center', verticalalignment='center', transform=plt.gcf().transFigure,
         bbox=dict(facecolor='red', alpha=0.5))
# Setting a common title for the entire subplot
plt.suptitle("Feature Importance")

# Adjusting layout for better appearance
fig.tight_layout(rect=[0, 0.03, 1, 0.95])

plt.show()

We can see that the models are quite rightly selecting the most important features and it makes complete sense

For e.g. for threats - words like kill, shoot, destroy etc are most important

for identity hate - words like nigger, nigga, homosexual, faggot

most important words for toxic are less extreme than most important words for severe toxic.

## TODOs:
1. Try TF-IDF instead of CountVectorizer
TF-IDF tend to perform better than CountVectorizer in some cases
2. Try ensemble models instead of Vanilla ML models
Bagging and Boosting models give better results than classic ML techniques in most cases
3. Better Text Preprocessing
Typo correction, Lemmatization, etc can be done to further improve the model

Useful Links:
- https://medium.com/analytics-vidhya/text-classification-from-bag-of-words-to-bert-1e628a2dd4c9
- https://towardsdatascience.com/text-classification-using-naive-bayes-theory-a-working-example-2ef4b7eb7d5a#:~:text=1.-,Introduction,is%20independent%20of%20each%20other