# NLP Multilabel Classification: Toxic Comment Classification

** **
### DISCLAIMER 
#### ** THE DATASET FOR THIS COMPETITION CONTAINS TEXT THAT MAY BE CONSIDERED PROFANE, VULGAR AND/OR OFFENSIVE 

** **

#### Description and background

Discussing things you care about can be difficult. The threat of abuse and harassment online means that many people stop expressing themselves and give up on seeking different opinions. Platforms struggle to effectively facilitate conversations, leading many communities to limit or completely shut down user comments.

The Conversation AI team, a research initiative founded by Jigsaw and Google (both a part of Alphabet) are working on tools to help improve online conversation. One area of focus is the study of negative online behaviors, like toxic comments (i.e. comments that are rude, disrespectful or otherwise likely to make someone leave a discussion). So far they’ve built a range of publicly available models served through the Perspective API, including toxicity. But the current models still make errors, and they don’t allow users to select which types of toxicity they’re interested in finding (e.g. some platforms may be fine with profanity, but not with other types of toxic content).

In this hands-on activity (adapted from and based on a Kaggle competition), you’re challenged to build a classification model that’s capable of detecting different types of toxicity like threats, obscenity, insults, and identity-based hate. You’ll be using a dataset of comments from Wikipedia’s talk page edits. Improvements to the current model will hopefully help online discussion become more productive and respectful.


#### Labels (classes) 

You are provided with a large number of Wikipedia comments which have been labeled by human raters for toxic behavior. The types of toxicity are:

- toxic
- severe_toxic
- obscene
- threat
- insult
- identity_hate

You must create a model which predicts a probability of each type of toxicity for each comment.



#### NLP pipeline

The NLP pipeline could be represent and below image 

![image.png](attachment:9e63a49b-cea6-48e4-bcb5-564da27382e2.png)

image source: [Natural Language Processing Pipeline](https://towardsdatascience.com/natural-language-processing-pipeline-93df02ecd03f)

### Import the libraries 

In [None]:
# Installing the required libraries (in case they are not found in the system) 

# get the latest version of matplotlib 
!pip install --upgrade matplotlib
!pip install venn
!pip install contractions
!pip install scikit-multilearn
!pip install spacy

# !python -m spacy download en_core_web_lg 
# The following is used when the previous command does not successfully import en_core_web_lg 
import spacy.cli
spacy.cli.download("en_core_web_lg")

In [None]:
import pandas as pd
import numpy as np
import string
import re # RegEx library

# Plotting libraries
import venn
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns

# NLP visualization techniques
from wordcloud import WordCloud, STOPWORDS 

# NLTK and text pre-processing libraries (stopwords, stemming, lemmatizers, etc.)
import nltk
import contractions
from nltk.tree import Tree
from nltk.corpus import stopwords
from nltk import ne_chunk, pos_tag, word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer

# Classifiers, vectorizers, train/test split, metrics and TSNE
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression, RidgeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, average_precision_score, recall_score, hamming_loss

# Extras 
from scipy import sparse as sp_sparse
from itertools import combinations 
from tqdm.notebook import tqdm  # to create progress bars

sns.set() # set the theme to seaborn 

In [None]:
#supporting/essential downloads for NLTK library 
#to handle chuncking/stemming/stopwords

nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
nltk.download('stopwords')
nltk.download('wordnet')


### Load the data

#### Data Overview

For this study, we are using Kaggle data from theToxic Comment Classification Challenge (Source: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data). The dataset here is from wiki corpus dataset which was rated by human raters for toxicity. The corpus contains 63M comments from discussions relating to user pages and articles dating from 2004-2015. Different platforms/sites can have different standards for their toxic screening process. Hence the comments are tagged in the following five categories:

- toxic
- severe_toxic
- obscene
- threat
- insult
- identity_hate

NOTE: The tagging was done via crowdsourcing which means that the dataset was rated by different people and the tagging might not be 100% accurate too. The same concern is being discussed [here](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/discussion/46131). The [source paper](https://arxiv.org/pdf/1610.08914) also contains more interesting details about the dataset creation.

#### Data Inspection

Let's load and inspect the data. This is a **multilabel (multioutput) classification problem** where comments are classified by the level of toxicity. The data is provided in the following files: 
- train.csv - the training set, contains comments with their binary labels
- test.csv - the test set, you must predict the toxicity probabilities for these comments. To deter hand labeling, the test set contains some comments which are not included in scoring.
- sample_submission.csv - a sample submission file in the correct format
- test_labels.csv - labels for the test data; value of -1 indicates it was not used for scoring; (Note: file added after competition close!)


In [None]:
# Load the 'train.csv' data into a new variable named 'train_df'
# Print the dimensionality of the train_df data and preview the first few rows 
# The train_df data contain a row per comment, with an id, the text of the comment, and 6 different labels that we'll try to predict.



As observed, in the training data, the comments are labelled as one or more of the six categories; toxic, severe toxic, obscene, threat, insult and identity hate. This is essentially a multi-label classification problem.


In [None]:
# Retrieve 5 random samples from train_df. Can you remember which Python function to use? 
# Re-run the code to retrieve new random samples and observe the class values



In [None]:
# Create a new dataframe named 'unlabelled_data' that contains all the samples of non-labelled train_df data: 
# this means that all classes (toxic, severe_toxic, obscene, threat, insult and identity_hate should be 0 
# (OR, alternatively, another approach is that they should NOT be 1). Use filtering on the various columns (multiple solutions available). 
# Calculate and print the percentage of unlabelled data contained in the original train_df 



In [None]:
# Similarly, load the 'test.csv' data into a new variable named 'test_df'
# Print the dimensionality of the test_df data and preview the first few rows 
# The test_df data contain a row per comment, with an id and the text of the comment (but not any classes, which we will try to predict)



As you observe, no labelled information is available for the test_df as this is the dataset we would like to predict once we have trained and tuned our model. 

In [None]:
# Even though no duplicates are present in this dataset, it is always a good practice to check and drop any duplicates in your input data
# Check for duplicates in train_df. Can you find the solution? 



In [None]:
# Check for duplicates in test_df. Can you find the solution? 



In [None]:
## Check for missing values in the train_df. Multiple solutions available. 



In [None]:
## Check for missing values in the test_df. Multiple solutions available. 



In [None]:
# Get the information of train_df set. Check the dtypes of each column



In [None]:
# Get the information of test_df set. Check the dtypes of each column



In [None]:
# Drop the column 'id' from your train_df either in place or with assignment. 
# Preview once more the first few rows of your train_df. Did the changes go through? 



In [None]:
# Drop the column 'id' from your test_df either in place or with assignment. 
# Preview once more the first few rows of your test_df. Did the changes go through? 




### Named-Entity Recognition

NER (Named-entity recognition) is the process to tag named entities mentioned in unstructured text with pre-defined categories such as person names, organizations, locations, time expressions, quantities, etc. 

- Note: training a NER model is really time-consuming because it requires a pretty rich dataset. Luckily there is someone who already did this job for us. One of the best open source NER tools is SpaCy. It provides different NLP models that are able to recognize several categories of entities.

Let's try using the SpaCy model `en_core_web_lg` (the large model for English trained on web data) on a subset of our data. Other SpaCy english (or other) language models (like the small or medium ones) can be found at https://spacy.io/models/en. 

In [None]:
# Import the library spacy 
# Load the large english model (en_core_web_lg) from spacy to perform NER and store into a new variable 'ner'



In [None]:
# Get a random text: e.g. use .iloc[15] on the train_df and retrieve the 'comment_text' column. Store in a new variable 'random_txt'. 
# Pass random_txt through the ner object created above and store in a new variable 'doc'




In [None]:
# Use SpaCy's displacy render() function to display the NER result using style="ent" (for entities) 



In [None]:
# Can you repeat the process for sample 1000 ? 




In [None]:
# Instead of running the samples one-by-one, you can tag text from a dataframe and exctract tags into a list (and store in your dataframe) 
# Let's use for speed a subset of our train_df (the first 20 samples) and pass it via the NER model to get back the entities for each sample 

tags = train_df["comment_text"].iloc[0:20].apply(lambda x: [(tag.text, tag.label_) for tag in ner(x).ents] )
tags

### EDA 

In [None]:
labels = ['obscene','insult','toxic','severe_toxic','identity_hate','threat']

In [None]:
# Create a new dataframe (used for plotting purposes) named 'data' that contains only the columns from train_df with the (class) labels 
# that were defined in the previous cell. (Note: 'data' should only contain the label columns and NOT the text!!) 
# Sanity check: preview the data 



#### Wordclouds

In [None]:
all_comments =  pd.Series(train_df["comment_text"]).str.cat(sep=' ')

In [None]:
# Generate and display the word cloud (most common words) across all comments in our text. 
# Use the 'all_comments' created above. Pass all_comments to the .generate() function from WordCloud 
# Arguments to consider for WordCloud are max_words (set to 200), background_color, width (Set to 1500), 
# height (set to 800), max_font_size (set to 500), collocations (set to False). You can also optionally set the plt's figsize to 15 x 8




### Top toxic words per label

Disclaimer: The dataset for this case-study contains text that may be considered profane, vulgar, or offensive.

In [None]:
# Plot one WordCloud (most frequent words) for each class (where the class is equal to 1) 

plt.figure(figsize=(20,10))
count=1

for col in train_df[labels].columns:
    toxic_class_1 = train_df[train_df[col]==1]['comment_text'].str.lower().values
    wordcloud = WordCloud(width=2000, height=2000, background_color ='black', margin=1, stopwords = STOPWORDS,
                          ).generate(" ".join(toxic_class_1))

    plt.subplot(2,3,count)
    plt.axis("off")
    plt.title("Word-cloud for "+col+"_class-1",fontsize=15)
    plt.tight_layout(pad=3)
    plt.imshow(wordcloud,interpolation='bilinear')
    count=count+1
    
plt.show()

In [None]:
toxic_wordclouds = []

for i in range(len(labels)):
    toxic_comments_i = train_df.loc[train_df[labels[i]] == 1, :]["comment_text"] # at least on label is positive
    toxic_text_i = pd.Series(toxic_comments_i).str.cat(sep=' ')
        
    toxic_wordcloud_i = WordCloud(max_words=200)
    # Generate the word cloud
    toxic_wordcloud_i.generate(toxic_text_i)
    toxic_wordclouds.append(toxic_wordcloud_i.words_)

toxic_wordclouds_df = pd.DataFrame(toxic_wordclouds).T.round(2).fillna(0)
toxic_wordclouds_df.columns = labels
toxic_wordclouds_df

In [None]:
# For better visibility, we will select only 3 labels:

selected_labels = ['toxic','threat','identity_hate']
toxic_wordclouds_selected = toxic_wordclouds_df[selected_labels]

fig, ax = plt.subplots(1, 3, figsize=(20,8), sharex=False)
plt.subplots_adjust(wspace=0.8,hspace=0.8)
fig.suptitle('Top toxic words frequency', fontsize=20, weight = 'bold')

axes=ax.ravel()
for i in range(3):
    label = selected_labels[i]
    top_words_i = pd.DataFrame(toxic_wordclouds_selected[label].sort_values(ascending=False)[:10])
    sns.heatmap(top_words_i, fmt='.01f', annot=True,cmap="Blues",ax=axes[i],annot_kws={"size": 12})

    axes[i].title.set_text(selected_labels[i])
    axes[i].title.set_size(26)
    axes[i].tick_params(axis='y', labelsize=14)

### Per class & class combination analysis

In [None]:
# Can you get the number of samples per each class label in your data dataframe? 
# Hint: use the function .sum() (this will perform a sum on the 1s found within each column). 
# Store in a new variable named 'label_count'



In [None]:
# Plot the counts for each class label you found above using the label_count that you defined in the previous step 



What do you observe? 

- The graph above shows that there is an imbalance between the 6 categories.
- Comments with the threat label are the least common. 


In [None]:
# Code to draw bar graph for visualising distribution of classes within each label.

barWidth = 0.25

bars1 = [sum(train_df['toxic'] == 1), sum(train_df['obscene'] == 1), sum(train_df['insult'] == 1), sum(train_df['severe_toxic'] == 1),
         sum(train_df['identity_hate'] == 1), sum(train_df['threat'] == 1)]
bars2 = [sum(train_df['toxic'] == 0), sum(train_df['obscene'] == 0), sum(train_df['insult'] == 0), sum(train_df['severe_toxic'] == 0),
         sum(train_df['identity_hate'] == 0), sum(train_df['threat'] == 0)]

r1 = np.arange(len(bars1))
r2 = [x + barWidth for x in r1]

plt.bar(r1, bars1, color='steelblue', width=barWidth,  label='labeled = 1')
plt.bar(r2, bars2, color='lightsteelblue', width=barWidth, label='labeled = 0')

plt.xlabel('group', fontweight='bold')
plt.xticks([r + barWidth for r in range(len(bars1))], ['Toxic', 'Obscene', 'Insult', 'Severe Toxic', 'Identity Hate', 'Threat'])
plt.legend()
plt.show()

- The above plot shows the individual counts of class zero & class 1, ie comment is classified as toxic= class 1 and comment is classified as non-toxic= class 0.
- From this plot we can understand that the data set is highly imbalanced.
- This is pictorial representation of value counts of classes, for each target individually

#### Venn diagrams

In [None]:
no_of_labels= np.arange(2,6)
rows_col=[(5,3),(5,4),(5,3),(2,3)]

for i,rc in zip(no_of_labels,rows_col):
    comb = combinations(data.columns.values, i)
    fig, top_axs = plt.subplots(ncols=rc[1], nrows=rc[0],figsize=(20, 20))
    fig.suptitle("Venn diagram - considering "+str(i)+" Target Labels",fontsize=24)
    fig.subplots_adjust(top=0.88)
    fig.tight_layout()
    top_axs=top_axs.flatten()
    for j,ax in zip(list(comb),top_axs):
        data_set=dict()
        for k in j:
            data_set[k]=set(train_df[(train_df[k]==1)].index)
        venn_dgrm=venn.venn(data_set,legend_loc="best",alpha=0.4,fontsize=10,ax=ax)

What do you notice? 

- All Severe_Toxic class comments are Toxic.
- In most of the combinations Toxic class,Insult,Obscene are dominating
- In higher combinations, some of the intersections are zero.

In [None]:
#for 6 sets, by default venn.venn() draws venn diagram with triangles
#psudeovenn draws cicrle by considering only few intersections.

fig, ax = plt.subplots(figsize=(8,8))
dataset_dict = {
    col: set(data[(data[col]==1)].index)
    for col in data.columns
}
ax.set_title("Venn Diagram on Positive Target Labels",fontsize=20)
fig.tight_layout()
venn_dgrm=venn.pseudovenn(dataset_dict, hint_hidden=False, ax=ax, legend_loc="best",alpha=0.4,fontsize=11)

#### Correlation matrix

In [None]:
# Calculate and plot a heatmap of the correlation coefficients from 'data'
# Optional: use linewidths=0.1, vmax=1.0, square=True, cmap=plt.cm.plasma, linecolor='white',annot=True



Indeed, it looks like some of the labels are correlated, e.g. insult-obscene has the highest value at 0.74, followed by toxic-obscene and toxic-insult.

What about the character length & distribution of the comment text in the data?

In [None]:
# We can conduct some feature generation / engineering and evaluate once more the correlation coefficients 

corr_df=train_df.drop(columns=["comment_text"])
corr_df['length']=train_df['comment_text'].str.len()
corr_df['no_of_sentences']=train_df['comment_text'].str.split("/n").apply(len)
corr_df['new_line'] = train_df['comment_text'].str.count('\n')
corr_df['question_mark'] = train_df['comment_text'].str.count('\?')
corr_df['exclamation_mark'] = train_df['comment_text'].str.count('!')
corr_df['at_the_rate_mark'] = train_df['comment_text'].str.count('@')
corr_df['hash'] = train_df['comment_text'].str.count('#')
corr_df['ampercent'] = train_df['comment_text'].str.count('&')
corr_df['star']= train_df['comment_text'].str.count('\*')
corr_df['dot'] = train_df['comment_text'].str.count('\.')
corr_df['uppercase_words'] = train_df['comment_text'].str.split().apply(lambda x: sum(map(str.isupper, x)))

In [None]:
# Repeat the process of calculating the correlation coefficients on the corr_df this time and plot in a heatmap 



In [None]:
# Can you calculate the character length for each 'comment_text' in the train_df data? 
# Store in a new column 'char_length' within train_df



In [None]:
# Create a histogram plot using the 'char_length' column of train_df



In [None]:
# Similarly, do the same for the test_df



#### Extra/bonus activities: can you get and/or plot the character length per each category (label)??? 

In [None]:
# Boxplots 

class_0=[]
class_1=[]

plt.figure(figsize=(20,10))
count=1

for col in train_df[labels].columns:
    toxic_class_0 = train_df[train_df[col]==0]['comment_text'].str.split().apply(len)
    toxic_class_0_count = toxic_class_0.values
    class_0.append(toxic_class_0_count)
    
    toxic_class_1 = train_df[train_df[col]==1]['comment_text'].str.split().apply(len)
    toxic_class_1_count = toxic_class_1.values
    class_1.append(toxic_class_1_count)

    plt.subplot(2,3,count)
    plt.boxplot([class_0[count-1], class_1[count-1]])
    plt.title('Box plot of no:of words in comments_text by '+str(col)+' class')
    plt.xticks([1,2],(str(col)+'_class_0',str(col)+'_class_1'))
    plt.ylabel('No:of words')
    plt.tight_layout()
    plt.grid()
    count=count+1
    
plt.show()

- Height of Box plots on number of words in comment_text for each targets labels, for each class is almost similar except for identity_hate and toxic labels.
- All the target label's class-0 plots are thicker (due to overlapping) than its corresponding class-1 plots. 
- There is much overlapping of 50th percentile, 75th percentile for class-0 and class-1 for most of target labels. So no:of words in comments_text may not be good feature in this case.

Let us plot the distributions of number of words in targets labels per each class.

In [None]:
# Histograms

plt.figure(figsize=(20, 10))
count = 1

for col in train_df[labels].columns:
    plt.subplot(2, 3, count)
    sns.kdeplot(class_0[count-1], label=str(col)+'_class_0', fill=True, common_norm=False, alpha=0.5)
    sns.kdeplot(class_1[count-1], label=str(col)+'_class_1', fill=True, common_norm=False, alpha=0.5)
    plt.title('PDF of number of words in comment_text for '+str(col)+' per each class')
    plt.xlabel('Number of words in comments')
    plt.legend()
    plt.grid()
    count += 1

plt.tight_layout()
plt.show()

- Distribution of number of words in comment_text for all the target labels per class is overalaping.
- So, from the above it is clear that there is no clear seperation between class-0/class-1 for any of the target labels.
- Class-0 is dominating in most of the distribution plots.

After all the study between number of words in comment_text & target labels, it doesn't play any major role in classification as all of the density plots for toxic & non-toxic are overlapping each other.

### Text pre-processing: clean up and normalize the comment text


In this section, let's pre-process the data, before feeding to vectorization or embeddings and the model. Since our input is text/string type, let's investigate and retain only useful information in the input text.

In [None]:
# Feel free to add any words of your choice in the following custom stopwords. 
# These custom stopwords can be used for filtering directly or they can be used to extend the list of stopwords provided by one or more libraries. 
# In this activity, we'll do the latter. 

custom_stopwords = ["mr","mrs","miss", "one","two","three","four","five","six","seven","eight","nine","ten",
                    "us","also","dont","cant","any","can","along","among","during","anyone",
                    "a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","s","t","u","v","w","x","y","z",
                    "hi","hello","hey","ok","okay","lol","rofl","hola","let","may","etc"]

In [None]:
# Creating your own custom stopwords (with a combination of two sources in this case plus some additional words)  
# 1. Get the english stopwords from nltk (they have been imported in the libraries section). Assing into a new variable 'nltk_stop_words'
# 2. Use the .union() function between the STOPWORDS from the WordCloud library (they have also been imported in the libraries section) 
# and the nltk_stop_words. Store into a new variable 'final_stop_words'
# 3. Convert the final_stop_words into a list format using the function list()
# 4. Use the .extend() function on final_stop_words to extend final_stop_words with custom_stopwords, which was defined in the previous cell




In [None]:
d

In [None]:
# Instantiate the WordNetLemmatizer() and save in a new variabel named 'lemmatiser' 



In [None]:
# Define the following functions that will clean and pre-process a raw input text. 
# If unsure on the solution, return text in its raw format in every function


def convert_to_lower_case(text):
    """function to convert the input text to lower case"""

    # Fill in your solution here # 
    return # Fill in your solution here # 

def remove_escape_char(text):
    """function to remove newline (\n),
    tab(\t) and slashes (/ , \) from the input text"""

    # Fill in your solution here # 
    
    return # Fill in your solution here # 
    
    
def remove_html_tags(text):
    """function to remove html tags (< >) and its content 
    from the input text"""

    # Fill in your solution here # 
    
    return # Fill in your solution here # 
    
    
def remove_links(text):
    """function to remove any kind of links with no html tags"""

    # Fill in your solution here # 

    return # Fill in your solution here # 


def remove_digits(text):
    """function to remove digits from the input text"""

    # Fill in your solution here # 
    
    return # Fill in your solution here # 


def remove_punctuation(text):
    """function to remove punctuation marks from the input text"""

    # Fill in your solution here # 

    return # Fill in your solution here #       


def remove_extra_spaces_if_any(text):

    """function to remove extra spaces if any after all the pre-preocessing"""

    # Fill in your solution here # 
    
    return # Fill in your solution here # 


def remove_repeated_characters(text):
    #on close observation of toxic comments, In some of the bad comments, words in 
    #bad words characters are repeated. say for example...
    #the word "shit" is written as SSSSHHHHHHHHIIIIIIIIIIITTTTTT
    #but the base word is "shit", in order to increase model performance
    # i am adding this.
    
    """function to remove repeated characters if any from the input text"""

    """for example CAAAAASSSSSSEEEEE SSSSTTTTTUUUUUUDDDDYYYYYY gives CASE STUDY"""

    # Fill in your solution here # 
    
    return # Fill in your solution here # 


def remove_words_lesth2(text):
    """function to remove words with length less than 2"""

    # Fill in your solution here # 
    
    return # Fill in your solution here # 


def decontraction(text):
    """function to handle contractions"""
    
    # Fill in your solution here # 
    
    return # Fill in your solution here # 


def remove_special_characters(text):
    """
        Remove special special characters, including symbols, emojis, and other graphic characters
    """
    
   # Fill in your solution here # 
    
    return # Fill in your solution here # 


# Build the all-in-one pre-processing function

def preprocess(text):

    preprocessed_text = []

    for each_text in tqdm(text):

        result=remove_links(each_text)
        result=remove_html_tags(result)
        result=remove_escape_char(result)        
        result=remove_digits(result)
        result=decontraction(result)
        result=remove_punctuation(result)
        result=convert_to_lower_case(result)
        result = ' '.join(non_stop_word for non_stop_word in result.split() if non_stop_word not in final_stop_words)
        result=remove_extra_spaces_if_any(result)
        result=remove_repeated_characters(result)
        result=remove_special_characters(result)
        result=remove_words_lesth2(result)
        result=' '.join(lemmatiser.lemmatize(word,pos="v") for word in result.split())
        preprocessed_text.append(result.strip())
        
    return preprocessed_text

In [None]:
# Sample check - if you haven't found the solution, feel free to add in comments 

remove_repeated_characters("CAAAAASSSSSSEEEEE SSSSTTTTTUUUUUUDDDDYYYYYY")

In [None]:
# Performing the pre-processing on all the comments in the data-set: Use the preprocess() function on the train_df['comment_text'].values
# and store into a new column 'clean_text' within train_df 




In [None]:
# Optional: create a backup of your pre-processed data if needed (esp for your group project, it is step to consider)

# train_df.to_csv("./pre_processed_data.csv", header=True, index=False)


In [None]:
# Print the 'comment_text' column of train_df for sample 143 


In [None]:
# Print the 'clean_text' column of train_df for sample 143. Did any changes go through?  



In [None]:
# Print the 'comment_text' column of train_df for sample 189 



In [None]:
# Print the 'clean_text' column of train_df for sample 189. Did any changes go through?



### Split data into train and test

Now with these cleaned comments, we can proceed to convert text into vector representation. There is one more care should be taken — data leakage i.e. in order to build the robust model, data leakage should be avoided.


In [None]:
# Create a new variable X with the VALUES from the 'clean_text' column of train_df (use .values on the column to convert to array) 
# Create a class vector y with the VALUES using ONLY the labels columns from train_df (The columns that contain the classes) 
# (once more, use .values to convert to array on the columns that contain the labels in train_df) 



In [None]:
# Use the train_test_split function from sklearn to split the data into X_train, X_test, y_train, y_test 
# Use a 70/30 split (30% in the test set), set the random_state to 42 and shuffle equal to True
# Sanity check: print the dimensionality of the train/test data




In [None]:
# *****************************************************************************************************
# Note: It is always a good practice to split the dataset into train & test sets with stratification 
# However, sklearn’s `train_test_split()` function is NOT developed for multi-label classification. 
# We could use instead the “scikit-multilearn” library. ( pip install scikit-multilearn) to solve 
# multi-label classification tasks. The “scikit-multilearn” library provides the iterative_train_test_split() 
# method to split the dataset into train & test via stratify method as follows. 
# In our example, we will keep things simple and proceed with the simple option of train_test_split() by sklearn **


# from skmultilearn.model_selection import iterative_train_test_split

# X_train, y_train, X_test,y_test = iterative_train_test_split(X.reshape(-1,1), y, test_size = 0.2)

### Feature extraction: transform text to a vector
For machine learning models, the textual data must be converted to numeric data. This can be done in various ways like BoW, tf-idf, Word embedding, etc. In this activity, we will be focusing for simplicity on tf-idf.

### TF-IDF

TF-IDF is easy to compute but its disadvantage is that it does not capture position in text, semantics, co-occurrences in different documents, etc

In [None]:
# Instantiate a TfidfVectorizer and store into a new variable 'tfidf_vect'. 
# Try / pass different arguments such as the number of max_features (e.g. set to 5000), ngram_range (allow unigrams and bigrams!),
# stop_words (set to final_stop_words as defined earlier in the activity, alternatively 'english'), max_df=0.9 and min_df=5



In [None]:
# Learn the vocabulary in the training data: fit_transform() on X_train and store in X_train_tfidf
# run in-line X_train_tfidf in order to examine the document-term matrix created from X_train



In [None]:
# Transform the X_test data using the tfidf_vect and store in X_test_tfidf
# run in-line X_test_tfidf in order to examine the document-term matrix created from X_train



In [None]:
# Can you get the first 20 words (tokens) from tfidf_vect using the get_feature_names_out() function?


In [None]:
# tfidf_vect.vocabulary_

In [None]:
# tfidf_reversed_vocab = {i:word for word,i in tfidf_vect.vocabulary_.items()}
# tfidf_reversed_vocab

### Bonus activity

In [None]:
## Optional: Feel free to experiment with and create embeddings from other models, even pre-trained to compare the results) 



## Multi-Class Vs Multi-Label Classification
#### Multi-Class Classification: “pick only one”
As the name suggests, this is a classification task with more than two classes. The basic assumption in multi-class classification is that for any data point in the dataset, it’s corresponding class-label will be one and only one among all class labels available. In probability theory, such events are called as “Mutually Exclusive Events”, which means the happening of all the events at the same time is ZERO.

![image.png](attachment:9b0d9ce3-cfa9-488a-bf33-70d47498d248.png)

#### Multi-Label Classification: “pick all applicable”
Multi-label classification is a generalization of multi-class classification. The basic assumption in multi-label classification is that for any data point in the dataset, it’s corresponding class-label(s) will be none or many among all class labels available. In other words, for any data point in the dataset there is no constraint on no: of class labels it may belongs to. It may be one or more among the available class labels or none of the class labels.

Multi-Label Classification target labels are not mutually exclusive events but, there exists some relation between them.

![image.png](attachment:60669b48-c060-449f-b2ab-732b55486e39.png)

### Solving a multi-label classification problem¶

We have the datasets prepared using two different techniques BoW and tf-idf. We can run classifiers on both datasets. Since this is a multi-label classification problem, we will be using a simple `MultiOutputClassifier` `LogisticRegression`.


![image.png](attachment:0c41aac9-87e3-41b3-a061-60a4a2add6be.png)

After all the preparation, we would start training our Multilabel Classifier. In Scikit-Learn, we would use the MultiOutputClassifier object to train the Multilabel Classifier model. The strategy behind this model is to train one classifier per label. Basically, each label has its own classifier.

We would use `LogisticRegression` in this sample, and `MultiOutputClassifier` would extend them into all labels.

- _The estimators provided in the MultiOutput class are meta-estimators meaning that they require a base estimator(e.g. linear/logistic regression, svm, decision trees etc.) for their construction. These meta-estimators then extend the base estimators to become MultiOutput estimators._

In [None]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

You can experiment with different regularization techniques, `L1` and `L2` with different coefficients (e.g. `C` equal to 0.1, 1, 10, 100) till you are happy with the result (remember the step of hyperparameter tuning, which is OMITTED HERE). This can be achieved by cv grid search, random search, and bayesian optimization. We are not covering this topic in this article. If you would like to learn more about this, please refer to this post.

In [None]:
# Instantiate a LogisticRegression model named 'lr' with penalty='l2', C = 4, max_iter=10000 
# Instantiate a MultiOutputClassifier() and pass as argument the lr model. Save as 'multilabel_clf'
# Fit the multilabel_clf to X_train_tfidf and y_train



In [None]:
# Use the multilabel_clf to predict X_test_tfidf. Save the results in 'y_test_pred' 



In [None]:
mo_probs = multilabel_clf.predict_proba(X_test_tfidf)

n_classes = y_test.shape[1]
n_test_samples = X_test_tfidf.shape[0]
mo_probs_pos = np.zeros((n_test_samples, n_classes))

for c in range(n_classes):
    c_probs = mo_probs[c]
    mo_probs_pos[:, c] = c_probs[:, 1]

pd.DataFrame(mo_probs_pos, columns=labels)


Lastly, we need to evaluate our Multilabel Classifier. We can use the accuracy metrics to evaluate the model.

## Evaluation

We can use metrics like accuracy score and f1 score for evaluation.

- Accuracy Score (Exact Match Ratio or Subset accuracy): In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true:
    - It is the most strict metric, indicating the percentage of samples that have all their labels classified correctly.
    - The disadvantage of this measure is that multi-class classification problems have a chance of being partially correct, but here we ignore those partially correct matches.
    - There is a function in scikit-learn which implements subset accuracy, called the `accuracy_score`.


- Hamming-Loss (Example based measure): In simplest of terms, Hamming-Loss is the fraction of labels that are incorrectly predicted, i.e., the fraction of the wrong labels to the total number of labels.
  
- Micro-averaging & Macro-averaging (Label based measures): To measure a multi-class classifier we have to average out the classes somehow. There are two different methods of doing this called micro-averaging and macro-averaging.
    - In micro-averaging all TPs, TNs, FPs and FNs for each class are summed up and then the average is taken.
    - In micro-averaging method, you sum up the individual true positives, false positives, and false negatives of the system for different sets and the apply them. And the micro-average F1-Score will be simply the harmonic mean of above two equations.
    - Macro-averaging is straight forward. We just take the average of the precision and recall of the system on different sets.
    - Macro-averaging method can be used when you want to know how the system performs overall across the sets of data. You should not come up with any specific decision with this average. On the other hand, micro-averaging can be a useful measure when your dataset varies in size.

- F1 score: The F1 score can be interpreted as a weighted average of precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. F1 score = 2 * (precision * recall) / (precision + recall)
    - 'F1 score micro': Calculate metrics globally by counting the total true positives, false negatives, and false positives.
    - 'F1 score macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
    - 'F1 score weighted': Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

More on metrics can be found at https://towardsdatascience.com/journey-to-the-center-of-multi-label-classification-384c40229bff

In [None]:
# Print the overall accuracy using the accuracy_score() function by passing y_test and y_test_pred 



In [None]:
# Print the Hamming loss using the hamming_loss() function by passing y_test and y_test_pred 



In [None]:
# Print the f1_score using 1) average='macro', 2) average='micro' , 3) average='weighted'
# Print the Precision using the average_precision_score() with 1) average='macro', 2) average='micro' , 3) average='weighted'


In [None]:
confusion_mat = confusion_matrix(y_test.argmax(axis=1), y_test_pred.argmax(axis=1))
confusion_mat

In [None]:
plt.subplots(figsize=(10,6))
sns.heatmap(confusion_mat, annot=True, fmt='.5g')
plt.xlabel('Predicted')
plt.ylabel('Actual');

In [None]:
# You can apply the same with any other model but it may take a long time to run. Give it a go with a classifier of your choice



### Extras - Other techniques 



#### Classifier Chains

Just like the MultiOutput models, the chain models also extend a base estimator to become a MultiOutput estimator. But in the case of the chain models, Each ML model in the chain makes a prediction using all of the available features provided to the model i.e. X plus the results of models earlier in the chain.

The Chain Models has a parameter “order” that determines the order of the chain i.e. which order the columns of the dependent column Y, are to be predicted. If not stated, the order of the columns of Y are simply followed.

Using our earlier defined example, we can simply implement Sklearn’s Multioutput.ClassifierChain model using the logistic regression as base estimator.


- A chain of binary classifiers C0, C1, . . . , Cn is constructed, where a classifier Ci uses the predictions of all the classifier Cj , where j < i. This way the method, also called classifier chains (CC), can take into account label correlations.
- The total number of classifiers needed for this approach is equal to the number of classes, but the training of the classifiers is more involved.
- Following is an illustrated example with a classification problem of three categories {C1, C2, C3} chained in that order.

In [None]:
# using classifier chains
from sklearn.multioutput import ClassifierChain
# from skmultilearn.problem_transform import ClassifierChain
from sklearn.linear_model import LogisticRegression

## Fill in your solution here 