<a href="https://colab.research.google.com/github/LeonardoGoncRibeiro/05_AppliedMachineLearning/blob/main/02_Multilabel_Classification_NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text multilabel classification: Multiple contexts in NLP

In this course, we will understand what is multilabel classification, and understand the difference from a multiclass classification. Then, we will apply multilabel classification to classify tags from posts on Stack Overflow. Also, we will see how to evaluate our classification.

*In this course, we will use a different version of sci-kit learn. This is necessary to run the MLkNN algorithm:*

In [1]:
!pip uninstall scikit-learn -y
!pip install scikit-learn==0.24.1

Found existing installation: scikit-learn 0.24.1
Uninstalling scikit-learn-0.24.1:
  Successfully uninstalled scikit-learn-0.24.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scikit-learn==0.24.1
  Using cached scikit_learn-0.24.1-cp37-cp37m-manylinux2010_x86_64.whl (22.3 MB)
Installing collected packages: scikit-learn
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
yellowbrick 1.4 requires scikit-learn>=1.0.0, but you have scikit-learn 0.24.1 which is incompatible.[0m
Successfully installed scikit-learn-0.24.1


## Types of classification

*   Binary: When we want to classify if something is (1) or is not (0) something. For instance, if we have a movie, we want to classify if it is a comedy (1) or not (0).
*   Multiclass: When we have multiple classes. For instance, if we have a movie, we want to classify if it is a Drama, Comedy, Action...
*   Multilabel: When our entry can have multiple classes at the same time. For instance, we can classify Star Wars as Action and Adventure.

In this course, we will learn more about multilabel classification. To that end, we will work with multilabel classification of posts from Stack Overflow. Basically, depending on the post contents, we will classify our post (javascript, html, jquery, angular, python, and others). 



## Understanding and manipulating our dataset

So, our dataset will be made of different posts, each with their own tags.

In [2]:
import pandas as pd

dataset = pd.read_csv('https://raw.githubusercontent.com/alura-cursos/alura_classificacao_multilabel/master/dataset/stackoverflow_perguntas.csv')

In [3]:
dataset

Unnamed: 0,Perguntas,Tags
0,Possuo um projeto Node.js porém preciso criar ...,node.js
1,"Gostaria de fazer testes unitários no Node.js,...",node.js
2,Como inverter a ordem com que o jQuery itera u...,jquery
3,Eu tenho uma página onde pretendo utilizar um ...,html
4,Como exibir os dados retornados do FireStore e...,html angular
...,...,...
5403,Queria saber como pegar o total de cores de um...,jquery html
5404,"Boa noite, estou usando phonegap para fazer um...",html
5405,"Estou construindo um mini fórum, e nele, os us...",jquery html
5406,"Boa tarde, Estou para desenvolver um site na ...",html


Nice! Our dataset has 5407 entries, with the post title and the tags used in each post. 

In [4]:
dataset.info( )

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5408 entries, 0 to 5407
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Perguntas  5408 non-null   object
 1   Tags       5408 non-null   object
dtypes: object(2)
memory usage: 84.6+ KB


Also, note that we have no missing value in our entire dataset. 

Some entries have two tags (or even three tags). For instance:

In [5]:
dataset.loc[4]

Perguntas    Como exibir os dados retornados do FireStore e...
Tags                                             html angular 
Name: 4, dtype: object

This entry has tags html and angular. So, in our multilabel algorithm, we want to find multiple labels that can fit a given post. Let's see the unique tag combinations from our dataframe:

In [6]:
dataset.Tags.unique( )

array(['node.js', 'jquery', 'html', 'html angular ', 'html ', 'angular',
       'angular ', 'jquery html  ', 'jquery ', 'jquery html',
       'jquery html ', 'html angular', 'angular node.js ', 'html  ',
       'jquery html angular', 'node.js ', 'html jquery', 'html jquery ',
       'jquery angular  ', 'html node.js', 'jquery  ', 'angular node.js',
       'jquery angular', 'html node.js ', 'jquery node.js ', 'angular  ',
       'jquery angular ', 'jquery html angular ', 'node.js html ',
       ' node.js', 'node.js html', 'html angular  ', 'jquery node.js',
       'angular html', 'html angular  node.js', 'jquery html node.js',
       'html angular node.js'], dtype=object)

In [7]:
len(dataset.Tags.unique( ))

37

So, we have 37 unique tag combinations. 

Note that some of our tags look strange. Note that we have the tag 'html' and the tag 'html ' as different tags, due to a simple space in the end. That also occurs in other tags in our dataset. Also, we have the 'html angular' tag, which is taken as different from the 'angular html' tag. 

To get the unique tag combinations, we can use a better solution. Since we have a multilabel system, we can use a binary classification: our column tags will be split into multiple other columns, one for each single tag possible. Then, if the tag is present in the post, it receives 1, and, if not, it receives 0.

For instance, if a post has the tag 'html', it will receive a 1 in column 'html', and 0 in all others. If it has the tags jquery and node.js, it will receive a 1 in columns jquery and node.js, and 0 in the others.

So, we can try to make this feature transformation. Then, it becomes much easier to determine the tags present in each entry.

First, let's get the single labels in our dataset:

In [8]:
single_labels = []

for tags in dataset.Tags.unique( ):
  for tag in tags.split( ):
    if tag not in single_labels:
      single_labels.append(tag)

single_labels

['node.js', 'jquery', 'html', 'angular']

Nice! So, in fact, we only have four different single labels. Now, let's update our initial dataframe with the new features:

In [9]:
for label in single_labels:
  dataset[label] = dataset.Tags.apply(lambda x : 1 if (label in x.split( )) else 0)

In [10]:
dataset

Unnamed: 0,Perguntas,Tags,node.js,jquery,html,angular
0,Possuo um projeto Node.js porém preciso criar ...,node.js,1,0,0,0
1,"Gostaria de fazer testes unitários no Node.js,...",node.js,1,0,0,0
2,Como inverter a ordem com que o jQuery itera u...,jquery,0,1,0,0
3,Eu tenho uma página onde pretendo utilizar um ...,html,0,0,1,0
4,Como exibir os dados retornados do FireStore e...,html angular,0,0,1,1
...,...,...,...,...,...,...
5403,Queria saber como pegar o total de cores de um...,jquery html,0,1,1,0
5404,"Boa noite, estou usando phonegap para fazer um...",html,0,0,1,0
5405,"Estou construindo um mini fórum, e nele, os us...",jquery html,0,1,1,0
5406,"Boa tarde, Estou para desenvolver um site na ...",html,0,0,1,0


Nice! Everything seems to be working out. Also, we can make a simple visual check in our dataset and see that them columns were defined correctly.



Let's get a sum of these columns, just to understand how many entries have each label:

In [11]:
dataset[single_labels].sum( )

node.js     641
jquery     2444
html       2345
angular     929
dtype: int64

So, let's try to understando a little more about our problem. We will receive the post, in column 'Perguntas', and we have to assign some labels for it. Thus, we have multiple possible targets for our post. Thus, our model should be able to define more than one target. 

# Train-test splits

To create our train-test splits, we have to define:

* Train dataset, which will be used to train our model.
* Test dataset, which will be used to evaluate (test) our model

Also, we have to keep in mind that we need to separate our features in:

* Independent features ($X$)
* Dependent features, or targets ($Y$)

We can perform our split using the train_test_split function:


In [12]:
from sklearn.model_selection import train_test_split

X = dataset.Perguntas
Y = dataset[single_labels]

X_train, X_test, Y_train, Y_test = train_test_split( X, Y )

Note that, here, we considered that the independent features are the posts, and the dependent features are the columns related to the tags binary classification:

In [13]:
X

0       Possuo um projeto Node.js porém preciso criar ...
1       Gostaria de fazer testes unitários no Node.js,...
2       Como inverter a ordem com que o jQuery itera u...
3       Eu tenho uma página onde pretendo utilizar um ...
4       Como exibir os dados retornados do FireStore e...
                              ...                        
5403    Queria saber como pegar o total de cores de um...
5404    Boa noite, estou usando phonegap para fazer um...
5405    Estou construindo um mini fórum, e nele, os us...
5406    Boa tarde,  Estou para desenvolver um site na ...
5407    Estou fazendo um hotsite, ele é one page, e é ...
Name: Perguntas, Length: 5408, dtype: object

In [14]:
Y

Unnamed: 0,node.js,jquery,html,angular
0,1,0,0,0
1,1,0,0,0
2,0,1,0,0
3,0,0,1,0
4,0,0,1,1
...,...,...,...,...
5403,0,1,1,0
5404,0,0,1,0
5405,0,1,1,0
5406,0,0,1,0


## Aggregating target features

Note that, when we do a train-test split, our dataset gets splitted randomly. Also, usually, the target is given by a single column. Let's try to fix this:

In [15]:
dataset['Agg_Target'] = list(zip(dataset['node.js'], dataset['jquery'], dataset['html'], dataset['angular']))

In [16]:
dataset

Unnamed: 0,Perguntas,Tags,node.js,jquery,html,angular,Agg_Target
0,Possuo um projeto Node.js porém preciso criar ...,node.js,1,0,0,0,"(1, 0, 0, 0)"
1,"Gostaria de fazer testes unitários no Node.js,...",node.js,1,0,0,0,"(1, 0, 0, 0)"
2,Como inverter a ordem com que o jQuery itera u...,jquery,0,1,0,0,"(0, 1, 0, 0)"
3,Eu tenho uma página onde pretendo utilizar um ...,html,0,0,1,0,"(0, 0, 1, 0)"
4,Como exibir os dados retornados do FireStore e...,html angular,0,0,1,1,"(0, 0, 1, 1)"
...,...,...,...,...,...,...,...
5403,Queria saber como pegar o total de cores de um...,jquery html,0,1,1,0,"(0, 1, 1, 0)"
5404,"Boa noite, estou usando phonegap para fazer um...",html,0,0,1,0,"(0, 0, 1, 0)"
5405,"Estou construindo um mini fórum, e nele, os us...",jquery html,0,1,1,0,"(0, 1, 1, 0)"
5406,"Boa tarde, Estou para desenvolver um site na ...",html,0,0,1,0,"(0, 0, 1, 0)"


Nice! Now, we can use the Agg_Target column as our target!

Now, let's try to perform our train test split:

In [17]:
from sklearn.model_selection import train_test_split

X = dataset.Perguntas
Y = dataset['Agg_Target']

X_train, X_test, Y_train, Y_test = train_test_split( X, Y, test_size = 0.2, random_state = 123 )

Nice! Now, let's look to our training set:

In [18]:
X_train

1577       array1 = [1,2,3];   array2 = ["um","dois","...
1927    Não sei se fui claro no título, mas quem é da ...
3409    Alguém sabe me dizer qual a melhor forma de re...
4606    Estou com problemas ao tentar validar campos d...
5237    Preciso copiar um valor de dentro de um CODE  ...
                              ...                        
5218    Tenho um sisteminha, para mudar o layout da pá...
4060    Como fazer alto scoll ao carregar a página?  E...
1346    Explicação:  Tenho uma CODE  pai que contém du...
3454    Estou querendo fazer um sistema onde eu iria t...
3582    Galera eu to com um problemão, ja pesquisei ba...
Name: Perguntas, Length: 4326, dtype: object

Note that our training set is defined by a column of text data. How can the ML algorithm understand and make predictions based on a text? First, we should try to manipulate this data so that it becomes easier for the algorithm to make predictions using the training and test sets.



## Vectorization

To that end, we will **vectorize** our text. Basically, we will try to translate our text into a vector of numbers which represent our text. This number is related to how frequent the word is in the rest of the corpus. We can do this using TF-IDF:

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [20]:
# Defining our vectorizer. We will have, at maximum, 5000 features, 
# and we will exclude words which appear in 85% of the corpus.

vectorize = TfidfVectorizer(max_features = 5000, max_df = 0.85)  
vectorize

TfidfVectorizer(max_df=0.85, max_features=5000)

In [21]:
# Fitting our vectorizer

vectorize.fit(dataset.Perguntas)

TfidfVectorizer(max_df=0.85, max_features=5000)

In [22]:
# Vectorizing our train and test sets

X_train_v = vectorize.transform(X_train)
X_test_v  = vectorize.transform(X_test)

Let's check the shape of our data:

In [23]:
X_train_v.shape

(4326, 5000)

In [24]:
X_test_v.shape

(1082, 5000)

Note that, now, we have 5000 columns (or features) in each data set, which correspond to the vectorize form of our Stack Overflow posts.

Nice! Now, our sets are in a good format to be applied to our ML algorithm.

# Creating our first multilabel model

So, how will we create our multilabel model? A very simple way of solving this issue is to use a binary relevance algorithm, which will create an algorithm for each tag column, and a binary classification algorithm is trained multiple times.

Let's try to implement this classifier. In sklearn, the binary relevance algorithm is denominated OneVsRest:

In [25]:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression( solver = 'lbfgs' )
bin_rel = OneVsRestClassifier( estimator = log_reg )
bin_rel

OneVsRestClassifier(estimator=LogisticRegression())

Note that, here, each of our binary classification algorithms will be given by a Logistic Regression model. Now, let's fit the binary relevance algorithm. First, we just have to transform our Y_train in an array:

In [26]:
import numpy as np

Y_train_array = np.asarray(list(Y_train))
Y_test_array  = np.asarray(list(Y_test))

bin_rel.fit(X_train_v, Y_train_array)

OneVsRestClassifier(estimator=LogisticRegression())

Nice! Now, how good is our fitted model? We can see how well our trained model is predicting our training set using the score( ) method:

In [27]:
bin_rel.score(X_train_v, Y_train_array)

0.5917706888580675

And how well our trained model predicts the test set?

In [28]:
bin_rel.score(X_test_v, Y_test_array)

0.4168207024029575

Ok, so, on the test set, our model is hitting 41.68% of our classifications. But do all labels being classified have this accuracy? Let's understand what accuracy is telling us.

## Testing our model

So, we have evaluated the accuracy of our model. But how does accuracy works in a multilabel classification model? 

Accuracy will test if the prediction of our model is the same as the true value. In a multilabel classification, for instance, if our model predicts [0, 0, 1, 1] in an entry, we will have a good accuracy (100%) if the true value is [0, 0, 1, 1]. However, what if the true value is [0, 0, 1, 0]? In this case, the accuracy is 0%. However, note that we actually hitted 3 out of 4 classifications. 

Thus, accuracy is actually a very conservative metric for multilabel algorithms (exact match). In multilabel classification (or in binary classification in general) another very common metric is the Hamming loss.

### Hamming loss

The Hamming distance will get the sum of the absolute differences between the prediction and the true value. For instance:

* Hamming loss from [0, 0, 1, 1] and [0, 0, 1, 1] is 0.
* Hamming loss from [0, 0, 1, 0] and [0, 0, 1, 1] is 1.

If our prediction is farther from the true value, our Hamming loss is higher. 

The Hamming loss metric is given by the sum of the Hamming distance from all data points (prediction vs true) divided by the number of possible values. Thus, it given us the percentage of wrong classifications in our multilabel set. 

Let's evaluate the Hamming loss for our model:

In [29]:
from sklearn.metrics import hamming_loss

pred_onevsrest = bin_rel.predict(X_test_v)
hamming_loss_onevsrest = hamming_loss(Y_test_array, pred_onevsrest)

In [30]:
hamming_loss_onevsrest

0.1883086876155268

Nice! So, we have only wrongfully classified 18.83% tags! 

Note that, in our binary relevance model, we simply performed four different binary classification algorithms separetely. Here, each algorithm does not influence the other. 

However, is this really true? Shouldn't a specific tag influence the existence of another related tag? After all, these tags actually have some correlation:

In [31]:
dataset.corr( )

Unnamed: 0,node.js,jquery,html,angular
node.js,1.0,-0.321485,-0.273523,-0.101787
jquery,-0.321485,1.0,-0.253977,-0.366269
html,-0.273523,-0.253977,1.0,-0.286706
angular,-0.101787,-0.366269,-0.286706,1.0


Note that we have a negative correlation, which means that, if it has a tag, it is less likely to have the other. How can we create a model that actually takes this in consideration when evaluating our multilabel classification?

# Capturing the correlation between the different labels in our model

To capture this, we actually need to perform a **chain-classification**. Basically, we will still use four models, but now, each of our models will introduce the previous label as a target. 

## Chain classification

For instance, we will create a model for node.js, then a second for node.js+jquery, then a third for node.js+jquert+html, and finally one for node.js+jquery+html+angular. Thus, imputing the already classified column, we end up considering such labels in our final model.

To implement these kinds of models, we will import scikit-multilearn:

http://scikit.ml/api/skmultilearn.problem_transform.cc.html

In [32]:
!pip install scikit-multilearn

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [33]:
from skmultilearn.problem_transform import ClassifierChain

Nice! Now, let's instance our object and fit it:

In [34]:
chain_class = ClassifierChain(log_reg)

chain_class.fit(X_train_v, Y_train_array)

ClassifierChain(classifier=LogisticRegression(), require_dense=[True, True])

Ok! Now, let's evaluate the accuracy of our new model:

In [35]:
chain_class.score(X_test_v, Y_test_array)

0.49815157116451014

Again, accuracy is not really a good metric in multi-label classification. Still, our accuracy was actually higher now. 

Now, let's evaluate the hamming loss:

In [36]:
pred_chain = chain_class.predict(X_test_v)
hamming_loss_chain = hamming_loss(Y_test_array, pred_chain)

In [37]:
hamming_loss_chain

0.21095194085027727

So, actually, our hamming loss became worse, but our accuracy became higher. That means that our model is better at finding exact matches (which is expected, since we now consider the correlation between targets), but is a little worse at classifying our single columns.

## Binary relevance using scikit-multilearn

Note that the binary relevance is also present in sci-kit multilearn. Let's try to apply this algorithm here:

In [38]:
from skmultilearn.problem_transform import BinaryRelevance

bin_rel = BinaryRelevance(log_reg)

bin_rel.fit(X_train_v, Y_train_array)

BinaryRelevance(classifier=LogisticRegression(), require_dense=[True, True])

Now, testing our model:

In [39]:
bin_rel.score(X_test_v, Y_test_array)

0.4168207024029575

In [40]:
pred_bin_rel = bin_rel.predict(X_test_v)
hamming_loss_bin_rel = hamming_loss(Y_test_array, pred_bin_rel)

In [41]:
hamming_loss_bin_rel

0.1883086876155268

Note that the accuracy and the hamming loss are the same as we evaluated previously, using the sklearn binary relevance algorithm.

# Multi-label K-NN

Finally, let's use a truly multi-label algorithm, which does not need to build multiple models and algorithms to handle one problem. This is the Multi-label $K$-Nearest Neighbors, or ML-KNN.

In the KNN, we basically get the $K$-Nearest Neighbors and, then, we use a given predefined function to get a prediction based on the target of these neighbors.

In the ML-KNN, we are able to define multiple classes to our predicted entry. If the model is not certain on which tag to use to the label, the model usually assigns more than one tag.



In [42]:
from skmultilearn.adapt import MLkNN

classifier_mlknn = MLkNN( )

classifier_mlknn.fit(X_train_v, Y_train_array)



MLkNN()

Now, let's test our algorithm:

In [43]:
classifier_mlknn.score(X_test_v, Y_test_array)

0.32532347504621073

In [44]:
pred_mlknn = classifier_mlknn.predict(X_test_v)
hamming_loss_mlknn = hamming_loss(Y_test_array, pred_mlknn)

In [45]:
hamming_loss_mlknn

0.25231053604436227

So, actually, we found a lower accuracy and a higher hamming loss using the MLkNN algorithm! Note that using a more complex model does not always mean that our results will improve!

## Trying to analyze the prediction of our algorithms

So, let's try to understand the prediction and the results from each of our algorithms. Let's create a DataFrame to store this:

In [60]:
results_class = pd.DataFrame( )
results_class['posts']   = X_test.values
results_class['real']    = Y_test.values
results_class['bin_rel'] = list(bin_rel.predict(X_test_v).toarray( ))
results_class['chain_class'] = list(chain_class.predict(X_test_v).toarray( ))
results_class['mlknn'] = list(classifier_mlknn.predict(X_test_v).toarray( ))

In [61]:
results_class

Unnamed: 0,posts,real,bin_rel,chain_class,mlknn
0,estou com conflito entre o CODE e os CODE ...,"(0, 1, 0, 0)","[0, 1, 0, 0]","[0.0, 1.0, 0.0, 0.0]","[0, 0, 0, 0]"
1,Estou fazendo um site que eu sou obrigado a us...,"(0, 0, 1, 0)","[0, 0, 1, 0]","[0.0, 0.0, 1.0, 0.0]","[0, 1, 1, 0]"
2,Recentemente fiz um refactor do meu código par...,"(1, 0, 0, 0)","[1, 0, 0, 0]","[1.0, 0.0, 0.0, 0.0]","[1, 0, 0, 0]"
3,Eu tenho esse código em CODE que passo valore...,"(0, 1, 1, 0)","[0, 1, 0, 0]","[0.0, 1.0, 0.0, 0.0]","[0, 1, 1, 0]"
4,"Olá, em minha função tem o evento CODE que de...","(0, 1, 1, 0)","[0, 1, 0, 0]","[0.0, 1.0, 0.0, 0.0]","[0, 1, 1, 0]"
...,...,...,...,...,...
1077,Estou a desenvolver um website em jQuery. E at...,"(0, 1, 0, 0)","[0, 1, 1, 0]","[0.0, 1.0, 1.0, 0.0]","[0, 0, 1, 0]"
1078,Estou usando este plugin - jquery autocomplete...,"(0, 1, 0, 0)","[0, 1, 0, 0]","[0.0, 1.0, 0.0, 0.0]","[0, 1, 0, 0]"
1079,"Tenho o seguinte jQuery: CODE Nisto, quanti...","(0, 1, 0, 0)","[0, 1, 0, 0]","[0.0, 1.0, 0.0, 0.0]","[0, 1, 0, 0]"
1080,Estou usando o SimpleModal Contact Form de Eri...,"(0, 1, 0, 0)","[0, 1, 0, 0]","[0.0, 1.0, 0.0, 0.0]","[0, 1, 0, 0]"


Nice! We can see, for instance, that the two first entries are right for the binary relevan and chain classification, but wrong for MLkNN. The third entry is correct for all algorithms. The fourth entry is correct only for MLkNN. And the list goes on.