## Multi Label classification

Multilabel Classification is a machine-learning task where the output could be no label or all the possible labels given the input data. It’s different from binary or multiclass classification, where the label output is mutually exclusive.

In [3]:

import re
import spacy
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from tqdm.notebook import tqdm
tqdm.pandas()


In [4]:
nlp=spacy.load('en_core_web_sm',disable=["tagger", "parser","ner"])

In [5]:
#loading the questions dataset
df1=pd.read_csv("/content/Questions.csv",encoding="latin-1")
print(df1.shape)
df1.head()

(85085, 6)


Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,6,5.0,2010-07-19T19:14:44Z,272,The Two Cultures: statistics vs. machine learn...,"<p>Last year, I read a blog post from <a href=..."
1,21,59.0,2010-07-19T19:24:36Z,4,Forecasting demographic census,<p>What are some of the ways to forecast demog...
2,22,66.0,2010-07-19T19:25:39Z,208,Bayesian and frequentist reasoning in plain En...,<p>How would you describe in plain English the...
3,31,13.0,2010-07-19T19:28:44Z,138,What is the meaning of p values and t values i...,<p>After taking a statistics course and then t...
4,36,8.0,2010-07-19T19:31:47Z,58,Examples for teaching: Correlation does not me...,"<p>There is an old saying: ""Correlation does n..."


In [6]:
##loading the tags data
df2=pd.read_csv("/content/Tags.csv")
print(df2.shape)
df2.head()

(244228, 2)


Unnamed: 0,Id,Tag
0,1,bayesian
1,1,prior
2,1,elicitation
3,2,distributions
4,2,normality


In [7]:
##cleaning the text by taking off html_tags
BeautifulSoup(df1["Body"][0]).get_text()

'Last year, I read a blog post from Brendan O\'Connor entitled "Statistics vs. Machine Learning, fight!" that discussed some of the differences between the two fields.  Andrew Gelman responded favorably to this:\nSimon Blomberg: \n\nFrom R\'s fortunes\n  package: To paraphrase provocatively,\n  \'machine learning is statistics minus\n  any checking of models and\n  assumptions\'.\n  -- Brian D. Ripley (about the difference between machine learning\n  and statistics) useR! 2004, Vienna\n  (May 2004) :-) Season\'s Greetings!\n\nAndrew Gelman:\n\nIn that case, maybe we should get rid\n  of checking of models and assumptions\n  more often. Then maybe we\'d be able to\n  solve some of the problems that the\n  machine learning people can solve but\n  we can\'t!\n\nThere was also the "Statistical Modeling: The Two Cultures" paper by Leo Breiman in 2001 which argued that statisticians rely too heavily on data modeling, and that machine learning techniques are making progress by instead relying

In [8]:
##applying text preprocessing
def clean_text(text):

  ##removing html tags
  text=BeautifulSoup(text).get_text()

  ##keeping only alphabets
  text=re.sub("[^A-Za-z]"," ",text)

  ##converting text to lower
  text=text.lower()

  ##removing extra spaces
  text=re.sub("[\s]+"," ",text)

  ##creating the doc object
  doc=nlp(text)

  ##lemmatizing the text
  tokens=[token.lemma_ for token in doc if(token.is_stop==False)]

  return " ".join(tokens)


In [9]:
##applying on Body column
df1["Body"]=df1["Body"].progress_apply(clean_text)
df1.head()

  0%|          | 0/85085 [00:00<?, ?it/s]



Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body
0,6,5.0,2010-07-19T19:14:44Z,272,The Two Cultures: statistics vs. machine learn...,year read blog post brendan o connor entitled ...
1,21,59.0,2010-07-19T19:24:36Z,4,Forecasting demographic census,ways forecast demographic census validation ca...
2,22,66.0,2010-07-19T19:25:39Z,208,Bayesian and frequentist reasoning in plain En...,describe plain english characteristics disting...
3,31,13.0,2010-07-19T19:28:44Z,138,What is the meaning of p values and t values i...,taking statistics course trying help fellow st...
4,36,8.0,2010-07-19T19:31:47Z,58,Examples for teaching: Correlation does not me...,old saying correlation mean causation teach te...


In [10]:
## getting unique count of tags
df2["Tag"].value_counts()

Tag
r                       13236
regression              10959
machine-learning         6089
time-series              5559
probability              4217
                        ...  
fmincon                     1
doc2vec                     1
sympy                       1
adversarial-boosting        1
corpus-linguistics          1
Name: count, Length: 1315, dtype: int64

In [11]:
##removing "-" from tags
df2["Tag"]=df2["Tag"].progress_apply(lambda x:re.sub("-"," ",x))
df2["Tag"]

  0%|          | 0/244228 [00:00<?, ?it/s]

0                        bayesian
1                           prior
2                     elicitation
3                   distributions
4                       normality
                   ...           
244223           machine learning
244224    mathematical statistics
244225        normal distribution
244226                 estimation
244227                     median
Name: Tag, Length: 244228, dtype: object

In [12]:
df2=df2.groupby("Id").progress_apply(lambda x:x["Tag"].values).reset_index()
df2.head()

  0%|          | 0/85085 [00:00<?, ?it/s]

Unnamed: 0,Id,0
0,1,"[bayesian, prior, elicitation]"
1,2,"[distributions, normality]"
2,3,"[software, open source]"
3,4,"[distributions, statistical significance]"
4,6,[machine learning]


In [13]:
df2.columns=["Id","Tags"]
df2.head()

Unnamed: 0,Id,Tags
0,1,"[bayesian, prior, elicitation]"
1,2,"[distributions, normality]"
2,3,"[software, open source]"
3,4,"[distributions, statistical significance]"
4,6,[machine learning]


In [14]:
###merging two dataframes
df=pd.merge(df1,df2,how="inner",on="Id")
print(df.shape)
df.head()

(85085, 7)


Unnamed: 0,Id,OwnerUserId,CreationDate,Score,Title,Body,Tags
0,6,5.0,2010-07-19T19:14:44Z,272,The Two Cultures: statistics vs. machine learn...,year read blog post brendan o connor entitled ...,[machine learning]
1,21,59.0,2010-07-19T19:24:36Z,4,Forecasting demographic census,ways forecast demographic census validation ca...,"[forecasting, population, census]"
2,22,66.0,2010-07-19T19:25:39Z,208,Bayesian and frequentist reasoning in plain En...,describe plain english characteristics disting...,"[bayesian, frequentist]"
3,31,13.0,2010-07-19T19:28:44Z,138,What is the meaning of p values and t values i...,taking statistics course trying help fellow st...,"[hypothesis testing, t test, p value, interpre..."
4,36,8.0,2010-07-19T19:31:47Z,58,Examples for teaching: Correlation does not me...,old saying correlation mean causation teach te...,"[correlation, teaching]"


In [15]:
##checking the frequency of occurence of each tag
freq={}
for i in df["Tags"]:
  for j in i:
    if j in freq.keys():
      freq[j]+=1
    else:
      freq[j]=1



In [16]:
# sort the dictionary in descending order
freq = dict(sorted(freq.items(), key=lambda x:x[1],reverse=True))


In [17]:
top_tags=list(freq.keys())[:10]
top_tags

['r',
 'regression',
 'machine learning',
 'time series',
 'probability',
 'hypothesis testing',
 'self study',
 'distributions',
 'logistic',
 'classification']

We will keep only those docs which have more than one of these above tags

In [18]:
x=[]
y=[]

for i in range(len(df['Tags'])):

  temp=[]
  for j in df['Tags'][i]:
    if j in top_tags:
      temp.append(j)

  if(len(temp)>1):
    x.append(df['Body'][i])
    y.append(temp)

In [19]:
x[:10]

['recently started working tuberculosis clinic meet periodically discuss number tb cases currently treating number tests administered etc d like start modeling counts guessing unusual unfortunately ve little training time series exposure models continuous data stock prices large numbers counts influenza deal cases month mean median var distributed like image lost mists time image eaten grue ve found articles address models like d greatly appreciate hearing suggestions approaches r packages use implement approaches edit mbq s answer forced think carefully m asking got hung monthly counts lost actual focus question d like know fairly visible decline onward reflect downward trend overall number cases looks like number cases monthly reflects stable process maybe seasonality overall stable present looks like process changing overall number cases declining monthly counts wobble randomness seasonality test s real change process identify decline use trend seasonality estimate number cases upco

In [20]:
y[:10]

[['r', 'time series'],
 ['regression', 'distributions'],
 ['distributions', 'probability', 'hypothesis testing'],
 ['hypothesis testing', 'self study'],
 ['r', 'regression', 'time series'],
 ['r', 'time series', 'self study'],
 ['probability', 'hypothesis testing'],
 ['r', 'regression'],
 ['r', 'regression'],
 ['regression', 'logistic']]

In [21]:
print(len(x),len(y))

11106 11106


Using MultiLabelBinarizer to convert labels into one hot encoded form

In [22]:
###converting tags into binary fprm using multi label binarizer from sklearn
from sklearn.preprocessing import MultiLabelBinarizer
mlb=MultiLabelBinarizer()

##fit transforming the y i.e the target column
y=mlb.fit_transform(y)

In [23]:
print(y.shape)
print(y[0])
print(mlb.classes_)

(11106, 10)
[0 0 0 0 0 0 1 0 0 1]
['classification' 'distributions' 'hypothesis testing' 'logistic'
 'machine learning' 'probability' 'r' 'regression' 'self study'
 'time series']


In [24]:
##splitting the data into train and test
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.25,random_state=42)
print(x_train,y_train)
print(x_test,y_test)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [25]:
#using Tfidf vectorizer to transform text to number
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer=TfidfVectorizer(max_features=5000,max_df=0.9)

##fit and transform x_train
x_train=vectorizer.fit_transform(x_train)


### Building The model Using Logistic Regression

- Here the MultiOutputClassifier from sklearn multioutput libraray is used to build model.
- The Algorithm is passed to this MultiOuputClassifier object and it is trained on training data

In [27]:
##building model
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression
clf=MultiOutputClassifier(LogisticRegression()).fit(x_train,y_train)

In [28]:
##transforming test data
x_test=vectorizer.transform(x_test)

In [29]:
#making predictions
y_pred=clf.predict(x_test)

In [31]:
from sklearn.metrics import accuracy_score,f1_score
print('Accuracy Score: ', accuracy_score(y_test, y_pred))
print("F1 score : ",f1_score(y_test,y_pred,average="weighted"))

Accuracy Score:  0.3161685271876125
F1 score :  0.6889058415515329


 ### Hamming Distance as metric
 The accuracy score contains weaknesses for a multilabel prediction evaluation. The accuracy score would need each sentence to have all the label presence in the exact position, or it would be considered wrong.
 It would be considered a wrong prediction for the accuracy score as the label combination differs. That is why our model has a low metric score.

To mitigate this problem, we must evaluate the label prediction rather than their label combination. In this case, we can rely on Hamming Loss evaluation metric.
 Hamming Loss is calculated by taking a fraction of the wrong prediction with the total number of labels. Because Hamming Loss is a loss function, the lower the score is, the better (0 indicates no wrong prediction and 1 indicates all the prediction is wrong).




In [33]:
from sklearn.metrics import hamming_loss
print('Hamming Loss: ', round(hamming_loss(y_test,y_pred),2))

Hamming Loss:  0.11


Our Multilabel Classifier Hamming Loss model is 0.11, which means that our model would have a wrong prediction 11% of the time independently. This means each label prediction might be wrong 11% of the time.

### Using MultiNomial Naive bayes building the model

In [35]:
from sklearn.naive_bayes import MultinomialNB
clf=MultiOutputClassifier(MultinomialNB()).fit(x_train,y_train)

##making prediction
y_pred=clf.predict(x_test)

##hamming distance
print('Hamming Loss: ', round(hamming_loss(y_test,y_pred),2))

Hamming Loss:  0.14
