## Multi-Label Text Classification on Stack Overflow Tag Prediction

### Link : 
   https://kgptalkie.com/multi-label-text-classification-on-stack-overflow-tag-prediction/

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split

from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC

from sklearn.multiclass import OneVsRestClassifier

In [2]:
df = pd.read_csv('https://raw.githubusercontent.com/laxmimerit/All-CSV-ML-Data-Files-Download/master/stackoverflow.csv', index_col=0)

In [3]:
df.head()

Unnamed: 0,Text,Tags
2,aspnet site maps has anyone got experience cre...,"['sql', 'asp.net']"
4,adding scripting functionality to net applicat...,"['c#', '.net']"
5,should i use nested classes in this case i am ...,['c++']
6,homegrown consumption of web services i have b...,['.net']
8,automatically update version number i would li...,['c#']


In [4]:
type(df['Tags'].iloc[0])

str

In [5]:
df['Tags'].iloc[0]

"['sql', 'asp.net']"

#### 
The ast module helps Python applications to process trees of the Python abstract syntax grammar. This module helps to find out programmatically what the current grammar looks like.

ast.literal_eval Safely evaluate an expression node or a string containing a Python literal. The string or node provided may only consist of the following Python literal structures: strings, bytes, numbers, tuples, lists, dicts, sets, booleans, None, bytes and sets.

Here we need only string literals in Tags which acts as labels for the Target variable

In [6]:
import ast
ast.literal_eval(df['Tags'].iloc[0])

['sql', 'asp.net']

In [7]:
df['Tags'] = df['Tags'].apply(lambda x: ast.literal_eval(x))
df.head()

Unnamed: 0,Text,Tags
2,aspnet site maps has anyone got experience cre...,"[sql, asp.net]"
4,adding scripting functionality to net applicat...,"[c#, .net]"
5,should i use nested classes in this case i am ...,[c++]
6,homegrown consumption of web services i have b...,[.net]
8,automatically update version number i would li...,[c#]


### MultiLabel Binarize

#### 
Multilabelbinarizer allows you to encode multiple labels per instance. It is used when any column has multiple labels. The input to this transformer should be an array-like of integers or strings,denoting the values taken on by categorical (discrete) features

In [8]:
y = df['Tags']
y

2          [sql, asp.net]
4              [c#, .net]
5                   [c++]
6                  [.net]
8                    [c#]
                ...      
1262668             [c++]
1262834             [c++]
1262915          [python]
1263065          [python]
1263454             [c++]
Name: Tags, Length: 48976, dtype: object

In [9]:
multilabel = MultiLabelBinarizer()
y = multilabel.fit_transform(df['Tags'])
y

array([[0, 0, 1, ..., 0, 0, 1],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

#### List down the classes

In [10]:
multilabel.classes_

array(['.net', 'android', 'asp.net', 'c', 'c#', 'c++', 'css', 'html',
       'ios', 'iphone', 'java', 'javascript', 'jquery', 'mysql',
       'objective-c', 'php', 'python', 'ruby', 'ruby-on-rails', 'sql'],
      dtype=object)

In [11]:
pd.DataFrame(y, columns=multilabel.classes_)

Unnamed: 0,.net,android,asp.net,c,c#,c++,css,html,ios,iphone,java,javascript,jquery,mysql,objective-c,php,python,ruby,ruby-on-rails,sql
0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1
1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48971,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
48972,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
48973,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0
48974,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0


### Text Vectorization

Word Embeddings or Word vectorization is a methodology in NLP to map words or phrases from vocabulary to a corresponding vector of real numbers which are used to find word predictions, word similarities/semantics. The process of converting words into numbers are called Vectorization.

Machine learning algorithms cannot work with raw text directly.We need to transform that text into numbers. This process is called Text Vectorization.

Text Vectorization uses bag-of-words model to represents text data into vectors,when modeling text with machine learning algorithms.

we can do it in 3 ways using Scikit Learn library.

Word Counts with CountVectorizer Convert a collection of text documents to a matrix of token counts.
Word Frequencies with TfidfVectorizer Convert a collection of raw documents to a matrix of TF-IDF features.
Hashing with HashingVectorizer The main difference is that HashingVectorizer applies a hashing functionto term frequency counts in each document, where TfidfVectorizer scales those term frequency counts in each document by penalising terms that appear more widely across the corpus.
TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents.

This is done by multiplying two metrics:Term frequency gives how many times a word appears in a document and the inverse document frequency of the word across a set of documents.Multiplying these two metrics results in the TF-IDF score of a word in a document.

The higher the score, the more relevant that word is in that particular document.

In [12]:
tfidf = TfidfVectorizer(analyzer='word', max_features=10000, ngram_range=(1,3), stop_words='english')
X = tfidf.fit_transform(df['Text'])
X.shape, y.shape

((48976, 10000), (48976, 20))

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

## Build Model

In [14]:
sgd = SGDClassifier()
lr = LogisticRegression(solver='lbfgs')
svc = LinearSVC()

#### Metrics for Multi-label classification

Multi-label classification problems must be assessed using different performance measures than single-label classification problems. Two of the most common performance metrics are hamming loss and Jaccard similarity.

Hamming loss is the average fraction of incorrect labels. Note that hamming loss is a loss function and that the perfect score is 0.

Jaccard similarity or the Jaccard index, is the size of the intersection of the predicted labels and the true labels divided by the size of the union of the predicted and true labels. It ranges from 0 to 1, and 1 is the perfect score

Hamming and Jaccard similarity can be represented in terms of true/false positive/negative counts.

In [15]:
def j_score(y_true, y_pred):
  jaccard = np.minimum(y_true, y_pred).sum(axis = 1)/np.maximum(y_true, y_pred).sum(axis = 1)
  return jaccard.mean()*100


def print_score(y_pred, clf):
  print("Clf: ", clf.__class__.__name__)
  print('Jacard score: {}'.format(j_score(y_test, y_pred)))
  print('----'*4)

### OneVsRest Classifier

####

when we want to do multiclass or multilabel classification and it's strategy consists of fitting one classifier per class. For each classifier, the class is fitted against all the other classes.It means that problem of multiclass/multilabel classification is broken down to multiple binary classification problems.

It involves splitting the multi-class dataset into multiple binary classification problems. A binary classifier is then trained on each binary classification problem and predictions are made using the model that is the most confident.

The scikit-learn library also provides a separate OneVsRestClassifier class that allows the one-vs-rest strategy to be used with any classifier.This class can be used to use a binary classifier like LinearSVC , SGDClassifier or Logistic Regression for multi-Label classification or even other classifiers that natively support multi-label classification.

In [None]:
for classifier in [LinearSVC(C=1.5, penalty = 'l1', dual=False)]:
    clf = OneVsRestClassifier(classifier)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print_score(y_pred, classifier)

In [16]:
for classifier in [sgd, lr, svc]:
    clf = OneVsRestClassifier(classifier)
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)
    print_score(y_pred, classifier)

Clf:  SGDClassifier
Jacard score: 52.515822784810126
----------------
Clf:  LogisticRegression
Jacard score: 51.1014699877501
----------------
Clf:  LinearSVC
Jacard score: 62.42105621342044
----------------


### Model Test with Real Data

In [17]:
x = [ 'how to write ml code in python and java i have data but do not know how to do it']

In [18]:
xt = tfidf.transform(x)
clf.predict(xt)

array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0]])

In [19]:
multilabel.inverse_transform(clf.predict(xt))

[('java', 'python')]

In [23]:
xx = ['TensorFlow not found using pip']
xt = tfidf.transform(xx)
clf.predict(xt)
multilabel.inverse_transform(clf.predict(xt))

[('python',)]

## Conclusion:

#### 
 - First, We have loaded the text pre-processed dataset using pandas dataframe and also evaluated the string tags using AST module and encoded the tags using Multilabelbinarizer.
 
- Thereafter, we have performed the Text Vectorization on Question type using TfidfVectorizer. Finally, we have fit the model on various classifers like LinearSVC,SGDClassifier or Logistic Regression for multi- Label classification and predicted the output on real data.

- For better accuracy and prediction we may also model Multilabel text clssification using RNN, LSTM's, bi- directional LSTM's etc

In [21]:
df.to_csv('stackoverflow_questions_and_tags.csv', index = False)