## Solution for Multi-Label Problem
Methods for solving Multi-label Classification Problems
- Problem Transformation
- Adapted Algorithm
- Ensemble approaches

## Problem Transformation
It refers to transforming the multi-label problem into single-label problem(s) by using
- Binary Relevance: treats each label as a separate single class classification
- Classifier Chains:In this, the first classifier is trained just on the input data and then each next classifier is trained on the input space and all the previous classifiers in the chain.
- Label Powerset:we transform the problem into a multi-class problem with one multi-class classifier is trained on all unique label combinations found in the training data.

    ## Adapted Algorithm
adapting the algorithm to directly perform multi-label classification, rather than transforming the problem into different subsets of problems.

In [3]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB,MultinomialNB
from sklearn.metrics import accuracy_score,hamming_loss,classification_report

In [36]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import word_tokenize
from sklearn import model_selection
from sklearn import metrics

from skmultilearn.problem_transform import BinaryRelevance
from skmultilearn.problem_transform import ClassifierChain
from skmultilearn.problem_transform import LabelPowerset

In [7]:
data = pd.read_csv('csv/train_02_balanced.csv')
data.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0002bcb3da6cb337,cocksuck piss around work,1,1,1,0,1,0
1,00054a5e18b50dd4,bbq man let discuss maybe phone?,0,0,0,0,0,0
2,0005c987bdfc9d4b,hey @ talk exclusive group wp taliban good des...,1,0,0,0,0,0
3,0007e25b2121310b,"bye! look, come think com back! tosser",1,0,0,0,0,0
4,001735f961a23fc4,""" sure, lead must briefly summarize armenia hi...",0,0,0,0,0,0


In [9]:
data.shape

(31503, 8)

In [13]:
data_text = data['comment_text'].values.astype('U')
X = data.drop(labels=['id'], axis=1)
y = data.drop(labels = ['id', 'comment_text'], axis=1)

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size = 0.33, random_state=42)
vectorizer = TfidfVectorizer(strip_accents='unicode', tokenizer=word_tokenize, analyzer='word', ngram_range=(1,3), norm='l2', max_features = 10000)

X_train_vec = vectorizer.fit_transform(X_train['comment_text'])
X_test_vec = vectorizer.transform(X_test['comment_text'])

In [25]:
from skmultilearn.ensemble import LabelSpacePartitioningClassifier
from sklearn.ensemble import RandomForestClassifier

In [28]:
binary_rel_clf = BinaryRelevance(MultinomialNB())

In [29]:
binary_rel_clf.fit(X_train_vec,y_train)


BinaryRelevance(classifier=MultinomialNB(), require_dense=[True, True])

In [43]:
br_prediction = binary_rel_clf.predict(X_test_vec)

In [44]:
#br_prediction.toarray()

In [45]:
accuracy_score(y_test,br_prediction)

0.6060023085802232

In [46]:
hamming_loss(y_test,br_prediction)

0.09447543927151468

In [54]:
def build_model(model,mlb_estimator,xtrain,ytrain,xtest,ytest):
    # Create an Instance
    clf = mlb_estimator(model)
    clf.fit(xtrain,ytrain)
    # Predict
    clf_predictions = clf.predict(xtest)
    # Check For Accuracy
    acc = accuracy_score(ytest,clf_predictions)
    ham = hamming_loss(ytest,clf_predictions)
    result = {"accuracy:":acc,"hamming_score":ham}
    return result

In [55]:
clf_chain_model = build_model(MultinomialNB(),ClassifierChain,X_train_vec,y_train,X_test_vec,y_test)


In [56]:
clf_chain_model

{'accuracy:': 0.5986918045402078, 'hamming_score': 0.09981403103757856}

In [60]:
clf_labelP_model = build_model(MultinomialNB(),LabelPowerset,X_train_vec,y_train,X_test_vec,y_test)


In [58]:
clf_labelP_model

{'accuracy:': 0.6079261254328588, 'hamming_score': 0.10159356162626651}

In [64]:
lp_classifier = LabelPowerset(LogisticRegression(max_iter=10000))
lp_classifier.fit(X_train_vec, y_train)
lp_predictions = lp_classifier.predict(X_test_vec)
print("Accuracy = ",accuracy_score(y_test,lp_predictions))
print("F1 score = ",metrics.f1_score(y_test,lp_predictions, average="micro"))
print("Hamming loss = ",hamming_loss(y_test,lp_predictions))

Accuracy =  0.6293766833397461
F1 score =  0.7070272485146485
Hamming loss =  0.09170193664229832
