### Multi-label Classification problem

##### Requirements:
scikit-learn==0.18.2<br>
pandas==0.23.3<br>
numpy==1.13.1<br>
nltk==3.3<br>
matplotlib==2.0.2<br>

For this multi-label classification problem, I have concatenated the intent and class labels and trained the model. 

The maximum accuracy achieved is 0.664221218962. 

Also in the last cell of this notebook, you can see the Classification report with Precision and recall per class and intent for the best model.

I have documented below the accuracies and the corresponding models and methods used.

I have noticed that pre-processing and MultiLabelBinarizer improved the Classifier performance.

### Models trained

In [None]:
'''
- BinaryRelevance(RandomForestClassifier()) 
Accuracy =  0.348758465011
Hamming Loss =  0.021231739000809234

- LabelPowerset(RandomForestClassifier())
Accuracy =  0.654063205418
Hamming Loss =  0.02340389284041058

- ClassifierChain(RandomForestClassifier())
Accuracy =  0.437358916479
Hamming Loss =  0.02069934835384812

**********************************************************************************************************************************
- Best Accuracy achieved:

- SVM + MultiLabelBinarizer + Preprocessing + no stop words removal

model: model_20180803_153406.pkl

Grid search chosen parameters are:  {'tfidf__use_idf': True, 'vect__ngram_range': (1, 2), 'clf-svm__estimator__C': 10}
Accuracy with Grid search parameters:  0.5988323603
Accuracy (test):  0.664221218962
Hamming Loss =  0.012021380808381959

'''

In [1]:
import csv
import pandas as pd
import numpy as np


intent = []
label = []
ques = []
label_intent = []

def load_data(train_data_path):
    
    '''Import data'''

    '''Number of samples = 15452'''
    
    global intent
    global label
    global ques
    global label_intent

    data_array = []

    with open(data_path, 'r') as csvfile:
        data_reader = csv.reader(csvfile)
        for row in data_reader:
#             print('1:', row)
            data_array.append(row[0])

    '''Structure the imported data'''
    
    for i,d in enumerate(data_array):
        temp = d.split(':')
        intent.append(temp[0])
        lis = temp[1].split(' ')
        label.append(lis[0])
        ques.append(' '.join(lis[1:]))
        label_intent.append([temp[0], lis[0]])
        
data_path = 'E:/aaaML Projects/data/aisera_dataset/training.data'

load_data(data_path)


In [4]:
'''Remove redundant data'''

questions = []
intent_set = []
label_set = []
label_intent_set = []

for i in range(len(ques)):
    if ques[i] not in questions:
        questions.append(ques[i])
        intent_set.append(intent[i])
        label_set.append(label[i])
        label_intent_set.append(label_intent[i])
    

### Analyse data and labels

In [5]:
df = pd.DataFrame({'intent':intent,'label':label})
print(df.head())

  intent   label
0   DESC  manner
1   ENTY  cremat
2   DESC  manner
3   ENTY  animal
4   ABBR     exp


In [6]:
p = pd.crosstab(df['intent'],df['label'])
p

label,abb,animal,body,city,code,color,count,country,cremat,currency,...,substance,symbol,techmeth,temp,termeq,title,veh,volsize,weight,word
intent,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
ABBR,46,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
DESC,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ENTY,0,365,54,0,0,119,0,0,595,6,...,124,31,111,0,271,0,68,0,0,71
HUM,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,67,0,0,0,0
LOC,0,0,0,376,0,0,0,425,0,0,...,0,0,0,0,0,0,0,0,0,0
NUM,0,0,0,0,22,0,985,0,0,0,...,0,0,0,15,0,0,0,32,23,0


### Encode class labels

In [6]:
from keras.utils import to_categorical
from sklearn.preprocessing import LabelEncoder

'''To encode the target label and intent: Method 1'''

encoder = LabelEncoder()

temp = np.array(label_set)
label_encode = to_categorical(encoder.fit_transform(temp.astype(str)))

temp = np.array(intent_set)
intent_encode = to_categorical(encoder.fit_transform(temp.astype(str)))

y = np.concatenate((intent_encode, label_encode), axis=1) # shape: samples x labels

Using Theano backend.


In [32]:
from sklearn.preprocessing import MultiLabelBinarizer

'''To encode the target label and intent: Method 2 is better than Method 1'''

multilabel_binarizer = MultiLabelBinarizer(sparse_output = True)
y = multilabel_binarizer.fit_transform(label_intent_set)
classes_list = multilabel_binarizer.classes_

### Pre-process the input text features

In [9]:
import string
from datetime import datetime
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk import word_tokenize
from skmultilearn.problem_transform import BinaryRelevance
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC, SVC
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from sklearn.externals import joblib

%matplotlib inline
plt.style.use('ggplot')

def pre_process_data(words):
    ''' Pre process text data: Does Lemmatization, strips case, punctuation, converts number to words and stopwords'''
    
    word_tokens = [word.lower() for word in word_tokenize(words)]
    
    # To lower case
    word_lower = [word.lower() for word in word_tokens] 
    
    # Remove punctuations
    table = str.maketrans('','', string.punctuation)
    word_nopunct = [w.translate(table) for w in word_lower]
    
    # Remove non-alphabetic tokens
    word_list = [word for word in word_nopunct if word.isalpha()] 
    
    # Remove stopwords
#     stopword_list = set(stopwords.words('english'))
#     word_stopw = [w for w in word_list if not w in stopword_list]
    
    # Lemmatization
    wnl = WordNetLemmatizer()
    word_final = [wnl.lemmatize(w) for w in word_list]
    
    return word_final

'''Pre-process training dataset for X''' 

X = [] 
for q in questions:
    X.append(' '.join(pre_process_data(q)))
    
cv = CountVectorizer()
X_cv = cv.fit_transform(X)

tf_idf = TfidfTransformer()
X_tf = tf_idf.fit_transform(X_cv)


### Train the model

In [33]:
def Grid_search(svm_pipeline, svm_parameters, X_train, y_train):
    
    ''' Performs Grid search to find optimal values for model parameters'''
    
    print ('Doing Grid search...')
    gs_svm = GridSearchCV(svm_pipeline, param_grid=svm_parameters, n_jobs=-1)
    gs_svm = gs_svm.fit(X_train, y_train)
    print ("Grid search chosen parameters are: ", gs_svm.best_params_)
    print ("Accuracy with Grid search parameters: ", gs_svm.best_score_)
    return gs_svm

def train_model(X,y):
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

    text_svmpipeline = Pipeline([('vect', CountVectorizer()),
                                 ('tfidf', TfidfTransformer()),
                                 ('clf-svm', OneVsRestClassifier(LinearSVC()))])

    svm_parameters = [{'vect__ngram_range': [(1, 1), (1, 2), (1,3)], 'tfidf__use_idf': (True, False),'clf-svm__estimator__C': [0.001, 0.01, 0.1, 1, 10]}]

    svm_model = Grid_search(text_svmpipeline, svm_parameters, X_train, y_train)

    # train
    svm_model.fit(X_train, y_train)

    # predict
    result = svm_model.score(X_test, y_test)
    print('Accuracy (test): ', result)
    return X_test, y_test, svm_model 

def save_model(model_object):
    ''' Saves the trained model with timestamp'''
    
    try:
        joblib.dump(model_object, 'model_{}.pkl'.format(datetime.now().strftime("%Y%m%d_%H%M%S")))
        print ("Model saved as Pickle file successfully:", 'model_{}.pkl'.format(datetime.now().strftime("%Y%m%d_%H%M%S")))
    except Exception as error:
        print ("Error saving model")
        print (error)

def load_model(model_path):    
    ''' Loads the trained and saved model'''
    
    loaded_model = joblib.load(model_path)
    return loaded_model

def predict_intent(question):
    '''To predict the classes of a question'''
    
#     svm_model = load_model(model_path)
    
    p = svm_model.predict([question]) # Predict the classes
    _, classes = p.nonzero() # unpack the sparse matrix
    
    return classes_list[classes[0]], classes_list[classes[1]]
 
X_test, y_test,svm_model = train_model(X,y)
save_model(svm_model)

t, l = predict_intent('Who is Lakers?')
print('Intent: ', t, '\nLabel: ', l)


Doing Grid search...
Grid search chosen parameters are:  {'tfidf__use_idf': True, 'vect__ngram_range': (1, 2), 'clf-svm__estimator__C': 10}
Accuracy with Grid search parameters:  0.5988323603
Accuracy (test):  0.664221218962
Model saved as Pickle file successfully: model_20180803_153406.pkl
Intent:  HUM 
Label:  desc


### Evaluate the model

In [36]:
from sklearn import metrics

y_pred = svm_model.predict(X_test)

print("Accuracy = ", metrics.accuracy_score(y_test,y_pred))

print("Hamming Loss = ", metrics.hamming_loss(y_test, y_pred))

print('Classification Report: ')
print(classification_report(y_test,y_pred, target_names = classes_list))

Accuracy =  0.664221218962
Hamming Loss =  0.012021380808381959
Classification Report: 
             precision    recall  f1-score   support

       ABBR       0.89      0.71      0.79        35
       DESC       0.87      0.87      0.87       387
       ENTY       0.83      0.72      0.77       400
        HUM       0.93      0.82      0.87       390
        LOC       0.94      0.83      0.88       281
        NUM       0.97      0.90      0.93       279
        abb       1.00      0.43      0.60         7
     animal       0.89      0.46      0.60        35
       body       0.00      0.00      0.00         4
       city       0.82      0.80      0.81        40
       code       0.00      0.00      0.00         3
      color       1.00      0.87      0.93        23
      count       0.98      0.96      0.97       114
    country       0.89      0.98      0.93        48
     cremat       0.98      0.61      0.75        76
   currency       0.00      0.00      0.00         0
       dat

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [37]:
from sklearn.metrics import coverage_error
from sklearn.metrics import label_ranking_average_precision_score
from sklearn.metrics import label_ranking_loss
import pickle 

# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

def load_model(model_path):    
    ''' Loads the trained and saved model'''
    
    pkl_file = open('class_list.pkl', 'rb')
    classes_list = pickle.load(pkl_file)
    pkl_file.close()
    
    loaded_model = joblib.load(model_path)
    return classes_list, loaded_model

# classes_list, svm_model = load_model('model_20180801_222822.pkl')
y_pred = svm_model.predict(X_test)

print('Coverage error', coverage_error(y_test.todense(), y_pred.todense()))
print('Label ranking loss', label_ranking_loss(y_test, y_pred.todense()))
print('Label ranking average precision score', label_ranking_average_precision_score(y_test.todense(), y_pred.todense()))

Coverage error 17.1015801354
Label ranking loss 0.228571902802
Label ranking average precision score 0.756267125232


##### Other models used

In [29]:
from skmultilearn.problem_transform import ClassifierChain
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X_tf, y, test_size=0.33)

classifier = ClassifierChain(RandomForestClassifier())

classifier.fit(X_train, y_train)

predictions = classifier.predict(X_test)

print("Accuracy = ", metrics.accuracy_score(y_test,predictions))

print("Hamming Loss = ", metrics.hamming_loss(y_test, predictions))

print('Classification Report: ')
print(classification_report(y_test,predictions, target_names = classes_list))

Accuracy =  0.437358916479
Hamming Loss =  0.02069934835384812
Classification Report: 
             precision    recall  f1-score   support

       ABBR       0.79      0.41      0.54        27
       DESC       0.88      0.54      0.67       388
       ENTY       0.78      0.48      0.60       401
        HUM       0.92      0.64      0.76       402
        LOC       0.85      0.76      0.80       259
        NUM       0.90      0.77      0.83       295
        abb       0.00      0.00      0.00         7
     animal       0.89      0.23      0.36        35
       body       0.00      0.00      0.00         7
       city       0.92      0.27      0.42        44
       code       0.00      0.00      0.00         0
      color       1.00      0.08      0.14        13
      count       0.99      0.87      0.92       120
    country       0.88      0.68      0.77        44
     cremat       0.74      0.20      0.32        69
   currency       0.00      0.00      0.00         2
       date

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [30]:
from skmultilearn.problem_transform import LabelPowerset

# X_train, X_test, y_train, y_test = train_test_split(X_tf, y, test_size=0.33)

classifier = LabelPowerset(RandomForestClassifier())

classifier.fit(X_train, y_train)

predictions = classifier.predict(X_test)

print("Accuracy = ", metrics.accuracy_score(y_test,predictions))

print("Hamming Loss = ", metrics.hamming_loss(y_test, predictions))

print('Classification Report: ')
print(classification_report(y_test,predictions, target_names = classes_list))

Accuracy =  0.654063205418
Hamming Loss =  0.02340389284041058
Classification Report: 
             precision    recall  f1-score   support

       ABBR       0.71      0.44      0.55        27
       DESC       0.71      0.85      0.77       388
       ENTY       0.69      0.48      0.56       401
        HUM       0.65      0.86      0.74       402
        LOC       0.73      0.80      0.77       259
        NUM       0.94      0.65      0.77       295
        abb       1.00      0.14      0.25         7
     animal       0.59      0.37      0.46        35
       body       1.00      0.14      0.25         7
       city       0.90      0.84      0.87        44
       code       0.00      0.00      0.00         0
      color       1.00      1.00      1.00        13
      count       0.90      0.92      0.91       120
    country       0.88      0.95      0.91        44
     cremat       0.55      0.59      0.57        69
   currency       1.00      0.50      0.67         2
       date

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [31]:
from skmultilearn.problem_transform import BinaryRelevance

# X_train, X_test, y_train, y_test = train_test_split(X_tf, y, test_size=0.33)

classifier = BinaryRelevance(RandomForestClassifier())

classifier.fit(X_train, y_train)

predictions = classifier.predict(X_test)

print("Accuracy = ", metrics.accuracy_score(y_test,predictions))

print("Hamming Loss = ", metrics.hamming_loss(y_test, predictions))

print('Classification Report: ')
print(classification_report(y_test,predictions, target_names = classes_list))

Accuracy =  0.348758465011
Hamming Loss =  0.021231739000809234
Classification Report: 
             precision    recall  f1-score   support

       ABBR       0.75      0.44      0.56        27
       DESC       0.89      0.60      0.72       388
       ENTY       0.90      0.41      0.57       401
        HUM       0.93      0.58      0.71       402
        LOC       0.91      0.69      0.78       259
        NUM       0.98      0.68      0.81       295
        abb       1.00      0.14      0.25         7
     animal       0.83      0.14      0.24        35
       body       0.00      0.00      0.00         7
       city       1.00      0.25      0.40        44
       code       0.00      0.00      0.00         0
      color       0.00      0.00      0.00        13
      count       0.98      0.88      0.93       120
    country       0.88      0.52      0.66        44
     cremat       0.80      0.17      0.29        69
   currency       0.00      0.00      0.00         2
       dat

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


#### Fast Text supervised training

In [34]:
open('d_train_fasttext.txt', "w", encoding='utf-8').close()

output = open('d_train_fasttext.txt', "w", encoding='utf-8')

for i in range(len(questions[:3000])):

    s = '__label__' + intent_set[i] + ',' + label_set[i]  + ' ' + questions[i] + "\n"
    
    output.write(s)
    
output.flush()
output.close()

In [35]:
open('d_test_fasttext.txt', "w", encoding='utf-8').close()

output = open('d_test_fasttext.txt', "w", encoding='utf-8')

for i in range(len(questions[3000:])):

    s = '__label__' + intent_set[i] + ',' + label_set[i]  + ' ' + questions[i] + "\n"
    
    output.write(s)
    
output.flush()
output.close()

In [None]:
import fastText as ft

model = ft.train_supervised('d_train_fasttext.txt', epoch=50)

result = model.test('d_test_fasttext.txt')

print (result)

# output:  Precision: 0.9662304769945125, Recall: 0.9662304769945125

In [None]:
label = model.predict('Who are the Lakers team ?')
print (label)

# output: ('__label__HUM,gr',)