## CSML1010 Group3 Course_Project - Milestone 2 - Baseline Machine Learning Implementation
#### Authors (Group3): Paul Doucet, Jerry Khidaroo
#### Project Repository: https://github.com/CSML1010-3-2020/NLPCourseProject

#### Dataset:
The dataset used in this project is the __Taskmaster-1__ dataset from Google.
[Taskmaster-1](https://research.google/tools/datasets/taskmaster-1/)

The dataset can be obtained from: https://github.com/google-research-datasets/Taskmaster

---

## Workbook Setup and Data Preparation

#### Import Libraries

In [1]:
# import pandas, numpy
import pandas as pd
import numpy as np
import re
import nltk


#### Set Some Defaults

In [2]:
# adjust pandas display
pd.options.display.max_columns = 30
pd.options.display.max_rows = 100
pd.options.display.float_format = '{:.7f}'.format
pd.options.display.precision = 7
pd.options.display.max_colwidth = None

# Import matplotlib and seaborn and adjust some defaults
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

from matplotlib import pyplot as plt
plt.rcParams['figure.dpi'] = 100

import seaborn as sns
sns.set_style("whitegrid")

import warnings
warnings.filterwarnings('ignore')

#### Load Data

In [3]:
df_all = pd.read_csv('./data/dialog_norm.csv')
df_all.columns

Index(['Instruction_id', 'category', 'selfdialog_norm'], dtype='object')

In [4]:
df_all.head(3)

Unnamed: 0,Instruction_id,category,selfdialog_norm
0,restaurant-table,0,restauranttable
1,movie-tickets-1,1,hi would like see movie men want playing yes showing would like purchase ticket yes friend two tickets please okay time moving playing today movie showing pm okay anymore movies showing around pm yes showing pm green book two men dealing racisim oh recommend anything else like well like movies funny like comedies well like action well okay train dragon playing pm okay get two tickets want cancel tickets men want yes please okay problem much cost said two adult tickets yes okay okay anything else help yes bring food theater sorry purchase food lobby okay fine thank enjoy movie
2,movie-tickets-3,2,want watch avengers endgame want watch bangkok close hotel currently staying sounds good time want watch movie oclock many tickets two use account already movie theater yes seems movie time lets watch another movie movie want watch lets watch train dragon newest one yes one dont think movie playing time either neither choices playing time want watch afraid longer interested watching movie well great day sir thank welcome


#### Remove NaN rows

In [5]:
print(df_all.shape)
df_all = df_all.dropna()
df_all = df_all.reset_index(drop=True)
df_all = df_all[df_all.selfdialog_norm != '']
print(df_all.shape)

(7706, 3)
(7706, 3)


In [6]:
print (df_all.groupby('Instruction_id').size())

Instruction_id
auto-repair-appt-1    1160
coffee-ordering       1376
movie-finder            54
movie-tickets-1        678
movie-tickets-2        377
movie-tickets-3        195
pizza-ordering        1468
restaurant-table      1198
restaurant-table-3     102
uber-lyft             1098
dtype: int64


In [7]:
#weight_higher = ['restaurant-table-2', 'movie-tickets-1', 'movie-tickets-3','uber-lift-2','coffee-ordering-1','coffee-ordering-2','pizza-ordering-2','movie-finder']
class_sample_size_dict = {
    "auto-repair-appt-1": 230,
    "coffee-ordering": 230,
    "movie-finder": 54,
    "movie-tickets-1": 250,
    "movie-tickets-2": 250,
    "movie-tickets-3": 195,
    "pizza-ordering": 230,
    "restaurant-table": 230,
    "restaurant-table-3": 101,
    "uber-lyft": 230
}
sum(class_sample_size_dict.values())

2000

#### Get a Sample of records.

In [8]:
# Function to Get balanced Sample - Get a bit more than needed then down sample
def sampling_k_elements(group):
    name = group['Instruction_id'].iloc[0]
    k = class_sample_size_dict[name]
    return group.sample(k, random_state=5)

#Get balanced samples
corpus_df = df_all.groupby('Instruction_id').apply(sampling_k_elements).reset_index(drop=True)
print (corpus_df.groupby('Instruction_id').size(), corpus_df.shape)

Instruction_id
auto-repair-appt-1    230
coffee-ordering       230
movie-finder           54
movie-tickets-1       250
movie-tickets-2       250
movie-tickets-3       195
pizza-ordering        230
restaurant-table      230
restaurant-table-3    101
uber-lyft             230
dtype: int64 (2000, 3)


In [9]:
corpus_df.groupby('Instruction_id')

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001B8A9724788>

#### Generate Corpus List

In [10]:
doc_lst = []
for i, row in corpus_df.iterrows():
    doc_lst.append(row.selfdialog_norm)

print(len(doc_lst))
doc_lst[1:5]

2000


['hi im issue car help sure whats problem light came saying headlight ok want get fixed right away today would ideal already know want take yes intelligent auto solutions ok let pull website online scheduler see today ok im looks like two appointments open today could minutes im least minutes away ok time would pm tonight tell able fix spot call confirm makemodel car kia soul ok said parts done appointment thats great news please book yes booked online thanks give info yes text youll phone thank big help',
 'hi schedule appointment car okay auto repair shop would like check check intelligent auto solutions car bringing lexus im driving put name cell phone number yes put jeff green cell phone number seems problem car makes sound step brakes anything else would like check like oil change maintenance yes think im due oil change well got let check online see available check bring mins able make appointment bring car time pm great thanks initial cost brake checkup oil change okay accept cre

#### Split Data into Train and Test Sets

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(doc_lst, corpus_df['category'], test_size=0.25, random_state = 0)

In [12]:
from __future__ import print_function
import lime
import sklearn
import numpy as np
import sklearn
import sklearn.ensemble
import sklearn.metrics

In [13]:
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(lowercase=False)
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)

In [14]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB(alpha=.01)
nb.fit(train_vectors, y_train)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

In [15]:
pred = nb.predict(test_vectors)
sklearn.metrics.f1_score(y_test, pred, average='weighted')

0.8996975604838171

In [16]:
from lime import lime_text
from sklearn.pipeline import make_pipeline
c = make_pipeline(vectorizer, nb)

In [17]:
cats = set(corpus_df['category'])

In [18]:
class_names = list(cats)

In [19]:
from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=class_names)

In [20]:
idx = 3
exp = explainer.explain_instance(X_test[idx], c.predict_proba, num_features=6, labels=[0, 9])
print('Document id: %d' % idx)
print('Predicted class =', class_names[nb.predict(test_vectors[idx]).reshape(1,-1)[0,0]])
#print('True class: %s' % class_names[y_test[idx]])

TypeError: object of type 'NoneType' has no len()

In [21]:
print ('Explanation for class %s' % class_names[0])
print ('\n'.join(map(str, exp.as_list(label=0))))
print ()
print ('Explanation for class %s' % class_names[0])
print ('\n'.join(map(str, exp.as_list(label=0))))

Explanation for class 0


NameError: name 'exp' is not defined

In [22]:
exp = explainer.explain_instance(X_test[idx], c.predict_proba, num_features=6, top_labels=2)
print(exp.available_labels())

TypeError: object of type 'NoneType' has no len()

In [23]:
exp.show_in_notebook(text=False)

NameError: name 'exp' is not defined

In [24]:
exp.show_in_notebook(text=X_test[idx])

NameError: name 'exp' is not defined

#### Build Vocabulary

In [25]:
from keras.preprocessing import text
from keras.utils import np_utils
from keras.preprocessing import sequence

tokenizer = text.Tokenizer(lower=False)
tokenizer.fit_on_texts(X_train)
word2id = tokenizer.word_index

word2id['PAD'] = 0
id2word = {v:k for k, v in word2id.items()}
wids = [[word2id[w] for w in text.text_to_word_sequence(doc)] for doc in X_train]

vocab_size = len(word2id)
embed_size = 100
window_size = 2

print('Vocabulary Size:', vocab_size)
print('Vocabulary Sample:', list(word2id.items())[:10])

Using TensorFlow backend.
Vocabulary Size: 4822
Vocabulary Sample: [('tickets', 1), ('pm', 2), ('like', 3), ('would', 4), ('ok', 5), ('okay', 6), ('movie', 7), ('yes', 8), ('see', 9), ('want', 10)]


### Bag of Words Feature Extraction

In [26]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(min_df=0., max_df=1., vocabulary=word2id)
cv_matrix = cv.fit_transform(X_train, y_train)
cv_matrix = cv_matrix.toarray()
cv_matrix

array([[0, 3, 2, ..., 0, 0, 0],
       [0, 4, 1, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 3, 3, ..., 1, 1, 1],
       [0, 2, 3, ..., 0, 0, 0]], dtype=int64)

In [27]:
# get all unique words in the corpus
vocab = cv.get_feature_names()
# show document feature vectors
pd.DataFrame(cv_matrix, columns=vocab)

Unnamed: 0,PAD,tickets,pm,like,would,ok,okay,movie,yes,see,want,time,showing,need,thank,...,customers,reald,caption,promotions,kingstown,allows,glowing,blackstone,marinos,pizzaeria,oky,indianapolis,riveria,lke,muh
0,0,3,2,5,4,0,3,3,1,1,1,4,0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,0,4,1,4,4,0,0,1,0,0,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,0,4,3,1,1,0,7,1,1,0,1,1,0,1,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,0,5,3,1,1,7,0,2,0,2,2,1,4,2,3,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1495,0,1,1,4,4,1,7,1,3,1,0,3,3,0,2,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1496,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1497,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1498,0,3,3,3,4,5,0,6,2,1,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1


In [28]:
# Get BOW features
X_train_bow = cv_matrix #cv.fit_transform(X_train).toarray()
X_test_bow = cv.transform(X_test).toarray()
y_train = np.array(y_train)
y_test = np.array(y_test)
print (X_train_bow.shape) 
print (X_test_bow.shape) 
print (y_test.shape)

(1500, 4822)
(500, 4822)
(500,)


# Interpretability - Features Importances

In [29]:
import random
import pandas as pd
import IPython
import xgboost
import keras

import eli5
from eli5.lime import TextExplainer
from lime.lime_text import LimeTextExplainer
print('ELI5 Version:', eli5.__version__)
print('XGBoost Version:', xgboost.__version__)
print('Keras Version:', keras.__version__)

ELI5 Version: 0.10.1
XGBoost Version: 1.0.2
Keras Version: 2.3.1


In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from xgboost import XGBClassifier

In [31]:
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin

from keras.models import Model, Input
from keras.layers import Dense, LSTM, Dropout, Embedding, SpatialDropout1D, Bidirectional, concatenate
from keras.layers import GlobalAveragePooling1D, GlobalMaxPooling1D
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

class KerasTextClassifier:
    __author__ = "Edward Ma"
    __copyright__ = "Copyright 2018, Edward Ma"
    __credits__ = ["Edward Ma"]
    __license__ = "Apache"
    __version__ = "2.0"
    __maintainer__ = "Edward Ma"
    __email__ = "makcedward@gmail.com"
    
    OOV_TOKEN = "UnknownUnknown"
    
    def __init__(self, 
                 max_word_input, word_cnt, word_embedding_dimension, labels, 
                 batch_size, epoch, validation_split,
                 verbose=0):
        self.verbose = verbose
        self.max_word_input = max_word_input
        self.word_cnt = word_cnt
        self.word_embedding_dimension = word_embedding_dimension
        self.labels = labels
        self.batch_size = batch_size
        self.epoch = epoch
        self.validation_split = validation_split
        
        self.label_encoder = None
        self.classes_ = None
        self.tokenizer = None
        
        self.model = self._init_model()
        self._init_label_encoder(y=labels)
        self._init_tokenizer()
        
    def _init_model(self):
        input_layer = Input((self.max_word_input,))
        text_embedding = Embedding(
            input_dim=self.word_cnt+2, output_dim=self.word_embedding_dimension,
            input_length=self.max_word_input, mask_zero=False)(input_layer)
        
        text_embedding = SpatialDropout1D(0.5)(text_embedding)
        
        bilstm = Bidirectional(LSTM(units=256, return_sequences=True, recurrent_dropout=0.5))(text_embedding)
        x = concatenate([GlobalAveragePooling1D()(bilstm), GlobalMaxPooling1D()(bilstm)])
        x = Dropout(0.5)(x)
        x = Dense(128, activation="relu")(x)
        x = Dropout(0.5)(x)
        
        output_layer = Dense(units=len(self.labels), activation="softmax")(x)
        model = Model(input_layer, output_layer)
        model.compile(
            optimizer="adam",
            loss="sparse_categorical_crossentropy",
            metrics=["accuracy"])
        return model
    
    def _init_tokenizer(self):
        self.tokenizer = Tokenizer(
            num_words=self.word_cnt+1, split=' ', oov_token=self.OOV_TOKEN)
    
    def _init_label_encoder(self, y):
        self.label_encoder = LabelEncoder()
        self.label_encoder.fit(y)
        self.classes_ = self.label_encoder.classes_
        
    def _encode_label(self, y):
        return self.label_encoder.transform(y)
        
    def _decode_label(self, y):
        return self.label_encoder.inverse_transform(y)
    
    def _get_sequences(self, texts):
        seqs = self.tokenizer.texts_to_sequences(texts)
        return pad_sequences(seqs, maxlen=self.max_word_input, value=0)
    
    def _preprocess(self, texts):
        # Placeholder only.
        return [text for text in texts]
        
    def _encode_feature(self, x):
        self.tokenizer.fit_on_texts(self._preprocess(x))
        self.tokenizer.word_index = {e: i for e,i in self.tokenizer.word_index.items() if i <= self.word_cnt}
        self.tokenizer.word_index[self.tokenizer.oov_token] = self.word_cnt + 1
        return self._get_sequences(self._preprocess(x))
        
    def fit(self, X, y):
        """
            Train the model by providing x as feature, y as label
        
            :params x: List of sentence
            :params y: List of label
        """
        
        encoded_x = self._encode_feature(X)
        encoded_y = self._encode_label(y)
        
        self.model.fit(encoded_x, encoded_y, 
                       batch_size=self.batch_size, epochs=self.epoch, 
                       validation_split=self.validation_split)
        
    def predict_proba(self, X, y=None):
        encoded_x = self._get_sequences(self._preprocess(X))
        return self.model.predict(encoded_x)
    
    def predict(self, X, y=None):
        y_pred = np.argmax(self.predict_proba(X), axis=1)
        return self._decode_label(y_pred)

In [32]:
names = ['Logistic Regression', 'Random Forest', 'XGBoost Classifier', 'Keras']

In [33]:
def build_model(names, x, y):
    pipelines = []
    vec = TfidfVectorizer()
    vec.fit(x)

    for name in names:
        print('train %s' % name)
        
        if name == 'Logistic Regression':
            estimator = LogisticRegression(solver='newton-cg', n_jobs=-1)
            pipeline = make_pipeline(vec, estimator)
        elif name == 'Random Forest':
            estimator = RandomForestClassifier(n_jobs=-1)
            pipeline = make_pipeline(vec, estimator)
        elif name == 'XGBoost Classifier':
            estimator = XGBClassifier()
            pipeline = make_pipeline(vec, estimator)
        elif name == 'Keras':
            pipeline = KerasTextClassifier(
                max_word_input=100, word_cnt=30000, word_embedding_dimension=100, 
                labels=list(set(y_train.tolist())), batch_size=128, epoch=1, validation_split=0.1)
        
        
        pipeline.fit(x, y)
        pipelines.append({
            'name': name,
            'pipeline': pipeline
        })
        
    return pipelines, vec

In [34]:
pipelines, vec = build_model(names, X_train, y_train)

train Logistic Regression
train Random Forest
train XGBoost Classifier
train Keras
Train on 1350 samples, validate on 150 samples
Epoch 1/1


## ELI5 - Global Interpretation

In [35]:
for pipeline in pipelines:
    print('Estimator: %s' % (pipeline['name']))
    labels = pipeline['pipeline'].classes_.tolist()
    
    if pipeline['name'] in ['Logistic Regression', 'Random Forest']:
        estimator = pipeline['pipeline']
    elif pipeline['name'] == 'XGBoost Classifier':
        estimator = pipeline['pipeline'].steps[1][1].get_booster()
#     Not support Keras
#     elif pipeline['name'] == 'keras':
#         estimator = pipeline['pipeline']
    else:
        continue
    
    IPython.display.display(
        eli5.show_weights(estimator=estimator, top=10, target_names=labels, vec=vec))

Estimator: Logistic Regression


Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5,Unnamed: 9_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6,Unnamed: 9_level_6
Weight?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7,Unnamed: 9_level_7
Weight?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8,Unnamed: 9_level_8
Weight?,Feature,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9,Unnamed: 6_level_9,Unnamed: 7_level_9,Unnamed: 8_level_9,Unnamed: 9_level_9
+5.988,restauranttable,,,,,,,,
… 4794 more negative …,… 4794 more negative …,,,,,,,,
-0.301,pizzaordering,,,,,,,,
-0.319,would,,,,,,,,
-0.337,movie,,,,,,,,
-0.350,like,,,,,,,,
-0.356,ok,,,,,,,,
-0.384,pm,,,,,,,,
-0.385,okay,,,,,,,,
-0.428,tickets,,,,,,,,

Weight?,Feature
+5.988,restauranttable
… 4794 more negative …,… 4794 more negative …
-0.301,pizzaordering
-0.319,would
-0.337,movie
-0.350,like
-0.356,ok
-0.384,pm
-0.385,okay
-0.428,tickets

Weight?,Feature
+3.912,tickets
+1.511,glass
+1.345,two
+1.314,total
+1.199,purchase
… 1475 more positive …,… 1475 more positive …
… 3319 more negative …,… 3319 more negative …
-1.212,sold
-1.346,dumbo
-1.362,shazam

Weight?,Feature
+2.627,sorry
+2.401,shazam
+2.212,movie
+2.038,pet
+1.772,hellboy
+1.616,pm
+1.559,showing
+1.432,dumbo
+1.429,sold
+1.317,see

Weight?,Feature
+6.015,pizzaordering
… 4794 more negative …,… 4794 more negative …
-0.303,uberlyft
-0.320,would
-0.339,movie
-0.352,like
-0.357,ok
-0.386,pm
-0.387,okay
-0.430,tickets

Weight?,Feature
+6.007,coffeeordering
… 4794 more negative …,… 4794 more negative …
-0.303,pizzaordering
-0.320,would
-0.338,movie
-0.351,like
-0.357,ok
-0.385,pm
-0.386,okay
-0.429,tickets

Weight?,Feature
+3.467,car
+3.306,appointment
+1.907,auto
+1.607,intelligent
+1.547,solutions
+1.490,tomorrow
+1.349,number
+1.214,name
… 1526 more positive …,… 1526 more positive …
… 3268 more negative …,… 3268 more negative …

Weight?,Feature
+6.015,uberlyft
… 4794 more negative …,… 4794 more negative …
-0.303,pizzaordering
-0.320,would
-0.339,movie
-0.352,like
-0.357,ok
-0.386,pm
-0.387,okay
-0.430,tickets

Weight?,Feature
+2.913,tickets
+1.917,us
+1.636,captain
+1.531,marvel
+1.415,showing
+1.243,two
+1.230,pm
+1.182,sent
+1.168,text
+1.026,show

Weight?,Feature
+1.820,watch
+1.654,movie
+1.453,seen
+1.339,movies
+1.285,comedy
+1.267,action
+1.176,something
… 809 more positive …,… 809 more positive …
… 3985 more negative …,… 3985 more negative …
-0.803,pizzaordering

Weight?,Feature
+2.742,restaurant
+2.725,table
+2.544,reservation
+1.746,people
+1.094,try
+1.053,reservations
+1.026,seating
+0.971,dinner
… 916 more positive …,… 916 more positive …
… 3878 more negative …,… 3878 more negative …


Estimator: Random Forest


Weight,Feature
0.1029  ± 0.1036,pizzaordering
0.0966  ± 0.1120,coffeeordering
0.0928  ± 0.1163,uberlyft
0.0912  ± 0.1115,restauranttable
0.0171  ± 0.0498,tickets
0.0114  ± 0.0443,like
0.0093  ± 0.0442,appointment
0.0091  ± 0.0313,see
0.0088  ± 0.0384,intelligent
0.0087  ± 0.0289,movie


Estimator: XGBoost Classifier


Weight,Feature
0.0948,pizzaordering
0.0948,uberlyft
0.0939,coffeeordering
0.0919,restauranttable
0.0565,car
0.0493,seen
0.0383,appointment
0.0264,restaurant
0.0200,comedy
0.0174,table


Estimator: Keras


## ELI5 - Local Interpretation

In [36]:
number_of_sample = 1
sample_ids = [random.randint(0, len(X_test) -1 ) for p in range(0, number_of_sample)]

for idx in sample_ids:
    print('Index: %d' % (idx))
#     print('Index: %d, Feature: %s' % (idx, x_test[idx]))
    for pipeline in pipelines:
        print('-' * 50)
        print('Estimator: %s' % (pipeline['name']))
        
        print('True Label: %s, Predicted Label: %s' % (y_test[idx], pipeline['pipeline'].predict([X_test[idx]])[0]))
        labels = pipeline['pipeline'].classes_.tolist()
  
        if pipeline['name'] in ['Logistic Regression', 'Random Forest']:
            estimator = pipeline['pipeline'].steps[1][1]
        elif pipeline['name'] == 'XGBoost Classifier':
            estimator = pipeline['pipeline'].steps[1][1].get_booster()
        #     Not support Keras
#         elif pipeline['name'] == 'Keras':
#             estimator = pipeline['pipeline'].model
        else:
            continue

        IPython.display.display(
            eli5.show_prediction(estimator, X_test[idx], top=10, vec=vec, target_names=labels))

Index: 140
--------------------------------------------------
Estimator: Logistic Regression
True Label: 0, Predicted Label: 0


Contribution?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0
Contribution?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Contribution?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Contribution?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3
Contribution?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4
Contribution?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5,Unnamed: 9_level_5
Contribution?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6,Unnamed: 9_level_6
Contribution?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7,Unnamed: 9_level_7
Contribution?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8,Unnamed: 9_level_8
Contribution?,Feature,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9,Unnamed: 6_level_9,Unnamed: 7_level_9,Unnamed: 8_level_9,Unnamed: 9_level_9
+5.988,restauranttable,,,,,,,,
-0.951,<BIAS>,,,,,,,,
+0.924,<BIAS>,,,,,,,,
-0.985,restauranttable,,,,,,,,
+0.251,<BIAS>,,,,,,,,
-0.681,restauranttable,,,,,,,,
-0.301,restauranttable,,,,,,,,
-0.944,<BIAS>,,,,,,,,
-0.301,restauranttable,,,,,,,,
-0.946,<BIAS>,,,,,,,,

Contribution?,Feature
5.988,restauranttable
-0.951,<BIAS>

Contribution?,Feature
0.924,<BIAS>
-0.985,restauranttable

Contribution?,Feature
0.251,<BIAS>
-0.681,restauranttable

Contribution?,Feature
-0.301,restauranttable
-0.944,<BIAS>

Contribution?,Feature
-0.301,restauranttable
-0.946,<BIAS>

Contribution?,Feature
0.799,<BIAS>
-0.923,restauranttable

Contribution?,Feature
-0.301,restauranttable
-0.944,<BIAS>

Contribution?,Feature
0.642,<BIAS>
-0.85,restauranttable

Contribution?,Feature
0.531,<BIAS>
-0.8,restauranttable

Contribution?,Feature
0.636,<BIAS>
-0.847,restauranttable


--------------------------------------------------
Estimator: Random Forest
True Label: 0, Predicted Label: 0


Contribution?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0
Contribution?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Contribution?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Contribution?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3
Contribution?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4
Contribution?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5,Unnamed: 9_level_5
Contribution?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6,Unnamed: 9_level_6
Contribution?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7,Unnamed: 9_level_7
Contribution?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8,Unnamed: 9_level_8
Contribution?,Feature,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9,Unnamed: 6_level_9,Unnamed: 7_level_9,Unnamed: 8_level_9,Unnamed: 9_level_9
+0.487,restauranttable,,,,,,,,
+0.111,<BIAS>,,,,,,,,
+0.093,coffeeordering,,,,,,,,
+0.084,pizzaordering,,,,,,,,
+0.083,uberlyft,,,,,,,,
+0.010,like,,,,,,,,
+0.005,time,,,,,,,,
+0.005,pm,,,,,,,,
+0.005,see,,,,,,,,
+0.004,want,,,,,,,,

Contribution?,Feature
+0.487,restauranttable
+0.111,<BIAS>
+0.093,coffeeordering
+0.084,pizzaordering
+0.083,uberlyft
+0.010,like
+0.005,time
+0.005,pm
+0.005,see
+0.004,want

Contribution?,Feature
+0.126,<BIAS>
… 237 more positive …,… 237 more positive …
… 181 more negative …,… 181 more negative …
-0.004,time
-0.005,want
-0.005,pm
-0.005,would
-0.006,two
-0.007,see
-0.007,tickets

Contribution?,Feature
+0.096,<BIAS>
… 232 more positive …,… 232 more positive …
… 162 more negative …,… 162 more negative …
-0.003,two
-0.003,want
-0.004,would
-0.004,pm
-0.004,time
-0.005,see
-0.006,tickets

Contribution?,Feature
+0.116,<BIAS>
+0.025,coffeeordering
+0.023,uberlyft
+0.008,like
+0.005,time
+0.004,see
+0.004,would
+0.004,want
… 415 more positive …,… 415 more positive …
-0.123,restauranttable

Contribution?,Feature
+0.115,<BIAS>
+0.036,pizzaordering
+0.030,uberlyft
+0.008,like
+0.005,time
+0.005,see
+0.005,want
+0.005,would
… 446 more positive …,… 446 more positive …
-0.155,restauranttable

Contribution?,Feature
+0.111,<BIAS>
… 211 more positive …,… 211 more positive …
… 180 more negative …,… 180 more negative …
-0.003,time
-0.004,solutions
-0.004,car
-0.005,number
-0.005,need
-0.005,like
-0.005,appointment

Contribution?,Feature
+0.116,<BIAS>
+0.035,coffeeordering
+0.027,pizzaordering
+0.009,like
+0.005,see
+0.005,would
+0.005,pm
+0.005,time
… 444 more positive …,… 444 more positive …
-0.142,uberlyft

Contribution?,Feature
+0.128,<BIAS>
… 220 more positive …,… 220 more positive …
… 185 more negative …,… 185 more negative …
-0.004,movie
-0.005,pm
-0.005,want
-0.005,time
-0.006,see
-0.006,two
-0.007,tickets

Contribution?,Feature
+0.026,<BIAS>
+0.001,pizzaordering
… 253 more positive …,… 253 more positive …
… 159 more negative …,… 159 more negative …
-0.001,sure
-0.001,yes
-0.001,good
-0.001,would
-0.002,want
-0.002,movie

Contribution?,Feature
+0.054,<BIAS>
… 239 more positive …,… 239 more positive …
… 167 more negative …,… 167 more negative …
-0.002,book
-0.002,available
-0.002,yes
-0.002,reservation
-0.003,many
-0.003,like
-0.003,pm


--------------------------------------------------
Estimator: XGBoost Classifier
True Label: 0, Predicted Label: 0


Contribution?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0
Contribution?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Contribution?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Contribution?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3
Contribution?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4
Contribution?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5,Unnamed: 9_level_5
Contribution?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6,Unnamed: 9_level_6
Contribution?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7,Unnamed: 9_level_7
Contribution?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8,Unnamed: 9_level_8
Contribution?,Feature,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9,Unnamed: 6_level_9,Unnamed: 7_level_9,Unnamed: 8_level_9,Unnamed: 9_level_9
+5.515,restauranttable,,,,,,,,
+0.126,like,,,,,,,,
-1.507,<BIAS>,,,,,,,,
+0.711,<BIAS>,,,,,,,,
… 21 more positive …,… 21 more positive …,,,,,,,,
… 48 more negative …,… 48 more negative …,,,,,,,,
-0.157,buy,,,,,,,,
-0.158,showing,,,,,,,,
-0.179,theater,,,,,,,,
-0.227,two,,,,,,,,

Contribution?,Feature
5.515,restauranttable
0.126,like
-1.507,<BIAS>

Contribution?,Feature
+0.711,<BIAS>
… 21 more positive …,… 21 more positive …
… 48 more negative …,… 48 more negative …
-0.157,buy
-0.158,showing
-0.179,theater
-0.227,two
-0.287,would
-0.388,see
-0.404,movie

Contribution?,Feature
+0.511,<BIAS>
… 10 more positive …,… 10 more positive …
… 43 more negative …,… 43 more negative …
-0.196,theater
-0.229,pet
-0.233,time
-0.277,shazam
-0.405,sorry
-0.461,showing
-0.474,pm

Contribution?,Feature
0.127,like
-1.417,<BIAS>
-2.783,pizzaordering

Contribution?,Feature
0.127,like
-1.442,<BIAS>
-2.762,coffeeordering

Contribution?,Feature
+0.155,tickets
… 1 more negative …,… 1 more negative …
-0.002,schedule
-0.005,four
-0.036,see
-0.040,help
-0.216,need
-0.618,auto
-0.896,car
-0.959,<BIAS>

Contribution?,Feature
0.127,like
-1.417,<BIAS>
-2.783,uberlyft

Contribution?,Feature
+0.935,<BIAS>
… 15 more positive …,… 15 more positive …
… 77 more negative …,… 77 more negative …
-0.118,available
-0.124,text
-0.125,like
-0.142,amc
-0.191,thats
-0.205,movie
-0.263,want

Contribution?,Feature
+0.396,pm
+0.038,tickets
… 2 more positive …,… 2 more positive …
… 9 more negative …,… 9 more negative …
-0.027,able
-0.038,action
-0.046,romantic
-0.114,yes
-0.223,new
-0.730,watch

Contribution?,Feature
+0.311,tickets
+0.103,movie
… 15 more negative …,… 15 more negative …
-0.083,sorry
-0.111,pm
-0.135,available
-0.440,table
-0.445,restaurant
-0.691,time
-0.740,reservation


--------------------------------------------------
Estimator: Keras
True Label: 0, Predicted Label: 6


# LIME

## LIME - Local Interpretation

In [37]:
from __future__ import print_function
import lime
import sklearn
import numpy as np
import sklearn
import sklearn.ensemble
import sklearn.metrics


In [38]:
#vectorizer = sklearn.feature_extraction.text.TfidfVectorizer
#train_vectors = vectorizer.fit_transform(doc_lst)
#test_vectors = vectorizer.transform(doc_lst)

vectorizer = sklearn.feature_extraction.text.TfidfVectorizer(lowercase=False)
train_vectors = vectorizer.fit_transform(X_train)
test_vectors = vectorizer.transform(X_test)

In [39]:
from sklearn.naive_bayes import MultinomialNB
nb = MultinomialNB(alpha=.01)
nb.fit(train_vectors, y_train)

MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)

In [40]:
pred = nb.predict(test_vectors)
sklearn.metrics.f1_score(y_test, pred, average='weighted')

0.8996975604838171

In [41]:
from lime import lime_text
from sklearn.pipeline import make_pipeline
c = make_pipeline(vectorizer, nb)

In [42]:
cats = set(corpus_df['category'])

class_names = list(cats)


from lime.lime_text import LimeTextExplainer
explainer = LimeTextExplainer(class_names=class_names)

In [43]:
idx = 34
exp = explainer.explain_instance(X_test[idx], c.predict_proba, num_features=6, labels=[0, 9])
print('Document id: %d' % idx)
print('Predicted class =', class_names[nb.predict(test_vectors[idx]).reshape(1,-1)[0,0]])
print('True class: %s' % class_names[y_test[idx]])



TypeError: object of type 'NoneType' has no len()

In [44]:
print ('Explanation for class %s' % class_names[0])
print ('\n'.join(map(str, exp.as_list(label=0))))
print ()
print ('Explanation for class %s' % class_names[0])
print ('\n'.join(map(str, exp.as_list(label=0))))

Explanation for class 0


NameError: name 'exp' is not defined

In [45]:
exp = explainer.explain_instance(X_test[idx], c.predict_proba, num_features=6, top_labels=2)
print(exp.available_labels())

TypeError: object of type 'NoneType' has no len()

In [46]:

exp.show_in_notebook(text=False)

NameError: name 'exp' is not defined

In [47]:

exp.show_in_notebook(text=X_test[idx])

NameError: name 'exp' is not defined

## Skater

## Skater - Global Interpretation

In [75]:
# Super slow when there is lots of feature(word in this case).......
pipelines, vec = build_model(names, X_train[:2], y_train[:2])

train Logistic Regression
train Random Forest
train XGBoost Classifier
train Keras
Train on 1 samples, validate on 1 samples
Epoch 1/1


In [76]:
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt

from skater.model import InMemoryModel
from skater.core.explanations import Interpretation

transfromed_x_test = vec.transform(X_test[:2]).toarray()

interpreter = Interpretation(transfromed_x_test, feature_names=vec.get_feature_names())

for pipeline in pipelines:
    print('-' * 50)
    print('Estimator: %s' % (pipeline['name']))
    
    if pipeline['name'] in ['Logistic Regression', 'Random Forest']:
        estimator = pipeline['pipeline'].steps[1][1]
    else:
        continue
        
    print(estimator)
        
    pyint_model = InMemoryModel(estimator.predict_proba, examples=transfromed_x_test)
    
    f, axes = plt.subplots(1, 1, figsize = (26, 18))
    ax = axes
    interpreter.feature_importance.plot_feature_importance(pyint_model, ascending=False, ax=ax)
    plt.show()

ModuleNotFoundError: No module named 'skater'