# Smart Home Commands Classification

I noticed that it's hard to find a decent dataset with all sorts of commands so I build my own tiny dataset that I could start with just for fun. Upside of a small dataset is the training time, it's possible to try a lot of different machine learning methods.


Following machine learning classifiers will be tested:
- random forests
- support vector machines
- xgboost
- multi-layer perceptrons
- catboost
- lightgbm
- TPOT

What we are going to test in this notebook is the following:
- averaged sentence representation vs TFIDF (term frequency inverse document frequency) representation
- random forests vs support vector machines vs xgboost vs neural networks vs catboost vs lightgbm vs TPOT
- radial basis function kernel vs linear kernel in SVM's

## <center style="background-color: #6dc8b5; width:30%;">Contents</center>
* [Data Statistics](#data_statistics)
* [Data Preperation](#data_preperation)
* [Training Base Model](#training_base_model)
* [Improved Models](#improved_models)
* [Further Improvements](#further_improvements)
* [Testing](#testing)

<a class="anchor" id="data_statistics"></a>
# Data Statistics

Some statistics about our small dataset of commands.

In [None]:
import numpy as np
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
df = pd.read_csv("/kaggle/input/smart-home-commands-dataset/dataset.csv")
del df["Number"]
df.sample(frac=1).head(25)

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.columns

In [None]:
plt.figure(figsize=(18,10))
sns.countplot(x="Category", palette="rocket", data=df)

In [None]:
plt.figure(figsize=(18,10))
sns.countplot(x="Subcategory", palette="rocket", data=df)

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x="Question", palette="rocket", data=df)

In [None]:
plt.figure(figsize=(10,5))
sns.countplot(x="Action_needed", palette="rocket", data=df)

In [None]:
plt.figure(figsize=(15,10))
sns.countplot(x="Time", palette="rocket", data=df)

In [None]:
plt.figure(figsize=(15,10))
sns.countplot(x="Action", palette="rocket", data=df)

<a class="anchor" id="data_preperation"></a>
# Data Preperation

> One-hot word representation can be simply done with taking a vector initialize with all zeros and putting a one on the place of that word in the vocabulary.


> Distributed word representations are a bit trickier, a neural network has to help us with that. To obtain these distributed word embeddings, word2vec is used. Word2vec uses the skip-gram model. It takes the weight vector between the input layer and the hidden layer after training each word with its closest neighbours. By taking the weight matrices of this neural network, hidden representations of the words have been encapsulated in the vector representations of each word. These hidden representations between words are already embedded in the word vector.

> In order to obtain a trainable vector for the machine learning methods to learn, a sentence has to be transformed to a vector. Each sentence consists of several words, that each have their own word representation. The word representations can be combined into a sentence representation, there are several ways to do this, either with a simple averaging of the word vectors or first multiplying with the TFIDF (Term Frequency Inverse Document Frequency) score and then averaging. This  score  is  obtained  bymultiplying the term frequency with the inverse document frequency.  The termfrequency  is  the  likelihood  of  a  word  occurring  in  a  sentence  and  the  inversedocument frequency is used to indicate how rare a word is in a sentence. This to avoid giving more importancy to sentences where the same word appears multiple times.

> Experience has shown that the dataset has way too little data to use distributed word vectors, so we will continue with the one-hot encoded words and sentences.

In [None]:
from nltk import word_tokenize
from sklearn.model_selection import train_test_split
import itertools
import math

In [None]:
sentences = df['Sentence']
categories = df['Category']
subcategories = df['Subcategory']
actions = df['Action']

uniquecategories = list(set(categories))
uniquesubcategories = list(set(subcategories))
uniqueactions = list(set(actions))

mergesentences = list(itertools.chain.from_iterable([word_tokenize(sentence.lower()) for sentence in sentences]))
vocabulary = list(set(mergesentences))
print(vocabulary)

In [None]:
# calculates how often the word appears in the sentence
def term_frequency(word, sentence):
    return sentence.split().count(word)

In [None]:
# calculates how often the word appears in the entire vocabulary
def document_frequency(word):
    return vocabulary.count(word)

In [None]:
# will make sure that unimportant words such as "and" that occur often will have lower weights
# log taken to avoid exploding of IDF with words such as 'is' that can occur a lot
def inverse_document_frequency(word):
    return math.log(len(vocabulary) / (document_frequency(word) + 1))

In [None]:
# get term frequency inverse document frequency value
def calculate_tfidf(word, sentence):
    return term_frequency(word, sentence) * inverse_document_frequency(word)

In [None]:
# get one-hot encoded vectors for the targets
def one_hot_class_vector(uniqueclasses, w):
    emptyvector = [0 for i in range(len(uniqueclasses))]
    emptyvector[uniqueclasses.index(w)] = 1
    return emptyvector

In [None]:
# get one-hot encoded vectors for the words
def one_hot_vector(w):
    emptyvector = [0 for i in range(len(vocabulary))]
    emptyvector[vocabulary.index(w)] = 1
    return emptyvector

In [None]:
# get one-hot encdoded sentence vector
def sentence_vector(sentence, tfidf=False):
    tokenizedlist = word_tokenize(sentence.lower())
    sentencevector = [0 for i in range(len(vocabulary))]
    count = 0

    for word in tokenizedlist:
        if word in vocabulary:
            count = count + 1
            if tfidf:
                sentencevector = [x + y for x, y in zip(sentencevector, [e * calculate_tfidf(word, sentence) for e in one_hot_vector(word)])] 
            else:
                sentencevector = [x + y for x, y in zip(sentencevector, one_hot_vector(word))]

    if count == 0:
        return sentencevector
    else:
        return [(el / count) for el in sentencevector]

Let's construct the sentence vectors now, these are needed to start training on.

In [None]:
# wordvectors = [one_hot_vector(w) for w in vocabulary] # not needed
categoryvectors = [cv.index(1) for cv in [one_hot_class_vector(uniquecategories, w) for w in categories]]
subcategoryvectors = [cv.index(1) for cv in [one_hot_class_vector(uniquesubcategories, w) for w in subcategories]]
actionvectors = [cv.index(1) for cv in [one_hot_class_vector(uniqueactions, w) for w in actions]]
sentencevectors = [sentence_vector(sentence) for sentence in sentences]
sentencevectorstfidf = [sentence_vector(sentence, True) for sentence in sentences]

In [None]:
X_train_cat, X_test_cat, y_train_cat, y_test_cat = train_test_split(sentencevectors, categoryvectors, test_size=0.25, random_state=42)
X_train_cat_tfidf, X_test_cat_tfidf, y_train_cat_tfidf, y_test_cat_tfidf = train_test_split(sentencevectorstfidf, categoryvectors, test_size=0.25, random_state=42)
X_train_subcat, X_test_subcat, y_train_subcat, y_test_subcat = train_test_split(sentencevectors, subcategoryvectors, test_size=0.25, random_state=42)
X_train_action, X_test_action, y_train_action, y_test_action = train_test_split(sentencevectors, actionvectors, test_size=0.25, random_state=42)

<a class="anchor" id="training_base_model"></a>
# Training Base Model

Training a Random Foreset baseline models to start from.

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from catboost import CatBoostClassifier
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
import xgboost as xgb
from tpot import TPOTClassifier
from sklearn.metrics import accuracy_score
from numpy import random

random.seed(2020)

In [None]:
def train_fit(model_name, model, X, y, X_test, y_test):
    model.fit(X, y)
    y_preds = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_preds)
    print(f"{model_name}: {accuracy}")
    return model

In [None]:
random_forest_model = RandomForestClassifier()
random_forest_model = train_fit("RandomForestClassifier", random_forest_model, X_train_cat, y_train_cat, X_test_cat, y_test_cat)

<a class="anchor" id="improved_models"></a>
# Improved Models

Starting with RandomForestClassifier, SVC (linear + rbf kernel), XGBClassifier, MLPClassifier, CatBoostClassifier and TPOT. Time to do some improvements and explain why we see what we see.

In [None]:
svc_model_linear = SVC(kernel='linear', decision_function_shape='ovo')
svc_model_linear = train_fit("SVC (linear)", svc_model_linear, X_train_cat, y_train_cat, X_test_cat, y_test_cat)

svc_model_rbf = SVC(kernel='rbf', decision_function_shape='ovo')
svc_model_rbf = train_fit("SVC (rbf)", svc_model_rbf, X_train_cat, y_train_cat, X_test_cat, y_test_cat)

xgb_model = xgb.XGBClassifier()
xgb_model = train_fit("XGBClassifier", xgb_model, np.array(X_train_cat), np.array(y_train_cat), X_test_cat, y_test_cat)

catboost_model = CatBoostClassifier(verbose=False)
catboost_model = train_fit("CatBoostClassifier", catboost_model, X_train_cat, y_train_cat, X_test_cat, y_test_cat)

mlp_model = MLPClassifier()
mlp_model = train_fit("MLPClassifier", mlp_model, X_train_cat, y_train_cat, X_test_cat, y_test_cat)

lgbm_model = lgb.LGBMClassifier()
lgbm_model = train_fit("LGBMClassifier", lgbm_model, X_train_cat, y_train_cat, X_test_cat, y_test_cat)

<a class="anchor" id="further_improvements"></a>
# Further Improvements

## Linear kernel vs RBF kernel SVM

The support vector classifier didn't score so well, but what if we tweak the parameters, such as `cost`. With a higher value it even outperforms the radial basis function kernel.

In [None]:
svc_linear_cost_model = SVC(kernel='linear', decision_function_shape='ovo', C=52)
svc_linear_cost_model = train_fit("SVC (linear) + cost", svc_linear_cost_model, X_train_cat, y_train_cat, X_test_cat, y_test_cat)

The positive sides of a linear function kernel outweigh the positive sides of a radial basis function kernel mainly based on the argument of computational intensity. A linear kernel is faster to train than a radial basis function kernel.The linear function kernel also has a parameter less to tune which is the gamma parameter. Text classification is a problem that can be handled in a linearly separable way because of the already high dimensional space in which text resides, it doesn’t really help the performance to transform the data into even higher dimensional space. 

## Tweaking MLPClassifier

The MLPClassifier gave a convergence warning: `ConvergenceWarning: Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.`. Let's see what happens when we increase the max iterations.

In [None]:
mlp_max_iter_model = MLPClassifier(max_iter=10000)
mlp_max_iter_model = train_fit("MLPClassifier", mlp_max_iter_model, X_train_cat, y_train_cat, X_test_cat, y_test_cat)

## AutoML with TPOT

Last but not least, let's try TPOT (https://github.com/epistasislab/tpot/). It should find the most optimal solution. To avoid really long training times, we're setting the generations, population_size and cross-validation to 5.

In [None]:
tpot_model = TPOTClassifier(generations=5, population_size=5, cv=5, verbosity=3)
tpot_model = train_fit("TPOTClassifier", tpot_model, np.array(X_train_cat), np.array(y_train_cat), X_test_cat, y_test_cat)
tpot_model.export("tpot_pipeline.py")

In [None]:
! cat tpot_pipeline.py

## Averaged vs TFIDF+averaged

The MLPClassifier got the highest accuracy in the most test runs and was one of the fastest to train. This was with the averaged sentence approach, let's see how it does with the TFIDF approach.



In [None]:
mlp_max_iter_model_tfidf = MLPClassifier(max_iter=10000)
mlp_max_iter_model_tfidf = train_fit("MLPClassifier", mlp_max_iter_model_tfidf, X_train_cat_tfidf, y_train_cat_tfidf, X_test_cat_tfidf, y_test_cat_tfidf)

Results did not approve with TFIDF, reason for this is the length of the 'documents' and in our case, a short sentence. TFIDF approach is less useful in small length text because chances of multiple occurences of words is slim. It hurts more in this case than it helps.

<a class="anchor" id="testing"></a>
# Testing

The multi-layer perceptron classifier has the best score with a high max_iter value. Let's use this for the predictions.

In [None]:
mlp_max_iter_model_cat = MLPClassifier(max_iter=10000)
mlp_max_iter_model_cat = train_fit("MLPClassifier", mlp_max_iter_model_cat, X_train_cat, y_train_cat, X_test_cat, y_test_cat)
mlp_max_iter_model_subcat = MLPClassifier(max_iter=10000)
mlp_max_iter_model_subcat = train_fit("MLPClassifier", mlp_max_iter_model_subcat, X_train_subcat, y_train_subcat, X_test_subcat, y_test_subcat)
mlp_max_iter_model_action = MLPClassifier(max_iter=10000)
mlp_max_iter_model_action = train_fit("MLPClassifier", mlp_max_iter_model_action, X_train_action, y_train_action, X_test_action, y_test_action)

In [None]:
def predict(model, classes, sentence):
    y_preds = model.predict([sentence_vector(sentence)])
    return classes[y_preds[0]]

In [None]:
sentence = "Hi Google, please turn off the lights."
print(predict(mlp_max_iter_model, uniquecategories, sentence))
print(predict(mlp_max_iter_model_subcat, uniquesubcategories, sentence))
print(predict(mlp_max_iter_model_action, uniqueactions, sentence))

In [None]:
sentence = "Turn the lights off in the kitchen."
print(predict(mlp_max_iter_model, uniquecategories, sentence))
print(predict(mlp_max_iter_model_subcat, uniquesubcategories, sentence))
print(predict(mlp_max_iter_model_action, uniqueactions, sentence))

In [None]:
sentence = "Random sentence."
print(predict(mlp_max_iter_model, uniquecategories, sentence))
print(predict(mlp_max_iter_model_subcat, uniquesubcategories, sentence))
print(predict(mlp_max_iter_model_action, uniqueactions, sentence))

In [None]:
sentence = "Find the furthest bus stop."
print(predict(mlp_max_iter_model, uniquecategories, sentence))
print(predict(mlp_max_iter_model_subcat, uniquesubcategories, sentence))
print(predict(mlp_max_iter_model_action, uniqueactions, sentence))

In [None]:
sentence = "Lower the door."
print(predict(mlp_max_iter_model, uniquecategories, sentence))
print(predict(mlp_max_iter_model_subcat, uniquesubcategories, sentence))
print(predict(mlp_max_iter_model_action, uniqueactions, sentence))