# Model Training and export

- In this notebook the model to classify the activities from the online.data file will be selected, train and exported.
- We will train and compare several models, evaluating their performance to try to identify the better one. This model will be exported to a .dat.gz file to be used by our program to accurately classify whether the user is walking or running or walking.
- For efficiency reasons we will not test the best parameters for each model.

In [2]:
import joblib
import gzip
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler

- First, read the data from the file.
- We are also separating the activity column from the others, since the activity column is the label.
- The date and time columns are dropped since they have no use in the classification.
- There seems to be no missing values in the dataset. Still, we are using an Imputer to, if missing values were to exist, replace them with the mean of their respective columns.
- We use a scaler to force all the values to be in the same scale. The idea is to not have a specific column artifically mess with the classification for having values a lot higher or lower than the rest

In [9]:
data = pd.read_csv('data/output_training.data', delimiter=';')

imputer = SimpleImputer(strategy='mean')
scaler = MinMaxScaler()

X_imputed = imputer.fit_transform(data.drop(['activity', 'date', 'time'], axis=1))
X = scaler.fit_transform(X_imputed)
y = data['activity']
X_imputed

array([[ 0.265 , -0.7814, -0.0076, -0.059 ,  0.0325, -2.9296],
       [ 0.6722, -1.1233, -0.2344, -0.1757,  0.0208,  0.1269],
       [ 0.4399, -1.4817,  0.0722, -0.9105,  0.1063, -2.4367],
       ...,
       [-0.3874, -1.2696, -0.2641, -0.8963,  0.2506,  0.58  ],
       [-0.3047, -1.1782, -0.1363, -1.1958,  0.5059,  1.3119],
       [-0.1926, -0.7112, -0.0832, -0.4192, -0.2741,  0.9312]])

In this phase we will select the models that we are going to train and evaluate

In [4]:
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class="auto")))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('DT', DecisionTreeClassifier()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('GNB', GaussianNB()))
models.append(('SVC', SVC(C = 1.0, kernel = 'rbf')))

We evaluate each model using K-Fold validation. Without it, we would be relying on a single split between test and train which, depending on the distribution of data in the split, poorly impact our measures. With cross-validation we can have a more realiable measure, at the same time that we are able to use our whole set to train the model.
The metric that we use to evaluate the model's performance is f1-score. F1-Score is more resistant to inbalanced datasets and often provides a more reliable and balanced metric.

In [5]:
# evaluate each model in turn
print("Model F1-Score:")
f1_scores = []
for name, model in models:
    kfold = KFold(n_splits=10, shuffle=True)
    cv_results = cross_val_score(model, X, y, cv=kfold, scoring='f1')
    
    f1_scores.append(cv_results.mean())
    msg = "%s: F1-Score=%f std=(%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

Model F1-Score:
LR: F1-Score=0.837860 std=(0.004695)
LDA: F1-Score=0.792139 std=(0.007012)
DT: F1-Score=0.987359 std=(0.001238)
KNN: F1-Score=0.988835 std=(0.001512)
GNB: F1-Score=0.951092 std=(0.002205)
SVC: F1-Score=0.985651 std=(0.000941)


Now, select the model that performed the best:

In [6]:
max_f1_score = max(f1_scores)
index_max_f1_score = f1_scores.index(max_f1_score)

chosen_model = models[index_max_f1_score]
print(chosen_model[0])

KNN


The selected model will now be fitted with the whole set and exported. This will be the model that our application will use to try to classify whether the user is running or walking.

In [8]:
model_to_export = chosen_model[1]

model_to_export.fit(X, y)
print(model_to_export)
joblib.dump(model_to_export, 'model/classifier.dat.gz')

KNeighborsClassifier()


['model/classifier.dat.gz']