# Handy
## Train model

Now let's train the model. Handy trains it using a whole variety of classification models and then chooses which one works the best. I decided to go with a lot of different models to learn how each of them work.
* [Random Forest (decision trees)](https://en.wikipedia.org/wiki/Random_forest)
* [Gradient Boosting](https://en.wikipedia.org/wiki/Gradient_boosting)
* [Stochastic Gradient Descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
* [Decision Tree](https://en.wikipedia.org/wiki/Decision_tree)
* [Naive Bayes Classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
* [K-nearest Neighbors Algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)
* [Support Vector Classifier (SVC)](https://en.wikipedia.org/wiki/Support_vector_machine)
* [Linear Support Vector Classifier (LinearSVC)](https://en.wikipedia.org/wiki/Support_vector_machine#Linear_SVM)
* [Linear Discriminant Analysis](https://en.wikipedia.org/wiki/Linear_discriminant_analysis)
* [Quadratic Discriminant Analysis](https://en.wikipedia.org/wiki/Quadratic_classifier)

In [1]:
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
from sklearn.pipeline import make_pipeline 
from sklearn.preprocessing import StandardScaler 

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis

In [3]:
# Load the data

from os import path


if not path.exists("data.csv"):
    print("The data.csv file doesn't exist! Please first come through the 2_Process_Data.ipynb notebook.")
    exit(-1)

df = pd.read_csv("data.csv")

X = df.drop("class_name", axis=1)
y = df["class_name"]

# Number of data per class
df.groupby("class_name").size()

class_name
0    200
1    200
2    200
3    199
4    176
5    196
6    158
7    197
8    199
9    196
dtype: int64

In [4]:
# Split data into train and test set

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=2)

y_test = y_test.to_numpy()

print(f"Train: {len(X_train)}")
print(f"Test: {len(X_test)}")

Train: 1728
Test: 193


In [5]:
print(X_test)

         angle_0     angle_1     angle_2     angle_3
844   179.873997  169.216289    7.005674  177.978082
1828  167.104811  179.071756   80.086742   93.292244
1369  163.199372  179.062881   72.913115    5.932791
4     177.297062  176.462795    8.410298    8.085928
426   178.072259   82.373054    9.622006   71.537795
...          ...         ...         ...         ...
1282  162.356705  164.553184  158.741131  157.443774
1339  164.967706  178.128011   71.234268    5.698393
1055  174.122265  176.673479   11.737375  155.368424
1783  179.124347  173.213055  115.996727  117.650164
445   177.478903   79.389113   10.371008   71.939602

[193 rows x 4 columns]


In [6]:
pipelines = {
    "RandomForestClassifier": make_pipeline(StandardScaler(), RandomForestClassifier(criterion="log_loss", max_depth=8, n_estimators=30, random_state=7)),
    "GradientBoostingClassifier": make_pipeline(StandardScaler(), GradientBoostingClassifier()),
    "SGDClassifier": make_pipeline(StandardScaler(), SGDClassifier()),
    "DecisionTreeClassifier": make_pipeline(StandardScaler(), DecisionTreeClassifier()),
    "GaussianNB": make_pipeline(StandardScaler(), GaussianNB()),
    "KNeighborsClassifier": make_pipeline(StandardScaler(), KNeighborsClassifier()),
    "SVC": make_pipeline(StandardScaler(), SVC()),
    "LinearSVC": make_pipeline(StandardScaler(), LinearSVC()),
    "LinearDiscriminantAnalysis": make_pipeline(StandardScaler(), LinearDiscriminantAnalysis()),
    "QuadraticDiscriminantAnalysis": make_pipeline(StandardScaler(), QuadraticDiscriminantAnalysis()),
}

In [7]:
from sklearn.metrics import accuracy_score, log_loss

models = {}

for name, algorithm in pipelines.items():
    model = algorithm.fit(X_train.values, y_train.values)

    # Calculate the metrics (accurracy)
    y_predicted = model.predict(X_test.values)
    accuracy = accuracy_score(y_test, y_predicted)
    print(f"[{name}]: accuracy = {accuracy}")

    try:
        train_predictions = model.predict_proba(X_test.values)
        loss = log_loss(y_test, train_predictions)
        print(f"[{name}]: log loss = {loss}")
    except:
        pass
    
    models[name] = model

    # incorrect_indices = [i for i in range(len(y_test)) if y_test[i] != y_predicted[i]]
    # incorrect_predictions = [(y_test[i], y_predicted[i]) for i in incorrect_indices]

    # # Print the actual incorrect predictions
    # print("Incorrect predictions:")
    # for true_label, predicted_label in incorrect_predictions:
    #     print(f"True: {true_label}, Predicted: {predicted_label}")
    

[RandomForestClassifier]: accuracy = 1.0
[RandomForestClassifier]: log loss = 0.009605406241126959
[GradientBoostingClassifier]: accuracy = 0.9948186528497409
[GradientBoostingClassifier]: log loss = 0.0634269639671492
[SGDClassifier]: accuracy = 0.9326424870466321
[DecisionTreeClassifier]: accuracy = 1.0
[DecisionTreeClassifier]: log loss = 1.998401444325284e-15
[GaussianNB]: accuracy = 0.9792746113989638
[GaussianNB]: log loss = 0.2230335127098297
[KNeighborsClassifier]: accuracy = 1.0
[KNeighborsClassifier]: log loss = 1.998401444325284e-15
[SVC]: accuracy = 0.9896373056994818
[LinearSVC]: accuracy = 0.917098445595855
[LinearDiscriminantAnalysis]: accuracy = 0.9119170984455959
[LinearDiscriminantAnalysis]: log loss = 0.3275876472915633
[QuadraticDiscriminantAnalysis]: accuracy = 0.9948186528497409
[QuadraticDiscriminantAnalysis]: log loss = 0.021457710420767537




In [8]:
import pickle

selected_model = models["RandomForestClassifier"]

with open("handy_classifier.pkl", "wb") as f:
    pickle.dump(selected_model, f)