# Handy
## Train model

Now let's train the model. Handy trains it using a whole variety of classification models and then chooses which one works the best. I decided to go with a lot of different models to learn how each of them work.
* [Random Forest (decision trees)](https://en.wikipedia.org/wiki/Random_forest)
* [Gradient Boosting](https://en.wikipedia.org/wiki/Gradient_boosting)
* [Stochastic Gradient Descent](https://en.wikipedia.org/wiki/Stochastic_gradient_descent)
* [Decision Tree](https://en.wikipedia.org/wiki/Decision_tree)
* [Naive Bayes Classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier)
* [K-nearest Neighbors Algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)
* [Support Vector Classifier (SVC)](https://en.wikipedia.org/wiki/Support_vector_machine)
* [Linear Support Vector Classifier (LinearSVC)](https://en.wikipedia.org/wiki/Support_vector_machine#Linear_SVM)
* [Linear Discriminant Analysis](https://en.wikipedia.org/wiki/Linear_discriminant_analysis)
* [Quadratic Discriminant Analysis](https://en.wikipedia.org/wiki/Quadratic_classifier)

In [56]:
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [57]:
import pandas as pd
from sklearn.pipeline import make_pipeline 
from sklearn.preprocessing import StandardScaler 

from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis, QuadraticDiscriminantAnalysis

In [58]:
# Load the data

from os import path


if not path.exists("data.csv"):
    print("The data.csv file doesn't exist! Please first come through the 2_Process_Data.ipynb notebook.")
    exit(-1)

df = pd.read_csv("data.csv")

X = df.drop("class_name", axis=1)
y = df["class_name"]

# Number of data per class
df.groupby("class_name").size()

class_name
0    200
1    200
2    200
3    200
4    200
5    200
6    200
7    200
8    200
9    200
dtype: int64

In [59]:
# Split data into train and test set

from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

y_test = y_test.to_numpy()

print(f"Train: {len(X_train)}")
print(f"Test: {len(X_test)}")

Train: 1600
Test: 400


In [60]:
pipelines = {
    "RandomForestClassifier": make_pipeline(StandardScaler(), RandomForestClassifier()),
    "GradientBoostingClassifier": make_pipeline(StandardScaler(), GradientBoostingClassifier()),
    "SGDClassifier": make_pipeline(StandardScaler(), SGDClassifier()),
    "DecisionTreeClassifier": make_pipeline(StandardScaler(), DecisionTreeClassifier()),
    "GaussianNB": make_pipeline(StandardScaler(), GaussianNB()),
    "KNeighborsClassifier": make_pipeline(StandardScaler(), KNeighborsClassifier()),
    "SVC": make_pipeline(StandardScaler(), SVC()),
    "LinearSVC": make_pipeline(StandardScaler(), LinearSVC()),
    "LinearDiscriminantAnalysis": make_pipeline(StandardScaler(), LinearDiscriminantAnalysis()),
    "QuadraticDiscriminantAnalysis": make_pipeline(StandardScaler(), QuadraticDiscriminantAnalysis()),
}

In [61]:
from sklearn.metrics import accuracy_score, log_loss

models = {}

for name, algorithm in pipelines.items():
    model = algorithm.fit(X_train, y_train)

    # Calculate the metrics (accurracy)
    y_predicted = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_predicted)
    print(f"[{name}]: accuracy = {accuracy}")

    try:
        train_predictions = model.predict_proba(X_test)
        loss = log_loss(y_test, train_predictions)
        print(f"[{name}]: log loss = {loss}")
    except:
        pass
    
    models[name] = model

    # incorrect_indices = [i for i in range(len(y_test)) if y_test[i] != y_predicted[i]]
    # incorrect_predictions = [(y_test[i], y_predicted[i]) for i in incorrect_indices]

    # # Print the actual incorrect predictions
    # print("Incorrect predictions:")
    # for true_label, predicted_label in incorrect_predictions:
    #     print(f"True: {true_label}, Predicted: {predicted_label}")
    

[RandomForestClassifier]: accuracy = 0.9775
[RandomForestClassifier]: log loss = 0.17702533328706957
[GradientBoostingClassifier]: accuracy = 0.96
[GradientBoostingClassifier]: log loss = 0.14423208834160978
[SGDClassifier]: accuracy = 0.6525
[DecisionTreeClassifier]: accuracy = 0.945
[DecisionTreeClassifier]: log loss = 1.9824009364014454
[GaussianNB]: accuracy = 0.855
[GaussianNB]: log loss = 0.5994193279360206
[KNeighborsClassifier]: accuracy = 0.9775
[KNeighborsClassifier]: log loss = 0.3921929029873924
[SVC]: accuracy = 0.92
[LinearSVC]: accuracy = 0.8
[LinearDiscriminantAnalysis]: accuracy = 0.6925
[LinearDiscriminantAnalysis]: log loss = 0.9549862885554117
[QuadraticDiscriminantAnalysis]: accuracy = 0.935
[QuadraticDiscriminantAnalysis]: log loss = 0.3728535240093255




In [62]:
import pickle

selected_model = models["RandomForestClassifier"]

with open("handy_classifier.pkl", "wb") as f:
    pickle.dump(selected_model, f)

In [63]:
# Test
# Should predict 7
selected_model.predict([[166.76, 179.76, 79.35, 6.82]])[0]



6