# Selecting the Best Model

## Objective

The objective of this notebook is to test the best models found during training.

## Loading libraries and data

In [1]:
# model library
from LibrasModel import LibrasModel, weighted_accuracy_score, weighted_accuracy_scorer

# models
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.svm import SVC

# loading data
import pickle
import joblib

# other modules
import numpy as np
from sklearn.metrics import accuracy_score

In [2]:
# get dataset
train_data = pickle.load(open("TrainTestData/train_data.pickle", "rb"))
test_data = pickle.load(open("TrainTestData/test_data.pickle", "rb"))

## Defining the models

Check each ipynb to see how we achieved the best models

In [3]:
models = {
    "knn1": KNeighborsClassifier(n_neighbors=3, weights="distance", p=12, n_jobs=-1),
    "knn2": KNeighborsClassifier(n_neighbors=3, weights="distance", p=11, n_jobs=-1),
    "knn3": KNeighborsClassifier(n_neighbors=3, weights="distance", p=13, n_jobs=-1),
    "rfc1": RandomForestClassifier(max_depth=20, n_estimators=200),
    "rfc2": RandomForestClassifier(max_depth=15, n_estimators=200),
    "rfc3": RandomForestClassifier(max_depth=20, n_estimators=500),
    "svm1": SVC(C=20, gamma=5, kernel="rbf"),
    "svm2": SVC(C=50, gamma=5, kernel="rbf"),
    "svm3": SVC(C=40, gamma=5, kernel="rbf")
}

## Testing all models

In [15]:
from time import time
import joblib
import os

In [20]:
def print_metrics(base_model, train_data, test_data, has_z=False):
    model = LibrasModel(base_model, has_z=has_z)

    X_train = np.array(train_data["features"])
    y_train = np.array(train_data["labels"])
    X_test = np.array(test_data["features"])
    y_test = np.array(test_data["labels"])

    model.fit(X_train, y_train)

    t = time()
    y_pred = model.predict(X_test)
    t = time() - t

    acc_w = weighted_accuracy_score(y_test, y_pred)
    acc = accuracy_score(y_test, y_pred)

    joblib.dump(model.model, "model.pkl")
    size = os.path.getsize("model.pkl")
    if size < 1024:
        size_str = f"{size}B"
    elif size < 1024 ** 2:
        size_str = f"{round(size / 1024, 2)}KB"
    elif size < 1024 ** 3:
        size_str = f"{round(size / (1024 ** 2), 2)}MB"
    else:
        size_str = f"{round(size / (1024 ** 3), 2)}GB"

    print(f"Weighted Accuracy: {round(100 * np.mean(acc_w), 2)}%")
    print(f"Accuracy: {round(100 * np.mean(acc), 2)}%")
    print(f"Time per prediction: {1000 * t / len(y_test)} ms")
    print(f"Size (bytes): {size_str}")

In [21]:
for name, model in models.items():
    print(name)
    print_metrics(model, train_data, test_data)
    print("-"*50)

knn1
Weighted Accuracy: 94.17%
Accuracy: 93.97%
Time per prediction: 0.7903448576356294 ms
Size (bytes): 1.45MB
--------------------------------------------------
knn2
Weighted Accuracy: 94.05%
Accuracy: 93.88%
Time per prediction: 0.8272102020815669 ms
Size (bytes): 1.45MB
--------------------------------------------------
knn3
Weighted Accuracy: 94.19%
Accuracy: 94.06%
Time per prediction: 0.8235238524575772 ms
Size (bytes): 1.45MB
--------------------------------------------------
rfc1
Weighted Accuracy: 93.49%
Accuracy: 93.54%
Time per prediction: 0.03611661564369432 ms
Size (bytes): 42.64MB
--------------------------------------------------
rfc2
Weighted Accuracy: 93.55%
Accuracy: 93.88%
Time per prediction: 0.0357818849943097 ms
Size (bytes): 40.73MB
--------------------------------------------------
rfc3
Weighted Accuracy: 93.1%
Accuracy: 93.45%
Time per prediction: 0.09048796233999432 ms
Size (bytes): 107.59MB
--------------------------------------------------
svm1
Weighted Acc

## Conclusion

In terms of performance, the SVM 1 model is by far the best, with a score of 97.94% It also very good accuracy and time per prediction, being able to predict the label in less than 0.1 milisecond. They also require less than 1 MB of disk space to be storaged, so they can easily be downloaded in most applications.

The KNN models presented no advantage over the SVM models, therefore they're probably not interesing choices. The random forest 1 and 2 models, however, even though have low accuracies scores compared to the other models, can make prediction extremely fast, so depending on the application, they can be a good choice. However, they do require more disk space to be stored than the other models.