# Selecting the Best Model

## Objective

Now that we trained a lot of different models, it's time to select the best one. In this notebook, we will try 4 different models:
- Random Forest: minimum transformation, max_depth = 30 and n_estimators = 100 (performance = 93.91%)
- KNN: geometric transformation, n_neighbors = 3, p = 19 and weights = "distance" (performance = 94.20%)
- SVM: geometric transformation, kernel = "rbf", C = 30 and gamma = 5 (performance = 97.02%)
- Essemble of the best KNN, the best SVM and the best Random Forest that uses geometric transformation (max_depth = 30 and n_estimators = 200, performance = 91.94%)

## Loading libraries and data

In [5]:
# importing important libraries

# transformations library
from transformations import minimum, geometric

# models
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

# loading data
import pickle

# other modules
from sklearn.model_selection import cross_val_score
from sklearn.metrics import recall_score, make_scorer
from sklearn.model_selection import GridSearchCV
import numpy as np

In [3]:
# get dataset
data_path = "TrainTestData/train_data.pickle"
data = pickle.load(open(data_path, "rb"))

In [4]:
# function that calculates weighted_accuracy
# weights are basead on the frequency of the letters in the portuguese alphabet 
# source: https://pt.wikipedia.org/wiki/Alfabeto_portugu%C3%AAs#Frequ%C3%AAncia_da_ocorr%C3%AAncia_de_letras
# H, K, J, X and Z are not present
LETTERS_FREQUENCY = [
    14.63,
    1.04,
    3.88,
    5.01,
    12.57,
    1.02,
    1.30,
    6.18,
    2.78,
    4.74,
    5.05,
    10.73,
    2.52,
    1.20,
    6.53,
    7.81,
    4.34,
    4.63,
    1.67,
    0.01,
    0.01,
]
def weighted_accuracy(y_true, y_pred):
    recall_array = recall_score(y_true, y_pred, average=None)
    weights_total = 0
    result = 0
    for recall, weight in zip(recall_array, LETTERS_FREQUENCY):
        weights_total += weight
        result += recall * weight
    return result / weights_total
weighted_accuracy_score = make_scorer(weighted_accuracy)

In [9]:
# Minumum transformation
minimum_X = []
for observation in data["features"]:
    minimum_X.append(minimum(observation))

# Geometric transformation
geometric_X = []
for observation in data["features"]:
    geometric_X.append(geometric(observation))

## Training Essemble

In [7]:
# Creating the classifier
forest = RandomForestClassifier(max_depth=30, n_estimators=200)
knn = KNeighborsClassifier(n_neighbors=3, p=19, weights="distance")
svm = SVC(kernel="rbf", C=30, gamma=5)

voting = VotingClassifier(
    estimators=[("rf", forest), ("knn", knn), ("svm", svm)],
    voting="hard"
)

In [10]:
np.mean(cross_val_score(voting, geometric_X, data["labels"], cv=5, n_jobs=-1, scoring=weighted_accuracy_score))

0.9576984089009951

Seems like this model does not perform as well as the svm model alone.

## Test and Time Performance

This is the final analysis, where we will compare the best 4 models by their test results and average prediction time

In [11]:
# Creating the classifiers
forest = RandomForestClassifier(max_depth=30, n_estimators=200,  n_jobs=-1)
knn = KNeighborsClassifier(n_neighbors=3, p=19, weights="distance", n_jobs=-1)
svm = SVC(kernel="rbf", C=30, gamma=5)

voting = VotingClassifier(
    estimators=[("rf", forest), ("knn", knn), ("svm", svm)],
    voting="hard",
    n_jobs=-1
)

best_forest = RandomForestClassifier(max_depth=30, n_estimators=100, n_jobs=-1)
best_knn = knn
best_svm = svm

In [18]:
# importing test data
# get dataset
data_path = "TrainTestData/test_data.pickle"
test_data = pickle.load(open(data_path, "rb"))

# Minumum transformation
test_minimum_X = []
for observation in test_data["features"]:
    test_minimum_X.append(minimum(observation))

# Geometric transformation
test_geometric_X = []
for observation in test_data["features"]:
    test_geometric_X.append(geometric(observation))

In [22]:
from time import time
models_info = [
    {"model": voting, "data": "geometric"},
    {"model": best_forest, "data": "minimum"},
    {"model": best_knn, "data": "geometric"},
    {"model": best_svm, "data": "geometric"}
]

for model_info in models_info:
    if model_info["data"] == "geometric":
        train_x = geometric_X
        test_x = test_geometric_X
    else:
        train_x = minimum_X
        test_x = test_minimum_X
    model = model_info["model"]
    model.fit(train_x, data["labels"])
    start = time()
    y_pred = model.predict(test_x)
    end = time()
    avg_time = (end - start) / len(test_data)
    score = weighted_accuracy(test_data["labels"], y_pred)
    print(model.__class__.__name__)
    print(f"\t Score: {round(100 * score, 2)}%")
    print(f"\t Time: {round(avg_time, 5)} seconds")
    print()

VotingClassifier
	 Score: 95.61%
	 Time: 1.2446 seconds

RandomForestClassifier
	 Score: 93.61%
	 Time: 0.02499 seconds

KNeighborsClassifier
	 Score: 94.1%
	 Time: 0.99616 seconds

SVC
	 Score: 97.22%
	 Time: 0.18441 seconds



## Conclusion

In terms of performance, the SVM model is by far the best, with a score of about 97.22%. However, the other models also have good performances.
<br><br>
In terms of prediction time, RandomFrest has a time much lower than the others, while SVM time is also fine.
<br><br>
The choice of best model depends on the equilibrium of these two variables. SVM seems to have a good balance, since it has the best score and the second best time. However, the RandomForest model can be good if sacrifing some performance for time is essential. The KNN and the Voting models don't seem to be good choices, since they are slower and have a worst performance compared to SVM. 
<br><br>
Best Model: SVM with geometric transformation, kernel = "rbf", C = 30 and gamma = 5