# Selecting the Best Model

## Objective

Now that we trained a lot of different models, it's time to select the best one. In this notebook, we will try 4 different models:
- Random Forest: minimum2D transformation, max_depth = 30 and n_estimators = 100 (performance = 94.69%)
- KNN: geometric transformation, n_neighbors = 3, p = 9 and weights = "distance" (performance = 94.22%) 
- SVM: minimum3D transformation, kernel = "rbf", C = 40 and gamma = 5 (performance = 97.33%)
- SVM: minimum2D transformation, kernel = "rbf", C = 40 and gamma = 5 (performance = 97.20%)
- Essemble of the best KNN, the best SVM and the best Random Forest that uses geometric transformation (max_depth = 30 and n_estimators = 200, performance = 91.94%)

## Loading libraries and data

In [1]:
# importing important libraries

# transformations library
from transformations import minimum, geometric, minimum2D, geometric2D

# models
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

# loading data
import pickle

# other modules
from sklearn.model_selection import cross_val_score
from sklearn.metrics import recall_score, make_scorer
from sklearn.model_selection import GridSearchCV
import numpy as np

In [2]:
# get dataset
data_path = "TrainTestData/train_data.pickle"
data = pickle.load(open(data_path, "rb"))

In [3]:
# function that calculates weighted_accuracy
# weights are basead on the frequency of the letters in the portuguese alphabet 
# source: https://pt.wikipedia.org/wiki/Alfabeto_portugu%C3%AAs#Frequ%C3%AAncia_da_ocorr%C3%AAncia_de_letras
# H, K, J, X and Z are not present
LETTERS_FREQUENCY = [
    14.63,
    1.04,
    3.88,
    5.01,
    12.57,
    1.02,
    1.30,
    6.18,
    2.78,
    4.74,
    5.05,
    10.73,
    2.52,
    1.20,
    6.53,
    7.81,
    4.34,
    4.63,
    1.67,
    0.01,
    0.01,
]
def weighted_accuracy(y_true, y_pred):
    recall_array = recall_score(y_true, y_pred, average=None)
    weights_total = 0
    result = 0
    for recall, weight in zip(recall_array, LETTERS_FREQUENCY):
        weights_total += weight
        result += recall * weight
    return result / weights_total
weighted_accuracy_score = make_scorer(weighted_accuracy)

In [4]:
# Minumum transformation
minimum_X = []
for observation in data["features"]:
    minimum_X.append(minimum(observation))

# Geometric transformation
geometric_X = []
for observation in data["features"]:
    geometric_X.append(geometric(observation))

# Minumum 2D transformation
minimum2D_X = []
for observation in data["features"]:
    minimum2D_X.append(minimum2D(observation))

# Geometric 2D transformation
geometric2D_X = []
for observation in data["features"]:
    geometric2D_X.append(geometric2D(observation))

## Training Essemble

In [5]:
# Creating the classifier
forest = RandomForestClassifier(max_depth=30, n_estimators=100)
knn = KNeighborsClassifier(n_neighbors=3, p=9, weights="distance")
svm = SVC(kernel="rbf", C=40, gamma=5)

voting = VotingClassifier(
    estimators=[("rf", forest), ("svm", svm)],
    voting="hard"
)

In [6]:
# Minimum 3D
np.mean(cross_val_score(voting, minimum_X, data["labels"], cv=5, n_jobs=-1, scoring=weighted_accuracy_score))

0.9587557538048017

In [7]:
# Minimum 2D
np.mean(cross_val_score(voting, minimum2D_X, data["labels"], cv=5, n_jobs=-1, scoring=weighted_accuracy_score))

0.9579171213128987

Seems like this model does not perform as well as the svm model alone.

## Test and Time Performance

This is the final analysis, where we will compare the best models by their test results and average prediction time.
<br><br>
We will consider both types of transformations too.

In [8]:
# Creating the classifiers
forest = RandomForestClassifier(max_depth=30, n_estimators=100,  n_jobs=-1)
knn = KNeighborsClassifier(n_neighbors=3, p=9, weights="distance", n_jobs=-1)
svm = SVC(kernel="rbf", C=40, gamma=5)

voting = VotingClassifier(
    estimators=[("rf", forest), ("svm", svm)],
    voting="hard",
    n_jobs=-1
)

best_forest = forest
best_knn = knn
best_svm = svm

In [9]:
# importing test data
# get dataset
data_path = "TrainTestData/test_data.pickle"
test_data = pickle.load(open(data_path, "rb"))

# Minumum transformation
test_minimum_X = []
for observation in test_data["features"]:
    test_minimum_X.append(minimum(observation))

# Geometric transformation
test_geometric_X = []
for observation in test_data["features"]:
    test_geometric_X.append(geometric(observation))

# Minumum2D transformation
test_minimum2D_X = []
for observation in test_data["features"]:
    test_minimum2D_X.append(minimum2D(observation))

# Geometric2D transformation
test_geometric2D_X = []
for observation in test_data["features"]:
    test_geometric2D_X.append(geometric2D(observation))

In [12]:
from time import time
models_info = [
    {"model": voting, "data": "minimum2D"},
    {"model": voting, "data": "geometric2D"},
    {"model": best_forest, "data": "minimum2D"},
    {"model": best_forest, "data": "geometric2D"},
    {"model": best_knn, "data": "geometric"},
    {"model": best_svm, "data": "minimum"},
    {"model": best_svm, "data": "minimum2D"},
    {"model": best_svm, "data": "geometric"},
    {"model": best_svm, "data": "geometric2D"}
]

models = [voting, best_forest, best_knn, best_svm]
data_type = [
    {"name": "minimum3D", "train": minimum_X, "test": test_minimum_X},
    {"name": "minimum2D", "train": minimum2D_X, "test": test_minimum2D_X},
    {"name": "geometric3D", "train": geometric_X, "test": test_geometric_X},
    {"name": "geometric2D", "train": geometric2D_X, "test": test_geometric2D_X}
]

for model in models:
    for dataset in data_type:
        train_x = dataset["train"]
        test_x = dataset["test"]
        model.fit(train_x, data["labels"])
        start = time()
        y_pred = model.predict(test_x)
        end = time()
        avg_time = (end - start) / len(test_data)
        score = weighted_accuracy(test_data["labels"], y_pred)
        print(model.__class__.__name__, dataset["name"])
        print(f"\t Score: {round(100 * score, 2)}%")
        print(f"\t Time: {round(avg_time, 5)} seconds")
        print()

VotingClassifier minimum3D
	 Score: 95.27%
	 Time: 0.23847 seconds

VotingClassifier minimum2D
	 Score: 96.26%
	 Time: 0.22359 seconds

VotingClassifier geometric3D
	 Score: 94.31%
	 Time: 0.24105 seconds

VotingClassifier geometric2D
	 Score: 93.78%
	 Time: 0.20625 seconds

RandomForestClassifier minimum3D
	 Score: 93.25%
	 Time: 0.06825 seconds

RandomForestClassifier minimum2D
	 Score: 94.43%
	 Time: 0.02497 seconds

RandomForestClassifier geometric3D
	 Score: 90.18%
	 Time: 0.05226 seconds

RandomForestClassifier geometric2D
	 Score: 91.46%
	 Time: 0.0213 seconds

KNeighborsClassifier minimum3D
	 Score: 89.21%
	 Time: 1.99604 seconds

KNeighborsClassifier minimum2D
	 Score: 89.03%
	 Time: 1.33978 seconds

KNeighborsClassifier geometric3D
	 Score: 93.27%
	 Time: 1.71854 seconds

KNeighborsClassifier geometric2D
	 Score: 93.95%
	 Time: 1.17566 seconds

SVC minimum3D
	 Score: 97.23%
	 Time: 0.19882 seconds

SVC minimum2D
	 Score: 97.12%
	 Time: 0.19548 seconds

SVC geometric3D
	 Score

## Conclusion

In terms of performance, the SVM model is by far the best, with a score of about 97.22% in both minimum3D and geometric2D. Since geometric2D uses less parameters, it will be the choice. However, the other models also have good performances.
<br><br>
In terms of prediction time, RandomFrest has a time much lower than the others, while SVM time is also fine.
<br><br>
The choice of best model depends on the equilibrium of these two variables. SVM seems to have a good balance, since it has the best score and the second best time. However, the RandomForest model can be good if sacrifing some performance for time is essential. The KNN and the Voting models don't seem to be good choices, since they are slower and have a worst performance compared to SVM. 
<br><br>
Best Model: SVM with geometric2D transformation, kernel = "rbf", C = 40 and gamma = 5