# SVM Models

## Objective

The objective of this notebook is to train and test different SVM models, by changing their hyperparameters, in order to obtain the best Random Forest model.
<br><br>
As discussed in "basic_models.ipynb", the transformations that will be used are Minimum and Geometric.

## Loading libraries and data

In [1]:
# importing important libraries

# transformations library
from transformations import minimum, geometric

# models
from sklearn.svm import SVC

# loading data
import pickle

# other modules
from sklearn.model_selection import cross_val_score
from sklearn.metrics import recall_score, make_scorer
from sklearn.model_selection import GridSearchCV
import numpy as np

In [2]:
# get base_dataset
data_path = "TrainTestData/train_data.pickle"
data = pickle.load(open(data_path, "rb"))

In [3]:
# function that calculates weighted_accuracy
# weights are basead on the frequency of the letters in the portuguese alphabet 
# source: https://pt.wikipedia.org/wiki/Alfabeto_portugu%C3%AAs#Frequ%C3%AAncia_da_ocorr%C3%AAncia_de_letras
# H, K, J, X and Z are not present
LETTERS_FREQUENCY = [
    14.63,
    1.04,
    3.88,
    5.01,
    12.57,
    1.02,
    1.30,
    6.18,
    2.78,
    4.74,
    5.05,
    10.73,
    2.52,
    1.20,
    6.53,
    7.81,
    4.34,
    4.63,
    1.67,
    0.01,
    0.01,
]
def weighted_accuracy(y_true, y_pred):
    recall_array = recall_score(y_true, y_pred, average=None)
    weights_total = 0
    result = 0
    for recall, weight in zip(recall_array, LETTERS_FREQUENCY):
        weights_total += weight
        result += recall * weight
    return result / weights_total
weighted_accuracy_score = make_scorer(weighted_accuracy)

## Choosing hyperparameters and transformations

In [4]:
# Minumum transformation
minimum_X = []
for observation in data["features"]:
    minimum_X.append(minimum(observation))

# Geometric transformation
geometric_X = []
for observation in data["features"]:
    geometric_X.append(geometric(observation))

In [5]:
# hyperparameters for first Grid Search
param_grid  = {
    "C": [1, 10, 20],
    "kernel": ["poly", "rbf"],
    "gamma": ["scale", 0.1, 5]
}

## First Grid Search

In [6]:
# Minimum transformation
svm = SVC()
grid_search_minimum = GridSearchCV(svm, param_grid, cv=5, scoring=weighted_accuracy_score, return_train_score=True, n_jobs=-1)

grid_search_minimum.fit(minimum_X, data["labels"])

In [18]:
cvres = grid_search_minimum.cv_results_ 
results = dict(zip(cvres["mean_test_score"], cvres["params"]))
scores = sorted(cvres["mean_test_score"], reverse=True)
for mean_score in scores:
    print(mean_score, results[mean_score])

0.9607775268561177 {'C': 20, 'gamma': 5, 'kernel': 'poly'}
0.9607775268561177 {'C': 20, 'gamma': 5, 'kernel': 'poly'}
0.9607775268561177 {'C': 20, 'gamma': 5, 'kernel': 'poly'}
0.9475073009645938 {'C': 20, 'gamma': 'scale', 'kernel': 'poly'}
0.9380149122348957 {'C': 10, 'gamma': 'scale', 'kernel': 'poly'}
0.9259973785448009 {'C': 20, 'gamma': 5, 'kernel': 'rbf'}
0.9229718648311153 {'C': 10, 'gamma': 5, 'kernel': 'rbf'}
0.9223222950999894 {'C': 20, 'gamma': 'scale', 'kernel': 'rbf'}
0.8990502274438257 {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
0.8958739754683739 {'C': 20, 'gamma': 0.1, 'kernel': 'rbf'}
0.8923283067726322 {'C': 20, 'gamma': 0.1, 'kernel': 'poly'}
0.8734125597296389 {'C': 1, 'gamma': 5, 'kernel': 'rbf'}
0.8646992601926333 {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
0.8577876520253541 {'C': 10, 'gamma': 0.1, 'kernel': 'poly'}
0.8462172801153972 {'C': 1, 'gamma': 'scale', 'kernel': 'poly'}
0.7614922320217931 {'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}
0.6848391863649963 {'C'

In [19]:
# Geometric transformation
svm = SVC()
grid_search_geometric= GridSearchCV(svm,  param_grid, cv=5, scoring=weighted_accuracy_score, return_train_score=True, n_jobs=-1)

grid_search_geometric.fit(geometric_X, data["labels"])

In [20]:
cvres = grid_search_geometric.cv_results_ 
results = dict(zip(cvres["mean_test_score"], cvres["params"]))
scores = sorted(cvres["mean_test_score"], reverse=True)
for mean_score in scores:
    print(mean_score, results[mean_score])

0.9700429965939914 {'C': 20, 'gamma': 5, 'kernel': 'rbf'}
0.9683711625700694 {'C': 10, 'gamma': 5, 'kernel': 'rbf'}
0.9653497603425798 {'C': 1, 'gamma': 5, 'kernel': 'poly'}
0.963743866736972 {'C': 20, 'gamma': 'scale', 'kernel': 'rbf'}
0.9634501562929383 {'C': 10, 'gamma': 5, 'kernel': 'poly'}
0.9626830907891886 {'C': 20, 'gamma': 5, 'kernel': 'poly'}
0.9487641249280229 {'C': 10, 'gamma': 'scale', 'kernel': 'rbf'}
0.9395911038503113 {'C': 20, 'gamma': 'scale', 'kernel': 'poly'}
0.9365061731659807 {'C': 1, 'gamma': 5, 'kernel': 'rbf'}
0.9219483363591532 {'C': 10, 'gamma': 'scale', 'kernel': 'poly'}
0.8540261277440571 {'C': 20, 'gamma': 0.1, 'kernel': 'rbf'}
0.8209098545164704 {'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}
0.7964820251651245 {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
0.7507057401676055 {'C': 1, 'gamma': 'scale', 'kernel': 'poly'}
0.5347261931610161 {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'}
0.3829733655155009 {'C': 20, 'gamma': 0.1, 'kernel': 'poly'}
0.28819588952072067 {'C': 1

The best model uses geometric transformation. The highest value of C and gamma where used, so might be good to try higher values than this.
<br><br>
The best kernel was the rbf, but since poly appeared in some of the best models as well, we will also try to use it again.

## Second Grid Search

In [21]:
# hyperparameters for second Grid Search
param_grid  = {
    "C": [20, 40, 60],
    "kernel": ["poly", "rbf"],
    "gamma": [5, 10, 20]
}

In [22]:
# Geometric transformation
svm = SVC()
grid_search_geometric= GridSearchCV(svm,  param_grid, cv=5, scoring=weighted_accuracy_score, return_train_score=True, n_jobs=-1)

grid_search_geometric.fit(geometric_X, data["labels"])

In [23]:
cvres = grid_search_geometric.cv_results_ 
results = dict(zip(cvres["mean_test_score"], cvres["params"]))
scores = sorted(cvres["mean_test_score"], reverse=True)
for mean_score in scores:
    print(mean_score, results[mean_score])

0.9700429965939914 {'C': 20, 'gamma': 5, 'kernel': 'rbf'}
0.970024842223749 {'C': 40, 'gamma': 5, 'kernel': 'rbf'}
0.969252608761343 {'C': 60, 'gamma': 5, 'kernel': 'rbf'}
0.9686671437322671 {'C': 20, 'gamma': 10, 'kernel': 'rbf'}
0.9672389194323248 {'C': 40, 'gamma': 10, 'kernel': 'rbf'}
0.9662086267461181 {'C': 60, 'gamma': 10, 'kernel': 'rbf'}
0.9626830907891886 {'C': 20, 'gamma': 5, 'kernel': 'poly'}
0.9614257769417272 {'C': 40, 'gamma': 5, 'kernel': 'poly'}
0.9612560510085604 {'C': 60, 'gamma': 5, 'kernel': 'poly'}
0.9607948751085248 {'C': 60, 'gamma': 20, 'kernel': 'poly'}
0.9607948751085248 {'C': 60, 'gamma': 20, 'kernel': 'poly'}
0.9607948751085248 {'C': 60, 'gamma': 20, 'kernel': 'poly'}
0.9607948751085248 {'C': 60, 'gamma': 20, 'kernel': 'poly'}
0.9607948751085248 {'C': 60, 'gamma': 20, 'kernel': 'poly'}
0.9607948751085248 {'C': 60, 'gamma': 20, 'kernel': 'poly'}
0.9600218944999426 {'C': 60, 'gamma': 20, 'kernel': 'rbf'}
0.9600218944999426 {'C': 60, 'gamma': 20, 'kernel': 'rb

Since there is only a small decrease in performance, we can try some values of C between of 20 and 40 and gamma between 5 and 10 to check for better models.
<br><br>
Since the poly kernel wasn't between the best models, it will nt be used from now on.

### Third Grid Search

In [25]:
# hyperparameters for third Grid Search
param_grid  = {
    "C": [20, 25, 30, 35],
    "kernel": ["rbf"],
    "gamma": [5, 7, 9]
}

In [26]:
# Geometric transformation
svm = SVC()
grid_search_geometric= GridSearchCV(svm,  param_grid, cv=5, scoring=weighted_accuracy_score, return_train_score=True, n_jobs=-1)

grid_search_geometric.fit(geometric_X, data["labels"])

In [27]:
cvres = grid_search_geometric.cv_results_ 
results = dict(zip(cvres["mean_test_score"], cvres["params"]))
scores = sorted(cvres["mean_test_score"], reverse=True)
for mean_score in scores:
    print(mean_score, results[mean_score])

0.970152995474183 {'C': 30, 'gamma': 5, 'kernel': 'rbf'}
0.9700429965939914 {'C': 20, 'gamma': 5, 'kernel': 'rbf'}
0.9699395190651575 {'C': 35, 'gamma': 5, 'kernel': 'rbf'}
0.9695167096591023 {'C': 25, 'gamma': 5, 'kernel': 'rbf'}
0.9690424222560828 {'C': 30, 'gamma': 7, 'kernel': 'rbf'}
0.9688738810827904 {'C': 20, 'gamma': 7, 'kernel': 'rbf'}
0.9685552408166632 {'C': 25, 'gamma': 7, 'kernel': 'rbf'}
0.9684765174321134 {'C': 35, 'gamma': 7, 'kernel': 'rbf'}
0.9684230487979996 {'C': 30, 'gamma': 9, 'kernel': 'rbf'}
0.9682398959871898 {'C': 20, 'gamma': 9, 'kernel': 'rbf'}
0.9678321268972503 {'C': 25, 'gamma': 9, 'kernel': 'rbf'}
0.9673133240363618 {'C': 35, 'gamma': 9, 'kernel': 'rbf'}


Since there was only a very small increase in performance, we will stop the search here.

## Analysing Time Performance

In [28]:
# average time per prediction
from time import time
best_svm = grid_search_minimum.best_estimator_
best_svm.fit(minimum_X, data["labels"])

start = time()
best_svm.predict(minimum_X)
end = time()
print((end - start) / len(minimum_X))

8.932372619365824e-05


## Conclusion

The best SVM model uses Geometric transformation, with kernel = "rbf", C = 30 and gamma = 5, with a performance of 97.02%.
<br><br>
The average time per prediction is 0.00009 seconds.