# Basic Models

## Objective

The objective of this notebook is to get a general idea of how different transformations and models perform, without changing their hyperparameters.
<br><br>
Models and transformations that lead to very poor predictions will no be used in future analysis.

## Loading libraries and data

In [1]:
# importing important libraries

# transformations library
from transformations import minimum, geometric, minimum2D, geometric2D

# models
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier 

# loading data
import pickle
import joblib

# other modules
from sklearn.model_selection import cross_val_score
from sklearn.metrics import recall_score, make_scorer
from sklearn.preprocessing import MinMaxScaler
import numpy as np

In [4]:
# get base_dataset
data_path = "TrainTestData/train_data.pickle"
data = pickle.load(open(data_path, "rb"))

## Training Models

In [5]:
# list of models and their names (names only to make prints easier)
models = [
    {"name": "RandomForest", "classifier": RandomForestClassifier(n_jobs=-1), "scale": False},
    {"name": "LogisticRegression", "classifier": SGDClassifier(loss="log_loss"), "scale": True},
    {"name": "SVM", "classifier": SVC(), "scale": True},
    {"name": "KNN", "classifier": KNeighborsClassifier(), "scale": True}
]

# list of changed_models and their names (names only to make prints easier)
changed_datasets = [
    {"name": "Minimum 3D", "transformation": minimum, "dataset": {}},
    {"name": "Geometric 3D", "transformation": geometric, "dataset": {}},
    {"name": "Minimum 2D", "transformation": minimum2D, "dataset": {}},
    {"name": "Geometric 2D", "transformation": geometric2D, "dataset": {}},
]

In [6]:
# function that calculates weighted_accuracy
# weights are basead on the frequency of the letters in the portuguese alphabet 
# source: https://pt.wikipedia.org/wiki/Alfabeto_portugu%C3%AAs#Frequ%C3%AAncia_da_ocorr%C3%AAncia_de_letras
# H, K, J, X and Z are not present
LETTERS_FREQUENCY = [
    14.63,
    1.04,
    3.88,
    5.01,
    12.57,
    1.02,
    1.30,
    6.18,
    2.78,
    4.74,
    5.05,
    10.73,
    2.52,
    1.20,
    6.53,
    7.81,
    4.34,
    4.63,
    1.67,
    0.01,
    0.01,
]
def weighted_accuracy(y_true, y_pred):
    recall_array = recall_score(y_true, y_pred, average=None)
    weights_total = 0
    result = 0
    for recall, weight in zip(recall_array, LETTERS_FREQUENCY):
        weights_total += weight
        result += recall * weight
    return result / weights_total
weighted_accuracy_score = make_scorer(weighted_accuracy)

In [7]:
# functions that train the models
# returns the model trained and its accuracy in cross val score
# if feature scaling is necessary, makes minmax
def train_model (model, data, scaling=False, cv=5):
    if scaling:
        scaler = MinMaxScaler()
        features = scaler.fit_transform(data["features"])
    else:
        features = data["features"]
    X = features
    y = data["labels"]

    # train and test model with cross val score
    score = np.mean(cross_val_score(model, X, y, cv=cv, n_jobs=-1, scoring=weighted_accuracy_score))

    return model, score

In [8]:
# Making the necessary transformations for training
for option in changed_datasets:
    new_features = []
    for observation in data["features"]:
        new_features.append(option["transformation"](observation))
    option["dataset"]["features"] = new_features
    option["dataset"]["labels"] = data["labels"]

In [9]:
# trains all models and prints their results
for model in models:
    for dataset in changed_datasets:
        _, accuracy = train_model(model["classifier"], dataset["dataset"], scaling=model["scale"])
        print(model["name"], dataset["name"], accuracy)

RandomForest Minimum 3D 0.9407256167096127
RandomForest Geometric 3D 0.9031747908460657
RandomForest Minimum 2D 0.9442522269748361
RandomForest Geometric 2D 0.9178894399955378
LogisticRegression Minimum 3D 0.8372268668761362
LogisticRegression Geometric 3D 0.6775464842641636
LogisticRegression Minimum 2D 0.7727500652391159
LogisticRegression Geometric 2D 0.5784093080234257
SVM Minimum 3D 0.922297394557531
SVM Geometric 3D 0.8739441226691611
SVM Minimum 2D 0.9177141723759805
SVM Geometric 2D 0.8689468596792571
KNN Minimum 3D 0.9114483307833072
KNN Geometric 3D 0.8422670308200975
KNN Minimum 2D 0.9223405590970337
KNN Geometric 2D 0.876452695802962


## Conclusion

The Logistic Regression models had the worst performance, and since is lower than the other models by about 10 p.p, it will not be used in further analysis.
<br><br>
The other models and transformations had good performances overall.
<br><br>
Therefore, we will explore more the models RandomForest, SVM and KNN, as well as the transformations Minimum and Geometric.
<br><br>
In Random Forest and KNN, 2D transformations performed better than 3D, but the inverse happened in SVM. Sincethe differences are quite small, we will continue to test all of them.