# Basic Models

## Objective

The objective of this notebook is to get a general idea of how different transformations and models perform, without changing their hyperparameters.
<br><br>
Models and transformations that lead to very poor predictions will no be used in future analysis.

## Loading libraries and data

In [41]:
# model library
from LibrasModel import LibrasModel, weighted_accuracy_score, weighted_accuracy_scorer

# models
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier 

# loading data
import pickle
import joblib

# other modules
import numpy as np

In [42]:
# get base_dataset
data_path = "TrainTestData/train_data.pickle"
train_data = pickle.load(open(data_path, "rb"))

## Training Models

In [43]:
# list of models and their names (names only to make prints easier)
models = [
    {"name": "RandomForest", "classifier": RandomForestClassifier(), "scale": False},
    {"name": "LogisticRegression", "classifier": SGDClassifier(loss="log_loss"), "scale": True},
    {"name": "SVM", "classifier": SVC(), "scale": True},
    {"name": "KNN", "classifier": KNeighborsClassifier(), "scale": True}
]

In [44]:
# functions that train the models
# returns the model trained and its accuracy in cross val score
# if feature scaling is necessary, makes minmax
def train_model (type_model, data, has_z=True, cv=5):
    X = np.array(data["features"])
    y = np.array(data["labels"])

    model = LibrasModel(type_model, has_z)
    return model.cross_val(X, y, weighted_accuracy_scorer, cv=cv, mean=True)

In [45]:
# trains all models and prints their results
for model in models:
    for option in [True, False]:
        accuracy = train_model(model["classifier"], train_data, has_z=option)
        print(model["name"], option, accuracy)

RandomForest True 0.9232748592219172
RandomForest False 0.9312236462881265
LogisticRegression True 0.8353883984635957
LogisticRegression False 0.8618296295024763
SVM True 0.9325286331202809
SVM False 0.9312993195388793
KNN True 0.9086503022374846
KNN False 0.9130389054007816


## Conclusion

The Logistic Regression models had the worst performance, and since is lower than the other models by about 10 p.p, it will not be used in further analysis.
<br><br>
The other models and transformations had good performances overall.
<br><br>
Therefore, we will explore more the models RandomForest, SVM and KNN, as well as the transformations Minimum and Geometric.
<br><br>
In Random Forest and KNN, 2D transformations performed better than 3D, but the inverse happened in SVM. Sincethe differences are quite small, we will continue to test all of them.