# Random Forest Models

## Objective

The objective of this notebook is to train and test different Random Forest models, by changing their hyperparameters, in order to obtain the best Random Forest model.
<br><br>
As discussed in "basic_models.ipynb", the transformations that will be used are Minimum and Geometric.

## Loading libraries and data

In [2]:
# importing important libraries

# transformations library
from transformations import minimum, geometric

# models
from sklearn.ensemble import RandomForestClassifier

# loading data
import pickle

# other modules
from sklearn.model_selection import cross_val_score
from sklearn.metrics import recall_score, make_scorer
from sklearn.model_selection import GridSearchCV
import numpy as np

In [3]:
# get base_dataset
data_path = "TrainTestData/train_data.pickle"
data = pickle.load(open(data_path, "rb"))

In [4]:
# function that calculates weighted_accuracy
# weights are basead on the frequency of the letters in the portuguese alphabet 
# source: https://pt.wikipedia.org/wiki/Alfabeto_portugu%C3%AAs#Frequ%C3%AAncia_da_ocorr%C3%AAncia_de_letras
# H, K, J, X and Z are not present
LETTERS_FREQUENCY = [
    14.63,
    1.04,
    3.88,
    5.01,
    12.57,
    1.02,
    1.30,
    6.18,
    2.78,
    4.74,
    5.05,
    10.73,
    2.52,
    1.20,
    6.53,
    7.81,
    4.34,
    4.63,
    1.67,
    0.01,
    0.01,
]
def weighted_accuracy(y_true, y_pred):
    recall_array = recall_score(y_true, y_pred, average=None)
    weights_total = 0
    result = 0
    for recall, weight in zip(recall_array, LETTERS_FREQUENCY):
        weights_total += weight
        result += recall * weight
    return result / weights_total
weighted_accuracy_score = make_scorer(weighted_accuracy)

## Choosing hyperparameters and transformations

In [5]:
# Minumum transformation
minimum_X = []
for observation in data["features"]:
    minimum_X.append(minimum(observation))

# Geometric transformation
geometric_X = []
for observation in data["features"]:
    geometric_X.append(geometric(observation))

In [6]:
# just to have an idea of the maximum max_depth
forest = RandomForestClassifier(n_jobs=-1)
forest.fit(data["features"], data["labels"])
print(max([estimator.tree_.max_depth for estimator in forest.estimators_]))

27


In [7]:
# hyperparameters for first Grid Search
param_grid  = {
    "n_estimators": [50, 100, 200],
    "max_depth": [5, 15, 30]
}

## First Grid Search

In [8]:
# Minimum transformation
forest = RandomForestClassifier(n_jobs=-1)
grid_search_minimum = GridSearchCV(forest, param_grid, cv=5, scoring=weighted_accuracy_score, return_train_score=True)

grid_search_minimum.fit(minimum_X, data["labels"])

In [10]:
cvres = grid_search_minimum.cv_results_ 
results = sorted(zip(cvres["mean_test_score"], cvres["params"]), reverse=True)
for mean_score, params in results:
    print(mean_score, params)

0.9372456345722519 {'max_depth': 30, 'n_estimators': 100}
0.936199417063776 {'max_depth': 30, 'n_estimators': 200}
0.9356382301099908 {'max_depth': 15, 'n_estimators': 100}
0.935400760208605 {'max_depth': 15, 'n_estimators': 200}
0.9325745079161288 {'max_depth': 30, 'n_estimators': 50}
0.9310475155035581 {'max_depth': 15, 'n_estimators': 50}
0.8084066087641265 {'max_depth': 5, 'n_estimators': 200}
0.8012313536708919 {'max_depth': 5, 'n_estimators': 100}
0.7947801584148527 {'max_depth': 5, 'n_estimators': 50}


In [11]:
# Geometric transformation
forest = RandomForestClassifier(n_jobs=-1)
grid_search_geometric= GridSearchCV(forest, param_grid, cv=5, scoring=weighted_accuracy_score, return_train_score=True)

grid_search_geometric.fit(geometric_X, data["labels"])

In [12]:
cvres = grid_search_geometric.cv_results_ 
results = sorted(zip(cvres["mean_test_score"], cvres["params"]), reverse=True)
for mean_score, params in results:
    print(mean_score, params)

0.9207871633956726 {'max_depth': 30, 'n_estimators': 100}
0.9183211294131166 {'max_depth': 30, 'n_estimators': 200}
0.915166328412997 {'max_depth': 15, 'n_estimators': 200}
0.9115189615448249 {'max_depth': 30, 'n_estimators': 50}
0.909830294548831 {'max_depth': 15, 'n_estimators': 100}
0.9082336195262997 {'max_depth': 15, 'n_estimators': 50}
0.6836240507938431 {'max_depth': 5, 'n_estimators': 50}
0.6802557967190553 {'max_depth': 5, 'n_estimators': 100}
0.6743538854891692 {'max_depth': 5, 'n_estimators': 200}


As we can see, the results are about the same as the basic models. The basic RandomForest for minimum transformation had 93.5%, and the best one here had 93.72%, a difference that most likely means that testing more hyperparameters will probably lead to similar results. The same can be sar about geometric transformation: 92.01% for basic, 92.08% for the best one here.
<br><br>
Therefore, we will consider that there's no need to test further hyperparameters. 

## Conclusion

The best RandomForest model uses Minimum transformation, with max_depth = 30 and n_estimators = 100.