# Random Forest Models

## Objective

The objective of this notebook is to train and test different Random Forest models, by changing their hyperparameters, in order to obtain the best Random Forest model.
<br><br>
As discussed in "basic_models.ipynb", the transformations that will be used are Minimum and Geometric.

## Loading libraries and data

In [8]:
# importing important libraries

# transformations library
from transformations import minimum, geometric

# models
from sklearn.ensemble import RandomForestClassifier

# loading data
import pickle

# other modules
from sklearn.model_selection import cross_val_score
from sklearn.metrics import recall_score, make_scorer
from sklearn.model_selection import GridSearchCV
import numpy as np

In [9]:
# get base_dataset
data_path = "TrainTestData/train_data.pickle"
data = pickle.load(open(data_path, "rb"))

In [10]:
# function that calculates weighted_accuracy
# weights are basead on the frequency of the letters in the portuguese alphabet 
# source: https://pt.wikipedia.org/wiki/Alfabeto_portugu%C3%AAs#Frequ%C3%AAncia_da_ocorr%C3%AAncia_de_letras
# H, K, J, X and Z are not present
LETTERS_FREQUENCY = [
    14.63,
    1.04,
    3.88,
    5.01,
    12.57,
    1.02,
    1.30,
    6.18,
    2.78,
    4.74,
    5.05,
    10.73,
    2.52,
    1.20,
    6.53,
    7.81,
    4.34,
    4.63,
    1.67,
    0.01,
    0.01,
]
def weighted_accuracy(y_true, y_pred):
    recall_array = recall_score(y_true, y_pred, average=None)
    weights_total = 0
    result = 0
    for recall, weight in zip(recall_array, LETTERS_FREQUENCY):
        weights_total += weight
        result += recall * weight
    return result / weights_total
weighted_accuracy_score = make_scorer(weighted_accuracy)

## Choosing hyperparameters and transformations

In [11]:
# Minumum transformation
minimum_X = []
for observation in data["features"]:
    minimum_X.append(minimum(observation))

# Geometric transformation
geometric_X = []
for observation in data["features"]:
    geometric_X.append(geometric(observation))

In [12]:
# just to have an idea of the maximum max_depth
forest = RandomForestClassifier(n_jobs=-1)
forest.fit(data["features"], data["labels"])
print(max([estimator.tree_.max_depth for estimator in forest.estimators_]))

26


In [13]:
# hyperparameters for first Grid Search
param_grid  = {
    "n_estimators": [50, 100, 200],
    "max_depth": [5, 15, 30]
}

## First Grid Search

In [14]:
# Minimum transformation
forest = RandomForestClassifier(n_jobs=-1)
grid_search_minimum = GridSearchCV(forest, param_grid, cv=5, scoring=weighted_accuracy_score, return_train_score=True, n_jobs=-1)

grid_search_minimum.fit(minimum_X, data["labels"])

In [15]:
cvres = grid_search_minimum.cv_results_ 
results = sorted(zip(cvres["mean_test_score"], cvres["params"]), reverse=True)
for mean_score, params in results:
    print(mean_score, params)

0.9391427445262333 {'max_depth': 30, 'n_estimators': 200}
0.9378851970946421 {'max_depth': 15, 'n_estimators': 100}
0.9347732507578896 {'max_depth': 15, 'n_estimators': 200}
0.9345646805434609 {'max_depth': 30, 'n_estimators': 100}
0.9323158115166116 {'max_depth': 15, 'n_estimators': 50}
0.9322619634500462 {'max_depth': 30, 'n_estimators': 50}
0.8082512067926277 {'max_depth': 5, 'n_estimators': 200}
0.8014615957638143 {'max_depth': 5, 'n_estimators': 100}
0.793141935727868 {'max_depth': 5, 'n_estimators': 50}


In [16]:
# Geometric transformation
forest = RandomForestClassifier(n_jobs=-1)
grid_search_geometric= GridSearchCV(forest, param_grid, cv=5, scoring=weighted_accuracy_score, return_train_score=True, n_jobs=-1)

grid_search_geometric.fit(geometric_X, data["labels"])

In [17]:
cvres = grid_search_geometric.cv_results_ 
results = sorted(zip(cvres["mean_test_score"], cvres["params"]), reverse=True)
for mean_score, params in results:
    print(mean_score, params)

0.9194287461097733 {'max_depth': 30, 'n_estimators': 200}
0.9185657550235836 {'max_depth': 30, 'n_estimators': 100}
0.9146639459974935 {'max_depth': 15, 'n_estimators': 100}
0.9144259555234442 {'max_depth': 15, 'n_estimators': 200}
0.912543853409927 {'max_depth': 30, 'n_estimators': 50}
0.9090264733434097 {'max_depth': 15, 'n_estimators': 50}
0.6838162498433995 {'max_depth': 5, 'n_estimators': 200}
0.6723993603018357 {'max_depth': 5, 'n_estimators': 50}
0.6715484876166474 {'max_depth': 5, 'n_estimators': 100}


As we can see, the results are about the same as the basic models. The basic RandomForest for minimum transformation had 93.5%, and the best one here had 93.72%, a difference that most likely means that testing more hyperparameters will probably lead to similar results. The same can be sar about geometric transformation: 92.01% for basic, 92.08% for the best one here.
<br><br>
Therefore, we will consider that there's no need to test further hyperparameters. 

## Analysing Time Performance

In [19]:
# average time per prediction
from time import time
best_forest = grid_search_minimum.best_estimator_
best_forest.fit(minimum_X, data["labels"])

start = time()
best_forest.predict(minimum_X)
end = time()
print((end - start) / len(minimum_X))

2.860396072782319e-05


## Conclusion

The best RandomForest model uses Minimum transformation, with max_depth = 30 and n_estimators = 100, with a performance of 93.91%.
<br><br>
The average time per prediction is 0.00003 seconds.