# Random Forest Models

## Objective

The objective of this notebook is to train and test different Random Forest models, by changing their hyperparameters, in order to obtain the best Random Forest model.
<br><br>
As discussed in "basic_models.ipynb", the transformations that will be used are Minimum and Geometric.

## Loading libraries and data

In [1]:
# importing important libraries

# transformations library
from transformations import minimum, geometric, minimum2D, geometric2D

# models
from sklearn.ensemble import RandomForestClassifier

# loading data
import pickle

# other modules
from sklearn.model_selection import cross_val_score
from sklearn.metrics import recall_score, make_scorer
from sklearn.model_selection import GridSearchCV
import numpy as np

In [2]:
# get base_dataset
data_path = "TrainTestData/train_data.pickle"
data = pickle.load(open(data_path, "rb"))

In [3]:
# function that calculates weighted_accuracy
# weights are basead on the frequency of the letters in the portuguese alphabet 
# source: https://pt.wikipedia.org/wiki/Alfabeto_portugu%C3%AAs#Frequ%C3%AAncia_da_ocorr%C3%AAncia_de_letras
# H, K, J, X and Z are not present
LETTERS_FREQUENCY = [
    14.63,
    1.04,
    3.88,
    5.01,
    12.57,
    1.02,
    1.30,
    6.18,
    2.78,
    4.74,
    5.05,
    10.73,
    2.52,
    1.20,
    6.53,
    7.81,
    4.34,
    4.63,
    1.67,
    0.01,
    0.01,
]
def weighted_accuracy(y_true, y_pred):
    recall_array = recall_score(y_true, y_pred, average=None)
    weights_total = 0
    result = 0
    for recall, weight in zip(recall_array, LETTERS_FREQUENCY):
        weights_total += weight
        result += recall * weight
    return result / weights_total
weighted_accuracy_score = make_scorer(weighted_accuracy)

## Choosing hyperparameters and transformations

In [4]:
# Minumum transformation
minimum_X = []
for observation in data["features"]:
    minimum_X.append(minimum(observation))

# Geometric transformation
geometric_X = []
for observation in data["features"]:
    geometric_X.append(geometric(observation))

# Minumum 2D transformation
minimum2D_X = []
for observation in data["features"]:
    minimum2D_X.append(minimum2D(observation))

# Geometric 2D transformation
geometric2D_X = []
for observation in data["features"]:
    geometric2D_X.append(geometric2D(observation))

In [5]:
# just to have an idea of the maximum max_depth
forest = RandomForestClassifier(n_jobs=-1)
forest.fit(data["features"], data["labels"])
print(max([estimator.tree_.max_depth for estimator in forest.estimators_]))

26


In [6]:
# hyperparameters for first Grid Search
param_grid  = {
    "n_estimators": [50, 100, 200],
    "max_depth": [5, 15, 30]
}

## First Grid Search

In [7]:
# Minimum transformation
forest = RandomForestClassifier(n_jobs=-1)
grid_search_minimum = GridSearchCV(forest, param_grid, cv=5, scoring=weighted_accuracy_score, return_train_score=True, n_jobs=-1)

grid_search_minimum.fit(minimum_X, data["labels"])

In [8]:
cvres = grid_search_minimum.cv_results_ 
results = sorted(zip(cvres["mean_test_score"], cvres["params"]), reverse=True)
for mean_score, params in results:
    print(mean_score, params)

0.94077799773588 {'max_depth': 30, 'n_estimators': 100}
0.9403130366182557 {'max_depth': 15, 'n_estimators': 100}
0.9400356989542857 {'max_depth': 30, 'n_estimators': 200}
0.9386231213544469 {'max_depth': 15, 'n_estimators': 200}
0.9369719094722487 {'max_depth': 30, 'n_estimators': 50}
0.9334466351762817 {'max_depth': 15, 'n_estimators': 50}
0.8310149689899626 {'max_depth': 5, 'n_estimators': 200}
0.8247659856625666 {'max_depth': 5, 'n_estimators': 100}
0.8207536296910067 {'max_depth': 5, 'n_estimators': 50}


In [9]:
# Geometric transformation
forest = RandomForestClassifier(n_jobs=-1)
grid_search_geometric= GridSearchCV(forest, param_grid, cv=5, scoring=weighted_accuracy_score, return_train_score=True, n_jobs=-1)

grid_search_geometric.fit(geometric_X, data["labels"])

In [10]:
cvres = grid_search_geometric.cv_results_ 
results = sorted(zip(cvres["mean_test_score"], cvres["params"]), reverse=True)
for mean_score, params in results:
    print(mean_score, params)

0.9063251418915333 {'max_depth': 30, 'n_estimators': 200}
0.9061280747933838 {'max_depth': 15, 'n_estimators': 200}
0.9044469441417249 {'max_depth': 30, 'n_estimators': 100}
0.9034445213722175 {'max_depth': 30, 'n_estimators': 50}
0.8987143810621105 {'max_depth': 15, 'n_estimators': 50}
0.8981692541598946 {'max_depth': 15, 'n_estimators': 100}
0.660642292435955 {'max_depth': 5, 'n_estimators': 100}
0.6533516501505641 {'max_depth': 5, 'n_estimators': 200}
0.6398529089772614 {'max_depth': 5, 'n_estimators': 50}


In [13]:
# Geometric2D transformation
forest = RandomForestClassifier(n_jobs=-1)
grid_search_geometric2D = GridSearchCV(forest, param_grid, cv=5, scoring=weighted_accuracy_score, return_train_score=True, n_jobs=-1)

grid_search_geometric2D.fit(geometric2D_X, data["labels"])

In [14]:
cvres = grid_search_geometric2D.cv_results_ 
results = sorted(zip(cvres["mean_test_score"], cvres["params"]), reverse=True)
for mean_score, params in results:
    print(mean_score, params)

0.9187873192787992 {'max_depth': 30, 'n_estimators': 100}
0.9161316082473514 {'max_depth': 15, 'n_estimators': 50}
0.9151214458943533 {'max_depth': 30, 'n_estimators': 200}
0.9134868104437638 {'max_depth': 15, 'n_estimators': 100}
0.9128979546817575 {'max_depth': 15, 'n_estimators': 200}
0.9114348803994785 {'max_depth': 30, 'n_estimators': 50}
0.6929183235821322 {'max_depth': 5, 'n_estimators': 200}
0.6599298068178144 {'max_depth': 5, 'n_estimators': 50}
0.6502096050020887 {'max_depth': 5, 'n_estimators': 100}


In [17]:
# Minimum 2D transformation
forest = RandomForestClassifier(n_jobs=-1)
grid_search_minimum2D = GridSearchCV(forest, param_grid, cv=5, scoring=weighted_accuracy_score, return_train_score=True, n_jobs=-1)

grid_search_minimum2D.fit(minimum2D_X, data["labels"])

In [18]:
cvres = grid_search_minimum2D.cv_results_ 
results = sorted(zip(cvres["mean_test_score"], cvres["params"]), reverse=True)
for mean_score, params in results:
    print(mean_score, params)

0.946897431737856 {'max_depth': 30, 'n_estimators': 100}
0.9448825594359835 {'max_depth': 30, 'n_estimators': 50}
0.944578848828197 {'max_depth': 15, 'n_estimators': 200}
0.9442354031440587 {'max_depth': 30, 'n_estimators': 200}
0.9402123894884943 {'max_depth': 15, 'n_estimators': 100}
0.9376997182711468 {'max_depth': 15, 'n_estimators': 50}
0.8080951935572017 {'max_depth': 5, 'n_estimators': 200}
0.8076494877515847 {'max_depth': 5, 'n_estimators': 100}
0.8065247249076062 {'max_depth': 5, 'n_estimators': 50}


As we can see, the Minimum Model performed better than the Geometric Model, and the 2D version is better than the 3D version. 

## Analysing Time Performance

In [20]:
# average time per prediction
from time import time
best_forest = grid_search_minimum2D.best_estimator_
best_forest.fit(minimum_X, data["labels"])

start = time()
best_forest.predict(minimum_X)
end = time()
print((end - start) / len(minimum_X))

1.6950687457775248e-05


## Conclusion

The best RandomForest model uses Minimum 2D transformation, with max_depth = 30 and n_estimators = 100, with a performance of 94.69%.
<br><br>
The average time per prediction is 0.000017 seconds.