# Random Forest Models

## Objective

The objective of this notebook is to train and test different SVM models, by changing their hyperparameters, in order to obtain the best Random Forest model.

## Loading libraries and data

In [30]:
# model library
from LibrasModel import LibrasModel, weighted_accuracy_score, weighted_accuracy_scorer

# model
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

# loading data
import pickle
import joblib

# other modules
import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

In [31]:
# get base_dataset
data_path = "TrainTestData/train_data.pickle"
data = pickle.load(open(data_path, "rb"))

## Choosing hyperparameters

In [32]:
# just to have an idea of the maximum max_depth
forest = RandomForestClassifier(n_jobs=-1)
model = LibrasModel(forest, has_z=True)
X = model.transform_data(np.array(data["features"]))
y = np.array(data["labels"])
forest.fit(X, y)
print(max([estimator.tree_.max_depth for estimator in forest.estimators_]))

23


In [33]:
# hyperparameters for first Grid Search
param_grid  = {
    "n_estimators": [10, 50, 100, 200],
    "max_depth": [5, 15, 20, 30]
}

In [34]:
def apply_gd(base_model, has_z, data, param_grid):
    model = LibrasModel(base_model, has_z=has_z)
    X = np.array(data["features"])
    y = np.array(data["labels"])
    X = model.transform_data(X)
    gd = GridSearchCV(model.model, param_grid, scoring=weighted_accuracy_scorer, return_train_score=True, cv=5, n_jobs=-1)
    gd.fit(X, y)

    cvres = gd.cv_results_ 
    results = sorted(zip(cvres["mean_test_score"], cvres["params"]), reverse=True, key=lambda x: x[0])
    for mean_score, params in results:
        print(mean_score, params)

## Training

In [35]:
# with z
apply_gd(RandomForestClassifier(n_jobs=-1), True, data, param_grid)

0.9289823652229966 {'max_depth': 30, 'n_estimators': 200}
0.9266420189188022 {'max_depth': 20, 'n_estimators': 100}
0.9253166662104609 {'max_depth': 30, 'n_estimators': 100}
0.9252308745089796 {'max_depth': 20, 'n_estimators': 200}
0.9249231907372154 {'max_depth': 15, 'n_estimators': 200}
0.9222831156150197 {'max_depth': 30, 'n_estimators': 50}
0.9220849192920781 {'max_depth': 15, 'n_estimators': 100}
0.9197196459837832 {'max_depth': 20, 'n_estimators': 50}
0.9176015767839438 {'max_depth': 15, 'n_estimators': 50}
0.8934837375941935 {'max_depth': 20, 'n_estimators': 10}
0.892161945351314 {'max_depth': 30, 'n_estimators': 10}
0.8789505419026369 {'max_depth': 15, 'n_estimators': 10}
0.7036813482900719 {'max_depth': 5, 'n_estimators': 200}
0.7030504776953531 {'max_depth': 5, 'n_estimators': 100}
0.6854882146290712 {'max_depth': 5, 'n_estimators': 50}
0.6318401742605649 {'max_depth': 5, 'n_estimators': 10}


In [36]:
# without z
apply_gd(RandomForestClassifier(n_jobs=-1), False, data, param_grid)

0.9324179417698509 {'max_depth': 15, 'n_estimators': 200}
0.9316945067023299 {'max_depth': 20, 'n_estimators': 200}
0.9307709963634657 {'max_depth': 30, 'n_estimators': 100}
0.9297356270971816 {'max_depth': 30, 'n_estimators': 200}
0.929677795694371 {'max_depth': 20, 'n_estimators': 50}
0.9296225134068165 {'max_depth': 15, 'n_estimators': 100}
0.9284201681218077 {'max_depth': 20, 'n_estimators': 100}
0.9276494415639915 {'max_depth': 30, 'n_estimators': 50}
0.9252849909378323 {'max_depth': 15, 'n_estimators': 50}
0.9032109347339583 {'max_depth': 30, 'n_estimators': 10}
0.9021576185635769 {'max_depth': 20, 'n_estimators': 10}
0.9015027594938999 {'max_depth': 15, 'n_estimators': 10}
0.7013342979804594 {'max_depth': 5, 'n_estimators': 100}
0.6983504874019644 {'max_depth': 5, 'n_estimators': 200}
0.6974061777314733 {'max_depth': 5, 'n_estimators': 50}
0.6536476802839821 {'max_depth': 5, 'n_estimators': 10}


Since the best model used the highest values of n_estimators, we will test some new values. Also, the max depth varied alot in the top models, so we will continue testing all three options again.

The models without z outperformed the ones with z, so we will only use models withou z.

## Fine tunning

In [39]:
param_grid  = {
    "n_estimators": [100, 200, 350, 500],
    "max_depth": [15, 20, 30]
}

In [40]:
# without z
apply_gd(RandomForestClassifier(n_jobs=-1), False, data, param_grid)

0.9359604317694554 {'max_depth': 20, 'n_estimators': 200}
0.934211978748827 {'max_depth': 15, 'n_estimators': 200}
0.9337513054137924 {'max_depth': 20, 'n_estimators': 500}
0.9337483010551992 {'max_depth': 20, 'n_estimators': 100}
0.9331444932627369 {'max_depth': 15, 'n_estimators': 500}
0.9326830857030547 {'max_depth': 30, 'n_estimators': 200}
0.9311045501360221 {'max_depth': 15, 'n_estimators': 350}
0.9308738662024126 {'max_depth': 30, 'n_estimators': 350}
0.9298276395182763 {'max_depth': 20, 'n_estimators': 350}
0.9294303578986891 {'max_depth': 30, 'n_estimators': 500}
0.9287505027497168 {'max_depth': 15, 'n_estimators': 100}
0.9274604243378578 {'max_depth': 30, 'n_estimators': 100}


## Analysing performance in all metrics

In [41]:
from time import time

In [42]:
def print_metrics(base_model, data, has_z=False):
    model = LibrasModel(base_model, has_z=has_z)
    

    X = np.array(data["features"])
    X_transformed = model.transform_data(X)
    y = np.array(data["labels"])
    metrics = {
        "acc_w": weighted_accuracy_score,
        "acc": accuracy_score
    }
    model.fit(X, y)
    acc_w = cross_val_score(model.model, X_transformed, y, scoring=weighted_accuracy_scorer, cv=5)
    acc = cross_val_score(model.model, X_transformed, y, scoring="accuracy", cv=5)

    t = time()
    model.predict(X)
    t = time() - t

    print(f"Weighted Accuracy: {round(100 * np.mean(acc_w), 2)}%")
    print(f"Accuracy: {round(100 * np.mean(acc), 2)}%")
    print(f"Time per prediction: {1000 * t / len(y)} ms")

In [43]:
model1 = RandomForestClassifier(max_depth=20, n_estimators=200)
model2 = RandomForestClassifier(max_depth=15, n_estimators=200)
model3 = RandomForestClassifier(max_depth=20, n_estimators=500)

In [44]:
print_metrics(model1, data)

Weighted Accuracy: 93.11%
Accuracy: 92.8%
Time per prediction: 0.031893469136336755 ms


In [45]:
print_metrics(model2, data)

Weighted Accuracy: 93.36%
Accuracy: 92.91%
Time per prediction: 0.031031514036244358 ms


In [46]:
print_metrics(model3, data)

Weighted Accuracy: 93.28%
Accuracy: 92.82%
Time per prediction: 0.07406416638144131 ms


## Conclusion

The random forest models achives good results, especially in the prediction time.

Since the top 2 models have basically the same performance and it seems the choice of the best between them depends on the random number generator, we will consider both of them in testing. The thid model has much higher predicion time, so it will not be considered.