## Random Forrest

As with the Decision Tree example, we will again compare the performance of our own implementation to the one from the sklearn library. To avoid redundancy, in this notebook we will only focus on the examples where the single Decision Tree struggled and skip over the simple datasets where it performed well. The datasets we will focus on will be the digits dataset for classification, aswell as the diabetes dataset for regression.
Finally we will also use the large diamond dataset to compare the relative performance of the two implementations.

In [1]:
# Load modules
from models.random_forest import RandomForestClassifier as OwnRandomForestClassifier, RandomForestRegressor as OwnRandomForestRegressor
from sklearn.ensemble import RandomForestClassifier as SklearnRandomForestClassifier, RandomForestRegressor as SklearnRandomForestRegressor

from utils.reports import evaluate_classification, evaluate_regression
from sklearn.model_selection import train_test_split

from sklearn.datasets import load_digits
from sklearn.datasets import load_diabetes
from datasets.diamonds import load_diamonds

ds_c_hard = load_digits()
X, Y = ds_c_hard.data, ds_c_hard.target
X_c_hard_train, X_c_hard_test, Y_c_hard_train, Y_c_hard_test = train_test_split(X , Y, test_size=0.2, random_state=42)

ds_r_medium = load_diabetes()
X, Y = ds_r_medium.data, ds_r_medium.target
X_r_medium_train, X_r_medium_test, Y_r_medium_train, Y_r_medium_test = train_test_split(X, Y, test_size=0.2, random_state=42)

ds_r_hard = load_diamonds()
X, Y = ds_r_hard.data, ds_r_hard.target
X_r_hard_train, X_r_hard_test, Y_r_hard_train, Y_r_hard_test = train_test_split(X, Y, test_size=0.2, random_state=42)

Starting with the hard classification example, we can immediately see that the Random Forrest is able to achieve a much higher accuracy than the individual Decision Tree, due to its ability to grasp more complex data, as is present in the digits dataset (the Decision Trees achieved an accuracy of around 0.85). To avoid excessive compute durations when running the notebook, we do not perform a grid search for hyperparameters and instead rely on the default values. Just note that the potential results could be improved with a more thorough search.

In [2]:
rf_classifier = OwnRandomForestClassifier()

rf_classifier.fit(X_c_hard_train, Y_c_hard_train)
Y_c_hard_pred = rf_classifier.predict(X_c_hard_test)

evaluate_classification(Y_c_hard_test, Y_c_hard_pred)

Precision: 0.96, Recall: 0.96, F1-Score: 0.96


In [3]:
rf_classifier = SklearnRandomForestClassifier()

rf_classifier.fit(X_c_hard_train, Y_c_hard_train)
Y_c_hard_pred = rf_classifier.predict(X_c_hard_test)

evaluate_classification(Y_c_hard_test, Y_c_hard_pred)

Precision: 0.97, Recall: 0.97, F1-Score: 0.97


In the diabetes set we can only observe marginal improvements over standart desicion trees. Given that the sklean implementation performs equally poor, we can assume that the dataset is simply not well suited for this type of model and that there are no inherent flaws in our implementation.

In [4]:
rf_regressor = OwnRandomForestRegressor()

rf_regressor.fit(X_r_medium_train, Y_r_medium_train)
Y_r_medium_pred = rf_regressor.predict(X_r_medium_test)

evaluate_regression(Y_r_medium_test, Y_r_medium_pred)

MAE: 44.63, MSE: 2928.43, R²: 0.45


In [5]:
rf_regressor = SklearnRandomForestRegressor()

rf_regressor.fit(X_r_medium_train, Y_r_medium_train)
Y_r_medium_pred = rf_regressor.predict(X_r_medium_test)

evaluate_regression(Y_r_medium_test, Y_r_medium_pred)

MAE: 45.00, MSE: 3051.44, R²: 0.42


Finally the diamonds dataset shows that our implementation is efficient enough, to be able to handle large datasets. In terms of performance, it is not surprising that we achieve a high accuracy, as this has already been achieved by individual decision trees.

In [6]:
rf_regressor = OwnRandomForestRegressor()

rf_regressor.fit(X_r_hard_train, Y_r_hard_train)
Y_r_hard_pred = rf_regressor.predict(X_r_hard_test)

evaluate_regression(Y_r_hard_test, Y_r_hard_pred)

MAE: 294.72, MSE: 332031.56, R²: 0.98


In [7]:
rf_regressor = SklearnRandomForestRegressor()

rf_regressor.fit(X_r_hard_train, Y_r_hard_train)
Y_r_hard_pred = rf_regressor.predict(X_r_hard_test)

evaluate_regression(Y_r_hard_test, Y_r_hard_pred)

MAE: 266.15, MSE: 290207.92, R²: 0.98
