# Random Forest Classifier

For the final model, we will be having a look at the Random Forest classifier.

In [13]:
import numpy as np
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
import time

In [14]:
#Import datasets
X_train = np.load("X_train.npy")
y_train = np.load("y_train.npy")
X_test = np.load("X_test.npy")
y_test = np.load("y_test.npy")
X_train_reduced = np.load("X_train_reduced.npy")
X_test_reduced = np.load("X_test_reduced.npy")

#Flatten layers
shape1 = X_train.shape
shape2 = X_test.shape

X_train = X_train.reshape(shape1[0], shape1[1]*shape1[2])

X_test = X_test.reshape(shape2[0], shape2[1]*shape2[2])

In [15]:
#Define function to produce different random forrests with different number estiamtors
def rf(n, X, y, X_test, y_test, red, maxdepth = None):
    regressor = RandomForestClassifier(n_estimators=n, max_depth = maxdepth)
    start_time = time.time()
    regressor.fit(X, y)
    duration = (time.time() - start_time)
    print("For the " + red + " dataset:")
    print("the random forest took " + str(duration) + " seconds for " + str(n) + " estimators.")
    y_pred = regressor.predict(X_test)
    sc = regressor.score(X_test, y_test)
    print("the score of random forest with " + str(n) + " esitmators is: " + str(sc))

Let's first of all try a random forrest with 100 estimators:

In [24]:
rf(100, X_train, y_train, X_test, y_test, "non reduced", None)

For the non reduced dataset:
the random forest took 9.950697898864746 seconds for 100 estimators.
the score of random forest with 100 esitmators is: 0.9662839248434238


Let's now have a look at the dimensionally reduced dataset:

In [25]:
rf(100, X_train_reduced, y_train, X_test_reduced, y_test, "reduced")

For the reduced dataset:
the random forest took 6.881294250488281 seconds for 100 estimators.
the score of random forest with 100 esitmators is: 0.9875782881002088


The score and computational costs of the reduced dataset is much better. The total preformance is the best here.

In [27]:
rf(50, X_train, y_train, X_test, y_test, "non reduced")

For the non reduced dataset:
the random forest took 5.112277030944824 seconds for 50 estimators.
the score of random forest with 50 esitmators is: 0.9663883089770355


In [26]:
rf(50, X_train_reduced, y_train, X_test_reduced, y_test, "reduced")

For the reduced dataset:
the random forest took 3.6399359703063965 seconds for 50 estimators.
the score of random forest with 50 esitmators is: 0.9867432150313152


Increasing the number of estimators does not improve the score, but only increases time. Let's have a look if a reduciton in estimators significantly decreases the score

In [29]:
rf(20, X_train, y_train, X_test, y_test, "non reduced")

For the non reduced dataset:
the random forest took 2.1383020877838135 seconds for 20 estimators.
the score of random forest with 20 esitmators is: 0.966910229645094


In [30]:
rf(20, X_train_reduced, y_train, X_test_reduced, y_test, "reduced")

For the reduced dataset:
the random forest took 1.5109620094299316 seconds for 20 estimators.
the score of random forest with 20 esitmators is: 0.9856993736951983


Decreasing the number of estimators by a factor of 2 also decreases the time by a factor of 2. However, in both cases, reduced and non reduced, the score decreased significantly. This leads us to believe that the default number of estimators of 20 is best. Again the reduced dataset has a significantly better score that that of the non reduced. One more way to decrease computational cost is to introduce a maximal depth. Let's have a look how a maxdepth influences time and score:

In [38]:
rf(20, X_train, y_train, X_test, y_test, "non reduced", 10)

For the non reduced dataset:
the random forest took 2.1836540699005127 seconds for 20 estimators.
the score of random forest with 20 esitmators is: 0.9665970772442589


In [36]:
rf(20, X_train_reduced, y_train, X_test_reduced, y_test, "reduced", 10)

For the reduced dataset:
the random forest took 0.9571869373321533 seconds for 20 estimators.
the score of random forest with 20 esitmators is: 0.9824634655532359


Max depth of 5 on the both data sets improves time a bit, but decreases the score significantly. One final way to improve performance is to preprocess the data and make them Standard Scaler:

In [33]:
#Prepairing the data
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
X_test_reduced = sc.fit_transform(X_test_reduced)
X_train_reduced = sc.transform(X_train_reduced)

In [34]:
rf(20, X_train, y_train, X_test, y_test, "non reduced", None)

For the non reduced dataset:
the random forest took 2.1309092044830322 seconds for 20 estimators.
the score of random forest with 20 esitmators is: 0.9670146137787057


In [35]:
rf(20, X_train_reduced, y_train, X_test_reduced, y_test, "reduced")

For the reduced dataset:
the random forest took 1.298051118850708 seconds for 20 estimators.
the score of random forest with 20 esitmators is: 0.9852818371607516


This again did not improve the preformance at all

## To conclude:
The best preforming regressor is the default random forest on the reduced dataset with 20 estimators