# Machine Learning Using Random Forests
*Curtis Miller*

A **random forest** is a collection of decision trees that each individually make a prediction for an observation. Each tree is formed from a random subset of the training set. The majority decision among the trees is then the predicted value of an observation. Random forests are an example of **ensemble methods**, where the predictions of individual classifiers are used for decision making.

The **scikit-learn** class `RandomForestClassifier` can be used for training random forests. For random forests we may consider an additional hyperparameter to tree depth: the number of trees to train. Each tree should individually be shallow, and having more trees should lead to less overfitting.

We will still be using the *Titanic* dataset.

In [None]:
import pandas as pd
from pandas import DataFrame
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import classification_report
from random import seed    # Set random seed for reproducible results

In [None]:
seed(110717)    # Set the seed
titanic = pd.read_csv("titanic.csv")
titanic_train, titanic_test = train_test_split(titanic)

## Growing a Random Forest

Let's generate a random forest where I cap the depth for each tree at $m = 5$ and grow 10 trees.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
forest1 = RandomForestClassifier(n_estimators=10,    # Number of trees to grow
                                 max_depth=5)        # Maximum depth of a tree
forest1.fit(X=titanic_train.replace({'Sex': {'male': 0, 'female': 1}}    # Replace strings with numbers
                                   ).drop(["Survived", "Name"], axis=1),
            y=titanic_train.Survived)

# Example prediction
forest1.predict([[2, 0, 26, 0, 0, 30]])

In [None]:
pred1 = forest1.predict(titanic_train.replace({'Sex': {'male': 0, 'female': 1}}
                                             ).drop(["Survived", "Name"], axis=1))
print(classification_report(titanic_train.Survived, pred1))

The random forest does not perform as well on the training data as a full-grown decision tree, but such a tree overfit. The random forest, in comparison, seems to do as well as a better decision tree so far.

## Optimizing Multiple Hyperparameters

We now have two hyperparameters to optimize: tree depth and the number of trees to grow. We have a few ways to proceed:

1. We could use cross-validation to see which combination of hyperparameters performs the best. Beware that there could be many combinations to check!
2. We could use cross-validation to optimize one hyperparameter first, then the next, and so on. While not necessarily producing a globally optimal solution this is less work and likely yields a "good enough" result.
3. We could randomly pick combinations of hyperparameters and use the results to guess a good combination. This is like 1 but less work.

Here I will go with option 2. I will optimize the number of trees to use first, then maximum tree depth.

In [None]:
n_candidate = [10, 20, 30, 40, 60, 80, 100]    # Candidate forest sizes
res1 = dict()

for n in n_candidate:
    pred3 = RandomForestClassifier(n_estimators=n, max_depth=5)
    res1[n] = cross_validate(pred3,
                            X=titanic_train.replace({'Sex': {'male': 0, 'female': 1}}    # Replace strings with numbers
                                         ).drop(["Survived", "Name"], axis=1),
                            y=titanic_train.Survived,
                            cv=10,
                            return_train_score=False,
                            scoring='accuracy')

res1df = DataFrame({(i, j): res1[i][j]
                             for i in res1.keys()
                             for j in res1[i].keys()}).T

res1df.loc[(slice(None), 'test_score'), :]

In [None]:
res1df.loc[(slice(None), 'test_score'), :].mean(axis=1)

$n = 100$ seems to do well. Now let's pick optimal tree depth.

In [None]:
m_candidate = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]    # Candidate depths

In [None]:
res2 = dict()

for m in m_candidate:
    pred3 = RandomForestClassifier(max_depth=m, n_estimators=40)
    res2[m] = cross_validate(pred3,
                             X=titanic_train.replace({'Sex': {'male': 0, 'female': 1}}    # Replace strings with numbers
                                          ).drop(["Survived", "Name"], axis=1),
                             y=titanic_train.Survived,
                             cv=10,
                             return_train_score=False,
                             scoring='accuracy')

res2df = DataFrame({(i, j): res2[i][j]
                             for i in res2.keys()
                             for j in res2[i].keys()}).T

res2df.loc[(slice(None), 'test_score'), :]

In [None]:
res2df.loc[(slice(None), 'test_score'), :].mean(axis=1)

A maximum tree depth of $m = 7$ seems to work well. A way to try and combat the path-dependence of this approach would be to repeat the search for optimal forest size but with the new tree depth and so on, but I will not do so here.

Let's now see how the new random forest performs on the test set.

In [None]:
forest2 = RandomForestClassifier(max_depth=9, n_estimators=40)
forest2.fit(X=titanic_train.replace({'Sex': {'male': 0, 'female': 1}}    # Replace strings with numbers
                                   ).drop(["Survived", "Name"], axis=1),
            y=titanic_train.Survived)

survived_test_predict = forest2.predict(X=titanic_test.replace(
    {'Sex': {'male': 0, 'female': 1}}
).drop(["Survived", "Name"], axis=1))

In [None]:
print(classification_report(titanic_test.Survived, survived_test_predict))

The random forest does reasonably well, though it does not appear to be much of an improvement over the decision tree. Given the complexity of the random forest, a simple decision tree would be preferred.