For the final lesson in this guide, we'll learn about random forest models. As we saw last time, decision trees are a conceptually simple predictive modeling technique, but when you start building deep trees, they become complicated and likely to overfit your training data. In addition, decision trees are constructed in a way such that branch splits are always made on variables that appear to be the most significant first, even if those splits do not lead to optimal outcomes as the tree grows. Random forests are an extension of decision trees that address these shortcomings.

# Random Forest Basics

A random forest model is a collection of decision tree models that are combined together to make predictions. When you make a random forest, you have to specify the number of decision trees you want to use to make the model. The random forest algorithm then takes random samples of observations from your training data and builds a decision tree model for each sample. The random samples are typically drawn with replacement, meaning the same observation can be drawn multiple times. The end result is a bunch of decision trees that are created with different groups of data records drawn from the original training data.

The decision trees in a random forest model are a little different than the standard decision trees we made last time. Instead of growing trees where every single explanatory variable can potentially be used to make a branch at any level in the tree, random forests limit the variables that can be used to make a split in the decision tree to some random subset of the explanatory variables. Limiting the splits in this fashion helps avoid the pitfall of always splitting on the same variables and helps random forests create a wider variety of trees to reduce overfitting.

Random forests are an example of an ensemble model: a model composed of some combination of several different underlying models. Ensemble models often yields better results than single models because different models may detect different patterns in the data and combining models tends to dull the tendency that complex single models have to overfit the data.

# Random Forests on the Titanic

Python's sklearn package offers a random forest model that works much like the decision tree model we used last time. Let's use it to train a random forest model on the Titanic training set:

In [1]:
import numpy as np
import pandas as pd
import os

In [2]:
# Load and prepare Titanic data
titanic_train = pd.read_csv("train.csv")    # Read the data

# Impute median Age for NA Age values
new_age_var = np.where(titanic_train["Age"].isnull(), # Logical check
                       28,                       # Value if check is true
                       titanic_train["Age"])     # Value if check is false

titanic_train["Age"] = new_age_var 

In [3]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing

In [4]:
help(RandomForestClassifier)

Help on class RandomForestClassifier in module sklearn.ensemble._forest:

class RandomForestClassifier(ForestClassifier)
 |  RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)
 |  
 |  A random forest classifier.
 |  
 |  A random forest is a meta estimator that fits a number of decision tree
 |  classifiers on various sub-samples of the dataset and uses averaging to
 |  improve the predictive accuracy and control over-fitting.
 |  The sub-sample size is controlled with the `max_samples` parameter if
 |  `bootstrap=True` (default), otherwise the whole dataset is used to build
 |  each tree.
 |  
 |  Read more in the :ref:`User Guide <forest>`.
 |  
 |  P

In [5]:


# Set the seed
np.random.seed(12)

# Initialize label encoder
label_encoder = preprocessing.LabelEncoder()

# Convert some variables to numeric
titanic_train["Sex"] = label_encoder.fit_transform(titanic_train["Sex"])

# Initialize the model
rf_model = RandomForestClassifier(n_estimators=1000, # Number of trees
                                  max_features=2,    # Num features considered
                                  oob_score=True)    # Use OOB scoring*

features = ["Sex","Pclass","SibSp","Age"]

# Train the model
rf_model.fit(X=titanic_train[features],
             y=titanic_train["Survived"])

print("OOB accuracy: ")
print(rf_model.oob_score_)

OOB accuracy: 
0.8035914702581369


In [11]:
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

In [16]:
%%time
parameters = {
    "n_estimators": [20,50,70], 
    "max_depth": [3, 5, 7, 9, 2,4],
    "min_samples_split" : [10,20,30,40,50]
}

model_random_forest = RandomForestClassifier(
    random_state=5,
    class_weight='balanced', oob_score=True
)

model_random_forest = GridSearchCV(
    model_random_forest, 
    parameters, 
    cv=5,
    scoring='accuracy',
)

model_random_forest.fit(titanic_train[features], titanic_train["Survived"])

print('-----')
print(f'Best parameters {model_random_forest.best_params_}')
print(f'Mean cross-validated accuracy score of the best_estimator: {model_random_forest.best_score_:.3f}')
print('-----')

-----
Best parameters {'max_depth': 7, 'min_samples_split': 10, 'n_estimators': 20}
Mean cross-validated accuracy score of the best_estimator: 0.820
-----
Wall time: 1min 25s


In [17]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [18]:
%%time
parameters = {
    "n_estimators": [20,50,70], 
    "max_depth": [3, 5, 7, 9, 2,4],
    "min_samples_split" : [10,20,30,40,50]
}

model_random_forest = RandomForestClassifier(
    random_state=5,
    class_weight='balanced', oob_score=True
)

model_random_forest = RandomizedSearchCV(
    model_random_forest, 
    parameters, 
    cv=5,
    scoring='accuracy',n_iter= 10
)

model_random_forest.fit(titanic_train[features], titanic_train["Survived"])

print('-----')
print(f'Best parameters {model_random_forest.best_params_}')
print(f'Mean cross-validated accuracy score of the best_estimator: {model_random_forest.best_score_:.3f}')
print('-----')

-----
Best parameters {'n_estimators': 50, 'min_samples_split': 40, 'max_depth': 7}
Mean cross-validated accuracy score of the best_estimator: 0.808
-----
Wall time: 8.64 s
