# 05. Random Forest Classification
***

In this chapter, we are using Random Forest classification to predict Titanic survivors. As we did previously with Logistic Regression and kNN classifiers, we are going to test different hypothesis to find the best classifier, given its test score.

In [30]:
%matplotlib inline

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

import seaborn as sns
sns.set_style('whitegrid')
sns.set_context('poster')

from matplotlib import rcParams

In [72]:
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier

***
## Import data

In [36]:
data_train = pd.read_csv('./data/training_dataset.csv')
data_train.head(3)

Unnamed: 0,SibSp,Parch,Fare,Gender,Boarding_C,Boarding_Q,Boarding_S,Age,Pclass_1,Pclass_2,Pclass_3,Survived
0,0,0,13.0,0,0,0,1,24.0,0,1,0,0
1,1,0,16.1,0,0,0,1,21.75,0,0,1,1
2,0,0,30.6958,1,1,0,0,56.0,1,0,0,0


In [37]:
data_test = pd.read_csv('./data/testing_dataset.csv')
data_test.head(3)

Unnamed: 0,SibSp,Parch,Fare,Gender,Boarding_C,Boarding_Q,Boarding_S,Age,Pclass_1,Pclass_2,Pclass_3,Survived
0,0,0,9.5,1,0,0,1,24.0,0,0,1,0
1,0,0,30.5,1,0,0,1,41.281386,1,0,0,1
2,0,2,26.25,0,0,0,1,8.0,0,1,0,1


In [38]:
X = data_train.drop('Survived', axis = 1)
Y = data_train['Survived']

X_test = data_test.drop('Survived', axis = 1)
Y_test = data_test['Survived']

***
## Random Forest on the whole training dataset

In this first try, we are going to apply Random Forest (RF) learning algorithm on the whole training dataset, and arbitrary fixing RF parameters. This is just a first try and will be used to study the evolution of scoring when performing smarter choices in future RF models.

In [39]:
rfc = RandomForestClassifier()
rfc.fit(X, Y)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [40]:
print rfc.score(X, Y)
print rfc.score(X_test, Y_test)

0.966244725738
0.724719101124


As we can see, the Random Forest model performs poorly with the basic hyper-parameters and the provided dataset. We are now trying a smarter approach to tune the model hyper-parameters.

***
## In the search of the best parameters for the algorithm

We are going to use `GridSearchCV` method for searching the best hyper-parameters for the learning model. The hyper-parameter we are going to optimize is `n_estimator` which represents the number of trees in the forest (default to 10).

In [71]:
rfc = RandomForestClassifier()
parameters = { 'n_estimators': range(60, 101) }

gs = GridSearchCV(rfc, param_grid = parameters, cv = 10)
gs.fit(X, Y)

GridSearchCV(cv=10, error_score='raise',
       estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'n_estimators': [60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100]},
       pre_dispatch='2*n_jobs', refit=True, scoring=None, verbose=0)

In [73]:
# Getting some grid search optimization values
print gs.best_params_
print gs.best_score_

{'n_estimators': 79}
0.841068917018


In [78]:
# In this case, the best estimator is:
rfc_optimal = gs.best_estimator_

In [75]:
print gs.score(X, Y)
print gs.score(X_test, Y_test)

0.98452883263
0.741573033708


In [76]:
rfc_optimal.feature_importances_

array([ 0.0495754 ,  0.04293062,  0.25870171,  0.27087136,  0.01299723,
        0.00731293,  0.01394788,  0.24671798,  0.03317335,  0.01182045,
        0.0519511 ])

In [79]:
# Calculating the F1 score, based on the confusion_matrix
cm = confusion_matrix(Y_test, rfc_optimal.predict(X_test))
cm

array([[80, 29],
       [17, 52]])

In [83]:
f1_score = 2 * cm[0][0] / float(2 * cm[0][0] + cm[0][1] + cm[1][0])
f1_score

0.77669902912621358

We can observe that using grid search cross-validation on Random Forest hyper-parameters improved the model score, but is still low compared to other learning models.