## Random Forest: Fit and evaluate a model

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will fit and evaluate a simple Random Forest model.

### Read in Data

In [1]:
import joblib
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

tr_features = pd.read_csv('../data/train_features.csv')
tr_labels = pd.read_csv('../data/train_labels.csv')



### Hyperparameter tuning
This image provides a quick reminder of what the two hyperparameters are that i'll be tuning.

Number of estimators simply represents how many individua decision trees to build.

Max_depth dictates how deep each of those decision trees can go. 
![RF](../img/rf.png)

In [3]:
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

In [4]:
# defining my hyperparameter dictionary:
# for n_estimators i will test out building 5, 50, 250 decision trees.
# for max_depth i will test out 2, 4, 8, 16, 32, and none. this will control how deep each tree in n_estimators can go.

# the None setting will let the tree go as deep as it wants until it reaches some level-
# of training at a tolerance thats defined within Random Forest Classifier
rf = RandomForestClassifier()
parameters = {
    'n_estimators': [5, 50, 250],
    'max_depth': [2, 4, 8, 16, 32, None]
}

cv = GridSearchCV(rf, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())

print_results(cv)

BEST PARAMS: {'max_depth': 4, 'n_estimators': 50}

0.781 (+/-0.109) for {'max_depth': 2, 'n_estimators': 5}
0.792 (+/-0.121) for {'max_depth': 2, 'n_estimators': 50}
0.802 (+/-0.103) for {'max_depth': 2, 'n_estimators': 250}
0.817 (+/-0.122) for {'max_depth': 4, 'n_estimators': 5}
0.83 (+/-0.1) for {'max_depth': 4, 'n_estimators': 50}
0.824 (+/-0.108) for {'max_depth': 4, 'n_estimators': 250}
0.82 (+/-0.08) for {'max_depth': 8, 'n_estimators': 5}
0.824 (+/-0.067) for {'max_depth': 8, 'n_estimators': 50}
0.822 (+/-0.072) for {'max_depth': 8, 'n_estimators': 250}
0.8 (+/-0.066) for {'max_depth': 16, 'n_estimators': 5}
0.811 (+/-0.018) for {'max_depth': 16, 'n_estimators': 50}
0.811 (+/-0.029) for {'max_depth': 16, 'n_estimators': 250}
0.803 (+/-0.039) for {'max_depth': 32, 'n_estimators': 5}
0.815 (+/-0.036) for {'max_depth': 32, 'n_estimators': 50}
0.811 (+/-0.029) for {'max_depth': 32, 'n_estimators': 250}
0.787 (+/-0.022) for {'max_depth': None, 'n_estimators': 5}
0.811 (+/-0.029) for

# Results:
Before i dig into the results, i want to call out that with Random Forest, even if i ran this exact cell again, on the same exact training set, i would get different results. 

Thats because each time you run Random Forest, it is randomly sampling rows and columns internally, to build each individual decision tree. 

Looking at these results, the best accuracy is produced by a model with 50 estimators and a max_depth of 4. This generate an overall accuracy of 83%. 

This is the best cross validation performance that i have seen so far. 

Two things that i want to call out quickly: it's clear that 5 estimators is not quite enough, if i look through every combination, the worst one is usually the one with only 5 estimators. 

I can also see that throughout the combinations, the ones that have a max_depth of 4 do quite well. Same with a depth of 8.

It really starts to fall off after the max_depth of eight, which indicates that they might be over-fitting just a little bit. 

### Write out pickled model

In [5]:
joblib.dump(cv.best_estimator_, '../data/RF_model.pkl')

['../data/RF_model.pkl']