## Pipeline: Tune hyperparameters

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will tune the hyperparameters for the basic model we fit in the last section.

### Read in data

![Tune Hyperparameters](../../img/tune_hyperparameters.png)

In [10]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
#LETS RUN A GRIDSEARCHCV TO FIND THE OPTIMAL PARAMETER SETTINGS FOR OUR MODEL
#GRIDSEARCHCV IS SIMPLY A WRAPPER AROUND RF WHICH ALLOWS US TO USE IT WITHIN CROSS-VALIDATION
tr_features = pd.read_csv('/Users/AndrewMcLeod_1/Desktop/Ex_Files_Applied_Machine_Learning/Exercise Files/train_features.csv')
tr_labels = pd.read_csv('/Users/AndrewMcLeod_1/Desktop/Ex_Files_Applied_Machine_Learning/Exercise Files/train_labels.csv', header=None)
tr_labels.drop(index=tr_labels.index[1],axis=0,inplace=True)

### Hyperparameter tuning

![Hyperparameters](../../img/hyperparameters.png)

In [11]:
#OUR FUNCTION HERE PRINTS OUT THE AVERAGE ACCURACY SCORE ACCROSS OUR FIVE FOLDS AND THE STD OF THE ACCURACY SCORE FOR EVERY HYPERPARAMETER COMBO
#THIS WILL GIVES US ALL THE NEEDED INFO TO CHOOSE OPTIMAL HYPERPARAMETER SETTINGS
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

In [12]:
#NOW WE CAN DO OUR GRIDSEARCHCV
#ESSENTIALLY A RANDOM FOREST IS A COLLECTION OF DECISION TREES
rf = RandomForestClassifier()
#THERE ARE TWO HP'S THAT WE WANT TO TUNE, NUMBER OF ESTIMATORS (IE: INDIVIDUAL DECISION TREES),
#AND MAX DEPTH (HOW DEEP THE DECISION TREES GO)
parameters = {
    'n_estimators': [5, 50, 100],
    'max_depth': [2, 10, 20, None]
}
#Now we call GreidsearchCV to run every possible combonation of parameter
#WITH 12 PARAMETER COMBINATIONS AND 5 FOLDS OF CV THIS IS CREATING 60 MODELS!
cv = GridSearchCV(rf, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())

print_results(cv)



BEST PARAMS: {'max_depth': 10, 'n_estimators': 50}

0.764 (+/-0.118) for {'max_depth': 2, 'n_estimators': 5}
0.811 (+/-0.085) for {'max_depth': 2, 'n_estimators': 50}
0.802 (+/-0.106) for {'max_depth': 2, 'n_estimators': 100}
0.82 (+/-0.037) for {'max_depth': 10, 'n_estimators': 5}
0.824 (+/-0.058) for {'max_depth': 10, 'n_estimators': 50}
0.818 (+/-0.063) for {'max_depth': 10, 'n_estimators': 100}
0.792 (+/-0.053) for {'max_depth': 20, 'n_estimators': 5}
0.805 (+/-0.021) for {'max_depth': 20, 'n_estimators': 50}
0.811 (+/-0.022) for {'max_depth': 20, 'n_estimators': 100}
0.8 (+/-0.009) for {'max_depth': None, 'n_estimators': 5}
0.809 (+/-0.054) for {'max_depth': None, 'n_estimators': 50}
0.809 (+/-0.025) for {'max_depth': None, 'n_estimators': 100}


In [None]:
#THUS OUR BEST FIT MODEL HAS A MAX DEPTH OF 10 AND NUMBER OF TREES AT 100
#OBSERVE THE ACCURACY IS LITTLE BETTER THAN OUR DEFAULT BEST RUN IN THE LAST EXCERCISE WITH NUMBER OF PARAMETERS AT TEN AND DEPTH AT NONE!