## Boosting: Fit and evaluate a model

Using the Titanic dataset from [this](https://www.kaggle.com/c/titanic/overview) Kaggle competition.

In this section, we will fit and evaluate a simple Gradient Boosting model.

### Read in Data

In [2]:
import joblib
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
warnings.filterwarnings('ignore', category=DeprecationWarning)

tr_features = pd.read_csv('../data/train_features.csv')
tr_labels = pd.read_csv('../data/train_labels.csv')

### Hyperparameter tuning:
n_estimators simply represents how many individual decision trees i need to build.

max_depth indicates how deep each of those trees will go.

learning_rate controls how quickly this algorithm will try to find the optimal solution.
If i set it too large it may never find it, too small and it also may never find the optimal solution. 
![GB](../img/gb.png)

In [3]:
def print_results(results):
    print('BEST PARAMS: {}\n'.format(results.best_params_))

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        print('{} (+/-{}) for {}'.format(round(mean, 3), round(std * 2, 3), params))

In [4]:
# As i mentioned before, I am testing out more individual decision trees but shallower-
# trees than i did with random forest.

# each of these two algorithms optimize the bias varience trade-off to find the best model.

# gradient boodting does better with a lot of shallow trees, while random forest goes deeper on fewer trees. 
gb = GradientBoostingClassifier()
parameters = {
    'n_estimators': [5, 50, 250, 500],
    'max_depth': [1, 3, 5, 7, 9],
    'learning_rate': [0.01, 0.1, 1, 10, 100]
}

cv = GridSearchCV(gb, parameters, cv=5)
cv.fit(tr_features, tr_labels.values.ravel())

print_results(cv)

BEST PARAMS: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 50}

0.624 (+/-0.007) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 5}
0.796 (+/-0.115) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 50}
0.796 (+/-0.115) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 250}
0.811 (+/-0.117) for {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 500}
0.624 (+/-0.007) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 5}
0.811 (+/-0.069) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 50}
0.83 (+/-0.074) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 250}
0.839 (+/-0.084) for {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 500}
0.624 (+/-0.007) for {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 5}
0.822 (+/-0.052) for {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 50}
0.82 (+/-0.037) for {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 250}
0.826 (+/-0.047) for {'learning_rate

# Results:
So the same caveat that i called out with random forest is also true here. 

There is a random component within each boosting model because i am sampling 

There are a lot of hyperparameter combinations here. Remember I tested 4 levels of n_estimators, 5 levels of max_depth, and 5 levels of learning_rate. That's 100 total models that were built. 

I can see that the best model has a learning rate of 0.01 with a max_depth of 3 and 50 total estimators. 

This model is using drastically more trees than i saw with the random forest but eac individual tree is more shallow than what random forest settled on as its best model. 

This optimal hyperparameter combination is generating an overall accuracy of 84.1%. This is better than any Of the other algorithms that i have tested. 

I want to highlight the fact that this does not mean that this is the best model.It just means its the best using cross-validation.

Evaluating it on the test set like i'll do after this, will be the true test. 

Two things to take ito account for these results is that high learning rate is generating really poor results across the board. That indicates that its jumping across the loss curve too quickly and not finding the optimal model.

Another thing is that the models that are only using 5 decision trees is pretty consistantly generating the worst results. 

### Write out pickled model

In [5]:
joblib.dump(cv.best_estimator_, '../data/GB_model.pkl')

['../data/GB_model.pkl']