## Gradient Boosting Regression

Regression trees are mostly commonly teamed with boosting. There are some additional hyperparameters that need to be set which includes:
- number of estimators
- learning rate
- subsample
- max depth

We will deal with each of these when it is appropriate. Our goal is to predict the amount of weight loss in cancer patients based on independent variables. This the process we will follow to achieve this.
- Data Preperation
- Baseline Decision tree model
- Hyperparameter tuning
- Gradient Boosting model development

In [1]:
from sklearn.ensemble import GradientBoostingRegressor
from sklearn import tree
from sklearn.model_selection import GridSearchCV, cross_val_score, KFold
from pydataset import data
import pandas as pd
import numpy as np

### Data Preparation
The data preparation is not that difficult in this situation. We simply need to load the dataset in an object and remove any missing values. Then we seperate the independent and dependent variables into seperate datasets.

In [2]:
df = data("cancer").dropna()

In [3]:
X = df[['time', 'sex', 'ph.karno', 'pat.karno', 'meal.cal', 'status']]
y = df['wt.loss']

### Baseline Model
The purpose of the baseline model is to have something to compare our gradient boosting model to. Therefore, all we will do here is to create several regression trees. The difference between the regression trees will be the max depth. The max depth has to with the number of nodes python can make to try to purify the classification. We will then decide which tree is best based on the mean squared error.

The first thing we need to do is set the arguments for the cross validation. Cross Validating the results helps to check the accuracy of the result.

In [4]:
crossvalidation = KFold(n_splits = 10, shuffle = True, random_state = 1)

In [5]:
for depth in range(1,10):
    tree_regressor = tree.DecisionTreeRegressor(max_depth = depth, 
                                               random_state = 1)
    if tree_regressor.fit(X, y).tree_.max_depth < depth:
        break
    score = np.mean(cross_val_score(tree_regressor, X, y,
                                   scoring = 'neg_mean_squared_error',
                                   cv = crossvalidation, n_jobs = 1))
    print(depth, score)

1 -193.55304528235052
2 -189.2634427676794
3 -209.2846723461564
4 -218.80238479654003
5 -236.7481695179989
6 -249.27095314925208
7 -294.80522693721264
8 -293.8231882876493
9 -286.32692707086045


You can see thar max depth of 2 had the lowest amount of error. Therefore, our baseline model has a mean squared error of 176. We need to improve on this in order to say that our gradient boosting model is superior

### Hyperparameter Tuning

Hyperparameter tuning has to with setting the value of parameters that the algorithm cannot learn on its own. As such, these are constants that you set as the researcher. The problem is that you are not any better knowing where to set these values than the computer. Therefore, the process that is commonly used is to have the algorithm use several combinations of value until it finds the values that are best for the model. Having said that, there are several hyperparameters we need to tune and there are:
- number of estimators
- learning rate
- subsample
- max depth

The number of estimators show many trees to create. The more trees the more likely to overfit. The learning rate is the weight tht each tree has on the final prediction. Subsample is the proportion of the sample to use.

In [6]:
GRB = GradientBoostingRegressor()

In [7]:
search_grid = {
    'n_estimators': [500,1000,2000],
    'learning_rate': [0.001,0.01,0.1],
    'max_depth': [1,2,4],
    'subsample': [0.5, 0.75, 1],
    'random_state': [1]
}

In [8]:
search = GridSearchCV(estimator = GRB, param_grid = search_grid,
                     scoring = 'neg_mean_squared_error', n_jobs = 1,
                     cv = crossvalidation)

In [13]:
search.fit(X, y)
print(search.best_params_)
print(search.best_score_)

{'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 500, 'random_state': 1, 'subsample': 0.5}
-160.77818130839782


THe hyperparameter results speaks for themselves. With this tuning we can see that mean squared error is lower than with the baseline model.

### Gradient Boosting Model Development

In [10]:
GRB2 = GradientBoostingRegressor(n_estimators = 500, learning_rate = 0.01,
                                subsample = 0.5, max_depth = 1, 
                                random_state = 1)

In [11]:
score = np.mean(cross_val_score(GRB2, X, y,
                                scoring = 'neg_mean_squared_error',
                               cv = crossvalidation, n_jobs = 1))

In [12]:
score

-160.77818130839782

These results were to be exprected. The gradient boosting model has a better performance than the performance than the baseline regression tree model