## Gradient Boosting Classification

Gradient Boosting is an alternative form of boostinb to AdaBoost. Many consider gradient boosting to be a better performer than adaboost, Some difference between the two algorithms is that gradient boosting usign optimization for weight the estimator. Like AdaBoost, gradient boosting can be used for most algorithms but is commonly assosiated with decision trees.

In addition, gradient boosting requires several additional hyperparameters such as max depth and subsample. Maxdepth has to do with the number of nodes in a tree. The higher the number, the purere the classification become. The downside to this is the risk of overfitting.

Subsample has to do with the proportion of the sample that is used for each estimator. This can range from decimal value up until the whole number 1. If the value set to 1, it becomes stochastic gradient boosting.

To do this gradient boosting classification, we will use cancer dataset from the pydataset library. Our goal will be to predict the status of patients(alive or dead) using the available independent variables. The steps we will use is as follows:

- Data Preperation
- Baseline decision tree model
- Hyperparameter tuning

In [1]:
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from pydataset import data
from sklearn.model_selection import GridSearchCV

### Data Preparation

The data preparation is simple in this situation. All we need to do is load dataset, dropping missing values, and create our X dataset and y dataset.

In [2]:
df = data("cancer").dropna()

In [3]:
X = df[['time', 'sex', 'ph.karno', 'pat.karno', 'meal.cal', 'wt.loss']]

In [4]:
y = df['status']

### Baseline Model

The purpose of the baseline model is to have something to compare our gradient boosting model to the strength of a model is always relative to some other model. So we need to make atleast two, so we can say one is better than the other.

The criteria for better in this situation is accuracy. Therefore, we will make a decision tree model, but we will manipulate the max depth of the tree to create 9 different baseline models. The best accuracy model will be the baseling model.

To achieve this, we need to use a for loop to make python several decision trees. We also need to set the parameters for the cross validation by calling Kfold(). Once this is done, we print the result for 9 trees.

In [5]:
crossvalidation = KFold(n_splits = 10, shuffle = True, random_state = 1)

In [6]:
for depth in range(1, 10):
    tree_classifier = tree.DecisionTreeClassifier(max_depth = depth,
                                                 random_state = 1)
    if tree_classifier.fit(X, y).tree_.max_depth < depth:
        break
    score = np.mean(cross_val_score(tree_classifier, X, y, 
                    scoring = 'accuracy', cv = crossvalidation, n_jobs = 1))
    print(depth, score)

1 0.71875
2 0.6477941176470589
3 0.6768382352941177
4 0.6698529411764707
5 0.6584558823529412
6 0.6525735294117647
7 0.6283088235294118
8 0.6573529411764706
9 0.6577205882352941


It appears that when the max depth is limited to 1, then we get the best accuracy at almost 72%. This will be our baseline for comparision. We will now tune the parameters for the gradient boosting algorithm.

### Hyperparameter Tuning

There are several hyperparameters we need to tune. The ones we will tune are:

- number of estimators
- learning rate
- subsample
- max depth

First, we will create an instance of the gradient boosting classifier. Second, we will create our grid search. It is inside this grid that we set several valus for each hyperparameter. Then we will call GridSearchCV and placr the instance of the gradient boosting classifier, the grid, the cross validation values from mad earlier, and n_jobs all together in one place. 

In [7]:
GBC = GradientBoostingClassifier()

In [8]:
search_grid = {'n_estimators': [500, 1000, 2000],
              'learning_rate': [0.001, 0.01, 0.1],
              'max_depth': [1,3,5],
              'subsample': [0.5,0.75, 1],
              'random_state': [1]}

In [9]:
search = GridSearchCV(estimator = GBC, param_grid = search_grid,
                     scoring = 'accuracy', n_jobs = 1,
                     cv = crossvalidation)

In [10]:
search

You can see now run youe model by calling fit(). Keep in mind that there are several hyperparameters. This means that it mi9ght take some time to run the calculations. It is common to find values for max_depth, subsample, and number of estimators first. Then as second run through is done to find the learning rate. In our example, we are doing everything at once which is why it takes longer.

In [14]:
search.fit(X, y)
print(search.best_params_)
print(search.best_score_)

{'learning_rate': 0.001, 'max_depth': 5, 'n_estimators': 1000, 'random_state': 1, 'subsample': 0.75}
0.7360294117647059


You can see what the best hyperparameters are. In addition, we that when these parameters were set, we got an accuracy of %. This is superior to our baseline model. We will now see if we can replicate these numbers when we use them for our Gradient Boosting model.

### Gradient Boosting Model


In [11]:
GBC2 = GradientBoostingClassifier(n_estimators = 2000, learning_rate = 0.01,
                                 subsample = 0.75, max_depth = 5,
                                 random_state = 1)

In [12]:
score = np.mean(cross_val_score(GBC2, X, y, scoring = 'accuracy',
                               cv = crossvalidation, n_jobs = 1))

In [13]:
score

0.7301470588235295