# Tuning Stochastic Gradient Boosting
A simple technique for ensembling decision trees involves training trees on subsamples of the training dataset. Subsets of the rows in the training data can be taken to train individual trees called bagging. When subsets of rows of the training data are also taken when calculating each split point, this is called random forest. These techniques can also be used in the gradient tree boosting model in a technique called stochastic gradient boosting.

### Stochastic Gradient Boosting
Gradient boosting is a greedy procedure. New decision trees are added to the model to correct the residual error of the existing model. Each decision tree is created using a greedy search procedure to select split points that best minimize an objective function. This can result in trees that use the same attributes and even the same split points again and again. Bagging is a technique where a collection of decision trees are created, each from a different random subset of rows from the training data. The effect is that better performance is achieved from the ensemble of trees because the randomness in the sample allows slightly different trees to be created, adding variance to the ensembled predictions.

## Tuning Row Subsampling

Row subsampling involves selecting a random sample of the training dataset without replacement. Row subsampling can be specified in the scikit-learn wrapper of the XGBoost class in the subsample parameter. The default is 1.0 which is no sub-sampling.

In [None]:
#Load Libraries
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import LabelEncoder
import matplotlib
matplotlib.use('Agg')
from matplotlib import pyplot as pt

#Load Data, split, encode target
data = read_csv("train.csv")
X = data.values[:,0:94]
Y = data.values[:,94]
encoded_y = LabelEncoder().fit_transform(Y)

#Grid Search
model = XGBClassifier(nthread=-1)
subsample = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1.0]
param_grid = dict(subsample=subsample)
kfold = StratifiedKFold(n_splits = 10, shuffle = True, random_state = 7)
grid_search = GridSearchCV(model, param_grid, scoring = 'neg_log_loss', n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X, encoded_y)

#Summarize Results
print("Best: %f (%f) using %s" % (grid_result.best_score_, grid_result.best_params))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_results.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) using %r" %(mean, stdev, param))

In [None]:
#plot results
pt.errorbar(subsample, means, yerr=stds)
pt.title("XGBoost subsample vs Log Loss")
pt.xlabel("Subsample")
pt.ylabel("Log Loss")
pt.savefig('subsample.png')

## Tuning Column Subsampling by Tree

We can also create a random sample of the features (or columns) to use prior to creating each decision tree in the boosted model. In the XGBoost wrapper for scikit-learn, this is controlled by the colsample bytree parameter. The default value is 1.0 meaning that all columns are used in each decision tree. We can evaluate values for colsample bytree between 0.1 and 1.0 incrementing by 0.1.

In [None]:
#Load Library
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
import matplotlib
matplotlib.use('Agg')
from matplotlib import pyplot at pt

#Load data, split, encode
data = read_csv('train.csv')
X = data.values[:,0:94]
Y = data.values[:,94]
encoded_y = LabelEncoder().fit_transform(Y)

#Grid Search
model = XGBClassifier(nthread=-1)
colsample_bytree = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1]
param_grid = dict(colsample_bytree = colsample_bytree)
kfold = StratifiedKFold(n_splits = 10, shuffle=True, random_state=7)
grid_search = GridSearchCV(model, param_grid, scoring='neg_log_loss', n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X, encoded_y)

#Summarize Results
print("Best: %f (%f) using %s" % (grid_result.best_score_, grid_result.best_params))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_results.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) using %r" %(mean, stdev, param))
    
#Plot
pt.errorbar("colsample_bytree, means, yerr=stds")
pt.title("XGBoost colsample_bytree vs Log Loss")
pt.xlabel('colsample_bytree')
pt.ylabel('Log Loss')
pt.savefig('colsample_bytree.png')

## Tuning Column Subsampling By Split
Rather than subsample the columns once for each tree, we can subsample them at each split in the decision tree. In principle, this is the approach used in random forest. We can set the size of the sample of columns used at each split in the colsample bylevel parameter in the XGBoost wrapper classes for scikit-learn. As before, we will vary the ratio from 10% to the default of 100%. The full code listing is provided below.

In [None]:
#Load Library
from pandas import read_csv
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import LabelEncoder
import matplotlib
matplotlib.use('Agg')
from matplotlib import pyplot at pt

#Load data, split, encode
data = read_csv('train.csv')
X = data.values[:,0:94]
Y = data.values[:,94]
encoded_y = LabelEncoder().fit_transform(Y)

#Grid Search
model = XGBClassifier(nthread=-1)
colsample_bylevel = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 1]
param_grid = dict(colsample_bylevel = colsample_bylevel)
kfold = StratifiedKFold(n_splits = 10, shuffle=True, random_state=7)
grid_search = GridSearchCV(model, param_grid, scoring='neg_log_loss', n_jobs=-1, cv=kfold)
grid_result = grid_search.fit(X, encoded_y)

#Summarize Results
print("Best: %f (%f) using %s" % (grid_result.best_score_, grid_result.best_params))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_results.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) using %r" %(mean, stdev, param))
    
#Plot
pt.errorbar("colsample_bytree, means, yerr=stds")
pt.title("XGBoost colsample_bytree vs Log Loss")
pt.xlabel('colsample_bytree')
pt.ylabel('Log Loss')
pt.savefig('colsample_bylevel.png')