---
layout: post
title:  "Boosting"
date:   2023-05-17 10:14:54 +0700
categories: MachineLearning
---

# TOC

- [Introduction](#intro)
- [Ensemble learning](#ens)
- [AdaBoost](#ada)
- [Gradient boosting](#grad)
- [Stochastic gradient boosting](#sto)

# Introduction

# Ensemble learning

Ensemble learning happens when we have multiple models but they are weak. Weak learners are models that predict worse than random. We ensemble them together in some ways to achieve better prediction (better bias and variance). The two most popular ensemble learning methods are bagging and boosting.

- Bagging is when we train those models in parallel. We sample with replacement to create new dataset for those models. This method of sampling is called bootstrapping. Random Forest is bagging. Each tree is trained in parallel on random subset of the data and then the resulted predictions are averaged to find the classification.

- Boosting is the method in which we train the model sequentially. Each model will fit the error of the previous model. In this way, the next learner improves from the mistake of the previous learner. 

# AdaBoost

AdaBoost is short for adaptive boosting. The base models would be very simple tree (when there are only two leaves, we call it a decision stump). In the initial step, the dataset is initialized with equal weight to each of the data point. The data is then provided as the input into the model, then the wrongly classified data is identified. Those data points' weights would be increased in the next round, so that they are more likely to be chosen during the sampling. The model therefore pays more attention to difficult observations, and with the feedback of gradient descent, they would learn gradually the tricky cases. Hence the word adaptive.

# Gradient Boosting

In gradient boosting, the models are short decision trees. And it is different from AdaBoost in that it simply tries to predict the error (calculated by the loss function) of the previous model. This makes it slow to learn, but it learns better: the model is more robust and stronger than the individual trees. We can use mean squared error for a regression task and logarithmic loss for a classification task. 

The learning rate controls how much of the error to be fitted for the next model. The lower the rate, the slower the models learn. The number of individual trees is also a hyper parameter. The more trees we add, the higher risk of overfitting. 

Let's consider the first model for input x and output y:

$$ y = A_1 + (B_1 * x) + e_1 $$

with $$ e_1 $$ is the residual. This $$ e_1 $$ would be to fit the second tree:

$$ e_1 = A_2 + (B_2 * x) + e_2 $$

and $$ e_2 = A_3 + (B_3 * x) + e_3 $$

and so on. We can have hundreds of trees for a model. When we combine then, the combined model would be:

$$ y = A_1 + A_2 + A_3 + B_1 * x + B_2 * x + B_3 * x + e_3 $$



In [1]:
# Import all relevant libraries
from sklearn.ensemble import GradientBoostingClassifier
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn import preprocessing
import warnings
warnings.filterwarnings("ignore")

In [2]:
# Load the dataset 
pima = pd.read_csv('diabetes.csv') 
pima.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
# Split dataset into test and train data
X_train, X_test, y_train, y_test = train_test_split(pima.drop('Outcome', axis=1),
                                                    pima['Outcome'], test_size=0.2)

# Scale the data
scaler = preprocessing.StandardScaler().fit(X_train)
X_train_transformed = scaler.transform(X_train)
X_test_transformed = scaler.transform(X_test)

In [4]:
# Define Gradient Boosting Classifier with hyperparameters
gbc=GradientBoostingClassifier(n_estimators=500,learning_rate=0.05,random_state=100,max_features=5 )
# Fit train data to GBC
gbc.fit(X_train_transformed, y_train)

In [5]:
# Confusion matrix will give number of correct and incorrect classifications
print(confusion_matrix(y_test, gbc.predict(X_test_transformed)))

[[83 15]
 [24 32]]


In [6]:
# Accuracy of model
print("GBC accuracy is %2.2f" % accuracy_score(
    y_test, gbc.predict(X_test_transformed)))

GBC accuracy is 0.75


# Stochastic Gradient Boosting

SGB is a hybrid of the boosting and bagging approaches. At each iteration, a random sub sample of the dataset is chosen to fit a tree. SGB is also based on a steepest gradient algorithm which emphasizes the misclassified data that are close. Finally, at each iteration, they use small trees and the major voting to aggregate the prediction. SGB is robust to missing data and outliers. 

Consider the training sample $$ \{y_i, x_i\}_1^N $$. We need to find a function $$ H^*(x) $$ that maps x to y and at the same time minimize the loss function $$ L(y, H(x)) $$. SGB can approximate $$ H^*(x) $$:

$$ H(x) = \sum_{m=0}^M \beta_m h(x; a_m) $$

where $$ h(x;a_m) $$ is the base model (which can be a tree), $$ a_m $$ is the parameters and $$ \beta_m $$ is an expansion coefficient. $$ \beta $$ and a would be fitted to min the loss function:

$$ (\beta_m, a_m) = argmin_{\beta,a} \sum_{i=1}^N L(y_i, H_{m-1} (x_i) + \beta h(x_i; a)) $$ 

and $$ H_m(x) = H_{m-1} (x) + \beta_m h(x;a_m) $$

To solve for $$ (\beta_m, a_m) $$, SGB fits h(x;a) by least squares:

$$ a_m = argmin_{a,\rho} \sum_{i=1}^N {[y_{im} - \rho h(x_i;a)]}^2 $$

where $$ y_{im} = - {[ \frac{\delta L(y_i, H(x_i))}{\delta H(x_i)} ]}_{H(x) = H_{m-1}(x)} $$

Then we can estimate $$ \beta_m = argmin_{\beta} \sum_{i=1}^N L(y_i, H_{m-1}(x_i) + \beta h(x_i;a_m)) $$

To improve performance, at each iteration SGB incorporates randomness to select a random permutation (without replacement) to fit the regression tree. Tuning the parameters include tuning M - the total number of trees (for example, from 50 up to 700), the learning rate - which decides how much loss is fitted for the next tree (for example, 0.01, 0.05, 0.1, 0.5), and L - the depth of each tree (for example, 3, 5, 7, 9).

To search for the best combination of parameters, we can use 10 fold cross validation in which the training set is divided into 10 groups, nine would be for fitting and the other for testing. 