# (State-of-the-Art) Learning Algorithms Based on Gradient Boosting

In recent years and for numerical datasets (i.e. not multimedia based), supervised learning algorithms based on **gradient boosting**, and ***with careful hyperparameter training***, frequently won open data-classification contests organized by kaggle.com and other forums when classification accuracy or ROC AUC is the metric. These algorithms are the state-of-the art learning algorithms for traditional supervised learning.

Gradient boosting, as we will study below, is a boosting technique that can be applied to any weak learning algorithms (why weak?). That said, in practice it is typically applied to decision trees -- the result of which is called **Gradient Boosted Decision Trees (GBDT)**.

In this lecture, we will try the following popular GBDT implementations that winners at open contests frequently use:
+ **XGBoost**, the go-to choice in the last decade, until it was taken over by the next one
+ **LightGBM**, similar in performance to XGBoost but much faster for large data
+ **HistGradientBoostingClassifier**, scikit-learn's adaption of LightGBM

## XGBoost

XGBoost standards for 'Extreme Gradient Boosting':
+ Extreme: it mainly means ***fast*** through a variety of computational and resource-management tricks (not required)

(Not required, but fun to watch at your spare time) XGBoost uses a special type of decision tree called XGBoost tree -- see (https://www.youtube.com/watch?v=OtD8wVaFm6E&t=11s) for an excellent tutorial on this tree structure.

See (https://xgboost.readthedocs.io/en/latest/) for documentation of XGBoost. We will discuss a few of the key parameters of XGBoost when we try the code.

In [1]:
# load necessary Python packages
import numpy as np
import pandas as pd
pd.set_option('max_columns', 50)

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

from sklearn.model_selection import GridSearchCV

In [2]:
# Load the balanced LendingClub dataset

df = pd.read_csv('LendingClub_balanced.csv')

X = df.drop(columns=['not_fully_paid'])
y = df['not_fully_paid']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state=365)
X_train = X_train.copy()
X_test = X_test.copy()

In [4]:
# The xgboost package is separate from sklearn. That said, it provides
# xgb.XGBClassifier as a sklearn-compatible interface.

# Note: you might need to first install the xgboost package in anaconda

import xgboost as xgb

xgb_sklearn = xgb.XGBClassifier(verbosity=0, use_label_encoder=False, random_state=1)

In [5]:
# Hyperparameter tuning of xgboost

param_grid = {
    'n_estimators': np.arange(50,250,50),
    'max_depth': np.arange(2,7,2),
}    

grid = GridSearchCV(estimator = xgb_sklearn, param_grid = param_grid, cv=3)
grid.fit(X_train, y_train)
print(f"Best parameters are: {grid.best_params_}")
print(f"The cross-validation accuracy is: {grid.best_score_:.4}")

# evaluation
y_predict = grid.best_estimator_.predict(X_test)
print(f"The testing accuracy is: {accuracy_score(y_test, y_predict):.4}")
print("The confusion matrix is:")
cm = confusion_matrix(y_test, y_predict)
print(cm)

Best parameters are: {'max_depth': 2, 'n_estimators': 50}
The cross-validation accuracy is: 0.6146
The testing accuracy is: 0.6156
The confusion matrix is:
[[189 109]
 [127 189]]


## LightGBM

[Researchers in Microsoft](https://www.microsoft.com/en-us/research/project/lightgbm/) created LightGBM in 2016. It quickly gained popularity in Kaggle.com competitions as it produces comparable performance with XGBoost, yet often faster -- sometimes several magnitude of speed difference. (And this is remarkable given that XGBoost was already considered fast!)

The key reason behind this huge speed improvement is LightGBM's use of histogram-based algorithms: *for each continuous feature, LightGBM buckets it into discrete bins*. This greatly speeds up the tree growing process because there are far less number of branch separation points to consider. LightGBM also use a lot [other tricks](https://lightgbm.readthedocs.io/en/latest/Features.html#) to speed up the training process.

See (https://lightgbm.readthedocs.io/en/latest/) for documentation of LightGBM.

Note: Recently scikit-klearn introduced a new [experimental learning algorithm](https://scikit-learn.org/stable/modules/ensemble.html#histogram-based-gradient-boosting) called `HistGradientBoostingClassifier` that is inspired by LightGBM. However, I don't yet see much adoption of it in practice. `HistGradientBoostingClassifier` also appears to perform worse than XGBoost and LightGBM when I tried.

Let's try LightGBM below.

In [9]:
# You might need to first install lightgbm package in anaconda

import lightgbm as lgbm

In [10]:
lgbm_sklearn = lgbm.LGBMClassifier(random_state=1)

param_grid = {
    'n_estimators': np.arange(50,250,50),
    'max_depth': np.arange(2,7,2),
}    

grid = GridSearchCV(estimator = lgbm_sklearn, param_grid = param_grid, cv=3)
grid.fit(X_train, y_train)
print(f"Best parameters are: {grid.best_params_}")
print(f"The cross-validation accuracy is: {grid.best_score_:.4}")

# evaluation
y_predict = grid.best_estimator_.predict(X_test)
print(f"The testing accuracy is: {accuracy_score(y_test, y_predict):.4}")
print("The confusion matrix is:")
cm = confusion_matrix(y_test, y_predict)
print(cm)

Best parameters are: {'max_depth': 2, 'n_estimators': 200}
The cross-validation accuracy is: 0.6183
The testing accuracy is: 0.6091
The confusion matrix is:
[[186 112]
 [128 188]]


### Hyperparameter tuning of LightGBM using ROC AUC as performance metric

If we don't explicitly specify the performance metric for classification, `GridSearchCV()` used 'accuracy' as the default metric. Now let's try an alternative metric of roc_auc.

In [11]:
from sklearn.metrics import roc_auc_score

In [12]:
param_grid = {
    'n_estimators': np.arange(50,250,50),
    'max_depth': np.arange(2,7,2),
}    

grid = GridSearchCV(estimator = lgbm_sklearn, param_grid = param_grid, scoring='roc_auc', cv=3)
grid.fit(X_train, y_train)
print(f"Best parameters are: {grid.best_params_}")
print(f"The cross-validation ROC AUC is: {grid.best_score_:.4}")

# evaluation
y_predict = grid.best_estimator_.predict(X_test)
y_predict_proba = grid.best_estimator_.predict_proba(X_test)[:,1]
print(f"The testing ROC AUC is: {roc_auc_score(y_test, y_predict_proba):.4}")
print("The confusion matrix is:")
cm = confusion_matrix(y_test, y_predict)
print(cm)

Best parameters are: {'max_depth': 2, 'n_estimators': 50}
The cross-validation ROC AUC is: 0.6705
The testing ROC AUC is: 0.6492
The confusion matrix is:
[[186 112]
 [128 188]]


## Summary

Hyperparameter tuning helps relieve data scientists from the repetitive work of fine-tuning a model by experimenting with various hyperparameter values. This is widely adopted in today's analytical practice.