# Lab 6: XGBoost and LightGBM

In this lab, we will go over two ways of utilizing gradient boosting in practice, namely XGBoost ([docs](https://xgboost.readthedocs.io/en/stable/index.html), [license](https://github.com/dmlc/xgboost/blob/master/LICENSE)) and LightGBM ([docs](https://lightgbm.readthedocs.io/en/stable/), [license](https://github.com/microsoft/LightGBM/blob/master/LICENSE)). Before going over the following code, we will discuss some basic theory on XGBoost and LightGBM (which you can find on the docs above). Then, we will return to this notebook in which we follow some of the relevant tutorials from their documentations.

## XGBoost

We follow "quick start tutorial" for XGBoost ([link](https://xgboost.readthedocs.io/en/stable/get_started.html)), for which it may be good to check out this [article](https://xgboost.readthedocs.io/en/stable/python/sklearn_estimator.html) on "Using the Scikit-Learn Estimator Interface". Also, check out [XGBClassifier docs](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBClassifier).

In [None]:
from xgboost import XGBClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

In [None]:
# read data
data = load_iris()
X_train, X_test, y_train, y_test = train_test_split(data['data'], data['target'], test_size=.2)

In [None]:
# create model instance
bst = XGBClassifier(n_estimators=2, max_depth=2, learning_rate=1, objective='binary:logistic')

In [None]:
# fit model
bst.fit(X_train, y_train)

In [None]:
# make predictions
preds = bst.predict(X_test)
preds

## LightGBM

First, we follow `sklearn_example.py` from LightGBM repository ([link](https://github.com/microsoft/LightGBM/blob/master/examples/python-guide/sklearn_example.py)). To do so, we need `regression.train` and `regression.test` files to be in the current directory. These files can be accessed [here](https://github.com/microsoft/LightGBM/tree/master/examples/regression).

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt
import lightgbm as lgb

In [None]:
# load or create your dataset
df_train = pd.read_csv("regression.train", header=None, sep="\t")
df_test = pd.read_csv("regression.test", header=None, sep="\t")

y_train = df_train[0]
y_test = df_test[0]
X_train = df_train.drop(0, axis=1)
X_test = df_test.drop(0, axis=1)

In [None]:
# train
gbm = lgb.LGBMRegressor(num_leaves=31, learning_rate=0.05, n_estimators=20)
gbm.fit(X_train, y_train, eval_set=[(X_test, y_test)], eval_metric="l1", callbacks=[lgb.early_stopping(5)])

In [None]:
# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)

In [None]:
# evaluate
rmse_test = mean_squared_error(y_test, y_pred) ** 0.5
print(f"The RMSE of prediction is: {rmse_test}")

In [None]:
# feature importances
print(f"Feature importances: {list(gbm.feature_importances_)}")

In [None]:
# self-defined eval metric, which here is root mean squared logarithmic error (RMSLE)
# f(y_true: array, y_pred: array) -> name: str, eval_result: float, is_higher_better: bool
def rmsle(y_true, y_pred):
    return "RMSLE", np.sqrt(np.mean(np.power(np.log1p(y_pred) - np.log1p(y_true), 2))), False

In [None]:
# train with custom eval function
gbm.fit(X_train, y_train, eval_set=[(X_test, y_test)], eval_metric=rmsle, callbacks=[lgb.early_stopping(5)])

# predict
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration_)

# evaluate
rmsle_test = rmsle(y_test, y_pred)[1]
print(f"The RMSLE of prediction is: {rmsle_test}")

In [None]:
# find optimal parameters using GridSearchCV

estimator = lgb.LGBMRegressor(num_leaves=31)

param_grid = {"learning_rate": [0.01, 0.1, 1], "n_estimators": [20, 40]}

gbm = GridSearchCV(estimator, param_grid, cv=3)
gbm.fit(X_train, y_train)

print(f"Best parameters found by grid search are: {gbm.best_params_}")

Next, we follow `plot_example.py` from the same repository ([link](https://github.com/microsoft/LightGBM/blob/master/examples/python-guide/plot_example.py)). You may wish to look at [Python Quick Start](https://lightgbm.readthedocs.io/en/latest/Python-Intro.html) from LightGBM docs.

In [None]:
# load or create your dataset
df_train = pd.read_csv("regression.train", header=None, sep="\t")
df_test = pd.read_csv("regression.test", header=None, sep="\t")

y_train = df_train[0]
y_test = df_test[0]
X_train = df_train.drop(0, axis=1)
X_test = df_test.drop(0, axis=1)

In [None]:
# create dataset for lightgbm
lgb_train = lgb.Dataset(X_train, y_train)
lgb_test = lgb.Dataset(X_test, y_test, reference=lgb_train)

In [None]:
# specify your configurations as a dict
params = {"num_leaves": 5, "metric": ("l1", "l2"), "verbose": 0}

evals_result = {}  # to record eval results for plotting

In [None]:
# train
gbm = lgb.train(
    params,
    lgb_train,
    num_boost_round=100,
    valid_sets=[lgb_train, lgb_test],
    feature_name=[f"f{i + 1}" for i in range(X_train.shape[-1])],
    categorical_feature=[21],
    callbacks=[lgb.log_evaluation(10), lgb.record_evaluation(evals_result)],
)

In [None]:
y_test_pred = gbm.predict(X_test)

In [None]:
# Plot metrics recorded during training
ax = lgb.plot_metric(evals_result, metric="l1")
plt.show()

In [None]:
# Plot feature importances
ax = lgb.plot_importance(gbm, max_num_features=10)
plt.show()

In [None]:
# Plot split value histogram
ax = lgb.plot_split_value_histogram(gbm, feature="f26", bins="auto")
plt.show()

In [None]:
# Plotting 54th tree, use categorical feature to split
ax = lgb.plot_tree(gbm, tree_index=53, figsize=(15, 15), show_info=["split_gain"])
plt.show()