# LightGBM deep dive

## Why LightGBM?

- Tree boosting works well:
    - Trees usually work well on tabular data
      - [Why do tree-based models still outperform deep learning on tabular data?](https://arxiv.org/abs/2207.08815)
      - [When Do Neural Nets Outperform Boosted Trees on Tabular Data?](https://arxiv.org/abs/2305.02997)
      - [Machine Learning Challenge Winning Solutions](https://github.com/microsoft/LightGBM/tree/master/examples#machine-learning-challenge-winning-solutions)
    - Not that many parameters to tune:
      - Tree height
      - Number of trees
      - Learning rate
- Added benefits of LightGBM:
    - Fast
    - Supports categorical variables
    - Handles missing data
    - Can output prediction intervals
    - No need to do feature selection
    - Provides feature importance
    - Works for regression, classification, ranking

## Objective recap

- https://www.kaggle.com/code/prashant111/lightgbm-classifier-in-python

Let's load the dataset we produced in the previous notebook. The one about taxi trip durations.

In [118]:
import pandas as pd

dataset = pd.read_pickle('../../data/taxi_trip_dataset.pkl')
dataset['hour'] = pd.Categorical(dataset['hour'])
dataset['weekday'] = pd.Categorical(dataset['weekday'])
dataset['pickup_cell_id'] = pd.Categorical(dataset['pickup_cell_id'])
dataset['dropoff_cell_id'] = pd.Categorical(dataset['dropoff_cell_id'])
dataset.head()


Unnamed: 0,id,pickup_datetime,trip_duration,l1_distance,l2_distance,hour,weekday,avg_duration_per_hour_recent,avg_duration_per_weekday_recent,avg_duration_recent,pickup_cell_id,dropoff_cell_id,cell_pair_count,avg_duration_per_cell_pair,avg_duration_per_hour_per_cell_pair,avg_duration_per_weekday_per_cell_pair
0,id0190469,2016-01-01 00:00:17,849,0.152939,0.118097,0,5,,,,2-7,6-18,18,,,
1,id1665586,2016-01-01 00:00:53,1294,0.056721,0.040151,0,5,849.0,849.0,849.0,2-10,4-7,211,,,
2,id1210365,2016-01-01 00:01:01,408,0.031929,0.022726,0,5,1071.5,1071.5,1071.5,4-15,5-17,233,,,
3,id3888279,2016-01-01 00:01:14,280,0.01004,0.009103,0,5,850.333333,850.333333,850.333333,2-10,2-10,7392,,,
4,id0924227,2016-01-01 00:01:20,736,0.03606,0.025557,0,5,707.75,707.75,707.75,3-11,2-10,11054,,,


Our goal is to predict the trip duration of a taxi ride in New York City given the pickup and dropoff locations. We will use LightGBM for this task. As with any supervised task, the goal is to learn a model on a training set and then make predictions on a test set.

In [119]:
is_test = dataset['pickup_datetime'].dt.month == 6
X_train = dataset.loc[~is_test].drop(columns=['id', 'trip_duration', 'pickup_datetime'])
y_train = dataset.loc[~is_test, 'trip_duration']
X_test = dataset.loc[is_test].drop(columns=['id', 'trip_duration', 'pickup_datetime'])
y_test = dataset.loc[is_test, 'trip_duration']


In [120]:
f"{len(X_train)=:,d}, {len(X_test)=:,d}"


'len(X_train)=1,219,608, len(X_test)=233,460'

Our goal is to fit a model on `(X_train, y_train)`, and then make predictions `y_pred` on `X_test`. We want to minimize the mean squared error between `y_pred` and `y_test`.

## Decision trees

https://mlu-explain.github.io/decision-tree/

## Gradient boosting (for trees)

Gradient boosting is a general purpose algorithm for regression and classification. It can be used with any differentiable loss function. It is an ensemble method that combines weak learners to create a strong learner. It works well with decision trees as weak learners, but it can also work with other types of models.

The idea is to iteratively fit a weak learner to the residuals of the previous model. The residuals are the difference between the predictions of the previous model and the true values. The weak learner is then fitted to the residuals, and the predictions of the weak learner are added to the predictions of the previous model.

Let's assume we're doing regression. The algorithm is as follows:

*Initialize the model with a constant value*

$$\hat{y}^0 = \frac{1}{n} \sum_{i=1}^n y_i$$

$\hat{y}^0$ is a vector of length $n$ with all values equal to the mean of $y$.

*Calculate the residuals*

$$r^0 = y - \hat{y}^0$$

$r^0$ is a vector of length $n$, where each value $r^0_i$ is the difference between the true value $y_i$ and $\hat{y}^0_i$. In other words, it's the gradient of the loss function with respect to the predictions of the previous model.

*Fit a weak learner to the residuals*

$$\hat{f}^1 = \text{argmin}_f \sum_{i=1}^n L(y_i, \hat{y}^0_i + f(x_i))$$

$\hat{f}^1$ is a vector of length $n$ with the predictions of the weak learner. The weak learner is fitted to the residuals $r^0$. The weak learner learns to predict the residuals $r^0$ from the training features $x$. It therefore learns to correct the errors of the previous model.

*Update the model*

$$\hat{y}^1 = \hat{y}^0 + \gamma \times \hat{f}^1$$

$\hat{y}^1$ is a vector of length $n$ with the predictions of the model after the first iteration. The predictions of the weak learner are added to the predictions of the previous model. The learning rate $\gamma$ controls how much the predictions of the weak learner are added to the predictions of the previous model.

*Calculate the residuals*

$$r^1 = y - \hat{y}^1$$

$r^1$ is a vector of length $n$, where each value $r^1_i$ is the difference between the true value $y_i$ and $\hat{y}^1_i$.

*Fit a weak learner to the residuals*

$$\hat{f}^2 = \text{argmin}_f \sum_{i=1}^n L(y_i, \hat{y}^1_i + f(x_i))$$

$\hat{f}^2$ is a vector of length $n$ with the predictions of the weak learner. The weak learner is fitted to the residuals $r^1$. The weak learner learns to predict the residuals $r^1$ from the training features $x$.

*Update the model*

$$\hat{y}^2 = \hat{y}^1 + \gamma \times \hat{f}^2$$

$$\hat{y}^2 = \hat{y}^0 + \gamma \times \hat{f}^1 + \gamma \times \hat{f}^2$$

The idea is to keep doing this for a certain number of iterations. The more iterations, the more $\hat{y}$ will approach $y$. This is good because it means the model is learning. However, if we do too many iterations, the model will overfit the training data. It will learn the noise in the training data, and it will not generalize well to the test data. This is why we need to tune the number of iterations. We can do this with early stopping, which involves monitoring the performance of the model on a validation set. If the performance on the validation set does not improve for a certain number of iterations, we stop the training.

Let's see how this works with some code. We'll use a toy dataset.

In [121]:
from sklearn import datasets
from sklearn import model_selection

# Generate the original dataset
X, y = datasets.make_regression(n_samples=1000, n_features=10, noise=0.2, random_state=42)

# Split the dataset into training and validation sets
X_fit, X_val, y_fit, y_val = model_selection.train_test_split(X, y, test_size=0.2, random_state=42)

# Introduce a different distribution in the validation set by adding noise
noise_factor = 1  # Adjust the noise factor as needed
X_val += np.random.normal(0, noise_factor, X_val.shape)


Here's the training loop:

In [122]:
from sklearn import tree

# Hyperparameters
tree_max_depth = 10
n_iterations = 30
learning_rate = 0.1

# Initialize model parameters
y_fit_pred = np.full(shape=len(y_fit), fill_value=y_fit.mean())

# Monitor loss values
fit_loss_values = []
val_loss_values = []

def l2_loss(y_true, y_pred):
    return np.mean((y_true - y_pred) ** 2)

def l2_loss_gradient(y_true, y_pred):
    return -2 * (y_true - y_pred)

def predict(X):
    y_pred = np.full(shape=len(X), fill_value=y_fit.mean())

    for weak_learner in weak_learners:
        y_pred += learning_rate * weak_learner.predict(X)

    return y_pred

weak_learners = []
for m in range(n_iterations):

    # We take the current predictions and measure the gap with the true values
    negative_gradients = -l2_loss_gradient(y_fit, y_fit_pred)

    # We fit a tree to predict the gap given the features
    weak_learner = tree.DecisionTreeRegressor(max_depth=tree_max_depth)
    weak_learner.fit(X_fit, y=negative_gradients)
    y_fit_pred += learning_rate * weak_learner.predict(X_fit)

    # We add the weak learner to the ensemble
    weak_learners.append(weak_learner)

    # We want to monitor the loss values
    fit_loss_values.append(l2_loss(y_fit, predict(X_fit)))  # predict(X_fit) = y_fit_pred
    val_loss_values.append(l2_loss(y_val, predict(X_val)))


In [123]:
import altair as alt

data = {
    'Iteration': np.arange(1, len(fit_loss_values) + 1),
    'Fit Loss': fit_loss_values,
    'Val Loss': val_loss_values
}

chart = alt.Chart(pd.DataFrame(data)).transform_fold(
    ['Fit Loss', 'Val Loss'],
    as_=['Loss Type', 'Loss Value']
).mark_circle().encode(
    x='Iteration:O',
    y='Loss Value:Q',
    color='Loss Type:N'
).properties(
    title='Fit and Validation Loss Over Iterations',
    width=600,
    height=400
)

chart.interactive()


That's it, we've implemented gradient boosting from scratch! Our implementation is quite basic. Here are some things LightGBM does that we haven't implemented:

- Predictive performance
  - The weak learners use the hessian as well as the gradient of the loss function.
  - The decision tree split finding algorithm is specifically designed for gradient boosting.
- Throughput
  - Samples with small gradients are not used to train the weak learners.
  - The split finding algorithm uses a histogram approximation to speed up the computation.
  - Exclusive feature bundling: features are grouped together to speed up the computation.
  - Row and/or column can be used to fit each weak learner on a subset of the data.
- Scalability
  - The data can be distributed across multiple machines, and the learning happens in a distributed fashion.
  - The data can be stored on disk, and the learning happens out-of-core.

LightGBM's C++ code is too convoluted to understand. There is however [TinyGBT](https://github.com/lancifollia/tinygbt), which is implemented in Python and is much easier to understand. It's a good starting point if you want to understand how tree-based gradient boosting works. It is 200 lines of code and reaches the same predictive performance as LightGBM. It is of course much slower than LightGBM, but that's the price to pay for readability.

## How to use LightGBM

LightGBM provides a [scikit-learn API](https://lightgbm.readthedocs.io/en/stable/pythonapi/lightgbm.LGBMRegressor.html#lightgbm.LGBMRegressor).

In [141]:
import lightgbm as lgb

model = lgb.LGBMRegressor()

model = model.fit(X_train, y_train)


In [142]:
import datetime as dt
from sklearn import metrics

def evaluate(model):
    y_pred = model.predict(X_test)
    mae = metrics.mean_absolute_error(y_test, y_pred)
    print(dt.timedelta(seconds=mae))

evaluate(model)


0:03:27.731214


## Hyperparameters

Hyperparameters are described in the [documentation](https://lightgbm.readthedocs.io/en/stable/Parameters.html). The most important parameters are:

- `objective`: the loss function to optimize.
- `max_depth`: the maximum depth of each tree.
- `num_leaves`: the maximum number of leaves in each tree.
- `n_estimators`: the number of trees.
- `learning_rate`: the learning rate.
- `min_child_samples`: the minimum number of samples in each leaf.

### `objective`

The objective is the loss function to optimize. It's important to distinguish the loss function the model optimizes, from the metric we want to minimize. Indeed, a model can optimize a differentiable loss function, but we can still use a non-differentiable metric to evaluate the model.

In our case, we're looking to minimize the mean absolute error (MAE). We're in luck, because LightGBM can minimize the L1 loss function. The L1 loss function is equivalent to the mean absolute error. It's the sum of the absolute values of the residuals divided by the number of samples.

In [143]:
model = lgb.LGBMRegressor(
    objective='regression_l1'
)

model = model.fit(X_train, y_train)
evaluate(model)


0:03:25.826597


### `max_depth`

The maximum depth of each tree is controlled by the `max_depth` parameter. The default value is `-1`, which means there is no maximum depth. The tree grows until all leaves are pure or until all leaves contain less than `min_child_samples` samples.

In [156]:
model = lgb.LGBMRegressor(
    objective='regression_l1',
    max_depth=7
)

model = model.fit(X_train, y_train)
evaluate(model)


0:03:27.044919


In this case, the tree depth doesn't play a big role because the train and test sets share the same distribution. The model will overfit the training data, but it will generalize well to the test data. The tree depth is a good way to control the complexity of the model. The deeper the tree, the more complex the model. The more complex the model, the more likely it is to overfit the training data.

Note that controlling the tree depth is also a good way to control the training time. The deeper the tree, the longer it takes to train the model.

### `num_leaves`

LightGBM learns in a leaf-wise fashion. It grows the tree leaf by leaf. This is different from other tree-based models, which grow the tree level by level. It thus provides a `num_leaves` parameter, which controls the maximum number of leaves in each tree. The tree grows until all leaves are pure or until all leaves contain less than `min_child_samples` samples. This is slightly different from the `max_depth` parameter, which controls the maximum depth of each tree. Both can be used in combination to control the model's complexity.

In [154]:
model = lgb.LGBMRegressor(
    objective='regression_l1',
    num_leaves=2 ** 7
)

model = model.fit(X_train, y_train)
evaluate(model)


0:03:17.490448


### `n_estimators`

The number of trees is controlled by the `n_estimators` parameter. The default value is `100`:

In [157]:
model.n_estimators


100

The more trees, the more likely it is to overfit the training data. The more trees, the longer it takes to train the model.

In [158]:
model = lgb.LGBMRegressor(
    objective='regression_l1',
    num_leaves=2 ** 7,
    n_estimators=500
)

model = model.fit(X_train, y_train)
evaluate(model)


0:03:15.170267


`learning_rate`

The learning rate can be increased to speed up the training. The learning rate controls how much the predictions of the weak learner are added to the predictions of the previous model. The higher the learning rate, the more the predictions of the weak learner are added to the predictions of the previous model.

In [159]:
model.learning_rate


0.1

In [160]:
model = lgb.LGBMRegressor(
    objective='regression_l1',
    num_leaves=2 ** 7,
    n_estimators=500,
    learning_rate=0.2
)

model = model.fit(X_train, y_train)
evaluate(model)


0:03:16.098418


## Early stopping

Early stopping is a technique to prevent overfitting. It involves monitoring the performance of the model on a validation set. If the performance on the validation set does not improve for a certain number of iterations, we stop the training.

In [163]:
X_fit, X_val, y_fit, y_val = model_selection.train_test_split(
    X_train, y_train,
    test_size=0.2, random_state=42
)

model = lgb.LGBMRegressor(
    objective='regression_l1',
    num_leaves=2 ** 7,
    n_estimators=500,
    learning_rate=0.2
)

model = model.fit(
    X_fit, y_fit,
    eval_set=[
        (X_fit, y_fit),
        (X_val, y_val)
    ],
    eval_metric='l1',
    callbacks=[
        lgb.early_stopping(10),
        lgb.log_evaluation(period=50)
    ]
)




Training until validation scores don't improve for 10 rounds
[50]	training's l1: 175.537	valid_1's l1: 180.005
[100]	training's l1: 171.544	valid_1's l1: 178.254
[150]	training's l1: 169.717	valid_1's l1: 177.682
Early stopping, best iteration is:
[155]	training's l1: 169.623	valid_1's l1: 177.676


In [164]:
evaluate(model)


0:03:16.868874


As we can see, early stopping doesn't guarantee that the model will generalize well to the test data. It only guarantees that the model will generalize well to the validation data. The validation data is a sample of the training data. It's not the test data.

In practice, the main benefit of early stopping is to speed up the training. It allows us to stop the training early if the model is not improving. It's a good way to control the training time and thus iterate faster.

## Cross-validated predictions

This is a technique that developped on Kaggle. The idea is to train the model on different slices of the training data. For each slice, we train the model on a subset of the training data, and we make predictions on the test set. We then average the predictions of all the models to get the final predictions. This is form of bootstrap aggregation, or bagging for short. It reduces the variance of the predictions, and it can improve the predictive performance.

In [166]:
from sklearn import model_selection

cv = model_selection.KFold(n_splits=5, shuffle=True, random_state=42)

model = lgb.LGBMRegressor(
    objective='regression_l1',
    num_leaves=2 ** 7,
    n_estimators=500,
    learning_rate=0.2
)

y_pred = np.zeros(len(X_test))

for i, (fit_idx, val_idx) in enumerate(cv.split(X_train, y_train)):
    print(f'Fold {i + 1}')
    X_fit = X_train.iloc[fit_idx]
    y_fit = y_train.iloc[fit_idx]
    X_val = X_train.iloc[val_idx]
    y_val = y_train.iloc[val_idx]

    model = model.fit(
        X_fit, y_fit,
        eval_set=[
            (X_fit, y_fit),
            (X_val, y_val)
        ],
        eval_metric='l1',
        callbacks=[
            lgb.early_stopping(10),
            lgb.log_evaluation(period=50)
        ]
    )
    print('-' * 30)

    y_pred += model.predict(X_test)

y_pred /= cv.get_n_splits()


Fold 1




Training until validation scores don't improve for 10 rounds
[50]	training's l1: 175.465	valid_1's l1: 180.014
[100]	training's l1: 171.776	valid_1's l1: 178.54
[150]	training's l1: 170.165	valid_1's l1: 178.107
[200]	training's l1: 169.017	valid_1's l1: 177.866
Early stopping, best iteration is:
[220]	training's l1: 168.735	valid_1's l1: 177.841
Fold 2




Training until validation scores don't improve for 10 rounds
[50]	training's l1: 175.374	valid_1's l1: 180.956
[100]	training's l1: 171.25	valid_1's l1: 179.148
Early stopping, best iteration is:
[135]	training's l1: 170.174	valid_1's l1: 178.904
Fold 3




Training until validation scores don't improve for 10 rounds
[50]	training's l1: 175.415	valid_1's l1: 180.266
[100]	training's l1: 171.33	valid_1's l1: 178.538
[150]	training's l1: 169.938	valid_1's l1: 178.306
[200]	training's l1: 169.162	valid_1's l1: 178.228
Early stopping, best iteration is:
[191]	training's l1: 169.259	valid_1's l1: 178.226
Fold 4




Training until validation scores don't improve for 10 rounds
[50]	training's l1: 175.391	valid_1's l1: 180.426
[100]	training's l1: 171.153	valid_1's l1: 178.649
[150]	training's l1: 169.458	valid_1's l1: 178.174
Early stopping, best iteration is:
[154]	training's l1: 169.396	valid_1's l1: 178.171
Fold 5




Training until validation scores don't improve for 10 rounds
[50]	training's l1: 175.369	valid_1's l1: 180.708
[100]	training's l1: 171.036	valid_1's l1: 178.897
[150]	training's l1: 169.762	valid_1's l1: 178.71
Early stopping, best iteration is:
[177]	training's l1: 169.094	valid_1's l1: 178.578


In [167]:
mae = metrics.mean_absolute_error(y_test, y_pred)
print(dt.timedelta(seconds=mae))


0:03:14.785986


The downside of this technique is that it increases the training time because we have to train the model multiple times.

We can see that tuning the hyperparameters reduces the MAE on the test set by roughly 20 seconds. The feature engineering part managed to reduce the MAE on the test set by about 45 seconds. That's almost always the case in practice: feature engineering reaps more benefits than hyperparameter tuning.

We could do some hyperparameter tuning to tune the amount of leaves, the learning rate, the number of estimators, etc. However, it's unlikely to improve the predictive performance by much. Also, this would be extremely costly if we want to do it with bagging. My advice is to spend a bit of time searching for "good enough" hyperparameter. Then, spend most of your time on feature engineering. I know this sounds a bit more like art than science, but that's the reality of machine learning in practice: we don't have unlimited time and resources to tune the hyperparameters.

## Feature importance

In [175]:
model = lgb.LGBMRegressor(
    objective='regression_l1',
    num_leaves=2 ** 5,
    n_estimators=100
)
model = model.fit(X_train, y_train)


In [176]:
pd.Series(model.feature_importances_, index=X_train.columns).sort_values(ascending=False)


dropoff_cell_id                           941
pickup_cell_id                            631
hour                                      423
l2_distance                               332
weekday                                   214
l1_distance                               131
cell_pair_count                           113
avg_duration_per_cell_pair                 96
avg_duration_recent                        95
avg_duration_per_weekday_recent            54
avg_duration_per_hour_recent               35
avg_duration_per_hour_per_cell_pair        23
avg_duration_per_weekday_per_cell_pair     12
dtype: int32

The default feature importance is based on the number of times a feature is used in the trees. It's not a good metric because it's biased towards categorical features. Categories with many levels are more likely to be used in the trees than categories with few levels.

LightGBM provides a better metric called split gain. It's the total gain of each feature when it's used in the trees. It's a better metric because it's not biased towards categorical features. It's also a better metric because it's normalized. It's the total gain divided by the number of times the feature is used in the trees.

In [184]:
model = lgb.LGBMRegressor(
    objective='regression_l1',
    num_leaves=2 ** 5,
    n_estimators=100,
    importance_type='gain'

)
model = model.fit(X_train, y_train)
(
    pd.Series(model.feature_importances_, index=X_train.columns)
    .sort_values(ascending=False)
    .map('{:.2f}'.format)
)


l2_distance                               4767436.50
dropoff_cell_id                            631979.84
hour                                       545120.92
avg_duration_per_cell_pair                 318701.27
pickup_cell_id                             313141.82
weekday                                    214134.00
avg_duration_recent                        159161.57
l1_distance                                 48904.12
avg_duration_per_weekday_recent             48715.68
cell_pair_count                             45376.78
avg_duration_per_hour_recent                32077.81
avg_duration_per_hour_per_cell_pair         30626.98
avg_duration_per_weekday_per_cell_pair       8136.40
dtype: object

## Quantiles

LightGBM can optimize different loss functions. In particular, it can optimize quantile loss functions. This allows us to output prediction intervals. The idea is to train a model for each quantile. For example, we can train a model for the 0.05 quantile, the 0.5 quantile, and the 0.95 quantile. We can then use these models to output prediction intervals.

In [185]:
quantiles = [0.05, 0.5, 0.95]
y_pred = pd.DataFrame(index=X_test.index, columns=quantiles)

for q in quantiles:
    model = lgb.LGBMRegressor(
        objective='quantile',
        alpha=q,
        num_leaves=2 ** 5,
        n_estimators=100
    )
    model = model.fit(X_train, y_train)
    y_pred[q] = model.predict(X_test)


In [186]:
y_pred.head()


Unnamed: 0,0.05,0.50,0.95
1219332,251.932303,399.497378,727.883352
1219333,647.351284,860.229473,1270.26767
1219334,596.019179,804.426819,1247.829629
1219335,916.289646,1163.106212,1984.173113
1219336,394.629415,578.263293,985.190387


We can verify that the prediction intervals contain the true values. In this case, we should expect the model to output prediction intervals that contain the true values 90% of the time.

In [190]:
in_ci = ((y_pred[0.05] <= y_test) & (y_test <= y_pred[0.95])).mean()
print(f'{in_ci:.2%}')


88.92%


Not bad!