<a href="https://colab.research.google.com/github/ArslanAmanov/AI-ML-DL/blob/default-branch/ML_models%20research%20notebooks/LGBM_research%20%26%20review.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#LGBM Model Research

LightGBM( Light Gradient Boosting Machine) is a popular gradient boosting framework for machine learning tasks, especially in the field of tabular data.
It was developed by Microsoft and is known for its efficiency, speed and ability to handle large datasets.



# Gradient Boosting overview:

Gradient boosting is an ensemble learning technique used for both classification and regression tasks. It builds of decision trees sequentially, where each tree corrects the errors made by the previous one.

# Step by Step process of gradient boosting:

1. Start with an initial prediction(usually the mean of the target values for regression or a balanced class distribution for classification).
2. Calculate the residuals (the difference between the actual and predicted values) for each data point.
3. Fit a decision tree to the residuals. This tree is often referred to as a "weak learner" because it's simple model.
4. Add the predictions from the new tree to the previous predictions, which updates the model's predictions.
5. Repeat steps 2-4 for a special number of iterations or until a predefined stopping criterion is met.
6. The final ensemble model is the sum of all the predictions from the individual trees.

# LGBM Features:
LightGBM offers several features and optimizations that make it stand out:

1. Gradient Boosting with Histogram-Based Learning: LGBM uses histogram based learning which bins the continuous feature values into discrete bins. This speeds up the training process and reduces memory usage compared to traditional tree-based models.
2. Leaf Wise Growth: LGBM uses a leaf-wise tree growth strategy rather than a level-wise strategy, which can lead to more accurate models with fewer leaves. This reduces overfitting.
3. Gradient-Based One-Side Sampling(GOOS): GOOS is a technique used to subsample the data points during training. It keeps the data points with large gradients(which contribute more to the model) while randomly sampling those with small gradient. This improves both training speed and generalization.
4. Exclusive Feature Bundling: LGBM supports feature bundling, which groups related features together. This reduces the dimensionality of the data and can improve model performance.
5. Regularization: LGBM provides options for L1 and L2 regularization to prevent overfitting.
6. Categorical Feature Support: LGBM can handle categorical features directly without one-hot encoding, making it more memory-efficient.
7. Parallel and GPU Learning: LGBM is highly optimized for parallel processing and can leverage GPU's for even faster  training.
8. Early Stopping: You can use early stopping to halt training when the model's performance on a validation dataset plateaus or worsens, preventing overfitting.


# Install

The preferred way to install LightGBM is via pip:

In [1]:
! pip install lightgbm



In [2]:
# verify your installation, try to
import lightgbm as lgb

# Data Interface
The LightGBM Python module can load data from:

LibSVM (zero-based) / TSV / CSV format text file

NumPy 2D array(s), pandas DataFrame, H2O DataTable’s Frame, SciPy sparse matrix

LightGBM binary file

LightGBM Sequence object(s)

The data is stored in a Dataset object.


Many of the examples in this page use functionality from numpy. To run the examples, be sure to import numpy in your session.

In [3]:
import numpy as np

To load a LibSVM(zero-based) text file or a LGBM binary file into Dataset:

In [4]:
train_data = lgb.Dataset('train.svm.bin')

# To load a numpy array into Dataset:

In [5]:
data = np.random.rand(500, 10)  # 500 entities, each contains 10 features
label = np.random.randint(2, size=500)  # binary target
train_data = lgb.Dataset(data, label=label)

# To load a scipy.sparse.csr_matrix array into Dataset:

In [8]:
# import scipy
# csr = scipy.sparse.csr_matrix((data, (row, col)))
# train_data = lgb.Dataset(csr)

# Load from Sequence objects:

We can implement Sequence interface to read binary files. The following example shows reading HDF5 file with h5py.

In [None]:
import h5py

class HDFSequence(lgb.Sequence):
    def __init__(self, hdf_dataset, batch_size):
        self.data = hdf_dataset
        self.batch_size = batch_size

    def __getitem__(self, idx):
        return self.data[idx]

    def __len__(self):
        return len(self.data)

f = h5py.File('train.hdf5', 'r')
train_data = lgb.Dataset(HDFSequence(f['X'], 8192), label=f['Y'][:])

# Saving Dataset into a LightGBM binary file will make loading faster:

In [None]:
train_data = lgb.Dataset('train.svm.txt')
train_data.save_binary('train.bin')

# Create validation data:

In [None]:
validation_data = train_data.create_valid('validation.svm')

In [None]:
# or
validation_data = lgb.Dataset('validation.svm', reference=train_data)
#In LightGBM, the validation data should be aligned with training data.

# Specific feature names and categorical features:

In [None]:
train_data=lgb.Dataset(data, label=label, feature_name=['c1', 'c2','c3'], categorical_feature=['c3'])

LightGBM can use categorical features as input directly. It doesn’t need to convert to one-hot encoding, and is much faster than one-hot encoding (about 8x speed-up).

Note: You should convert your categorical features to int type before you construct Dataset.

# Weights can be set when needed:

In [None]:
w = np.random.rand(500, )
train_data = lgb.Dataset(data, label=label, weight=w)

In [None]:
# or
train_data = lgb.Dataset(data, label=label)
w = np.random.rand(500, )
train_data.set_weight(w)

# Setting parameters

Booster parameters:

In [None]:
param = {'num_leaves': 31, 'objective': 'binary'}
param['metric'] = 'auc'

In [None]:
# You can also specify multiple eval metrics:
param['metric']=['auc', 'binary_logloss']

# Training
Training a model requires a parameter list and data set:

In [None]:
num_round = 10
bst = lgb.train(param, train_data, num_round, valid_sets=[validation_data])

After training, the model can be saved:

In [None]:
bst.save_model('model.txt')

The trained model can also be dumped to JSON format:

In [None]:
json_model = bst.dump_model()

A saved model can be loaded:

In [None]:
bst = lgb.Booster(model_file='model.txt')  # init model

# Early Stopping
If you have a validation set, you can use early stopping to find the optimal number of boosting rounds. Early stopping requires at least one set in valid_sets. If there is more than one, it will use all of them except the training data:

In [None]:
bst = lgb.train(param, train_data, num_round, valid_sets=valid_sets, callbacks=[lgb.early_stopping(stopping_rounds=5)])
bst.save_model('model.txt', num_iteration=bst.best_iteration)

The model will train until the validation score stops improving. Validation score needs to improve at least every stopping_rounds to continue training.

The index of iteration that has the best performance will be saved in the best_iteration field if early stopping logic is enabled by setting early_stopping callback. Note that train() will return a model from the best iteration.

This works with both metrics to minimize (L2, log loss, etc.) and to maximize (NDCG, AUC, etc.). Note that if you specify more than one evaluation metric, all of them will be used for early stopping. However, you can change this behavior and make LightGBM check only the first metric for early stopping by passing first_metric_only=True in early_stopping callback constructor.

In [None]:
#If early stopping is enabled during training, you can get predictions from the best iteration with bst.best_iteration:
ypred = bst.predict(data, num_iteration=bst.best_iteration)