# Kaggler's Guide to LightGBM Hyperparameter Tuning with Optuna in 2021
## Maximize LGBM's performance  TODO
![](images/pixabay.jpg)
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@pixabay?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pixabay</a>
        on 
        <a href='https://www.pexels.com/photo/silhouette-of-person-holding-sparkler-digital-wallpaepr-266429/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Pexels.</a> All images are by the author unless specified otherwise.
    </strong>
</figcaption>

# Setup

In [1]:
import logging
import time

import catboost as cb
import joblib
import lightgbm as lgbm
import matplotlib.pyplot as plt
import numpy as np
import optuna
import pandas as pd
import seaborn as sns
import xgboost as xgb
from optuna.samplers import TPESampler
from sklearn.compose import (
    ColumnTransformer,
    make_column_selector,
    make_column_transformer,
)
from sklearn.impute import SimpleImputer
from sklearn.metrics import log_loss, mean_squared_error
from sklearn.model_selection import (
    KFold,
    StratifiedKFold,
    cross_validate,
    train_test_split,
)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

logging.basicConfig(
    format="%(asctime)s - %(message)s", datefmt="%d-%b-%y %H:%M:%S", level=logging.INFO
)
optuna.logging.set_verbosity(optuna.logging.WARNING)

# Introduction

In the previous article, we talked about the basics of LightGBM and creating LGBM models that beat XGBoost in almost every aspect. This article focuses on the last stage of any machine learning project - hyperparameter tuning (if we don't include model ensembling). 

First, we will take a look at the most important LGBM hyperparameters, grouped by their impact level and area. Then, we will see a hands-on example of tuning LGBM parameters using Optuna - the next-generation bayesian hyperparameter tuning framework. 

Most importantly, we will do this in a way that is similar to how top Kagglers tune their LGBM models that achieve impressive results (TODO THIS SENTENCE).

> I highly suggest reading the first part of the article if you are new to LGBM. Although I will briefly explain how Optuna works, I also recommend reading my separate post on it to get the best out this article.

https://towardsdatascience.com/why-is-everyone-at-kaggle-obsessed-with-optuna-for-hyperparameter-tuning-7608fdca337c?source=your_stories_page-------------------------------------

https://towardsdatascience.com/why-is-everyone-at-kaggle-obsessed-with-optuna-for-hyperparameter-tuning-7608fdca337c?source=your_stories_page-------------------------------------

# Overview of the most important parameters

Generally, hyperparameters of most tree-based model can be grouped into 4 categories:
1. Parameters that affect the structure and learning of the decision trees
2. Parameters that affect the training speed
3. Parameters for better accuracy
4. Parameters to combat overfitting

Most of the time, parameters in these categories have a lot of overlap and increasing efficiency in one may come at the risk of decrease in another. That's why tuning them manually is a giant mistake and should be avoided at all costs. 

Frameworks like Optuna can find the "sweet medium" between these categories automatically if given a good enough parameter grid (and yes, we will develop this grid for LGBM today).

# Hyperparameters that control the tree structure

> If you are not familiar with decision trees, check out [this legendary video](https://www.youtube.com/watch?v=_L39rN6gz7Y) by StatQuest.

In LGBM, the most important parameter to control the tree structure is `num_leaves`. As the name suggests, it controls the number of decision leaves in a single tree. Decision leaf of a tree is the node where the 'actual decision' happens.

The next is `max_depth` that controls the tree depth. The higher `max_depth`, the more levels the tree has, which makes it more complex and prone to overfit. Too low and you will underfit. Even though it sounds complex, it is the easiest parameter to tune - just choose a value between 3 and 12 (this range tends to work well on Kaggle for any dataset).

Tuning `num_leaves` can also be easy once you determine `max_depth`. There is a simple formula given in LGBM documentation - the maximum limit to `num_leaves` should be `2^(max_depth)`. This means the optimal value for `num_leaves` lies within the range ($2^3$, $2^{12}$) or (8, 4096). 

However, `num_leaves` impacts the learning in LGBM more than `max_depth`. This means you need to specify a more conservative search range like (20, 3000) - that's what I mostly do. 

Another important structural parameter for a tree is `min_data_in_leaf`. Its magnitude is also correlated to whether you overfit or not. In simple terms, `min_data_in_leaf` specifies the minimum number of observetaions that fit the decision criteria in a leaf.

For example, if the decision leaf is checking wheter one feature is greater than, let's say, 13 - setting `min_data_in_leaf` to 100 means we want to evaluate this node only if there are at least 100 training observations that are bigger than 13. This is the gist in my lay terms.

The optimal value for `min_data_in_leaf` depends on the number of training samples and `num_leaves`. For large datasets, set a value in hundreds or thousands.

# Optuna, creating the grid

# Creating Optuna study and run trials

# Visualize the results