# 03 Model Training and tuning

In this notebook, we finetune our LightGBM model using Optuna. The steps include:
1. Loading preprocessed data.
2. Loading configuration (the list of features to use) from our YAML file.
3. Optionally performing an out-of-time split.
4. Running hyperparameter optimization via Optuna.
5. Saving the best model to the `artifacts/` folder.

In [1]:
import os
import pandas as pd
import yaml
import sys

In [2]:
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..', 'src')))

In [3]:
from models.finetune_model import finetune_model
from data.split_data import split_data_by_time

In [4]:
data_path = os.path.join(os.getcwd(), "../data", "processed", "processed_data.csv")
df = pd.read_csv(data_path)
print("Original data shape:", df.shape)

Original data shape: (150000, 46)


In [5]:
constants_path = os.path.join(os.getcwd(), "../src/utils", "constants.yaml")
with open(constants_path, "r") as file:
    constants = yaml.safe_load(file)
FEATURES_TO_USE_AFTER_SELECTION = constants["FEATURES_TO_USE_AFTER_SELECTION"]

In [6]:
X_train, X_test, y_train, y_test = split_data_by_time(
    df
)
# Filter training set to keep only the selected features.
X_train = X_train[FEATURES_TO_USE_AFTER_SELECTION]
print("Training set shape:", X_train.shape)

Training set shape: (90063, 16)


In [7]:
best_model, best_params = finetune_model(X_train, y_train, model_output_name="best_model_1.pkl", n_trials=50)
print("Best hyperparameters found:")
print(best_params)

[I 2025-02-17 22:06:13,100] A new study created in memory with name: lgbm_finetuning


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-02-17 22:06:19,081] Trial 3 finished with value: 0.8490295742804053 and parameters: {'n_estimators': 242, 'max_depth': 3, 'learning_rate': 0.29181389848419775, 'subsample': 0.9838636166923819, 'colsample_bytree': 0.636832786230453, 'num_leaves': 33}. Best is trial 3 with value: 0.8490295742804053.
[I 2025-02-17 22:06:22,102] Trial 1 finished with value: 0.8490804925958058 and parameters: {'n_estimators': 395, 'max_depth': 4, 'learning_rate': 0.13482210171397874, 'subsample': 0.6020050401047186, 'colsample_bytree': 0.7634442774076408, 'num_leaves': 32}. Best is trial 1 with value: 0.8490804925958058.
[I 2025-02-17 22:06:31,246] Trial 5 finished with value: 0.8517812005266145 and parameters: {'n_estimators': 430, 'max_depth': 5, 'learning_rate': 0.09546021720973098, 'subsample': 0.5757371977858275, 'colsample_bytree': 0.9062438718039114, 'num_leaves': 19}. Best is trial 5 with value: 0.8517812005266145.
[I 2025-02-17 22:06:32,620] Trial 0 finished with value: 0.8298860003615667 a