# 03 Model Training and tuning

In this notebook, we finetune our LightGBM model using Optuna. The steps include:
1. Loading preprocessed data.
2. Loading configuration (the list of features to use) from our YAML file.
3. Optionally performing an out-of-time split.
4. Running hyperparameter optimization via Optuna.
5. Saving the best model to the `artifacts/` folder.

In [1]:
import os
import pandas as pd
import yaml
import sys

In [2]:
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..', 'src')))

In [3]:
from models.finetune_model import finetune_model
from data.split_data import split_data_by_time

In [4]:
data_path = os.path.join(os.getcwd(), "../data", "processed", "processed_data.csv")
df = pd.read_csv(data_path)
print("Original data shape:", df.shape)

Original data shape: (150000, 45)


In [5]:
constants_path = os.path.join(os.getcwd(), "../src/utils", "constants.yaml")
with open(constants_path, "r") as file:
    constants = yaml.safe_load(file)
FEATURES_TO_USE_AFTER_SELECTION = constants["FEATURES_TO_USE_AFTER_SELECTION"]

In [6]:
X_train, X_test, y_train, y_test = split_data_by_time(
    df
)
# Filter training set to keep only the selected features.
X_train = X_train[FEATURES_TO_USE_AFTER_SELECTION]
print("Training set shape:", X_train.shape)

Training set shape: (90063, 16)


In [8]:
best_model, best_params = finetune_model(X_train, y_train, model_output_name="best_model_1.pkl", n_trials=50)
print("Best hyperparameters found:")
print(best_params)

[I 2025-02-17 18:27:10,756] A new study created in memory with name: lgbm_finetuning


  0%|          | 0/50 [00:00<?, ?it/s]

[I 2025-02-17 18:27:16,060] Trial 2 finished with value: 0.8547131066163445 and parameters: {'n_estimators': 169, 'max_depth': 3, 'learning_rate': 0.1556713178329501, 'subsample': 0.8912397826298294, 'colsample_bytree': 0.5533291927766556, 'num_leaves': 105}. Best is trial 2 with value: 0.8547131066163445.
[I 2025-02-17 18:27:20,652] Trial 3 finished with value: 0.8536465324916056 and parameters: {'n_estimators': 352, 'max_depth': 4, 'learning_rate': 0.09236466651467053, 'subsample': 0.8988675164866756, 'colsample_bytree': 0.6916400651274675, 'num_leaves': 139}. Best is trial 2 with value: 0.8547131066163445.
[I 2025-02-17 18:27:26,887] Trial 0 finished with value: 0.8574348568686772 and parameters: {'n_estimators': 403, 'max_depth': 5, 'learning_rate': 0.019983215558284913, 'subsample': 0.8407504471555454, 'colsample_bytree': 0.6075532506380598, 'num_leaves': 114}. Best is trial 0 with value: 0.8574348568686772.
[I 2025-02-17 18:27:28,920] Trial 1 finished with value: 0.83554166292392