# Optuna hyperparameter optimization for LightGBM and XGBoost models in Goggle Colab

**Hyperparameter Optimization with Optuna**

This notebook details the hyperparameter optimization process for our machine learning models, utilizing Optuna, a powerful open-source framework. Optuna's key advantage lies in its ability to efficiently navigate the vast hyperparameter search space using state-of-the-art sampling algorithms, such as Tree-structured Parzen Estimator (TPE), to intelligently select new parameter combinations based on the performance of previous trials. This approach is significantly more efficient than traditional grid or random search methods, which do not learn from past results.

The optimization process is structured as follows:

    Objective Function: We define an objective function that takes a trial object as input. This function trains a model (in this case, LightGBM or XGBoost) with a specific set of hyperparameters proposed by Optuna and returns a performance metric (ROC-AUC).

    Study: An Optuna study object manages the optimization process. It stores all trials, their parameters, and their corresponding results.

    Optimization: The study.optimize() method iteratively calls the objective function, with Optuna's samplers proposing new hyperparameters for each trial. This process continues for a specified number of trials or a set time limit, automatically saving the best-performing set of hyperparameters.

By automating this complex search, Optuna ensures we identify the most effective hyperparameter configurations to maximize our model's performance on the given task.

## 1. Installing libraries, Data loading

In [2]:
!pip install optuna category_encoders xgboost lightgbm

Collecting optuna
  Downloading optuna-4.5.0-py3-none-any.whl.metadata (17 kB)
Collecting category_encoders
  Downloading category_encoders-2.8.1-py3-none-any.whl.metadata (7.9 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Downloading optuna-4.5.0-py3-none-any.whl (400 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m400.9/400.9 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading category_encoders-2.8.1-py3-none-any.whl (85 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.7/85.7 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading colorlog-6.9.0-py3-none-any.whl (11 kB)
Installing collected packages: colorlog, optuna, category_encoders
Successfully installed category_encoders-2.8.1 colorlog-6.9.0 optuna-4.5.0


In [3]:
import optuna
from lightgbm import LGBMClassifier
import numpy as np
from sklearn.model_selection import StratifiedKFold, train_test_split, cross_val_score, train_test_split
from sklearn.metrics import roc_auc_score
import pandas as pd
import time
import lightgbm as lgb

**Loading dataset**

In [4]:
from google.colab import drive
drive.mount('/content/drive')


Mounted at /content/drive


In [5]:
train_df_enriched = pd.read_parquet("/content/drive/MyDrive/train_enriched_288.parquet")
importances = pd.read_csv("/content/drive/MyDrive/lgbm_importances.csv", index_col=0).squeeze("columns")

In [6]:
train_df_enriched.head()

Unnamed: 0,SK_ID_CURR,TARGET,NAME_CONTRACT_TYPE,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,AMT_CREDIT,AMT_ANNUITY,...,POS_SK_DPD_DEF_max,POS_INSTALMENTS_COMPLETED_RATIO_mean,CNT_INSTALMENT_max,CNT_INSTALMENT_FUTURE_max,MONTHS_BALANCE_max,POS_DPD_CHANGE_max,POS_SK_DPD_last,POS_IS_DELINQUENT_max,POS_IS_SERIOUSLY_DELINQUENT_max,POS_IS_COMPLETED_max
0,100002,1,Cash loans,M,N,Y,0,202500.0,406597.5,24700.5,...,0.0,0.4,24.0,24.0,-1.0,0.0,0.0,0.0,0.0,0.0
1,100003,0,Cash loans,F,N,N,0,270000.0,1293502.5,35698.5,...,0.0,0.505495,12.0,12.0,-18.0,0.0,0.0,0.0,0.0,1.0
2,100004,0,Revolving loans,M,Y,Y,0,67500.0,135000.0,6750.0,...,0.0,0.55,4.0,4.0,-24.0,0.0,0.0,0.0,0.0,1.0
3,100006,0,Cash loans,F,N,Y,0,135000.0,312682.5,29686.5,...,0.0,0.470251,48.0,48.0,-1.0,0.0,0.0,0.0,0.0,1.0
4,100007,0,Cash loans,M,N,Y,0,121500.0,513000.0,21865.5,...,0.0,0.478161,24.0,24.0,-1.0,0.0,0.0,0.0,0.0,1.0


**Splitting the data**

In [8]:
X = train_df_enriched.drop(columns=["TARGET", "SK_ID_CURR"], errors="ignore")
y = train_df_enriched["TARGET"]


X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Train: {X_train.shape[0]} samples, Test: {X_test.shape[0]} samples")

Train: 246008 samples, Test: 61503 samples


**LGBM feature importances**

In [11]:
top_n = 170
selected_features_top170 = importances.head(top_n).index.tolist()
X_train_170 = X_train[selected_features_top170]

## 2. Optuna optimization for LightGBM model

In [15]:
X_opt = X_train_170.copy()
y_opt = y_train.copy()

for col in X_opt.select_dtypes(include="object").columns:
    X_opt[col] = X_opt[col].astype("category")

scale_pos_weight = (y_opt == 0).sum() / (y_opt == 1).sum()

best_params = {
    "learning_rate": 0.03725239085209563,
    "max_depth": 8,
    #"num_leaves": 203,
    "min_child_samples": 606,
    "subsample": 0.73963761930167,
    "colsample_bytree": 0.6198891468899977
}

def objective(trial):

    param = {
        "objective": "binary",
        "boosting_type": "gbdt",
        "metric": "auc",
        "device": "cpu",
        #"gpu_platform_id": 0,  # optional
        #"gpu_device_id": 0,    # optional
        "random_state": 42,
        "n_estimators": 500,
        "verbose": -1,

        "learning_rate": trial.suggest_float(
            "learning_rate",
            max(0.001, best_params["learning_rate"] * 0.5),
            best_params["learning_rate"] * 1.5
        ),
        "max_depth": trial.suggest_int(
            "max_depth",
            max(3, best_params["max_depth"] - 2),
            best_params["max_depth"] + 2
        ),
        "min_child_samples": trial.suggest_int(
            "min_child_samples",
            max(20, best_params["min_child_samples"] - 100),
            best_params["min_child_samples"] + 100
        ),
        "subsample": trial.suggest_float(
            "subsample",
            max(0.6, best_params["subsample"] - 0.1),
            min(1.0, best_params["subsample"] + 0.1)
        ),
        "colsample_bytree": trial.suggest_float(
            "colsample_bytree",
            max(0.6, best_params["colsample_bytree"] - 0.1),
            min(1.0, best_params["colsample_bytree"] + 0.1)
        ),
        "scale_pos_weight": scale_pos_weight
    }

    train_set = lgb.Dataset(X_opt, label=y_opt)

    cv_result = lgb.cv(
        params=param,
        train_set=train_set,
        nfold=5,
        metrics="auc",
        stratified=True,
        seed=42,
        #verbose_eval=False
    )

    auc_key = [k for k in cv_result.keys() if "auc" in k][0]
    return max(cv_result[auc_key])

study = optuna.create_study(
    direction="maximize",
    sampler=optuna.samplers.TPESampler(n_startup_trials=0)
)

import time
start_time = time.time()

study.optimize(objective, n_trials=100, show_progress_bar=True)

elapsed_time = time.time() - start_time

print(f"Best ROC-AUC: {study.best_value:.4f}")
print("Best params:", study.best_params)
print(f"Total time: {elapsed_time:.2f} seconds")


[I 2025-09-13 06:34:49,533] A new study created in memory with name: no-name-b6bf9e3a-4f25-4a45-8dca-f724a868bff6


  0%|          | 0/100 [00:00<?, ?it/s]

[I 2025-09-13 06:40:33,052] Trial 0 finished with value: 0.7868651674044445 and parameters: {'learning_rate': 0.04921553892560256, 'max_depth': 8, 'min_child_samples': 675, 'subsample': 0.6635555340323006, 'colsample_bytree': 0.6305611001585433}. Best is trial 0 with value: 0.7868651674044445.
[I 2025-09-13 06:46:07,939] Trial 1 finished with value: 0.7867272748491232 and parameters: {'learning_rate': 0.05077899759579435, 'max_depth': 8, 'min_child_samples': 684, 'subsample': 0.6513766184548517, 'colsample_bytree': 0.6259195657859995}. Best is trial 0 with value: 0.7868651674044445.
[I 2025-09-13 06:51:51,965] Trial 2 finished with value: 0.7866443082956204 and parameters: {'learning_rate': 0.04067157559438531, 'max_depth': 7, 'min_child_samples': 642, 'subsample': 0.7004063702690521, 'colsample_bytree': 0.6822646612780654}. Best is trial 0 with value: 0.7868651674044445.
[I 2025-09-13 06:58:28,675] Trial 3 finished with value: 0.7855305253568273 and parameters: {'learning_rate': 0.019

**Saving best params**

In [16]:
import json
import os

save_path = "/content/drive/MyDrive/optuna_results"
os.makedirs(save_path, exist_ok=True)

best_data_lgbm = {
    "best_value": study.best_value,
    "best_params": study.best_params
}

file_best = os.path.join(save_path, "lgbm_best_result.json")
with open(file_best, "w") as f:
    json.dump(best_data_lgbm, f, indent=4)

print(f"✅ Best params & value saved to: {file_best}")

✅ Best params & value saved to: /content/drive/MyDrive/optuna_results/lgbm_best_result.json


## 3. Optuna optimization for XGBoost model

In [None]:
import xgboost as xgb
from sklearn.model_selection import train_test_split
from category_encoders import TargetEncoder

categorical_cols = X_train170.select_dtypes(exclude=["number"]).columns
numeric_cols = X_train170.select_dtypes(include=["number"]).columns

encoder = TargetEncoder(cols=categorical_cols)
X_train_encoded = encoder.fit_transform(X_train170, y_train)

X_opt = X_train_encoded.copy()
y_opt = y_train.copy()

X_train_opt, X_valid_opt, y_train_opt, y_valid_opt = train_test_split(
    X_opt, y_opt, test_size=0.2, stratify=y_opt, random_state=42
)

scale_pos_weight = (y_opt == 0).sum() / (y_opt == 1).sum()

def objective(trial):
    param = {
        "n_estimators": 500,
        "tree_method": "hist",
        "learning_rate": trial.suggest_float("learning_rate", 0.03, 0.06),
        "max_depth": trial.suggest_int("max_depth", 3, 8),
        "min_child_weight": trial.suggest_int("min_child_weight", 3, 20),
        "subsample": trial.suggest_float("subsample", 0.6, 0.8),
        "colsample_bytree": trial.suggest_float("colsample_bytree", 0.6, 0.8),
        "scale_pos_weight": scale_pos_weight,
        "random_state": 42,
        "eval_metric": "auc",
        "use_label_encoder": False,
        "verbosity": 0,
    }

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = []

    for train_idx, valid_idx in cv.split(X_opt, y_opt):

        X_train_cv, X_valid_cv = X_opt.iloc[train_idx], X_opt.iloc[valid_idx]
        y_train_cv, y_valid_cv = y_opt.iloc[train_idx], y_opt.iloc[valid_idx]

        preprocessor170.fit(X_train_cv, y_train_cv)

        X_train_cv_transformed = preprocessor170.transform(X_train_cv)
        X_valid_cv_transformed = preprocessor170.transform(X_valid_cv)

        model = XGBClassifier(**param)
        model.fit(X_train_cv_transformed, y_train_cv)

        preds = model.predict_proba(X_valid_cv_transformed)[:, 1]
        score = roc_auc_score(y_valid_cv, preds)
        scores.append(score)

    return np.mean(scores)

study = optuna.create_study(
    direction="maximize",
    sampler=optuna.samplers.TPESampler(n_startup_trials=0, seed=42),
)

start_time = time.time()
study.optimize(objective, n_trials=150, show_progress_bar=True)
elapsed_time = time.time() - start_time

print("✅ Best ROC-AUC:", study.best_value)
print("✅ Best params:", study.best_params)
print(f"⏱ Total time: {elapsed_time:.2f} seconds")


[I 2025-09-07 03:58:16,746] A new study created in memory with name: no-name-4bfc6cff-fad2-4699-8a84-9fffdc65e77a


  0%|          | 0/150 [00:00<?, ?it/s]

[I 2025-09-07 03:58:31,934] Trial 0 finished with value: 0.7770369844048735 and parameters: {'learning_rate': 0.04370861069626263, 'max_depth': 10, 'min_child_weight': 15, 'subsample': 0.8394633936788146, 'colsample_bytree': 0.6624074561769746, 'gamma': 1.5599452033620265, 'reg_alpha': 0.5808361216819946, 'reg_lambda': 8.661761457749352}. Best is trial 0 with value: 0.7770369844048735.
[I 2025-09-07 03:58:39,962] Trial 1 finished with value: 0.7744450764014918 and parameters: {'learning_rate': 0.0641003510568888, 'max_depth': 8, 'min_child_weight': 1, 'subsample': 0.9879639408647978, 'colsample_bytree': 0.9329770563201687, 'gamma': 2.1233911067827616, 'reg_alpha': 1.8182496720710062, 'reg_lambda': 1.8340450985343382}. Best is trial 0 with value: 0.7770369844048735.
[I 2025-09-07 03:58:49,401] Trial 2 finished with value: 0.7819456597464587 and parameters: {'learning_rate': 0.0373818018663584, 'max_depth': 7, 'min_child_weight': 9, 'subsample': 0.7164916560792167, 'colsample_bytree': 0.

**Saving best_params for XGB**

In [13]:
save_path = "/content/drive/MyDrive/optuna_results"
os.makedirs(save_path, exist_ok=True)

best_data_xgb = {
    "best_value": study.best_value,
    "best_params": study.best_params
}

file_best = os.path.join(save_path, "xgb_best_result.json")
with open(file_best, "w") as f:
    json.dump(best_data_xgb, f, indent=4)

print(f"✅ Best params & value saved to: {file_best}")


✅ Best params & value saved to: /content/drive/MyDrive/optuna_results/xgb_best_result.json
