### **LightGBM Model with Hyperparameter Tuning**

LightGBM is a gradient boosting framework that uses tree-based algorithms and is designed for efficiency and low memory usage.

In [1]:
import lightgbm as lgb
import pandas as pd
import numpy as np
import os
import joblib
import time
import optuna
from sklearn.metrics import mean_absolute_error,mean_squared_error
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
import time
import matplotlib.pyplot as plt
from sklearn.multioutput import MultiOutputRegressor
import warnings
warnings.filterwarnings('ignore')

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
X = pd.read_csv('X_processed.csv')
X_test = pd.read_csv('X_test_processed.csv')
y = pd.read_csv('y_processed.csv')

# Drop if still in the data
if 'PID' in X.columns:
    X = X.drop(columns=['PID'])
if 'site' in X.columns:
    X = X.drop(columns=['site'])

if 'PID' in X_test.columns:
    X_test = X_test.drop(columns=['PID'])
if 'site' in X_test.columns:
    X_test = X_test.drop(columns=['site'])


#split the data into training and validation sets 
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

print("X_train shape:", X_train.shape)
print("y_train shape:", y_train.shape)

X_train shape: (6195, 31)
y_train shape: (6195, 11)


In [3]:
#Function to  evalute  model constently

def evaluate_model(model,X_train,y_train,X_val,y_val,model_name,):
      #tracking training time 
    StartTime = time.time()
    
    #fit the model
    model.fit(X_train,y_train)

    #trainin the time
    trainTime = time.time()-StartTime

    #prediction
    y_pred_train = model.predict(X_train)
    y_pred_val = model.predict(X_val)


    #check the errors
    train_mae = mean_absolute_error(y_train,y_pred_train)
    val_mae = mean_absolute_error(y_val,y_pred_val)


    #check the RMSE
    train_rmse = np.sqrt(mean_squared_error(y_train,y_pred_train))
    val_rmse = np.sqrt(mean_squared_error(y_val,y_pred_val))
   

     # Print results
    print(f"\n{model_name} Results:")
    print(f"Training Time: {trainTime:.2f} seconds")
    print(f"Training MAE: {train_mae:.4f}, RMSE: {train_rmse:.4f}")
    print(f"Validation MAE: {val_mae:.4f}, RMSE: {val_rmse:.4f}")

    # Return the results
    return {
        'model': model,
        'name': model_name,
        'train_mae': train_mae,
        'val_mae': val_mae,
        'train_rmse': train_rmse,
        'val_rmse': val_rmse,
        'train_time': trainTime
    }





In [5]:
# Define objective function for LightGBM hyperparameter tuning
def objective_lgb(trial):
    # Define hyperparameters to tune
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', -1, 15),  # -1 means no limit
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'num_leaves': trial.suggest_int('num_leaves', 20, 150),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 10.0),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.0, 10.0),
        'random_state': 42
    }
    
    # Create LightGBM MultiOutputRegressor
    lgb_model = MultiOutputRegressor(lgb.LGBMRegressor(**params))
    
    # Train the model
    lgb_model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = lgb_model.predict(X_val)
    
    # Calculate MAE
    mae = mean_absolute_error(y_val, y_pred)
    
    return mae

# Run the hyperparameter optimization
print("Tuning LightGBM hyperparameters...")
study_lgb = optuna.create_study(direction='minimize')
study_lgb.optimize(objective_lgb, n_trials=10)  # Adjust n_trials as needed

print("Best LightGBM Parameters:", study_lgb.best_params)
print("Best LightGBM MAE:", study_lgb.best_value)

# Create the optimized LightGBM model
best_lgb_model = MultiOutputRegressor(lgb.LGBMRegressor(**study_lgb.best_params, random_state=42))

# Evaluate LightGBM model
lgb_results = evaluate_model(best_lgb_model, X_train, y_train, X_val, y_val, "LightGBM")

[I 2025-07-08 13:10:43,430] A new study created in memory with name: no-name-2f7584f9-2d29-4973-85f1-6033aab5d768


Tuning LightGBM hyperparameters...
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003254 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 1659.143341
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001581 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 15.498505
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001670 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used feat

[I 2025-07-08 13:10:55,338] Trial 0 finished with value: 158.88737745415503 and parameters: {'n_estimators': 350, 'max_depth': 7, 'learning_rate': 0.036416904527630044, 'num_leaves': 61, 'subsample': 0.8822120144631846, 'colsample_bytree': 0.8629110833950804, 'min_child_samples': 94, 'reg_alpha': 3.755936946982208, 'reg_lambda': 9.346690734304563}. Best is trial 0 with value: 158.88737745415503.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001953 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 1659.143341
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002035 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 15.498505
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001959 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start tr

[I 2025-07-08 13:11:04,917] Trial 1 finished with value: 167.3864792746605 and parameters: {'n_estimators': 139, 'max_depth': 13, 'learning_rate': 0.02078735762814366, 'num_leaves': 98, 'subsample': 0.7122381342352144, 'colsample_bytree': 0.9695920744233436, 'min_child_samples': 81, 'reg_alpha': 0.27277175851544766, 'reg_lambda': 8.950985282019433}. Best is trial 0 with value: 158.88737745415503.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001863 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 1659.143341
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001517 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 15.498505
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001528 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start tr

[I 2025-07-08 13:11:17,665] Trial 2 finished with value: 160.40152807342602 and parameters: {'n_estimators': 441, 'max_depth': 7, 'learning_rate': 0.10635817146960835, 'num_leaves': 68, 'subsample': 0.8226765332526321, 'colsample_bytree': 0.6113144948069162, 'min_child_samples': 66, 'reg_alpha': 7.281166885462072, 'reg_lambda': 5.4817185458714786}. Best is trial 0 with value: 158.88737745415503.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001690 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 1659.143341
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001653 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 15.498505
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002140 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start tr

[I 2025-07-08 13:11:19,640] Trial 3 finished with value: 184.91386393594058 and parameters: {'n_estimators': 222, 'max_depth': 2, 'learning_rate': 0.04676874453161505, 'num_leaves': 122, 'subsample': 0.7411816120426142, 'colsample_bytree': 0.6556566527166058, 'min_child_samples': 37, 'reg_alpha': 5.072854917428987, 'reg_lambda': 4.380668752076793}. Best is trial 0 with value: 158.88737745415503.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003299 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 1659.143341
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002659 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 15.498505
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001900 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start tr

[I 2025-07-08 13:11:27,281] Trial 4 finished with value: 162.2289275059804 and parameters: {'n_estimators': 206, 'max_depth': 11, 'learning_rate': 0.1314482030667287, 'num_leaves': 144, 'subsample': 0.8034989229381286, 'colsample_bytree': 0.6141906659987186, 'min_child_samples': 65, 'reg_alpha': 9.842942506404725, 'reg_lambda': 6.799358287836505}. Best is trial 0 with value: 158.88737745415503.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002280 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 1659.143341
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002018 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 15.498505
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001970 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start tr

[I 2025-07-08 13:11:28,720] Trial 5 finished with value: 195.20318350529726 and parameters: {'n_estimators': 102, 'max_depth': 2, 'learning_rate': 0.05266901598871288, 'num_leaves': 47, 'subsample': 0.9742294111883109, 'colsample_bytree': 0.924319598217713, 'min_child_samples': 38, 'reg_alpha': 1.2121441128832022, 'reg_lambda': 8.146269293132967}. Best is trial 0 with value: 158.88737745415503.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001784 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 1659.143341
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001638 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 15.498505
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001845 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start tr

[I 2025-07-08 13:11:33,992] Trial 6 finished with value: 167.78392129522283 and parameters: {'n_estimators': 385, 'max_depth': 3, 'learning_rate': 0.26665930217409933, 'num_leaves': 85, 'subsample': 0.8667785933284822, 'colsample_bytree': 0.6745972803456536, 'min_child_samples': 50, 'reg_alpha': 9.693016319464316, 'reg_lambda': 9.266796470145577}. Best is trial 0 with value: 158.88737745415503.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006709 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 1659.143341
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002710 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 15.498505
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003652 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start tr

[I 2025-07-08 13:11:44,429] Trial 7 finished with value: 158.40474126092667 and parameters: {'n_estimators': 201, 'max_depth': 12, 'learning_rate': 0.03151313338836413, 'num_leaves': 74, 'subsample': 0.6360118710635863, 'colsample_bytree': 0.7336842284708387, 'min_child_samples': 93, 'reg_alpha': 5.575268915464707, 'reg_lambda': 7.343429582645299}. Best is trial 7 with value: 158.40474126092667.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001956 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 1659.143341
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002374 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 15.498505
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003789 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start tr

[I 2025-07-08 13:12:05,334] Trial 8 finished with value: 160.2731122916023 and parameters: {'n_estimators': 368, 'max_depth': 14, 'learning_rate': 0.09548828225364875, 'num_leaves': 132, 'subsample': 0.7160317334554605, 'colsample_bytree': 0.8116332209419621, 'min_child_samples': 53, 'reg_alpha': 7.593686122935324, 'reg_lambda': 3.52830406323672}. Best is trial 7 with value: 158.40474126092667.


[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005289 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 1659.143341
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001975 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 15.498505
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002988 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start tr

[I 2025-07-08 13:12:19,655] Trial 9 finished with value: 157.20082619207147 and parameters: {'n_estimators': 310, 'max_depth': 12, 'learning_rate': 0.03942220740141231, 'num_leaves': 113, 'subsample': 0.610863265593157, 'colsample_bytree': 0.7370934380357586, 'min_child_samples': 86, 'reg_alpha': 9.862157309953282, 'reg_lambda': 0.22627056616229435}. Best is trial 9 with value: 157.20082619207147.


Best LightGBM Parameters: {'n_estimators': 310, 'max_depth': 12, 'learning_rate': 0.03942220740141231, 'num_leaves': 113, 'subsample': 0.610863265593157, 'colsample_bytree': 0.7370934380357586, 'min_child_samples': 86, 'reg_alpha': 9.862157309953282, 'reg_lambda': 0.22627056616229435}
Best LightGBM MAE: 157.20082619207147
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002359 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 1659.143341
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001998 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6693
[LightGBM] [Info] Number of data points in the train set: 6195, number of used features: 31
[LightGBM] [Info] Start training from score 