# Hyperparameter Optimization

## Rationale:
Hyperparameter optimization is a critical step in machine learning model development. Properly tuned hyperparameters can significantly improve a model's performance and generalization. The choice of hyperparameters depends on the specific machine learning algorithm and dataset. In this study, we aim to optimize the hyperparameters of a LightGBM model, a popular gradient boosting framework, to enhance its predictive power.

## Methodology:
We conducted hyperparameter optimization using the Optuna library, which provides a powerful and efficient platform for automated hyperparameter tuning. The study involved the following steps:

1. **Baseline Model**: We started by training a baseline LightGBM model with default hyperparameters. This served as a reference point for evaluating the impact of hyperparameter optimization.

2. **Optuna Integration**: We utilized Optuna to search for optimal hyperparameters. Optuna employs various optimization algorithms to efficiently explore the hyperparameter space. We defined a custom objective function that evaluates the model's performance using cross-validation.

3. **Parameter Search Space**: We specified a search space for hyperparameters, including learning rate, the number of leaves, minimum child samples, subsample ratio, feature fraction, and regularization terms (L1 and L2). Optuna sampled from this space to find the best combination of hyperparameters.

4. **Pruning**: To expedite the optimization process, we employed pruning techniques such as Hyperband pruning. Pruning allows us to terminate poorly performing trials early, saving computational resources.

5. **Integration with LightGBM**: In addition to optimizing hyperparameters, we explored LightGBM's built-in support for Optuna. By leveraging this integration, we aimed to streamline the hyperparameter tuning process and potentially achieve better results.

6. **Comparison**: We compared three models: the baseline model with default hyperparameters, a model tuned with Optuna, and a model using Optuna with LightGBM integration. We evaluated their performance using appropriate metrics such as ROC-AUC, Mean Squared Error (MSE), or accuracy, depending on the specific problem.

## Conclusions:
The results of our hyperparameter optimization study yielded valuable insights into improving the LightGBM model's performance. Here are the key takeaways:

- The baseline model with default hyperparameters served as a benchmark but lacked optimal performance.
- Optuna-driven hyperparameter optimization significantly enhanced the model's predictive capabilities, achieving better results than the baseline.
- The integration of Optuna with LightGBM provided a streamlined and efficient approach to hyperparameter tuning, potentially reducing the required computational resources.
- The final model, with optimized hyperparameters and LightGBM integration, demonstrated superior performance, confirming the value of automated hyperparameter optimization.

Overall, this hyperparameter optimization study demonstrates the importance of tuning hyperparameters for achieving the best possible model performance, and it highlights the benefits of leveraging advanced tools like Optuna and integrating them with machine learning frameworks like LightGBM.


In [5]:
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd

import lightgbm  as lgbm
from sklearn.inspection import permutation_importance
from sklearn.model_selection import train_test_split
from copy import deepcopy

from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import roc_auc_score

import warnings;warnings.filterwarnings("ignore")

from optuna import visualization as optunaviz

import sys
sys.path.append("../")

# local imports
from src.learner_params import target_column, space_column, boruta_learner_params, test_params
from utils.functions__utils import find_constraint

from utils.feature_selection_lists import fw_features, boruta_features, optuna_features, ensemble_features

from utils.functions__training import model_pipeline

In [6]:
train_df = pd.read_pickle("../data/train_df.pkl")
validation_df = pd.read_pickle("../data/validation_df.pkl")

In [18]:
from optuna.pruners import HyperbandPruner
from optuna.samplers import TPESampler
from optuna.logging import set_verbosity


def objective(trial):
    """
    """
    train, test = train_test_split(train_df, random_state=42, test_size=.2)
    dtrain = lgbm.Dataset(train[boruta_features], label=train[target_column])

    params = {
        'verbose':-1,
        'objective':"binary",
        'metric':"binary_logloss",
        "boosting_type":trial.suggest_categorical("boosting_type", ["gbdt", "dart"]),
        "n_estimators":trial.suggest_int("n_estimators", 2000, 3032),
        "learning_rate":trial.suggest_loguniform("learning_rate", 1e-3,1e-1),
        'num_leaves': trial.suggest_int('num_leaves', 32, 264),
        "bagging_freq":trial.suggest_int("bagging_freq", 2,7),
        'min_child_samples': trial.suggest_int('min_child_samples', 50, 1024),
        'subsample': trial.suggest_float('subsample', .7, 1),
        'colsample_bytree': trial.suggest_float('colsample_bytree', .6, 1),
        'lambda_l1':trial.suggest_float('lambda_l1',1e-2,10, log = True),
        'lambda_l2':trial.suggest_float('lambda_l2',1e-2,10, log = True),
        'n_jobs': -1,
        'random_state': 42
      }
    bst = lgbm.train(params, dtrain)
    preds = bst.predict(test[boruta_features])

    score = roc_auc_score(test[target_column], preds)

    return score

set_verbosity(optuna.logging.ERROR)
study = optuna.create_study(direction="maximize",
                            pruner=HyperbandPruner(),
                            sampler=TPESampler(seed=0)
                           )
study.optimize(objective, n_trials=25,show_progress_bar=True)

  0%|          | 0/25 [00:00<?, ?it/s]

In [19]:
study.best_params

{'boosting_type': 'dart',
 'n_estimators': 2349,
 'learning_rate': 0.021424861729373922,
 'num_leaves': 59,
 'bagging_freq': 6,
 'min_child_samples': 335,
 'subsample': 0.7537835105674658,
 'colsample_bytree': 0.8226472292027862,
 'lambda_l1': 0.010994877348284958,
 'lambda_l2': 2.028432500906238}

In [20]:
study.best_value

0.7899181764633438

In [16]:
best_params_one = {
	'learner_params': {
		'learning_rate': 0.021424861729373922,
		'n_estimators': 2349,
		'extra_params': {
			'objective': 'binary',
			'metric': 'binary_logloss',
			'boosting_type': 'gbdt',
			'num_leaves': 59,
			'bagging_freq': 6,
			'min_child_samples': 335,
			'subsample': 0.7537835105674658,
			'colsample_bytree': 0.8226472292027862,
			'lambda_l1': 0.010994877348284958,
			'lambda_l2': 2.028432500906238,
			'n_jobs': -1,
			'random_state': 42,
			'monotone_constraints': None,
			'verbose': -1
		}
	}
}

In [17]:
challenger_one_logs = model_pipeline(train_df = train_df,
                            validation_df = validation_df,
                            params = best_params_one,
                            target_column = target_column,
                            features = boruta_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False
                          )

2023-09-24T22:49:48 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-09-24T22:49:51 | INFO | Training for fold 1
2023-09-24T22:52:48 | INFO | Training for fold 2
2023-09-24T22:55:55 | INFO | Training for fold 3
2023-09-24T22:59:13 | INFO | CV training finished!
2023-09-24T22:59:13 | INFO | Training the model in the full dataset...
2023-09-24T23:03:21 | INFO | Training process finished!
2023-09-24T23:03:21 | INFO | Calculating metrics...
2023-09-24T23:03:21 | INFO | Full process finished in 13.61 minutes.


In [50]:
%%time

import optuna.integration.lightgbm as lgb

from lightgbm import early_stopping
from lightgbm import log_evaluation

train, test = train_test_split(train_df, random_state=42, test_size=.2)
dtrain = lgbm.Dataset(train[boruta_features], label=train[target_column])
dval = lgb.Dataset(test[boruta_features], label=test[target_column])

params = {
    "objective": "binary",
    "metric": "auc",
    "verbosity": -1,
    "boosting_type": "gbdt",
    "n_estimators":2349,
    "learning_rate": 0.021424861729373922
}

bst = lgb.train(
    params,
    dtrain,
    valid_sets=[dtrain, dval],
    callbacks=[early_stopping(450)],
)

preds = bst.predict(test[boruta_features],num_iteration=bst.best_iteration)
score = roc_auc_score(test[target_column], preds)

best_params = bst.params
print("Best params:", best_params)



feature_fraction, val_score: -inf:   0%|                  | 0/7 [00:00<?, ?it/s]

Training until validation scores don't improve for 450 rounds


feature_fraction, val_score: 0.787537:  14%|8     | 1/7 [01:07<06:44, 67.44s/it]

Early stopping, best iteration is:
[1095]	valid_0's auc: 0.913838	valid_1's auc: 0.787537
Training until validation scores don't improve for 450 rounds


feature_fraction, val_score: 0.787537:  29%|#7    | 2/7 [02:13<05:33, 66.77s/it]

Early stopping, best iteration is:
[706]	valid_0's auc: 0.885465	valid_1's auc: 0.786838
Training until validation scores don't improve for 450 rounds


feature_fraction, val_score: 0.788985:  43%|##5   | 3/7 [03:06<04:02, 60.57s/it]

Early stopping, best iteration is:
[843]	valid_0's auc: 0.893474	valid_1's auc: 0.788985
Training until validation scores don't improve for 450 rounds


feature_fraction, val_score: 0.788985:  57%|###4  | 4/7 [04:32<03:31, 70.35s/it]

Early stopping, best iteration is:
[1098]	valid_0's auc: 0.915571	valid_1's auc: 0.787953
Training until validation scores don't improve for 450 rounds


feature_fraction, val_score: 0.788985:  71%|####2 | 5/7 [05:53<02:28, 74.40s/it]

Early stopping, best iteration is:
[959]	valid_0's auc: 0.907208	valid_1's auc: 0.785927
Training until validation scores don't improve for 450 rounds


feature_fraction, val_score: 0.788985:  86%|#####1| 6/7 [07:00<01:11, 71.89s/it]

Early stopping, best iteration is:
[626]	valid_0's auc: 0.878697	valid_1's auc: 0.785634
Training until validation scores don't improve for 450 rounds


feature_fraction, val_score: 0.788985: 100%|######| 7/7 [08:09<00:00, 69.86s/it]


Early stopping, best iteration is:
[725]	valid_0's auc: 0.885882	valid_1's auc: 0.786886


num_leaves, val_score: 0.788985:   0%|                   | 0/20 [00:00<?, ?it/s]

Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:   5%|5          | 1/20 [01:12<22:59, 72.59s/it]

Early stopping, best iteration is:
[499]	valid_0's auc: 0.973638	valid_1's auc: 0.787437
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:  10%|#1         | 2/20 [02:15<20:01, 66.77s/it]

Early stopping, best iteration is:
[393]	valid_0's auc: 0.950488	valid_1's auc: 0.787302
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:  15%|#6         | 3/20 [03:26<19:29, 68.81s/it]

Early stopping, best iteration is:
[663]	valid_0's auc: 0.967843	valid_1's auc: 0.787812
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:  20%|##2        | 4/20 [05:04<21:23, 80.22s/it]

Early stopping, best iteration is:
[392]	valid_0's auc: 0.994529	valid_1's auc: 0.785895
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:  25%|##7        | 5/20 [06:18<19:29, 77.97s/it]

Did not meet early stopping. Best iteration is:
[2349]	valid_0's auc: 0.861854	valid_1's auc: 0.788023
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:  30%|###3       | 6/20 [07:29<17:38, 75.58s/it]

Early stopping, best iteration is:
[1725]	valid_0's auc: 0.859867	valid_1's auc: 0.787862
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:  35%|###8       | 7/20 [08:35<15:44, 72.68s/it]

Early stopping, best iteration is:
[1392]	valid_0's auc: 0.887859	valid_1's auc: 0.788329
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:  40%|####4      | 8/20 [09:33<13:33, 67.77s/it]

Early stopping, best iteration is:
[595]	valid_0's auc: 0.927927	valid_1's auc: 0.787987
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:  45%|####9      | 9/20 [10:55<13:16, 72.41s/it]

Early stopping, best iteration is:
[414]	valid_0's auc: 0.984383	valid_1's auc: 0.787831
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:  50%|#####     | 10/20 [11:58<11:33, 69.34s/it]

Early stopping, best iteration is:
[771]	valid_0's auc: 0.935051	valid_1's auc: 0.788323
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:  55%|#####5    | 11/20 [13:12<10:38, 70.95s/it]

Early stopping, best iteration is:
[352]	valid_0's auc: 0.97131	valid_1's auc: 0.787258
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:  60%|######    | 12/20 [14:20<09:18, 69.84s/it]

Early stopping, best iteration is:
[914]	valid_0's auc: 0.940673	valid_1's auc: 0.788638
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:  65%|######5   | 13/20 [15:28<08:05, 69.37s/it]

Early stopping, best iteration is:
[792]	valid_0's auc: 0.953609	valid_1's auc: 0.788268
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:  70%|#######   | 14/20 [17:05<07:45, 77.63s/it]

Early stopping, best iteration is:
[454]	valid_0's auc: 0.994235	valid_1's auc: 0.786293
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:  75%|#######5  | 15/20 [18:01<05:56, 71.23s/it]

Early stopping, best iteration is:
[843]	valid_0's auc: 0.893474	valid_1's auc: 0.788985
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:  80%|########  | 16/20 [19:12<04:44, 71.01s/it]

Early stopping, best iteration is:
[1061]	valid_0's auc: 0.938919	valid_1's auc: 0.788777
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:  85%|########5 | 17/20 [20:21<03:31, 70.51s/it]

Early stopping, best iteration is:
[602]	valid_0's auc: 0.963075	valid_1's auc: 0.787677
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:  90%|######### | 18/20 [21:32<02:21, 70.65s/it]

Early stopping, best iteration is:
[1174]	valid_0's auc: 0.93432	valid_1's auc: 0.788078
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985:  95%|#########5| 19/20 [22:38<01:09, 69.38s/it]

Early stopping, best iteration is:
[580]	valid_0's auc: 0.955695	valid_1's auc: 0.788269
Training until validation scores don't improve for 450 rounds


num_leaves, val_score: 0.788985: 100%|##########| 20/20 [24:03<00:00, 72.15s/it]


Early stopping, best iteration is:
[576]	valid_0's auc: 0.985828	valid_1's auc: 0.788176


bagging, val_score: 0.788985:   0%|                      | 0/10 [00:00<?, ?it/s]

Training until validation scores don't improve for 450 rounds


bagging, val_score: 0.788985:  10%|#4            | 1/10 [00:51<07:44, 51.64s/it]

Early stopping, best iteration is:
[884]	valid_0's auc: 0.902306	valid_1's auc: 0.788215
Training until validation scores don't improve for 450 rounds


bagging, val_score: 0.788985:  20%|##8           | 2/10 [02:10<09:00, 67.62s/it]

Early stopping, best iteration is:
[1472]	valid_0's auc: 0.934265	valid_1's auc: 0.788754
Training until validation scores don't improve for 450 rounds


bagging, val_score: 0.788985:  30%|####2         | 3/10 [03:03<07:07, 61.12s/it]

Early stopping, best iteration is:
[942]	valid_0's auc: 0.905977	valid_1's auc: 0.788634
Training until validation scores don't improve for 450 rounds


bagging, val_score: 0.789109:  40%|#####6        | 4/10 [03:59<05:52, 58.82s/it]

Early stopping, best iteration is:
[883]	valid_0's auc: 0.89977	valid_1's auc: 0.789109
Training until validation scores don't improve for 450 rounds


bagging, val_score: 0.789109:  50%|#######       | 5/10 [05:05<05:08, 61.64s/it]

Early stopping, best iteration is:
[1199]	valid_0's auc: 0.921647	valid_1's auc: 0.789061
Training until validation scores don't improve for 450 rounds


bagging, val_score: 0.789356:  60%|########4     | 6/10 [06:19<04:22, 65.62s/it]

Early stopping, best iteration is:
[1427]	valid_0's auc: 0.934755	valid_1's auc: 0.789356
Training until validation scores don't improve for 450 rounds


bagging, val_score: 0.789356:  70%|#########7    | 7/10 [07:18<03:11, 63.71s/it]

Early stopping, best iteration is:
[1026]	valid_0's auc: 0.909995	valid_1's auc: 0.788742
Training until validation scores don't improve for 450 rounds


bagging, val_score: 0.789356:  80%|###########2  | 8/10 [08:28<02:11, 65.63s/it]

Early stopping, best iteration is:
[1269]	valid_0's auc: 0.925688	valid_1's auc: 0.789
Training until validation scores don't improve for 450 rounds


bagging, val_score: 0.789356:  90%|############6 | 9/10 [09:43<01:08, 68.57s/it]

Early stopping, best iteration is:
[1457]	valid_0's auc: 0.936656	valid_1's auc: 0.789293
Training until validation scores don't improve for 450 rounds


bagging, val_score: 0.789356: 100%|#############| 10/10 [10:46<00:00, 64.69s/it]


Early stopping, best iteration is:
[1120]	valid_0's auc: 0.917501	valid_1's auc: 0.788958


feature_fraction_stage2, val_score: 0.789356:   0%|       | 0/3 [00:00<?, ?it/s]

Training until validation scores don't improve for 450 rounds


feature_fraction_stage2, val_score: 0.789356:  33%|3| 1/3 [01:08<02:17, 68.54s/i

Early stopping, best iteration is:
[1213]	valid_0's auc: 0.922738	valid_1's auc: 0.788879
Training until validation scores don't improve for 450 rounds


feature_fraction_stage2, val_score: 0.789356:  67%|6| 2/3 [02:14<01:06, 66.80s/i

Early stopping, best iteration is:
[973]	valid_0's auc: 0.908578	valid_1's auc: 0.788184
Training until validation scores don't improve for 450 rounds


feature_fraction_stage2, val_score: 0.789356: 100%|#| 3/3 [03:14<00:00, 64.69s/i


Early stopping, best iteration is:
[882]	valid_0's auc: 0.9007	valid_1's auc: 0.788285


regularization_factors, val_score: 0.789356:   0%|       | 0/20 [00:00<?, ?it/s]

Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791036:   5%| | 1/20 [01:14<23:37, 74.60s/i

Early stopping, best iteration is:
[1194]	valid_0's auc: 0.895883	valid_1's auc: 0.791036
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791036:  10%|1| 2/20 [02:16<20:06, 67.03s/i

Early stopping, best iteration is:
[876]	valid_0's auc: 0.87703	valid_1's auc: 0.790075
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791036:  15%|1| 3/20 [03:40<21:12, 74.84s/i

Early stopping, best iteration is:
[1425]	valid_0's auc: 0.904917	valid_1's auc: 0.790833
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791036:  20%|2| 4/20 [04:50<19:29, 73.08s/i

Early stopping, best iteration is:
[1029]	valid_0's auc: 0.880268	valid_1's auc: 0.790979
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791158:  25%|2| 5/20 [06:15<19:20, 77.38s/i

Early stopping, best iteration is:
[1424]	valid_0's auc: 0.904561	valid_1's auc: 0.791158
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791158:  30%|3| 6/20 [07:25<17:25, 74.69s/i

Early stopping, best iteration is:
[981]	valid_0's auc: 0.878675	valid_1's auc: 0.791065
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791158:  35%|3| 7/20 [08:44<16:29, 76.08s/i

Early stopping, best iteration is:
[1165]	valid_0's auc: 0.888873	valid_1's auc: 0.791011
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791158:  40%|4| 8/20 [09:57<15:00, 75.06s/i

Early stopping, best iteration is:
[1027]	valid_0's auc: 0.88071	valid_1's auc: 0.79039
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791158:  45%|4| 9/20 [11:24<14:26, 78.78s/i

Early stopping, best iteration is:
[1428]	valid_0's auc: 0.901676	valid_1's auc: 0.790836
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791158:  50%|5| 10/20 [12:28<12:24, 74.47s/

Early stopping, best iteration is:
[883]	valid_0's auc: 0.872172	valid_1's auc: 0.790282
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791158:  55%|5| 11/20 [13:43<11:09, 74.41s/

Early stopping, best iteration is:
[1328]	valid_0's auc: 0.928113	valid_1's auc: 0.789714
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791158:  60%|6| 12/20 [14:40<09:13, 69.23s/

Early stopping, best iteration is:
[710]	valid_0's auc: 0.85991	valid_1's auc: 0.790572
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791158:  65%|6| 13/20 [15:51<08:07, 69.68s/

Early stopping, best iteration is:
[1266]	valid_0's auc: 0.925394	valid_1's auc: 0.789597
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791158:  70%|7| 14/20 [17:05<07:05, 70.92s/

Early stopping, best iteration is:
[1029]	valid_0's auc: 0.872201	valid_1's auc: 0.790265
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791158:  75%|7| 15/20 [18:02<05:34, 66.96s/

Early stopping, best iteration is:
[884]	valid_0's auc: 0.896472	valid_1's auc: 0.789271
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791158:  80%|8| 16/20 [19:01<04:17, 64.33s/

Early stopping, best iteration is:
[877]	valid_0's auc: 0.894385	valid_1's auc: 0.789758
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791158:  85%|8| 17/20 [19:59<03:07, 62.50s/

Early stopping, best iteration is:
[876]	valid_0's auc: 0.89274	valid_1's auc: 0.789224
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791158:  90%|9| 18/20 [21:10<02:09, 64.99s/

Early stopping, best iteration is:
[1187]	valid_0's auc: 0.910361	valid_1's auc: 0.789917
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791158:  95%|9| 19/20 [22:20<01:06, 66.56s/

Early stopping, best iteration is:
[1032]	valid_0's auc: 0.88042	valid_1's auc: 0.790861
Training until validation scores don't improve for 450 rounds


regularization_factors, val_score: 0.791158: 100%|#| 20/20 [23:37<00:00, 70.86s/


Early stopping, best iteration is:
[1427]	valid_0's auc: 0.934213	valid_1's auc: 0.789496


min_data_in_leaf, val_score: 0.791158:   0%|              | 0/5 [00:00<?, ?it/s]

Training until validation scores don't improve for 450 rounds


min_data_in_leaf, val_score: 0.791158:  20%|#2    | 1/5 [01:09<04:37, 69.32s/it]

Early stopping, best iteration is:
[1023]	valid_0's auc: 0.881648	valid_1's auc: 0.790775
Training until validation scores don't improve for 450 rounds


min_data_in_leaf, val_score: 0.791158:  40%|##4   | 2/5 [02:20<03:30, 70.22s/it]

Early stopping, best iteration is:
[1019]	valid_0's auc: 0.881314	valid_1's auc: 0.790478
Training until validation scores don't improve for 450 rounds


min_data_in_leaf, val_score: 0.791158:  60%|###6  | 3/5 [03:44<02:33, 76.72s/it]

Early stopping, best iteration is:
[1395]	valid_0's auc: 0.902728	valid_1's auc: 0.790594
Training until validation scores don't improve for 450 rounds


min_data_in_leaf, val_score: 0.791158:  80%|####8 | 4/5 [04:51<01:12, 72.84s/it]

Early stopping, best iteration is:
[970]	valid_0's auc: 0.878594	valid_1's auc: 0.790792
Training until validation scores don't improve for 450 rounds


min_data_in_leaf, val_score: 0.791158: 100%|######| 5/5 [06:21<00:00, 76.27s/it]

Early stopping, best iteration is:
[1428]	valid_0's auc: 0.904744	valid_1's auc: 0.790859





Best params: {'objective': 'binary', 'metric': 'auc', 'verbosity': -1, 'boosting_type': 'gbdt', 'learning_rate': 0.021424861729373922, 'feature_pre_filter': False, 'lambda_l1': 0.00021744689137046032, 'lambda_l2': 6.07402119317552, 'num_leaves': 31, 'feature_fraction': 0.4, 'bagging_fraction': 0.7762748139696756, 'bagging_freq': 7, 'min_child_samples': 20, 'num_iterations': 2349}
CPU times: user 5h 29min 45s, sys: 23min 3s, total: 5h 52min 49s
Wall time: 1h 16min 20s


In [51]:
# new
best_params

{'objective': 'binary',
 'metric': 'auc',
 'verbosity': -1,
 'boosting_type': 'gbdt',
 'learning_rate': 0.021424861729373922,
 'feature_pre_filter': False,
 'lambda_l1': 0.00021744689137046032,
 'lambda_l2': 6.07402119317552,
 'num_leaves': 31,
 'feature_fraction': 0.4,
 'bagging_fraction': 0.7762748139696756,
 'bagging_freq': 7,
 'min_child_samples': 20,
 'num_iterations': 2349}

In [52]:
# new
bst.best_score

defaultdict(collections.OrderedDict,
            {'valid_0': OrderedDict([('auc', 0.9045610250337482)]),
             'valid_1': OrderedDict([('auc', 0.7911576257095011)])})

In [8]:
from optuna.logging import set_verbosity
import optuna
from optuna.pruners import HyperbandPruner
from optuna.samplers import TPESampler

def objective(trial):
    """
    """
    train, test = train_test_split(train_df, random_state=42, test_size=.2)
    dtrain = lgbm.Dataset(train[boruta_features], label=train[target_column])

    params = {
        'verbose':-1,
        'objective':"binary",
        'metric':"binary_logloss",
         'boosting_type': 'gbdt',
         'learning_rate': 0.021424861729373922,
         'lambda_l1': 0.00021744689137046032,
         'lambda_l2': 6.07402119317552,
         'num_leaves': 31,
         'feature_fraction': 0.4,
         'bagging_fraction': 0.7762748139696756,
         'bagging_freq': 7,
         'min_child_samples': 20,
         "n_estimators":trial.suggest_int("n_estimators", 2000, 10000),
         "learning_rate":trial.suggest_loguniform("learning_rate", 1e-3,1e-1),
         'n_jobs': -1,
         'random_state': 42
      }
    bst = lgbm.train(params, dtrain)
    preds = bst.predict(test[boruta_features])

    score = roc_auc_score(test[target_column], preds)

    return score

set_verbosity(optuna.logging.ERROR)
study = optuna.create_study(direction="maximize",
                            pruner=HyperbandPruner(),
                            sampler=TPESampler(seed=0)
                           )
study.optimize(objective, n_trials=50,show_progress_bar=True)

  0%|          | 0/50 [00:00<?, ?it/s]

In [9]:
study.best_params

{'n_estimators': 5926, 'learning_rate': 0.005603627873630697}

In [61]:
best_params_two = {
	'learner_params': {
		'learning_rate': 0.021424861729373922,
		'n_estimators': 2349,
		'extra_params': {
			'objective': 'binary',
			'metric': 'binary_logloss',
			'boosting_type': 'gbdt',
			 'lambda_l1': 0.00021744689137046032,
             'lambda_l2': 6.07402119317552,
             'num_leaves': 31,
             'feature_fraction': 0.4,
             'bagging_fraction': 0.7762748139696756,
             'bagging_freq': 7,
             'min_child_samples': 20,
			'n_jobs': -1,
			'random_state': 42,
			'monotone_constraints': None,
			'verbose': -1
		}
	}
}

In [10]:
best_params_two = {
	'learner_params': {
		'n_estimators': 5926, 
        'learning_rate': 0.005603627873630697,
		'extra_params': {
			'objective': 'binary',
			'metric': 'binary_logloss',
			'boosting_type': 'gbdt',
			 'lambda_l1': 0.00021744689137046032,
             'lambda_l2': 6.07402119317552,
             'num_leaves': 31,
             'feature_fraction': 0.4,
             'bagging_fraction': 0.7762748139696756,
             'bagging_freq': 7,
             'min_child_samples': 20,
			'n_jobs': -1,
			'random_state': 42,
			'monotone_constraints': None,
			'verbose': -1
		}
	}
}

In [11]:
challenger_two_logs = model_pipeline(train_df = train_df,
                            validation_df = validation_df,
                            params = best_params_two,
                            target_column = target_column,
                            features = boruta_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False
                          )

2023-09-24T22:19:17 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-09-24T22:19:19 | INFO | Training for fold 1
2023-09-24T22:22:43 | INFO | Training for fold 2
2023-09-24T22:26:19 | INFO | Training for fold 3
2023-09-24T22:29:58 | INFO | CV training finished!
2023-09-24T22:29:58 | INFO | Training the model in the full dataset...
2023-09-24T22:34:42 | INFO | Training process finished!
2023-09-24T22:34:42 | INFO | Calculating metrics...
2023-09-24T22:34:43 | INFO | Full process finished in 15.46 minutes.


In [14]:
boruta_logs = model_pipeline(train_df = train_df,
                            validation_df = validation_df,
                            params = test_params,
                            target_column = target_column,
                            features = boruta_features,
                            cv = 3,
                            random_state = 42,
                            apply_shap = False
                          )

2023-09-24T22:35:34 | INFO | Starting pipeline: Generating 3 k-fold training...
2023-09-24T22:35:35 | INFO | Training for fold 1
2023-09-24T22:38:12 | INFO | Training for fold 2
2023-09-24T22:40:53 | INFO | Training for fold 3
2023-09-24T22:43:46 | INFO | CV training finished!
2023-09-24T22:43:46 | INFO | Training the model in the full dataset...
2023-09-24T22:47:57 | INFO | Training process finished!
2023-09-24T22:47:57 | INFO | Calculating metrics...
2023-09-24T22:47:57 | INFO | Full process finished in 12.44 minutes.


## Performance comparison

In [26]:
model_metrics  ={}
models = [boruta_logs, challenger_one_logs, challenger_two_logs]
names = ["boruta vanilla", "boruta + Optuna base", "boruta + Optuna integration"]

for model, name in zip(models, names):
    model_metrics[f"{name}"] = model["metrics"]["roc_auc"]
pd.DataFrame(model_metrics).T.sort_values(by = "validation", ascending = False)

Unnamed: 0,out_of_fold,validation
boruta + Optuna integration,0.79568,0.800874
boruta vanilla,0.79251,0.799865
boruta + Optuna base,0.789507,0.795329


In [27]:
pass