# 6.0.0 Hyperparameter Optimization for Classifier Model

### Methodology

The primary goal is to tune the model parameters:

eta (learning rate): Extended to explore more conservative and slightly more aggressive learning rates.
gamma: Now starts at 0 up to 1 to explore the impact of making trees more conservative.
max_depth: Increased the upper limit to allow deeper trees which might capture more complex patterns.
min_child_weight: Broader range to better control overfitting by requiring nodes to justify splits with more samples.
subsample and colsample_bytree: Allowed to vary more widely to assess different levels of data and feature subsampling.
scale_pos_weight: Adjusted to better balance the classes given the known imbalance.
lambda and alpha: Expanded the range for regularization parameters to further control overfitting.
max_delta_step: Introduced a broader range to help stabilize updates in scenarios of high class imbalance.
n_estimators: Increased the maximum to allow more trees to be evaluated, which can be crucial when all other parameters are being optimized for better granularity.


### Conclusion
Before optimization, the model parameters were:
max_depth: 3
subsample: 0.8
colsample_bytree: 0.8

Then, We used the Optuna framework to conduct hyperparameter optimization, resulting in the following settings which suggest a more complex model capable of capturing subtle patterns:

lambda: 1.2345483862873696
alpha: 7.720320468362867
colsample_bytree: 0.7
subsample: 0.6
learning_rate: 0.02
max_depth: 9
min_child_weight: 9


In [1]:
from pathlib import Path
import numpy as np
import pandas as pd
import yaml
import optuna 
import xgboost as xgb
from pathlib import Path
from sklearn.metrics import roc_auc_score
from sklearn import metrics
from xgboost import XGBClassifier

from src.utils import calculate_metrics

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
def objective(trial):
   
    param = {
        'verbosity': 0,
        'objective': 'binary:logistic',  # For binary classification
        'booster': 'gbtree',             # Tree-based learning algorithms
        'eval_metric': 'auc',            # Evaluation metric for the validation data
        'eta': trial.suggest_float('eta', 0.005, 0.05),  # Learning rate
        'gamma': trial.suggest_float('gamma', 0, 1),  # Minimum loss reduction required to make a further partition
        'max_depth': trial.suggest_int('max_depth', 3, 10),  # Depth of the tree
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 6),  # Minimum sum of instance weight needed in a child
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),  # Subsample ratio of the training instances
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.3, 1.0),  # Subsample ratio of features
        'scale_pos_weight': trial.suggest_float('scale_pos_weight', 1, 10),  # Balancing of positive and negative weights
        'lambda': trial.suggest_loguniform('lambda', 0.1, 5),  # L2 regularization
        'alpha': trial.suggest_loguniform('alpha', 0.01, 1),  # L1 regularization
        'max_delta_step': trial.suggest_int('max_delta_step', 0, 10),  # Might be used in logistic regression when class is extremely imbalanced
        'n_estimators': trial.suggest_int('n_estimators', 50, 300)  # Number of trees
    }

    
    clf = XGBClassifier(**param)
    clf.fit(X_train, Y_train, eval_set=[(X_valid, Y_valid)], early_stopping_rounds=10, verbose=False)
    
    preds = clf.predict_proba(X_valid)[:, 1]
    auc = roc_auc_score(Y_valid, preds)
    return auc

## 1. Data Preparation

In [3]:
with open("config.yaml", "r") as f:
    config = yaml.safe_load(f)
    
model_parameters = config["model_parameters"]["xgbm"]
numeric_features = config["filter_features"]["numerical"]
features = numeric_features
target = config["main"]["target"]
data_train_path = Path.cwd().parent / config["main"]["data_train_path"]
train_validation_path = Path.cwd().parent / config["main"]["data_validation_path"]

train_df = pd.read_pickle(data_train_path)
validation_df = pd.read_pickle(train_validation_path)

X_train, Y_train = train_df[features], train_df[target]
X_valid, Y_valid = validation_df[features], validation_df[target]

split_seed = config["main"]["random_seed"]

X_train.shape

(9479, 14)

## 3. Results

In [4]:
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

print('Best trial:', study.best_trial.params)
trial = study.best_trial


[I 2024-05-01 01:33:31,612] A new study created in memory with name: no-name-c23779e0-1187-4b12-83ef-73b6c48be06a
  'lambda': trial.suggest_loguniform('lambda', 0.1, 5),  # L2 regularization
  'alpha': trial.suggest_loguniform('alpha', 0.01, 1),  # L1 regularization
[I 2024-05-01 01:33:31,862] Trial 0 finished with value: 0.6003530475060723 and parameters: {'eta': 0.0434174554635591, 'gamma': 0.018745235155249507, 'max_depth': 3, 'min_child_weight': 2, 'subsample': 0.7773119867092021, 'colsample_bytree': 0.3847004740408239, 'scale_pos_weight': 1.1860207157103346, 'lambda': 0.18292385478742423, 'alpha': 0.05503596805665763, 'max_delta_step': 5, 'n_estimators': 96}. Best is trial 0 with value: 0.6003530475060723.
  'lambda': trial.suggest_loguniform('lambda', 0.1, 5),  # L2 regularization
  'alpha': trial.suggest_loguniform('alpha', 0.01, 1),  # L1 regularization
[I 2024-05-01 01:33:31,905] Trial 1 finished with value: 0.603129413093826 and parameters: {'eta': 0.03633449325650671, 'gamma

Best trial: {'eta': 0.04362327410869968, 'gamma': 0.34533271146139294, 'max_depth': 4, 'min_child_weight': 2, 'subsample': 0.8375200371599197, 'colsample_bytree': 0.9327683046084467, 'scale_pos_weight': 5.1731540381466905, 'lambda': 0.5912105480011364, 'alpha': 0.05937093656433391, 'max_delta_step': 10, 'n_estimators': 68}


### 3. Results

In [5]:
study.best_params

{'eta': 0.04362327410869968,
 'gamma': 0.34533271146139294,
 'max_depth': 4,
 'min_child_weight': 2,
 'subsample': 0.8375200371599197,
 'colsample_bytree': 0.9327683046084467,
 'scale_pos_weight': 5.1731540381466905,
 'lambda': 0.5912105480011364,
 'alpha': 0.05937093656433391,
 'max_delta_step': 10,
 'n_estimators': 68}

In [6]:
model_parameters

{'objective"': 'binary:logistic',
 'booster"': 'gbtree',
 'eval_metric"': 'auc',
 'eta': 0.01,
 'gamma': 0.1,
 'max_depth': 6,
 'min_child_weight': 3,
 'subsample': 0.8,
 'colsample_bytree': 0.8,
 'scale_pos_weight': 4,
 'lambda': 1,
 'alpha': 0.1,
 'max_delta_step': 1,
 'n_estimators': 100}

In [7]:
xgbm_model = XGBClassifier(missing=np.nan, **model_parameters, random_state=split_seed)

xgbm_model.fit(X_train, Y_train)
xgbm_preds = xgbm_model.predict_proba(X_valid)[:, 1]

model_results = calculate_metrics(Y_valid, xgbm_preds)
model_results

Parameters: { "booster"", "eval_metric"", "objective"" } are not used.



{'roc_auc_score': 0.6110941648308197,
 'pr_auc': 0.28320533857646,
 'ks': 0.17798678190137265}

In [8]:
model = XGBClassifier(missing=np.nan,**study.best_params, random_state=split_seed )
model.fit(X_train, Y_train)
preds = model.predict_proba(X_valid)[:, 1]
roc_auc = metrics.roc_auc_score(y_true = Y_valid, y_score = preds)

print(roc_auc)

0.5935491159690448
