# Collinearity and Leakage Test

__Split Distribution__  
Step 1 - Perform distribution test, Kolgomorov-Smirnov for continuous, Chi-square for categorical  

__Collinearity (Moved to later part)__  
step 2 - Run the VIF -> drop the highest -> Repeat (Threshold VIF < 5)  
Note:
* unlike p-value which the choice of drop is arbitrary, VIF check against the remaining variables which give clear values
* modern ML can handle multicollinearity but GLM struggle  
* the VIF will be run after variable selection only for GLM, but not for trees

__Leakage Test__  
step 3 - Check p-value against the predictor for regression data. Check if p-value is suspiciously high.  
step 4 - Run random forest against the data (can use default param). Check top 10 feature importance and check manually for a potential leakage.  

In [4]:
import pandas as pd
import numpy as np
import json
import os

from sklearn.ensemble import RandomForestRegressor

# stats
from scipy.stats import ks_2samp, chi2_contingency, pearsonr
from statsmodels.stats.outliers_influence import variance_inflation_factor

__Split Distribution__  
Step 1 - Perform distribution test, Kolgomorov-Smirnov for continuous, Chi-square for categorical  

> KS test: 2 variables (1.60%) have p < 0.05
> Chi-square test: 4 variables (3.05%) have p < 0.05
>
> Conclusion: The train-test split preserved the distribution well

In [6]:
with open("PROCESSED/DATA/merged_and_dropped.cat_cols.json") as f:
    cat_cols = json.load(f)

X_train = pd.read_parquet("INPUTS/TRAIN/X_train.parquet")
X_test = pd.read_parquet("INPUTS/TEST/X_test.parquet")
y_train = pd.read_parquet("INPUTS/TRAIN/y_train.parquet")
y_test = pd.read_parquet("INPUTS/TEST/y_test.parquet")

X_train[cat_cols] = X_train[cat_cols].astype("category")
X_test[cat_cols] = X_test[cat_cols].astype("category")

num_cols = [c for c in X_train.columns if c not in cat_cols]

In [8]:
# one-hot encode categorical variables
X_encoded = pd.get_dummies(X_train, drop_first=True)

rf = RandomForestRegressor(random_state=42, n_jobs=-1)
rf.fit(X_encoded, y_train.iloc[:, 0])
importances = pd.Series(rf.feature_importances_, index=X_encoded.columns)
top10 = importances.sort_values(ascending=False).head(10)
print(top10)
# top10.to_csv("LOG/rf_leakage_test.csv")

P_DEMO__RIDAGEYR_Age_in_years_at_screening                  0.107458
P_LUX__LUXCAPM_Median_CAP_decibels_per_meter_dB_m           0.055981
P_ALB_CR__URDACT_Albumin_creatinine_ratio_mg_g              0.049917
P_MCQ__MCQ366D                                              0.021577
P_MCQ__MCQ300C                                              0.019748
P_TCHOL__LBDTCSI_Total_Cholesterol_mmol_L                   0.019056
P_ALB_CR__URXUMS_Albumin_urine_mg_L                         0.017563
P_BPQ__BPQ020_Ever_told_you_had_high_blood_pressure_2.0     0.017517
P_BIOPRO__LBDSCHSI_Cholesterol_refrigerated_serum_mmol_L    0.016390
P_BIOPRO__LBXSOSSI_Osmolality_mmol_Kg                       0.015303
dtype: float64


#### Initial test

In [10]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, classification_report, make_scorer, roc_auc_score
import numpy as np
import pandas as pd

# encode
X_encoded = pd.get_dummies(X_train, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, drop_first=True)
X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0)

# ensure target is categorical
y_train_cat = y_train.iloc[:, 0].astype("category")
y_test_cat  = y_test.iloc[:, 0].astype("category")

# parameter grid
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
}

# model
rf = RandomForestClassifier(random_state=42, n_jobs=-1)

scorers = {
    'auc': 'roc_auc_ovr',
    'accuracy': make_scorer(accuracy_score),
    'f1': make_scorer(f1_score, average='macro')
}

grid = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    scoring=scorers,
    refit='f1',
    cv=3,
    n_jobs=-1,
    verbose=1
)

# fit
grid.fit(X_encoded, y_train_cat)

print("Best parameters:", grid.best_params_)
print("Best CV score:", grid.best_score_)
print("CV AUC at best-AUC params:", grid.cv_results_['mean_test_auc'][grid.best_index_])
print("CV F1 at best-AUC params:", grid.cv_results_['mean_test_f1'][grid.best_index_])
print("CV Accuracy at best-AUC params:", grid.cv_results_['mean_test_accuracy'][grid.best_index_])

# evaluate on train/test
best_rf = grid.best_estimator_

y_pred_train = best_rf.predict(X_encoded)
y_pred_test  = best_rf.predict(X_test_encoded)

acc_train = accuracy_score(y_train_cat, y_pred_train)
acc_test  = accuracy_score(y_test_cat, y_pred_test)

f1_train = f1_score(y_train_cat, y_pred_train, average='macro')
f1_test  = f1_score(y_test_cat, y_pred_test, average='macro')

auc_train = roc_auc_score(y_train_cat, best_rf.predict_proba(X_encoded)[:, 1])
auc_test  = roc_auc_score(y_test_cat, best_rf.predict_proba(X_test_encoded)[:, 1])

print(f"Train accuracy: {acc_train:.3f},  F1: {f1_train:.3f}, AUC: {auc_train:.3f}")
print(f"Test  accuracy: {acc_test:.3f},  F1: {f1_test:.3f}, AUC: {auc_test:.3f}")

print("\nClassification report:\n")
print(classification_report(y_test_cat, y_pred_test))

Fitting 3 folds for each of 8 candidates, totalling 24 fits
Best parameters: {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}
Best CV score: 0.6015451407695722
CV AUC at best-AUC params: 0.8752691103239566
CV F1 at best-AUC params: 0.6015451407695722
CV Accuracy at best-AUC params: 0.8501732669693746
Train accuracy: 0.934,  F1: 0.862, AUC: 0.993
Test  accuracy: 0.870,  F1: 0.643, AUC: 0.898

Classification report:

              precision    recall  f1-score   support

         0.0       0.87      0.99      0.93      1642
         1.0       0.79      0.23      0.36       306

    accuracy                           0.87      1948
   macro avg       0.83      0.61      0.64      1948
weighted avg       0.86      0.87      0.84      1948



In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, classification_report
import pandas as pd

# encode
X_encoded = pd.get_dummies(X_train, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, drop_first=True)
X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0)

# target as categorical
y_train_cat = y_train.iloc[:, 0].astype("category")
y_test_cat  = y_test.iloc[:, 0].astype("category")

# unregularized logistic regression (GLM)
model = LogisticRegression(
    penalty=None,        # FIX: unregularized
    solver="lbfgs",
    max_iter=500,
    n_jobs=-1
)

# fit
model.fit(X_encoded, y_train_cat)

# predictions
y_pred_train = model.predict(X_encoded)
y_pred_test  = model.predict(X_test_encoded)

# metrics
acc_train = accuracy_score(y_train_cat, y_pred_train)
acc_test  = accuracy_score(y_test_cat, y_pred_test)

f1_train = f1_score(y_train_cat, y_pred_train, average="macro")
f1_test  = f1_score(y_test_cat, y_pred_test, average="macro")

print(f"Train accuracy: {acc_train:.3f},  F1: {f1_train:.3f}")
print(f"Test  accuracy: {acc_test:.3f},  F1: {f1_test:.3f}")
print(f"Train AUC: {roc_auc_score(y_train_cat, model.predict_proba(X_encoded)[:, 1]):.3f}")
print(f"Test  AUC: {roc_auc_score(y_test_cat, model.predict_proba(X_test_encoded)[:, 1]):.3f}")

print("\nClassification report:\n")
print(classification_report(y_test_cat, y_pred_test))


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Train accuracy: 0.857,  F1: 0.688
Test  accuracy: 0.867,  F1: 0.693
Train AUC: 0.866
Test  AUC: 0.881

Classification report:

              precision    recall  f1-score   support

         0.0       0.89      0.96      0.92      1642
         1.0       0.63      0.36      0.46       306

    accuracy                           0.87      1948
   macro avg       0.76      0.66      0.69      1948
weighted avg       0.85      0.87      0.85      1948



In [14]:
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, classification_report, make_scorer
import pandas as pd

# encode
X_encoded = pd.get_dummies(X_train, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, drop_first=True)
X_test_encoded = X_test_encoded.reindex(columns=X_encoded.columns, fill_value=0)

# target
y_train_cat = y_train.iloc[:, 0].astype(int)
y_test_cat  = y_test.iloc[:, 0].astype(int)

# parameter grid (XGB version)
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.05, 0.1],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# model
xgb = XGBClassifier(
    objective='binary:logistic',
    eval_metric='logloss',
    tree_method='hist',
    random_state=42,
    n_jobs=-1
)

# scorers
scorers = {
    'auc': 'roc_auc',
    'accuracy': make_scorer(accuracy_score),
    'f1': make_scorer(f1_score, average='macro')
}

# grid search
grid = GridSearchCV(
    estimator=xgb,
    param_grid=param_grid,
    scoring=scorers,
    refit='f1',
    cv=3,
    verbose=1,
    n_jobs=-1
)

# fit
grid.fit(X_encoded, y_train_cat)

print("Best parameters:", grid.best_params_)
print("Best CV score:", grid.best_score_)
print("CV AUC at best-AUC params:", grid.cv_results_['mean_test_auc'][grid.best_index_])
print("CV F1 at best-AUC params:", grid.cv_results_['mean_test_f1'][grid.best_index_])
print("CV Accuracy at best-AUC params:", grid.cv_results_['mean_test_accuracy'][grid.best_index_])

# evaluate
best_xgb = grid.best_estimator_

y_pred_train = best_xgb.predict(X_encoded)
y_pred_test  = best_xgb.predict(X_test_encoded)

acc_train = accuracy_score(y_train_cat, y_pred_train)
acc_test  = accuracy_score(y_test_cat, y_pred_test)

f1_train = f1_score(y_train_cat, y_pred_train, average='macro')
f1_test  = f1_score(y_test_cat, y_pred_test, average='macro')

auc_train = roc_auc_score(y_train_cat, best_xgb.predict_proba(X_encoded)[:, 1])
auc_test  = roc_auc_score(y_test_cat, best_xgb.predict_proba(X_test_encoded)[:, 1])

print(f"Train accuracy: {acc_train:.3f},  F1: {f1_train:.3f}, AUC: {auc_train:.3f}")
print(f"Test  accuracy: {acc_test:.3f},  F1: {f1_test:.3f}, AUC: {auc_test:.3f}")

print("\nClassification report:\n")
print(classification_report(y_test_cat, y_pred_test))


Fitting 3 folds for each of 48 candidates, totalling 144 fits
Best parameters: {'colsample_bytree': 0.8, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200, 'subsample': 0.8}
Best CV score: 0.7787033779911746
CV AUC at best-AUC params: 0.9155415564742245
CV F1 at best-AUC params: 0.7787033779911746
CV Accuracy at best-AUC params: 0.8885601773924675
Train accuracy: 0.942,  F1: 0.888, AUC: 0.981
Test  accuracy: 0.914,  F1: 0.817, AUC: 0.938

Classification report:

              precision    recall  f1-score   support

           0       0.93      0.97      0.95      1642
           1       0.81      0.59      0.68       306

    accuracy                           0.91      1948
   macro avg       0.87      0.78      0.82      1948
weighted avg       0.91      0.91      0.91      1948



In [18]:
from sklearn.ensemble import ExtraTreesClassifier

# --- ExtraTreesClassifier ---

# binary/int version of the target (0/1)
y_train_bin = y_train.iloc[:, 0].astype(int)
y_test_bin  = y_test.iloc[:, 0].astype(int)

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
}

et = ExtraTreesClassifier(random_state=42, n_jobs=-1)

scorers = {
    'auc': 'roc_auc',
    'accuracy': make_scorer(accuracy_score),
    'f1': make_scorer(f1_score, average='macro')
}

grid = GridSearchCV(
    estimator=et,
    param_grid=param_grid,
    scoring=scorers,
    refit='f1',
    cv=3,
    n_jobs=-1,
    verbose=1
)

grid.fit(X_encoded, y_train_bin)

print("Best parameters:", grid.best_params_)
print("Best CV score:", grid.best_score_)
print("CV AUC at best-AUC params:", grid.cv_results_['mean_test_auc'][grid.best_index_])
print("CV F1 at best-AUC params:", grid.cv_results_['mean_test_f1'][grid.best_index_])
print("CV Accuracy at best-AUC params:", grid.cv_results_['mean_test_accuracy'][grid.best_index_])

best_et = grid.best_estimator_

y_pred_train = best_et.predict(X_encoded)
y_pred_test  = best_et.predict(X_test_encoded)

acc_train = accuracy_score(y_train_bin, y_pred_train)
acc_test  = accuracy_score(y_test_bin, y_pred_test)

f1_train = f1_score(y_train_bin, y_pred_train, average='macro')
f1_test  = f1_score(y_test_bin, y_pred_test, average='macro')

auc_train = roc_auc_score(y_train_bin, best_et.predict_proba(X_encoded)[:, 1])
auc_test  = roc_auc_score(y_test_bin, best_et.predict_proba(X_test_encoded)[:, 1])

print(f"Train accuracy: {acc_train:.3f},  F1: {f1_train:.3f}, AUC: {auc_train:.3f}")
print(f"Test  accuracy: {acc_test:.3f},  F1: {f1_test:.3f}, AUC: {auc_test:.3f}")

print("\nClassification report:\n")
print(classification_report(y_test_bin, y_pred_test))

Fitting 3 folds for each of 8 candidates, totalling 24 fits
Best parameters: {'max_depth': 10, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 100}
Best CV score: 0.5067559409414645
CV AUC at best-AUC params: 0.8535214610423522
CV F1 at best-AUC params: 0.5067559409414645
CV Accuracy at best-AUC params: 0.8348953268547586
Train accuracy: 0.898,  F1: 0.760, AUC: 0.983
Test  accuracy: 0.849,  F1: 0.508, AUC: 0.860

Classification report:

              precision    recall  f1-score   support

           0       0.85      1.00      0.92      1642
           1       0.80      0.05      0.10       306

    accuracy                           0.85      1948
   macro avg       0.82      0.52      0.51      1948
weighted avg       0.84      0.85      0.79      1948



In [22]:
!pip install catboost lightgbm

Collecting catboost
  Downloading catboost-1.2.8-cp312-cp312-macosx_11_0_universal2.whl.metadata (1.4 kB)
Collecting lightgbm
  Downloading lightgbm-4.6.0-py3-none-macosx_12_0_arm64.whl.metadata (17 kB)
Collecting graphviz (from catboost)
  Downloading graphviz-0.21-py3-none-any.whl.metadata (12 kB)
Downloading catboost-1.2.8-cp312-cp312-macosx_11_0_universal2.whl (27.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.8/27.8 MB[0m [31m33.7 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading lightgbm-4.6.0-py3-none-macosx_12_0_arm64.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m38.2 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hDownloading graphviz-0.21-py3-none-any.whl (47 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m47.3/47.3 kB[0m [31m5.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: graphviz, lightgbm, catboost
Successfully installed catboost-1.2.8 graphviz

In [24]:
from catboost import CatBoostClassifier

# --- CatBoost ---

param_grid = {
    'n_estimators': [200, 400],
    'depth': [4, 6, 8],
    'learning_rate': [0.05, 0.1],
}

cat = CatBoostClassifier(
    loss_function='Logloss',
    eval_metric='AUC',
    random_state=42,
    thread_count=-1,
    verbose=0  # silence per-iteration output
)

scorers = {
    'auc': 'roc_auc',
    'accuracy': make_scorer(accuracy_score),
    'f1': make_scorer(f1_score, average='macro')
}

grid = GridSearchCV(
    estimator=cat,
    param_grid=param_grid,
    scoring=scorers,
    refit='f1',
    cv=3,
    n_jobs=-1,
    verbose=1
)

grid.fit(X_encoded, y_train_bin)

print("Best parameters:", grid.best_params_)
print("Best CV score:", grid.best_score_)
print("CV AUC at best-AUC params:", grid.cv_results_['mean_test_auc'][grid.best_index_])
print("CV F1 at best-AUC params:", grid.cv_results_['mean_test_f1'][grid.best_index_])
print("CV Accuracy at best-AUC params:", grid.cv_results_['mean_test_accuracy'][grid.best_index_])

best_cat = grid.best_estimator_

y_pred_train = best_cat.predict(X_encoded)
y_pred_test  = best_cat.predict(X_test_encoded)

acc_train = accuracy_score(y_train_bin, y_pred_train)
acc_test  = accuracy_score(y_test_bin, y_pred_test)

f1_train = f1_score(y_train_bin, y_pred_train, average='macro')
f1_test  = f1_score(y_test_bin, y_pred_test, average='macro')

auc_train = roc_auc_score(y_train_bin, best_cat.predict_proba(X_encoded)[:, 1])
auc_test  = roc_auc_score(y_test_bin, best_cat.predict_proba(X_test_encoded)[:, 1])

print(f"Train accuracy: {acc_train:.3f},  F1: {f1_train:.3f}, AUC: {auc_train:.3f}")
print(f"Test  accuracy: {acc_test:.3f},  F1: {f1_test:.3f}, AUC: {auc_test:.3f}")

print("\nClassification report:\n")
print(classification_report(y_test_bin, y_pred_test))

Fitting 3 folds for each of 12 candidates, totalling 36 fits
Best parameters: {'depth': 4, 'learning_rate': 0.1, 'n_estimators': 400}
Best CV score: 0.7742526759671201
CV AUC at best-AUC params: 0.9144010650081252
CV F1 at best-AUC params: 0.7742526759671201
CV Accuracy at best-AUC params: 0.8867629355431448
Train accuracy: 0.965,  F1: 0.934, AUC: 0.991
Test  accuracy: 0.912,  F1: 0.816, AUC: 0.936

Classification report:

              precision    recall  f1-score   support

           0       0.93      0.97      0.95      1642
           1       0.78      0.60      0.68       306

    accuracy                           0.91      1948
   macro avg       0.86      0.79      0.82      1948
weighted avg       0.91      0.91      0.91      1948



In [26]:
from sklearn.ensemble import HistGradientBoostingClassifier

# --- HistGradientBoostingClassifier ---

param_grid = {
    'max_depth': [3, 5, 7],
    'learning_rate': [0.05, 0.1],
    'max_iter': [100, 200],
    'min_samples_leaf': [20, 50],
}

hgb = HistGradientBoostingClassifier(
    random_state=42
)

scorers = {
    'auc': 'roc_auc',
    'accuracy': make_scorer(accuracy_score),
    'f1': make_scorer(f1_score, average='macro')
}

grid = GridSearchCV(
    estimator=hgb,
    param_grid=param_grid,
    scoring=scorers,
    refit='f1',
    cv=3,
    n_jobs=-1,
    verbose=1
)

grid.fit(X_encoded, y_train_bin)

print("Best parameters:", grid.best_params_)
print("Best CV score:", grid.best_score_)
print("CV AUC at best-AUC params:", grid.cv_results_['mean_test_auc'][grid.best_index_])
print("CV F1 at best-AUC params:", grid.cv_results_['mean_test_f1'][grid.best_index_])
print("CV Accuracy at best-AUC params:", grid.cv_results_['mean_test_accuracy'][grid.best_index_])

best_hgb = grid.best_estimator_

y_pred_train = best_hgb.predict(X_encoded)
y_pred_test  = best_hgb.predict(X_test_encoded)

acc_train = accuracy_score(y_train_bin, y_pred_train)
acc_test  = accuracy_score(y_test_bin, y_pred_test)

f1_train = f1_score(y_train_bin, y_pred_train, average='macro')
f1_test  = f1_score(y_test_bin, y_pred_test, average='macro')

auc_train = roc_auc_score(y_train_bin, best_hgb.predict_proba(X_encoded)[:, 1])
auc_test  = roc_auc_score(y_test_bin, best_hgb.predict_proba(X_test_encoded)[:, 1])

print(f"Train accuracy: {acc_train:.3f},  F1: {f1_train:.3f}, AUC: {auc_train:.3f}")
print(f"Test  accuracy: {acc_test:.3f},  F1: {f1_test:.3f}, AUC: {auc_test:.3f}")

print("\nClassification report:\n")
print(classification_report(y_test_bin, y_pred_test))

Fitting 3 folds for each of 24 candidates, totalling 72 fits
Best parameters: {'learning_rate': 0.1, 'max_depth': 3, 'max_iter': 200, 'min_samples_leaf': 50}
Best CV score: 0.7722170427499749
CV AUC at best-AUC params: 0.915831299609526
CV F1 at best-AUC params: 0.7722170427499749
CV Accuracy at best-AUC params: 0.8859933106015614
Train accuracy: 0.942,  F1: 0.888, AUC: 0.980
Test  accuracy: 0.911,  F1: 0.813, AUC: 0.933

Classification report:

              precision    recall  f1-score   support

           0       0.93      0.97      0.95      1642
           1       0.79      0.59      0.68       306

    accuracy                           0.91      1948
   macro avg       0.86      0.78      0.81      1948
weighted avg       0.91      0.91      0.91      1948



In [30]:
from lightgbm import LGBMClassifier

import warnings
warnings.filterwarnings("ignore", category=UserWarning, module="lightgbm")

# --- LightGBM ---

param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [-1, 5, 10],
    'learning_rate': [0.05, 0.1],
    'num_leaves': [31, 63],
}

lgbm = LGBMClassifier(
    objective='binary',
    random_state=42,
    n_jobs=-1,
    verbose=-1      # <- this suppresses the info messages
    # verbosity=-1  # (alternative name in some versions)
)

scorers = {
    'auc': 'roc_auc',
    'accuracy': make_scorer(accuracy_score),
    'f1': make_scorer(f1_score, average='macro')
}

grid = GridSearchCV(
    estimator=lgbm,
    param_grid=param_grid,
    scoring=scorers,
    refit='f1',
    cv=3,
    n_jobs=-1,
    verbose=1
)

grid.fit(X_encoded, y_train_bin)

print("Best parameters:", grid.best_params_)
print("Best CV score:", grid.best_score_)
print("CV AUC at best-AUC params:", grid.cv_results_['mean_test_auc'][grid.best_index_])
print("CV F1 at best-AUC params:", grid.cv_results_['mean_test_f1'][grid.best_index_])
print("CV Accuracy at best-AUC params:", grid.cv_results_['mean_test_accuracy'][grid.best_index_])

best_lgbm = grid.best_estimator_

y_pred_train = best_lgbm.predict(X_encoded)
y_pred_test  = best_lgbm.predict(X_test_encoded)

acc_train = accuracy_score(y_train_bin, y_pred_train)
acc_test  = accuracy_score(y_test_bin, y_pred_test)

f1_train = f1_score(y_train_bin, y_pred_train, average='macro')
f1_test  = f1_score(y_test_bin, y_pred_test, average='macro')

auc_train = roc_auc_score(y_train_bin, best_lgbm.predict_proba(X_encoded)[:, 1])
auc_test  = roc_auc_score(y_test_bin, best_lgbm.predict_proba(X_test_encoded)[:, 1])

print(f"Train accuracy: {acc_train:.3f},  F1: {f1_train:.3f}, AUC: {auc_train:.3f}")
print(f"Test  accuracy: {acc_test:.3f},  F1: {f1_test:.3f}, AUC: {auc_test:.3f}")

print("\nClassification report:\n")
print(classification_report(y_test_bin, y_pred_test))

Fitting 3 folds for each of 24 candidates, totalling 72 fits


  _data = np.array(data, dtype=dtype, copy=copy,


Best parameters: {'learning_rate': 0.1, 'max_depth': 10, 'n_estimators': 200, 'num_leaves': 31}
Best CV score: 0.7704216027713929
CV AUC at best-AUC params: 0.9157653959098013
CV F1 at best-AUC params: 0.7704216027713929
CV Accuracy at best-AUC params: 0.887918460694741
Train accuracy: 1.000,  F1: 1.000, AUC: 1.000
Test  accuracy: 0.911,  F1: 0.811, AUC: 0.937

Classification report:

              precision    recall  f1-score   support

           0       0.93      0.97      0.95      1642
           1       0.79      0.59      0.67       306

    accuracy                           0.91      1948
   macro avg       0.86      0.78      0.81      1948
weighted avg       0.91      0.91      0.91      1948

