# 🎯 Emotion-Domain Classification with PyCaret AutoML (`domain_EI`, 4 classes)

This notebook builds ML classifiers to predict the **emotion-intelligence domain** (`domain_EI`) with four labels  
(HAHV · HALV · LALV · LAHV  →  encoded as 0-3) using **PyCaret 3**.

---

## 🧩 Objectives

| Item | Details |
|------|---------|
| **Feature sets** | 4 configurations (score_EI / group_EI included or removed) |
| **Scaling options** | `none`, `standard`, `minmax` |
| **Model family** | CatBoost · XGBoost · Random Forest · Extra Trees · Logistic Regression · Naive Bayes · SVM · K-NN · QDA |
| **Total runs** | **12** (4 feature configs × 3 scalers) |
| **Validation** | Stratified **3-fold CV** inside PyCaret + **20 % hold-out** |
| **Artifacts** | PNG plots (AUC, confusion matrix, feature importance) + CSV summaries |

---

## 🛠️ Pipeline Overview

| Step | Description |
|------|-------------|
| **1 — Split** | Apply feature-exclusion rules, then stratified 80 / 20 train-test split |
| **2 — Scaling** | Transform numeric features with the chosen scaler (`none` / `StandardScaler` / `MinMaxScaler`) |
| **3 — AutoML** | `compare_models()` evaluates the candidate models via CV and returns the best estimator |
| **4 — Hold-out Evaluation** | Predict on the reserved test set; save confusion-matrix & AUC plots |
| **5 — Result Logging** | Collect CV AUC / Accuracy and hold-out Accuracy for each run into a master CSV |

---

## 📂 Output Artifacts

| File / Folder | Contents |
|---------------|----------|
| **`overall_results.csv`** | One-row summary per run (feature cfg · scaler · chosen model · metrics) |
| **`leaderboard_cfg*_*.csv`** | Full PyCaret leaderboard (all models, CV statistics) |
| **`CM_/`, `AUC_/`, `FI_/`** | Confusion-matrix, AUC curve and feature-importance PNGs named by run tag |
| *(optional)* `best_models.pkl` | Serialized best estimators via `joblib.dump()` |

---

## 🔧 Requirements

* **PyCaret 3.x** (classification module)  
* `scikit-learn`, `pandas`, `numpy`, `matplotlib`, `tqdm`  
* CatBoost & XGBoost are pulled automatically with the full PyCaret install

> **Note**  
> * All stimulus-related (`Arousal`, `Valence`, etc.) and subjective self-report variables are **permanently excluded** from the feature space.  
> * The four feature configurations are designed to measure the impact of **EI score** (`score_EI`) and **EI group label** (`group_EI`) on model performance.  
> * GPU acceleration, SHAP explanations and 5-fold CV are **not** used in the current pipeline; they can be added later if needed.


## Import Libraries

In [16]:
# ---------------------------------------------------------------
#  ❖  CONFIG & PREP
# ---------------------------------------------------------------
import os, sys, types, re, joblib, warnings

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_auc_score
from sklearn.ensemble import ExtraTreesClassifier          # used for FS
from pycaret.classification import *
warnings.filterwarnings("ignore", category=UserWarning)

from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

## Step 0: Define Save Paths

We define clean, relative paths for all outputs such as feature importance plots, evaluation plots, and CSV results.

In [17]:
# Base directory: project root (current notebook folder)
ROOT_PATH = os.getcwd()
RES_PATH = os.path.join(ROOT_PATH, '../res')

# DATA
DATA_PATH = os.path.join(os.getcwd(), '../data/')
EXCEL_PATH = "../data/ETRI_cardiac.xlsx"
CSV_PATH = "../data/ETRI_cardiac.csv"
DF_PATH = "../data/updated_data_multi.csv"
ARSL_PATH = "../data/updated_data_ArslBC.csv"
VLNC_PATH = "../data/updated_data_VlncBC.csv"

# Define subfolders for saving results
RES_MULTI_PATH = os.path.join(RES_PATH, 'multi')
FI_MULTI_PATH   = os.path.join(RES_MULTI_PATH, 'FI_plots')      # Feature Importance
CM_MULTI_PATH   = os.path.join(RES_MULTI_PATH, 'CM_plots')      # Confusion Matrix
AUC_MULTI_PATH  = os.path.join(RES_MULTI_PATH, 'AUC_plots')     # AUC Curves
SHAP_MULTI_PATH = os.path.join(RES_MULTI_PATH, 'SHAP_plots')    # SHAP Plots
RES_CSV_MULTI_PATH  = os.path.join(RES_MULTI_PATH, 'results_csv')   # CSV files

RES_ARSL_PATH = os.path.join(RES_PATH, 'ArslBC')
FI_ARSL_PATH   = os.path.join(RES_ARSL_PATH, 'FI_plots')      # Feature Importance
CM_ARSL_PATH   = os.path.join(RES_ARSL_PATH, 'CM_plots')      # Confusion Matrix
AUC_ARSL_PATH  = os.path.join(RES_ARSL_PATH, 'AUC_plots')     # AUC Curves
SHAP_ARSL_PATH = os.path.join(RES_ARSL_PATH, 'SHAP_plots')    # SHAP Plots
RES_CSV_ARSL_PATH  = os.path.join(RES_ARSL_PATH, 'results_csv')   # CSV files

RES_VLNC_PATH = os.path.join(RES_PATH, 'VlncBC')
FI_VLNC_PATH   = os.path.join(RES_VLNC_PATH, 'FI_plots')      # Feature Importance
CM_VLNC_PATH   = os.path.join(RES_VLNC_PATH, 'CM_plots')      # Confusion Matrix
AUC_VLNC_PATH  = os.path.join(RES_VLNC_PATH, 'AUC_plots')     # AUC Curves
SHAP_VLNC_PATH = os.path.join(RES_VLNC_PATH, 'SHAP_plots')    # SHAP Plots
RES_CSV_VLNC_PATH  = os.path.join(RES_VLNC_PATH, 'results_csv')   # CSV files


# Create directories if they don't exist
for path in [RES_PATH, RES_MULTI_PATH, FI_MULTI_PATH, CM_MULTI_PATH, AUC_MULTI_PATH, SHAP_MULTI_PATH, RES_CSV_MULTI_PATH,
             RES_ARSL_PATH, FI_ARSL_PATH, CM_ARSL_PATH, AUC_ARSL_PATH, SHAP_ARSL_PATH, RES_CSV_ARSL_PATH,
             RES_VLNC_PATH, FI_VLNC_PATH, CM_VLNC_PATH, AUC_VLNC_PATH, SHAP_VLNC_PATH, RES_CSV_VLNC_PATH]:
    os.makedirs(path, exist_ok=True)

In [18]:
df_multi = pd.read_csv(DF_PATH)
df_multi.head()

Unnamed: 0,name,year,score_EI,domain_EI,trainnig,group_EI,자극_Arousal,자극_Valence,Arousal,Valence,...,VLF/HF_autocorr,LF/HF_autocorr,tPow_autocorr,dPow_autocorr,dHz_autocorr,pPow_autocorr,pHz_autocorr,CohRatio_autocorr,RSA_autocorr,dHz_diff_autocorr
0,subj_01_00,2021,400,0,0,0,1,1,4,5,...,-0.567119,-0.991761,4.761754,-0.106286,-0.597611,-0.177651,-1.020465,-1.087973,1.085212,-0.595462
1,subj_01_01,2021,400,1,0,0,1,0,6,4,...,-2.854562,-1.532805,-4.013737,13.465573,-1.486156,0.136173,-0.504705,-0.481968,0.855739,-1.486156
2,subj_01_02,2021,400,3,0,0,0,1,3,6,...,-1.375726,-1.030548,0.05676,-1.111399,-1.126756,-0.961825,-1.62631,-1.110674,-0.670813,-1.127433
3,subj_01_03,2021,400,2,0,0,0,0,4,4,...,-0.198272,0.921968,-2.812514,7.058668,-0.880433,-0.866042,-0.602454,-1.700471,0.168826,-0.880433
4,subj_02_00,2021,450,0,0,0,1,1,6,7,...,0.469754,0.216407,0.900575,0.121397,0.023057,-0.702703,-0.874824,0.579896,-0.289846,0.023057


In [19]:
df_multi.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 100 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   name               180 non-null    object 
 1   year               180 non-null    int64  
 2   score_EI           180 non-null    int64  
 3   domain_EI          180 non-null    int64  
 4   trainnig           180 non-null    int64  
 5   group_EI           180 non-null    int64  
 6   자극_Arousal         180 non-null    int64  
 7   자극_Valence         180 non-null    int64  
 8   Arousal            180 non-null    int64  
 9   Valence            180 non-null    int64  
 10  Subj_AR            180 non-null    int64  
 11  Subj_PN            180 non-null    int64  
 12  BPM                180 non-null    float64
 13  SDNN               180 non-null    float64
 14  rMSSD              180 non-null    float64
 15  VLF                180 non-null    float64
 16  LF                 180 no

In [20]:
df_arsl = pd.read_csv(ARSL_PATH)
df_arsl.head()

Unnamed: 0,name,year,score_EI,domain_EI,trainnig,group_EI,자극_Arousal,자극_Valence,Arousal,Valence,...,VLF/HF_autocorr,LF/HF_autocorr,tPow_autocorr,dPow_autocorr,dHz_autocorr,pPow_autocorr,pHz_autocorr,CohRatio_autocorr,RSA_autocorr,dHz_diff_autocorr
0,subj_01_00,2021,400,1,0,0,1,1,4,5,...,-0.567119,-0.991761,4.761754,-0.106286,-0.597611,-0.177651,-1.020465,-1.087973,1.085212,-0.595462
1,subj_01_01,2021,400,1,0,0,1,0,6,4,...,-2.854562,-1.532805,-4.013737,13.465573,-1.486156,0.136173,-0.504705,-0.481968,0.855739,-1.486156
2,subj_01_02,2021,400,0,0,0,0,1,3,6,...,-1.375726,-1.030548,0.05676,-1.111399,-1.126756,-0.961825,-1.62631,-1.110674,-0.670813,-1.127433
3,subj_01_03,2021,400,0,0,0,0,0,4,4,...,-0.198272,0.921968,-2.812514,7.058668,-0.880433,-0.866042,-0.602454,-1.700471,0.168826,-0.880433
4,subj_02_00,2021,450,1,0,0,1,1,6,7,...,0.469754,0.216407,0.900575,0.121397,0.023057,-0.702703,-0.874824,0.579896,-0.289846,0.023057


In [21]:
df_arsl.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 100 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   name               180 non-null    object 
 1   year               180 non-null    int64  
 2   score_EI           180 non-null    int64  
 3   domain_EI          180 non-null    int64  
 4   trainnig           180 non-null    int64  
 5   group_EI           180 non-null    int64  
 6   자극_Arousal         180 non-null    int64  
 7   자극_Valence         180 non-null    int64  
 8   Arousal            180 non-null    int64  
 9   Valence            180 non-null    int64  
 10  Subj_AR            180 non-null    int64  
 11  Subj_PN            180 non-null    int64  
 12  BPM                180 non-null    float64
 13  SDNN               180 non-null    float64
 14  rMSSD              180 non-null    float64
 15  VLF                180 non-null    float64
 16  LF                 180 no

In [22]:
df_vlnc = pd.read_csv(VLNC_PATH)
df_vlnc.head()

Unnamed: 0,name,year,score_EI,domain_EI,trainnig,group_EI,자극_Arousal,자극_Valence,Arousal,Valence,...,VLF/HF_autocorr,LF/HF_autocorr,tPow_autocorr,dPow_autocorr,dHz_autocorr,pPow_autocorr,pHz_autocorr,CohRatio_autocorr,RSA_autocorr,dHz_diff_autocorr
0,subj_01_00,2021,400,1,0,0,1,1,4,5,...,-0.567119,-0.991761,4.761754,-0.106286,-0.597611,-0.177651,-1.020465,-1.087973,1.085212,-0.595462
1,subj_01_01,2021,400,0,0,0,1,0,6,4,...,-2.854562,-1.532805,-4.013737,13.465573,-1.486156,0.136173,-0.504705,-0.481968,0.855739,-1.486156
2,subj_01_02,2021,400,1,0,0,0,1,3,6,...,-1.375726,-1.030548,0.05676,-1.111399,-1.126756,-0.961825,-1.62631,-1.110674,-0.670813,-1.127433
3,subj_01_03,2021,400,0,0,0,0,0,4,4,...,-0.198272,0.921968,-2.812514,7.058668,-0.880433,-0.866042,-0.602454,-1.700471,0.168826,-0.880433
4,subj_02_00,2021,450,1,0,0,1,1,6,7,...,0.469754,0.216407,0.900575,0.121397,0.023057,-0.702703,-0.874824,0.579896,-0.289846,0.023057


In [23]:
df_vlnc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 100 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   name               180 non-null    object 
 1   year               180 non-null    int64  
 2   score_EI           180 non-null    int64  
 3   domain_EI          180 non-null    int64  
 4   trainnig           180 non-null    int64  
 5   group_EI           180 non-null    int64  
 6   자극_Arousal         180 non-null    int64  
 7   자극_Valence         180 non-null    int64  
 8   Arousal            180 non-null    int64  
 9   Valence            180 non-null    int64  
 10  Subj_AR            180 non-null    int64  
 11  Subj_PN            180 non-null    int64  
 12  BPM                180 non-null    float64
 13  SDNN               180 non-null    float64
 14  rMSSD              180 non-null    float64
 15  VLF                180 non-null    float64
 16  LF                 180 no

## Step 1: Experiment Settings
Define Scaling and Setup for PyCaret 

We define three types of scaling: none, standard, and min-max, for every feature configuration.

In [24]:
TARGET = "domain_EI"

# Columns that are never used as features
MUST_EXCLUDE = [
    "name", "domain_EI", "year", "trainnig",
    "Arousal", "Valence", "자극_Arousal", "자극_Valence",
    "Subj_AR", "Subj_PN"
]

# 4 feature-set flavours: keep / drop EI score / drop EI group / drop both
FEATURE_EXCLUDE_SETS = {
    1: [],
    2: ["group_EI"],
    3: ["score_EI"],
    4: ["score_EI", "group_EI"],
}

# Three scaler options
SCALERS = {
    "none":     None,
    "standard": StandardScaler(),
    "minmax":   MinMaxScaler(),
}

# Models we allow PyCaret to try
INCLUDE_MODELS = [
    "catboost", "xgboost", "rf", "et",
    "lr", "nb", "svm", "knn", "qda",
]

####  LightGBM stub

In [25]:
lgb_stub = types.ModuleType("lightgbm")

class _DummyLGBM:
    """Minimal fake LightGBM estimator so that PyCaret can import it."""
    def __init__(self, *args, **kwargs): pass
    def fit(self, *args, **kwargs): return self
    def predict(self, X, *args, **kwargs):          # label output
        return np.zeros(len(X), dtype=int)
    def predict_proba(self, X, *args, **kwargs):    # proba output
        # return 2-col dummy probs for binary; PyCaret only checks shape
        return np.zeros((len(X), 2), dtype=float)

# expose expected symbols at top level
lgb_stub.LGBMClassifier = _DummyLGBM
lgb_stub.LGBMRegressor  = _DummyLGBM
lgb_stub.Dataset        = object          # rarely used by PyCaret

basic_stub = types.ModuleType("lightgbm.basic")
basic_stub.LightGBMError = RuntimeError    # any Exception subclass works

# make `from lightgbm.basic import LightGBMError` succeed
sys.modules["lightgbm.basic"] = basic_stub

# attach sub-module to the parent package stub
lgb_stub.basic = basic_stub

sys.modules["lightgbm"] = lgb_stub

## Step 2. Result Containers

In [26]:
results_multi:     list[dict]        = []     # one row per config+scaler
best_models_multi: dict[str, object] = {}     # tag ➜ fitted model

results_arsl:     list[dict]        = []     # one row per config+scaler
best_models_arsl: dict[str, object] = {}     # tag ➜ fitted model

results_vlnc:     list[dict]        = []     # one row per config+scaler
best_models_vlnc: dict[str, object] = {}     # tag ➜ fitted model

## Step 3: AutoML Training using PyCaret + tqdm Progress Tracking
WITHOUT OUTLIER vs WITH OUTLIER !! (With Outlier first)

We train models using PyCaret with GPU acceleration, over 8 × 3 configuration combinations.
'''
for config_id, exclude_cols in tqdm(feature_exclude_sets.items(), desc="Feature Configs", position=0):
    for scaler_name, scaler in tqdm(scalers.items(), desc=f"Scalers for Config {config_id}", leave=False, position=1):
        df_exp = df.copy()
        all_excluded = must_exclude + exclude_cols
        
        remaining_cols = [col for col in df_exp.columns if col not in all_excluded]
        if len(remaining_cols) == 0:
            print(f"⚠️ Skipping config {config_id} + scaler {scaler_name} – no features left.")
            continue

        X = df_exp.drop(columns=all_excluded)
        y = df_exp[target]

        X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

        # ✅ Feature Check
        print(f"\n📌 Config {config_id} | Scaler: {scaler_name}")
        print(f"🔸 X_train shape: {X_train.shape}")
        print(f"🔸 Feature columns: {list(X_train.columns)}")
        print(f"🔸 Null values:\n{X_train.isnull().sum()}")
        print(f"🔸 Class distribution:\n{y_train.value_counts()}")
        print("-" * 60)
'''

### Step 2.3. Experiment loop

### Multi

In [None]:
for cfg_id, excl_cols in tqdm(FEATURE_EXCLUDE_SETS.items(),
                              desc="Feature Configs"):

    for scaler_name, scaler in tqdm(SCALERS.items(),
                                    desc=f"Scalers for {cfg_id}",
                                    leave=False):

        # 3-1 ▶ split -------------------------------------------------------
        X = df_multi.drop(columns=MUST_EXCLUDE + excl_cols)
        y = df_multi[TARGET]
        X_tr, X_te, y_tr, y_te = train_test_split(
            X, y, stratify=y, test_size=0.10, random_state=42
        )

        if X_tr.shape[1] == 0:             # empty feature guard
            print(f"🚫 cfg{cfg_id}|{scaler_name}: no features"); continue

        # 3-2 ▶ optional scaling -------------------------------------------
        if scaler:
            X_tr = pd.DataFrame(scaler.fit_transform(X_tr),
                                index=X_tr.index, columns=X_tr.columns)
            X_te = pd.DataFrame(scaler.transform(X_te),
                                index=X_te.index, columns=X_te.columns)

        # 3-3 ▶ PyCaret setup  --------------------------------------------
        train_df = pd.concat([X_tr.reset_index(drop=True),
                              y_tr.reset_index(drop=True)], axis=1)
        test_df  = pd.concat([X_te.reset_index(drop=True),
                              y_te.reset_index(drop=True)], axis=1)

        setup(
            data=train_df,
            test_data=test_df,     # 👈 pass *your* hold-out here
            target=TARGET,
            session_id=42,
            fold=3,
            use_gpu=False,
            html=True,
            verbose=True,
            feature_selection=False,
            index=False,
        )

        # 3-4 ▶ model search  ---------------------------------------------
        best = compare_models(include=INCLUDE_MODELS,
                              sort="AUC",   fold=3,
                              turbo=False, verbose=True)

        tag = f"cfg{cfg_id}_{scaler_name}"
        best_models_multi[tag] = best

        # leaderboard CSV ---------------------------------------------------
        pull().assign(config=cfg_id, scaler=scaler_name) \
              .to_csv(os.path.join(RES_MULTI_PATH, f"leaderboard_{tag}.csv"),
                      index=False)

        # 3-5 ▶ predict on hold-out & plots -------------------------------
        pred = predict_model(best)     # uses test_data provided in setup()

        plot_model(best, plot="confusion_matrix",   save=True)
        os.replace("Confusion Matrix.png",
                   os.path.join(CM_MULTI_PATH,  f"CM_{tag}.png"))

        # AUC — only if model supports probability estimates
        if hasattr(best, "predict_proba"):
            plot_model(best, plot="auc", save=True)
            if os.path.exists("AUC.png"):
                os.replace("AUC.png", os.path.join(AUC_MULTI_PATH, f"AUC_{tag}.png"))
        else:
            print(f"[SKIP] AUC not available for {tag}")

        # 3-6 ▶ feature importance ----------------------------------------
        fi_path = os.path.join(FI_MULTI_PATH, f"FI_{tag}.png")

        def save_fi_tree():
            plot_model(best, plot="feature", save=True)
            os.replace("Feature Importance.png", fi_path)

        if hasattr(best, "feature_importances_"):
            save_fi_tree()

        elif best.__class__.__name__.startswith("CatBoost"):
            fi = best.get_feature_importance()
            top = np.argsort(fi)[::-1][:20]
            plt.figure(figsize=(6, 4))
            plt.barh(range(len(top)), fi[top][::-1])
            plt.yticks(range(len(top)), X_tr.columns[top][::-1])
            plt.tight_layout(); plt.savefig(fi_path); plt.close()

        # 3-7 ▶ collect numeric summary -----------------------------------
        cv_metrics = pull().iloc[0]        # row 0 = best model’s CV stats
        results_multi.append({
            "config":      cfg_id,
            "scaler":      scaler_name,
            "model":       best.__class__.__name__,
            "AUC_CV":      cv_metrics["AUC"],
            "Acc_CV":      cv_metrics["Accuracy"],
            "Acc_holdout": (pred[TARGET] == pred["prediction_label"]).mean(),
        })


Feature Configs:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 91)"
4,Transformed data shape,"(180, 91)"
5,Transformed train set shape,"(162, 91)"
6,Transformed test set shape,"(18, 91)"
7,Numeric features,90
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
nb,Naive Bayes,0.2469,0.4792,0.2469,0.2486,0.2335,-0.0066,-0.0051,0.19
xgboost,Extreme Gradient Boosting,0.2531,0.4775,0.2531,0.251,0.25,0.0046,0.0046,0.34
et,Extra Trees Classifier,0.2346,0.4637,0.2346,0.233,0.2307,-0.0211,-0.0212,0.22
catboost,CatBoost Classifier,0.216,0.4305,0.216,0.2143,0.2094,-0.0443,-0.0451,2.98
rf,Random Forest Classifier,0.2099,0.4207,0.2099,0.2074,0.2064,-0.0533,-0.0537,0.23
knn,K Neighbors Classifier,0.2037,0.4021,0.2037,0.2006,0.1974,-0.0608,-0.0614,0.19
lr,Logistic Regression,0.2346,0.0,0.2346,0.2514,0.2332,-0.02,-0.0197,0.22
svm,SVM - Linear Kernel,0.2346,0.0,0.2346,0.2067,0.1953,-0.0201,-0.021,0.1967
qda,Quadratic Discriminant Analysis,0.2099,0.0,0.2099,0.1613,0.1162,-0.0495,-0.0829,0.1867


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Naive Bayes,0.2222,0.4551,0.2222,0.1188,0.1538,-0.0328,-0.0385




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 91)"
4,Transformed data shape,"(180, 91)"
5,Transformed train set shape,"(162, 91)"
6,Transformed test set shape,"(18, 91)"
7,Numeric features,90
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
knn,K Neighbors Classifier,0.2531,0.4985,0.2531,0.238,0.2315,0.0042,0.0037,0.0133
nb,Naive Bayes,0.2593,0.4776,0.2593,0.2695,0.2503,0.0103,0.0124,0.0067
xgboost,Extreme Gradient Boosting,0.2531,0.4775,0.2531,0.251,0.25,0.0046,0.0046,0.3
et,Extra Trees Classifier,0.2469,0.465,0.2469,0.2435,0.2416,-0.0046,-0.0047,0.0433
catboost,CatBoost Classifier,0.1975,0.4322,0.1975,0.197,0.1914,-0.069,-0.07,2.8967
rf,Random Forest Classifier,0.2099,0.421,0.2099,0.209,0.2074,-0.0532,-0.0534,0.05
lr,Logistic Regression,0.2531,0.0,0.2531,0.2598,0.2478,0.005,0.0052,0.0067
svm,SVM - Linear Kernel,0.2469,0.0,0.2469,0.2319,0.234,-0.003,-0.0031,0.0133
qda,Quadratic Discriminant Analysis,0.3272,0.0,0.3272,0.3015,0.2984,0.1023,0.1072,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,K Neighbors Classifier,0.3333,0.4847,0.3333,0.5111,0.3175,0.122,0.1385




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 91)"
4,Transformed data shape,"(180, 91)"
5,Transformed train set shape,"(162, 91)"
6,Transformed test set shape,"(18, 91)"
7,Numeric features,90
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
xgboost,Extreme Gradient Boosting,0.2531,0.4775,0.2531,0.251,0.25,0.0046,0.0046,0.14
nb,Naive Bayes,0.2593,0.477,0.2593,0.2695,0.2503,0.0103,0.0124,0.0067
knn,K Neighbors Classifier,0.2346,0.4699,0.2346,0.2352,0.2214,-0.0216,-0.0219,0.0133
et,Extra Trees Classifier,0.2407,0.4622,0.2407,0.2421,0.2357,-0.0132,-0.0136,0.04
catboost,CatBoost Classifier,0.2099,0.4283,0.2099,0.2072,0.2026,-0.0527,-0.0535,2.67
rf,Random Forest Classifier,0.1914,0.4263,0.1914,0.1943,0.19,-0.0772,-0.0777,0.0467
lr,Logistic Regression,0.2654,0.0,0.2654,0.2601,0.2526,0.0219,0.0223,0.01
svm,SVM - Linear Kernel,0.2593,0.0,0.2593,0.2842,0.2093,0.015,0.0188,0.0133
qda,Quadratic Discriminant Analysis,0.2531,0.0,0.2531,0.2482,0.246,0.0034,0.0034,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extreme Gradient Boosting,0.2778,0.5446,0.2778,0.3481,0.2846,0.0526,0.0561


Feature Configs:  25%|██▌       | 1/4 [00:39<01:57, 39.26s/it]

Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 90)"
4,Transformed data shape,"(180, 90)"
5,Transformed train set shape,"(162, 90)"
6,Transformed test set shape,"(18, 90)"
7,Numeric features,89
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
nb,Naive Bayes,0.2469,0.4795,0.2469,0.2486,0.2335,-0.0066,-0.0051,0.0067
xgboost,Extreme Gradient Boosting,0.2531,0.4775,0.2531,0.251,0.25,0.0046,0.0046,0.1533
et,Extra Trees Classifier,0.2284,0.4553,0.2284,0.2332,0.2261,-0.0284,-0.0284,0.0433
rf,Random Forest Classifier,0.1914,0.4496,0.1914,0.1914,0.1858,-0.0762,-0.0775,0.05
catboost,CatBoost Classifier,0.2284,0.4267,0.2284,0.2326,0.2269,-0.0275,-0.0278,2.7433
knn,K Neighbors Classifier,0.2037,0.4021,0.2037,0.2006,0.1974,-0.0608,-0.0614,0.0133
lr,Logistic Regression,0.2346,0.0,0.2346,0.2549,0.2338,-0.0199,-0.0201,0.0367
svm,SVM - Linear Kernel,0.2346,0.0,0.2346,0.2067,0.1953,-0.0201,-0.021,0.0133
qda,Quadratic Discriminant Analysis,0.216,0.0,0.216,0.1435,0.1206,-0.0405,-0.0601,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Naive Bayes,0.2222,0.4551,0.2222,0.1188,0.1538,-0.0328,-0.0385




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 90)"
4,Transformed data shape,"(180, 90)"
5,Transformed train set shape,"(162, 90)"
6,Transformed test set shape,"(18, 90)"
7,Numeric features,89
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
nb,Naive Bayes,0.2593,0.4782,0.2593,0.2695,0.2503,0.0103,0.0124,0.0067
xgboost,Extreme Gradient Boosting,0.2531,0.4775,0.2531,0.251,0.25,0.0046,0.0046,0.1833
knn,K Neighbors Classifier,0.2222,0.4722,0.2222,0.2199,0.202,-0.0373,-0.0395,0.01
et,Extra Trees Classifier,0.216,0.4539,0.216,0.2231,0.2154,-0.0447,-0.0446,0.0533
rf,Random Forest Classifier,0.1852,0.4478,0.1852,0.181,0.1776,-0.0843,-0.0859,0.0633
catboost,CatBoost Classifier,0.216,0.4469,0.216,0.2205,0.215,-0.0438,-0.0444,2.7167
lr,Logistic Regression,0.2654,0.0,0.2654,0.2717,0.2596,0.0211,0.0214,0.0233
svm,SVM - Linear Kernel,0.284,0.0,0.284,0.2635,0.2658,0.0464,0.0475,0.0133
qda,Quadratic Discriminant Analysis,0.2654,0.0,0.2654,0.2361,0.2365,0.0206,0.0218,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Naive Bayes,0.2222,0.4667,0.2222,0.1188,0.1538,-0.0328,-0.0385




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 90)"
4,Transformed data shape,"(180, 90)"
5,Transformed train set shape,"(162, 90)"
6,Transformed test set shape,"(18, 90)"
7,Numeric features,89
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
nb,Naive Bayes,0.2593,0.4776,0.2593,0.2695,0.2503,0.0103,0.0124,0.0067
xgboost,Extreme Gradient Boosting,0.2531,0.4775,0.2531,0.251,0.25,0.0046,0.0046,0.21
et,Extra Trees Classifier,0.2222,0.4591,0.2222,0.22,0.2148,-0.0366,-0.037,0.04
knn,K Neighbors Classifier,0.2407,0.4523,0.2407,0.2297,0.2272,-0.0141,-0.0151,0.0133
rf,Random Forest Classifier,0.1914,0.4493,0.1914,0.1815,0.1797,-0.0761,-0.0778,0.0633
catboost,CatBoost Classifier,0.2284,0.4265,0.2284,0.232,0.2266,-0.0277,-0.028,2.6833
lr,Logistic Regression,0.2654,0.0,0.2654,0.2469,0.2503,0.0218,0.022,0.02
svm,SVM - Linear Kernel,0.2654,0.0,0.2654,0.2352,0.2366,0.0195,0.0216,0.0133
qda,Quadratic Discriminant Analysis,0.2469,0.0,0.2469,0.2609,0.2471,-0.0033,-0.003,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Naive Bayes,0.2222,0.4667,0.2222,0.1188,0.1538,-0.0328,-0.0385


Feature Configs:  50%|█████     | 2/4 [01:12<01:11, 35.72s/it]

Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 90)"
4,Transformed data shape,"(180, 90)"
5,Transformed train set shape,"(162, 90)"
6,Transformed test set shape,"(18, 90)"
7,Numeric features,89
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
nb,Naive Bayes,0.2531,0.4798,0.2531,0.2534,0.24,0.0019,0.0044,0.0067
xgboost,Extreme Gradient Boosting,0.2346,0.4704,0.2346,0.224,0.2269,-0.02,-0.02,0.1933
rf,Random Forest Classifier,0.1975,0.4494,0.1975,0.1981,0.1933,-0.0686,-0.0701,0.0633
catboost,CatBoost Classifier,0.2099,0.4474,0.2099,0.2232,0.2085,-0.0528,-0.0537,2.6033
et,Extra Trees Classifier,0.1975,0.4394,0.1975,0.2027,0.1952,-0.0692,-0.0698,0.04
knn,K Neighbors Classifier,0.2222,0.432,0.2222,0.2143,0.2108,-0.0358,-0.0378,0.0133
lr,Logistic Regression,0.2469,0.0,0.2469,0.2514,0.2428,-0.0025,-0.0027,0.0367
svm,SVM - Linear Kernel,0.2407,0.0,0.2407,0.2244,0.2048,-0.0114,-0.0153,0.0133
qda,Quadratic Discriminant Analysis,0.2346,0.0,0.2346,0.1394,0.1452,-0.0135,-0.0435,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Naive Bayes,0.2222,0.4762,0.2222,0.1188,0.1538,-0.0328,-0.0385




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 90)"
4,Transformed data shape,"(180, 90)"
5,Transformed train set shape,"(162, 90)"
6,Transformed test set shape,"(18, 90)"
7,Numeric features,89
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
nb,Naive Bayes,0.2593,0.4783,0.2593,0.2695,0.2503,0.0103,0.0124,0.0067
knn,K Neighbors Classifier,0.2531,0.4745,0.2531,0.2397,0.2316,0.0044,0.0043,0.01
xgboost,Extreme Gradient Boosting,0.2346,0.4704,0.2346,0.224,0.2269,-0.02,-0.02,0.1433
catboost,CatBoost Classifier,0.2222,0.4653,0.2222,0.2322,0.2192,-0.0361,-0.0366,2.5933
rf,Random Forest Classifier,0.1852,0.4455,0.1852,0.1851,0.18,-0.0846,-0.0865,0.0667
et,Extra Trees Classifier,0.2037,0.4413,0.2037,0.2062,0.201,-0.0613,-0.0618,0.0533
lr,Logistic Regression,0.2654,0.0,0.2654,0.2689,0.257,0.0217,0.0221,0.0067
svm,SVM - Linear Kernel,0.2654,0.0,0.2654,0.26,0.2531,0.0221,0.0227,0.0133
qda,Quadratic Discriminant Analysis,0.3272,0.0,0.3272,0.2996,0.2947,0.1028,0.108,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Naive Bayes,0.2222,0.471,0.2222,0.1188,0.1538,-0.0328,-0.0385




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 90)"
4,Transformed data shape,"(180, 90)"
5,Transformed train set shape,"(162, 90)"
6,Transformed test set shape,"(18, 90)"
7,Numeric features,89
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
nb,Naive Bayes,0.2593,0.4778,0.2593,0.2695,0.2503,0.0103,0.0124,0.0067
knn,K Neighbors Classifier,0.2654,0.4777,0.2654,0.2508,0.2436,0.0202,0.0223,0.0133
xgboost,Extreme Gradient Boosting,0.2346,0.4704,0.2346,0.224,0.2269,-0.02,-0.02,0.1067
catboost,CatBoost Classifier,0.2037,0.4495,0.2037,0.2163,0.2021,-0.0611,-0.0619,2.6367
et,Extra Trees Classifier,0.2099,0.4471,0.2099,0.2127,0.2045,-0.0519,-0.0524,0.04
rf,Random Forest Classifier,0.179,0.4468,0.179,0.1751,0.172,-0.093,-0.0952,0.0633
lr,Logistic Regression,0.2531,0.0,0.2531,0.2331,0.236,0.0057,0.0055,0.01
svm,SVM - Linear Kernel,0.2531,0.0,0.2531,0.2696,0.1999,0.0037,0.0008,0.0133
qda,Quadratic Discriminant Analysis,0.3025,0.0,0.3025,0.2999,0.2948,0.0695,0.0704,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Naive Bayes,0.2222,0.471,0.2222,0.1188,0.1538,-0.0328,-0.0385


Feature Configs:  75%|███████▌  | 3/4 [01:44<00:33, 33.99s/it]

Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 89)"
4,Transformed data shape,"(180, 89)"
5,Transformed train set shape,"(162, 89)"
6,Transformed test set shape,"(18, 89)"
7,Numeric features,88
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.2469,0.4976,0.2469,0.2515,0.238,-0.0043,-0.0043,0.04
nb,Naive Bayes,0.2531,0.4803,0.2531,0.2534,0.24,0.0019,0.0044,0.0067
xgboost,Extreme Gradient Boosting,0.2654,0.4747,0.2654,0.2543,0.2578,0.0216,0.0218,0.1933
catboost,CatBoost Classifier,0.2346,0.4706,0.2346,0.2242,0.2222,-0.0191,-0.02,2.72
rf,Random Forest Classifier,0.2284,0.4542,0.2284,0.2226,0.2216,-0.0283,-0.0285,0.05
knn,K Neighbors Classifier,0.2222,0.432,0.2222,0.2143,0.2108,-0.0358,-0.0378,0.0133
lr,Logistic Regression,0.2469,0.0,0.2469,0.2538,0.2423,-0.0024,-0.0024,0.0333
svm,SVM - Linear Kernel,0.2407,0.0,0.2407,0.2244,0.2048,-0.0114,-0.0153,0.0133
qda,Quadratic Discriminant Analysis,0.2346,0.0,0.2346,0.1394,0.1452,-0.0135,-0.0435,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.2222,0.4634,0.2222,0.1926,0.2059,-0.037,-0.0375




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 89)"
4,Transformed data shape,"(180, 89)"
5,Transformed train set shape,"(162, 89)"
6,Transformed test set shape,"(18, 89)"
7,Numeric features,88
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.2469,0.4982,0.2469,0.2515,0.238,-0.0043,-0.0043,0.0433
nb,Naive Bayes,0.2531,0.4789,0.2531,0.2639,0.2441,0.002,0.0034,0.0067
xgboost,Extreme Gradient Boosting,0.2654,0.4747,0.2654,0.2543,0.2578,0.0216,0.0218,0.1433
knn,K Neighbors Classifier,0.2346,0.4747,0.2346,0.2284,0.2163,-0.0208,-0.0218,0.0067
rf,Random Forest Classifier,0.2346,0.4539,0.2346,0.226,0.2261,-0.0202,-0.0203,0.05
catboost,CatBoost Classifier,0.2407,0.4519,0.2407,0.2354,0.2304,-0.0114,-0.012,2.5633
lr,Logistic Regression,0.2716,0.0,0.2716,0.2758,0.2655,0.03,0.0302,0.0067
svm,SVM - Linear Kernel,0.2284,0.0,0.2284,0.2176,0.2159,-0.0284,-0.029,0.0133
qda,Quadratic Discriminant Analysis,0.2778,0.0,0.2778,0.2823,0.251,0.0367,0.0401,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.1667,0.4492,0.1667,0.1481,0.1556,-0.1066,-0.1083




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 89)"
4,Transformed data shape,"(180, 89)"
5,Transformed train set shape,"(162, 89)"
6,Transformed test set shape,"(18, 89)"
7,Numeric features,88
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.2593,0.4934,0.2593,0.2591,0.2492,0.0125,0.0129,0.0367
nb,Naive Bayes,0.2531,0.4784,0.2531,0.2639,0.2441,0.002,0.0034,0.0067
xgboost,Extreme Gradient Boosting,0.2654,0.4747,0.2654,0.2543,0.2578,0.0216,0.0218,0.15
catboost,CatBoost Classifier,0.2284,0.4657,0.2284,0.2198,0.2169,-0.027,-0.028,2.6033
rf,Random Forest Classifier,0.2469,0.4523,0.2469,0.2349,0.2342,-0.0034,-0.003,0.0467
knn,K Neighbors Classifier,0.2222,0.4484,0.2222,0.2115,0.2094,-0.0384,-0.0397,0.0133
lr,Logistic Regression,0.2654,0.0,0.2654,0.2534,0.2539,0.0217,0.022,0.01
svm,SVM - Linear Kernel,0.2654,0.0,0.2654,0.2187,0.1863,0.0292,0.041,0.0133
qda,Quadratic Discriminant Analysis,0.2469,0.0,0.2469,0.2416,0.2365,-0.0042,-0.0041,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.2222,0.4361,0.2222,0.1889,0.2037,-0.0328,-0.0332


Feature Configs: 100%|██████████| 4/4 [02:17<00:00, 34.43s/it]


### Arsl

In [None]:
for cfg_id, excl_cols in tqdm(FEATURE_EXCLUDE_SETS.items(),
                              desc="Feature Configs"):

    for scaler_name, scaler in tqdm(SCALERS.items(),
                                    desc=f"Scalers for {cfg_id}",
                                    leave=False):

        # 3-1 ▶ split -------------------------------------------------------
        X = df_arsl.drop(columns=MUST_EXCLUDE + excl_cols)
        y = df_arsl[TARGET]
        X_tr, X_te, y_tr, y_te = train_test_split(
            X, y, stratify=y, test_size=0.10, random_state=42
        )

        if X_tr.shape[1] == 0:             # empty feature guard
            print(f"🚫 cfg{cfg_id}|{scaler_name}: no features"); continue

        # 3-2 ▶ optional scaling -------------------------------------------
        if scaler:
            X_tr = pd.DataFrame(scaler.fit_transform(X_tr),
                                index=X_tr.index, columns=X_tr.columns)
            X_te = pd.DataFrame(scaler.transform(X_te),
                                index=X_te.index, columns=X_te.columns)

        # 3-3 ▶ PyCaret setup  --------------------------------------------
        train_df = pd.concat([X_tr.reset_index(drop=True),
                              y_tr.reset_index(drop=True)], axis=1)
        test_df  = pd.concat([X_te.reset_index(drop=True),
                              y_te.reset_index(drop=True)], axis=1)

        setup(
            data=train_df,
            test_data=test_df,     # 👈 pass *your* hold-out here
            target=TARGET,
            session_id=42,
            fold=3,
            use_gpu=False,
            html=True,
            verbose=True,
            feature_selection=False,
            index=False,
        )

        # 3-4 ▶ model search  ---------------------------------------------
        best = compare_models(include=INCLUDE_MODELS,
                              sort="AUC",   fold=3,
                              turbo=False, verbose=True)

        tag = f"cfg{cfg_id}_{scaler_name}"
        best_models_arsl[tag] = best

        # leaderboard CSV ---------------------------------------------------
        pull().assign(config=cfg_id, scaler=scaler_name) \
              .to_csv(os.path.join(RES_ARSL_PATH, f"leaderboard_{tag}.csv"),
                      index=False)

        # 3-5 ▶ predict on hold-out & plots -------------------------------
        pred = predict_model(best)     # uses test_data provided in setup()

        plot_model(best, plot="confusion_matrix",   save=True)
        os.replace("Confusion Matrix.png",
                   os.path.join(CM_ARSL_PATH,  f"CM_{tag}.png"))
        
        # AUC — only if model supports probability estimates
        if hasattr(best, "predict_proba"):
            plot_model(best, plot="auc", save=True)
            if os.path.exists("AUC.png"):
                os.replace("AUC.png", os.path.join(AUC_ARSL_PATH, f"AUC_{tag}.png"))
        else:
            print(f"[SKIP] AUC not available for {tag}")

        # 3-6 ▶ feature importance ----------------------------------------
        fi_path = os.path.join(FI_ARSL_PATH, f"FI_{tag}.png")

        def save_fi_tree():
            plot_model(best, plot="feature", save=True)
            os.replace("Feature Importance.png", fi_path)

        if hasattr(best, "feature_importances_"):
            save_fi_tree()

        elif best.__class__.__name__.startswith("CatBoost"):
            fi = best.get_feature_importance()
            top = np.argsort(fi)[::-1][:20]
            plt.figure(figsize=(6, 4))
            plt.barh(range(len(top)), fi[top][::-1])
            plt.yticks(range(len(top)), X_tr.columns[top][::-1])
            plt.tight_layout(); plt.savefig(fi_path); plt.close()

        # 3-7 ▶ collect numeric summary -----------------------------------
        cv_metrics = pull().iloc[0]        # row 0 = best model’s CV stats
        results_arsl.append({
            "config":      cfg_id,
            "scaler":      scaler_name,
            "model":       best.__class__.__name__,
            "AUC_CV":      cv_metrics["AUC"],
            "Acc_CV":      cv_metrics["Accuracy"],
            "Acc_holdout": (pred[TARGET] == pred["prediction_label"]).mean(),
        })


Feature Configs:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Binary
3,Original data shape,"(180, 91)"
4,Transformed data shape,"(180, 91)"
5,Transformed train set shape,"(162, 91)"
6,Transformed test set shape,"(18, 91)"
7,Numeric features,90
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
qda,Quadratic Discriminant Analysis,0.5123,0.5578,0.6914,0.4994,0.5495,0.0247,0.0439,0.01
rf,Random Forest Classifier,0.5309,0.519,0.5309,0.542,0.5303,0.0617,0.0626,0.05
xgboost,Extreme Gradient Boosting,0.4815,0.5057,0.4938,0.4879,0.4879,-0.037,-0.0372,0.0333
et,Extra Trees Classifier,0.5062,0.4991,0.4938,0.5087,0.4983,0.0123,0.0125,0.0367
lr,Logistic Regression,0.4938,0.4847,0.5309,0.4946,0.5114,-0.0123,-0.0125,0.02
svm,SVM - Linear Kernel,0.4815,0.4733,0.1975,0.4545,0.255,-0.037,-0.046,0.0067
catboost,CatBoost Classifier,0.4938,0.4632,0.4198,0.5053,0.4479,-0.0123,-0.0089,1.0
knn,K Neighbors Classifier,0.4938,0.4582,0.4444,0.4943,0.4654,-0.0123,-0.0124,0.0133
nb,Naive Bayes,0.4938,0.3809,0.2469,0.486,0.324,-0.0123,-0.0149,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Quadratic Discriminant Analysis,0.5556,0.5556,1.0,0.5294,0.6923,0.1111,0.2425




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Binary
3,Original data shape,"(180, 91)"
4,Transformed data shape,"(180, 91)"
5,Transformed train set shape,"(162, 91)"
6,Transformed test set shape,"(18, 91)"
7,Numeric features,90
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.5185,0.5183,0.5185,0.5284,0.5161,0.037,0.0379,0.05
xgboost,Extreme Gradient Boosting,0.4815,0.5057,0.4938,0.4879,0.4879,-0.037,-0.0372,0.08
lr,Logistic Regression,0.5123,0.5007,0.4938,0.5113,0.5023,0.0247,0.0246,0.0067
et,Extra Trees Classifier,0.5062,0.4995,0.4938,0.5087,0.4983,0.0123,0.0125,0.04
svm,SVM - Linear Kernel,0.4691,0.4787,0.5062,0.4733,0.4875,-0.0617,-0.0635,0.0067
knn,K Neighbors Classifier,0.5123,0.4765,0.4815,0.5213,0.4965,0.0247,0.0263,0.0133
qda,Quadratic Discriminant Analysis,0.4815,0.4668,0.6543,0.4848,0.5547,-0.037,-0.0393,0.0067
catboost,CatBoost Classifier,0.4938,0.454,0.4074,0.5048,0.4415,-0.0123,-0.0089,0.97
nb,Naive Bayes,0.4877,0.38,0.2593,0.4709,0.3329,-0.0247,-0.0302,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.2778,0.2469,0.4444,0.3333,0.381,-0.4444,-0.4714




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Binary
3,Original data shape,"(180, 91)"
4,Transformed data shape,"(180, 91)"
5,Transformed train set shape,"(162, 91)"
6,Transformed test set shape,"(18, 91)"
7,Numeric features,90
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.5123,0.5233,0.5062,0.5231,0.5079,0.0247,0.0255,0.05
xgboost,Extreme Gradient Boosting,0.4815,0.5057,0.4938,0.4879,0.4879,-0.037,-0.0372,0.0233
qda,Quadratic Discriminant Analysis,0.5309,0.5034,0.4568,0.5421,0.4863,0.0617,0.0661,0.0067
et,Extra Trees Classifier,0.5,0.5005,0.4815,0.5008,0.4874,-0.0,-0.0001,0.04
svm,SVM - Linear Kernel,0.4815,0.4819,0.4938,0.5723,0.4446,-0.037,-0.0161,0.0067
lr,Logistic Regression,0.5,0.4701,0.4691,0.4974,0.4787,0.0,-0.0006,0.0067
catboost,CatBoost Classifier,0.4938,0.4668,0.4198,0.5053,0.4479,-0.0123,-0.0089,1.0767
knn,K Neighbors Classifier,0.4815,0.4412,0.3086,0.4753,0.3704,-0.037,-0.0386,0.01
nb,Naive Bayes,0.4877,0.3791,0.2593,0.4709,0.3329,-0.0247,-0.0302,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.3333,0.2222,0.4444,0.3636,0.4,-0.3333,-0.3419


Feature Configs:  25%|██▌       | 1/4 [00:17<00:51, 17.31s/it]

Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Binary
3,Original data shape,"(180, 90)"
4,Transformed data shape,"(180, 90)"
5,Transformed train set shape,"(162, 90)"
6,Transformed test set shape,"(18, 90)"
7,Numeric features,89
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
qda,Quadratic Discriminant Analysis,0.4877,0.524,0.6543,0.5081,0.5102,-0.0247,-0.064,0.0067
rf,Random Forest Classifier,0.5123,0.5165,0.4691,0.5167,0.4892,0.0247,0.0256,0.05
xgboost,Extreme Gradient Boosting,0.4815,0.5057,0.4938,0.4879,0.4879,-0.037,-0.0372,0.02
et,Extra Trees Classifier,0.5,0.4954,0.4691,0.507,0.4811,0.0,0.0014,0.0367
lr,Logistic Regression,0.5,0.4801,0.5556,0.5008,0.526,0.0,-0.0001,0.02
svm,SVM - Linear Kernel,0.4815,0.4733,0.1975,0.4545,0.255,-0.037,-0.046,0.0067
catboost,CatBoost Classifier,0.4753,0.4682,0.4074,0.4753,0.4327,-0.0494,-0.0494,0.9233
knn,K Neighbors Classifier,0.4938,0.4582,0.4444,0.4943,0.4654,-0.0123,-0.0124,0.0133
nb,Naive Bayes,0.4938,0.3809,0.2469,0.486,0.324,-0.0123,-0.0149,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Quadratic Discriminant Analysis,0.5,0.5,1.0,0.5,0.6667,0.0,0.0




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Binary
3,Original data shape,"(180, 90)"
4,Transformed data shape,"(180, 90)"
5,Transformed train set shape,"(162, 90)"
6,Transformed test set shape,"(18, 90)"
7,Numeric features,89
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.5,0.5183,0.4568,0.5009,0.4755,-0.0,0.0,0.05
xgboost,Extreme Gradient Boosting,0.4815,0.5057,0.4938,0.4879,0.4879,-0.037,-0.0372,0.0233
lr,Logistic Regression,0.5185,0.5007,0.5062,0.5185,0.5122,0.037,0.037,0.0067
et,Extra Trees Classifier,0.5,0.495,0.4691,0.507,0.4811,0.0,0.0014,0.04
svm,SVM - Linear Kernel,0.4938,0.4783,0.5556,0.4917,0.5207,-0.0123,-0.0115,0.0067
catboost,CatBoost Classifier,0.4938,0.4769,0.4321,0.5095,0.458,-0.0123,-0.0081,0.88
qda,Quadratic Discriminant Analysis,0.4938,0.471,0.6667,0.4896,0.5599,-0.0123,-0.0082,0.0067
knn,K Neighbors Classifier,0.463,0.4303,0.4568,0.4688,0.4553,-0.0741,-0.0753,0.0067
nb,Naive Bayes,0.4877,0.3809,0.2593,0.4709,0.3329,-0.0247,-0.0302,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.3333,0.3951,0.3333,0.3333,0.3333,-0.3333,-0.3333




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Binary
3,Original data shape,"(180, 90)"
4,Transformed data shape,"(180, 90)"
5,Transformed train set shape,"(162, 90)"
6,Transformed test set shape,"(18, 90)"
7,Numeric features,89
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.5062,0.521,0.4321,0.5067,0.4635,0.0123,0.0124,0.05
svm,SVM - Linear Kernel,0.537,0.5103,0.358,0.5652,0.4362,0.0741,0.0822,0.01
xgboost,Extreme Gradient Boosting,0.4815,0.5057,0.4938,0.4879,0.4879,-0.037,-0.0372,0.0233
et,Extra Trees Classifier,0.5062,0.4963,0.4815,0.5123,0.4894,0.0123,0.0138,0.04
lr,Logistic Regression,0.5062,0.4915,0.4691,0.5062,0.4844,0.0123,0.0123,0.0067
qda,Quadratic Discriminant Analysis,0.4938,0.4856,0.4198,0.4964,0.4477,-0.0123,-0.0104,0.0067
catboost,CatBoost Classifier,0.4753,0.4742,0.4074,0.4753,0.4327,-0.0494,-0.0494,0.9567
knn,K Neighbors Classifier,0.4691,0.4607,0.4568,0.4655,0.4564,-0.0617,-0.0629,0.0067
nb,Naive Bayes,0.4877,0.38,0.2593,0.4709,0.3329,-0.0247,-0.0302,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.3333,0.4383,0.3333,0.3333,0.3333,-0.3333,-0.3333


Feature Configs:  50%|█████     | 2/4 [00:33<00:33, 16.75s/it]

Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Binary
3,Original data shape,"(180, 90)"
4,Transformed data shape,"(180, 90)"
5,Transformed train set shape,"(162, 90)"
6,Transformed test set shape,"(18, 90)"
7,Numeric features,89
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.5247,0.5242,0.4938,0.5296,0.505,0.0494,0.051,0.05
qda,Quadratic Discriminant Analysis,0.4753,0.5075,0.6914,0.4579,0.5255,-0.0494,-0.0371,0.0067
svm,SVM - Linear Kernel,0.5123,0.4947,0.5556,0.5128,0.5328,0.0247,0.0247,0.0067
et,Extra Trees Classifier,0.5123,0.4874,0.5185,0.5184,0.5145,0.0247,0.0255,0.0367
xgboost,Extreme Gradient Boosting,0.4815,0.4865,0.5062,0.487,0.4914,-0.037,-0.0377,0.02
knn,K Neighbors Classifier,0.463,0.4801,0.3704,0.4577,0.4077,-0.0741,-0.075,0.0133
lr,Logistic Regression,0.4815,0.4728,0.5556,0.4848,0.517,-0.037,-0.038,0.02
catboost,CatBoost Classifier,0.463,0.4536,0.3951,0.4544,0.417,-0.0741,-0.0765,1.09
nb,Naive Bayes,0.4938,0.3809,0.2469,0.486,0.324,-0.0123,-0.0149,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.3333,0.4012,0.3333,0.3333,0.3333,-0.3333,-0.3333




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Binary
3,Original data shape,"(180, 90)"
4,Transformed data shape,"(180, 90)"
5,Transformed train set shape,"(162, 90)"
6,Transformed test set shape,"(18, 90)"
7,Numeric features,89
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.5062,0.5226,0.4815,0.5105,0.4922,0.0123,0.0132,0.0433
lr,Logistic Regression,0.5185,0.5016,0.5062,0.5185,0.5122,0.037,0.037,0.0067
et,Extra Trees Classifier,0.5123,0.4867,0.5185,0.5184,0.5145,0.0247,0.0255,0.04
xgboost,Extreme Gradient Boosting,0.4815,0.4865,0.5062,0.487,0.4914,-0.037,-0.0377,0.02
svm,SVM - Linear Kernel,0.5,0.4765,0.4815,0.5015,0.4904,0.0,0.0001,0.0067
qda,Quadratic Discriminant Analysis,0.4815,0.4696,0.642,0.4855,0.5485,-0.037,-0.0406,0.0067
knn,K Neighbors Classifier,0.4877,0.4602,0.4198,0.4835,0.4404,-0.0247,-0.0263,0.0067
catboost,CatBoost Classifier,0.4691,0.4508,0.3827,0.4681,0.4124,-0.0617,-0.0618,1.0233
nb,Naive Bayes,0.4877,0.3804,0.2593,0.4709,0.3329,-0.0247,-0.0302,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.3333,0.4012,0.3333,0.3333,0.3333,-0.3333,-0.3333




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Binary
3,Original data shape,"(180, 90)"
4,Transformed data shape,"(180, 90)"
5,Transformed train set shape,"(162, 90)"
6,Transformed test set shape,"(18, 90)"
7,Numeric features,89
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.5185,0.5229,0.4691,0.5254,0.4918,0.037,0.0388,0.05
svm,SVM - Linear Kernel,0.4753,0.4989,0.4321,0.6261,0.3958,-0.0494,-0.0158,0.01
qda,Quadratic Discriminant Analysis,0.5247,0.4893,0.4321,0.5337,0.4682,0.0494,0.0527,0.01
xgboost,Extreme Gradient Boosting,0.4815,0.4865,0.5062,0.487,0.4914,-0.037,-0.0377,0.0233
et,Extra Trees Classifier,0.5123,0.484,0.5062,0.5209,0.5073,0.0247,0.0263,0.0367
lr,Logistic Regression,0.5,0.4723,0.4691,0.4974,0.4787,0.0,-0.0006,0.0067
catboost,CatBoost Classifier,0.4568,0.4449,0.3951,0.4491,0.4142,-0.0864,-0.0889,1.0433
knn,K Neighbors Classifier,0.4691,0.4426,0.284,0.4545,0.3447,-0.0617,-0.0656,0.0067
nb,Naive Bayes,0.4877,0.3795,0.2593,0.4709,0.3329,-0.0247,-0.0302,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.4444,0.4444,0.5556,0.4545,0.5,-0.1111,-0.114


Feature Configs:  75%|███████▌  | 3/4 [00:51<00:17, 17.38s/it]

Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Binary
3,Original data shape,"(180, 89)"
4,Transformed data shape,"(180, 89)"
5,Transformed train set shape,"(162, 89)"
6,Transformed test set shape,"(18, 89)"
7,Numeric features,88
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.537,0.5283,0.5309,0.5459,0.5335,0.0741,0.0756,0.0467
qda,Quadratic Discriminant Analysis,0.4691,0.5039,0.6914,0.4545,0.5226,-0.0617,-0.0526,0.0067
et,Extra Trees Classifier,0.5617,0.4993,0.5432,0.5658,0.5487,0.1235,0.1256,0.04
svm,SVM - Linear Kernel,0.5123,0.4947,0.5556,0.5128,0.5328,0.0247,0.0247,0.0067
xgboost,Extreme Gradient Boosting,0.4815,0.4911,0.5062,0.487,0.4914,-0.037,-0.0377,0.02
knn,K Neighbors Classifier,0.463,0.4801,0.3704,0.4577,0.4077,-0.0741,-0.075,0.0067
lr,Logistic Regression,0.4877,0.4705,0.5432,0.4904,0.5147,-0.0247,-0.0253,0.02
catboost,CatBoost Classifier,0.4815,0.4609,0.4198,0.4868,0.4418,-0.037,-0.0356,0.9567
nb,Naive Bayes,0.4938,0.3813,0.2469,0.486,0.324,-0.0123,-0.0149,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.3889,0.4383,0.4444,0.4,0.4211,-0.2222,-0.2236




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Binary
3,Original data shape,"(180, 89)"
4,Transformed data shape,"(180, 89)"
5,Transformed train set shape,"(162, 89)"
6,Transformed test set shape,"(18, 89)"
7,Numeric features,88
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.5309,0.5332,0.5185,0.5374,0.522,0.0617,0.0636,0.0467
et,Extra Trees Classifier,0.5556,0.5057,0.5309,0.5612,0.5405,0.1111,0.113,0.04
svm,SVM - Linear Kernel,0.5062,0.5002,0.5432,0.5042,0.5226,0.0123,0.0127,0.0067
lr,Logistic Regression,0.5309,0.4993,0.5185,0.5322,0.5244,0.0617,0.0621,0.0067
xgboost,Extreme Gradient Boosting,0.4815,0.4911,0.5062,0.487,0.4914,-0.037,-0.0377,0.03
qda,Quadratic Discriminant Analysis,0.463,0.4701,0.6049,0.4701,0.5247,-0.0741,-0.0798,0.0067
catboost,CatBoost Classifier,0.4753,0.465,0.4074,0.4842,0.4373,-0.0494,-0.0467,1.0067
knn,K Neighbors Classifier,0.4815,0.4502,0.4074,0.465,0.4228,-0.037,-0.0436,0.0067
nb,Naive Bayes,0.4877,0.3804,0.2593,0.4709,0.3329,-0.0247,-0.0302,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.3889,0.4506,0.4444,0.4,0.4211,-0.2222,-0.2236




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Binary
3,Original data shape,"(180, 89)"
4,Transformed data shape,"(180, 89)"
5,Transformed train set shape,"(162, 89)"
6,Transformed test set shape,"(18, 89)"
7,Numeric features,88
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.5309,0.5338,0.5309,0.5399,0.5285,0.0617,0.0627,0.0467
svm,SVM - Linear Kernel,0.4691,0.5112,0.716,0.4886,0.5546,-0.0617,-0.1393,0.0067
et,Extra Trees Classifier,0.5679,0.5103,0.5432,0.5731,0.5526,0.1358,0.1377,0.04
lr,Logistic Regression,0.5123,0.5002,0.4691,0.5128,0.4876,0.0247,0.0247,0.01
xgboost,Extreme Gradient Boosting,0.4815,0.4911,0.5062,0.487,0.4914,-0.037,-0.0377,0.03
qda,Quadratic Discriminant Analysis,0.5062,0.4769,0.4321,0.5139,0.4593,0.0123,0.0165,0.0067
knn,K Neighbors Classifier,0.4691,0.4611,0.4321,0.4665,0.4444,-0.0617,-0.0629,0.0067
catboost,CatBoost Classifier,0.463,0.4604,0.3951,0.4688,0.4206,-0.0741,-0.0727,0.99
nb,Naive Bayes,0.4877,0.3795,0.2593,0.4709,0.3329,-0.0247,-0.0302,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.3889,0.4444,0.4444,0.4,0.4211,-0.2222,-0.2236


Feature Configs: 100%|██████████| 4/4 [01:09<00:00, 17.40s/it]


### Vlnc

In [None]:
for cfg_id, excl_cols in tqdm(FEATURE_EXCLUDE_SETS.items(),
                              desc="Feature Configs"):

    for scaler_name, scaler in tqdm(SCALERS.items(),
                                    desc=f"Scalers for {cfg_id}",
                                    leave=False):

        # 3-1 ▶ split -------------------------------------------------------
        X = df_vlnc.drop(columns=MUST_EXCLUDE + excl_cols)
        y = df_vlnc[TARGET]
        X_tr, X_te, y_tr, y_te = train_test_split(
            X, y, stratify=y, test_size=0.10, random_state=42
        )

        if X_tr.shape[1] == 0:             # empty feature guard
            print(f"🚫 cfg{cfg_id}|{scaler_name}: no features"); continue

        # 3-2 ▶ optional scaling -------------------------------------------
        if scaler:
            X_tr = pd.DataFrame(scaler.fit_transform(X_tr),
                                index=X_tr.index, columns=X_tr.columns)
            X_te = pd.DataFrame(scaler.transform(X_te),
                                index=X_te.index, columns=X_te.columns)

        # 3-3 ▶ PyCaret setup  --------------------------------------------
        train_df = pd.concat([X_tr.reset_index(drop=True),
                              y_tr.reset_index(drop=True)], axis=1)
        test_df  = pd.concat([X_te.reset_index(drop=True),
                              y_te.reset_index(drop=True)], axis=1)

        setup(
            data=train_df,
            test_data=test_df,     # 👈 pass *your* hold-out here
            target=TARGET,
            session_id=42,
            fold=3,
            use_gpu=False,
            html=True,
            verbose=True,
            feature_selection=False,
            index=False,
        )

        # 3-4 ▶ model search  ---------------------------------------------
        best = compare_models(include=INCLUDE_MODELS,
                              sort="AUC",   fold=3,
                              turbo=False, verbose=True)

        tag = f"cfg{cfg_id}_{scaler_name}"
        best_models_vlnc[tag] = best

        # leaderboard CSV ---------------------------------------------------
        pull().assign(config=cfg_id, scaler=scaler_name) \
              .to_csv(os.path.join(RES_VLNC_PATH, f"leaderboard_{tag}.csv"),
                      index=False)

        # 3-5 ▶ predict on hold-out & plots -------------------------------
        pred = predict_model(best)     # uses test_data provided in setup()

        plot_model(best, plot="confusion_matrix",   save=True)
        os.replace("Confusion Matrix.png",
                   os.path.join(CM_VLNC_PATH,  f"CM_{tag}.png"))

        # AUC — only if model supports probability estimates
        if hasattr(best, "predict_proba"):
            plot_model(best, plot="auc", save=True)
            if os.path.exists("AUC.png"):
                os.replace("AUC.png", os.path.join(AUC_VLNC_PATH, f"AUC_{tag}.png"))
        else:
            print(f"[SKIP] AUC not available for {tag}")

        # 3-6 ▶ feature importance ----------------------------------------
        fi_path = os.path.join(FI_VLNC_PATH, f"FI_{tag}.png")

        def save_fi_tree():
            plot_model(best, plot="feature", save=True)
            os.replace("Feature Importance.png", fi_path)

        if hasattr(best, "feature_importances_"):
            save_fi_tree()

        elif best.__class__.__name__.startswith("CatBoost"):
            fi = best.get_feature_importance()
            top = np.argsort(fi)[::-1][:20]
            plt.figure(figsize=(6, 4))
            plt.barh(range(len(top)), fi[top][::-1])
            plt.yticks(range(len(top)), X_tr.columns[top][::-1])
            plt.tight_layout(); plt.savefig(fi_path); plt.close()

        # 3-7 ▶ collect numeric summary -----------------------------------
        cv_metrics = pull().iloc[0]        # row 0 = best model’s CV stats
        results_vlnc.append({
            "config":      cfg_id,
            "scaler":      scaler_name,
            "model":       best.__class__.__name__,
            "AUC_CV":      cv_metrics["AUC"],
            "Acc_CV":      cv_metrics["Accuracy"],
            "Acc_holdout": (pred[TARGET] == pred["prediction_label"]).mean(),
        })


Feature Configs:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Binary
3,Original data shape,"(180, 91)"
4,Transformed data shape,"(180, 91)"
5,Transformed train set shape,"(162, 91)"
6,Transformed test set shape,"(18, 91)"
7,Numeric features,90
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
nb,Naive Bayes,0.5556,0.5942,0.4321,0.585,0.4928,0.1111,0.1184,0.0067
lr,Logistic Regression,0.5864,0.5693,0.6049,0.6037,0.5897,0.1728,0.1819,0.0167
et,Extra Trees Classifier,0.5802,0.567,0.5926,0.5835,0.5843,0.1605,0.1628,0.04
rf,Random Forest Classifier,0.5432,0.521,0.4815,0.5496,0.5086,0.0864,0.0882,0.0467
svm,SVM - Linear Kernel,0.5062,0.508,0.3951,0.506,0.4404,0.0123,0.0122,0.01
qda,Quadratic Discriminant Analysis,0.5062,0.4961,0.7284,0.4928,0.5635,0.0123,-0.0025,0.0067
catboost,CatBoost Classifier,0.4753,0.4888,0.4444,0.4667,0.4529,-0.0494,-0.0511,0.9067
knn,K Neighbors Classifier,0.4877,0.4689,0.3951,0.4827,0.4285,-0.0247,-0.0261,0.01
xgboost,Extreme Gradient Boosting,0.4383,0.4618,0.4074,0.4283,0.4158,-0.1235,-0.1253,0.0233


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Naive Bayes,0.5556,0.5741,0.4444,0.5714,0.5,0.1111,0.114




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Binary
3,Original data shape,"(180, 91)"
4,Transformed data shape,"(180, 91)"
5,Transformed train set shape,"(162, 91)"
6,Transformed test set shape,"(18, 91)"
7,Numeric features,90
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
svm,SVM - Linear Kernel,0.5741,0.61,0.5679,0.5834,0.5703,0.1481,0.1513,0.0067
nb,Naive Bayes,0.5494,0.5997,0.4568,0.5677,0.5041,0.0988,0.1026,0.0067
lr,Logistic Regression,0.5741,0.5926,0.5802,0.5875,0.5736,0.1481,0.1536,0.0067
knn,K Neighbors Classifier,0.5802,0.5741,0.4815,0.6014,0.5343,0.1605,0.1642,0.0067
qda,Quadratic Discriminant Analysis,0.5432,0.5734,0.5062,0.5481,0.5257,0.0864,0.0871,0.0067
et,Extra Trees Classifier,0.5802,0.5674,0.5926,0.5835,0.5843,0.1605,0.1628,0.04
rf,Random Forest Classifier,0.537,0.5197,0.4691,0.543,0.4986,0.0741,0.0758,0.0467
catboost,CatBoost Classifier,0.4753,0.5034,0.4321,0.4704,0.4485,-0.0494,-0.0504,1.0367
xgboost,Extreme Gradient Boosting,0.4383,0.4618,0.4074,0.4283,0.4158,-0.1235,-0.1253,0.0233


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,SVM - Linear Kernel,0.6667,0.6667,0.6667,0.6667,0.6667,0.3333,0.3333


Feature Configs:   0%|          | 0/4 [00:10<?, ?it/s]


TypeError: AUC plot not available for estimators with no predict_proba attribute.

## Step 4. Save overall summary

In [None]:
results_df = pd.DataFrame(results_multi)
results_df.to_csv(os.path.join(RES_PATH,"overall_results_multi.csv"), index=False)

results_df = pd.DataFrame(results_arsl)
results_df.to_csv(os.path.join(RES_PATH,"overall_results_arsl.csv"), index=False)

results_df = pd.DataFrame(results_vlnc)
results_df.to_csv(os.path.join(RES_PATH,"overall_results_vlnc.csv"), index=False)

## 🎯 Selective Outlier Removal Based on XGBoost Feature Importance

Instead of removing all outliers across 88 features, we focus only on the top 10 features identified as most influential for classifying `domain_EI` using XGBoost.

### Method

- Extract top 10 features with highest importance scores.
- Apply Z-score based outlier detection (`|Z| > 4`) on those features only.
- Remove rows where extreme values are detected in these features.
- Visualize before/after distributions to confirm effective filtering.

This approach ensures:
- High-impact errors are removed.
- Valuable data is preserved from non-influential features.


In [None]:
'''# Step1: Important Features
important_features = [
    'HF', 'lnLF', 'RSA_var', 'LFp_var', 'lnVLF_mssd',
    'HF_var', 'VLFp_mssd', 'rMSSD_autocorr', 'dHz_mssd', 'BPM_autocorr'
]
'''

"# Step1: Important Features\nimportant_features = [\n    'HF', 'lnLF', 'RSA_var', 'LFp_var', 'lnVLF_mssd',\n    'HF_var', 'VLFp_mssd', 'rMSSD_autocorr', 'dHz_mssd', 'BPM_autocorr'\n]\n"

In [None]:
'''# Step 2. Z-score calculation (Top-10 Imporatant features only) and Z > 4 row detection
z_scores = df[important_features].apply(zscore)

extreme_mask = (np.abs(z_scores) > 4).any(axis=1)
df_outliers = df[extreme_mask]
df_cleaned = df[~extreme_mask].copy()

# Step 3. Remove outliers from the original DataFrame
print(f"⚠️ Rows with Z > 4 in important features: {extreme_mask.sum()} / {len(df)}")
print("Exclude row's name + index:")
display(df_outliers[['name']].reset_index())

print(f"✅ Cleaned data shape: {df_cleaned.shape}")
'''

'# Step 2. Z-score calculation (Top-10 Imporatant features only) and Z > 4 row detection\nz_scores = df[important_features].apply(zscore)\n\nextreme_mask = (np.abs(z_scores) > 4).any(axis=1)\ndf_outliers = df[extreme_mask]\ndf_cleaned = df[~extreme_mask].copy()\n\n# Step 3. Remove outliers from the original DataFrame\nprint(f"⚠️ Rows with Z > 4 in important features: {extreme_mask.sum()} / {len(df)}")\nprint("Exclude row\'s name + index:")\ndisplay(df_outliers[[\'name\']].reset_index())\n\nprint(f"✅ Cleaned data shape: {df_cleaned.shape}")\n'

In [None]:
'''# Step 4. Plotting the distribution of important features before and after removing outliers
df_before = df[important_features].copy()
df_before['source'] = 'Before'

df_after = df_cleaned[important_features].copy()
df_after['source'] = 'After'

df_plot = pd.concat([df_before, df_after])
df_plot = df_plot.melt(id_vars='source', var_name='feature', value_name='value')

plt.figure(figsize=(20,15))
sns.boxplot(data=df_plot, x='feature', y='value', hue='source')
plt.title("Top Features Distribution: Before vs After Z > 4 Removal")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()'''

'# Step 4. Plotting the distribution of important features before and after removing outliers\ndf_before = df[important_features].copy()\ndf_before[\'source\'] = \'Before\'\n\ndf_after = df_cleaned[important_features].copy()\ndf_after[\'source\'] = \'After\'\n\ndf_plot = pd.concat([df_before, df_after])\ndf_plot = df_plot.melt(id_vars=\'source\', var_name=\'feature\', value_name=\'value\')\n\nplt.figure(figsize=(20,15))\nsns.boxplot(data=df_plot, x=\'feature\', y=\'value\', hue=\'source\')\nplt.title("Top Features Distribution: Before vs After Z > 4 Removal")\nplt.xticks(rotation=45)\nplt.tight_layout()\nplt.show()'