# 🎯 Emotion-Domain Classification with PyCaret AutoML (`domain_EI`, 4 classes)

This notebook builds ML classifiers to predict the **emotion-intelligence domain** (`domain_EI`) with four labels  
(HAHV · HALV · LALV · LAHV  →  encoded as 0-3) using **PyCaret 3**.

---

## 🧩 Objectives

| Item | Details |
|------|---------|
| **Feature sets** | 4 configurations (score_EI / group_EI included or removed) |
| **Scaling options** | `none`, `standard`, `minmax` |
| **Model family** | CatBoost · XGBoost · Random Forest · Extra Trees · Logistic Regression · Naive Bayes · SVM · K-NN · QDA |
| **Total runs** | **12** (4 feature configs × 3 scalers) |
| **Validation** | Stratified **3-fold CV** inside PyCaret + **20 % hold-out** |
| **Artifacts** | PNG plots (AUC, confusion matrix, feature importance) + CSV summaries |

---

## 🛠️ Pipeline Overview

| Step | Description |
|------|-------------|
| **1 — Split** | Apply feature-exclusion rules, then stratified 80 / 20 train-test split |
| **2 — Scaling** | Transform numeric features with the chosen scaler (`none` / `StandardScaler` / `MinMaxScaler`) |
| **3 — AutoML** | `compare_models()` evaluates the candidate models via CV and returns the best estimator |
| **4 — Hold-out Evaluation** | Predict on the reserved test set; save confusion-matrix & AUC plots |
| **5 — Result Logging** | Collect CV AUC / Accuracy and hold-out Accuracy for each run into a master CSV |

---

## 📂 Output Artifacts

| File / Folder | Contents |
|---------------|----------|
| **`overall_results.csv`** | One-row summary per run (feature cfg · scaler · chosen model · metrics) |
| **`leaderboard_cfg*_*.csv`** | Full PyCaret leaderboard (all models, CV statistics) |
| **`CM_/`, `AUC_/`, `FI_/`** | Confusion-matrix, AUC curve and feature-importance PNGs named by run tag |
| *(optional)* `best_models.pkl` | Serialized best estimators via `joblib.dump()` |

---

## 🔧 Requirements

* **PyCaret 3.x** (classification module)  
* `scikit-learn`, `pandas`, `numpy`, `matplotlib`, `tqdm`  
* CatBoost & XGBoost are pulled automatically with the full PyCaret install

> **Note**  
> * All stimulus-related (`Arousal`, `Valence`, etc.) and subjective self-report variables are **permanently excluded** from the feature space.  
> * The four feature configurations are designed to measure the impact of **EI score** (`score_EI`) and **EI group label** (`group_EI`) on model performance.  
> * GPU acceleration, SHAP explanations and 5-fold CV are **not** used in the current pipeline; they can be added later if needed.


## Import Libraries

In [9]:
# ---------------------------------------------------------------
#  ❖  CONFIG & PREP
# ---------------------------------------------------------------
import os, sys, types, re, joblib, warnings

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_auc_score
from sklearn.ensemble import ExtraTreesClassifier          # used for FS
from pycaret.classification import *
warnings.filterwarnings("ignore", category=UserWarning)

from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

## Step 0: Define Save Paths

We define clean, relative paths for all outputs such as feature importance plots, evaluation plots, and CSV results.

In [10]:
# Base directory: project root (current notebook folder)
ROOT_PATH = os.getcwd()
RES_PATH = os.path.join(ROOT_PATH, '../res')

# DATA
DATA_PATH = os.path.join(os.getcwd(), '../data/')
EXCEL_PATH = "../data/ETRI_cardiac.xlsx"
CSV_PATH = "../data/ETRI_cardiac.csv"
DF_PATH = "../data/updated_data.csv"

# Define subfolders for saving results
FI_PATH   = os.path.join(RES_PATH, 'FI_plots')      # Feature Importance
CM_PATH   = os.path.join(RES_PATH, 'CM_plots')      # Confusion Matrix
AUC_PATH  = os.path.join(RES_PATH, 'AUC_plots')     # AUC Curves
SHAP_PATH = os.path.join(RES_PATH, 'SHAP_plots')    # SHAP Plots
RES_CSV_PATH  = os.path.join(RES_PATH, 'results_csv')   # CSV files

# Create directories if they don't exist
for path in [FI_PATH, CM_PATH, AUC_PATH, SHAP_PATH, RES_CSV_PATH]:
    os.makedirs(path, exist_ok=True)

In [11]:
df = pd.read_csv(DF_PATH)
df.head()

Unnamed: 0,name,year,score_EI,domain_EI,trainnig,group_EI,자극_Arousal,자극_Valence,Arousal,Valence,...,VLF/HF_autocorr,LF/HF_autocorr,tPow_autocorr,dPow_autocorr,dHz_autocorr,pPow_autocorr,pHz_autocorr,CohRatio_autocorr,RSA_autocorr,dHz_diff_autocorr
0,subj_01_00,2021,400,0,0,0,1,1,4,5,...,-0.567119,-0.991761,4.761754,-0.106286,-0.597611,-0.177651,-1.020465,-1.087973,1.085212,-0.595462
1,subj_01_01,2021,400,1,0,0,1,0,6,4,...,-2.854562,-1.532805,-4.013737,13.465573,-1.486156,0.136173,-0.504705,-0.481968,0.855739,-1.486156
2,subj_01_02,2021,400,3,0,0,0,1,3,6,...,-1.375726,-1.030548,0.05676,-1.111399,-1.126756,-0.961825,-1.62631,-1.110674,-0.670813,-1.127433
3,subj_01_03,2021,400,2,0,0,0,0,4,4,...,-0.198272,0.921968,-2.812514,7.058668,-0.880433,-0.866042,-0.602454,-1.700471,0.168826,-0.880433
4,subj_02_00,2021,450,0,0,0,1,1,6,7,...,0.469754,0.216407,0.900575,0.121397,0.023057,-0.702703,-0.874824,0.579896,-0.289846,0.023057


In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 100 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   name               180 non-null    object 
 1   year               180 non-null    int64  
 2   score_EI           180 non-null    int64  
 3   domain_EI          180 non-null    int64  
 4   trainnig           180 non-null    int64  
 5   group_EI           180 non-null    int64  
 6   자극_Arousal         180 non-null    int64  
 7   자극_Valence         180 non-null    int64  
 8   Arousal            180 non-null    int64  
 9   Valence            180 non-null    int64  
 10  Subj_AR            180 non-null    int64  
 11  Subj_PN            180 non-null    int64  
 12  BPM                180 non-null    float64
 13  SDNN               180 non-null    float64
 14  rMSSD              180 non-null    float64
 15  VLF                180 non-null    float64
 16  LF                 180 no

## Step 1: Experiment Settings
Define Scaling and Setup for PyCaret 

We define three types of scaling: none, standard, and min-max, for every feature configuration.

In [13]:
TARGET = "domain_EI"

# Columns that are never used as features
MUST_EXCLUDE = [
    "name", "domain_EI", "year", "trainnig",
    "Arousal", "Valence", "자극_Arousal", "자극_Valence",
    "Subj_AR", "Subj_PN"
]

# 4 feature-set flavours: keep / drop EI score / drop EI group / drop both
FEATURE_EXCLUDE_SETS = {
    1: [],
    2: ["group_EI"],
    3: ["score_EI"],
    4: ["score_EI", "group_EI"],
}

# Three scaler options
SCALERS = {
    "none":     None,
    "standard": StandardScaler(),
    "minmax":   MinMaxScaler(),
}

# Models we allow PyCaret to try
INCLUDE_MODELS = [
    "catboost", "xgboost", "rf", "et",
    "lr", "nb", "svm", "knn", "qda",
]

####  LightGBM stub

In [18]:
lgb_stub = types.ModuleType("lightgbm")

class _DummyLGBM:
    """Minimal fake LightGBM estimator so that PyCaret can import it."""
    def __init__(self, *args, **kwargs): pass
    def fit(self, *args, **kwargs): return self
    def predict(self, X, *args, **kwargs):          # label output
        return np.zeros(len(X), dtype=int)
    def predict_proba(self, X, *args, **kwargs):    # proba output
        # return 2-col dummy probs for binary; PyCaret only checks shape
        return np.zeros((len(X), 2), dtype=float)

# expose expected symbols at top level
lgb_stub.LGBMClassifier = _DummyLGBM
lgb_stub.LGBMRegressor  = _DummyLGBM
lgb_stub.Dataset        = object          # rarely used by PyCaret

basic_stub = types.ModuleType("lightgbm.basic")
basic_stub.LightGBMError = RuntimeError    # any Exception subclass works

# make `from lightgbm.basic import LightGBMError` succeed
sys.modules["lightgbm.basic"] = basic_stub

# attach sub-module to the parent package stub
lgb_stub.basic = basic_stub

sys.modules["lightgbm"] = lgb_stub

## Step 2. Result Containers

In [19]:
results:     list[dict]        = []     # one row per config+scaler
best_models: dict[str, object] = {}     # tag ➜ fitted model

## Step 3: AutoML Training using PyCaret + tqdm Progress Tracking
WITHOUT OUTLIER vs WITH OUTLIER !! (With Outlier first)

We train models using PyCaret with GPU acceleration, over 8 × 3 configuration combinations.

In [20]:
'''
for config_id, exclude_cols in tqdm(feature_exclude_sets.items(), desc="Feature Configs", position=0):
    for scaler_name, scaler in tqdm(scalers.items(), desc=f"Scalers for Config {config_id}", leave=False, position=1):
        df_exp = df.copy()
        all_excluded = must_exclude + exclude_cols
        
        remaining_cols = [col for col in df_exp.columns if col not in all_excluded]
        if len(remaining_cols) == 0:
            print(f"⚠️ Skipping config {config_id} + scaler {scaler_name} – no features left.")
            continue

        X = df_exp.drop(columns=all_excluded)
        y = df_exp[target]

        X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

        # ✅ Feature Check
        print(f"\n📌 Config {config_id} | Scaler: {scaler_name}")
        print(f"🔸 X_train shape: {X_train.shape}")
        print(f"🔸 Feature columns: {list(X_train.columns)}")
        print(f"🔸 Null values:\n{X_train.isnull().sum()}")
        print(f"🔸 Class distribution:\n{y_train.value_counts()}")
        print("-" * 60)
'''

'\nfor config_id, exclude_cols in tqdm(feature_exclude_sets.items(), desc="Feature Configs", position=0):\n    for scaler_name, scaler in tqdm(scalers.items(), desc=f"Scalers for Config {config_id}", leave=False, position=1):\n        df_exp = df.copy()\n        all_excluded = must_exclude + exclude_cols\n        \n        remaining_cols = [col for col in df_exp.columns if col not in all_excluded]\n        if len(remaining_cols) == 0:\n            print(f"⚠️ Skipping config {config_id} + scaler {scaler_name} – no features left.")\n            continue\n\n        X = df_exp.drop(columns=all_excluded)\n        y = df_exp[target]\n\n        X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)\n\n        # ✅ Feature Check\n        print(f"\n📌 Config {config_id} | Scaler: {scaler_name}")\n        print(f"🔸 X_train shape: {X_train.shape}")\n        print(f"🔸 Feature columns: {list(X_train.columns)}")\n        print(f"🔸 Null values:\n{X_train.isnull().sum()}"

### Step 2.3. Experiment loop

In [23]:
for cfg_id, excl_cols in tqdm(FEATURE_EXCLUDE_SETS.items(),
                              desc="Feature Configs"):

    for scaler_name, scaler in tqdm(SCALERS.items(),
                                    desc=f"Scalers for {cfg_id}",
                                    leave=False):

        # 3-1 ▶ split -------------------------------------------------------
        X = df.drop(columns=MUST_EXCLUDE + excl_cols)
        y = df[TARGET]
        X_tr, X_te, y_tr, y_te = train_test_split(
            X, y, stratify=y, test_size=0.20, random_state=42
        )

        if X_tr.shape[1] == 0:             # empty feature guard
            print(f"🚫 cfg{cfg_id}|{scaler_name}: no features"); continue

        # 3-2 ▶ optional scaling -------------------------------------------
        if scaler:
            X_tr = pd.DataFrame(scaler.fit_transform(X_tr),
                                index=X_tr.index, columns=X_tr.columns)
            X_te = pd.DataFrame(scaler.transform(X_te),
                                index=X_te.index, columns=X_te.columns)

        # 3-3 ▶ PyCaret setup  --------------------------------------------
        train_df = pd.concat([X_tr.reset_index(drop=True),
                              y_tr.reset_index(drop=True)], axis=1)
        test_df  = pd.concat([X_te.reset_index(drop=True),
                              y_te.reset_index(drop=True)], axis=1)

        setup(
            data=train_df,
            test_data=test_df,     # 👈 pass *your* hold-out here
            target=TARGET,
            session_id=42,
            fold=3,
            use_gpu=False,
            html=True,
            verbose=True,
            feature_selection=False,
            index=False,
        )

        # 3-4 ▶ model search  ---------------------------------------------
        best = compare_models(include=INCLUDE_MODELS,
                              sort="AUC",   fold=3,
                              turbo=False, verbose=True)

        tag = f"cfg{cfg_id}_{scaler_name}"
        best_models[tag] = best

        # leaderboard CSV ---------------------------------------------------
        pull().assign(config=cfg_id, scaler=scaler_name) \
              .to_csv(os.path.join(RES_CSV_PATH, f"leaderboard_{tag}.csv"),
                      index=False)

        # 3-5 ▶ predict on hold-out & plots -------------------------------
        pred = predict_model(best)     # uses test_data provided in setup()

        plot_model(best, plot="confusion_matrix",   save=True)
        os.replace("Confusion Matrix.png",
                   os.path.join(CM_PATH,  f"CM_{tag}.png"))

        plot_model(best, plot="auc",                 save=True)
        os.replace("AUC.png",
                   os.path.join(AUC_PATH, f"AUC_{tag}.png"))

        # 3-6 ▶ feature importance ----------------------------------------
        fi_path = os.path.join(FI_PATH, f"FI_{tag}.png")

        def save_fi_tree():
            plot_model(best, plot="feature", save=True)
            os.replace("Feature Importance.png", fi_path)

        if hasattr(best, "feature_importances_"):
            save_fi_tree()

        elif best.__class__.__name__.startswith("CatBoost"):
            fi = best.get_feature_importance()
            top = np.argsort(fi)[::-1][:20]
            plt.figure(figsize=(6, 4))
            plt.barh(range(len(top)), fi[top][::-1])
            plt.yticks(range(len(top)), X_tr.columns[top][::-1])
            plt.tight_layout(); plt.savefig(fi_path); plt.close()

        # 3-7 ▶ collect numeric summary -----------------------------------
        cv_metrics = pull().iloc[0]        # row 0 = best model’s CV stats
        results.append({
            "config":      cfg_id,
            "scaler":      scaler_name,
            "model":       best.__class__.__name__,
            "AUC_CV":      cv_metrics["AUC"],
            "Acc_CV":      cv_metrics["Accuracy"],
            "Acc_holdout": (pred[TARGET] == pred["prediction_label"]).mean(),
        })


Feature Configs:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 91)"
4,Transformed data shape,"(180, 91)"
5,Transformed train set shape,"(144, 91)"
6,Transformed test set shape,"(36, 91)"
7,Numeric features,90
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.2917,0.5221,0.2917,0.2868,0.2826,0.0556,0.0562,0.04
nb,Naive Bayes,0.2569,0.4947,0.2569,0.2322,0.2401,0.0093,0.0088,0.0067
xgboost,Extreme Gradient Boosting,0.2014,0.45,0.2014,0.1848,0.1887,-0.0648,-0.0667,0.2933
rf,Random Forest Classifier,0.1944,0.4409,0.1944,0.1936,0.1928,-0.0741,-0.0746,0.05
knn,K Neighbors Classifier,0.1875,0.4272,0.1875,0.1708,0.1748,-0.0833,-0.0844,0.01
catboost,CatBoost Classifier,0.1875,0.4267,0.1875,0.1826,0.1804,-0.0833,-0.0835,2.62
lr,Logistic Regression,0.3125,0.0,0.3125,0.3192,0.3096,0.0833,0.0843,0.0367
svm,SVM - Linear Kernel,0.2569,0.0,0.2569,0.2621,0.1969,0.0093,0.0239,0.0133
qda,Quadratic Discriminant Analysis,0.2014,0.0,0.2014,0.1482,0.1508,-0.0648,-0.0773,0.01


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.1944,0.5015,0.1944,0.2182,0.2009,-0.0741,-0.0752




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 91)"
4,Transformed data shape,"(180, 91)"
5,Transformed train set shape,"(144, 91)"
6,Transformed test set shape,"(36, 91)"
7,Numeric features,90
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.2986,0.5181,0.2986,0.2949,0.2897,0.0648,0.0656,0.04
knn,K Neighbors Classifier,0.25,0.5133,0.25,0.2391,0.227,0.0,-0.0005,0.0133
nb,Naive Bayes,0.2639,0.4914,0.2639,0.2697,0.2566,0.0185,0.0187,0.0067
xgboost,Extreme Gradient Boosting,0.2014,0.45,0.2014,0.1848,0.1887,-0.0648,-0.0667,0.1533
rf,Random Forest Classifier,0.1944,0.4422,0.1944,0.1975,0.1939,-0.0741,-0.0749,0.0467
catboost,CatBoost Classifier,0.1944,0.4228,0.1944,0.1887,0.1888,-0.0741,-0.0745,2.3233
lr,Logistic Regression,0.3611,0.0,0.3611,0.363,0.3574,0.1481,0.1491,0.0067
svm,SVM - Linear Kernel,0.3333,0.0,0.3333,0.3377,0.3323,0.1111,0.1117,0.0133
qda,Quadratic Discriminant Analysis,0.2778,0.0,0.2778,0.2821,0.2728,0.037,0.0371,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.2222,0.4907,0.2222,0.2419,0.2268,-0.037,-0.0375




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 91)"
4,Transformed data shape,"(180, 91)"
5,Transformed train set shape,"(144, 91)"
6,Transformed test set shape,"(36, 91)"
7,Numeric features,90
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.2708,0.5239,0.2708,0.2629,0.2593,0.0278,0.0285,0.0433
knn,K Neighbors Classifier,0.2431,0.516,0.2431,0.2593,0.2446,-0.0093,-0.0106,0.0133
nb,Naive Bayes,0.2639,0.4914,0.2639,0.2697,0.2566,0.0185,0.0187,0.0067
xgboost,Extreme Gradient Boosting,0.2222,0.4481,0.2222,0.2085,0.2098,-0.037,-0.0389,0.15
rf,Random Forest Classifier,0.2014,0.447,0.2014,0.2024,0.1999,-0.0648,-0.0656,0.05
catboost,CatBoost Classifier,0.2222,0.4385,0.2222,0.219,0.2158,-0.037,-0.0369,2.34
lr,Logistic Regression,0.3056,0.0,0.3056,0.3134,0.3043,0.0741,0.075,0.01
svm,SVM - Linear Kernel,0.2986,0.0,0.2986,0.3197,0.2755,0.0648,0.0692,0.0133
qda,Quadratic Discriminant Analysis,0.2153,0.0,0.2153,0.2117,0.211,-0.0463,-0.0467,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.2222,0.4856,0.2222,0.2506,0.2288,-0.037,-0.0378


Feature Configs:  25%|██▌       | 1/4 [00:32<01:36, 32.24s/it]

Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 90)"
4,Transformed data shape,"(180, 90)"
5,Transformed train set shape,"(144, 90)"
6,Transformed test set shape,"(36, 90)"
7,Numeric features,89
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.2917,0.5347,0.2917,0.285,0.2819,0.0556,0.0563,0.0567
nb,Naive Bayes,0.2569,0.4951,0.2569,0.2322,0.2401,0.0093,0.0088,0.0067
rf,Random Forest Classifier,0.25,0.4535,0.25,0.249,0.2408,0.0,0.0002,0.05
xgboost,Extreme Gradient Boosting,0.2014,0.45,0.2014,0.1848,0.1887,-0.0648,-0.0667,0.1567
catboost,CatBoost Classifier,0.2014,0.4333,0.2014,0.2067,0.1978,-0.0648,-0.0654,2.3367
knn,K Neighbors Classifier,0.1875,0.4272,0.1875,0.1708,0.1748,-0.0833,-0.0844,0.0133
lr,Logistic Regression,0.3125,0.0,0.3125,0.3188,0.3099,0.0833,0.0842,0.05
svm,SVM - Linear Kernel,0.2569,0.0,0.2569,0.2621,0.1969,0.0093,0.0239,0.0133
qda,Quadratic Discriminant Analysis,0.2153,0.0,0.2153,0.1611,0.1602,-0.0463,-0.0534,0.01


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.25,0.4666,0.25,0.2487,0.2482,0.0,0.0




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 90)"
4,Transformed data shape,"(180, 90)"
5,Transformed train set shape,"(144, 90)"
6,Transformed test set shape,"(36, 90)"
7,Numeric features,89
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.2917,0.538,0.2917,0.2861,0.2823,0.0556,0.0564,0.0633
knn,K Neighbors Classifier,0.25,0.5189,0.25,0.2651,0.2424,0.0,-0.0014,0.0133
nb,Naive Bayes,0.2639,0.4922,0.2639,0.2697,0.2566,0.0185,0.0187,0.0067
rf,Random Forest Classifier,0.2361,0.4578,0.2361,0.2229,0.2239,-0.0185,-0.0188,0.0667
xgboost,Extreme Gradient Boosting,0.2014,0.45,0.2014,0.1848,0.1887,-0.0648,-0.0667,0.17
catboost,CatBoost Classifier,0.1806,0.4358,0.1806,0.1772,0.1749,-0.0926,-0.0936,2.32
lr,Logistic Regression,0.3611,0.0,0.3611,0.3616,0.3551,0.1481,0.1495,0.02
svm,SVM - Linear Kernel,0.3542,0.0,0.3542,0.3524,0.3498,0.1389,0.1396,0.0133
qda,Quadratic Discriminant Analysis,0.2639,0.0,0.2639,0.2723,0.2603,0.0185,0.0186,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.25,0.4738,0.25,0.25,0.2492,0.0,0.0




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 90)"
4,Transformed data shape,"(180, 90)"
5,Transformed train set shape,"(144, 90)"
6,Transformed test set shape,"(36, 90)"
7,Numeric features,89
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.2708,0.5418,0.2708,0.2646,0.2618,0.0278,0.0283,0.04
knn,K Neighbors Classifier,0.2708,0.5043,0.2708,0.2805,0.2679,0.0278,0.0282,0.0133
nb,Naive Bayes,0.2639,0.4922,0.2639,0.2697,0.2566,0.0185,0.0187,0.0067
rf,Random Forest Classifier,0.2431,0.4654,0.2431,0.2303,0.2305,-0.0093,-0.0093,0.0667
xgboost,Extreme Gradient Boosting,0.2222,0.4481,0.2222,0.2085,0.2098,-0.037,-0.0389,0.1733
catboost,CatBoost Classifier,0.2014,0.4317,0.2014,0.2033,0.1968,-0.0648,-0.0655,2.4
lr,Logistic Regression,0.3333,0.0,0.3333,0.3462,0.3351,0.1111,0.112,0.0067
svm,SVM - Linear Kernel,0.3472,0.0,0.3472,0.3727,0.2819,0.1296,0.1523,0.0133
qda,Quadratic Discriminant Analysis,0.2292,0.0,0.2292,0.2188,0.2186,-0.0278,-0.0282,0.01


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.2778,0.4959,0.2778,0.2771,0.2753,0.037,0.0372


Feature Configs:  50%|█████     | 2/4 [01:04<01:04, 32.05s/it]

Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 90)"
4,Transformed data shape,"(180, 90)"
5,Transformed train set shape,"(144, 90)"
6,Transformed test set shape,"(36, 90)"
7,Numeric features,89
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.2639,0.5424,0.2639,0.2463,0.2515,0.0185,0.0189,0.0533
knn,K Neighbors Classifier,0.2569,0.497,0.2569,0.222,0.2371,0.0093,0.0093,0.0133
nb,Naive Bayes,0.2778,0.496,0.2778,0.2475,0.2573,0.037,0.0374,0.0067
rf,Random Forest Classifier,0.2153,0.452,0.2153,0.2127,0.206,-0.0463,-0.0474,0.0567
xgboost,Extreme Gradient Boosting,0.2014,0.4344,0.2014,0.1899,0.1916,-0.0648,-0.0672,0.0567
catboost,CatBoost Classifier,0.1875,0.4252,0.1875,0.1839,0.1823,-0.0833,-0.0841,2.2033
lr,Logistic Regression,0.3125,0.0,0.3125,0.3121,0.3108,0.0833,0.0836,0.0367
svm,SVM - Linear Kernel,0.2569,0.0,0.2569,0.2263,0.2218,0.0093,0.0057,0.0133
qda,Quadratic Discriminant Analysis,0.3056,0.0,0.3056,0.3335,0.2499,0.0741,0.1066,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.2778,0.4738,0.2778,0.2804,0.2754,0.037,0.0374




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 90)"
4,Transformed data shape,"(180, 90)"
5,Transformed train set shape,"(144, 90)"
6,Transformed test set shape,"(36, 90)"
7,Numeric features,89
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.2708,0.5448,0.2708,0.256,0.2604,0.0278,0.0281,0.0433
knn,K Neighbors Classifier,0.2778,0.5343,0.2778,0.2669,0.251,0.037,0.0401,0.01
nb,Naive Bayes,0.2778,0.4928,0.2778,0.2799,0.2689,0.037,0.0373,0.0067
rf,Random Forest Classifier,0.2361,0.4591,0.2361,0.2305,0.2256,-0.0185,-0.0192,0.06
xgboost,Extreme Gradient Boosting,0.2014,0.4344,0.2014,0.1899,0.1916,-0.0648,-0.0672,0.1867
catboost,CatBoost Classifier,0.2153,0.4336,0.2153,0.2147,0.2096,-0.0463,-0.047,2.3333
lr,Logistic Regression,0.3681,0.0,0.3681,0.3698,0.3647,0.1574,0.1583,0.0067
svm,SVM - Linear Kernel,0.3333,0.0,0.3333,0.3338,0.3282,0.1111,0.1126,0.0133
qda,Quadratic Discriminant Analysis,0.25,0.0,0.25,0.25,0.2407,0.0,-0.0002,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.2778,0.4712,0.2778,0.2804,0.2754,0.037,0.0374




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 90)"
4,Transformed data shape,"(180, 90)"
5,Transformed train set shape,"(144, 90)"
6,Transformed test set shape,"(36, 90)"
7,Numeric features,89
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.2639,0.5489,0.2639,0.2456,0.2516,0.0185,0.0188,0.04
knn,K Neighbors Classifier,0.2639,0.5222,0.2639,0.2815,0.2653,0.0185,0.0186,0.0133
nb,Naive Bayes,0.2778,0.4928,0.2778,0.2799,0.2689,0.037,0.0373,0.0067
rf,Random Forest Classifier,0.2639,0.4689,0.2639,0.2691,0.257,0.0185,0.0197,0.0467
xgboost,Extreme Gradient Boosting,0.1875,0.4381,0.1875,0.1756,0.1773,-0.0833,-0.0859,0.1133
catboost,CatBoost Classifier,0.2014,0.4282,0.2014,0.1985,0.1955,-0.0648,-0.0657,2.3567
lr,Logistic Regression,0.2986,0.0,0.2986,0.3066,0.2984,0.0648,0.0655,0.0067
svm,SVM - Linear Kernel,0.2847,0.0,0.2847,0.3101,0.2447,0.0463,0.0687,0.0133
qda,Quadratic Discriminant Analysis,0.2569,0.0,0.2569,0.2548,0.2501,0.0093,0.0095,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.2778,0.4938,0.2778,0.2854,0.2756,0.037,0.0375


Feature Configs:  75%|███████▌  | 3/4 [01:35<00:31, 31.54s/it]

Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 89)"
4,Transformed data shape,"(180, 89)"
5,Transformed train set shape,"(144, 89)"
6,Transformed test set shape,"(36, 89)"
7,Numeric features,88
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
knn,K Neighbors Classifier,0.2569,0.497,0.2569,0.222,0.2371,0.0093,0.0093,0.0133
nb,Naive Bayes,0.2778,0.496,0.2778,0.2475,0.2573,0.037,0.0374,0.0067
et,Extra Trees Classifier,0.1944,0.4935,0.1944,0.1923,0.1902,-0.0741,-0.0751,0.04
rf,Random Forest Classifier,0.2361,0.4597,0.2361,0.2357,0.2327,-0.0185,-0.0187,0.05
catboost,CatBoost Classifier,0.1944,0.4439,0.1944,0.188,0.1861,-0.0741,-0.0752,2.41
xgboost,Extreme Gradient Boosting,0.2014,0.4346,0.2014,0.197,0.1947,-0.0648,-0.0665,0.1567
lr,Logistic Regression,0.3264,0.0,0.3264,0.3278,0.3247,0.1019,0.1022,0.0467
svm,SVM - Linear Kernel,0.2569,0.0,0.2569,0.2263,0.2218,0.0093,0.0057,0.0133
qda,Quadratic Discriminant Analysis,0.3056,0.0,0.3056,0.3345,0.2502,0.0741,0.1068,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,K Neighbors Classifier,0.3056,0.4594,0.3056,0.2926,0.2938,0.0741,0.0749




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 89)"
4,Transformed data shape,"(180, 89)"
5,Transformed train set shape,"(144, 89)"
6,Transformed test set shape,"(36, 89)"
7,Numeric features,88
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
knn,K Neighbors Classifier,0.2778,0.5174,0.2778,0.2784,0.2623,0.037,0.0386,0.0167
nb,Naive Bayes,0.2708,0.493,0.2708,0.2744,0.2635,0.0278,0.0276,0.0067
et,Extra Trees Classifier,0.1944,0.4912,0.1944,0.1942,0.1919,-0.0741,-0.0747,0.04
rf,Random Forest Classifier,0.2361,0.4619,0.2361,0.2351,0.2329,-0.0185,-0.0189,0.05
catboost,CatBoost Classifier,0.2153,0.4354,0.2153,0.2166,0.2118,-0.0463,-0.0467,2.2867
xgboost,Extreme Gradient Boosting,0.2014,0.4346,0.2014,0.197,0.1947,-0.0648,-0.0665,0.13
lr,Logistic Regression,0.3681,0.0,0.3681,0.3686,0.3611,0.1574,0.159,0.0067
svm,SVM - Linear Kernel,0.3403,0.0,0.3403,0.3412,0.3337,0.1204,0.1217,0.0133
qda,Quadratic Discriminant Analysis,0.2917,0.0,0.2917,0.2954,0.2838,0.0556,0.0561,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,K Neighbors Classifier,0.2222,0.5442,0.2222,0.2143,0.1875,-0.037,-0.0416




Unnamed: 0,Description,Value
0,Session id,42
1,Target,domain_EI
2,Target type,Multiclass
3,Original data shape,"(180, 89)"
4,Transformed data shape,"(180, 89)"
5,Transformed train set shape,"(144, 89)"
6,Transformed test set shape,"(36, 89)"
7,Numeric features,88
8,Preprocess,True
9,Imputation type,simple


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
knn,K Neighbors Classifier,0.2778,0.5268,0.2778,0.2841,0.2598,0.037,0.039,0.0133
nb,Naive Bayes,0.2708,0.493,0.2708,0.2744,0.2635,0.0278,0.0276,0.0067
et,Extra Trees Classifier,0.2014,0.4848,0.2014,0.1937,0.1943,-0.0648,-0.0656,0.04
rf,Random Forest Classifier,0.2431,0.4629,0.2431,0.2412,0.2397,-0.0093,-0.0097,0.05
xgboost,Extreme Gradient Boosting,0.2292,0.4394,0.2292,0.2305,0.2263,-0.0278,-0.0287,0.12
catboost,CatBoost Classifier,0.2153,0.4361,0.2153,0.2077,0.2066,-0.0463,-0.0471,2.2833
lr,Logistic Regression,0.3264,0.0,0.3264,0.3468,0.3285,0.1019,0.1033,0.01
svm,SVM - Linear Kernel,0.2986,0.0,0.2986,0.3139,0.2576,0.0648,0.0714,0.0133
qda,Quadratic Discriminant Analysis,0.3125,0.0,0.3125,0.3088,0.3022,0.0833,0.0841,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,K Neighbors Classifier,0.2222,0.5118,0.2222,0.3345,0.2006,-0.037,-0.0424


Feature Configs: 100%|██████████| 4/4 [02:05<00:00, 31.33s/it]


## Step 4. Save overall summary

In [24]:
results_df = pd.DataFrame(results)
results_df.to_csv("overall_results.csv", index=False)

## 🎯 Selective Outlier Removal Based on XGBoost Feature Importance

Instead of removing all outliers across 88 features, we focus only on the top 10 features identified as most influential for classifying `domain_EI` using XGBoost.

### Method

- Extract top 10 features with highest importance scores.
- Apply Z-score based outlier detection (`|Z| > 4`) on those features only.
- Remove rows where extreme values are detected in these features.
- Visualize before/after distributions to confirm effective filtering.

This approach ensures:
- High-impact errors are removed.
- Valuable data is preserved from non-influential features.


In [None]:
'''# Step1: Important Features
important_features = [
    'HF', 'lnLF', 'RSA_var', 'LFp_var', 'lnVLF_mssd',
    'HF_var', 'VLFp_mssd', 'rMSSD_autocorr', 'dHz_mssd', 'BPM_autocorr'
]
'''

In [None]:
'''# Step 2. Z-score calculation (Top-10 Imporatant features only) and Z > 4 row detection
z_scores = df[important_features].apply(zscore)

extreme_mask = (np.abs(z_scores) > 4).any(axis=1)
df_outliers = df[extreme_mask]
df_cleaned = df[~extreme_mask].copy()

# Step 3. Remove outliers from the original DataFrame
print(f"⚠️ Rows with Z > 4 in important features: {extreme_mask.sum()} / {len(df)}")
print("Exclude row's name + index:")
display(df_outliers[['name']].reset_index())

print(f"✅ Cleaned data shape: {df_cleaned.shape}")
'''

In [None]:
'''# Step 4. Plotting the distribution of important features before and after removing outliers
df_before = df[important_features].copy()
df_before['source'] = 'Before'

df_after = df_cleaned[important_features].copy()
df_after['source'] = 'After'

df_plot = pd.concat([df_before, df_after])
df_plot = df_plot.melt(id_vars='source', var_name='feature', value_name='value')

plt.figure(figsize=(20,15))
sns.boxplot(data=df_plot, x='feature', y='value', hue='source')
plt.title("Top Features Distribution: Before vs After Z > 4 Removal")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()'''