# 🎯 Emotion-Domain Classification with PyCaret AutoML (`domain_EI`, 4 classes)

This notebook builds ML classifiers to predict the **emotion-intelligence domain** (`domain_EI`) with four labels  
(HAHV · HALV · LALV · LAHV  →  encoded as 0-3) using **PyCaret 3**.

---

## 🧩 Objectives

| Item | Details |
|------|---------|
| **Feature sets** | 4 configurations (score_EI / group_EI included or removed) |
| **Scaling options** | `none`, `standard`, `minmax` |
| **Model family** | CatBoost · XGBoost · Random Forest · Extra Trees · Logistic Regression · Naive Bayes · SVM · K-NN · QDA |
| **Total runs** | **12** (4 feature configs × 3 scalers) |
| **Validation** | Stratified **3-fold CV** inside PyCaret + **20 % hold-out** |
| **Artifacts** | PNG plots (AUC, confusion matrix, feature importance) + CSV summaries |

---

## 🛠️ Pipeline Overview

| Step | Description |
|------|-------------|
| **1 — Split** | Apply feature-exclusion rules, then stratified 80 / 20 train-test split |
| **2 — Scaling** | Transform numeric features with the chosen scaler (`none` / `StandardScaler` / `MinMaxScaler`) |
| **3 — AutoML** | `compare_models()` evaluates the candidate models via CV and returns the best estimator |
| **4 — Hold-out Evaluation** | Predict on the reserved test set; save confusion-matrix & AUC plots |
| **5 — Result Logging** | Collect CV AUC / Accuracy and hold-out Accuracy for each run into a master CSV |

---

## 📂 Output Artifacts

| File / Folder | Contents |
|---------------|----------|
| **`overall_results.csv`** | One-row summary per run (feature cfg · scaler · chosen model · metrics) |
| **`leaderboard_cfg*_*.csv`** | Full PyCaret leaderboard (all models, CV statistics) |
| **`CM_/`, `AUC_/`, `FI_/`** | Confusion-matrix, AUC curve and feature-importance PNGs named by run tag |
| *(optional)* `best_models.pkl` | Serialized best estimators via `joblib.dump()` |

---

## 🔧 Requirements

* **PyCaret 3.x** (classification module)  
* `scikit-learn`, `pandas`, `numpy`, `matplotlib`, `tqdm`  
* CatBoost & XGBoost are pulled automatically with the full PyCaret install

> **Note**  
> * All stimulus-related (`Arousal`, `Valence`, etc.) and subjective self-report variables are **permanently excluded** from the feature space.  
> * The four feature configurations are designed to measure the impact of **EI score** (`score_EI`) and **EI group label** (`group_EI`) on model performance.  
> * GPU acceleration, SHAP explanations and 5-fold CV are **not** used in the current pipeline; they can be added later if needed.


## Import Libraries

In [1]:
# ---------------------------------------------------------------
#  ❖  CONFIG & PREP
# ---------------------------------------------------------------
import os, sys, types, re, joblib, warnings

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_auc_score
from sklearn.ensemble import ExtraTreesClassifier          # used for FS
from pycaret.classification import *
warnings.filterwarnings("ignore", category=UserWarning)

from scipy.stats import zscore
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

## Step 0: Define Save Paths

We define clean, relative paths for all outputs such as feature importance plots, evaluation plots, and CSV results.

In [2]:
# Base directory: project root (current notebook folder)
ROOT_PATH = os.getcwd()
RES_PATH = os.path.join(ROOT_PATH, '../res')

# DATA
DATA_PATH = os.path.join(os.getcwd(), '../data/')
EXCEL_PATH = "../data/ETRI_cardiac.xlsx"
CSV_PATH = "../data/ETRI_cardiac.csv"
DF_PATH = "../data/updated_data.csv"

# Define subfolders for saving results
FI_PATH   = os.path.join(RES_PATH, 'FI_plots')      # Feature Importance
CM_PATH   = os.path.join(RES_PATH, 'CM_plots')      # Confusion Matrix
AUC_PATH  = os.path.join(RES_PATH, 'AUC_plots')     # AUC Curves
SHAP_PATH = os.path.join(RES_PATH, 'SHAP_plots')    # SHAP Plots
RES_CSV_PATH  = os.path.join(RES_PATH, 'results_csv')   # CSV files

# Create directories if they don't exist
for path in [FI_PATH, CM_PATH, AUC_PATH, SHAP_PATH, RES_CSV_PATH]:
    os.makedirs(path, exist_ok=True)

In [3]:
df = pd.read_csv(DF_PATH)
df.head()

Unnamed: 0,name,year,score_EI,domain_EI,trainnig,group_EI,자극_Arousal,자극_Valence,Arousal,Valence,...,VLF/HF_autocorr,LF/HF_autocorr,tPow_autocorr,dPow_autocorr,dHz_autocorr,pPow_autocorr,pHz_autocorr,CohRatio_autocorr,RSA_autocorr,dHz_diff_autocorr
0,subj_01_00,2021,400,0,0,0,1,1,4,5,...,-0.567119,-0.991761,4.761754,-0.106286,-0.597611,-0.177651,-1.020465,-1.087973,1.085212,-0.595462
1,subj_01_01,2021,400,1,0,0,1,0,6,4,...,-2.854562,-1.532805,-4.013737,13.465573,-1.486156,0.136173,-0.504705,-0.481968,0.855739,-1.486156
2,subj_01_02,2021,400,3,0,0,0,1,3,6,...,-1.375726,-1.030548,0.05676,-1.111399,-1.126756,-0.961825,-1.62631,-1.110674,-0.670813,-1.127433
3,subj_01_03,2021,400,2,0,0,0,0,4,4,...,-0.198272,0.921968,-2.812514,7.058668,-0.880433,-0.866042,-0.602454,-1.700471,0.168826,-0.880433
4,subj_02_00,2021,450,0,0,0,1,1,6,7,...,0.469754,0.216407,0.900575,0.121397,0.023057,-0.702703,-0.874824,0.579896,-0.289846,0.023057


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 100 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   name               180 non-null    object 
 1   year               180 non-null    int64  
 2   score_EI           180 non-null    int64  
 3   domain_EI          180 non-null    int64  
 4   trainnig           180 non-null    int64  
 5   group_EI           180 non-null    int64  
 6   자극_Arousal         180 non-null    int64  
 7   자극_Valence         180 non-null    int64  
 8   Arousal            180 non-null    int64  
 9   Valence            180 non-null    int64  
 10  Subj_AR            180 non-null    int64  
 11  Subj_PN            180 non-null    int64  
 12  BPM                180 non-null    float64
 13  SDNN               180 non-null    float64
 14  rMSSD              180 non-null    float64
 15  VLF                180 non-null    float64
 16  LF                 180 no

## Step 1: Experiment Settings
Define Scaling and Setup for PyCaret 

We define three types of scaling: none, standard, and min-max, for every feature configuration.

In [None]:
TARGET = "domain_EI"

# Columns that are never used as features
MUST_EXCLUDE = [
    "name", "domain_EI", "year", "trainnig",
    "Arousal", "Valence", "자극_Arousal", "자극_Valence",
    "Subj_AR", "Subj_PN"
]

# 4 feature-set flavours: keep / drop EI score / drop EI group / drop both
FEATURE_EXCLUDE_SETS = {
    1: [],
    2: ["group_EI"],
    3: ["score_EI"],
    4: ["score_EI", "group_EI"],
}

# Three scaler options
SCALERS = {
    "none":     None,
    "standard": StandardScaler(),
    "minmax":   MinMaxScaler(),
}

# Models we allow PyCaret to try
INCLUDE_MODELS = [
    "catboost", "xgboost", "rf", "et",
    "lr", "nb", "svm", "knn", "qda",
]

'TARGET = "domain_EI"\n\n# Columns that are never used as features\nMUST_EXCLUDE = [\n    "name", "domain_EI", "year", "trainnig",\n    "Arousal", "Valence", "자극_Arousal", "자극_Valence",\n    "Subj_AR", "Subj_PN"\n]\n\n# 4 feature-set flavours: keep / drop EI score / drop EI group / drop both\nFEATURE_EXCLUDE_SETS = {\n    1: [],\n    2: ["group_EI"],\n    3: ["score_EI"],\n    4: ["score_EI", "group_EI"],\n}\n\n# Three scaler options\nSCALERS = {\n    "none":     None,\n    "standard": StandardScaler(),\n    "minmax":   MinMaxScaler(),\n}\n\n# Models we allow PyCaret to try\nINCLUDE_MODELS = [\n    "catboost", "xgboost", "rf", "et",\n    "lr", "nb", "svm", "knn", "qda",\n]'

####  LightGBM stub

In [7]:
lgb_stub = types.ModuleType("lightgbm")

class _DummyLGBM:
    """Minimal fake LightGBM estimator so that PyCaret can import it."""
    def __init__(self, *args, **kwargs): pass
    def fit(self, *args, **kwargs): return self
    def predict(self, X, *args, **kwargs):          # label output
        return np.zeros(len(X), dtype=int)
    def predict_proba(self, X, *args, **kwargs):    # proba output
        # return 2-col dummy probs for binary; PyCaret only checks shape
        return np.zeros((len(X), 2), dtype=float)

# expose expected symbols at top level
lgb_stub.LGBMClassifier = _DummyLGBM
lgb_stub.LGBMRegressor  = _DummyLGBM
lgb_stub.Dataset        = object          # rarely used by PyCaret

basic_stub = types.ModuleType("lightgbm.basic")
basic_stub.LightGBMError = RuntimeError    # any Exception subclass works

# make `from lightgbm.basic import LightGBMError` succeed
sys.modules["lightgbm.basic"] = basic_stub

# attach sub-module to the parent package stub
lgb_stub.basic = basic_stub

sys.modules["lightgbm"] = lgb_stub

## Step 2. Result Containers

In [8]:
results:     list[dict]        = []     # one row per config+scaler
best_models: dict[str, object] = {}     # tag ➜ fitted model

## Step 3: AutoML Training using PyCaret + tqdm Progress Tracking
WITHOUT OUTLIER vs WITH OUTLIER !! (With Outlier first)

We train models using PyCaret with GPU acceleration, over 8 × 3 configuration combinations.

In [9]:
'''
for config_id, exclude_cols in tqdm(feature_exclude_sets.items(), desc="Feature Configs", position=0):
    for scaler_name, scaler in tqdm(scalers.items(), desc=f"Scalers for Config {config_id}", leave=False, position=1):
        df_exp = df.copy()
        all_excluded = must_exclude + exclude_cols
        
        remaining_cols = [col for col in df_exp.columns if col not in all_excluded]
        if len(remaining_cols) == 0:
            print(f"⚠️ Skipping config {config_id} + scaler {scaler_name} – no features left.")
            continue

        X = df_exp.drop(columns=all_excluded)
        y = df_exp[target]

        X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)

        # ✅ Feature Check
        print(f"\n📌 Config {config_id} | Scaler: {scaler_name}")
        print(f"🔸 X_train shape: {X_train.shape}")
        print(f"🔸 Feature columns: {list(X_train.columns)}")
        print(f"🔸 Null values:\n{X_train.isnull().sum()}")
        print(f"🔸 Class distribution:\n{y_train.value_counts()}")
        print("-" * 60)
'''

'\nfor config_id, exclude_cols in tqdm(feature_exclude_sets.items(), desc="Feature Configs", position=0):\n    for scaler_name, scaler in tqdm(scalers.items(), desc=f"Scalers for Config {config_id}", leave=False, position=1):\n        df_exp = df.copy()\n        all_excluded = must_exclude + exclude_cols\n        \n        remaining_cols = [col for col in df_exp.columns if col not in all_excluded]\n        if len(remaining_cols) == 0:\n            print(f"⚠️ Skipping config {config_id} + scaler {scaler_name} – no features left.")\n            continue\n\n        X = df_exp.drop(columns=all_excluded)\n        y = df_exp[target]\n\n        X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=42)\n\n        # ✅ Feature Check\n        print(f"\n📌 Config {config_id} | Scaler: {scaler_name}")\n        print(f"🔸 X_train shape: {X_train.shape}")\n        print(f"🔸 Feature columns: {list(X_train.columns)}")\n        print(f"🔸 Null values:\n{X_train.isnull().sum()}"

### Step 2.3. Experiment loop

In [None]:
for cfg_id, excl_cols in tqdm(FEATURE_EXCLUDE_SETS.items(),
                              desc="Feature Configs"):

    for scaler_name, scaler in tqdm(SCALERS.items(),
                                    desc=f"Scalers for {cfg_id}",
                                    leave=False):

        # 3-1 ▶ split -------------------------------------------------------
        X = df.drop(columns=MUST_EXCLUDE + excl_cols)
        y = df[TARGET]
        X_tr, X_te, y_tr, y_te = train_test_split(
            X, y, stratify=y, test_size=0.10, random_state=42
        )

        if X_tr.shape[1] == 0:             # empty feature guard
            print(f"🚫 cfg{cfg_id}|{scaler_name}: no features"); continue

        # 3-2 ▶ optional scaling -------------------------------------------
        if scaler:
            X_tr = pd.DataFrame(scaler.fit_transform(X_tr),
                                index=X_tr.index, columns=X_tr.columns)
            X_te = pd.DataFrame(scaler.transform(X_te),
                                index=X_te.index, columns=X_te.columns)

        # 3-3 ▶ PyCaret setup  --------------------------------------------
        train_df = pd.concat([X_tr.reset_index(drop=True),
                              y_tr.reset_index(drop=True)], axis=1)
        test_df  = pd.concat([X_te.reset_index(drop=True),
                              y_te.reset_index(drop=True)], axis=1)

        setup(
            data=train_df,
            test_data=test_df,     # 👈 pass *your* hold-out here
            target=TARGET,
            session_id=42,
            fold=3,
            use_gpu=False,
            html=True,
            verbose=True,
            feature_selection=False,
            index=False,
        )

        # 3-4 ▶ model search  ---------------------------------------------
        best = compare_models(include=INCLUDE_MODELS,
                              sort="AUC",   fold=3,
                              turbo=False, verbose=True)

        tag = f"cfg{cfg_id}_{scaler_name}"
        best_models[tag] = best

        # leaderboard CSV ---------------------------------------------------
        pull().assign(config=cfg_id, scaler=scaler_name) \
              .to_csv(os.path.join(RES_CSV_PATH, f"leaderboard_{tag}.csv"),
                      index=False)

        # 3-5 ▶ predict on hold-out & plots -------------------------------
        pred = predict_model(best)     # uses test_data provided in setup()

        plot_model(best, plot="confusion_matrix",   save=True)
        os.replace("Confusion Matrix.png",
                   os.path.join(CM_PATH,  f"CM_{tag}.png"))

        plot_model(best, plot="auc",                 save=True)
        os.replace("AUC.png",
                   os.path.join(AUC_PATH, f"AUC_{tag}.png"))

        # 3-6 ▶ feature importance ----------------------------------------
        fi_path = os.path.join(FI_PATH, f"FI_{tag}.png")

        def save_fi_tree():
            plot_model(best, plot="feature", save=True)
            os.replace("Feature Importance.png", fi_path)

        if hasattr(best, "feature_importances_"):
            save_fi_tree()

        elif best.__class__.__name__.startswith("CatBoost"):
            fi = best.get_feature_importance()
            top = np.argsort(fi)[::-1][:20]
            plt.figure(figsize=(6, 4))
            plt.barh(range(len(top)), fi[top][::-1])
            plt.yticks(range(len(top)), X_tr.columns[top][::-1])
            plt.tight_layout(); plt.savefig(fi_path); plt.close()

        # 3-7 ▶ collect numeric summary -----------------------------------
        cv_metrics = pull().iloc[0]        # row 0 = best model’s CV stats
        results.append({
            "config":      cfg_id,
            "scaler":      scaler_name,
            "model":       best.__class__.__name__,
            "AUC_CV":      cv_metrics["AUC"],
            "Acc_CV":      cv_metrics["Accuracy"],
            "Acc_holdout": (pred[TARGET] == pred["prediction_label"]).mean(),
        })


Feature Configs:   0%|          | 0/4 [00:00<?, ?it/s]

Unnamed: 0,Description,Value
0,Session id,42
1,Target,Arousal
2,Target type,Multiclass
3,Target mapping,"1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6"
4,Original data shape,"(180, 91)"
5,Transformed data shape,"(180, 91)"
6,Transformed train set shape,"(162, 91)"
7,Transformed test set shape,"(18, 91)"
8,Numeric features,90
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.3272,0.5576,0.3272,0.2578,0.2747,0.1371,0.1458,0.23
et,Extra Trees Classifier,0.2654,0.552,0.2654,0.2207,0.2381,0.0717,0.0732,0.2167
xgboost,Extreme Gradient Boosting,0.2346,0.5457,0.2346,0.2223,0.2237,0.0518,0.0529,0.3367
catboost,CatBoost Classifier,0.2593,0.5374,0.2593,0.2005,0.2216,0.0578,0.0598,5.6
nb,Naive Bayes,0.1049,0.4933,0.1049,0.131,0.1097,-0.0621,-0.0636,0.1933
knn,K Neighbors Classifier,0.1543,0.4606,0.1543,0.1214,0.1294,-0.0436,-0.0455,0.1967
lr,Logistic Regression,0.1975,0.0,0.1975,0.1998,0.1949,0.0134,0.0134,0.2267
svm,SVM - Linear Kernel,0.2037,0.0,0.2037,0.1758,0.1525,0.0192,0.0325,0.1967
qda,Quadratic Discriminant Analysis,0.2037,0.0,0.2037,0.0937,0.116,-0.0212,-0.0179,0.1833


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.1111,0.5281,0.1111,0.0494,0.0684,-0.125,-0.1369




Unnamed: 0,Description,Value
0,Session id,42
1,Target,Arousal
2,Target type,Multiclass
3,Target mapping,"1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6"
4,Original data shape,"(180, 91)"
5,Transformed data shape,"(180, 91)"
6,Transformed train set shape,"(162, 91)"
7,Transformed test set shape,"(18, 91)"
8,Numeric features,90
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
catboost,CatBoost Classifier,0.2716,0.5588,0.2716,0.2025,0.226,0.0674,0.0706,5.4633
rf,Random Forest Classifier,0.3148,0.5558,0.3148,0.2499,0.2656,0.1225,0.1296,0.05
et,Extra Trees Classifier,0.2654,0.5506,0.2654,0.2187,0.2365,0.0705,0.072,0.04
xgboost,Extreme Gradient Boosting,0.2346,0.5457,0.2346,0.2223,0.2237,0.0518,0.0529,0.2933
knn,K Neighbors Classifier,0.1914,0.5188,0.1914,0.1556,0.1671,-0.0023,-0.0022,0.0133
nb,Naive Bayes,0.142,0.4897,0.142,0.158,0.1422,-0.0345,-0.035,0.01
lr,Logistic Regression,0.1914,0.0,0.1914,0.1889,0.1848,-0.0024,-0.0025,0.01
svm,SVM - Linear Kernel,0.1852,0.0,0.1852,0.1765,0.1759,-0.0021,-0.0025,0.0167
qda,Quadratic Discriminant Analysis,0.1049,0.0,0.1049,0.1279,0.1071,-0.052,-0.0524,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,CatBoost Classifier,0.2222,0.547,0.2222,0.1111,0.1481,0.0079,0.0088




Unnamed: 0,Description,Value
0,Session id,42
1,Target,Arousal
2,Target type,Multiclass
3,Target mapping,"1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6"
4,Original data shape,"(180, 91)"
5,Transformed data shape,"(180, 91)"
6,Transformed train set shape,"(162, 91)"
7,Transformed test set shape,"(18, 91)"
8,Numeric features,90
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.3025,0.5563,0.3025,0.2237,0.2514,0.1085,0.1137,0.05
et,Extra Trees Classifier,0.2654,0.5516,0.2654,0.2205,0.2372,0.0718,0.0733,0.04
xgboost,Extreme Gradient Boosting,0.2346,0.5457,0.2346,0.2223,0.2237,0.0518,0.0529,0.1767
catboost,CatBoost Classifier,0.2593,0.5409,0.2593,0.2005,0.2202,0.0567,0.0592,5.5
knn,K Neighbors Classifier,0.2469,0.5372,0.2469,0.2296,0.2312,0.0738,0.0752,0.0133
nb,Naive Bayes,0.1358,0.4901,0.1358,0.1547,0.1378,-0.0403,-0.0408,0.0067
lr,Logistic Regression,0.2469,0.0,0.2469,0.1709,0.1975,0.0345,0.0362,0.01
svm,SVM - Linear Kernel,0.2407,0.0,0.2407,0.1771,0.1766,0.0253,0.0354,0.0133
qda,Quadratic Discriminant Analysis,0.142,0.0,0.142,0.1737,0.1481,-0.0185,-0.0178,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.1111,0.5288,0.1111,0.0494,0.0684,-0.125,-0.1369


Feature Configs:  25%|██▌       | 1/4 [01:10<03:31, 70.54s/it]

Unnamed: 0,Description,Value
0,Session id,42
1,Target,Arousal
2,Target type,Multiclass
3,Target mapping,"1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6"
4,Original data shape,"(180, 90)"
5,Transformed data shape,"(180, 90)"
6,Transformed train set shape,"(162, 90)"
7,Transformed test set shape,"(18, 90)"
8,Numeric features,89
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.2469,0.5572,0.2469,0.1987,0.2177,0.0471,0.0483,0.06
et,Extra Trees Classifier,0.2654,0.5557,0.2654,0.1923,0.2198,0.0649,0.0675,0.04
catboost,CatBoost Classifier,0.2654,0.5539,0.2654,0.215,0.2274,0.0634,0.0655,5.33
xgboost,Extreme Gradient Boosting,0.2346,0.5457,0.2346,0.2223,0.2237,0.0518,0.0529,0.18
nb,Naive Bayes,0.1049,0.4931,0.1049,0.131,0.1097,-0.0621,-0.0636,0.01
knn,K Neighbors Classifier,0.1543,0.4606,0.1543,0.1214,0.1294,-0.0436,-0.0455,0.0133
lr,Logistic Regression,0.1852,0.0,0.1852,0.1809,0.179,-0.0009,-0.0011,0.04
svm,SVM - Linear Kernel,0.2037,0.0,0.2037,0.1758,0.1525,0.0192,0.0325,0.0133
qda,Quadratic Discriminant Analysis,0.179,0.0,0.179,0.1675,0.1242,-0.0407,-0.0309,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.2222,0.6425,0.2222,0.2222,0.2074,0.0233,0.0244




Unnamed: 0,Description,Value
0,Session id,42
1,Target,Arousal
2,Target type,Multiclass
3,Target mapping,"1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6"
4,Original data shape,"(180, 90)"
5,Transformed data shape,"(180, 90)"
6,Transformed train set shape,"(162, 90)"
7,Transformed test set shape,"(18, 90)"
8,Numeric features,89
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.284,0.5584,0.284,0.2392,0.2414,0.089,0.0923,0.05
rf,Random Forest Classifier,0.2284,0.5558,0.2284,0.1833,0.2006,0.0221,0.0227,0.06
catboost,CatBoost Classifier,0.284,0.5514,0.284,0.2149,0.2405,0.0871,0.0903,5.23
xgboost,Extreme Gradient Boosting,0.2346,0.5457,0.2346,0.2223,0.2237,0.0518,0.0529,0.21
knn,K Neighbors Classifier,0.216,0.5374,0.216,0.1906,0.1957,0.0295,0.0306,0.0133
nb,Naive Bayes,0.142,0.4895,0.142,0.158,0.1422,-0.0345,-0.035,0.01
lr,Logistic Regression,0.1914,0.0,0.1914,0.1812,0.1806,-0.0091,-0.0093,0.01
svm,SVM - Linear Kernel,0.1543,0.0,0.1543,0.1492,0.1481,-0.0359,-0.036,0.0133
qda,Quadratic Discriminant Analysis,0.1481,0.0,0.1481,0.1624,0.1452,-0.0136,-0.0134,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.2222,0.4665,0.2222,0.1852,0.2,0.0418,0.0423




Unnamed: 0,Description,Value
0,Session id,42
1,Target,Arousal
2,Target type,Multiclass
3,Target mapping,"1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6"
4,Original data shape,"(180, 90)"
5,Transformed data shape,"(180, 90)"
6,Transformed train set shape,"(162, 90)"
7,Transformed test set shape,"(18, 90)"
8,Numeric features,89
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
rf,Random Forest Classifier,0.2346,0.5592,0.2346,0.1819,0.202,0.0291,0.0299,0.07
et,Extra Trees Classifier,0.2593,0.5584,0.2593,0.1917,0.2171,0.0572,0.0593,0.06
catboost,CatBoost Classifier,0.2716,0.5561,0.2716,0.2169,0.2318,0.0708,0.0735,5.2767
xgboost,Extreme Gradient Boosting,0.2346,0.5457,0.2346,0.2223,0.2237,0.0518,0.0529,0.17
knn,K Neighbors Classifier,0.2407,0.5247,0.2407,0.222,0.2235,0.0732,0.0746,0.0133
nb,Naive Bayes,0.1358,0.4889,0.1358,0.1547,0.1378,-0.0403,-0.0408,0.01
lr,Logistic Regression,0.2654,0.0,0.2654,0.2009,0.2186,0.0564,0.0599,0.01
svm,SVM - Linear Kernel,0.179,0.0,0.179,0.1585,0.1245,0.0002,0.0023,0.0133
qda,Quadratic Discriminant Analysis,0.1543,0.0,0.1543,0.1795,0.1592,-0.0107,-0.0099,0.0067


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Random Forest Classifier,0.1111,0.6208,0.1111,0.0494,0.0684,-0.125,-0.1369


Feature Configs:  50%|█████     | 2/4 [02:08<02:05, 62.93s/it]

Unnamed: 0,Description,Value
0,Session id,42
1,Target,Arousal
2,Target type,Multiclass
3,Target mapping,"1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6"
4,Original data shape,"(180, 90)"
5,Transformed data shape,"(180, 90)"
6,Transformed train set shape,"(162, 90)"
7,Transformed test set shape,"(18, 90)"
8,Numeric features,89
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.284,0.5744,0.284,0.2129,0.2405,0.0886,0.0911,0.0433
rf,Random Forest Classifier,0.2531,0.5571,0.2531,0.1935,0.2153,0.0494,0.0509,0.07
xgboost,Extreme Gradient Boosting,0.2593,0.5539,0.2593,0.2465,0.2471,0.0793,0.0806,0.1867
catboost,CatBoost Classifier,0.2654,0.5409,0.2654,0.192,0.2186,0.0628,0.0654,5.2667
nb,Naive Bayes,0.1049,0.4929,0.1049,0.1334,0.1108,-0.0605,-0.062,0.0233
knn,K Neighbors Classifier,0.1975,0.4851,0.1975,0.1658,0.1635,0.0153,0.0174,0.0133
lr,Logistic Regression,0.1667,0.0,0.1667,0.171,0.1658,-0.0194,-0.0196,0.0433
svm,SVM - Linear Kernel,0.142,0.0,0.142,0.1687,0.1481,-0.0305,-0.031,0.0167
qda,Quadratic Discriminant Analysis,0.1975,0.0,0.1975,0.1393,0.1369,0.0251,0.0365,0.01


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.2222,0.5267,0.2222,0.1397,0.1706,0.0382,0.0398




Unnamed: 0,Description,Value
0,Session id,42
1,Target,Arousal
2,Target type,Multiclass
3,Target mapping,"1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6"
4,Original data shape,"(180, 90)"
5,Transformed data shape,"(180, 90)"
6,Transformed train set shape,"(162, 90)"
7,Transformed test set shape,"(18, 90)"
8,Numeric features,89
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.3025,0.5801,0.3025,0.2404,0.2606,0.1115,0.115,0.0567
rf,Random Forest Classifier,0.2469,0.5547,0.2469,0.1891,0.2099,0.0409,0.0425,0.06
xgboost,Extreme Gradient Boosting,0.2593,0.554,0.2593,0.2465,0.2471,0.0793,0.0806,0.1267
catboost,CatBoost Classifier,0.2778,0.5472,0.2778,0.2086,0.2319,0.0792,0.083,5.19
knn,K Neighbors Classifier,0.2222,0.5234,0.2222,0.1835,0.1971,0.0339,0.035,0.0133
nb,Naive Bayes,0.142,0.4889,0.142,0.1598,0.1428,-0.0338,-0.0343,0.01
lr,Logistic Regression,0.1975,0.0,0.1975,0.1939,0.1908,0.0051,0.0051,0.01
svm,SVM - Linear Kernel,0.142,0.0,0.142,0.1309,0.1315,-0.0576,-0.0584,0.0133
qda,Quadratic Discriminant Analysis,0.0988,0.0,0.0988,0.1113,0.0951,-0.0576,-0.0589,0.01


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.2222,0.5133,0.2222,0.1376,0.1697,0.0345,0.0363




Unnamed: 0,Description,Value
0,Session id,42
1,Target,Arousal
2,Target type,Multiclass
3,Target mapping,"1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6"
4,Original data shape,"(180, 90)"
5,Transformed data shape,"(180, 90)"
6,Transformed train set shape,"(162, 90)"
7,Transformed test set shape,"(18, 90)"
8,Numeric features,89
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
et,Extra Trees Classifier,0.2778,0.5795,0.2778,0.2119,0.2351,0.0793,0.082,0.04
rf,Random Forest Classifier,0.2346,0.5599,0.2346,0.1736,0.1962,0.0258,0.0272,0.05
xgboost,Extreme Gradient Boosting,0.2593,0.5539,0.2593,0.2465,0.2471,0.0793,0.0806,0.1267
catboost,CatBoost Classifier,0.2654,0.5445,0.2654,0.1997,0.2234,0.0634,0.0656,5.2633
knn,K Neighbors Classifier,0.2654,0.5403,0.2654,0.253,0.2527,0.0969,0.0985,0.0133
nb,Naive Bayes,0.1358,0.4895,0.1358,0.1556,0.138,-0.0402,-0.0408,0.01
lr,Logistic Regression,0.284,0.0,0.284,0.2112,0.2341,0.0834,0.0877,0.01
svm,SVM - Linear Kernel,0.2037,0.0,0.2037,0.2702,0.1725,0.0536,0.0615,0.0133
qda,Quadratic Discriminant Analysis,0.1605,0.0,0.1605,0.1643,0.1568,-0.0053,-0.0051,0.01


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Extra Trees Classifier,0.2222,0.5094,0.2222,0.1376,0.1697,0.0345,0.0363


Feature Configs:  75%|███████▌  | 3/4 [03:05<01:00, 60.26s/it]

Unnamed: 0,Description,Value
0,Session id,42
1,Target,Arousal
2,Target type,Multiclass
3,Target mapping,"1: 0, 2: 1, 3: 2, 4: 3, 5: 4, 6: 5, 7: 6"
4,Original data shape,"(180, 89)"
5,Transformed data shape,"(180, 89)"
6,Transformed train set shape,"(162, 89)"
7,Transformed test set shape,"(18, 89)"
8,Numeric features,88
9,Preprocess,True


Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)


Processing:   0%|          | 0/41 [00:00<?, ?it/s]

## Step 4. Save overall summary

In [None]:
results_df = pd.DataFrame(results)
results_df.to_csv(os.path.join(RES_PATH,"overall_results.csv"), index=False)

## 🎯 Selective Outlier Removal Based on XGBoost Feature Importance

Instead of removing all outliers across 88 features, we focus only on the top 10 features identified as most influential for classifying `domain_EI` using XGBoost.

### Method

- Extract top 10 features with highest importance scores.
- Apply Z-score based outlier detection (`|Z| > 4`) on those features only.
- Remove rows where extreme values are detected in these features.
- Visualize before/after distributions to confirm effective filtering.

This approach ensures:
- High-impact errors are removed.
- Valuable data is preserved from non-influential features.


In [None]:
'''# Step1: Important Features
important_features = [
    'HF', 'lnLF', 'RSA_var', 'LFp_var', 'lnVLF_mssd',
    'HF_var', 'VLFp_mssd', 'rMSSD_autocorr', 'dHz_mssd', 'BPM_autocorr'
]
'''

In [None]:
'''# Step 2. Z-score calculation (Top-10 Imporatant features only) and Z > 4 row detection
z_scores = df[important_features].apply(zscore)

extreme_mask = (np.abs(z_scores) > 4).any(axis=1)
df_outliers = df[extreme_mask]
df_cleaned = df[~extreme_mask].copy()

# Step 3. Remove outliers from the original DataFrame
print(f"⚠️ Rows with Z > 4 in important features: {extreme_mask.sum()} / {len(df)}")
print("Exclude row's name + index:")
display(df_outliers[['name']].reset_index())

print(f"✅ Cleaned data shape: {df_cleaned.shape}")
'''

In [None]:
'''# Step 4. Plotting the distribution of important features before and after removing outliers
df_before = df[important_features].copy()
df_before['source'] = 'Before'

df_after = df_cleaned[important_features].copy()
df_after['source'] = 'After'

df_plot = pd.concat([df_before, df_after])
df_plot = df_plot.melt(id_vars='source', var_name='feature', value_name='value')

plt.figure(figsize=(20,15))
sns.boxplot(data=df_plot, x='feature', y='value', hue='source')
plt.title("Top Features Distribution: Before vs After Z > 4 Removal")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()'''