# Summary of Configurations and Models

### Preprocessing & Sampling Overview
- **Preprocessing types:** 2 main types (Normalization & Robust Scaling, plus feature selection & transformation options)  
- **Sampling methods:** 6 (SVMSMOTE, BorderlineSMOTE, RandomOverSampler, SMOTE, SMOTE+Tomek, SMOTE+ENN)  
- **Outliers:** Can be removed or kept (optional in this code)  

### Configurations Summary
**Total configurations:** 53  

| Type | # Configurations |
|------|----------------|
| Original | 1 |
| Normalized | 7 |
| Robust | 7 |
| Mutual Information (MI) | 7 |
| Boruta | 7 |
| LDA | 7 |
| Autoencoder | 7 |
| PowerTransformer | 4 |
| Polynomial Features | 4 |
| AdaBoost | 2 |

#### Configuration Details

1. **Original**  
   - 1 configuration: `Original Data`

2. **Normalized (MinMaxScaler)**  
   - 7 configurations: `Normalized Data`, `Normalized+SVMSMOTE`, `Normalized+BorderlineSMOTE`, `Normalized+RandomOverSampler`, `Normalized+SMOTE`, `Normalized+SMOTE+Tomek`, `Normalized+SMOTE+ENN`

3. **Robust Scaler**  
   - 7 configurations: `Robust Data`, `Robust+SVMSMOTE`, `Robust+BorderlineSMOTE`, `Robust+RandomOverSampler`, `Robust+SMOTE`, `Robust+SMOTE+Tomek`, `Robust+SMOTE+ENN`

4. **Feature Selection – Mutual Information (MI)**  
   - 7 configurations: `MI`, `MI+SVMSMOTE`, `MI+BorderlineSMOTE`, `MI+RandomOverSampler`, `MI+SMOTE`, `MI+SMOTE+Tomek`, `MI+SMOTE+ENN`

5. **Feature Selection – Boruta**  
   - 7 configurations: `Boruta`, `Boruta+SVMSMOTE`, `Boruta+BorderlineSMOTE`, `Boruta+RandomOverSampler`, `Boruta+SMOTE`, `Boruta+SMOTE+Tomek`, `Boruta+SMOTE+ENN`

6. **LDA**  
   - 7 configurations: `LDA`, `LDA+SVMSMOTE`, `LDA+BorderlineSMOTE`, `LDA+RandomOverSampler`, `LDA+SMOTE`, `LDA+SMOTE+Tomek`, `LDA+SMOTE+ENN`

7. **Autoencoder**  
   - 7 configurations: `Autoencoder`, `Autoencoder+SVMSMOTE`, `Autoencoder+BorderlineSMOTE`, `Autoencoder+RandomOverSampler`, `Autoencoder+SMOTE`, `Autoencoder+SMOTE+Tomek`, `Autoencoder+SMOTE+ENN`

8. **PowerTransformer**  
   - 4 configurations: `PowerTransformer`, `PowerTransformer+SMOTE`, `PowerTransformer+SMOTE+Tomek`, `PowerTransformer+SMOTE+ENN`

9. **Polynomial Features**  
   - 4 configurations: `PolynomialFeatures`, `PolynomialFeatures+SMOTE`, `PolynomialFeatures+SMOTE+Tomek`, `PolynomialFeatures+SMOTE+ENN`

10. **AdaBoost**  
    - 2 configurations: `AdaBoost Normalized+SMOTE Balanced`, `AdaBoost Normalized+SMOTE Balanced+Calibrated`  

---

## Models Used
| Model | # Variants / Notes |
|-------|------------------|
| Logistic Regression | 2 |
| AdaBoost | 3 (1 default + 2 calibrated) |
| Gradient Boosting | 2 |
| CatBoost | 7 (Default + Calibrated + 5 variants) |
| LightGBM | 2 |
| XGBoost | 1 |
| Extra Trees | 1 |
| KNN | 1 |
| MLP | 1 |
| Random Forest | 1 |


---

## Future Plan

1️⃣ **Initial Model Evaluation:**  
- Train all models across all configurations to check accuracy.  
- Select the **top 10 configurations** and evaluate all metrics for these.

2️⃣ **Ensembling & Stacking:**  
- Train selected models using stacking, ensembling, and voting on the top 10 configurations.  
- Measure all required metrics.  

✅ **Final Step:**  
- Sort results based on **accuracy and other metrics** to determine the best performing models.
metrics


## Top 10 Model Configurations by Test Accuracy

| Rank | Model                 | Configuration             | Test Accuracy |
|------|----------------------|--------------------------|---------------|
| 1    | Gradient Boosting     | Boruta + SMOTE + Tomek   | 0.9733        |
| 2    | Gradient Boosting     | PowerTransformer + SMOTE | 0.9733        |
| 3    | XGBoost               | MI + SMOTE + Tomek       | 0.9733        |
| 4    | XGBoost               | LDA                      | 0.9733        |
| 5    | Extra Trees           | MI                       | 0.9733        |
| 6    | Extra Trees           | MI + SMOTE + Tomek       | 0.9733        |
| 7    | Random Forest         | MI + SMOTE + Tomek       | 0.9733        |
| 8    | Random Forest         | Boruta + SMOTE + Tomek   | 0.9733        |
| 9    | Logistic Regression   | Robust + SVMSMOTE        | 0.9600        |
| 10   | Logistic Regression   | Robust + RandomOverSampler | 0.9600      |

---

## Top 5 Configurations Only

1. Boruta + SMOTE + Tomek  
2. PowerTransformer + SMOTE  
3. MI + SMOTE + Tomek  
4. LDA  
5. MI


In [None]:


!pip install boruta category_encoders xgboost catboost

In [None]:
!pip uninstall -y scikit-learn imbalanced-learn

!pip install scikit-learn==1.4.2 imbalanced-learn==0.12.0


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder, MinMaxScaler, RobustScaler, StandardScaler
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.feature_selection import mutual_info_classif, SelectKBest
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier, AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from imblearn.over_sampling import ADASYN, SMOTE, RandomOverSampler, SMOTENC
from imblearn.combine import SMOTETomek, SMOTEENN
from imblearn.under_sampling import CondensedNearestNeighbour, TomekLinks, RandomUnderSampler
from boruta import BorutaPy
from keras.models import Model, Sequential
from keras.layers import Input, Dense
from keras.optimizers import Adam

import warnings
warnings.filterwarnings('ignore')

In [None]:
from imblearn.over_sampling import RandomOverSampler, SMOTE, BorderlineSMOTE, SVMSMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler, ClusterCentroids, NearMiss
from imblearn.under_sampling import TomekLinks

from imblearn.over_sampling import RandomOverSampler, SMOTE, BorderlineSMOTE, SVMSMOTE, ADASYN
from imblearn.under_sampling import RandomUnderSampler, ClusterCentroids, NearMiss, TomekLinks
from imblearn.combine import SMOTETomek, SMOTEENN

# configs

# Preprocessing 1

In [None]:
'''

df = pd.read_csv("/kaggle/input/sleep-health-and-lifestyle-dataset/Sleep_health_and_lifestyle_dataset.csv")
df.fillna("None", inplace=True)

# Dividing Blood Pressure into Systolic and Diastolic BP
df[['Systolic BP', 'Diastolic BP']] = df['Blood Pressure'].str.split('/', expand=True).astype(int)
df.drop(['Person ID', 'Blood Pressure'], axis=1, inplace=True)

# Labeling less number of careers as other
df['Occupation'] = df['Occupation'].replace(['Manager', 'Sales Representative', 'Scientist', 'Software Engineer'], 'Other')

# Adding the average BMI for the range
df['BMI Category'] = df['BMI Category'].replace({'Normal':22, 'Normal Weight':22, 'Overweight':27, 'Obese':30})

# Creating Interaction features
df['Stress_sleep_interaction'] = df['Stress Level'] / df['Quality of Sleep']
df['BMI_Activity'] = df['BMI Category'] * df['Physical Activity Level']
df['Sleep_Heart_ratio'] = df['Sleep Duration'] / df['Heart Rate']
df['Sleep_Steps_ratio'] = df['Sleep Duration'] / df['Daily Steps']
df['Sleep_Stress_ratio'] = df['Sleep Duration'] / df['Stress Level']

df = pd.get_dummies(df, columns=['Occupation'], drop_first=False)

label_encoder = LabelEncoder()
columns = ['Gender', 'Sleep Disorder']
for col in columns:
  df[col] = label_encoder.fit_transform(df[col])

num_col = ['Age', 'Sleep Duration', 'Quality of Sleep', 'Physical Activity Level', 'Stress Level', 'Stress_sleep_interaction',
          'Sleep_Heart_ratio', 'Sleep_Steps_ratio', 'Sleep_Stress_ratio', 'Heart Rate', 'Daily Steps',
           'Systolic BP', 'Diastolic BP']


X = df.drop('Sleep Disorder', axis=1)
y = df['Sleep Disorder']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)


'''

# preprocessing 2

In [None]:
# ==============================
# Combined Feature Engineering Pipeline
# ==============================
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# ------------------------------
# Step 1: Load dataset
# ------------------------------
df = pd.read_csv("/kaggle/input/sleep-health-and-lifestyle-dataset/Sleep_health_and_lifestyle_dataset.csv")
df.fillna("None", inplace=True)

# ------------------------------
# Step 2: Basic preprocessing
# ------------------------------
# Split Blood Pressure
df[['Systolic BP', 'Diastolic BP']] = df['Blood Pressure'].str.split('/', expand=True).astype(int)
df.drop(['Person ID', 'Blood Pressure'], axis=1, inplace=True)

# Group rare occupations
df['Occupation'] = df['Occupation'].replace(['Manager', 'Sales Representative', 'Scientist', 'Software Engineer'], 'Other')

# Map BMI categories to numeric
df['BMI Category'] = df['BMI Category'].replace({'Normal': 22, 'Normal Weight': 22, 'Overweight': 27, 'Obese': 30})

# Label encode categorical columns
label_encoder = LabelEncoder()
for col in ['Gender', 'Sleep Disorder']:
    df[col] = label_encoder.fit_transform(df[col])

# One-hot encode Occupation
df = pd.get_dummies(df, columns=['Occupation'], drop_first=False)

# ------------------------------
# Step 3: Feature Engineering
# ------------------------------
epsilon = 1e-6

# ---- Ratio & difference features ----
df['Sleep_Heart_ratio'] = df['Sleep Duration'] / (df['Heart Rate'] + epsilon)
df['Sleep_Steps_ratio'] = df['Sleep Duration'] / (df['Daily Steps'] + epsilon)
df['Stress_Activity_ratio'] = df['Stress Level'] / (df['Physical Activity Level'] + epsilon)
df['Heart_Sleep_diff'] = df['Heart Rate'] - df['Sleep Duration']



# ---- Aggregation features ----
df['Total_Activity'] = df['Physical Activity Level'] + df['Daily Steps']
df['Stress_per_Activity'] = df['Stress Level'] / (df['Total_Activity'] + epsilon)

# ---- Interaction features ----
df['Sleep_Stress_interaction'] = df['Sleep Duration'] * df['Stress Level']
df['BMI_Activity_interaction'] = df['BMI Category'] * df['Physical Activity Level']

# ---- Categorical interaction features ----
df['Gender_Occupation'] = df['Gender'].astype(str) + "_" + df['Occupation_Other'].astype(str)
df = pd.get_dummies(df, columns=['Gender_Occupation'], drop_first=False)

# ------------------------------
# Step 4: Prepare features & target
# ------------------------------
X = df.drop('Sleep Disorder', axis=1)
y = df['Sleep Disorder']

# ------------------------------
# Step 5: Train-test split
# ------------------------------
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# ------------------------------
# Step 6: Summary
# ------------------------------
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train distribution:\n", y_train.value_counts())
print("y_test distribution:\n", y_test.value_counts())


In [None]:
print(y.isna().sum())

# Outlier removal(optional)

In [None]:
# Calculate Q1, Q3, and IQR for numerical columns in X_train
'''
Q1 = X_train[num_col].quantile(0.25)
Q3 = X_train[num_col].quantile(0.75)
IQR = Q3 - Q1

# Remove outliers from X_train
X_train = X_train[~((X_train[num_col] < (Q1 - 2 * IQR)) | (X_train[num_col] > (Q3 + 2* IQR))).any(axis=1)]
y_train = y_train[X_train.index]

'''

# original(robust/normalized)+ sampling

In [None]:
# ============================================
# MinMaxScaler Configurations
# ============================================
scaler = MinMaxScaler()
X_train_normalized = scaler.fit_transform(X_train)
X_test_normalized = scaler.transform(X_test)

# MinMaxScaler + SVMSMOTE
svmsmote_normalized = SVMSMOTE(random_state=42, k_neighbors=5)
X_train_normalized_svmsmote, y_train_normalized_svmsmote = svmsmote_normalized.fit_resample(X_train_normalized, y_train)
X_test_normalized_svmsmote = X_test_normalized.copy()

# MinMaxScaler + BorderlineSMOTE
bordersmote_normalized = BorderlineSMOTE(random_state=42, k_neighbors=5)
X_train_normalized_bordersmote, y_train_normalized_bordersmote = bordersmote_normalized.fit_resample(X_train_normalized, y_train)
X_test_normalized_bordersmote = X_test_normalized.copy()

# MinMaxScaler + RandomOverSampler
ros_normalized = RandomOverSampler(random_state=42)
X_train_normalized_ros, y_train_normalized_ros = ros_normalized.fit_resample(X_train_normalized, y_train)
X_test_normalized_ros = X_test_normalized.copy()

# MinMaxScaler + SMOTE
smote_normalized = SMOTE(random_state=42, k_neighbors=5)
X_train_normalized_smote, y_train_normalized_smote = smote_normalized.fit_resample(X_train_normalized, y_train)
X_test_normalized_smote = X_test_normalized.copy()

# MinMaxScaler + SMOTE+Tomek
smotetomek_normalized = SMOTETomek(random_state=42)
X_train_normalized_smotetomek, y_train_normalized_smotetomek = smotetomek_normalized.fit_resample(X_train_normalized, y_train)
X_test_normalized_smotetomek = X_test_normalized.copy()

# MinMaxScaler + SMOTE+ENN
smoteenn_normalized = SMOTEENN(random_state=42)
X_train_normalized_smoteenn, y_train_normalized_smoteenn = smoteenn_normalized.fit_resample(X_train_normalized, y_train)
X_test_normalized_smoteenn = X_test_normalized.copy()

# ============================================
# RobustScaler Configurations
# ============================================
scaler = RobustScaler()
X_train_robust = scaler.fit_transform(X_train)
X_test_robust = scaler.transform(X_test)

# RobustScaler + SVMSMOTE
svmsmote_robust = SVMSMOTE(random_state=42, k_neighbors=5)
X_train_robust_svmsmote, y_train_robust_svmsmote = svmsmote_robust.fit_resample(X_train_robust, y_train)
X_test_robust_svmsmote = X_test_robust.copy()

# RobustScaler + BorderlineSMOTE
bordersmote_robust = BorderlineSMOTE(random_state=42, k_neighbors=5)
X_train_robust_bordersmote, y_train_robust_bordersmote = bordersmote_robust.fit_resample(X_train_robust, y_train)
X_test_robust_bordersmote = X_test_robust.copy()

# RobustScaler + RandomOverSampler
ros_robust = RandomOverSampler(random_state=42)
X_train_robust_ros, y_train_robust_ros = ros_robust.fit_resample(X_train_robust, y_train)
X_test_robust_ros = X_test_robust.copy()

# RobustScaler + SMOTE
smote_robust = SMOTE(random_state=42, k_neighbors=5)
X_train_robust_smote, y_train_robust_smote = smote_robust.fit_resample(X_train_robust, y_train)
X_test_robust_smote = X_test_robust.copy()

# RobustScaler + SMOTE+Tomek
smotetomek_robust = SMOTETomek(random_state=42)
X_train_robust_smotetomek, y_train_robust_smotetomek = smotetomek_robust.fit_resample(X_train_robust, y_train)
X_test_robust_smotetomek = X_test_robust.copy()

# RobustScaler + SMOTE+ENN
smoteenn_robust = SMOTEENN(random_state=42)
X_train_robust_smoteenn, y_train_robust_smoteenn = smoteenn_robust.fit_resample(X_train_robust, y_train)
X_test_robust_smoteenn = X_test_robust.copy()



# Original - RobustScaler → Mutual Information → LDA



## basic

In [None]:
scaler = RobustScaler()
X_train_robust = scaler.fit_transform(X_train)
X_test_robust = scaler.transform(X_test)

# Feature selection using Mutual Information
mi = SelectKBest(score_func=mutual_info_classif, k=5)
X_train_mi = mi.fit_transform(X_train_robust, y_train)
X_test_mi = mi.transform(X_test_robust)

# Applying Linear Discriminant Analysis (LDA)
lda = LinearDiscriminantAnalysis(n_components=2)
X_train_lda = lda.fit_transform(X_train_mi, y_train)
X_test_lda = lda.transform(X_test_mi)

## basic+sampling

In [None]:
from imblearn.over_sampling import SVMSMOTE, BorderlineSMOTE, RandomOverSampler, SMOTE
from imblearn.combine import SMOTETomek, SMOTEENN
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import RobustScaler
from sklearn.feature_selection import SelectKBest, mutual_info_classif

# ============================================
# Configuration 1: SVMSMOTE
# ============================================
svmsmote = SVMSMOTE(random_state=42, k_neighbors=5)
X_train_svmsmote_temp, y_train_svmsmote = svmsmote.fit_resample(X_train, y_train)

scaler_svmsmote = RobustScaler()
X_train_svmsmote_scaled = scaler_svmsmote.fit_transform(X_train_svmsmote_temp)
X_test_svmsmote_scaled = scaler_svmsmote.transform(X_test)

mi_svmsmote = SelectKBest(score_func=mutual_info_classif, k=5)
X_train_svmsmote_mi = mi_svmsmote.fit_transform(X_train_svmsmote_scaled, y_train_svmsmote)
X_test_svmsmote_mi = mi_svmsmote.transform(X_test_svmsmote_scaled)

lda_svmsmote = LinearDiscriminantAnalysis(n_components=2)
X_train_lda_svmsmote = lda_svmsmote.fit_transform(X_train_svmsmote_mi, y_train_svmsmote)
X_test_lda_svmsmote = lda_svmsmote.transform(X_test_svmsmote_mi)

# ============================================
# Configuration 2: BorderlineSMOTE
# ============================================
bordersmote = BorderlineSMOTE(random_state=42, k_neighbors=5)
X_train_bordersmote_temp, y_train_bordersmote = bordersmote.fit_resample(X_train, y_train)

scaler_bordersmote = RobustScaler()
X_train_bordersmote_scaled = scaler_bordersmote.fit_transform(X_train_bordersmote_temp)
X_test_bordersmote_scaled = scaler_bordersmote.transform(X_test)

mi_bordersmote = SelectKBest(score_func=mutual_info_classif, k=5)
X_train_bordersmote_mi = mi_bordersmote.fit_transform(X_train_bordersmote_scaled, y_train_bordersmote)
X_test_bordersmote_mi = mi_bordersmote.transform(X_test_bordersmote_scaled)

lda_bordersmote = LinearDiscriminantAnalysis(n_components=2)
X_train_lda_bordersmote = lda_bordersmote.fit_transform(X_train_bordersmote_mi, y_train_bordersmote)
X_test_lda_bordersmote = lda_bordersmote.transform(X_test_bordersmote_mi)

# ============================================
# Configuration 3: RandomOverSampler
# ============================================
ros = RandomOverSampler(random_state=42)
X_train_ros_temp, y_train_ros = ros.fit_resample(X_train, y_train)

scaler_ros = RobustScaler()
X_train_ros_scaled = scaler_ros.fit_transform(X_train_ros_temp)
X_test_ros_scaled = scaler_ros.transform(X_test)

mi_ros = SelectKBest(score_func=mutual_info_classif, k=5)
X_train_ros_mi = mi_ros.fit_transform(X_train_ros_scaled, y_train_ros)
X_test_ros_mi = mi_ros.transform(X_test_ros_scaled)

lda_ros = LinearDiscriminantAnalysis(n_components=2)
X_train_lda_ros = lda_ros.fit_transform(X_train_ros_mi, y_train_ros)
X_test_lda_ros = lda_ros.transform(X_test_ros_mi)

# ============================================
# Configuration 4: SMOTE
# ============================================
smote = SMOTE(random_state=42, k_neighbors=5)
X_train_smote_temp, y_train_smote = smote.fit_resample(X_train, y_train)

scaler_smote = RobustScaler()
X_train_smote_scaled = scaler_smote.fit_transform(X_train_smote_temp)
X_test_smote_scaled = scaler_smote.transform(X_test)

mi_smote = SelectKBest(score_func=mutual_info_classif, k=5)
X_train_smote_mi = mi_smote.fit_transform(X_train_smote_scaled, y_train_smote)
X_test_smote_mi = mi_smote.transform(X_test_smote_scaled)

lda_smote = LinearDiscriminantAnalysis(n_components=2)
X_train_lda_smote = lda_smote.fit_transform(X_train_smote_mi, y_train_smote)
X_test_lda_smote = lda_smote.transform(X_test_smote_mi)

# ============================================
# Configuration 5: SMOTE + Tomek
# ============================================
smote_tomek = SMOTETomek(random_state=42)
X_train_smotetomek_temp, y_train_smotetomek = smote_tomek.fit_resample(X_train, y_train)

scaler_smotetomek = RobustScaler()
X_train_smotetomek_scaled = scaler_smotetomek.fit_transform(X_train_smotetomek_temp)
X_test_smotetomek_scaled = scaler_smotetomek.transform(X_test)

mi_smotetomek = SelectKBest(score_func=mutual_info_classif, k=5)
X_train_smotetomek_mi = mi_smotetomek.fit_transform(X_train_smotetomek_scaled, y_train_smotetomek)
X_test_smotetomek_mi = mi_smotetomek.transform(X_test_smotetomek_scaled)

lda_smotetomek = LinearDiscriminantAnalysis(n_components=2)
X_train_lda_smotetomek = lda_smotetomek.fit_transform(X_train_smotetomek_mi, y_train_smotetomek)
X_test_lda_smotetomek = lda_smotetomek.transform(X_test_smotetomek_mi)

# ============================================
# Configuration 6: SMOTE + ENN
# ============================================
smote_enn = SMOTEENN(random_state=42)
X_train_smoteenn_temp, y_train_smoteenn = smote_enn.fit_resample(X_train, y_train)

scaler_smoteenn = RobustScaler()
X_train_smoteenn_scaled = scaler_smoteenn.fit_transform(X_train_smoteenn_temp)
X_test_smoteenn_scaled = scaler_smoteenn.transform(X_test)

mi_smoteenn = SelectKBest(score_func=mutual_info_classif, k=5)
X_train_smoteenn_mi = mi_smoteenn.fit_transform(X_train_smoteenn_scaled, y_train_smoteenn)
X_test_smoteenn_mi = mi_smoteenn.transform(X_test_smoteenn_scaled)

lda_smoteenn = LinearDiscriminantAnalysis(n_components=2)
X_train_lda_smoteenn = lda_smoteenn.fit_transform(X_train_smoteenn_mi, y_train_smoteenn)
X_test_lda_smoteenn = lda_smoteenn.transform(X_test_smoteenn_mi)

# ============================================
# Summary
# ============================================
print("Configuration shapes:")
print(f"Original LDA - Train: {X_train_lda.shape}, Test: {X_test_lda.shape}")
print(f"SVMSMOTE - Train: {X_train_lda_svmsmote.shape}, Test: {X_test_lda_svmsmote.shape}")
print(f"BorderlineSMOTE - Train: {X_train_lda_bordersmote.shape}, Test: {X_test_lda_bordersmote.shape}")
print(f"RandomOverSampler - Train: {X_train_lda_ros.shape}, Test: {X_test_lda_ros.shape}")
print(f"SMOTE - Train: {X_train_lda_smote.shape}, Test: {X_test_lda_smote.shape}")
print(f"SMOTE+Tomek - Train: {X_train_lda_smotetomek.shape}, Test: {X_test_lda_smotetomek.shape}")
print(f"SMOTE+ENN - Train: {X_train_lda_smoteenn.shape}, Test: {X_test_lda_smoteenn.shape}")

configs = {
    'LDA_Original': {
        'X_train': X_train_lda,
        'y_train': y_train,
        'X_test': X_test_lda
    },
    'LDA_SVMSMOTE': {
        'X_train': X_train_lda_svmsmote,
        'y_train': y_train_svmsmote,
        'X_test': X_test_lda_svmsmote
    },
    'LDA_BorderlineSMOTE': {
        'X_train': X_train_lda_bordersmote,
        'y_train': y_train_bordersmote,
        'X_test': X_test_lda_bordersmote
    },
    'LDA_RandomOverSampler': {
        'X_train': X_train_lda_ros,
        'y_train': y_train_ros,
        'X_test': X_test_lda_ros
    },
    'LDA_SMOTE': {
        'X_train': X_train_lda_smote,
        'y_train': y_train_smote,
        'X_test': X_test_lda_smote
    },
    'LDA_SMOTE+Tomek': {
        'X_train': X_train_lda_smotetomek,
        'y_train': y_train_smotetomek,
        'X_test': X_test_lda_smotetomek
    },
    'LDA_SMOTE+ENN': {
        'X_train': X_train_lda_smoteenn,
        'y_train': y_train_smoteenn,
        'X_test': X_test_lda_smoteenn
    }
}

# Original - MinMaxScaler → Boruta → Autoencoder

## basic

In [None]:
# Imports
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.optimizers import Adam
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# -----------------------------
# RandomForest classifier (for Boruta feature selection)
# -----------------------------
scaler = MinMaxScaler()
rfc = RandomForestClassifier(n_estimators=100, random_state=42)

# Normalize data
X_train_normalized = scaler.fit_transform(X_train)
X_test_normalized = scaler.transform(X_test)

# -----------------------------
# Boruta Feature Selection
# -----------------------------
boruta_selector = BorutaPy(rfc, n_estimators='auto', verbose=0, random_state=42)

X_train_boruta = boruta_selector.fit_transform(X_train_normalized, y_train)
X_test_boruta = boruta_selector.transform(X_test_normalized)

# -----------------------------
# Autoencoder Architecture
# -----------------------------
n_features = X_train_boruta.shape[1]
input_layer = Input(shape=(n_features,))

# Encoder
encoded = Dense(32, activation='relu')(input_layer)
bottleneck = Dense(16, activation='relu')(encoded)

# Decoder
decoded = Dense(32, activation='relu')(bottleneck)
output_layer = Dense(n_features, activation='sigmoid')(decoded)

# Full autoencoder model
autoencoder = Model(inputs=input_layer, outputs=output_layer)
autoencoder.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')

# Train the autoencoder
autoencoder.fit(X_train_boruta, X_train_boruta, epochs=10, batch_size=32, verbose=0)

# Encoder model to extract bottleneck features
encoder = Model(inputs=input_layer, outputs=bottleneck)

# Transform data
X_train_encoded = encoder.predict(X_train_boruta)
X_test_encoded = encoder.predict(X_test_boruta)


## basic +sampling

In [None]:
from imblearn.over_sampling import SVMSMOTE, BorderlineSMOTE, RandomOverSampler, SMOTE
from imblearn.combine import SMOTETomek
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.optimizers import Adam

from imblearn.over_sampling import SVMSMOTE, BorderlineSMOTE, RandomOverSampler, SMOTE
from imblearn.combine import SMOTETomek, SMOTEENN
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.optimizers import Adam

# ============================================
# Configuration 1: SVMSMOTE
# ============================================
svmsmote = SVMSMOTE(random_state=42, k_neighbors=5)
X_train_svmsmote_temp, y_train_svmsmote = svmsmote.fit_resample(X_train, y_train)

scaler_svmsmote = MinMaxScaler()
X_train_svmsmote_normalized = scaler_svmsmote.fit_transform(X_train_svmsmote_temp)
X_test_svmsmote_normalized = scaler_svmsmote.transform(X_test)

rfc_svmsmote = RandomForestClassifier(n_estimators=100, random_state=42)
boruta_svmsmote = BorutaPy(rfc_svmsmote, n_estimators='auto', verbose=0, random_state=42)
X_train_svmsmote_boruta = boruta_svmsmote.fit_transform(X_train_svmsmote_normalized, y_train_svmsmote)
X_test_svmsmote_boruta = boruta_svmsmote.transform(X_test_svmsmote_normalized)

n_features_svmsmote = X_train_svmsmote_boruta.shape[1]
input_layer_svmsmote = Input(shape=(n_features_svmsmote,))
encoded_svmsmote = Dense(32, activation='relu')(input_layer_svmsmote)
bottleneck_svmsmote = Dense(16, activation='relu')(encoded_svmsmote)
decoded_svmsmote = Dense(32, activation='relu')(bottleneck_svmsmote)
output_layer_svmsmote = Dense(n_features_svmsmote, activation='sigmoid')(decoded_svmsmote)

autoencoder_svmsmote = Model(inputs=input_layer_svmsmote, outputs=output_layer_svmsmote)
autoencoder_svmsmote.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')
autoencoder_svmsmote.fit(X_train_svmsmote_boruta, X_train_svmsmote_boruta, epochs=10, batch_size=32, verbose=0)

encoder_svmsmote = Model(inputs=input_layer_svmsmote, outputs=bottleneck_svmsmote)
X_train_encoded_svmsmote = encoder_svmsmote.predict(X_train_svmsmote_boruta)
X_test_encoded_svmsmote = encoder_svmsmote.predict(X_test_svmsmote_boruta)

# ============================================
# Configuration 2: BorderlineSMOTE
# ============================================
bordersmote = BorderlineSMOTE(random_state=42, k_neighbors=5)
X_train_bordersmote_temp, y_train_bordersmote = bordersmote.fit_resample(X_train, y_train)

scaler_bordersmote = MinMaxScaler()
X_train_bordersmote_normalized = scaler_bordersmote.fit_transform(X_train_bordersmote_temp)
X_test_bordersmote_normalized = scaler_bordersmote.transform(X_test)

rfc_bordersmote = RandomForestClassifier(n_estimators=100, random_state=42)
boruta_bordersmote = BorutaPy(rfc_bordersmote, n_estimators='auto', verbose=0, random_state=42)
X_train_bordersmote_boruta = boruta_bordersmote.fit_transform(X_train_bordersmote_normalized, y_train_bordersmote)
X_test_bordersmote_boruta = boruta_bordersmote.transform(X_test_bordersmote_normalized)

n_features_bordersmote = X_train_bordersmote_boruta.shape[1]
input_layer_bordersmote = Input(shape=(n_features_bordersmote,))
encoded_bordersmote = Dense(32, activation='relu')(input_layer_bordersmote)
bottleneck_bordersmote = Dense(16, activation='relu')(encoded_bordersmote)
decoded_bordersmote = Dense(32, activation='relu')(bottleneck_bordersmote)
output_layer_bordersmote = Dense(n_features_bordersmote, activation='sigmoid')(decoded_bordersmote)

autoencoder_bordersmote = Model(inputs=input_layer_bordersmote, outputs=output_layer_bordersmote)
autoencoder_bordersmote.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')
autoencoder_bordersmote.fit(X_train_bordersmote_boruta, X_train_bordersmote_boruta, epochs=10, batch_size=32, verbose=0)

encoder_bordersmote = Model(inputs=input_layer_bordersmote, outputs=bottleneck_bordersmote)
X_train_encoded_bordersmote = encoder_bordersmote.predict(X_train_bordersmote_boruta)
X_test_encoded_bordersmote = encoder_bordersmote.predict(X_test_bordersmote_boruta)

# ============================================
# Configuration 3: RandomOverSampler
# ============================================
ros = RandomOverSampler(random_state=42)
X_train_ros_temp, y_train_ros = ros.fit_resample(X_train, y_train)

scaler_ros = MinMaxScaler()
X_train_ros_normalized = scaler_ros.fit_transform(X_train_ros_temp)
X_test_ros_normalized = scaler_ros.transform(X_test)

rfc_ros = RandomForestClassifier(n_estimators=100, random_state=42)
boruta_ros = BorutaPy(rfc_ros, n_estimators='auto', verbose=0, random_state=42)
X_train_ros_boruta = boruta_ros.fit_transform(X_train_ros_normalized, y_train_ros)
X_test_ros_boruta = boruta_ros.transform(X_test_ros_normalized)

n_features_ros = X_train_ros_boruta.shape[1]
input_layer_ros = Input(shape=(n_features_ros,))
encoded_ros = Dense(32, activation='relu')(input_layer_ros)
bottleneck_ros = Dense(16, activation='relu')(encoded_ros)
decoded_ros = Dense(32, activation='relu')(bottleneck_ros)
output_layer_ros = Dense(n_features_ros, activation='sigmoid')(decoded_ros)

autoencoder_ros = Model(inputs=input_layer_ros, outputs=output_layer_ros)
autoencoder_ros.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')
autoencoder_ros.fit(X_train_ros_boruta, X_train_ros_boruta, epochs=10, batch_size=32, verbose=0)

encoder_ros = Model(inputs=input_layer_ros, outputs=bottleneck_ros)
X_train_encoded_ros = encoder_ros.predict(X_train_ros_boruta)
X_test_encoded_ros = encoder_ros.predict(X_test_ros_boruta)

# ============================================
# Configuration 4: SMOTE
# ============================================
smote = SMOTE(random_state=42, k_neighbors=5)
X_train_smote_temp, y_train_smote = smote.fit_resample(X_train, y_train)

scaler_smote = MinMaxScaler()
X_train_smote_normalized = scaler_smote.fit_transform(X_train_smote_temp)
X_test_smote_normalized = scaler_smote.transform(X_test)

rfc_smote = RandomForestClassifier(n_estimators=100, random_state=42)
boruta_smote = BorutaPy(rfc_smote, n_estimators='auto', verbose=0, random_state=42)
X_train_smote_boruta = boruta_smote.fit_transform(X_train_smote_normalized, y_train_smote)
X_test_smote_boruta = boruta_smote.transform(X_test_smote_normalized)

n_features_smote = X_train_smote_boruta.shape[1]
input_layer_smote = Input(shape=(n_features_smote,))
encoded_smote = Dense(32, activation='relu')(input_layer_smote)
bottleneck_smote = Dense(16, activation='relu')(encoded_smote)
decoded_smote = Dense(32, activation='relu')(bottleneck_smote)
output_layer_smote = Dense(n_features_smote, activation='sigmoid')(decoded_smote)

autoencoder_smote = Model(inputs=input_layer_smote, outputs=output_layer_smote)
autoencoder_smote.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')
autoencoder_smote.fit(X_train_smote_boruta, X_train_smote_boruta, epochs=10, batch_size=32, verbose=0)

encoder_smote = Model(inputs=input_layer_smote, outputs=bottleneck_smote)
X_train_encoded_smote = encoder_smote.predict(X_train_smote_boruta)
X_test_encoded_smote = encoder_smote.predict(X_test_smote_boruta)

# ============================================
# Configuration 5: SMOTE + Tomek
# ============================================
smote_tomek = SMOTETomek(random_state=42)
X_train_smotetomek_temp, y_train_smotetomek = smote_tomek.fit_resample(X_train, y_train)

scaler_smotetomek = MinMaxScaler()
X_train_smotetomek_normalized = scaler_smotetomek.fit_transform(X_train_smotetomek_temp)
X_test_smotetomek_normalized = scaler_smotetomek.transform(X_test)

rfc_smotetomek = RandomForestClassifier(n_estimators=100, random_state=42)
boruta_smotetomek = BorutaPy(rfc_smotetomek, n_estimators='auto', verbose=0, random_state=42)
X_train_smotetomek_boruta = boruta_smotetomek.fit_transform(X_train_smotetomek_normalized, y_train_smotetomek)
X_test_smotetomek_boruta = boruta_smotetomek.transform(X_test_smotetomek_normalized)

n_features_smotetomek = X_train_smotetomek_boruta.shape[1]
input_layer_smotetomek = Input(shape=(n_features_smotetomek,))
encoded_smotetomek = Dense(32, activation='relu')(input_layer_smotetomek)
bottleneck_smotetomek = Dense(16, activation='relu')(encoded_smotetomek)
decoded_smotetomek = Dense(32, activation='relu')(bottleneck_smotetomek)
output_layer_smotetomek = Dense(n_features_smotetomek, activation='sigmoid')(decoded_smotetomek)

autoencoder_smotetomek = Model(inputs=input_layer_smotetomek, outputs=output_layer_smotetomek)
autoencoder_smotetomek.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')
autoencoder_smotetomek.fit(X_train_smotetomek_boruta, X_train_smotetomek_boruta, epochs=10, batch_size=32, verbose=0)

encoder_smotetomek = Model(inputs=input_layer_smotetomek, outputs=bottleneck_smotetomek)
X_train_encoded_smotetomek = encoder_smotetomek.predict(X_train_smotetomek_boruta)
X_test_encoded_smotetomek = encoder_smotetomek.predict(X_test_smotetomek_boruta)

# ============================================
# Configuration 6: SMOTE + ENN
# ============================================
smote_enn = SMOTEENN(random_state=42)
X_train_smoteenn_temp, y_train_smoteenn = smote_enn.fit_resample(X_train, y_train)

scaler_smoteenn = MinMaxScaler()
X_train_smoteenn_normalized = scaler_smoteenn.fit_transform(X_train_smoteenn_temp)
X_test_smoteenn_normalized = scaler_smoteenn.transform(X_test)

rfc_smoteenn = RandomForestClassifier(n_estimators=100, random_state=42)
boruta_smoteenn = BorutaPy(rfc_smoteenn, n_estimators='auto', verbose=0, random_state=42)
X_train_smoteenn_boruta = boruta_smoteenn.fit_transform(X_train_smoteenn_normalized, y_train_smoteenn)
X_test_smoteenn_boruta = boruta_smoteenn.transform(X_test_smoteenn_normalized)

n_features_smoteenn = X_train_smoteenn_boruta.shape[1]
input_layer_smoteenn = Input(shape=(n_features_smoteenn,))
encoded_smoteenn = Dense(32, activation='relu')(input_layer_smoteenn)
bottleneck_smoteenn = Dense(16, activation='relu')(encoded_smoteenn)
decoded_smoteenn = Dense(32, activation='relu')(bottleneck_smoteenn)
output_layer_smoteenn = Dense(n_features_smoteenn, activation='sigmoid')(decoded_smoteenn)

autoencoder_smoteenn = Model(inputs=input_layer_smoteenn, outputs=output_layer_smoteenn)
autoencoder_smoteenn.compile(optimizer=Adam(learning_rate=0.001), loss='mean_squared_error')
autoencoder_smoteenn.fit(X_train_smoteenn_boruta, X_train_smoteenn_boruta, epochs=10, batch_size=32, verbose=0)

encoder_smoteenn = Model(inputs=input_layer_smoteenn, outputs=bottleneck_smoteenn)
X_train_encoded_smoteenn = encoder_smoteenn.predict(X_train_smoteenn_boruta)
X_test_encoded_smoteenn = encoder_smoteenn.predict(X_test_smoteenn_boruta)

# boruta + sampling

In [None]:
from imblearn.over_sampling import SVMSMOTE, BorderlineSMOTE, RandomOverSampler, SMOTE
from imblearn.combine import SMOTETomek, SMOTEENN
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from boruta import BorutaPy

# ============================================
# Boruta Feature Selection Configurations
# ============================================
scaler = MinMaxScaler()
rfc = RandomForestClassifier(n_estimators=100, random_state=42)

# Normalize data
X_train_normalized = scaler.fit_transform(X_train)
X_test_normalized = scaler.transform(X_test)

# Boruta Feature Selection (Baseline)
boruta_selector = BorutaPy(rfc, n_estimators='auto', verbose=0, random_state=42)
X_train_boruta = boruta_selector.fit_transform(X_train_normalized, y_train)
X_test_boruta = boruta_selector.transform(X_test_normalized)

# Boruta + SVMSMOTE
svmsmote_boruta = SVMSMOTE(random_state=42, k_neighbors=5)
X_train_boruta_svmsmote, y_train_boruta_svmsmote = svmsmote_boruta.fit_resample(X_train_boruta, y_train)
X_test_boruta_svmsmote = X_test_boruta.copy()

# Boruta + BorderlineSMOTE
bordersmote_boruta = BorderlineSMOTE(random_state=42, k_neighbors=5)
X_train_boruta_bordersmote, y_train_boruta_bordersmote = bordersmote_boruta.fit_resample(X_train_boruta, y_train)
X_test_boruta_bordersmote = X_test_boruta.copy()

# Boruta + RandomOverSampler
ros_boruta = RandomOverSampler(random_state=42)
X_train_boruta_ros, y_train_boruta_ros = ros_boruta.fit_resample(X_train_boruta, y_train)
X_test_boruta_ros = X_test_boruta.copy()

# Boruta + SMOTE
smote_boruta = SMOTE(random_state=42, k_neighbors=5)
X_train_boruta_smote, y_train_boruta_smote = smote_boruta.fit_resample(X_train_boruta, y_train)
X_test_boruta_smote = X_test_boruta.copy()

# Boruta + SMOTE+Tomek

smotetomek_boruta = SMOTETomek(random_state=42)
X_train_boruta_smotetomek, y_train_boruta_smotetomek = smotetomek_boruta.fit_resample(X_train_boruta, y_train)
X_test_boruta_smotetomek = X_test_boruta.copy()

# Boruta + SMOTE+ENN
smoteenn_boruta = SMOTEENN(random_state=42)
X_train_boruta_smoteenn, y_train_boruta_smoteenn = smoteenn_boruta.fit_resample(X_train_boruta, y_train)
X_test_boruta_smoteenn = X_test_boruta.copy()


# mi + sampling

In [None]:
# ============================================
# Mutual Information Feature Selection Configurations
# ============================================
scaler = RobustScaler()
X_train_robust = scaler.fit_transform(X_train)
X_test_robust = scaler.transform(X_test)

# Mutual Information Feature Selection (Baseline)
mi = SelectKBest(score_func=mutual_info_classif, k=5)
X_train_mi = mi.fit_transform(X_train_robust, y_train)
X_test_mi = mi.transform(X_test_robust)

# MI + SVMSMOTE
svmsmote_mi = SVMSMOTE(random_state=42, k_neighbors=5)
X_train_mi_svmsmote, y_train_mi_svmsmote = svmsmote_mi.fit_resample(X_train_mi, y_train)
X_test_mi_svmsmote = X_test_mi.copy()

# MI + BorderlineSMOTE
bordersmote_mi = BorderlineSMOTE(random_state=42, k_neighbors=5)
X_train_mi_bordersmote, y_train_mi_bordersmote = bordersmote_mi.fit_resample(X_train_mi, y_train)
X_test_mi_bordersmote = X_test_mi.copy()

# MI + RandomOverSampler
ros_mi = RandomOverSampler(random_state=42)
X_train_mi_ros, y_train_mi_ros = ros_mi.fit_resample(X_train_mi, y_train)
X_test_mi_ros = X_test_mi.copy()

# MI + SMOTE
smote_mi = SMOTE(random_state=42, k_neighbors=5)
X_train_mi_smote, y_train_mi_smote = smote_mi.fit_resample(X_train_mi, y_train)
X_test_mi_smote = X_test_mi.copy()

# MI + SMOTE+Tomek
smotetomek_mi = SMOTETomek(random_state=42)
X_train_mi_smotetomek, y_train_mi_smotetomek = smotetomek_mi.fit_resample(X_train_mi, y_train)
X_test_mi_smotetomek = X_test_mi.copy()

# MI + SMOTE+ENN
smoteenn_mi = SMOTEENN(random_state=42)
X_train_mi_smoteenn, y_train_mi_smoteenn = smoteenn_mi.fit_resample(X_train_mi, y_train)
X_test_mi_smoteenn = X_test_mi.copy()



# Others

In [None]:
# ===============================
# Data manipulation
# ===============================
import numpy as np
import pandas as pd

# ===============================
# Preprocessing
# ===============================
from sklearn.preprocessing import MinMaxScaler, RobustScaler, PowerTransformer, PolynomialFeatures
from sklearn.impute import SimpleImputer

# ===============================
# Sampling / Imbalance handling
# ===============================
from imblearn.over_sampling import SMOTE, SVMSMOTE, BorderlineSMOTE, RandomOverSampler
from imblearn.combine import SMOTETomek, SMOTEENN

# ===============================
# Modeling
# ===============================
from sklearn.linear_model import LogisticRegression

# ===============================
# Calibration / Threshold tuning
# ===============================
from sklearn.calibration import CalibratedClassifierCV

# ===============================
# Evaluation / Metrics
# ===============================
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix


In [None]:
scaler = PowerTransformer(method='yeo-johnson')
X_train_power = scaler.fit_transform(X_train)
X_test_power = scaler.transform(X_test)

# PowerTransformer + SMOTE
smote_power = SMOTE(random_state=42, k_neighbors=5)
X_train_power_smote, y_train_power_smote = smote_power.fit_resample(X_train_power, y_train)
X_test_power_smote = X_test_power.copy()

# PowerTransformer + SMOTE+Tomek
smotetomek_power = SMOTETomek(random_state=42)
X_train_power_smotetomek, y_train_power_smotetomek = smotetomek_power.fit_resample(X_train_power, y_train)
X_test_power_smotetomek = X_test_power.copy()

# PowerTransformer + SMOTE+ENN
smoteenn_power = SMOTEENN(random_state=42)
X_train_power_smoteenn, y_train_power_smoteenn = smoteenn_power.fit_resample(X_train_power, y_train)
X_test_power_smoteenn = X_test_power.copy()


poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Polynomial + SMOTE
smote_poly = SMOTE(random_state=42, k_neighbors=5)
X_train_poly_smote, y_train_poly_smote = smote_poly.fit_resample(X_train_poly, y_train)
X_test_poly_smote = X_test_poly.copy()

# Polynomial + SMOTE+Tomek
smotetomek_poly = SMOTETomek(random_state=42)
X_train_poly_smotetomek, y_train_poly_smotetomek = smotetomek_poly.fit_resample(X_train_poly, y_train)
X_test_poly_smotetomek = X_test_poly.copy()

# Polynomial + SMOTE+ENN
smoteenn_poly = SMOTEENN(random_state=42)
X_train_poly_smoteenn, y_train_poly_smoteenn = smoteenn_poly.fit_resample(X_train_poly, y_train)
X_test_poly_smoteenn = X_test_poly.copy()


# Example: train LogisticRegression directly on any preprocessed data
from sklearn.linear_model import LogisticRegression

# Logistic Regression with balanced class weight
lr_balanced = LogisticRegression(class_weight='balanced', solver='saga', max_iter=5000)
lr_balanced.fit(X_train_normalized_smote, y_train_normalized_smote)

# Logistic Regression with balanced + SMOTE (optional)
smote_lr = SMOTE(random_state=42)
X_train_lr, y_train_lr = smote_lr.fit_resample(X_train_normalized, y_train)
lr_balanced_smote = LogisticRegression(class_weight='balanced', solver='saga', max_iter=5000)
lr_balanced_smote.fit(X_train_lr, y_train_lr)





# configuration

In [None]:
ML_Model = []
ML_Config = []
accuracy = []
f1 = []
recall = []
precision = []
auc_roc = []  # Adding a holder for AUC-ROC

# Function to call for storing the results
def storeResults(model, config, a, b, c, d, e):
    """
    Store model performance results

    Parameters:
    model: Name of the ML model
    config: Configuration name (preprocessing steps applied)
    a: Accuracy score
    b: F1 score
    c: Recall score
    d: Precision score
    e: AUC-ROC score
    """
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

In [None]:
configurations = [
    ('Original Data', X_train, X_test, y_train),
    
    # MinMaxScaler configurations
    ('Normalized Data', X_train_normalized, X_test_normalized, y_train),
    ('Normalized+SVMSMOTE', X_train_normalized_svmsmote, X_test_normalized_svmsmote, y_train_normalized_svmsmote),
    ('Normalized+BorderlineSMOTE', X_train_normalized_bordersmote, X_test_normalized_bordersmote, y_train_normalized_bordersmote),
    ('Normalized+RandomOverSampler', X_train_normalized_ros, X_test_normalized_ros, y_train_normalized_ros),
    ('Normalized+SMOTE', X_train_normalized_smote, X_test_normalized_smote, y_train_normalized_smote),
    ('Normalized+SMOTE+Tomek', X_train_normalized_smotetomek, X_test_normalized_smotetomek, y_train_normalized_smotetomek),
    ('Normalized+SMOTE+ENN', X_train_normalized_smoteenn, X_test_normalized_smoteenn, y_train_normalized_smoteenn),
    
    # RobustScaler configurations
    ('Robust Data', X_train_robust, X_test_robust, y_train),
    ('Robust+SVMSMOTE', X_train_robust_svmsmote, X_test_robust_svmsmote, y_train_robust_svmsmote),
    ('Robust+BorderlineSMOTE', X_train_robust_bordersmote, X_test_robust_bordersmote, y_train_robust_bordersmote),
    ('Robust+RandomOverSampler', X_train_robust_ros, X_test_robust_ros, y_train_robust_ros),
    ('Robust+SMOTE', X_train_robust_smote, X_test_robust_smote, y_train_robust_smote),
    ('Robust+SMOTE+Tomek', X_train_robust_smotetomek, X_test_robust_smotetomek, y_train_robust_smotetomek),
    ('Robust+SMOTE+ENN', X_train_robust_smoteenn, X_test_robust_smoteenn, y_train_robust_smoteenn),
    
    # Mutual Information configurations
    ('MI', X_train_mi, X_test_mi, y_train),
    ('MI+SVMSMOTE', X_train_mi_svmsmote, X_test_mi_svmsmote, y_train_mi_svmsmote),
    ('MI+BorderlineSMOTE', X_train_mi_bordersmote, X_test_mi_bordersmote, y_train_mi_bordersmote),
    ('MI+RandomOverSampler', X_train_mi_ros, X_test_mi_ros, y_train_mi_ros),
    ('MI+SMOTE', X_train_mi_smote, X_test_mi_smote, y_train_mi_smote),
    ('MI+SMOTE+Tomek', X_train_mi_smotetomek, X_test_mi_smotetomek, y_train_mi_smotetomek),
    ('MI+SMOTE+ENN', X_train_mi_smoteenn, X_test_mi_smoteenn, y_train_mi_smoteenn),
    
    # Boruta configurations
    ('Boruta', X_train_boruta, X_test_boruta, y_train),
    ('Boruta+SVMSMOTE', X_train_boruta_svmsmote, X_test_boruta_svmsmote, y_train_boruta_svmsmote),
    ('Boruta+BorderlineSMOTE', X_train_boruta_bordersmote, X_test_boruta_bordersmote, y_train_boruta_bordersmote),
    ('Boruta+RandomOverSampler', X_train_boruta_ros, X_test_boruta_ros, y_train_boruta_ros),
    ('Boruta+SMOTE', X_train_boruta_smote, X_test_boruta_smote, y_train_boruta_smote),
    ('Boruta+SMOTE+Tomek', X_train_boruta_smotetomek, X_test_boruta_smotetomek, y_train_boruta_smotetomek),
    ('Boruta+SMOTE+ENN', X_train_boruta_smoteenn, X_test_boruta_smoteenn, y_train_boruta_smoteenn),
    
    # LDA configurations
    ('LDA', X_train_lda, X_test_lda, y_train),
    ('LDA+SVMSMOTE', X_train_lda_svmsmote, X_test_lda_svmsmote, y_train_svmsmote),
    ('LDA+BorderlineSMOTE', X_train_lda_bordersmote, X_test_lda_bordersmote, y_train_bordersmote),
    ('LDA+RandomOverSampler', X_train_lda_ros, X_test_lda_ros, y_train_ros),
    ('LDA+SMOTE', X_train_lda_smote, X_test_lda_smote, y_train_smote),
    ('LDA+SMOTE+Tomek', X_train_lda_smotetomek, X_test_lda_smotetomek, y_train_smotetomek),
    ('LDA+SMOTE+ENN', X_train_lda_smoteenn, X_test_lda_smoteenn, y_train_smoteenn),
    
    # Autoencoder configurations
    ('Autoencoder', X_train_encoded, X_test_encoded, y_train),
    ('Autoencoder+SVMSMOTE', X_train_encoded_svmsmote, X_test_encoded_svmsmote, y_train_svmsmote),
    ('Autoencoder+BorderlineSMOTE', X_train_encoded_bordersmote, X_test_encoded_bordersmote, y_train_bordersmote),
    ('Autoencoder+RandomOverSampler', X_train_encoded_ros, X_test_encoded_ros, y_train_ros),
    ('Autoencoder+SMOTE', X_train_encoded_smote, X_test_encoded_smote, y_train_smote),
    ('Autoencoder+SMOTE+Tomek', X_train_encoded_smotetomek, X_test_encoded_smotetomek, y_train_smotetomek),
    ('Autoencoder+SMOTE+ENN', X_train_encoded_smoteenn, X_test_encoded_smoteenn, y_train_smoteenn),

    # PowerTransformer configurations
    ('PowerTransformer', X_train_power, X_test_power, y_train),
    ('PowerTransformer+SMOTE', X_train_power_smote, X_test_power_smote, y_train_power_smote),
    ('PowerTransformer+SMOTE+Tomek', X_train_power_smotetomek, X_test_power_smotetomek, y_train_power_smotetomek),
    ('PowerTransformer+SMOTE+ENN', X_train_power_smoteenn, X_test_power_smoteenn, y_train_power_smoteenn),

    # PolynomialFeatures configurations
    ('PolynomialFeatures', X_train_poly, X_test_poly, y_train),
    ('PolynomialFeatures+SMOTE', X_train_poly_smote, X_test_poly_smote, y_train_poly_smote),
    ('PolynomialFeatures+SMOTE+Tomek', X_train_poly_smotetomek, X_test_poly_smotetomek, y_train_poly_smotetomek),
    ('PolynomialFeatures+SMOTE+ENN', X_train_poly_smoteenn, X_test_poly_smoteenn, y_train_poly_smoteenn),

    # ============================================
    # AdaBoost configurations
    # ============================================
    ('AdaBoost Normalized+SMOTE Balanced', X_train_lr, X_test_normalized, y_train_lr),
    ('AdaBoost Normalized+SMOTE Balanced+Calibrated', X_train_lr, X_test_normalized, y_train_lr)
]


# Model training

## Logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
import pandas as pd

from imblearn.over_sampling import SVMSMOTE, BorderlineSMOTE, RandomOverSampler, SMOTE
from imblearn.combine import SMOTETomek, SMOTEENN
from sklearn.preprocessing import MinMaxScaler, RobustScaler


# ============================================
# Updated Configuration List
# ============================================


# Print all configurations
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# Evaluate Logistic Regression for each configuration
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train Logistic Regression
    log_reg = LogisticRegression(max_iter=1000, random_state=42)
    log_reg.fit(X_train_cfg, y_train_cfg)

    # Predictions
    y_train_lr = log_reg.predict(X_train_cfg)
    y_test_lr = log_reg.predict(X_test_cfg)
    y_train_lr_proba = log_reg.predict_proba(X_train_cfg)
    y_test_lr_proba = log_reg.predict_proba(X_test_cfg)

    # Compute metrics
    metrics_dict = {
        "Dataset": ["Training", "Test"],
        "Accuracy": [
            metrics.accuracy_score(y_train_cfg, y_train_lr),
            metrics.accuracy_score(y_test, y_test_lr),
        ],
        "F1 Score": [
            metrics.f1_score(y_train_cfg, y_train_lr, average='macro'),
            metrics.f1_score(y_test, y_test_lr, average='macro'),
        ],
        "Recall": [
            metrics.recall_score(y_train_cfg, y_train_lr, average='macro'),
            metrics.recall_score(y_test, y_test_lr, average='macro'),
        ],
        "Precision": [
            metrics.precision_score(y_train_cfg, y_train_lr, average='macro'),
            metrics.precision_score(y_test, y_test_lr, average='macro'),
        ],
        "AUC-ROC": [
            metrics.roc_auc_score(pd.get_dummies(y_train_cfg), y_train_lr_proba, multi_class='ovr', average='macro'),
            metrics.roc_auc_score(pd.get_dummies(y_test), y_test_lr_proba, multi_class='ovr', average='macro'),
        ]
    }

    df_metrics = pd.DataFrame(metrics_dict)



    # Store results (assuming storeResults function exists)
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_lr_proba, multi_class='ovr', average='macro')
    storeResults(
        'Logistic Regression',
        name,
        metrics.accuracy_score(y_test, y_test_lr),
        metrics.f1_score(y_test, y_test_lr, average='macro'),
        metrics.recall_score(y_test, y_test_lr, average='macro'),
        metrics.precision_score(y_test, y_test_lr, average='macro'),
        auc_score
    )

from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# List to store results
results = []

# Evaluate Logistic Regression for each configuration
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    log_reg = LogisticRegression(max_iter=1000, random_state=42)
    log_reg.fit(X_train_cfg, y_train_cfg)

    y_test_pred = log_reg.predict(X_test_cfg)

    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    
    # Append the name and test accuracy
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print results
print("Configurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")

1. Robust+SVMSMOTE: Test Accuracy = 0.9600
2. Robust+RandomOverSampler: Test Accuracy = 0.9600
3. PowerTransformer: Test Accuracy = 0.9600
4. PowerTransformer+SMOTE: Test Accuracy = 0.9600
5. PowerTransformer+SMOTE+Tomek: Test Accuracy = 0.9600

## adaboost

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn import metrics
import pandas as pd

from imblearn.over_sampling import SVMSMOTE, BorderlineSMOTE, RandomOverSampler, SMOTE
from imblearn.combine import SMOTETomek, SMOTEENN
from sklearn.preprocessing import MinMaxScaler, RobustScaler

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []  # Adding a holder for AUC-ROC




# 
# Function to call for storing the results
def storeResults(model, config, a, b, c, d, e):
    """
    Store model performance results

    Parameters:
    model: Name of the ML model
    config: Configuration name (preprocessing steps applied)
    a: Accuracy score
    b: F1 score
    c: Recall score
    d: Precision score
    e: AUC-ROC score
    """
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))
# ============================================
# Updated Configuration List
# ============================================
configurations = [
    ('Original Data', X_train, X_test, y_train),
    
    # MinMaxScaler configurations
    ('Normalized Data', X_train_normalized, X_test_normalized, y_train),
    ('Normalized+SVMSMOTE', X_train_normalized_svmsmote, X_test_normalized_svmsmote, y_train_normalized_svmsmote),
    ('Normalized+BorderlineSMOTE', X_train_normalized_bordersmote, X_test_normalized_bordersmote, y_train_normalized_bordersmote),
    ('Normalized+RandomOverSampler', X_train_normalized_ros, X_test_normalized_ros, y_train_normalized_ros),
    ('Normalized+SMOTE', X_train_normalized_smote, X_test_normalized_smote, y_train_normalized_smote),
    ('Normalized+SMOTE+Tomek', X_train_normalized_smotetomek, X_test_normalized_smotetomek, y_train_normalized_smotetomek),
    ('Normalized+SMOTE+ENN', X_train_normalized_smoteenn, X_test_normalized_smoteenn, y_train_normalized_smoteenn),
    
    # RobustScaler configurations
    ('Robust Data', X_train_robust, X_test_robust, y_train),
    ('Robust+SVMSMOTE', X_train_robust_svmsmote, X_test_robust_svmsmote, y_train_robust_svmsmote),
    ('Robust+BorderlineSMOTE', X_train_robust_bordersmote, X_test_robust_bordersmote, y_train_robust_bordersmote),
    ('Robust+RandomOverSampler', X_train_robust_ros, X_test_robust_ros, y_train_robust_ros),
    ('Robust+SMOTE', X_train_robust_smote, X_test_robust_smote, y_train_robust_smote),
    ('Robust+SMOTE+Tomek', X_train_robust_smotetomek, X_test_robust_smotetomek, y_train_robust_smotetomek),
    ('Robust+SMOTE+ENN', X_train_robust_smoteenn, X_test_robust_smoteenn, y_train_robust_smoteenn),
    
    # Mutual Information configurations
    ('MI', X_train_mi, X_test_mi, y_train),
    ('MI+SVMSMOTE', X_train_mi_svmsmote, X_test_mi_svmsmote, y_train_mi_svmsmote),
    ('MI+BorderlineSMOTE', X_train_mi_bordersmote, X_test_mi_bordersmote, y_train_mi_bordersmote),
    ('MI+RandomOverSampler', X_train_mi_ros, X_test_mi_ros, y_train_mi_ros),
    ('MI+SMOTE', X_train_mi_smote, X_test_mi_smote, y_train_mi_smote),
    ('MI+SMOTE+Tomek', X_train_mi_smotetomek, X_test_mi_smotetomek, y_train_mi_smotetomek),
    ('MI+SMOTE+ENN', X_train_mi_smoteenn, X_test_mi_smoteenn, y_train_mi_smoteenn),
    
    # Boruta configurations
    ('Boruta', X_train_boruta, X_test_boruta, y_train),
    ('Boruta+SVMSMOTE', X_train_boruta_svmsmote, X_test_boruta_svmsmote, y_train_boruta_svmsmote),
    ('Boruta+BorderlineSMOTE', X_train_boruta_bordersmote, X_test_boruta_bordersmote, y_train_boruta_bordersmote),
    ('Boruta+RandomOverSampler', X_train_boruta_ros, X_test_boruta_ros, y_train_boruta_ros),
    ('Boruta+SMOTE', X_train_boruta_smote, X_test_boruta_smote, y_train_boruta_smote),
    ('Boruta+SMOTE+Tomek', X_train_boruta_smotetomek, X_test_boruta_smotetomek, y_train_boruta_smotetomek),
    ('Boruta+SMOTE+ENN', X_train_boruta_smoteenn, X_test_boruta_smoteenn, y_train_boruta_smoteenn),
    
    # LDA configurations
    ('LDA', X_train_lda, X_test_lda, y_train),
    ('LDA+SVMSMOTE', X_train_lda_svmsmote, X_test_lda_svmsmote, y_train_svmsmote),
    ('LDA+BorderlineSMOTE', X_train_lda_bordersmote, X_test_lda_bordersmote, y_train_bordersmote),
    ('LDA+RandomOverSampler', X_train_lda_ros, X_test_lda_ros, y_train_ros),
    ('LDA+SMOTE', X_train_lda_smote, X_test_lda_smote, y_train_smote),
    ('LDA+SMOTE+Tomek', X_train_lda_smotetomek, X_test_lda_smotetomek, y_train_smotetomek),
    ('LDA+SMOTE+ENN', X_train_lda_smoteenn, X_test_lda_smoteenn, y_train_smoteenn),
    
    # Autoencoder configurations
    ('Autoencoder', X_train_encoded, X_test_encoded, y_train),
    ('Autoencoder+SVMSMOTE', X_train_encoded_svmsmote, X_test_encoded_svmsmote, y_train_svmsmote),
    ('Autoencoder+BorderlineSMOTE', X_train_encoded_bordersmote, X_test_encoded_bordersmote, y_train_bordersmote),
    ('Autoencoder+RandomOverSampler', X_train_encoded_ros, X_test_encoded_ros, y_train_ros),
    ('Autoencoder+SMOTE', X_train_encoded_smote, X_test_encoded_smote, y_train_smote),
    ('Autoencoder+SMOTE+Tomek', X_train_encoded_smotetomek, X_test_encoded_smotetomek, y_train_smotetomek),
    ('Autoencoder+SMOTE+ENN', X_train_encoded_smoteenn, X_test_encoded_smoteenn, y_train_smoteenn),

    # PowerTransformer configurations
    ('PowerTransformer', X_train_power, X_test_power, y_train),
    ('PowerTransformer+SMOTE', X_train_power_smote, X_test_power_smote, y_train_power_smote),
    ('PowerTransformer+SMOTE+Tomek', X_train_power_smotetomek, X_test_power_smotetomek, y_train_power_smotetomek),
    ('PowerTransformer+SMOTE+ENN', X_train_power_smoteenn, X_test_power_smoteenn, y_train_power_smoteenn),

    # PolynomialFeatures configurations
    ('PolynomialFeatures', X_train_poly, X_test_poly, y_train),
    ('PolynomialFeatures+SMOTE', X_train_poly_smote, X_test_poly_smote, y_train_poly_smote),
    ('PolynomialFeatures+SMOTE+Tomek', X_train_poly_smotetomek, X_test_poly_smotetomek, y_train_poly_smotetomek),
    ('PolynomialFeatures+SMOTE+ENN', X_train_poly_smoteenn, X_test_poly_smoteenn, y_train_poly_smoteenn),

    # ============================================
    # AdaBoost configurations
    # ============================================
    ('AdaBoost Normalized+SMOTE Balanced', X_train_lr, X_test_normalized, y_train_lr),
    ('AdaBoost Normalized+SMOTE Balanced+Calibrated', X_train_lr, X_test_normalized, y_train_lr)
]


# Print all configurations
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# Evaluate AdaBoost for each configuration
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train AdaBoost
    adb = AdaBoostClassifier(random_state=42)
    adb.fit(X_train_cfg, y_train_cfg)

    # Predictions
    y_train_pred = adb.predict(X_train_cfg)
    y_test_pred = adb.predict(X_test_cfg)
    y_train_proba = adb.predict_proba(X_train_cfg)
    y_test_proba = adb.predict_proba(X_test_cfg)

    # Compute metrics
    metrics_dict = {
        "Dataset": ["Training", "Test"],
        "Accuracy": [
            metrics.accuracy_score(y_train_cfg, y_train_pred),
            metrics.accuracy_score(y_test, y_test_pred),
        ],
        "F1 Score": [
            metrics.f1_score(y_train_cfg, y_train_pred, average='macro'),
            metrics.f1_score(y_test, y_test_pred, average='macro'),
        ],
        "Recall": [
            metrics.recall_score(y_train_cfg, y_train_pred, average='macro'),
            metrics.recall_score(y_test, y_test_pred, average='macro'),
        ],
        "Precision": [
            metrics.precision_score(y_train_cfg, y_train_pred, average='macro'),
            metrics.precision_score(y_test, y_test_pred, average='macro'),
        ],
        "AUC-ROC": [
            metrics.roc_auc_score(pd.get_dummies(y_train_cfg), y_train_proba, multi_class='ovr', average='macro'),
            metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro'),
        ]
    }

    df_metrics = pd.DataFrame(metrics_dict)

    # Store results (assuming storeResults function exists)
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')
    storeResults(
        'AdaBoost',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# Evaluate test accuracy and sort configurations
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    adb = AdaBoostClassifier(random_state=42)
    adb.fit(X_train_cfg, y_train_cfg)
    y_test_pred = adb.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print results
print("Configurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")


## adaboost calibrated

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn import metrics
import pandas as pd

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []

# -----------------------------
# Function to store results
# -----------------------------
def storeResults(model, config, a, b, c, d, e):
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

# -----------------------------
# Print all configurations
# -----------------------------
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# -----------------------------
# Evaluate AdaBoost for each configuration (with calibration)
# -----------------------------
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train AdaBoost
    adb = AdaBoostClassifier(random_state=42)
    adb_cal = CalibratedClassifierCV(adb, cv=5, method='sigmoid')
    adb_cal.fit(X_train_cfg, y_train_cfg)

    # Predictions
    y_train_pred = adb_cal.predict(X_train_cfg)
    y_test_pred = adb_cal.predict(X_test_cfg)
    y_train_proba = adb_cal.predict_proba(X_train_cfg)
    y_test_proba = adb_cal.predict_proba(X_test_cfg)

    # Compute AUC-ROC
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')

    # Store results
    storeResults(
        'AdaBoost (Calibrated)',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# -----------------------------
# Evaluate test accuracy and sort configurations
# -----------------------------
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    adb = AdaBoostClassifier(random_state=42)
    adb.fit(X_train_cfg, y_train_cfg)
    y_test_pred = adb.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print sorted results
print("Configurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")


1. PolynomialFeatures+SMOTE+ENN: Test Accuracy = 0.9467
2. AdaBoost Normalized+SMOTE Balanced: Test Accuracy = 0.9467
3. AdaBoost Normalized+SMOTE Balanced+Calibrated: Test Accuracy = 0.9467
4. LDA: Test Accuracy = 0.9333
5. LDA+SVMSMOTE: Test Accuracy = 0.9333

## adaboost calibrated 2

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn import metrics
import pandas as pd

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []

# -----------------------------
# Function to store results
# -----------------------------
def storeResults(model, config, a, b, c, d, e):
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

# -----------------------------
# Print all configurations
# -----------------------------
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# -----------------------------
# Evaluate AdaBoost for each configuration (with calibration)
# -----------------------------
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train AdaBoost
    adb = AdaBoostClassifier(algorithm='SAMME', n_estimators=200, learning_rate=0.5, random_state=42)
    adb_cal = CalibratedClassifierCV(adb, cv=5, method='sigmoid')
    adb_cal.fit(X_train_cfg, y_train_cfg)

    # Predictions
    y_train_pred = adb_cal.predict(X_train_cfg)
    y_test_pred = adb_cal.predict(X_test_cfg)
    y_train_proba = adb_cal.predict_proba(X_train_cfg)
    y_test_proba = adb_cal.predict_proba(X_test_cfg)

    # Compute AUC-ROC
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')

    # Store results
    storeResults(
        'AdaBoost (Calibrated)',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# -----------------------------
# Evaluate test accuracy and sort configurations
# -----------------------------
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    adb = AdaBoostClassifier(random_state=42)
    adb.fit(X_train_cfg, y_train_cfg)
    y_test_pred = adb.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print sorted results
print("Configurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")


1. PolynomialFeatures+SMOTE+ENN: Test Accuracy = 0.9467
2. AdaBoost Normalized+SMOTE Balanced: Test Accuracy = 0.9467
3. AdaBoost Normalized+SMOTE Balanced+Calibrated: Test Accuracy = 0.9467
4. LDA: Test Accuracy = 0.9333
5. LDA+SVMSMOTE: Test Accuracy = 0.9333

## Gradient boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
import pandas as pd

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []

# Function to store results
def storeResults(model, config, a, b, c, d, e):
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

# ============================================
# Gradient Boosting configurations
# ============================================
# Use the same configurations list you already have
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# Evaluate Gradient Boosting for each configuration
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train Gradient Boosting
    gbc = GradientBoostingClassifier(random_state=42)
    gbc.fit(X_train_cfg, y_train_cfg)

    # Predictions
    y_train_pred = gbc.predict(X_train_cfg)
    y_test_pred = gbc.predict(X_test_cfg)
    y_train_proba = gbc.predict_proba(X_train_cfg)
    y_test_proba = gbc.predict_proba(X_test_cfg)

    # Store results
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')
    storeResults(
        'GradientBoosting',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# Evaluate test accuracy and sort configurations
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    gbc = GradientBoostingClassifier(random_state=42)
    gbc.fit(X_train_cfg, y_train_cfg)
    y_test_pred = gbc.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print results
print("Configurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")


1. Boruta+SMOTE+Tomek: Test Accuracy = 0.9733
2. PowerTransformer+SMOTE: Test Accuracy = 0.9733
3. Original Data: Test Accuracy = 0.9600
4. Normalized Data: Test Accuracy = 0.9600
5. Normalized+SMOTE: Test Accuracy = 0.9600
6. Normalized+SMOTE+Tomek: Test Accuracy = 0.9600
7. Robust Data: Test Accuracy = 0.9600
8. Robust+SMOTE+Tomek: Test Accuracy = 0.9600
9. MI: Test Accuracy = 0.9600
10. MI+SMOTE: Test Accuracy = 0.9600
11. MI+SMOTE+Tomek: Test Accuracy = 0.9600
12. Boruta: Test Accuracy = 0.9600
13. Autoencoder+SMOTE+ENN: Test Accuracy = 0.9600
14. PowerTransformer: Test Accuracy = 0.9600

## calibrated gradient boosting calibrated

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn import metrics
import pandas as pd

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []

# -----------------------------
# Function to store results
# -----------------------------
def storeResults(model, config, a, b, c, d, e):
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

# -----------------------------
# Print all configurations
# -----------------------------
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# -----------------------------
# Evaluate Gradient Boosting
# -----------------------------
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train Gradient Boosting
    gbc = GradientBoostingClassifier(random_state=42)
    gbc.fit(X_train_cfg, y_train_cfg)

    # Calibrate probabilities
    gbc_cal = CalibratedClassifierCV(gbc, cv='prefit', method='sigmoid')
    gbc_cal.fit(X_train_cfg, y_train_cfg)

    # Predictions
    y_train_pred = gbc_cal.predict(X_train_cfg)
    y_test_pred = gbc_cal.predict(X_test_cfg)
    y_train_proba = gbc_cal.predict_proba(X_train_cfg)
    y_test_proba = gbc_cal.predict_proba(X_test_cfg)

    # Compute AUC-ROC
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')

    # Store results
    storeResults(
        'GradientBoosting',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# -----------------------------
# Evaluate test accuracy and sort configurations
# -----------------------------
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    gbc = GradientBoostingClassifier(random_state=42)
    gbc.fit(X_train_cfg, y_train_cfg)
    y_test_pred = gbc.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print sorted results
print("Configurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")


1. Boruta+SMOTE+Tomek: Test Accuracy = 0.9733
2. PowerTransformer+SMOTE: Test Accuracy = 0.9733
3. Original Data: Test Accuracy = 0.9600
4. Normalized Data: Test Accuracy = 0.9600
5. Normalized+SMOTE: Test Accuracy = 0.9600
6. Normalized+SMOTE+Tomek: Test Accuracy = 0.9600
7. Robust Data: Test Accuracy = 0.9600
8. Robust+SMOTE+Tomek: Test Accuracy = 0.9600
9. MI: Test Accuracy = 0.9600
10. MI+SMOTE: Test Accuracy = 0.9600

## Catboost calibrated

In [None]:
from catboost import CatBoostClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn import metrics
import pandas as pd

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []

# -----------------------------
# Function to store results
# -----------------------------
def storeResults(model, config, a, b, c, d, e):
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

# -----------------------------
# Print all configurations
# -----------------------------
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# -----------------------------
# Evaluate CatBoost for each configuration
# -----------------------------
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train CatBoost
    cbc = CatBoostClassifier(verbose=0, random_state=42)
    cbc.fit(X_train_cfg, y_train_cfg)

    # Calibrate probabilities
    cbc_cal = CalibratedClassifierCV(cbc, cv='prefit', method='sigmoid')
    cbc_cal.fit(X_train_cfg, y_train_cfg)

    # Predictions
    y_train_pred = cbc_cal.predict(X_train_cfg)
    y_test_pred = cbc_cal.predict(X_test_cfg)
    y_train_proba = cbc_cal.predict_proba(X_train_cfg)
    y_test_proba = cbc_cal.predict_proba(X_test_cfg)

    # Compute AUC-ROC
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')

    # Store results
    storeResults(
        'CatBoost',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# -----------------------------
# Evaluate test accuracy and sort configurations
# -----------------------------
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    cbc = CatBoostClassifier(verbose=0, random_state=42)
    cbc.fit(X_train_cfg, y_train_cfg)
    y_test_pred = cbc.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print sorted results
print("Configurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")


## Catboost default


In [None]:
from catboost import CatBoostClassifier
from sklearn import metrics
import pandas as pd

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []

# -----------------------------
# Function to store results
# -----------------------------
def storeResults(model, config, a, b, c, d, e):
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

# -----------------------------
# Print all configurations
# -----------------------------
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# -----------------------------
# Evaluate CatBoost for each configuration
# -----------------------------
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train CatBoost with default parameters
    cbc = CatBoostClassifier(verbose=0, random_state=42)
    cbc.fit(X_train_cfg, y_train_cfg)
    
    # Predictions
    y_train_pred = cbc.predict(X_train_cfg)
    y_test_pred = cbc.predict(X_test_cfg)
    y_train_proba = cbc.predict_proba(X_train_cfg)
    y_test_proba = cbc.predict_proba(X_test_cfg)
    
    # Compute AUC-ROC
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')
    
    # Store results
    storeResults(
        'CatBoost',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# -----------------------------
# Evaluate test accuracy and sort configurations
# -----------------------------
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    cbc = CatBoostClassifier(verbose=0, random_state=42)
    cbc.fit(X_train_cfg, y_train_cfg)
    y_test_pred = cbc.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print sorted results
print("\nConfigurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")


## Catboost 1

In [None]:
from catboost import CatBoostClassifier
from sklearn import metrics
import pandas as pd

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []

# -----------------------------
# Function to store results
# -----------------------------
def storeResults(model, config, a, b, c, d, e):
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

# -----------------------------
# Print all configurations
# -----------------------------
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# -----------------------------
# Evaluate CatBoost for each configuration
# -----------------------------
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train CatBoost with default parameters
    cbc = CatBoostClassifier(
    iterations=100,
    depth=8,
    learning_rate=0.05,
    l2_leaf_reg=5,
    bagging_temperature=0.8,
    border_count=128,
    eval_metric='Accuracy',
    random_seed=42,
    verbose=200
)
    cbc.fit(X_train_cfg, y_train_cfg)
    
    # Predictions
    y_train_pred = cbc.predict(X_train_cfg)
    y_test_pred = cbc.predict(X_test_cfg)
    y_train_proba = cbc.predict_proba(X_train_cfg)
    y_test_proba = cbc.predict_proba(X_test_cfg)
    
    # Compute AUC-ROC
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')
    
    # Store results
    storeResults(
        'CatBoost',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# -----------------------------
# Evaluate test accuracy and sort configurations
# -----------------------------
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    cbc = CatBoostClassifier(verbose=0, random_state=42)
    cbc.fit(X_train_cfg, y_train_cfg)
    y_test_pred = cbc.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print sorted results
print("\nConfigurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")


## Catboost 2

In [None]:
from catboost import CatBoostClassifier
from sklearn import metrics
import pandas as pd

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []

# -----------------------------
# Function to store results
# -----------------------------
def storeResults(model, config, a, b, c, d, e):
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

# -----------------------------
# Print all configurations
# -----------------------------
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# -----------------------------
# Evaluate CatBoost for each configuration
# -----------------------------
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train CatBoost with default parameters
    cbc = CatBoostClassifier(
    iterations=150,
    depth=10,
    learning_rate=0.03,
    l2_leaf_reg=3,

    border_count=254,
    bootstrap_type='Bernoulli',
    subsample=0.9,
    eval_metric='Accuracy',
    random_seed=42,
    verbose=200
)

    cbc.fit(X_train_cfg, y_train_cfg)
    
    # Predictions
    y_train_pred = cbc.predict(X_train_cfg)
    y_test_pred = cbc.predict(X_test_cfg)
    y_train_proba = cbc.predict_proba(X_train_cfg)
    y_test_proba = cbc.predict_proba(X_test_cfg)
    
    # Compute AUC-ROC
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')
    
    # Store results
    storeResults(
        'CatBoost',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# -----------------------------
# Evaluate test accuracy and sort configurations
# -----------------------------
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    cbc = CatBoostClassifier(verbose=0, random_state=42)
    cbc.fit(X_train_cfg, y_train_cfg)
    y_test_pred = cbc.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print sorted results
print("\nConfigurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")

## Catbooost 3

In [None]:
from catboost import CatBoostClassifier
from sklearn import metrics
import pandas as pd

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []

# -----------------------------
# Function to store results
# -----------------------------
def storeResults(model, config, a, b, c, d, e):
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

# -----------------------------
# Print all configurations
# -----------------------------
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# -----------------------------
# Evaluate CatBoost for each configuration
# -----------------------------
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train CatBoost with default parameters
    cbc = CatBoostClassifier(
    iterations=120,
    depth=6,
    learning_rate=0.07,
    l2_leaf_reg=8,
    bagging_temperature=1.0,
    border_count=64,
    random_strength=1.5,
    eval_metric='Accuracy',
    random_seed=42,
    verbose=200
)

    cbc.fit(X_train_cfg, y_train_cfg)
    
    # Predictions
    y_train_pred = cbc.predict(X_train_cfg)
    y_test_pred = cbc.predict(X_test_cfg)
    y_train_proba = cbc.predict_proba(X_train_cfg)
    y_test_proba = cbc.predict_proba(X_test_cfg)
    
    # Compute AUC-ROC
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')
    
    # Store results
    storeResults(
        'CatBoost',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# -----------------------------
# Evaluate test accuracy and sort configurations
# -----------------------------
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    cbc = CatBoostClassifier(verbose=0, random_state=42)
    cbc.fit(X_train_cfg, y_train_cfg)
    y_test_pred = cbc.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print sorted results
# Print sorted results
print("\nConfigurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")

## Catboost 4

In [None]:
from catboost import CatBoostClassifier
from sklearn import metrics
import pandas as pd

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []

# -----------------------------
# Function to store results
# -----------------------------
def storeResults(model, config, a, b, c, d, e):
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

# -----------------------------
# Print all configurations
# -----------------------------
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# -----------------------------
# Evaluate CatBoost for each configuration
# -----------------------------
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train CatBoost with default parameters
    cbc = CatBoostClassifier(
    iterations=100,
    depth=5,
    learning_rate=0.1,
    l2_leaf_reg=4,
    subsample=0.85,
    border_count=128,
    random_strength=0.8,
    eval_metric='Accuracy',
    random_seed=42,
    verbose=200
)

    cbc.fit(X_train_cfg, y_train_cfg)
    
    # Predictions
    y_train_pred = cbc.predict(X_train_cfg)
    y_test_pred = cbc.predict(X_test_cfg)
    y_train_proba = cbc.predict_proba(X_train_cfg)
    y_test_proba = cbc.predict_proba(X_test_cfg)
    
    # Compute AUC-ROC
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')
    
    # Store results
    storeResults(
        'CatBoost',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# -----------------------------
# Evaluate test accuracy and sort configurations
# -----------------------------
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    cbc = CatBoostClassifier(verbose=0, random_state=42)
    cbc.fit(X_train_cfg, y_train_cfg)
    y_test_pred = cbc.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print sorted results
print("\nConfigurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")

## Catboost 5

In [None]:
from catboost import CatBoostClassifier
from sklearn import metrics
import pandas as pd

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []

# -----------------------------
# Function to store results
# -----------------------------
def storeResults(model, config, a, b, c, d, e):
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

# -----------------------------
# Print all configurations
# -----------------------------
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# -----------------------------
# Evaluate CatBoost for each configuration
# -----------------------------
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train CatBoost with default parameters
    cbc = CatBoostClassifier(
    iterations=1500,
    depth=9,
    learning_rate=0.04,
    l2_leaf_reg=6,
    bagging_temperature=0.4,
    subsample=0.85,
    grow_policy='Lossguide',
    random_strength=1.0,
    eval_metric='Accuracy',
    random_seed=42,
    verbose=200,
    max_leaves=64
)

    cbc.fit(X_train_cfg, y_train_cfg)
    
    # Predictions
    y_train_pred = cbc.predict(X_train_cfg)
    y_test_pred = cbc.predict(X_test_cfg)
    y_train_proba = cbc.predict_proba(X_train_cfg)
    y_test_proba = cbc.predict_proba(X_test_cfg)
    
    # Compute AUC-ROC
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')
    
    # Store results
    storeResults(
        'CatBoost',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# -----------------------------
# Evaluate test accuracy and sort configurations
# -----------------------------
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    cbc = CatBoostClassifier(verbose=0, random_state=42)
    cbc.fit(X_train_cfg, y_train_cfg)
    y_test_pred = cbc.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print sorted results
print("\nConfigurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")

## Lightgbm default

In [None]:
from lightgbm import LGBMClassifier
from sklearn import metrics
import pandas as pd

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []

# Function to store results
def storeResults(model, config, a, b, c, d, e):
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

# ============================================
# LightGBM configurations
# ============================================
# Use the same configurations list you already have
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# Evaluate LightGBM for each configuration
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train LightGBM
    lgbm = LGBMClassifier(random_state=42, verbose=-1,objective='multiclass', boosting_type='gbdt',
    metric='multi_logloss')
    lgbm.fit(X_train_cfg, y_train_cfg)
    
    # Predictions
    y_train_pred = lgbm.predict(X_train_cfg)
    y_test_pred = lgbm.predict(X_test_cfg)
    y_train_proba = lgbm.predict_proba(X_train_cfg)
    y_test_proba = lgbm.predict_proba(X_test_cfg)
    
    # Store results
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')
    storeResults(
        'LightGBM',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# Evaluate test accuracy and sort configurations
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    lgbm = LGBMClassifier(random_state=42, verbose=-1)
    lgbm.fit(X_train_cfg, y_train_cfg)
    y_test_pred = lgbm.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print results
print("Configurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")

## lgbm 1

In [None]:
from lightgbm import LGBMClassifier
from sklearn import metrics
import pandas as pd

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []

# Function to store results
def storeResults(model, config, a, b, c, d, e):
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

# ============================================
# LightGBM configurations
# ============================================
# Use the same configurations list you already have
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# Evaluate LightGBM for each configuration
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train LightGBM
    lgbm =  LGBMClassifier(
    n_estimators=1200,
    learning_rate=0.07,
    objective='multiclass',
    num_leaves=25,
    max_depth=6,
    min_child_samples=40,
    subsample=0.7,
    colsample_bytree=0.7,
    reg_alpha=0.5,
    reg_lambda=1.0,
    boosting_type='gbdt',
    metric='multi_logloss',
    random_state=42
)
    lgbm.fit(X_train_cfg, y_train_cfg)
    
    # Predictions
    y_train_pred = lgbm.predict(X_train_cfg)
    y_test_pred = lgbm.predict(X_test_cfg)
    y_train_proba = lgbm.predict_proba(X_train_cfg)
    y_test_proba = lgbm.predict_proba(X_test_cfg)
    
    # Store results
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')
    storeResults(
        'LightGBM',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# Evaluate test accuracy and sort configurations
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    lgbm = LGBMClassifier(random_state=42, verbose=-1)
    lgbm.fit(X_train_cfg, y_train_cfg)
    y_test_pred = lgbm.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print results
print("Configurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")

## xgboost

In [None]:
from xgboost import XGBClassifier
from sklearn import metrics
import pandas as pd

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []

# Function to store results
def storeResults(model, config, a, b, c, d, e):
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

# ============================================
# XGBoost configurations
# ============================================
# Use the same configurations list you already have
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# Evaluate XGBoost for each configuration
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train XGBoost
    xgb = XGBClassifier(random_state=42, verbosity=0)
    xgb.fit(X_train_cfg, y_train_cfg)
    
    # Predictions
    y_train_pred = xgb.predict(X_train_cfg)
    y_test_pred = xgb.predict(X_test_cfg)
    y_train_proba = xgb.predict_proba(X_train_cfg)
    y_test_proba = xgb.predict_proba(X_test_cfg)
    
    # Store results
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')
    storeResults(
        'XGBoost',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# Evaluate test accuracy and sort configurations
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    xgb = XGBClassifier(random_state=42, verbosity=0)
    xgb.fit(X_train_cfg, y_train_cfg)
    y_test_pred = xgb.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print results
print("Configurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")

## Extra trees

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
from sklearn import metrics
import pandas as pd

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []

# Function to store results
def storeResults(model, config, a, b, c, d, e):
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

# ============================================
# Extra Trees configurations
# ============================================
# Use the same configurations list you already have
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# Evaluate Extra Trees for each configuration
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train Extra Trees
    etc = ExtraTreesClassifier(random_state=42)
    etc.fit(X_train_cfg, y_train_cfg)
    
    # Predictions
    y_train_pred = etc.predict(X_train_cfg)
    y_test_pred = etc.predict(X_test_cfg)
    y_train_proba = etc.predict_proba(X_train_cfg)
    y_test_proba = etc.predict_proba(X_test_cfg)
    
    # Store results
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')
    storeResults(
        'ExtraTrees',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# Evaluate test accuracy and sort configurations
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    etc = ExtraTreesClassifier(random_state=42)
    etc.fit(X_train_cfg, y_train_cfg)
    y_test_pred = etc.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print results
print("Configurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")

## KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
import pandas as pd

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []

# Function to store results
def storeResults(model, config, a, b, c, d, e):
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

# ============================================
# KNN configurations
# ============================================
# Use the same configurations list you already have
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# Evaluate KNN for each configuration
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train KNN
    knn = KNeighborsClassifier()
    knn.fit(X_train_cfg, y_train_cfg)
    
    # Predictions
    y_train_pred = knn.predict(X_train_cfg)
    y_test_pred = knn.predict(X_test_cfg)
    y_train_proba = knn.predict_proba(X_train_cfg)
    y_test_proba = knn.predict_proba(X_test_cfg)
    
    # Store results
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')
    storeResults(
        'KNN',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# Evaluate test accuracy and sort configurations
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    knn = KNeighborsClassifier()
    knn.fit(X_train_cfg, y_train_cfg)
    y_test_pred = knn.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print results
print("Configurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")

## MLP

In [None]:
from sklearn.neural_network import MLPClassifier
from sklearn import metrics
import pandas as pd

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []

# Function to store results
def storeResults(model, config, a, b, c, d, e):
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

# ============================================
# MLP configurations
# ============================================
# Use the same configurations list you already have
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# Evaluate MLP for each configuration
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train MLP
    mlp = MLPClassifier(random_state=42, max_iter=500)
    mlp.fit(X_train_cfg, y_train_cfg)
    
    # Predictions
    y_train_pred = mlp.predict(X_train_cfg)
    y_test_pred = mlp.predict(X_test_cfg)
    y_train_proba = mlp.predict_proba(X_train_cfg)
    y_test_proba = mlp.predict_proba(X_test_cfg)
    
    # Store results
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')
    storeResults(
        'MLP',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# Evaluate test accuracy and sort configurations
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    mlp = MLPClassifier(random_state=42, max_iter=500)
    mlp.fit(X_train_cfg, y_train_cfg)
    y_test_pred = mlp.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print results
print("Configurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")

#

## Random forest

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import pandas as pd

ML_Model = []
ML_Config = []
accuracy = []
f1_score = []
recall = []
precision = []
auc_roc = []

# Function to store results
def storeResults(model, config, a, b, c, d, e):
    ML_Model.append(model)
    ML_Config.append(config)
    accuracy.append(round(a, 6))
    f1_score.append(round(b, 6))
    recall.append(round(c, 6))
    precision.append(round(d, 6))
    auc_roc.append(round(e, 6))

# ============================================
# Random Forest configurations
# ============================================
# Use the same configurations list you already have
print(f"Total configurations: {len(configurations)}")
for i, (name, X_tr, X_te, y_tr) in enumerate(configurations):
    print(f"{i+1}. {name}: Train shape={X_tr.shape}, Test shape={X_te.shape}, Train labels={y_tr.shape}")

# Evaluate Random Forest for each configuration
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
   
    # Train Random Forest
    rf = RandomForestClassifier(random_state=42)
    rf.fit(X_train_cfg, y_train_cfg)
    
    # Predictions
    y_train_pred = rf.predict(X_train_cfg)
    y_test_pred = rf.predict(X_test_cfg)
    y_train_proba = rf.predict_proba(X_train_cfg)
    y_test_proba = rf.predict_proba(X_test_cfg)
    
    # Store results
    auc_score = metrics.roc_auc_score(pd.get_dummies(y_test), y_test_proba, multi_class='ovr', average='macro')
    storeResults(
        'RandomForest',
        name,
        metrics.accuracy_score(y_test, y_test_pred),
        metrics.f1_score(y_test, y_test_pred, average='macro'),
        metrics.recall_score(y_test, y_test_pred, average='macro'),
        metrics.precision_score(y_test, y_test_pred, average='macro'),
        auc_score
    )

# Evaluate test accuracy and sort configurations
results = []
for name, X_train_cfg, X_test_cfg, y_train_cfg in configurations:
    rf = RandomForestClassifier(random_state=42)
    rf.fit(X_train_cfg, y_train_cfg)
    y_test_pred = rf.predict(X_test_cfg)
    test_accuracy = metrics.accuracy_score(y_test, y_test_pred)
    results.append((name, test_accuracy))

# Sort results by test accuracy descending
results_sorted = sorted(results, key=lambda x: x[1], reverse=True)

# Print results
print("Configurations sorted by Test Accuracy (High to Low):\n")
for i, (name, acc) in enumerate(results_sorted, 1):
    print(f"{i}. {name}: Test Accuracy = {acc:.4f}")

# Best results and configurations

# Top 10 Model Configurations by Test Accuracy

| Rank | Model                 | Configuration             | Test Accuracy |
|------|----------------------|--------------------------|---------------|
| 1    | Gradient Boosting     | Boruta + SMOTE + Tomek   | 0.9733        |
| 2    | Gradient Boosting     | PowerTransformer + SMOTE | 0.9733        |
| 3    | XGBoost               | MI + SMOTE + Tomek       | 0.9733        |
| 4    | XGBoost               | LDA                      | 0.9733        |
| 5    | Extra Trees           | MI                       | 0.9733        |
| 6    | Extra Trees           | MI + SMOTE + Tomek       | 0.9733        |
| 7    | Random Forest         | MI + SMOTE + Tomek       | 0.9733        |
| 8    | Random Forest         | Boruta + SMOTE + Tomek   | 0.9733        |
| 9    | Logistic Regression   | Robust + SVMSMOTE        | 0.9600        |
| 10   | Logistic Regression   | Robust + RandomOverSampler | 0.9600      |

---

# Top 5 Configurations Only

1. Boruta + SMOTE + Tomek  
2. PowerTransformer + SMOTE  
3. MI + SMOTE + Tomek  
4. LDA  
5. MI


# Final work