# CS565-DS522 IoT Data Science Mini Project for K-EmoPhone dataset
*This material is a joint work of TAs from IC Lab at KAIST, including Panyu Zhang, Soowon Kang, and Woohyeok Choi. This work is licensed under CC BY-SA 4.0.*

## Instruction
In this mini-project, we will build a model to predict users' self-reported stress using extracted features from K-EmoPhone dataset. This material mainly refers to the public [repository](https://github.com/SteinPanyu/IndependentReproducibility) conducting indepedent reproducibility experiments on K-EmoPhone dataset. In order to save time, we provide the extracted features from the raw data instead of starting from scratch. Besides, traditional machine learning model is used considering limited number of labels and multimodality issue in the in-the-wild K-EmoPhone dataset.



## Guidance

1. Before running the code, please first download the extracted features from the following [link](https://drive.google.com/file/d/1HcyFvzWEzO21osyP5E8VpVmHROX1ew7q/view?usp=sharing).

2. Please change your runtime type to T4-GPU or other runtime types with GPU available since later we may use GPU for
xgboost execution

Install latest version of xgboost > 2.0.0

In [1]:
!pip install xgboost

Collecting xgboost
  Downloading xgboost-3.0.1-py3-none-manylinux_2_28_x86_64.whl.metadata (2.1 kB)
Downloading xgboost-3.0.1-py3-none-manylinux_2_28_x86_64.whl (253.9 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m253.9/253.9 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:02[0m
[?25hInstalling collected packages: xgboost
Successfully installed xgboost-3.0.1


In [2]:
import pytz
import os
import pandas as pd
import numpy as np
import scipy.stats as st
import cloudpickle
from datetime import datetime
from contextlib import contextmanager
import warnings
import time
from typing import Optional
from contextlib import contextmanager

DEFAULT_TZ = pytz.FixedOffset(540)  # GMT+09:00; Asia/Seoul

RANDOM_STATE =42


def log(msg: any):
    print('[{}] {}'.format(datetime.now().strftime('%y-%m-%d %H:%M:%S'), msg))

## 1.Preparation

### 1.1. Mount to Your Google Drive

In [5]:
# not relevant for local execution
'''
from google.colab import drive

drive.mount('/content/drive')
'''

"\nfrom google.colab import drive\n\ndrive.mount('/content/drive')\n"

### 1.2. Load Extracted Features

In [7]:
import pickle
import numpy as np

#PATH = '/content/drive/MyDrive/IoT_Data_Science/Project/Datasets/features_stress_fixed_K-EmoPhone.pkl'
PATH = './Datasets/features_stress_fixed_K-EmoPhone.pkl'

X, y, groups, t, datetimes = pickle.load(open(PATH, mode='rb'))

X is the extracted features and the feature extraction process refers to the public [repository](https://github.com/SteinPanyu/IndependentReproducibility) and the immediate past time window is set as 15 minutes. y is the array of labels while groups is the user ids.

Please note that here y is binarized using theoretical threshold (if ESM stress > 0, binarize as 1, else 0, ESM label scale [-3, 3])

Since features are already extracted, we do not need to work on preprocessing and feature extraction again.

## 2.Feature Preparation


There exist multiple types of features. Please try different combinations of features to see if there is any model performance improvement.

In [None]:

#The following code is designed for reordering the data
#################################################
# Create a DataFrame with user_id and datetime

df = pd.DataFrame({'user_id': groups, 'datetime': datetimes, 'label': y})

# df_merged = pd.merge(df, X, left_index=True, right_index=True)
df_merged = pd.merge(df, X, left_index=True, right_index=True)

# Sort the DataFrame by datetime
df_merged = df_merged.sort_values(by=['user_id', 'datetime'])

# Update groups and datetimes
groups = df_merged['user_id'].to_numpy()
datetimes = df_merged['datetime'].to_numpy()
y = df_merged['label'].to_numpy()
X = df_merged.drop(columns=['user_id', 'datetime', 'label'])



#Divide the features into different categories
feat_current = X.loc[:,[('#VAL' in str(x)) or ('ESM#LastLabel' in str(x)) for x in X.keys()]]
feat_dsc = X.loc[:,[('#DSC' in str(x))  for x in X.keys()]]
feat_yesterday = X.loc[:,[('Yesterday' in str(x))  for x in X.keys()]]
feat_today = X.loc[:,[('Today' in str(x))  for x in X.keys()]]

feat_ImmediatePast = X.loc[:,[('ImmediatePast_15' in str(x))  for x in X.keys()]]

#################################################################################
#Below are the available features
#Divide the time window features into sensor/ESM self-report features
feat_current_sensor = X.loc[:,[('#VAL' in str(x))  for x in X.keys()]] #Current sensor features (value right before label)
feat_current_ESM = X.loc[:,[('ESM#LastLabel' in str(x)) for x in X.keys()]] #Current ESM features (value right before label)
feat_ImmediatePast_sensor = feat_ImmediatePast.loc[:,[('ESM' not in str(x)) for x in feat_ImmediatePast.keys()]] #Immediate past sensor features (in past 15 minutes before label)
feat_ImmediatePast_ESM = feat_ImmediatePast.loc[:,[('ESM'  in str(x)) for x in feat_ImmediatePast.keys()]]  #Immediate past ESM features
feat_today_sensor = feat_today.loc[:,[('ESM' not in str(x))  for x in feat_today.keys()]] #Today epoch sensor features
feat_today_ESM = feat_today.loc[:,[('ESM'  in str(x)) for x in feat_today.keys()]] #Today epoch ESM features
feat_yesterday_sensor = feat_yesterday.loc[:,[('ESM' not in str(x)) for x in feat_yesterday.keys()]] #Yesterday sensor features
feat_yesterday_ESM = feat_yesterday.loc[:,[('ESM'  in str(x)) for x in feat_yesterday.keys()]] #Yesterday ESM features

feat_sleep = X.loc[:,[('Sleep' in str(x))  for x in X.keys()]]
feat_time = X.loc[:,[('Time' in str(x))  for x in X.keys()]]
feat_pif = X.loc[:,[('PIF' in str(x))  for x in X.keys()]]
################################################################################

#Prepare the final feature set
feat_baseline = pd.concat([ feat_time,feat_dsc,feat_current_sensor, feat_ImmediatePast_sensor],axis=1)

feat_final = pd.concat([feat_baseline  ],axis=1)


################################################################################
X = feat_final
cats = X.columns[X.dtypes == bool]

NameError: name 'f' is not defined

In [8]:
feat_current_ESM

Unnamed: 0,ESM#LastLabel
0,0.0
1,1.0
2,1.0
3,0.0
4,0.0
...,...
2614,0.0
2615,0.0
2616,0.0
2617,1.0


## 3.Model Training & Evaluation


Here is the revised XGBoost Classifier. We will use random eval_size percent of training set data as evaluation set for early stoppping.

In [9]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier, DMatrix
from sklearn.base import BaseEstimator
from sklearn.model_selection import StratifiedShuffleSplit, train_test_split
from typing import Union

#Function for revised xgboost classifier
class EvXGBClassifier(BaseEstimator):
    """
    Enhanced XGBClassifier with built-in validation set approach for early stopping.
    """
    def __init__(
        self,
        eval_size=None,
        eval_metric='logloss',
        early_stopping_rounds=10,
        random_state=None,
        **kwargs
        ):
        """
        Initializes the custom XGBoost Classifier.

        Args:
            eval_size (float): The proportion of the dataset to include in the evaluation split.
            eval_metric (str): The evaluation metric used for model training.
            early_stopping_rounds (int): The number of rounds to stop training if hold-out metric doesn't improve.
            random_state (int): Seed for the random number generator for reproducibility.
            **kwargs: Additional arguments to be passed to the underlying XGBClassifier.
        """
        self.random_state = random_state
        self.eval_size = eval_size
        self.eval_metric = eval_metric
        self.early_stopping_rounds = early_stopping_rounds
        # Initialize the XGBClassifier with specified arguments and GPU acceleration.
        self.model = XGBClassifier(
            random_state=self.random_state,
            eval_metric=self.eval_metric,
            early_stopping_rounds=self.early_stopping_rounds,
            tree_method = "hist", device = "cuda", #Use gpu for acceleration
            **kwargs
        )

    @property
    def feature_importances_(self):
        """ Returns the feature importances from the fitted model. """
        return self.model.feature_importances_

    @property
    def feature_names_in_(self):
        """ Returns the feature names from the input dataset used for fitting. """
        return self.model.feature_names_in_

    def fit(self, X: Union[pd.DataFrame, np.ndarray], y: np.ndarray):
        """
        Fit the XGBoost model with optional early stopping using a validation set.

        Args:
            X (Union[pd.DataFrame, np.ndarray]): Training features.
            y (np.ndarray): Target values.
        """
        if self.eval_size:
            # Split data for early stopping evaluation if eval_size is specified.
            X_train_sub, X_val, y_train_sub, y_val = train_test_split(
                X, y, test_size=self.eval_size, random_state=self.random_state)
            # Fit the model with early stopping.
            self.model.fit(
                X_train_sub, y_train_sub,
                eval_set=[(X_val, y_val)],
                verbose=False
            )
        else:
            # Fit the model without early stopping.
            self.model.fit(X, y, verbose=False)

        # Store the best iteration number for predictions.
        self.best_iteration_ = self.model.get_booster().best_iteration
        return self

    def predict(self, X: pd.DataFrame):
        """
        Predict the classes for the given features.

        Args:
            X (pd.DataFrame): Input features.
        """
        return self.model.predict(X, iteration_range=(0, self.best_iteration_ + 1))

    def predict_proba(self, X: pd.DataFrame):
        """
        Predict the class probabilities for the given features.

        Args:
            X (pd.DataFrame): Input features.
        """
        return self.model.predict_proba(X, iteration_range=(0, self.best_iteration_ + 1))

The following is defined functions for model training and model evaluation (cross-validation).

In [10]:
import os
import pandas as pd
import numpy as np
import time
import traceback
from sklearn.linear_model import LogisticRegression
from sklearn.base import clone
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold, LeaveOneGroupOut, StratifiedGroupKFold
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE, SMOTENC
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import roc_auc_score
from dataclasses import dataclass

@dataclass
class FoldResult:
    name: str
    metrics: dict
    duration: float

def log(message: str):
    print(message)  # Simple logging to stdout or enhance as needed

def train_fold(dir_result: str, fold_name: str, X_train, y_train, X_test, y_test, C_cat, C_num, estimator, normalize, select, oversample, random_state):
    """
    Function to train and evaluate the model for a single fold.
    Args:
        dir_result (str): Directory to store results.
        fold_name (str): Name of the fold for identification.
        X_train, y_train (DataFrame, Series): Training data.
        X_test, y_test (DataFrame, Series): Testing data.
        C_cat, C_num (array): Lists of categorical and numeric feature names.
        estimator (estimator instance): The model to be trained.
        normalize (bool): Flag to apply normalization.
        select (SelectFromModel instance): Feature selection method.
        oversample (bool): Flag to apply oversampling.
        random_state (int): Random state for reproducibility.
    Returns:
        FoldResult: Object containing metrics and duration of the training.
    """
    try:
        start_time = time.time()
        if normalize:
            X_train_N, X_test_N = X_train[C_num].values, X_test[C_num].values
            X_train_C, X_test_C = X_train[C_cat].values, X_test[C_cat].values
            # Standard scaler only applied to numeric data
            scaler = StandardScaler().fit(X_train_N)
            X_train_N = scaler.transform(X_train_N)
            X_test_N = scaler.transform(X_test_N)

            X_train = pd.DataFrame(
                np.concatenate((X_train_C, X_train_N), axis=1),
                columns=np.concatenate((C_cat, C_num))
            )
            X_test = pd.DataFrame(
                np.concatenate((X_test_C, X_test_N), axis=1),
                columns=np.concatenate((C_cat, C_num))
            )

        #Applying the LASSO feature selection method
        if select:

            if isinstance(select, SelectFromModel):
                select = [select]

            for i, s in enumerate(select):
                C = np.asarray(X_train.columns)
                M = s.fit(X=X_train.values, y=y_train).get_support()
                C_sel = C[M]
                C_cat = C_cat[np.isin(C_cat, C_sel)]
                C_num = C_num[np.isin(C_num, C_sel)]

                X_train_N, X_test_N = X_train[C_num].values, X_test[C_num].values
                X_train_C, X_test_C = X_train[C_cat].values, X_test[C_cat].values


                X_train = pd.DataFrame(
                    np.concatenate((X_train_C, X_train_N), axis=1),
                    columns=np.concatenate((C_cat, C_num))
                )
                X_test = pd.DataFrame(
                    np.concatenate((X_test_C, X_test_N), axis=1),
                    columns=np.concatenate((C_cat, C_num))
                )

        if oversample:
            #If there is any categorical data, apply SMOTE-NC, otherwise just SMOTE
            if len(C_cat) > 0:
                sampler = SMOTENC(categorical_features=[X_train.columns.get_loc(c) for c in C_cat], random_state=random_state)
            else:
                sampler = SMOTE(random_state=random_state)
            X_train, y_train = sampler.fit_resample(X_train, y_train)

        estimator = clone(estimator).fit(X_train, y_train)
        y_pred = estimator.predict_proba(X_test)[:, 1]
        #Deafult average method for roc_auc_score is macro
        auc_score = roc_auc_score(y_test, y_pred, average=None)

        result = FoldResult(
            name=fold_name,
            metrics={'AUC': auc_score},
            duration=time.time() - start_time
        )
        log(f'Training completed for {fold_name} with AUC: {auc_score}')
        return result

    except Exception as e:
        log(f'Error in {fold_name}: {traceback.format_exc()}')
        return None

def perform_cross_validation(X, y, groups, estimator, normalize=False, select=None, oversample=False, random_state=None):
    """
    Function to perform cross-validation using StratifiedGroupKFold.
    Args:
        X, y (DataFrame, Series): The entire dataset.
        groups (array): Array indicating the group for each instance in X.
        estimator (estimator instance): The model to be trained.
        normalize, select, oversample (bool): Preprocessing options.
        random_state (int): Seed for reproducibility.
    Returns:
        list: A list containing FoldResult for each fold.
    """
    futures = []
    # Group-k cross validation
    splitter = StratifiedGroupKFold(n_splits=5, shuffle =True, random_state = 42)
    # Loop over all the LOSO splits
    for idx, (train_idx, test_idx) in enumerate(splitter.split(X, y, groups)):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
        C_cat = np.asarray(sorted(cats))
        C_num = np.asarray(sorted(X.columns[~X.columns.isin(C_cat)]))

        job = train_fold('path_to_results', f'Fold_{idx}', X_train, y_train, X_test, y_test, C_cat, C_num, estimator, normalize, select, oversample, random_state)
        futures.append(job)

    return futures

Here, we define the feature selection method and classifier and execute the code. AUC-ROC is calculated as mean of macro AUC-ROC for all folds/users.

In [11]:
#Featur Selection, you may want to change the feature selection methods
SELECT_LASSO = SelectFromModel(
        estimator=LogisticRegression(
        penalty='l1'
        ,solver='liblinear'
        , C=1, random_state=RANDOM_STATE, max_iter=4000
    ),
    # This threshold may impact the model performance as well
    threshold = 0.005
)
#Classifier
#There could exist more parameters. Please search in your defined parameter
#space for model performance improvement
estimator = EvXGBClassifier(
    random_state=RANDOM_STATE,
    eval_metric='logloss',
    eval_size=0.2,
    early_stopping_rounds=10,
    objective='binary:logistic', #Prediction instead of regression
    verbosity=0,
    learning_rate=0.01,
)

#Perform cross validation including model training and evaluation
results = perform_cross_validation(X, y, groups, estimator, normalize=True, select=[SELECT_LASSO], oversample=True, random_state=42)
auc_values = [results[i].metrics['AUC'] for i in range(len(results))]
mean_auc = np.mean(auc_values)
print(mean_auc)

Potential solutions:
- Use a data structure that matches the device ordinal in the booster.
- Set the device for booster before call to inplace_predict.




Training completed for Fold_0 with AUC: 0.5760597892673365
Training completed for Fold_1 with AUC: 0.5194843617920542
Training completed for Fold_2 with AUC: 0.605396012438266
Training completed for Fold_3 with AUC: 0.5232254989908052
Training completed for Fold_4 with AUC: 0.5972818082858423
0.5642894941548608


# Assignment

## Assignment 1. Improve the model performance using different types of feature combinations. (20pts)

 Hint: Currently we are only using feat_baseline. You may want to try other feature combinations.

In [12]:
#######You may need to go back to the feature preparation code and check#########

feat_final = pd.concat([feat_baseline, feat_current_sensor,feat_ImmediatePast_sensor,feat_today_sensor,feat_yesterday_sensor,feat_sleep,feat_pif],axis=1)
X = feat_final
cats = X.columns[X.dtypes == bool]


## Assignment 2. Please try different feature selection methods (20pts)

Hint: Currently, we are using LASSO filter for feature selection. Please consider using embedded method as well(same model for both feature selection and model training). Besides, the threshold for LASSO filter may also affect the performance. **Sepcifically, there is a method called 'mean' which is using mean of feature importances of all features as threshold.** Please try both different feature selection methods and different thresholds for filtering features to improve model performance.

In [13]:
#######You may need to go back to the Model Training & Evaluation part and revise feature selection code########
#Featur Selection, you may want to change the feature selection methods or change feature selection threshold
SELECT_LASSO = SelectFromModel(
    #
        estimator=LogisticRegression(
        penalty='l1'
        ,solver='liblinear'
        , C=1, random_state=RANDOM_STATE, max_iter=4000
    ),
    # This threshold may impact the model performance as well
    threshold = 0.005 #Change to other thresholds or trying 'mean'
)

SELECT_LASSO_MEAN = SelectFromModel(
        estimator=LogisticRegression(
        penalty='l1'
        ,solver='liblinear'
        , C=1, random_state=RANDOM_STATE, max_iter=4000
    ),
    threshold = 'mean'
)

from xgboost import XGBClassifier
from sklearn.feature_selection import SelectFromModel

xgb_selector = SelectFromModel(
    estimator=XGBClassifier(random_state=RANDOM_STATE, use_label_encoder=False, eval_metric='logloss'),
    threshold='mean' # Or a different threshold
)

## Assignment 3. Please try using hyperopt for model hyperparameter tuning (20 pts)

Hint: Please be aware that for revised xgboost classifier EvXGBClassifier, there exist other parameters other than default XGBClassifier parameters such as eval_size.

For hyperparameter tuning, we will use 20% of training set as validation set to avoid data leakage.

If it is too timeconsuming to run the code in colab, please run the code locally and consider using [ray tune](https://docs.ray.io/en/latest/tune/index.html) if needed.

In [None]:
import numpy as np
import pandas as pd
from hyperopt import STATUS_OK, Trials, hp, fmin, tpe
from sklearn.model_selection import StratifiedGroupKFold, train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE, SMOTENC
from sklearn.feature_selection import SelectFromModel

# define your outer CV
OUTER_CV = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)

def objective(params):
    val_scores = []

    # outer loop: split into train_full / test (we will only use train_full for tuning)
    for train_full_idx, _ in OUTER_CV.split(X, y, groups):
        X_train_full = X.iloc[train_full_idx]
        y_train_full = y[train_full_idx]

        # split 20% of the *training fold* into a validation set
        X_train, X_val, y_train, y_val = train_test_split(
            X_train_full, y_train_full,
            test_size=0.20,
            stratify=y_train_full,
            random_state=42
        )

        # 1) Normalize
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_val_scaled   = scaler.transform(X_val)

        # 2) (Optional) Oversample on *training only*
        if np.any(X_train_scaled[:, -1] < 1):
            smote = SMOTENC(
                categorical_features=[X_train_scaled.shape[1]-1],
                random_state=int(params['random_state'])
            )
        else:
            smote = SMOTE(random_state=int(params['random_state']))
        X_train_os, y_train_os = smote.fit_resample(X_train_scaled, y_train)

        # 3) Feature selection on *training only*
        selector = SelectFromModel(
            LogisticRegression(penalty='l1', solver='liblinear',
                               random_state=int(params['random_state'])),
            threshold='mean'
        )
        X_train_sel = selector.fit_transform(X_train_os, y_train_os)
        X_val_sel   = selector.transform(X_val_scaled)

        # 4) Train & score on *validation only*
        clf = LogisticRegression(
            random_state=int(params['random_state']),
            max_iter=1000
        )
        clf.fit(X_train_sel, y_train_os)
        y_val_prob = clf.predict_proba(X_val_sel)[:, 1]
        val_scores.append(roc_auc_score(y_val, y_val_prob))

    # Hyperopt minimizes “loss”, so negate AUC
    return {'loss': -np.mean(val_scores), 'status': STATUS_OK}


# define your search space (fill in any missing parameters e.g. max_depth)
space = {
    'max_depth':          hp.quniform('max_depth', 3, 10, 1),  # Integer between 3 and 10
    'min_child_weight':   hp.quniform('min_child_weight', 1, 10, 1), # Integer between 1 and 10
    'subsample':          hp.uniform('subsample', 0.6, 1.0),  # Float between 0.6 and 1.0
    'colsample_bytree':   hp.uniform('colsample_bytree', 0.6, 1.0), # Float between 0.6 and 1.0
    'gamma':              hp.uniform('gamma', 0, 0.5),      # Float between 0 and 0.5
    'learning_rate':      hp.loguniform('learning_rate', -5, 0), # Float on a log scale (0.0067 to 1)
    'n_estimators':       hp.quniform('n_estimators', 100, 1000, 50), # Integer between 100 and 1000, steps of 50
    'reg_lambda':         hp.uniform('reg_lambda', 0, 1),     # Float between 0 and 1 (L2 regularization)
    'reg_alpha':          hp.uniform('reg_alpha', 0, 0.5),    # Float between 0 and 0.5 (L1 regularization)
    'random_state':       42 # Keeping random_state fixed
}

# run hyperopt
trials = Trials()
best = fmin(
    fn=objective,
    space=space,
    algo=tpe.suggest,
    max_evals=100,
    trials=trials
)

print("Best hyperparameters:", best)

  2%|▏         | 2/100 [02:39<2:11:08, 80.29s/trial, best loss: -0.6406176718563223]

## Assignment 4. Please consider replacing the previous traditional machine learning model with deep learning models designed for **tabular data** to improve model performance. (20 pts)

Hint: Since features are already extracted manually, it is impossible to use end-to-end deep learning models. Instead, try replacing xgboost with deep learning models designed for **tabular data** and see if there is performance improvement.

You may need to change runtime to TPU first to use torch or other packages you may want to use.



Please compare it with your previous XGBoost model performance and think about why it is higher or lower than XGBoost.

In [None]:
# prompt: write the code for this: Assignment 4. Please consider replacing the previous traditional machine learning model with deep learning models designed for tabular data to improve model performance. (20 pts)
# Hint: Since features are already extracted manually, it is impossible to use end-to-end deep learning models. Instead, try replacing xgboost with deep learning models designed for tabular data and see if there is performance improvement.
# You may need to change runtime to TPU first to use torch or other packages you may want to use.
# Please compare it with your previous XGBoost model performance and think about why it is higher or lower than XGBoost.

import pandas as pd
import numpy as np
!pip install pytorch-tabnet

import torch
from pytorch_tabnet.tab_model import TabNetClassifier
from torch.utils.data import Dataset, DataLoader
from sklearn.preprocessing import LabelEncoder

# Convert pandas DataFrames to numpy arrays for TabNet
X_np = X.values
y_np = y

# Encode categorical features. TabNet handles categorical features automatically if specified.
# Identify categorical feature indices
categorical_feature_indices = [i for i, col in enumerate(X.columns) if col in cats]
# Since we already One-Hot Encoded the boolean columns earlier, let's treat them as numerical for now
# if you decide to use the boolean columns as categorical, you'll need to handle that differently
# For now, assuming all features in X are treated as numerical inputs for TabNet after potential scaling.


class TabularDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.tensor(X, dtype=torch.float32)
        self.y = torch.tensor(y, dtype=torch.long)

    def __len__(self):
        return len(self.X)

    def __getitem__(self, idx):
        return self.X[idx], self.y[idx]

def train_tabnet_fold(X_train, y_train, X_test, y_test, normalize=False, select=None, oversample=False, random_state=None):
    """
    Function to train and evaluate TabNet model for a single fold.
    Args:
        X_train, y_train (DataFrame, Series): Training data.
        X_test, y_test (DataFrame, Series): Testing data.
        normalize (bool): Flag to apply normalization.
        select (SelectFromModel instance): Feature selection method.
        oversample (bool): Flag to apply oversampling.
        random_state (int): Random state for reproducibility.
    Returns:
        FoldResult: Object containing metrics and duration of the training.
    """
    try:
        start_time = time.time()

        X_train_processed = X_train.copy()
        X_test_processed = X_test.copy()
        y_train_processed = y_train.copy()

        # Separate categorical and numerical columns for processing
        C_cat = np.asarray(sorted(cats))
        C_num = np.asarray(sorted(X_train.columns[~X_train.columns.isin(C_cat)]))

        # Normalize numerical features
        if normalize:
            scaler = StandardScaler().fit(X_train_processed[C_num])
            X_train_processed[C_num] = scaler.transform(X_train_processed[C_num])
            X_test_processed[C_num] = scaler.transform(X_test_processed[C_num])

        # Apply feature selection
        if select:
             if isinstance(select, SelectFromModel):
                select = [select]

             for i, s in enumerate(select):
                # Fit selector on training data
                s.fit(X_train_processed, y_train_processed)
                # Transform both train and test data
                X_train_processed = pd.DataFrame(s.transform(X_train_processed), columns=X_train_processed.columns[s.get_support()])
                X_test_processed = pd.DataFrame(s.transform(X_test_processed), columns=X_test_processed.columns[s.get_support()])

             # Update categorical and numerical column lists based on selected features
             C_cat = np.asarray([col for col in C_cat if col in X_train_processed.columns])
             C_num = np.asarray([col for col in C_num if col in X_train_processed.columns])


        # Apply oversampling (on training data only)
        if oversample:
            # If there is any categorical data, apply SMOTE-NC, otherwise just SMOTE
            if len(C_cat) > 0:
                 # Need to determine categorical feature indices in the *processed* dataframe
                 processed_cat_indices = [X_train_processed.columns.get_loc(c) for c in C_cat]
                 sampler = SMOTENC(categorical_features=processed_cat_indices, random_state=random_state)
            else:
                sampler = SMOTE(random_state=random_state)

            X_train_processed, y_train_processed = sampler.fit_resample(X_train_processed, y_train_processed)


        # Convert to numpy arrays before passing to TabNet
        X_train_np = X_train_processed.values
        y_train_np = y_train_processed.values
        X_test_np = X_test_processed.values
        y_test_np = y_test

        # Initialize TabNet model
        # Adjust parameters based on your data and hyperparameter tuning
        clf = TabNetClassifier(
            optimizer_fn=torch.optim.Adam,
            optimizer_params=dict(lr=2e-2),
            scheduler_params={"step_size":50, # how many steps before decaying the learning rate
                              "gamma":0.9},
            scheduler_fn=torch.optim.lr_scheduler.StepLR,
            mask_type='entmax', # "sparsemax", "entmax"
            seed=random_state,
            verbose=0 # Set to 1 for detailed training logs per epoch
        )

        # Train the model
        # Use evaluation set for early stopping (TabNet has its own early stopping mechanism)
        # Split processed training data for early stopping
        X_train_sub_np, X_val_np, y_train_sub_np, y_val_np = train_test_split(
            X_train_np, y_train_np, test_size=0.2, random_state=random_state, stratify=y_train_np
        )

        clf.fit(
            X_train=X_train_sub_np, y_train=y_train_sub_np,
            eval_set=[(X_val_np, y_val_np)],
            eval_name=['valid'],
            max_epochs=100, # You might need to tune this
            patience=10, # You might need to tune this
            batch_size=1024, # You might need to tune this
            virtual_batch_size=128, # You might need to tune this
            num_workers=0,
            weights=1, # Set to 1 for automatic class weighting
            drop_last=False,
        )


        # Predict probabilities on the test set
        y_pred_proba = clf.predict_proba(X_test_np)[:, 1]

        # Calculate AUC
        auc_score = roc_auc_score(y_test_np, y_pred_proba, average=None)

        result = FoldResult(
            name=fold_name,
            metrics={'AUC': auc_score},
            duration=time.time() - start_time
        )
        log(f'Training completed for {fold_name} with AUC: {auc_score}')
        return result

    except Exception as e:
        log(f'Error in {fold_name}: {traceback.format_exc()}')
        return None


def perform_tabnet_cross_validation(X, y, groups, normalize=False, select=None, oversample=False, random_state=None):
    """
    Function to perform cross-validation using StratifiedGroupKFold with TabNet.
    Args:
        X, y (DataFrame, Series): The entire dataset.
        groups (array): Array indicating the group for each instance in X.
        normalize, select, oversample (bool): Preprocessing options.
        random_state (int): Seed for reproducibility.
    Returns:
        list: A list containing FoldResult for each fold.
    """
    futures = []
    # Group-k cross validation
    splitter = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)
    # Loop over all the LOSO splits
    for idx, (train_idx, test_idx) in enumerate(splitter.split(X, y, groups)):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]

        job = train_tabnet_fold(X_train, y_train, X_test, y_test, normalize, select, oversample, random_state)
        futures.append(job)

    return futures

# Define feature selection method if needed for TabNet (Optional - TabNet can handle raw features)
# You can still use LASSO or XGBoost based selection before passing to TabNet if you want to reduce dimensionality.
# For this example, let's try without explicit feature selection first to see TabNet's inherent capabilities.
# If you want to use feature selection, uncomment and configure SELECT_LASSO or xgb_selector here.

# SELECT_LASSO = SelectFromModel(
#         estimator=LogisticRegression(
#         penalty='l1'
#         ,solver='liblinear'
#         , C=1, random_state=RANDOM_STATE, max_iter=4000
#     ),
#     threshold = 'mean'
# )

# Perform cross validation with TabNet
# Set select=None to use all features after normalization and oversampling
results_tabnet = perform_tabnet_cross_validation(X, y, groups, normalize=True, select=None, oversample=True, random_state=42)

# Calculate and print the mean AUC for TabNet
auc_values_tabnet = [result.metrics['AUC'] for result in results_tabnet if result is not None]
if auc_values_tabnet:
    mean_auc_tabnet = np.mean(auc_values_tabnet)
    print(f"Mean AUC for TabNet: {mean_auc_tabnet}")
else:
    print("TabNet training failed for all folds.")

# Compare with previous XGBoost result
# You would have run the XGBoost code previously to get `mean_auc`
print(f"Mean AUC for XGBoost (previous run): {mean_auc}")

# Reflection on why performance is higher or lower
# TabNet is a deep learning model designed specifically for tabular data. It uses a sequential attention mechanism
# to select relevant features at each decision step, which can capture complex interactions between features.
# XGBoost is a powerful gradient boosting model that is also very effective on tabular data, especially
# with well-engineered features.

# Possible reasons for performance difference:
# 1. Complex Feature Interactions: TabNet might be better at capturing non-linear relationships and interactions
#    between features than XGBoost, especially if these interactions are not explicitly engineered.
# 2. Handling of Categorical Features: TabNet has built-in mechanisms for handling categorical features,
#    which might be more effective than one-hot encoding or other methods used with traditional models.
#    (Note: In this specific code, we assumed the boolean features were handled as numerical, so this point
#    might be less relevant unless you modify the code to treat them as categorical).
# 3. Hyperparameter Tuning: The performance of both models is highly dependent on hyperparameters.
#    The provided TabNet hyperparameters are defaults or basic settings; further tuning for TabNet
#    (e.g., network architecture, learning rate scheduling, batch sizes, mask type) might lead to significant
#    improvement. The hyperopt code provided was for Logistic Regression, not XGBoost or TabNet. You would
#    need to adapt hyperparameter tuning for TabNet.
# 4. Data Size: Deep learning models often require more data than traditional models to perform well.
#    If the dataset is relatively small, XGBoost might have an advantage.
# 5. Regularization: TabNet has various regularization mechanisms (e.g., in the attention masks) that can help
#    prevent overfitting, which might be beneficial.
# 6. Feature Scaling: TabNet, like many neural networks, is sensitive to feature scaling. The `StandardScaler`
#    applied to numerical features is important.
# 7. Stochasticity: Deep learning models like TabNet have more stochasticity in training (due to initialization,
#    batching, etc.) compared to tree-based models. Running multiple times or with different seeds might
#    give slightly different results.

# To get a more definitive comparison, you would ideally:
# - Perform comprehensive hyperparameter tuning for BOTH XGBoost and TabNet.
# - Ensure fair comparison of preprocessing steps (normalization, feature selection, oversampling) applied to both models.
# - Evaluate on the same cross-validation folds.

# In summary, if TabNet performs better, it could be due to its ability to learn complex feature representations and interactions. If XGBoost performs better, it might indicate that the engineered features are already capturing most of the necessary information, or that the dataset size or the specific characteristics of the data favor gradient boosting.



In [None]:
#######Your code for deep learning model#########

## Assignment 5. Please try combining all the above methods to push the model performance. (20 pts)

Hint: Methods other than the above methods are also okay to use to improve model performance.

Please avoid data leakage when conducting hyperparameter tuning.


In [None]:
#######Your code for combing all above mentioned methods to push model performance########