# CS565-DS522 IoT Data Science Mini Project for K-EmoPhone dataset
*This material is a joint work of TAs from IC Lab at KAIST, including Panyu Zhang, Soowon Kang, and Woohyeok Choi. This work is licensed under CC BY-SA 4.0.*

## Instruction
In this mini-project, we will build a model to predict users' self-reported stress using extracted features from K-EmoPhone dataset. This material mainly refers to the public [repository](https://github.com/SteinPanyu/IndependentReproducibility) conducting indepedent reproducibility experiments on K-EmoPhone dataset. In order to save time, we provide the extracted features from the raw data instead of starting from scratch. Besides, traditional machine learning model is used considering limited number of labels and multimodality issue in the in-the-wild K-EmoPhone dataset.



## Guidance

1. Before running the code, please first download the extracted features from the following [link](https://drive.google.com/file/d/1HcyFvzWEzO21osyP5E8VpVmHROX1ew7q/view?usp=sharing).

2. Please change your runtime type to T4-GPU or other runtime types with GPU available since later we may use GPU for
xgboost execution

Install latest version of xgboost > 2.0.0

In [1]:
!pip install xgboost



In [2]:
import pytz
import os
import pandas as pd
import numpy as np
import scipy.stats as st
import cloudpickle
from datetime import datetime
from contextlib import contextmanager
import warnings
import time
from typing import Optional

DEFAULT_TZ = pytz.FixedOffset(540)  # GMT+09:00; Asia/Seoul

RANDOM_STATE =42


def log(msg: any):
    print('[{}] {}'.format(datetime.now().strftime('%y-%m-%d %H:%M:%S'), msg))

In [3]:
# ignore warnings
import warnings
warnings.filterwarnings("ignore")

## 1.Preparation

### 1.1. Mount to Your Google Drive

In [4]:
# not relevant for local execution
'''
from google.colab import drive

drive.mount('/content/drive')
'''

"\nfrom google.colab import drive\n\ndrive.mount('/content/drive')\n"

### 1.2. Load Extracted Features

In [5]:
import pickle
import numpy as np

#PATH = '/content/drive/MyDrive/IoT_Data_Science/Project/Datasets/features_stress_fixed_K-EmoPhone.pkl'
PATH = './Datasets/features_stress_fixed_K-EmoPhone.pkl'

X, y, groups, t, datetimes = pickle.load(open(PATH, mode='rb'))

X is the extracted features and the feature extraction process refers to the public [repository](https://github.com/SteinPanyu/IndependentReproducibility) and the immediate past time window is set as 15 minutes. y is the array of labels while groups is the user ids.

Please note that here y is binarized using theoretical threshold (if ESM stress > 0, binarize as 1, else 0, ESM label scale [-3, 3])

Since features are already extracted, we do not need to work on preprocessing and feature extraction again.

## 2.Feature Preparation


There exist multiple types of features. Please try different combinations of features to see if there is any model performance improvement.

In [6]:

#The following code is designed for reordering the data
#################################################
# Create a DataFrame with user_id and datetime

df = pd.DataFrame({'user_id': groups, 'datetime': datetimes, 'label': y})

# df_merged = pd.merge(df, X, left_index=True, right_index=True)
df_merged = pd.merge(df, X, left_index=True, right_index=True)

# Sort the DataFrame by datetime
df_merged = df_merged.sort_values(by=['user_id', 'datetime'])

# Update groups and datetimes
groups = df_merged['user_id'].to_numpy()
datetimes = df_merged['datetime'].to_numpy()
y = df_merged['label'].to_numpy()
X = df_merged.drop(columns=['user_id', 'datetime', 'label'])



#Divide the features into different categories
feat_current = X.loc[:,[('#VAL' in str(x)) or ('ESM#LastLabel' in str(x)) for x in X.keys()]]
feat_dsc = X.loc[:,[('#DSC' in str(x))  for x in X.keys()]]
feat_yesterday = X.loc[:,[('Yesterday' in str(x))  for x in X.keys()]]
feat_today = X.loc[:,[('Today' in str(x))  for x in X.keys()]]

feat_ImmediatePast = X.loc[:,[('ImmediatePast_15' in str(x))  for x in X.keys()]]

#################################################################################
#Below are the available features
#Divide the time window features into sensor/ESM self-report features
feat_current_sensor = X.loc[:,[('#VAL' in str(x))  for x in X.keys()]] #Current sensor features (value right before label)
feat_current_ESM = X.loc[:,[('ESM#LastLabel' in str(x)) for x in X.keys()]] #Current ESM features (value right before label)
feat_ImmediatePast_sensor = feat_ImmediatePast.loc[:,[('ESM' not in str(x)) for x in feat_ImmediatePast.keys()]] #Immediate past sensor features (in past 15 minutes before label)
feat_ImmediatePast_ESM = feat_ImmediatePast.loc[:,[('ESM'  in str(x)) for x in feat_ImmediatePast.keys()]]  #Immediate past ESM features
feat_today_sensor = feat_today.loc[:,[('ESM' not in str(x))  for x in feat_today.keys()]] #Today epoch sensor features
feat_today_ESM = feat_today.loc[:,[('ESM'  in str(x)) for x in feat_today.keys()]] #Today epoch ESM features
feat_yesterday_sensor = feat_yesterday.loc[:,[('ESM' not in str(x)) for x in feat_yesterday.keys()]] #Yesterday sensor features
feat_yesterday_ESM = feat_yesterday.loc[:,[('ESM'  in str(x)) for x in feat_yesterday.keys()]] #Yesterday ESM features

feat_sleep = X.loc[:,[('Sleep' in str(x))  for x in X.keys()]]
feat_time = X.loc[:,[('Time' in str(x))  for x in X.keys()]]
feat_pif = X.loc[:,[('PIF' in str(x))  for x in X.keys()]]
################################################################################

#Prepare the final feature set
feat_baseline = pd.concat([ feat_time,feat_dsc,feat_current_sensor, feat_ImmediatePast_sensor],axis=1)

feat_final = pd.concat([feat_baseline  ],axis=1)


################################################################################
X = feat_final
cats = X.columns[X.dtypes == bool]

In [7]:
feat_current_ESM

Unnamed: 0,ESM#LastLabel
0,0.0
1,1.0
2,1.0
3,0.0
4,0.0
...,...
2614,0.0
2615,0.0
2616,0.0
2617,1.0


## 3.Model Training & Evaluation


Here is the revised XGBoost Classifier. We will use random eval_size percent of training set data as evaluation set for early stoppping.

In [8]:
import pandas as pd
import numpy as np
from xgboost import XGBClassifier 
from sklearn.base import BaseEstimator
from sklearn.model_selection import  train_test_split
from typing import Union

#Function for revised xgboost classifier
class EvXGBClassifier(BaseEstimator):
    """
    Enhanced XGBClassifier with built-in validation set approach for early stopping.
    """
    def __init__(
        self,
        eval_size=None,
        eval_metric='logloss',
        early_stopping_rounds=10,
        random_state=None,
        **kwargs
        ):
        """
        Initializes the custom XGBoost Classifier.

        Args:
            eval_size (float): The proportion of the dataset to include in the evaluation split.
            eval_metric (str): The evaluation metric used for model training.
            early_stopping_rounds (int): The number of rounds to stop training if hold-out metric doesn't improve.
            random_state (int): Seed for the random number generator for reproducibility.
            **kwargs: Additional arguments to be passed to the underlying XGBClassifier.
        """
        self.random_state = random_state
        self.eval_size = eval_size
        self.eval_metric = eval_metric
        self.early_stopping_rounds = early_stopping_rounds
        # Initialize the XGBClassifier with specified arguments and GPU acceleration.
        self.model = XGBClassifier(
            random_state=self.random_state,
            eval_metric=self.eval_metric,
            early_stopping_rounds=self.early_stopping_rounds,
            tree_method = "hist", device = "cuda", #Use gpu for acceleration
            **kwargs
        )

    @property
    def feature_importances_(self):
        """ Returns the feature importances from the fitted model. """
        return self.model.feature_importances_

    @property
    def feature_names_in_(self):
        """ Returns the feature names from the input dataset used for fitting. """
        return self.model.feature_names_in_

    def fit(self, X: Union[pd.DataFrame, np.ndarray], y: np.ndarray):
        """
        Fit the XGBoost model with optional early stopping using a validation set.

        Args:
            X (Union[pd.DataFrame, np.ndarray]): Training features.
            y (np.ndarray): Target values.
        """
        if self.eval_size:
            # Split data for early stopping evaluation if eval_size is specified.
            X_train_sub, X_val, y_train_sub, y_val = train_test_split(
                X, y, test_size=self.eval_size, random_state=self.random_state)
            # Fit the model with early stopping.
            self.model.fit(
                X_train_sub, y_train_sub,
                eval_set=[(X_val, y_val)],
                verbose=False
            )
        else:
            # Fit the model without early stopping.
            self.model.fit(X, y, verbose=False)

        # Store the best iteration number for predictions.
        self.best_iteration_ = self.model.get_booster().best_iteration
        return self

    def predict(self, X: pd.DataFrame):
        """
        Predict the classes for the given features.

        Args:
            X (pd.DataFrame): Input features.
        """
        return self.model.predict(X, iteration_range=(0, self.best_iteration_ + 1))

    def predict_proba(self, X: pd.DataFrame):
        """
        Predict the class probabilities for the given features.

        Args:
            X (pd.DataFrame): Input features.
        """
        return self.model.predict_proba(X, iteration_range=(0, self.best_iteration_ + 1))

The following is defined functions for model training and model evaluation (cross-validation).

In [9]:
import os
import pandas as pd
import numpy as np
import time
import traceback
from sklearn.linear_model import LogisticRegression
from sklearn.base import clone
from sklearn.model_selection import StratifiedShuffleSplit, StratifiedKFold, LeaveOneGroupOut, StratifiedGroupKFold
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import SMOTE, SMOTENC
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import roc_auc_score
from dataclasses import dataclass
from sklearn.preprocessing import OrdinalEncoder
from imblearn.over_sampling import ADASYN

@dataclass
class FoldResult:
    name: str
    metrics: dict
    duration: float

def log(message: str):
    print(message)  # Simple logging to stdout or enhance as needed

def train_fold(dir_result: str, fold_name: str, X_train, y_train, X_test, y_test, C_cat, C_num, estimator, normalize, select, oversample, random_state):
    """
    Function to train and evaluate the model for a single fold.
    Args:
        dir_result (str): Directory to store results.
        fold_name (str): Name of the fold for identification.
        X_train, y_train (DataFrame, Series): Training data.
        X_test, y_test (DataFrame, Series): Testing data.
        C_cat, C_num (array): Lists of categorical and numeric feature names.
        estimator (estimator instance): The model to be trained.
        normalize (bool): Flag to apply normalization.
        select (SelectFromModel instance): Feature selection method.
        oversample (bool): Flag to apply oversampling.
        random_state (int): Random state for reproducibility.
    Returns:
        FoldResult: Object containing metrics and duration of the training.
    """
    try:
        start_time = time.time()
        if normalize:
            X_train_N, X_test_N = X_train[C_num].values, X_test[C_num].values
            X_train_C, X_test_C = X_train[C_cat].values, X_test[C_cat].values
            # Standard scaler only applied to numeric data
            scaler = StandardScaler().fit(X_train_N)
            X_train_N = scaler.transform(X_train_N)
            X_test_N = scaler.transform(X_test_N)

            X_train = pd.DataFrame(
                np.concatenate((X_train_C, X_train_N), axis=1),
                columns=np.concatenate((C_cat, C_num))
            )
            X_test = pd.DataFrame(
                np.concatenate((X_test_C, X_test_N), axis=1),
                columns=np.concatenate((C_cat, C_num))
            )

        if select:

            if isinstance(select, SelectFromModel):
                select = [select]

            for i, s in enumerate(select):
                C = np.asarray(X_train.columns)
                M = s.fit(X=X_train.values, y=y_train).get_support()
                C_sel = C[M]
                C_cat = C_cat[np.isin(C_cat, C_sel)]
                C_num = C_num[np.isin(C_num, C_sel)]

                X_train_N, X_test_N = X_train[C_num].values, X_test[C_num].values
                X_train_C, X_test_C = X_train[C_cat].values, X_test[C_cat].values


                X_train = pd.DataFrame(
                    np.concatenate((X_train_C, X_train_N), axis=1),
                    columns=np.concatenate((C_cat, C_num))
                )
                X_test = pd.DataFrame(
                    np.concatenate((X_test_C, X_test_N), axis=1),
                    columns=np.concatenate((C_cat, C_num))
                )

        if oversample:
            # Encode categorical features if any
            if len(C_cat) > 0:
                encoder = OrdinalEncoder()
                X_train[C_cat] = encoder.fit_transform(X_train[C_cat])

            # Changed smote to ADASYN for better handling of imbalanced datasets
            sampler = ADASYN(random_state=random_state)
            X_train, y_train = sampler.fit_resample(X_train, y_train)

        estimator = clone(estimator).fit(X_train, y_train)
        y_pred = estimator.predict_proba(X_test)[:, 1]
        #Deafult average method for roc_auc_score is macro
        auc_score = roc_auc_score(y_test, y_pred, average=None)

        result = FoldResult(
            name=fold_name,
            metrics={'AUC': auc_score},
            duration=time.time() - start_time
        )
        log(f'Training completed for {fold_name} with AUC: {auc_score}')
        return result

    except Exception as e:
        log(f'Error in {fold_name}: {traceback.format_exc()}')
        return None

def perform_cross_validation(X, y, groups, estimator, normalize=False, select=None, oversample=False, random_state=None):
    """
    Function to perform cross-validation using StratifiedGroupKFold.
    Args:
        X, y (DataFrame, Series): The entire dataset.
        groups (array): Array indicating the group for each instance in X.
        estimator (estimator instance): The model to be trained.
        normalize, select, oversample (bool): Preprocessing options.
        random_state (int): Seed for reproducibility.
    Returns:
        list: A list containing FoldResult for each fold.
    """
    futures = []
    # Group-k cross validation
    splitter = StratifiedGroupKFold(n_splits=5, shuffle =True, random_state = 42)
    # Loop over all the LOSO splits
    for idx, (train_idx, test_idx) in enumerate(splitter.split(X, y, groups)):
        X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
        y_train, y_test = y[train_idx], y[test_idx]
        C_cat = np.asarray(sorted(cats))
        C_num = np.asarray(sorted(X.columns[~X.columns.isin(C_cat)]))

        job = train_fold('path_to_results', f'Fold_{idx}', X_train, y_train, X_test, y_test, C_cat, C_num, estimator, normalize, select, oversample, random_state)
        futures.append(job)
    return futures


Here, we define the feature selection method and classifier and execute the code. AUC-ROC is calculated as mean of macro AUC-ROC for all folds/users.

In [10]:
#Featur Selection, you may want to change the feature selection methods
SELECT_LASSO = SelectFromModel(
        estimator=LogisticRegression(
        penalty='l1'
        ,solver='liblinear'
        , C=1, random_state=RANDOM_STATE, max_iter=4000
    ),
    # This threshold may impact the model performance as well
    threshold = 0.005
)
#Classifier
#There could exist more parameters. Please search in your defined parameter
#space for model performance improvement
estimator = EvXGBClassifier(
    random_state=RANDOM_STATE,
    eval_metric='logloss',
    eval_size=0.2,
    early_stopping_rounds=10,
    objective='binary:logistic', #Prediction instead of regression
    verbosity=0,
    learning_rate=0.01,
)

#Perform cross validation including model training and evaluation
results = perform_cross_validation(X, y, groups, estimator, normalize=True, select=[SELECT_LASSO], oversample=True, random_state=42)
auc_values = [results[i].metrics['AUC'] for i in range(len(results))]
mean_auc = np.mean(auc_values)
print(mean_auc)

BASELINE_SCORE = mean_auc
previous_mean_auc = mean_auc

Training completed for Fold_0 with AUC: 0.6085028179367802
Training completed for Fold_1 with AUC: 0.566392947711629
Training completed for Fold_2 with AUC: 0.5536583135174685
Training completed for Fold_3 with AUC: 0.5370038125140166
Training completed for Fold_4 with AUC: 0.6069667669846961
0.5745049317329182


# Assignment

## Assignment 1. Improve the model performance using different types of feature combinations. (20pts)

 Hint: Currently we are only using feat_baseline. You may want to try other feature combinations.

In [11]:
# Selecting features
feat_baseline = pd.concat([ feat_time,feat_dsc,feat_current_sensor, feat_ImmediatePast_sensor],axis=1)
feat_final = pd.concat([feat_baseline, feat_current_ESM,feat_today_ESM ], axis=1) 

X = feat_final
cats = X.columns[X.dtypes == bool]

In [12]:
# Run model training and evaluation again with the selected features
results = perform_cross_validation(X, y, groups, estimator, normalize=True, select=[SELECT_LASSO], oversample=True, random_state=42)
auc_values = [results[i].metrics['AUC'] for i in range(len(results))]
mean_auc = np.mean(auc_values)
print(mean_auc)

print(f"Difference from baseline: {mean_auc - BASELINE_SCORE:.4f}")

print(f"Difference from previous mean AUC: {mean_auc - previous_mean_auc:.4f}")
previous_mean_auc = mean_auc

Training completed for Fold_0 with AUC: 0.6515069835824552
Training completed for Fold_1 with AUC: 0.5798454292959787
Training completed for Fold_2 with AUC: 0.5782970550576185
Training completed for Fold_3 with AUC: 0.5772734918143081
Training completed for Fold_4 with AUC: 0.6347569955817378
0.6043359910664197
Difference from baseline: 0.0298
Difference from previous mean AUC: 0.0298


## Assignment 2. Please try different feature selection methods (20pts)

Hint: Currently, we are using LASSO filter for feature selection. Please consider using embedded method as well(same model for both feature selection and model training). Besides, the threshold for LASSO filter may also affect the performance. **Sepcifically, there is a method called 'mean' which is using mean of feature importances of all features as threshold.** Please try both different feature selection methods and different thresholds for filtering features to improve model performance.

In [13]:
# Trying xgboost feature selection
from xgboost import XGBClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SelectFromModel

xgb_selector = SelectFromModel(XGBClassifier(n_estimators=100, random_state=RANDOM_STATE))

results = perform_cross_validation(X, y, groups, estimator, normalize=True, select=[xgb_selector], oversample=True, random_state=42)
auc_values = [results[i].metrics['AUC'] for i in range(len(results))]
mean_auc = np.mean(auc_values)
print(mean_auc)

print(f"Difference from baseline: {mean_auc - BASELINE_SCORE:.4f}")
print(f"Difference from previous mean AUC: {mean_auc - previous_mean_auc:.4f}")
previous_mean_auc = mean_auc

Training completed for Fold_0 with AUC: 0.6511149228130361
Training completed for Fold_1 with AUC: 0.6187658495350803
Training completed for Fold_2 with AUC: 0.5821382842509603
Training completed for Fold_3 with AUC: 0.5701390446288406
Training completed for Fold_4 with AUC: 0.614706729845681
0.6073729662147196
Difference from baseline: 0.0329
Difference from previous mean AUC: 0.0030


## Assignment 3. Please try using hyperopt for model hyperparameter tuning (20 pts)

Hint: Please be aware that for revised xgboost classifier EvXGBClassifier, there exist other parameters other than default XGBClassifier parameters such as eval_size.

For hyperparameter tuning, we will use 20% of training set as validation set to avoid data leakage.

If it is too timeconsuming to run the code in colab, please run the code locally and consider using [ray tune](https://docs.ray.io/en/latest/tune/index.html) if needed.

In [14]:
# Hyperparameter tuning using Hyperopt
import numpy as np
import pandas as pd
from hyperopt import STATUS_OK, Trials, hp, fmin, tpe
from sklearn.model_selection import StratifiedGroupKFold, train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE, SMOTENC
from sklearn.feature_selection import SelectFromModel

# define your outer CV
OUTER_CV = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)

def objective(params):
    val_scores = []

    # outer loop: split into train_full / test (we will only use train_full for tuning)
    for train_full_idx, _ in OUTER_CV.split(X, y, groups):
        X_train_full = X.iloc[train_full_idx]
        y_train_full = y[train_full_idx]

        # split 20% of the *training fold* into a validation set
        X_train, X_val, y_train, y_val = train_test_split(
            X_train_full, y_train_full,
            test_size=0.20,
            stratify=y_train_full,
            random_state=42
        )

        # 1) Normalize
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_val_scaled   = scaler.transform(X_val)

        # 2) (Optional) Oversample on *training only*
        if np.any(X_train_scaled[:, -1] < 1):
            encoder = OrdinalEncoder()
            X_train_scaled[:, -1] = encoder.fit_transform(X_train_scaled[:, -1].reshape(-1, 1)).ravel()

        adasyn = ADASYN(random_state=int(params['random_state']))

        X_train_os, y_train_os = adasyn.fit_resample(X_train_scaled, y_train)

        # 3) Feature selection on *training only*
        # embedded feature selection
        estimator = EvXGBClassifier(
            max_depth=int(params['max_depth']),
            min_child_weight=int(params['min_child_weight']),
            subsample=params['subsample'],
            colsample_bytree=params['colsample_bytree'],
            gamma=params['gamma'],
            learning_rate=params['learning_rate'],
            n_estimators=int(params['n_estimators']),
            reg_lambda=params['reg_lambda'],
            reg_alpha=params['reg_alpha'],
            random_state=int(params['random_state']),
            eval_metric='logloss',
            eval_size=0.2,
            early_stopping_rounds=10
        )
        
        selector = xgb_selector

        X_train_sel = selector.fit_transform(X_train_os, y_train_os)
        X_val_sel   = selector.transform(X_val_scaled)

        # 4) Train & score on *validation only*
        clf = estimator
        clf.fit(X_train_sel, y_train_os)
        y_val_prob = clf.predict_proba(X_val_sel)[:, 1]
        val_scores.append(roc_auc_score(y_val, y_val_prob))

    # Hyperopt minimizes “loss”, so negate AUC
    return {'loss': -np.mean(val_scores), 'status': STATUS_OK}


# define your search space (fill in any missing parameters e.g. max_depth)
space = {
    'max_depth':          hp.quniform('max_depth', 3, 10, 1),  # Integer between 3 and 10
    'min_child_weight':   hp.quniform('min_child_weight', 1, 10, 1), # Integer between 1 and 10
    'subsample':          hp.uniform('subsample', 0.6, 1.0),  # Float between 0.6 and 1.0
    'colsample_bytree':   hp.uniform('colsample_bytree', 0.6, 1.0), # Float between 0.6 and 1.0
    'gamma':              hp.uniform('gamma', 0, 0.5),      # Float between 0 and 0.5
    'learning_rate':      hp.loguniform('learning_rate', -5, 0), # Float on a log scale (0.0067 to 1)
    'n_estimators':       hp.quniform('n_estimators', 100, 1000, 50), # Integer between 100 and 1000, steps of 50
    'reg_lambda':         hp.uniform('reg_lambda', 0, 1),     # Float between 0 and 1 (L2 regularization)
    'reg_alpha':          hp.uniform('reg_alpha', 0, 0.5),    # Float between 0 and 0.5 (L1 regularization)
    'random_state':       42 # Keeping random_state fixed
}

# run hyperopt
trials = Trials()
best = fmin(
    fn=objective,
    space=space,
    algo=tpe.suggest,
    max_evals=50,
    trials=trials
)

print("Best hyperparameters:", best)

100%|██████████| 50/50 [22:55<00:00, 27.51s/trial, best loss: -0.6926022284511587]
Best hyperparameters: {'colsample_bytree': 0.7761731107644282, 'gamma': 0.38565216552801934, 'learning_rate': 0.007015335191520455, 'max_depth': 6.0, 'min_child_weight': 8.0, 'n_estimators': 450.0, 'reg_alpha': 0.06506803991327365, 'reg_lambda': 0.8834960461115632, 'subsample': 0.8805329501115369}


In [21]:
# saving the best hyperparameters so I dont have to run hyperopt again
best = {'colsample_bytree': 0.7761731107644282, 'gamma': 0.38565216552801934, 'learning_rate': 0.007015335191520455, 'max_depth': 6.0, 'min_child_weight': 8.0, 'n_estimators': 450.0, 'reg_alpha': 0.06506803991327365, 'reg_lambda': 0.8834960461115632, 'subsample': 0.8805329501115369}

In [23]:
# Run the final model with the best hyperparameters
best_params = {
    'max_depth': int(best['max_depth']),
    'min_child_weight': int(best['min_child_weight']),
    'subsample': best['subsample'],
    'colsample_bytree': best['colsample_bytree'],
    'gamma': best['gamma'],
    'learning_rate': best['learning_rate'],
    'n_estimators': int(best['n_estimators']),
    'reg_lambda': best['reg_lambda'],
    'reg_alpha': best['reg_alpha'],
    'random_state': 42,
}

# Final model training with the best hyperparameters
final_estimator = EvXGBClassifier(
    **best_params,
    eval_metric='logloss',
    eval_size=0.2,
    early_stopping_rounds=10
)

# Perform cross-validation with the best hyperparameters
final_results = perform_cross_validation(
    X, y, groups, final_estimator,
    normalize=True, select=[xgb_selector], oversample=True, random_state=42
)

auc_values = [final_results[i].metrics['AUC'] for i in range(len(final_results))]
mean_auc = np.mean(auc_values)
print("Final AUC after hyperparameter tuning:", mean_auc)
print(f"Difference from baseline: {mean_auc - BASELINE_SCORE:.4f}")
print(f"Improvement over previous mean AUC: {mean_auc - previous_mean_auc:.4f}")
previous_mean_auc = mean_auc


Training completed for Fold_0 with AUC: 0.6464714530752267
Training completed for Fold_1 with AUC: 0.618421688201908
Training completed for Fold_2 with AUC: 0.5362996158770805
Training completed for Fold_3 with AUC: 0.5950255102040817
Training completed for Fold_4 with AUC: 0.6094320291989498
Final AUC after hyperparameter tuning: 0.6011300593114494
Difference from baseline: 0.0266
Improvement over previous mean AUC: 0.0000


# Trying different combinations for xgb

In [24]:
# Feature selection
feat_final = pd.concat([ feat_current_ESM ,feat_today_ESM,feat_sleep,feat_time ],axis=1)
X = feat_final
cats = X.columns[X.dtypes == bool]

In [25]:

xgb_classifier = EvXGBClassifier(
    random_state=RANDOM_STATE,
    eval_metric='logloss',
    eval_size=0.2,
    early_stopping_rounds=10
)

xgb_selector = SelectFromModel(
    estimator=xgb_classifier,
    threshold='mean'  # Select features with importance above the mean
)

# Perform cross-validation with the best hyperparameters
final_results = perform_cross_validation(
    X, y, groups, xgb_classifier,
    normalize=True, select=[xgb_selector], oversample=True, random_state=42
)
auc_values = [final_results[i].metrics['AUC'] for i in range(len(final_results))]
mean_auc = np.mean(auc_values)
print("Final AUC:", mean_auc)
print(f"Difference from baseline: {mean_auc - BASELINE_SCORE:.4f}")
print(f"Improvement over previous mean AUC: {mean_auc - previous_mean_auc:.4f}")
previous_mean_auc = mean_auc



Training completed for Fold_0 with AUC: 0.6464714530752267
Training completed for Fold_1 with AUC: 0.618421688201908
Training completed for Fold_2 with AUC: 0.5362996158770805
Training completed for Fold_3 with AUC: 0.5950255102040817
Training completed for Fold_4 with AUC: 0.6094320291989498
Final AUC: 0.6011300593114494
Difference from baseline: 0.0266
Improvement over previous mean AUC: 0.0000


## Assignment 4. Please consider replacing the previous traditional machine learning model with deep learning models designed for **tabular data** to improve model performance. (20 pts)

Hint: Since features are already extracted manually, it is impossible to use end-to-end deep learning models. Instead, try replacing xgboost with deep learning models designed for **tabular data** and see if there is performance improvement.

In [26]:
# Defining a sklearn-compatible wrapper for TabNet
from sklearn.base import BaseEstimator, ClassifierMixin
from pytorch_tabnet.tab_model import TabNetClassifier
from sklearn.preprocessing import LabelEncoder
import numpy as np

class SklearnTabNet(BaseEstimator, ClassifierMixin):
    def __init__(self, **kwargs):
        self.model = TabNetClassifier(**kwargs)
        self.label_encoder = LabelEncoder()

    def fit(self, X, y, **fit_params):
        # Convert DataFrame to NumPy if needed
        if hasattr(X, "values"):
            X = X.values
        y = self.label_encoder.fit_transform(y)
        self.model.fit(X, y, **fit_params)
        return self

    def predict(self, X):
        if hasattr(X, "values"):
            X = X.values
        preds = self.model.predict(X)
        return self.label_encoder.inverse_transform(preds)

    def predict_proba(self, X):
        if hasattr(X, "values"):
            X = X.values
        return self.model.predict_proba(X)


You may need to change runtime to TPU first to use torch or other packages you may want to use.



Please compare it with your previous XGBoost model performance and think about why it is higher or lower than XGBoost.

In [27]:
# Running tabnet
from sklearn.feature_selection import SelectFromModel
import torch
# import randomforest
from sklearn.ensemble import RandomForestClassifier

tabnet_classifier = SklearnTabNet(
    n_d=8,
    n_a=8,
    n_steps=3,
    gamma=1.5,
    n_independent=2,
    n_shared=2,
    optimizer_fn=torch.optim.Adam,
    optimizer_params=dict(lr=2e-2),
    mask_type='sparsemax',
    verbose=0,
    seed=RANDOM_STATE
)

results = perform_cross_validation(X, y, groups, tabnet_classifier, normalize=True, select=[xgb_selector], oversample=True, random_state=42)
auc_values = [results[i].metrics['AUC'] for i in range(len(results))]
mean_auc = np.mean(auc_values)
print(mean_auc)

print(f"Difference from baseline: {mean_auc - BASELINE_SCORE:.4f}")
print(f"Difference from previous mean AUC: {mean_auc - previous_mean_auc:.4f}")
previous_mean_auc = mean_auc

epoch 0  | loss: 0.94564 |  0:00:00s
epoch 1  | loss: 0.75195 |  0:00:00s
epoch 2  | loss: 0.74165 |  0:00:00s
epoch 3  | loss: 0.70735 |  0:00:00s
epoch 4  | loss: 0.69181 |  0:00:00s
epoch 5  | loss: 0.69213 |  0:00:00s
epoch 6  | loss: 0.68294 |  0:00:00s
epoch 7  | loss: 0.68496 |  0:00:00s
epoch 8  | loss: 0.68329 |  0:00:00s
epoch 9  | loss: 0.67698 |  0:00:00s
epoch 10 | loss: 0.6781  |  0:00:00s
epoch 11 | loss: 0.67262 |  0:00:00s
epoch 12 | loss: 0.67893 |  0:00:00s
epoch 13 | loss: 0.67371 |  0:00:00s
epoch 14 | loss: 0.67202 |  0:00:00s
epoch 15 | loss: 0.6645  |  0:00:00s
epoch 16 | loss: 0.66733 |  0:00:00s
epoch 17 | loss: 0.67581 |  0:00:00s
epoch 18 | loss: 0.66678 |  0:00:00s
epoch 19 | loss: 0.67814 |  0:00:00s
epoch 20 | loss: 0.67374 |  0:00:00s
epoch 21 | loss: 0.67031 |  0:00:00s
epoch 22 | loss: 0.67321 |  0:00:01s
epoch 23 | loss: 0.67213 |  0:00:01s
epoch 24 | loss: 0.67372 |  0:00:01s
epoch 25 | loss: 0.66947 |  0:00:01s
epoch 26 | loss: 0.67225 |  0:00:01s
e

In [None]:
# hyperparameter optimization for TabNet
from hyperopt import hp
import torch
import numpy as np
import pandas as pd
from hyperopt import STATUS_OK, Trials, hp, fmin, tpe
from sklearn.model_selection import StratifiedGroupKFold, train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE, SMOTENC
from sklearn.feature_selection import SelectFromModel

# define your outer CV
OUTER_CV = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)

def objective(params):
    val_scores = []

    # outer loop: split into train_full / test (we will only use train_full for tuning)
    for train_full_idx, _ in OUTER_CV.split(X, y, groups):
        X_train_full = X.iloc[train_full_idx]
        y_train_full = y[train_full_idx]

        # split 20% of the *training fold* into a validation set
        X_train, X_val, y_train, y_val = train_test_split(
            X_train_full, y_train_full,
            test_size=0.20,
            stratify=y_train_full,
            random_state=42
        )

        # 1) Normalize
        scaler = StandardScaler()
        X_train_scaled = scaler.fit_transform(X_train)
        X_val_scaled   = scaler.transform(X_val)

        # 2) (Optional) Oversample on *training only*
        if np.any(X_train_scaled[:, -1] < 1):
            encoder = OrdinalEncoder()
            X_train_scaled[:, -1] = encoder.fit_transform(X_train_scaled[:, -1].reshape(-1, 1)).ravel()

        adasyn = ADASYN(random_state=int(params['random_state']))

        X_train_os, y_train_os = adasyn.fit_resample(X_train_scaled, y_train)

        # 3) Feature selection on *training only*
        # embedded feature selection
        estimator = SklearnTabNet(
            n_d=params['n_d'],
            n_a=params['n_a'],
            n_steps=params['n_steps'],
            gamma=params['gamma'],
            n_independent=params['n_independent'],
            n_shared=params['n_shared'],
            optimizer_fn=torch.optim.Adam,
            optimizer_params={'lr': params['optimizer_params']['lr']},
            mask_type=params['mask_type'],
            seed=int(params['seed']),
            verbose=0
        )
        
        selector =  xgb_selector

        X_train_sel = selector.fit_transform(X_train_os, y_train_os)
        X_val_sel   = selector.transform(X_val_scaled)

        # 4) Train & score on *validation only*
        clf = estimator
        clf.fit(X_train_sel, y_train_os)
        y_val_prob = clf.predict_proba(X_val_sel)[:, 1]
        val_scores.append(roc_auc_score(y_val, y_val_prob))

    # Hyperopt minimizes “loss”, so negate AUC
    return {'loss': -np.mean(val_scores), 'status': STATUS_OK}


tabnet_space = {
    'n_d': hp.choice('n_d', [8, 16, 24, 32, 64]),
    'n_a': hp.choice('n_a', [8, 16, 24, 32, 64]),
    'n_steps': hp.choice('n_steps', [3, 4, 5, 6, 7]),
    'gamma': hp.uniform('gamma', 1.0, 2.5),
    'n_shared': hp.choice('n_shared', [1, 2, 3]),
    'n_independent': hp.choice('n_independent', [1, 2, 3]),
    'optimizer_fn': torch.optim.Adam,  # Keep fixed for simplicity
    'optimizer_params': {
        'lr': hp.loguniform('lr', np.log(1e-4), np.log(2e-2))  # 0.0001 to 0.02
    },
    'mask_type': hp.choice('mask_type', ['entmax', 'sparsemax']),
    'seed': RANDOM_STATE,  # Fixed for reproducibility,
    'random_state': RANDOM_STATE,
    'max_epochs':10, # Optional: max epochs for training
    'early_stopping_rounds': 10  # Optional: early stopping rounds
}

# run hyperopt
trials = Trials()
best_tabnet_parameters = fmin(
    fn=objective,
    space=tabnet_space,
    algo=tpe.suggest,
    max_evals=20,
    trials=trials
)

print("Best hyperparameters:", best_tabnet_parameters)

In [29]:
# hard code the best parameters for TabNet so that I dont have to run hyperopt again
best_tabnet_parameters = {'gamma': 1.392443662732613, 'lr': 0.01992926577113052, 'mask_type': 1, 'n_a': 2, 'n_d': 4, 'n_independent': 0, 'n_shared': 1, 'n_steps': 1}

In [30]:
# Final model training with the best hyperparameters

final_tabnet_classifier = SklearnTabNet(
    n_d=8,
    n_a=32,
    n_steps=4,
    gamma=best_tabnet_parameters['gamma'],
    n_independent=1,
    n_shared=3,
    optimizer_fn=torch.optim.Adam,
    optimizer_params={'lr': best_tabnet_parameters['lr']},
    mask_type='entmax' if best_tabnet_parameters['mask_type'] == 0 else 'sparsemax',
    seed=RANDOM_STATE,
    verbose=0
)
results = perform_cross_validation(X, y, groups, final_tabnet_classifier, normalize=True, select=[xgb_selector], oversample=True, random_state=42)
auc_values = [results[i].metrics['AUC'] for i in range(len(results))]
mean_auc = np.mean(auc_values)
print(mean_auc)

print(f"Difference from baseline: {mean_auc - BASELINE_SCORE:.4f}")
print(f"Difference from previous mean AUC: {mean_auc - previous_mean_auc:.4f}")
previous_mean_auc = mean_auc

epoch 0  | loss: 0.94564 |  0:00:00s
epoch 1  | loss: 0.75195 |  0:00:00s
epoch 2  | loss: 0.74165 |  0:00:00s
epoch 3  | loss: 0.70735 |  0:00:00s
epoch 4  | loss: 0.69181 |  0:00:00s
epoch 5  | loss: 0.69213 |  0:00:00s
epoch 6  | loss: 0.68294 |  0:00:00s
epoch 7  | loss: 0.68496 |  0:00:00s
epoch 8  | loss: 0.68329 |  0:00:00s
epoch 9  | loss: 0.67698 |  0:00:00s
epoch 10 | loss: 0.6781  |  0:00:00s
epoch 11 | loss: 0.67262 |  0:00:00s
epoch 12 | loss: 0.67893 |  0:00:00s
epoch 13 | loss: 0.67371 |  0:00:00s
epoch 14 | loss: 0.67202 |  0:00:00s
epoch 15 | loss: 0.6645  |  0:00:00s
epoch 16 | loss: 0.66733 |  0:00:00s
epoch 17 | loss: 0.67581 |  0:00:00s
epoch 18 | loss: 0.66678 |  0:00:00s
epoch 19 | loss: 0.67814 |  0:00:00s
epoch 20 | loss: 0.67374 |  0:00:00s
epoch 21 | loss: 0.67031 |  0:00:00s
epoch 22 | loss: 0.67321 |  0:00:00s
epoch 23 | loss: 0.67213 |  0:00:01s
epoch 24 | loss: 0.67372 |  0:00:01s
epoch 25 | loss: 0.66947 |  0:00:01s
epoch 26 | loss: 0.67225 |  0:00:01s
e

## Assignment 5. Please try combining all the above methods to push the model performance. (20 pts)

Hint: Methods other than the above methods are also okay to use to improve model performance.

Please avoid data leakage when conducting hyperparameter tuning.


In [32]:
# Feature selection
feat_final = pd.concat([ feat_current_ESM ,feat_today_ESM,feat_sleep,feat_time ],axis=1)
X = feat_final
cats = X.columns[X.dtypes == bool]

In [33]:
# defining soft voting ensemble wrapper
from sklearn.base import BaseEstimator, ClassifierMixin

class SoftVotingEnsemble(BaseEstimator, ClassifierMixin):
    def __init__(self, models, weights=None):
        self.models = models
        self.weights = weights if weights else [1.0] * len(models)

    def fit(self, X, y):
        for model in self.models:
            model.fit(X, y)
        return self

    def predict_proba(self, X):
        probs = np.array([model.predict_proba(X) for model in self.models])
        weighted_probs = np.average(probs, axis=0, weights=self.weights)
        return weighted_probs

    def predict(self, X):
        return np.argmax(self.predict_proba(X), axis=1)

In [None]:
# Adding tabnet to the ensemble with soft voting
xgb_classifier_weigth = 0.05
soft_voting_ensemble = SoftVotingEnsemble(
    models=[xgb_classifier, final_tabnet_classifier],
    weights=[xgb_classifier_weigth, 1-xgb_classifier_weigth]  
)

results = perform_cross_validation(X, y, groups, soft_voting_ensemble, normalize=True, select=[xgb_selector], oversample=True, random_state=42)
auc_values = [results[i].metrics['AUC'] for i in range(len(results))]
mean_auc = np.mean(auc_values)
print(mean_auc)

print(f"Difference from baseline: {mean_auc - BASELINE_SCORE:.4f}")
print(f"Difference from previous mean AUC: {mean_auc - previous_mean_auc:.4f}")
previous_mean_auc = mean_auc

epoch 0  | loss: 0.94564 |  0:00:00s
epoch 1  | loss: 0.75195 |  0:00:00s
epoch 2  | loss: 0.74165 |  0:00:00s
epoch 3  | loss: 0.70735 |  0:00:00s
epoch 4  | loss: 0.69181 |  0:00:00s
epoch 5  | loss: 0.69213 |  0:00:00s
epoch 6  | loss: 0.68294 |  0:00:00s
epoch 7  | loss: 0.68496 |  0:00:00s
epoch 8  | loss: 0.68329 |  0:00:00s
epoch 9  | loss: 0.67698 |  0:00:00s
epoch 10 | loss: 0.6781  |  0:00:00s
epoch 11 | loss: 0.67262 |  0:00:00s
epoch 12 | loss: 0.67893 |  0:00:00s
epoch 13 | loss: 0.67371 |  0:00:00s
epoch 14 | loss: 0.67202 |  0:00:00s
epoch 15 | loss: 0.6645  |  0:00:00s
epoch 16 | loss: 0.66733 |  0:00:00s
epoch 17 | loss: 0.67581 |  0:00:00s
epoch 18 | loss: 0.66678 |  0:00:00s
epoch 19 | loss: 0.67814 |  0:00:00s
epoch 20 | loss: 0.67374 |  0:00:00s
epoch 21 | loss: 0.67031 |  0:00:00s
epoch 22 | loss: 0.67321 |  0:00:00s
epoch 23 | loss: 0.67213 |  0:00:01s
epoch 24 | loss: 0.67372 |  0:00:01s
epoch 25 | loss: 0.66947 |  0:00:01s
epoch 26 | loss: 0.67225 |  0:00:01s
e