# Model Selection and Training

#### 1. Data Loading and Preprocessing

We begin by importing the curated dataset, which contains the core features—exercise metrics, sleep metrics, and the smoothed recovery time—into a pandas DataFrame. Converting the `date` column into a datetime object allows us to leverage time-based indexing and operations throughout the modeling process. To ensure efficient grouping and memory usage, the `personId` field is cast to a categorical type.

Next, we arrange each user’s records in chronological order by sorting on `personId` and `date`. Setting the `date` column as the DataFrame index establishes a time series structure, which is crucial for later steps such as time-aware splitting and cross-validation. Finally, we perform basic sanity checks: reviewing the overall shape of the dataset, inspecting data types and counts, and examining summary statistics for the target variable. These checks confirm that our data is complete and correctly formatted, allowing us to proceed confidently into the modeling phase.


In [1]:
# 1 Load processed data for modeling
import pandas as pd

# read in the curated file and parse the date column
df = pd.read_csv('../data/curated/recovery_time_ready_for_model.csv', parse_dates=['date'])

# convert personId to a categorical type
df['personId'] = df['personId'].astype('category')

# ensure rows are ordered by user and time, then index by date for time‐aware CV
df.sort_values(['personId', 'date'], inplace=True)
df.set_index('date', inplace=True)

# quick sanity checks
print(f"Data shape: {df.shape}")
print(df.info())                            
print(df['recovery_hours_smooth'].describe().round(2))                           

# preview the first few records
df.head(3)

Data shape: (1250, 15)
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1250 entries, 2019-11-02 to 2020-03-29
Data columns (total 15 columns):
 #   Column                         Non-Null Count  Dtype   
---  ------                         --------------  -----   
 0   personId                       1250 non-null   category
 1   exercise_activeDuration_in_ms  1250 non-null   float64 
 2   exercise_steps                 1250 non-null   float64 
 3   exercise_calories              1250 non-null   float64 
 4   exercise_elevationGain         1250 non-null   float64 
 5   exercise_averageHeartRate      1250 non-null   float64 
 6   sleep_duration_in_ms           1250 non-null   float64 
 7   sleep_minutesAsleep            1250 non-null   float64 
 8   sleep_minutesAwake             1250 non-null   float64 
 9   sleep_efficiency               1250 non-null   float64 
 10  training_load                  1250 non-null   float64 
 11  sleep_score                    1250 non-null   float64

Unnamed: 0_level_0,personId,exercise_activeDuration_in_ms,exercise_steps,exercise_calories,exercise_elevationGain,exercise_averageHeartRate,sleep_duration_in_ms,sleep_minutesAsleep,sleep_minutesAwake,sleep_efficiency,training_load,sleep_score,recovery_score,recovery_hours,recovery_hours_smooth
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2019-11-02,p01,-0.307434,-0.458858,-0.443524,-0.463559,-0.368746,0.171737,0.113733,0.556285,0.121586,-0.750958,0.23532,0.986278,18.787551,18.787551
2019-11-04,p01,-1.081904,-1.064046,-1.036455,-0.463559,-0.437021,-0.171928,-0.095485,-0.594552,-0.382128,-2.118358,-0.477613,1.640745,15.147275,16.967413
2019-11-05,p01,-0.320936,0.208542,0.018905,-0.162531,1.030899,-0.582109,-0.52623,-0.758957,1.129014,-0.302031,0.602785,0.904816,19.240658,17.725161


#### 2. Chronological Train/Validation/Test Split per User

To evaluate model performance realistically, we split each user’s data chronologically into three disjoint sets:

1. **Training Set (70%)**  
   The earliest portion of each user’s timeline, used to fit model parameters.

2. **Validation Set (15%)**  
   The subsequent period, used to tune hyperparameters and select models without peeking at future data.

3. **Test Set (15%)**  
   The final segment, held out completely until the very end to assess generalization on unseen data.

By grouping and slicing per `personId`, we ensure that no data from the same user overlaps across these sets, preventing information leakage. After concatenating training and validation splits, we prepare a combined **train+validation** dataset for cross-validation.  

For hyperparameter tuning, we employ **GroupKFold** with five folds, using `personId` as the grouping variable. This cross-validation strategy guarantees that all records for a given user remain together in either the training or validation fold, maintaining independence between folds. Finally, we extract our feature matrix (`X`) and target vector (`y`) for both the train+validation and test sets, ready for model fitting and evaluation.


In [2]:
# 2) Chronological train/val/test split per user
import pandas as pd
from sklearn.model_selection import GroupKFold

def split_user(group, train_frac=0.7, val_frac=0.15):
    n = len(group)
    train_end = int(n * train_frac)
    val_end = train_end + int(n * val_frac)
    return group.iloc[:train_end], group.iloc[train_end:val_end], group.iloc[val_end:]

# apply the split
train_splits, val_splits, test_splits = [], [], []
for user_id, user_df in df.groupby('personId', observed=True):
    t, v, te = split_user(user_df)
    train_splits.append(t)
    val_splits.append(v)
    test_splits.append(te)

train_df = pd.concat(train_splits)
val_df = pd.concat(val_splits)
test_df = pd.concat(test_splits)

# combine train+validation for cross-validation
trainval_df = pd.concat([train_df, val_df]).sort_index()

# confirm sizes
print(f"Train set:      {train_df.shape[0]} rows ({train_df['personId'].nunique()} users)")
print(f"Validation set: {val_df.shape[0]} rows ({val_df['personId'].nunique()} users)")
print(f"Train+Val set:  {trainval_df.shape[0]} rows")
print(f"Test set:       {test_df.shape[0]} rows ({test_df['personId'].nunique()} users)")

# Define cross-validation strategy for hyperparameter tuning (avoiding user-based leakage)
cv = GroupKFold(n_splits=5)

# define features & target
feature_cols = [
    'exercise_activeDuration_in_ms',
    'exercise_steps',
    'exercise_calories',
    'exercise_elevationGain',
    'exercise_averageHeartRate',
    'sleep_duration_in_ms',
    'sleep_minutesAsleep',
    'sleep_minutesAwake',
    'sleep_efficiency'
]

X_trainval = trainval_df[feature_cols]
y_trainval = trainval_df['recovery_hours_smooth']
X_test = test_df[feature_cols]
y_test = test_df['recovery_hours_smooth']

# Groups for GroupKFold
groups = trainval_df['personId']


Train set:      869 rows (16 users)
Validation set: 179 rows (15 users)
Train+Val set:  1048 rows
Test set:       202 rows (16 users)


#### 3. Baseline Model: Constant “Mean” Predictor

Before introducing complex algorithms, we establish a simple reference point using the **mean** of the training target values. This baseline model:

1. **Computes the overall mean** of `recovery_hours_smooth` from the training set.
2. **Predicts this constant mean** for every observation in both the validation and test sets.
3. **Evaluates performance** using Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE).

This baseline helps quantify the intrinsic difficulty of our prediction task—any model we develop should outperform this naive predictor to demonstrate real predictive value.  


In [3]:
# 3 Baseline model: constant “mean” predictor

import numpy as np
from sklearn.metrics import mean_absolute_error, mean_squared_error

# 3.1 Compute the baseline value (mean of the training target)
baseline_value = train_df['recovery_hours_smooth'].mean()
print(f"Baseline (train mean) recovery_hours_smooth: {baseline_value:.2f} hours")

# 3.2 Create constant predictions for validation and test sets
val_preds = np.full(len(val_df), baseline_value)
test_preds = np.full(len(test_df), baseline_value)

# 3.3 Evaluate on validation set
val_mae = mean_absolute_error(val_df['recovery_hours_smooth'], val_preds)
val_rmse = np.sqrt(mean_squared_error(val_df['recovery_hours_smooth'], val_preds))
print(f"Validation MAE : {val_mae:.2f} hours")
print(f"Validation RMSE: {val_rmse:.2f} hours")

# 3.4 Evaluate on test set
test_mae = mean_absolute_error(test_df['recovery_hours_smooth'], test_preds)
test_rmse = np.sqrt(mean_squared_error(test_df['recovery_hours_smooth'], test_preds))
print(f"Test MAE       : {test_mae:.2f} hours")
print(f"Test RMSE      : {test_rmse:.2f} hours")


Baseline (train mean) recovery_hours_smooth: 21.71 hours
Validation MAE : 3.66 hours
Validation RMSE: 4.58 hours
Test MAE       : 3.61 hours
Test RMSE      : 4.42 hours


#### 4. Linear Regression with Time-Aware Cross-Validation

In this section, we apply **Ordinary Least Squares (OLS) Linear Regression** as a straightforward, interpretable model. The main objectives are:

1. **Cross-Validation (GroupKFold):**  
   We use `GroupKFold` to perform 5-fold cross-validation on the combined train+validation set, grouping by `personId`. This ensures that each user’s data lives entirely within a single fold, preventing leakage of future information and yielding a realistic assessment of out-of-sample performance.

2. **Evaluation Metrics:**  
   We compute the **Mean Absolute Error (MAE)** and **Root Mean Squared Error (RMSE)** across all folds to summarize predictive accuracy and robustness to larger errors.

3. **Final Training and Testing:**  
   After validating, we retrain the linear model on the full train+validation set and evaluate it once on the held-out test set, providing a final unbiased performance estimate.

4. **Interpretability:**  
   We inspect the learned coefficients for each feature to understand the **direction and relative impact** of exercise and sleep metrics on predicted recovery time.

Overall, Linear Regression establishes a solid, interpretable baseline among more complex models, and its cross-validated results guide us in comparing advanced algorithms.


In [4]:
# 4 Train and evaluate Linear Regression with cross-validation

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import cross_val_score, GroupKFold

# 4.1 Define feature columns
feature_cols = [
    'exercise_activeDuration_in_ms',
    'exercise_steps',
    'exercise_calories',
    'exercise_elevationGain',
    'exercise_averageHeartRate',
    'sleep_duration_in_ms',
    'sleep_minutesAsleep',
    'sleep_minutesAwake',
    'sleep_efficiency'
]

# 4.2 Prepare train+val data for cross-validation
X_trainval = trainval_df[feature_cols]
y_trainval = trainval_df['recovery_hours_smooth']
groups = trainval_df['personId']

# 4.3 Cross-validation setup (GroupKFold ensures no leakage between users)
cv = GroupKFold(n_splits=5)

# Evaluate using negative MAE and RMSE
cv_mae_scores = -cross_val_score(
    LinearRegression(),
    X_trainval,
    y_trainval,
    cv=cv,
    groups=groups,
    scoring='neg_mean_absolute_error'
)

cv_rmse_scores = np.sqrt(-cross_val_score(
    LinearRegression(),
    X_trainval,
    y_trainval,
    cv=cv,
    groups=groups,
    scoring='neg_mean_squared_error'
))

print(f"Linear Regression - CV Mean MAE: {cv_mae_scores.mean():.2f} hours")
print(f"Linear Regression - CV Mean RMSE: {cv_rmse_scores.mean():.2f} hours")

# 4.4 Fit Linear Regression on entire trainval data and evaluate on test set
lr_final = LinearRegression()
lr_final.fit(X_trainval, y_trainval)

# Make predictions on the test set
test_preds = lr_final.predict(X_test)

# Evaluate on the test set
test_mae_lr = mean_absolute_error(y_test, test_preds)
test_rmse_lr = np.sqrt(mean_squared_error(y_test, test_preds))

print(f"\nLinear Regression - Test MAE : {test_mae:.2f} hours")
print(f"Linear Regression - Test RMSE: {test_rmse:.2f} hours")

# 4.5 Inspect coefficients for interpretability
coef_df = pd.Series(lr_final.coef_, index=feature_cols).sort_values()
print("\nLinear Regression Coefficients:")
print(coef_df.round(4))


Linear Regression - CV Mean MAE: 4.17 hours
Linear Regression - CV Mean RMSE: 5.18 hours

Linear Regression - Test MAE : 3.61 hours
Linear Regression - Test RMSE: 4.42 hours

Linear Regression Coefficients:
sleep_duration_in_ms            -11.4424
sleep_efficiency                 -0.7164
exercise_averageHeartRate        -0.2773
exercise_steps                   -0.1746
exercise_activeDuration_in_ms     0.0273
exercise_elevationGain            0.1672
exercise_calories                 1.6816
sleep_minutesAwake                2.2228
sleep_minutesAsleep               9.2774
dtype: float64


#### 5. K-Nearest Neighbors Regression with Hyperparameter Tuning

In this section, we implement a **K-Nearest Neighbors (KNN) Regressor**, which predicts a user’s recovery time as the average of the _k_ most similar historical observations. Key points:

1. **KNN Principle:**  
   Each test instance is assigned the average target value of its _k_ nearest neighbors in feature space. Smaller _k_ can capture fine-grained local patterns but may overfit; larger _k_ smooths noise but may underfit.

2. **Hyperparameter Tuning (Grid Search):**  
   We search over a range of `n_neighbors` (3, 5, 7, …, 15) to find the optimal balance between bias and variance.  

3. **Time-Aware Cross-Validation (GroupKFold):**  
   By grouping on `personId` during 5-fold cross-validation, we ensure that all records for each user reside entirely in either the training or validation fold. This prevents “peeking” at the same user’s future data when tuning _k_.

4. **Final Evaluation:**  
   After selecting the best `n_neighbors`, the tuned model is retrained on the full train+validation set and then evaluated once on the held-out test set. We report the Test MAE and RMSE to measure real-world predictive performance.

This procedure systematically identifies the most appropriate neighborhood size and provides a robust estimate of KNN’s forecasting ability for recovery time.


In [5]:
# 5 Train and evaluate KNN Regressor with hyperparameter tuning and cross-validation

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV, GroupKFold
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# 5.1 Define hyperparameter grid (number of neighbors to test)
param_grid = {'n_neighbors': [3, 5, 7, 9, 11, 13, 15]}

# 5.2 Set up GroupKFold cross-validation
cv = GroupKFold(n_splits=5)

# 5.3 Run GridSearchCV to find the optimal number of neighbors
grid_knn = GridSearchCV(
    estimator=KNeighborsRegressor(),
    param_grid=param_grid,
    scoring='neg_mean_absolute_error',
    cv=cv,
    n_jobs=-1
)

# Perform Grid Search on training + validation set
grid_knn.fit(X_trainval, y_trainval, groups=groups)

# Display best hyperparameters
print(f"Best n_neighbors: {grid_knn.best_params_['n_neighbors']}")
print(f"Best CV MAE: {-grid_knn.best_score_:.2f} hours")

# 5.4 Evaluate the best model on the test set
best_knn = grid_knn.best_estimator_
test_preds_knn = best_knn.predict(X_test)

test_mae_knn = mean_absolute_error(y_test, test_preds_knn)
test_rmse_knn = np.sqrt(mean_squared_error(y_test, test_preds_knn))

print(f"KNN - Test MAE : {test_mae_knn:.2f} hours")
print(f"KNN - Test RMSE: {test_rmse_knn:.2f} hours")


Best n_neighbors: 15
Best CV MAE: 4.20 hours
KNN - Test MAE : 3.47 hours
KNN - Test RMSE: 4.26 hours


#### 6. Decision Tree Regression with Hyperparameter Tuning

Here, we apply a **Decision Tree Regressor**, a non-parametric model that recursively splits the feature space into homogeneous regions. Key aspects:

1. **Decision Tree Mechanics:**  
   The tree partitions data by choosing feature thresholds that minimize error (e.g., MAE) at each split. Leaves then predict the average target value of their region.

2. **Regularization via Hyperparameters:**  
   - `max_depth`: Limits tree height to prevent overfitting on noise (shallow trees generalize better).  
   - `min_samples_split`: Minimum number of samples required to split an internal node.  
   - `min_samples_leaf`: Minimum number of samples required to be at a leaf node, ensuring each prediction is based on enough data.

3. **Grid Search with GroupKFold:**  
   We systematically search over combinations of these hyperparameters using 5-fold **GroupKFold** cross-validation, grouping by `personId`. This ensures each user’s data remains intact within each fold, avoiding leakage.

4. **Final Evaluation:**  
   The tuned tree is retrained on the full train+validation set using the best hyperparameters, then tested on the held-out test set. We report the Test MAE and Test RMSE to assess generalization performance.

Decision trees provide easily interpretable decision rules, but require careful tuning to balance bias and variance.  


In [6]:
# 6 Train and evaluate Decision Tree Regressor with hyperparameter tuning and cross-validation

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV, GroupKFold
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# 6.1 Define hyperparameter grid to tune (common Decision Tree hyperparameters)
param_grid = {
    'max_depth': [3, 5, 7, 9, None],  # maximum depth of the tree
    'min_samples_split': [2, 5, 10],  # minimum samples required to split
    'min_samples_leaf': [1, 2, 4]     # minimum samples required at each leaf node
}

# 6.2 Setup GroupKFold cross-validation
cv = GroupKFold(n_splits=5)

# 6.3 Perform GridSearchCV to find optimal hyperparameters
grid_dt = GridSearchCV(
    estimator=DecisionTreeRegressor(random_state=42),
    param_grid=param_grid,
    scoring='neg_mean_absolute_error',
    cv=cv,
    n_jobs=-1
)

# Fit GridSearchCV on training + validation data
grid_dt.fit(X_trainval, y_trainval, groups=groups)

# Display the best hyperparameters found
print(f"Best Hyperparameters: {grid_dt.best_params_}")
print(f"Best CV MAE: {-grid_dt.best_score_:.2f} hours")

# 6.4 Evaluate best model on test set
best_dt = grid_dt.best_estimator_
test_preds_dt = best_dt.predict(X_test)

test_mae_dt = mean_absolute_error(y_test, test_preds_dt)
test_rmse_dt = np.sqrt(mean_squared_error(y_test, test_preds_dt))

print(f"Decision Tree - Test MAE : {test_mae_dt:.2f} hours")
print(f"Decision Tree - Test RMSE: {test_rmse_dt:.2f} hours")


Best Hyperparameters: {'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 2}
Best CV MAE: 4.21 hours
Decision Tree - Test MAE : 3.50 hours
Decision Tree - Test RMSE: 4.39 hours


#### 7. Random Forest Regression with Hyperparameter Tuning

In this section, we leverage a **Random Forest Regressor**, an ensemble of decision trees that reduces variance and improves generalization. Key points:

1. **Random Forest Principles:**  
   - Builds multiple decision trees on bootstrap samples of the data.  
   - At each split, a random subset of features is considered, which decorrelates trees and prevents overfitting.

2. **Hyperparameter Tuning:**  
   - `n_estimators`: Number of trees in the forest — more trees can improve stability at the cost of computation.  
   - `max_depth`, `min_samples_split`, `min_samples_leaf`: Control tree complexity and leaf sizes to balance bias and variance.

3. **Grid Search with GroupKFold:**  
   We perform a systematic grid search across these hyperparameters using 5-fold **GroupKFold**, grouping by `personId`. This ensures that each user’s observations remain entirely in the training or validation fold, preventing leakage of future information.

4. **Final Model Evaluation:**  
   After selecting the best hyperparameters, the Random Forest is retrained on the combined train+validation set and evaluated on the held-out test set. We report Test MAE and RMSE to quantify its predictive performance.

Random Forests typically deliver robust, high-accuracy predictions and can handle complex non-linear relationships, making them a strong candidate for our recovery time estimator.  


In [7]:
# 7 Train and evaluate Random Forest Regressor with hyperparameter tuning and cross-validation

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV, GroupKFold
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# 7.1 Define hyperparameter grid to tune
param_grid = {
    'n_estimators': [50, 100, 150],
    'max_depth': [3, 5, 7, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# 7.2 Setup GroupKFold cross-validation
cv = GroupKFold(n_splits=5)

# 7.3 Perform GridSearchCV to identify optimal hyperparameters
grid_rf = GridSearchCV(
    estimator=RandomForestRegressor(random_state=42),
    param_grid=param_grid,
    scoring='neg_mean_absolute_error',
    cv=cv,
    n_jobs=-1
)

# Fit GridSearchCV on train+val data
grid_rf.fit(X_trainval, y_trainval, groups=groups)

# Display best hyperparameters found
print(f"Best Hyperparameters: {grid_rf.best_params_}")
print(f"Best CV MAE: {-grid_rf.best_score_:.2f} hours")

# 7.4 Evaluate best model on test set
best_rf = grid_rf.best_estimator_
test_preds_rf = best_rf.predict(X_test)

test_mae_rf = mean_absolute_error(y_test, test_preds_rf)
test_rmse_rf = np.sqrt(mean_squared_error(y_test, test_preds_rf))

print(f"Random Forest - Test MAE : {test_mae_rf:.2f} hours")
print(f"Random Forest - Test RMSE: {test_rmse_rf:.2f} hours")


Best Hyperparameters: {'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 10, 'n_estimators': 150}
Best CV MAE: 4.11 hours
Random Forest - Test MAE : 3.38 hours
Random Forest - Test RMSE: 4.20 hours


#### 8. Gradient Boosting Regression with Hyperparameter Tuning

Gradient Boosting builds an ensemble model by sequentially fitting new trees to the **residual errors** of the combined previous learners. This approach often yields high predictive accuracy by minimizing bias and variance through additive corrections. Key points:

1. **Boosting Concept:**  
   Each new tree focuses on the samples that the existing ensemble predicts poorly, iteratively reducing overall residual error.

2. **Hyperparameter Tuning:**  
   - `n_estimators`: Number of boosting stages (trees) — more stages can improve fit but risk overfitting.  
   - `learning_rate`: Shrinks the contribution of each tree, balancing the trade-off between model complexity and convergence speed.  
   - `max_depth`, `min_samples_split`, `min_samples_leaf`: Control individual tree complexity and leaf sizes to prevent overfitting.

3. **Group-K-Fold Cross-Validation:**  
   We search this hyperparameter grid via 5-fold **GroupKFold**, grouping by `personId`. This keeps each user’s data intact within folds, ensuring unbiased performance estimates.

4. **Final Evaluation:**  
   After identifying optimal hyperparameters, the model is retrained on the full train+validation set and evaluated once on the hold-out test set. We report Test MAE and RMSE to quantify real-world predictive performance.

Gradient Boosting often outperforms single-tree methods by combining many weak learners into a strong ensemble, making it a powerful choice for our recovery time prediction task.  


In [8]:
# 8 Train and evaluate Gradient Boosting Regressor with hyperparameter tuning and cross-validation

import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV, GroupKFold
from sklearn.metrics import mean_absolute_error, mean_squared_error

# 8.1 Define hyperparameter grid for tuning
param_grid = {
    'n_estimators': [50, 100, 150],
    'learning_rate': [0.01, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# 8.2 Setup GroupKFold cross-validation
cv = GroupKFold(n_splits=5)

# 8.3 Perform GridSearchCV
grid_gbr = GridSearchCV(
    estimator=GradientBoostingRegressor(random_state=42),
    param_grid=param_grid,
    scoring='neg_mean_absolute_error',
    cv=cv,
    n_jobs=-1
)

# Fit GridSearchCV on train+val data
grid_gbr.fit(X_trainval, y_trainval, groups=groups)

# Display best hyperparameters found
print(f"Best Hyperparameters: {grid_gbr.best_params_}")
print(f"Best CV MAE: {-grid_gbr.best_score_:.2f} hours")

# 8.4 Evaluate best model on test set
best_gbr = grid_gbr.best_estimator_
test_preds_gbr = best_gbr.predict(X_test)

test_mae_gbr = mean_absolute_error(y_test, test_preds_gbr)
test_rmse_gbr = np.sqrt(mean_squared_error(y_test, test_preds_gbr))

print(f"Gradient Boosting - Test MAE : {test_mae_gbr:.2f} hours")
print(f"Gradient Boosting - Test RMSE: {test_rmse_gbr:.2f} hours")


Best Hyperparameters: {'learning_rate': 0.01, 'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 150}
Best CV MAE: 4.09 hours
Gradient Boosting - Test MAE : 3.37 hours
Gradient Boosting - Test RMSE: 4.13 hours


#### 9. Support Vector Regression with Hyperparameter Tuning

Support Vector Regression (SVR) extends the principles of Support Vector Machines (SVM) to regression tasks by finding a function that deviates from the true targets by no more than a specified margin (`epsilon`) while remaining as flat as possible. Key elements:

1. **SVR Fundamentals:**  
   - Seeks a regression function within an “epsilon-insensitive tube,” ignoring errors smaller than `epsilon`.  
   - Uses kernel functions (here, RBF) to model complex, non-linear relationships by projecting data into a higher-dimensional feature space.

2. **Hyperparameters:**  
   - `C`: Regularization parameter controlling the trade-off between model flatness and tolerance for deviations larger than `epsilon`. Higher `C` permits less deviation but risks overfitting.  
   - `epsilon`: Width of the no-penalty tube around the regression function. Larger `epsilon` yields a simpler, more generalized model.  
   - `kernel`: Selected as `'rbf'` to capture non-linear patterns.

3. **GroupKFold Cross-Validation:**  
   We tune `C` and `epsilon` via 5-fold **GroupKFold**, grouping by `personId`. This ensures that all observations for each user remain together in either the training or validation fold, preventing data leakage.

4. **Final Model Assessment:**  
   After finding the optimal hyperparameters, the SVR model is retrained on the combined train+validation set and evaluated on the held-out test set. We report the Test MAE and RMSE to quantify its predictive accuracy.

SVR is particularly effective for medium-sized datasets with non-linear relationships, offering robust generalization through margin-based optimization.  


In [9]:
# 9 Train and evaluate SVR with hyperparameter tuning and cross-validation

from sklearn.svm import SVR
from sklearn.model_selection import GridSearchCV, GroupKFold
from sklearn.metrics import mean_absolute_error, mean_squared_error
import numpy as np

# 9.1 Define hyperparameter grid for SVR tuning
param_grid = {
    'C': [0.1, 1, 10, 100],
    'epsilon': [0.01, 0.1, 0.5, 1],
    'kernel': ['rbf']
}

# 9.2 Setup GroupKFold cross-validation
cv = GroupKFold(n_splits=5)

# 9.3 Perform GridSearchCV
grid_svr = GridSearchCV(
    estimator=SVR(),
    param_grid=param_grid,
    scoring='neg_mean_absolute_error',
    cv=cv,
    n_jobs=-1
)

# Fit GridSearchCV on train+val data
grid_svr.fit(X_trainval, y_trainval, groups=groups)

# Display best hyperparameters found
print(f"Best Hyperparameters: {grid_svr.best_params_}")
print(f"Best CV MAE: {-grid_svr.best_score_:.2f} hours")

# 9.4 Evaluate best SVR model on test set
best_svr = grid_svr.best_estimator_
test_preds_svr = best_svr.predict(X_test)

test_mae_svr = mean_absolute_error(y_test, test_preds_svr)
test_rmse_svr = np.sqrt(mean_squared_error(y_test, test_preds_svr))

print(f"SVR - Test MAE : {test_mae_svr:.2f} hours")
print(f"SVR - Test RMSE: {test_rmse_svr:.2f} hours")


Best Hyperparameters: {'C': 1, 'epsilon': 1, 'kernel': 'rbf'}
Best CV MAE: 4.14 hours
SVR - Test MAE : 3.36 hours
SVR - Test RMSE: 4.19 hours


#### 10. Model Comparison and Selection

After training and tuning all candidate models, we compile their performance metrics into a unified table for direct comparison. This summary includes:

- **Cross-Validation MAE** (where applicable) to assess each model’s average performance during hyperparameter tuning on the train+validation data.
- **Test MAE and RMSE** to evaluate real-world accuracy and error dispersion on the held-out test set.

By presenting these metrics side by side, we can easily identify which algorithm achieves the lowest prediction error. 


In [10]:
# 10 Summarize all model performances clearly
import pandas as pd
import joblib

results = pd.DataFrame({
    'Model': [
        'Baseline',
        'Linear Regression',
        'KNN',
        'Decision Tree',
        'Random Forest',
        'Gradient Boosting',
        'SVR'
    ],
    'CV MAE': [
        np.nan,  # Baseline (no CV)
        4.17,
        4.20,
        4.21,
        4.11,
        4.09,
        grid_svr.best_score_ * -1
    ],
    'Test MAE': [
        test_mae,
        test_mae_lr,
        test_mae_knn,
        test_mae_dt,
        test_mae_rf,
        test_mae_gbr,
        test_mae_svr
    ],
    'Test RMSE': [
        test_rmse,
        test_rmse_lr,
        test_rmse_knn,
        test_rmse_dt,
        test_rmse_rf,
        test_rmse_gbr,
        test_rmse_svr
    ]
})

# Round metrics for readability
results[['CV MAE', 'Test MAE', 'Test RMSE']] = results[['CV MAE', 'Test MAE', 'Test RMSE']].round(2)

print("\nModel Comparison Summary:")
print(results)




Model Comparison Summary:
               Model  CV MAE  Test MAE  Test RMSE
0           Baseline     NaN      3.61       4.42
1  Linear Regression    4.17      3.58       4.50
2                KNN    4.20      3.47       4.26
3      Decision Tree    4.21      3.50       4.39
4      Random Forest    4.11      3.38       4.20
5  Gradient Boosting    4.09      3.37       4.13
6                SVR    4.14      3.36       4.19


#### 11. Final Model Selection

With all models evaluated on the same test set, we programmatically:

1. **Identify the Best Model** by locating the lowest Test MAE in our comparison table.  
2. **Print the Selection** to confirm which algorithm and hyperparameter configuration achieved optimal accuracy.  
3. **Map Model Names to Objects** so we can reference and persist the chosen estimator.  
4. **Retrieve the Best Model Instance** from our dictionary of tuned models for saving or deployment.

This step completes the model development pipeline by finalizing our choice of algorithm based on objective performance metrics.  


In [11]:
# 11 Select the best model based on Test MAE
best_idx = results['Test MAE'].idxmin()
best_model_name = results.loc[best_idx, 'Model']
best_test_mae = results.loc[best_idx, 'Test MAE']

print(f"\n Best model based on Test MAE: {best_model_name} (Test MAE = {best_test_mae:.2f} hours)")

# Map the final (hyperparameter-tuned) fitted models
model_map = {
    'Linear Regression': lr_final,
    'KNN': best_knn,
    'Decision Tree': best_dt,
    'Random Forest': best_rf,
    'Gradient Boosting': best_gbr,
    'SVR': best_svr
}

# Retrieve the best model instance
best_model = model_map[best_model_name]


 Best model based on Test MAE: SVR (Test MAE = 3.36 hours)


#### 12. Persisting the Selected Model

To enable future use—whether for batch inference, real-time serving, or further analysis—we serialize the best-performing model to disk using `joblib`. This approach:

1. **Exports the Model Object** into a `.pkl` file, capturing all learned parameters and hyperparameters.  
2. **Ensures Reproducibility**, since the exact same model can be loaded later without retraining.  
3. **Facilitates Deployment**, allowing downstream applications or scripts to quickly load the model and generate predictions on new data.

With this final step, the entire modeling pipeline is complete, from data ingestion and preprocessing through model training, evaluation, selection, and persistence.  


In [12]:
# 12 Save the best model to workspace using joblib
joblib.dump(best_model, f"{best_model_name.replace(' ', '_').lower()}_model.pkl")
print(f"\n {best_model_name} model saved successfully!")


 SVR model saved successfully!
