# 03 — Modeling

**Goal:** Train multiple regressors from preprocessed data and compare via cross-validated MAE/RMSE/R².

In this third notebook, we will be making use of the preprocessed data we have derived from our body fat dataset to train different machine learning regression models, measure their accuracy via different key metrics and store them for evaluation and visualisation in the next notebook

**Checklist**
- Linear models (LinearRegression/Ridge/Lasso/ElasticNet).
- Tree ensembles (RandomForest, GradientBoosting, XGBoost).
- Cross-validation with consistent splits & scoring.
- Log metrics for each model.

In [1]:
# Imports
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
import joblib


## Loading preprocessed data

We will be loading both the scaled-only and the scaled-and-feature-selected preprocessed datasets, which will then allow us to compare the performance of different models: whether they provide better predictions when trained on either of these datasets.

In [4]:
# Scaled-only dataset
X_train_scaled = pd.read_csv("/kaggle/input/preprocessed-bodyfat/X_train_all_features.csv")
X_test_scaled = pd.read_csv("/kaggle/input/preprocessed-bodyfat/X_test_all_features.csv")
y_train = pd.read_csv("/kaggle/input/preprocessed-bodyfat/y_train.csv").values.ravel()
y_test = pd.read_csv("/kaggle/input/preprocessed-bodyfat/y_test.csv").values.ravel()

# Feature-selected dataset
X_train_fs = pd.read_csv('/kaggle/input/preprocessed-bodyfat/X_train_top_features.csv')
X_test_fs = pd.read_csv('/kaggle/input/preprocessed-bodyfat/X_test_top_features.csv')

array([19.2, 19.2, 28. , 20.5, 16.7, 12.1, 23.6, 18.6, 11.7, 11.9, 26.1,
       24.5, 14.8, 22.5,  6.3,  5.3, 22. , 20.9, 20.4, 14. , 14.9, 16.5,
       13.9, 13.8, 21.3, 30.4, 23.6, 15. ,  7.1, 13. , 24.9,  9.6, 17.5,
       18.4, 18.7,  3.7, 21.4, 16. , 16.6, 11.5, 13.8, 23.6, 31.2,  9.4,
       13.9, 22.5, 29. , 21.5, 23.3,  9.9, 35.2])

In [5]:
print("Scaled-only shape:", X_train_scaled.shape)
print("Feature-selected shape:", X_train_fs.shape)

Scaled-only shape: (201, 13)
Feature-selected shape: (201, 10)


## Define the model-fitting and evaluation function

This function will be the common execution for all the models we will be training, and will include defining the model class, training and testing data, cross-validation, evaluating them on different metrics and storing them in a dataframe for further evaluation.

In [30]:
def evaluate_model(model, model_name, X_train, X_test, y_train, y_test, dataset_name):
    # Training all models via CV and storing metrics in a pandas DataFrame
    results = []

    # Cross-validation (5-fold, R² score)
    cv_scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')

    # Fitting model with training set
    model.fit(X_train, y_train)
    y_preds = model.predict(X_test)

    #Evaluate across different metrics
    mae = mean_absolute_error(y_test, y_preds)
    rmse = np.sqrt(mean_squared_error(y_test, y_preds))
    r2 = r2_score(y_test, y_preds)

    # Collect results
    results = {
        "Dataset": dataset_name,
        "Model": model_name,
        "CV_R2_Mean": np.mean(cv_scores),
        "CV_R2_Std": np.std(cv_scores),
        "Test_MAE": mae,
        "Test_RMSE": rmse,
        "Test_R2": r2
    }
    
    return pd.DataFrame([results])

In [31]:
all_results = pd.DataFrame()

# Training data on different models

In order to find which ML model fits best for our purposes and the data we have at hand, we will be going through different models one by one, fitting our training dataset into each of them, and using our validation and testing datasets to measure and compare the models' accuracies.

### 1. Decision Trees 

In [32]:
# Decision Tree
all_results = pd.concat([
    all_results,
    evaluate_model(DecisionTreeRegressor(random_state=42), 'DecisionTree', X_train_scaled, X_test_scaled, y_train, y_test, 'Scaled-only'),
    evaluate_model(DecisionTreeRegressor(random_state=42), 'DecisionTree', X_train_fs, X_test_fs, y_train, y_test, 'Feature-Selected')
])

In [33]:
all_results = all_results.reset_index(drop=True)
display(all_results)

Unnamed: 0,Dataset,Model,CV_R2_Mean,CV_R2_Std,Test_MAE,Test_RMSE,Test_R2
0,Scaled-only,DecisionTree,0.425027,0.109527,4.270588,5.545463,0.33892
1,Feature-Selected,DecisionTree,0.29768,0.115825,4.360784,5.54981,0.337883


We have now generated a pandas DataFrame that stores all the metrics of our Decision Tree model on both the scaled-only and the scaled-and-feature-selected datasets. We will be appending the same dataframe as we train other models and store their results.

At a glance, we can see that the model didn't perform quite well, with the MAE score, for example, being around 4% (meaning that the model has an average error spanning about 4% body fat, and according to this context, there is quite a big room for error). As between the two datasets, based on the test metric scores, there does not seem to be much difference between the two (which is expected, as tree models are robust against features with less correlation, and don't affect them when training and making predictions)

### 2. Linear Regression

In [34]:
all_results = pd.concat([
    all_results,
    evaluate_model(LinearRegression(), 'LinearRegression', X_train_scaled, X_test_scaled, y_train, y_test, 'Scaled-only'),
    evaluate_model(LinearRegression(), 'LinearRegression', X_train_fs, X_test_fs, y_train, y_test, 'Feature-Selected')
])

In [35]:
display(all_results)

Unnamed: 0,Dataset,Model,CV_R2_Mean,CV_R2_Std,Test_MAE,Test_RMSE,Test_R2
0,Scaled-only,DecisionTree,0.425027,0.109527,4.270588,5.545463,0.33892
1,Feature-Selected,DecisionTree,0.29768,0.115825,4.360784,5.54981,0.337883
0,Scaled-only,LinearRegression,0.686319,0.031946,3.329254,4.240279,0.613484
0,Feature-Selected,LinearRegression,0.695245,0.035255,3.248772,4.0295,0.650956


Straight away, we can see a substantial improvement in our predictions when we switch to a Linear Regression model. The R² score has gone from 0.3 to 0.6, and the MAE and RMSE scores have gone down by 1% body fat (which, although it may not sound like much, is still quite an improvement). 

As for the the results between the two datasets, the different is a bit more noticeable this time (with feature-selected performing subtly better).

### 3. Random Forest Regressor

In [36]:
all_results = pd.concat([
    all_results,
    evaluate_model(RandomForestRegressor(n_estimators=100, random_state=42), 'RandomForest', X_train_scaled, X_test_scaled, y_train, y_test, 'Scaled-only'),
    evaluate_model(RandomForestRegressor(n_estimators=100, random_state=42), 'RandomForest', X_train_fs, X_test_fs, y_train, y_test, 'Feature-Selected')
])

In [37]:
display(all_results)

Unnamed: 0,Dataset,Model,CV_R2_Mean,CV_R2_Std,Test_MAE,Test_RMSE,Test_R2
0,Scaled-only,DecisionTree,0.425027,0.109527,4.270588,5.545463,0.33892
1,Feature-Selected,DecisionTree,0.29768,0.115825,4.360784,5.54981,0.337883
0,Scaled-only,LinearRegression,0.686319,0.031946,3.329254,4.240279,0.613484
0,Feature-Selected,LinearRegression,0.695245,0.035255,3.248772,4.0295,0.650956
0,Scaled-only,RandomForest,0.666268,0.054562,3.401824,4.09701,0.639162
0,Feature-Selected,RandomForest,0.655445,0.047001,3.328765,3.984024,0.65879


Now, comparing these models, the Random Forest and the Linear Regression models seem to be going neck-and-neck, as all their metric scores are quite close to each other, with some of them slightly edged by one or the other.

Regarding the two datasets, even in this Random Forest model, the model trained on the feature-selected data performed slighlty better than the scaled-only data.

### 4. Gradient Boosting

In [38]:
all_results = pd.concat([
    all_results,
    evaluate_model(GradientBoostingRegressor(n_estimators=100, random_state=42), 'GradientBoosting', X_train_scaled, X_test_scaled, y_train, y_test, 'Scaled-only'),
    evaluate_model(GradientBoostingRegressor(n_estimators=100, random_state=42), 'GradientBoosting', X_train_fs, X_test_fs, y_train, y_test, 'Feature-Selected')
])

In [39]:
display(all_results)

Unnamed: 0,Dataset,Model,CV_R2_Mean,CV_R2_Std,Test_MAE,Test_RMSE,Test_R2
0,Scaled-only,DecisionTree,0.425027,0.109527,4.270588,5.545463,0.33892
1,Feature-Selected,DecisionTree,0.29768,0.115825,4.360784,5.54981,0.337883
0,Scaled-only,LinearRegression,0.686319,0.031946,3.329254,4.240279,0.613484
0,Feature-Selected,LinearRegression,0.695245,0.035255,3.248772,4.0295,0.650956
0,Scaled-only,RandomForest,0.666268,0.054562,3.401824,4.09701,0.639162
0,Feature-Selected,RandomForest,0.655445,0.047001,3.328765,3.984024,0.65879
0,Scaled-only,GradientBoosting,0.625801,0.083593,3.308267,4.061587,0.645375
0,Feature-Selected,GradientBoosting,0.600088,0.077933,3.501641,4.396792,0.584424


Though almost at the same level as the previous two models, the Gradient Boosting model is an interesting case in this scenario. This model performed much better with the scaled-only data rather than the feature-selected data, and in comparison with the other models, the model that was trained in the former dataset performed better than the other models trained on the same dataset, but it felt a bit short against the other models which were trained on the feature-selected data.

### 5. XGBoost

In [41]:
all_results = pd.concat([
    all_results,
    evaluate_model(XGBRegressor(n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42, verbosity=0), 'XGBoost', X_train_scaled, X_test_scaled, y_train, y_test, 'Scaled-only'),
    evaluate_model(XGBRegressor(n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42, verbosity=0), 'XGBoost', X_train_fs, X_test_fs, y_train, y_test, 'Feature-Selected')
])

In [42]:
display(all_results)

Unnamed: 0,Dataset,Model,CV_R2_Mean,CV_R2_Std,Test_MAE,Test_RMSE,Test_R2
0,Scaled-only,DecisionTree,0.425027,0.109527,4.270588,5.545463,0.33892
1,Feature-Selected,DecisionTree,0.29768,0.115825,4.360784,5.54981,0.337883
0,Scaled-only,LinearRegression,0.686319,0.031946,3.329254,4.240279,0.613484
0,Feature-Selected,LinearRegression,0.695245,0.035255,3.248772,4.0295,0.650956
0,Scaled-only,RandomForest,0.666268,0.054562,3.401824,4.09701,0.639162
0,Feature-Selected,RandomForest,0.655445,0.047001,3.328765,3.984024,0.65879
0,Scaled-only,GradientBoosting,0.625801,0.083593,3.308267,4.061587,0.645375
0,Feature-Selected,GradientBoosting,0.600088,0.077933,3.501641,4.396792,0.584424
0,Scaled-only,XGBoost,0.630576,0.074766,3.329362,4.112081,0.636502
0,Feature-Selected,XGBoost,0.587623,0.085667,3.304916,4.203455,0.620168


The XGBoost model performed quite similarly to the other models, and we can consider it as one of the better-performing ones from all the models we have trained thus far, although it isn't the top pick according to any of the metrics either.

### 6. Ridge

In [43]:
all_results = pd.concat([
    all_results,
    evaluate_model(Ridge(alpha=1.0), 'Ridge', X_train_scaled, X_test_scaled, y_train, y_test, 'Scaled-only'),
    evaluate_model(Ridge(alpha=1.0), 'Ridge', X_train_fs, X_test_fs, y_train, y_test, 'Feature-Selected')
])

In [44]:
display(all_results)

Unnamed: 0,Dataset,Model,CV_R2_Mean,CV_R2_Std,Test_MAE,Test_RMSE,Test_R2
0,Scaled-only,DecisionTree,0.425027,0.109527,4.270588,5.545463,0.33892
1,Feature-Selected,DecisionTree,0.29768,0.115825,4.360784,5.54981,0.337883
0,Scaled-only,LinearRegression,0.686319,0.031946,3.329254,4.240279,0.613484
0,Feature-Selected,LinearRegression,0.695245,0.035255,3.248772,4.0295,0.650956
0,Scaled-only,RandomForest,0.666268,0.054562,3.401824,4.09701,0.639162
0,Feature-Selected,RandomForest,0.655445,0.047001,3.328765,3.984024,0.65879
0,Scaled-only,GradientBoosting,0.625801,0.083593,3.308267,4.061587,0.645375
0,Feature-Selected,GradientBoosting,0.600088,0.077933,3.501641,4.396792,0.584424
0,Scaled-only,XGBoost,0.630576,0.074766,3.329362,4.112081,0.636502
0,Feature-Selected,XGBoost,0.587623,0.085667,3.304916,4.203455,0.620168


Now, Ridge can be considered a good model at a glance, relative to our other models. The MAE score of the Ridge model that was trained on the feature-selected data is almost close to beating the score of the Linear Regression model (which is the first in this category in our collected results), and this model is also close to topping the leading models: Linear Regression and Random Forest, when it comes to the R² score.

### 7. ElasticNet

In [45]:
all_results = pd.concat([
    all_results,
    evaluate_model(ElasticNet(alpha=1.0), 'ElasticNet', X_train_scaled, X_test_scaled, y_train, y_test, 'Scaled-only'),
    evaluate_model(ElasticNet(alpha=1.0), 'ElasticNet', X_train_fs, X_test_fs, y_train, y_test, 'Feature-Selected')
])

In [46]:
display(all_results)

Unnamed: 0,Dataset,Model,CV_R2_Mean,CV_R2_Std,Test_MAE,Test_RMSE,Test_R2
0,Scaled-only,DecisionTree,0.425027,0.109527,4.270588,5.545463,0.33892
1,Feature-Selected,DecisionTree,0.29768,0.115825,4.360784,5.54981,0.337883
0,Scaled-only,LinearRegression,0.686319,0.031946,3.329254,4.240279,0.613484
0,Feature-Selected,LinearRegression,0.695245,0.035255,3.248772,4.0295,0.650956
0,Scaled-only,RandomForest,0.666268,0.054562,3.401824,4.09701,0.639162
0,Feature-Selected,RandomForest,0.655445,0.047001,3.328765,3.984024,0.65879
0,Scaled-only,GradientBoosting,0.625801,0.083593,3.308267,4.061587,0.645375
0,Feature-Selected,GradientBoosting,0.600088,0.077933,3.501641,4.396792,0.584424
0,Scaled-only,XGBoost,0.630576,0.074766,3.329362,4.112081,0.636502
0,Feature-Selected,XGBoost,0.587623,0.085667,3.304916,4.203455,0.620168


On the other hand, the ElasticNet model is one of the least accurate models from all the models we have trained (but still better than the Decision Tree model). It has an error span of about 4% body fat in terms of its MAE and RMSE scores, which is still one of the higher values across the other MAE and RMSE scores in our dataframe.

### 8. Support Vector Regression

In [47]:
all_results = pd.concat([
    all_results,
    evaluate_model(SVR(), 'SVR', X_train_scaled, X_test_scaled, y_train, y_test, 'Scaled-only'),
    evaluate_model(SVR(), 'SVR', X_train_fs, X_test_fs, y_train, y_test, 'Feature-Selected')
])

In [48]:
display(all_results)

Unnamed: 0,Dataset,Model,CV_R2_Mean,CV_R2_Std,Test_MAE,Test_RMSE,Test_R2
0,Scaled-only,DecisionTree,0.425027,0.109527,4.270588,5.545463,0.33892
1,Feature-Selected,DecisionTree,0.29768,0.115825,4.360784,5.54981,0.337883
0,Scaled-only,LinearRegression,0.686319,0.031946,3.329254,4.240279,0.613484
0,Feature-Selected,LinearRegression,0.695245,0.035255,3.248772,4.0295,0.650956
0,Scaled-only,RandomForest,0.666268,0.054562,3.401824,4.09701,0.639162
0,Feature-Selected,RandomForest,0.655445,0.047001,3.328765,3.984024,0.65879
0,Scaled-only,GradientBoosting,0.625801,0.083593,3.308267,4.061587,0.645375
0,Feature-Selected,GradientBoosting,0.600088,0.077933,3.501641,4.396792,0.584424
0,Scaled-only,XGBoost,0.630576,0.074766,3.329362,4.112081,0.636502
0,Feature-Selected,XGBoost,0.587623,0.085667,3.304916,4.203455,0.620168


The Support Vector Regression model is about the same performance as what can be said about the ElasticNet model.

### 9. K-Nearest Neighbors Regressor

In [49]:
all_results = pd.concat([
    all_results,
    evaluate_model(KNeighborsRegressor(n_neighbors=5), 'KNRegressor', X_train_scaled, X_test_scaled, y_train, y_test, 'Scaled-only'),
    evaluate_model(KNeighborsRegressor(n_neighbors=5), 'KNRegressor', X_train_fs, X_test_fs, y_train, y_test, 'Feature-Selected')
])

In [50]:
display(all_results)

Unnamed: 0,Dataset,Model,CV_R2_Mean,CV_R2_Std,Test_MAE,Test_RMSE,Test_R2
0,Scaled-only,DecisionTree,0.425027,0.109527,4.270588,5.545463,0.33892
1,Feature-Selected,DecisionTree,0.29768,0.115825,4.360784,5.54981,0.337883
0,Scaled-only,LinearRegression,0.686319,0.031946,3.329254,4.240279,0.613484
0,Feature-Selected,LinearRegression,0.695245,0.035255,3.248772,4.0295,0.650956
0,Scaled-only,RandomForest,0.666268,0.054562,3.401824,4.09701,0.639162
0,Feature-Selected,RandomForest,0.655445,0.047001,3.328765,3.984024,0.65879
0,Scaled-only,GradientBoosting,0.625801,0.083593,3.308267,4.061587,0.645375
0,Feature-Selected,GradientBoosting,0.600088,0.077933,3.501641,4.396792,0.584424
0,Scaled-only,XGBoost,0.630576,0.074766,3.329362,4.112081,0.636502
0,Feature-Selected,XGBoost,0.587623,0.085667,3.304916,4.203455,0.620168


The same can be said about the KNeighborsRegressor model as with the previous two models (though it is one of the better performing ones from the final three we trained, yet it falls short of the other models we have trained.)

In [51]:
# Reset index for cleanliness
all_results = all_results.reset_index(drop=True)
display(all_results)

Unnamed: 0,Dataset,Model,CV_R2_Mean,CV_R2_Std,Test_MAE,Test_RMSE,Test_R2
0,Scaled-only,DecisionTree,0.425027,0.109527,4.270588,5.545463,0.33892
1,Feature-Selected,DecisionTree,0.29768,0.115825,4.360784,5.54981,0.337883
2,Scaled-only,LinearRegression,0.686319,0.031946,3.329254,4.240279,0.613484
3,Feature-Selected,LinearRegression,0.695245,0.035255,3.248772,4.0295,0.650956
4,Scaled-only,RandomForest,0.666268,0.054562,3.401824,4.09701,0.639162
5,Feature-Selected,RandomForest,0.655445,0.047001,3.328765,3.984024,0.65879
6,Scaled-only,GradientBoosting,0.625801,0.083593,3.308267,4.061587,0.645375
7,Feature-Selected,GradientBoosting,0.600088,0.077933,3.501641,4.396792,0.584424
8,Scaled-only,XGBoost,0.630576,0.074766,3.329362,4.112081,0.636502
9,Feature-Selected,XGBoost,0.587623,0.085667,3.304916,4.203455,0.620168


In [53]:
# Save metrics
all_results.to_csv("/kaggle/working/model_comparison.csv", index=False)