# Feature Engineering and Modelling

### <u>The Physics & Economics of Price: Why Temperature is Non-Linear</u>

`temp_c` is arguably the most critical external factor for predicting `price_actual`. However, our correlation matrix showed a low linear correlation ($\approx 0.11$). This suggests the relationship is **non-linear**, and using a standard linear model without transformation would result in significant under-fitting.

**<u>The Core Axiom: The Causal Chain</u>**

To understand why the data looks this way, we must look at the fundamental chain of causality in energy markets:

> **Weather influences Human Behavior $\rightarrow$ Human Behavior drives Demand $\rightarrow$ Demand drives Price.**

The grid does not care about temperature; it cares about how **humans react** to temperature. This reaction is not uniform, it depends entirely on the season.

**<u>Analysis of the "U-Shaped" Relationship</u>**

We can categorize this human behavior into three distinct thermal regimes:

1.  **The Summer Regime (Cooling Load):**
    During hot weather ($> 25^\circ C$), human behavior shifts towards seeking comfort. People turn on Air Conditioning. This creates a massive surge in **Demand**, which instantly drives **Price** up.
    *   *Mathematical Behavior:* Positive Linear Relationship ($Slope > 0$).

2.  **The Winter Regime (Heating Load):**
    During freezing weather ($< 5^\circ C$), human behavior shifts towards survival. People maximize electric heating systems. This spike in **Demand** forces expensive peaker plants online, driving **Price** up.
    *   *Mathematical Behavior:* Inverse Linear Relationship ($Slope < 0$).

3.  **The "Comfort Zone" (Shoulder Months):**
    During mild weather ($15^\circ C - 20^\circ C$), human behavior becomes passive. Windows are opened, and neither AC nor Heating is required. **Demand** collapses to its minimum baseload, resulting in the lowest **Prices** of the year.
    *   *Mathematical Behavior:* The Global Minimum of the curve.

**<u>The Solution: Polynomial Transformation</u>**

Since the relationship goes "Down" (Winter), "Flattens" (Spring), and then "Up" (Summer), it mathematically forms a **Parabola** ($y = x^2$).

To capture this, we cannot simply use `temp_c`. We must apply a **Polynomial Transformation**. This achieves two goals:
1.  **Curvature:** By introducing `temp_squared`, the model can fit the "U-Shape" described above.
2.  **Interaction:** By using `PolynomialFeatures`, we also capture complex weather interactions (e.g., `temp_c * solar_radiation`), allowing the model to understand scenarios like *"It is hot (High Demand), but the Sun is shining (High Supply)."*

In [18]:
# Imports and Setup
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import joblib

# Preprocessing and Pipelines
from sklearn.preprocessing import RobustScaler, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Models 
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor, BaggingRegressor, GradientBoostingRegressor

# Validation and Metrics
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, TimeSeriesSplit

In [19]:
df = pd.read_csv("../data/processed/trained_data.csv")
display(df.head())

Unnamed: 0,datetime_beginning_ept,price_actual,hour_of_day,day_of_week,month,price_1h_ago,price_24h_ago,avg_price_last_24h,temp_c,wind_kph,solar_radiation
0,2024-01-02 00:00:00,21.3249,0.0,2.0,1.0,23.3484,31.3827,29.3789,2.0,10.8,0.0
1,2024-01-02 01:00:00,19.6885,1.0,2.0,1.0,21.3249,20.0838,28.9598,1.2,8.0,0.0
2,2024-01-02 02:00:00,20.0916,2.0,2.0,1.0,19.6885,17.6052,28.9433,0.4,6.4,0.0
3,2024-01-02 03:00:00,18.6212,3.0,2.0,1.0,20.0916,19.7673,29.0469,0.5,10.7,0.0
4,2024-01-02 04:00:00,18.6391,4.0,2.0,1.0,18.6212,17.0687,28.9992,-0.3,12.0,0.0


In [20]:
# Converting the datetime column to datetime type
df["datetime_beginning_ept"] = pd.to_datetime(df["datetime_beginning_ept"])

# Setting back the datetime index
df = df.set_index("datetime_beginning_ept")
df.sort_index(inplace=True)
display(df.head())

Unnamed: 0_level_0,price_actual,hour_of_day,day_of_week,month,price_1h_ago,price_24h_ago,avg_price_last_24h,temp_c,wind_kph,solar_radiation
datetime_beginning_ept,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2024-01-02 00:00:00,21.3249,0.0,2.0,1.0,23.3484,31.3827,29.3789,2.0,10.8,0.0
2024-01-02 01:00:00,19.6885,1.0,2.0,1.0,21.3249,20.0838,28.9598,1.2,8.0,0.0
2024-01-02 02:00:00,20.0916,2.0,2.0,1.0,19.6885,17.6052,28.9433,0.4,6.4,0.0
2024-01-02 03:00:00,18.6212,3.0,2.0,1.0,20.0916,19.7673,29.0469,0.5,10.7,0.0
2024-01-02 04:00:00,18.6391,4.0,2.0,1.0,18.6212,17.0687,28.9992,-0.3,12.0,0.0


In [21]:
# Verifying the index
df.index

DatetimeIndex(['2024-01-02 00:00:00', '2024-01-02 01:00:00',
               '2024-01-02 02:00:00', '2024-01-02 03:00:00',
               '2024-01-02 04:00:00', '2024-01-02 05:00:00',
               '2024-01-02 06:00:00', '2024-01-02 07:00:00',
               '2024-01-02 08:00:00', '2024-01-02 09:00:00',
               ...
               '2024-12-30 15:00:00', '2024-12-30 16:00:00',
               '2024-12-30 17:00:00', '2024-12-30 18:00:00',
               '2024-12-30 19:00:00', '2024-12-30 20:00:00',
               '2024-12-30 21:00:00', '2024-12-30 22:00:00',
               '2024-12-30 23:00:00', '2024-12-31 00:00:00'],
              dtype='datetime64[ns]', name='datetime_beginning_ept', length=8736, freq=None)

In [22]:
# Summary Function 
def summarize_df(df, df_name="df"):
    """
    Display key information about a DataFrame:
    - info()
    - describe()
    - duplicated rows
    - count of missing values
    """
    print(f"===== DataFrame ({df_name.upper()}) Summary =====")
    print("===== DataFrame Index =====")
    display(df.index)
    print("===== DataFrame Info =====")
    df.info()
    print("\n===== DataFrame Description =====")
    display(df.describe(include='all'))  # include='all' to describe non-numeric columns too
    print("\n===== Duplicate Rows =====")
    duplicates = df[df.duplicated(keep=False)]
    if not duplicates.empty:
        display(duplicates)
    else:
        print("No duplicate rows found.")
    print("\n===== Missing Values per Column =====")
    print(df.isna().sum())


summarize_df(df, "train_df")

===== DataFrame (TRAIN_DF) Summary =====
===== DataFrame Index =====


DatetimeIndex(['2024-01-02 00:00:00', '2024-01-02 01:00:00',
               '2024-01-02 02:00:00', '2024-01-02 03:00:00',
               '2024-01-02 04:00:00', '2024-01-02 05:00:00',
               '2024-01-02 06:00:00', '2024-01-02 07:00:00',
               '2024-01-02 08:00:00', '2024-01-02 09:00:00',
               ...
               '2024-12-30 15:00:00', '2024-12-30 16:00:00',
               '2024-12-30 17:00:00', '2024-12-30 18:00:00',
               '2024-12-30 19:00:00', '2024-12-30 20:00:00',
               '2024-12-30 21:00:00', '2024-12-30 22:00:00',
               '2024-12-30 23:00:00', '2024-12-31 00:00:00'],
              dtype='datetime64[ns]', name='datetime_beginning_ept', length=8736, freq=None)

===== DataFrame Info =====
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8736 entries, 2024-01-02 00:00:00 to 2024-12-31 00:00:00
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   price_actual        8736 non-null   float64
 1   hour_of_day         8736 non-null   float64
 2   day_of_week         8736 non-null   float64
 3   month               8736 non-null   float64
 4   price_1h_ago        8736 non-null   float64
 5   price_24h_ago       8736 non-null   float64
 6   avg_price_last_24h  8736 non-null   float64
 7   temp_c              8736 non-null   float64
 8   wind_kph            8736 non-null   float64
 9   solar_radiation     8736 non-null   float64
dtypes: float64(10)
memory usage: 750.8 KB

===== DataFrame Description =====


Unnamed: 0,price_actual,hour_of_day,day_of_week,month,price_1h_ago,price_24h_ago,avg_price_last_24h,temp_c,wind_kph,solar_radiation
count,8736.0,8736.0,8736.0,8736.0,8736.0,8736.0,8736.0,8736.0,8736.0,8736.0
mean,33.27859,11.499771,3.000229,6.514766,33.280508,33.313805,33.300629,13.217903,9.87065,173.569254
std,28.297715,6.92293,1.999886,3.437097,28.297486,28.294552,14.345797,9.908276,5.366993,250.16511
min,-34.7736,0.0,0.0,1.0,-34.7736,-34.7736,11.0984,-11.4,0.2,0.0
25%,19.222125,5.75,1.0,4.0,19.222125,19.270475,24.2659,5.2,6.1,0.0
50%,26.26685,11.5,3.0,7.0,26.26685,26.3145,30.3667,13.65,8.7,8.0
75%,37.270625,17.25,5.0,10.0,37.2837,37.3027,37.931675,21.0,12.6,307.0
max,492.5833,23.0,6.0,12.0,492.5833,492.5833,117.9057,37.0,36.8,980.0



===== Duplicate Rows =====
No duplicate rows found.

===== Missing Values per Column =====
price_actual          0
hour_of_day           0
day_of_week           0
month                 0
price_1h_ago          0
price_24h_ago         0
avg_price_last_24h    0
temp_c                0
wind_kph              0
solar_radiation       0
dtype: int64


## Training and Evaluating Multiple Regression Models

In [23]:
# Since we are using ColumnTransformer, we need to have the columns' names available
X = df.drop(columns=["price_actual"])
y = df["price_actual"]

In [24]:
display(df.head())

Unnamed: 0_level_0,price_actual,hour_of_day,day_of_week,month,price_1h_ago,price_24h_ago,avg_price_last_24h,temp_c,wind_kph,solar_radiation
datetime_beginning_ept,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2024-01-02 00:00:00,21.3249,0.0,2.0,1.0,23.3484,31.3827,29.3789,2.0,10.8,0.0
2024-01-02 01:00:00,19.6885,1.0,2.0,1.0,21.3249,20.0838,28.9598,1.2,8.0,0.0
2024-01-02 02:00:00,20.0916,2.0,2.0,1.0,19.6885,17.6052,28.9433,0.4,6.4,0.0
2024-01-02 03:00:00,18.6212,3.0,2.0,1.0,20.0916,19.7673,29.0469,0.5,10.7,0.0
2024-01-02 04:00:00,18.6391,4.0,2.0,1.0,18.6212,17.0687,28.9992,-0.3,12.0,0.0


In [25]:
"""MODIFICATIONS TO MADE: 
- Calcualte a better n_splits for TimeSeriesSplit based on n_splits ≈ (N / H) - 1. Needs to understand the procedure
"""


weather_cols: list[str] = ["temp_c", "wind_kph", "solar_radiation"]
other_cols: list[str] = ["price_1h_ago", "price_24h_ago", "avg_price_last_24h", "hour_of_day", "day_of_week", "month"]



X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42, shuffle=False)

# Used for Time Series based Cross-Validation 
tscv = TimeSeriesSplit(n_splits=30, test_size=24) # Forecast horizon of 24 hours over a period of 30 days (720 hours)

"""Pipeline for weather columns"""
weather_poly_scaled = Pipeline(steps=[
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("scaler", RobustScaler())
])

"""Pipeline for other columns"""
other_scaled = Pipeline(steps=[
    ("scaler", RobustScaler())
])

"""Processor for Linear and Non-Distance model"""
processor_scaled = ColumnTransformer(
    transformers=[
        ("weather_columns", weather_poly_scaled, weather_cols),
        ("other_columns", other_scaled, other_cols)
    ],
    remainder="drop"
)

"""Processor for Tree-model"""
processor_trees = ColumnTransformer(
    transformers=[
        ("all_columns", "passthrough", weather_cols + other_cols)
    ],
    remainder="drop"
)

# Dict of all pipelines
pipelines: dict = {
    "LinearRegression": Pipeline(steps=[
        ("process", processor_scaled),
        ("lr", LinearRegression())
    ]),

    "RidgeRegression": Pipeline(steps=[
        ("process", processor_scaled),
        ("ridge", Ridge())
    ]),

    "LassoRegression": Pipeline(steps=[
        ("process", processor_scaled),
        ("lasso", Lasso())
    ]),

    "KNN": Pipeline(steps=[
        ("process", processor_scaled),
        ("knn", KNeighborsRegressor())
    ]),

    "BaggingRegressor": Pipeline(steps=[
        ("process", processor_trees),
        ("br", BaggingRegressor(random_state=42, verbose=1))
    ]),

    "RandomForestRegressor": Pipeline(steps=[
        ("process", processor_trees),
        ("rfr", RandomForestRegressor(random_state=42, verbose=1))
    ]),

    "GradientBoostingRegressor": Pipeline(steps=[
        ("process", processor_trees),
        ("gbr", GradientBoostingRegressor(random_state=42, verbose=1))
    ])
}

# Hyperparameter grids for each model
param_grid: dict = {
    "Ridge": {"alpha": [0.01, 0.1, 1.0, 10.0]},
    "Lasso": {"alpha": [0.001, 0.01, 0.1, 1.0]},
    "KNN": {
        "n_neighbors": [3, 5, 7, 9],
        "weights": ["uniform", "distance"]
    },
    "RandomForestRegressor": {
        "n_estimators": [100, 200],
        "max_depth": [None, 10, 20],
        "max_features": ["sqrt", "log2"]
    },
    "BaggingRegressor": {
        "n_estimators": [10, 50, 100], # Uses Decision Tree as base estimator by default
        "max_samples": [0.6, 0.8, 1.0]
    },
    "GradientBoostingRegressor": {
        "n_estimators": [100, 200],
        "learning_rate": [0.05, 0.1],
        "max_depth": [2, 3, 4]
    },
}

# CROSS VALIDATION SETUP
results: list = []  #Stores the CV_Scores for each model
estimators: dict = {}
cv_score: float = 0.0
best_score: float = 0.0

print(f"Starting Training loop on {len(pipelines)} models...")
print("-" * 50)

for name, pipeline in pipelines.items():
    print(f"Training (Fitting) {name}")

    # Pipeline uses it to go through all parameters
    # pipelien__parameter: values
    search_params: dict = {} 

    step_name: str = "" # To prevent UnboundError

    # Identify which algorithm to it is 
    if "RidgeRegression" in name: grid_key, step_name = "Ridge", "ridge"
    elif "LassoRegression" in name: grid_key, step_name = "Lasso", "lasso"
    elif "KNN" in name: grid_key, step_name = "KNN", "knn"
    elif "RandomForestRegressor" in name: grid_key, step_name = "RandomForestRegressor", "rfr"
    elif "BaggingRegressor" in name: grid_key, step_name = "BaggingRegressor", "br"
    elif "GradientBoostingRegressor" in name: grid_key, step_name = "GradientBoostingRegressor", "gbr"
    else: grid_key = None  # No hyperparameter tuning for Linear Regression


    # Build the parameter grid for GridSearchCV
    if grid_key and grid_key in param_grid:
        for param, values in param_grid[grid_key].items():
            search_params[f"{step_name}__{param}"] = values

    # Run GridSearchCV if the model is in the param_grid
    if search_params: 
        model = GridSearchCV(
            pipeline, 
            param_grid=search_params, 
            cv=tscv, 
            scoring="neg_root_mean_squared_error", 
            n_jobs=-1, 
            verbose=1
        )

        # Fitting the model on the entire training data
        model.fit(X_train, y_train)
        final_model = model.best_estimator_ # Best pipeline with best hyperparameters
        best_score = round(-model.best_score_, 4) # Best CV score (lowest RMSE)
        best_params = model.best_params_ # Best hyperparameters

        # Saving the CV results for analysis
        cv_results = pd.DataFrame(model.cv_results_)
        display(cv_results.sort_values("mean_test_score"))
        display(cv_results[["params","mean_test_score","std_test_score"]].sort_values("mean_test_score"))


    else: # If the model was not in the param_grid, fit the pipeline directly
        cv_scores = cross_val_score(
            pipeline,
            X_train, 
            y_train, 
            cv=tscv, 
            scoring="neg_root_mean_squared_error"
         )

        cv_score = round(-cv_scores.mean(), 4) # AVG_RMSE 
        best_params = "Default"

        # Fitting the model on the entire training data
        pipeline.fit(X_train, y_train)
        final_model = pipeline

    
    # Store results
    results.append({
        "Model": name,
        "CV_RMSE": best_score if search_params else cv_score,
        "Best Params": best_params
    })

    # Saving the model object
    estimators[name] = final_model 
    print(f"Completed training for {name}. CV_RMSE: {results[-1]["CV_RMSE"]:.4f}")


# LEADERBOARD
print("-" * 50)
print("Printing the Model Leaderboard sorted by CV_RMSE:")
leaderb_df = pd.DataFrame(results).sort_values("CV_RMSE", ascending=True)
display(leaderb_df)

# SAVING MODELS 
print("Saving models to disk...")
joblib.dump(estimators, "../data/processed/estimators.pkl") 
print("Done.")

Starting Training loop on 7 models...
--------------------------------------------------
Training (Fitting) LinearRegression
Completed training for LinearRegression. CV_RMSE: 20.4526
Training (Fitting) RidgeRegression
Fitting 30 folds for each of 4 candidates, totalling 120 fits


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ridge__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,...,split23_test_score,split24_test_score,split25_test_score,split26_test_score,split27_test_score,split28_test_score,split29_test_score,mean_test_score,std_test_score,rank_test_score
3,0.05305,0.008755,0.015062,0.004526,10.0,{'ridge__alpha': 10.0},-15.552225,-15.178155,-6.926497,-12.664454,...,-5.489208,-9.041293,-30.818927,-37.359318,-40.283687,-25.794928,-36.462,-20.457261,11.239035,4
2,0.057139,0.013826,0.019931,0.016934,1.0,{'ridge__alpha': 1.0},-15.541202,-15.177931,-6.955158,-12.678994,...,-5.506847,-8.956096,-30.790264,-37.317193,-40.247895,-25.773216,-36.471731,-20.453117,11.243023,3
1,0.059651,0.018133,0.01761,0.007694,0.1,{'ridge__alpha': 0.1},-15.540114,-15.17802,-6.958593,-12.680744,...,-5.508878,-8.946384,-30.786942,-37.31241,-40.243898,-25.77069,-36.472903,-20.452657,11.243485,2
0,0.064066,0.017138,0.019677,0.009128,0.01,{'ridge__alpha': 0.01},-15.540005,-15.17803,-6.958943,-12.680923,...,-5.509084,-8.945399,-30.786604,-37.311925,-40.243493,-25.770434,-36.473022,-20.45261,11.243532,1


Unnamed: 0,params,mean_test_score,std_test_score
3,{'ridge__alpha': 10.0},-20.457261,11.239035
2,{'ridge__alpha': 1.0},-20.453117,11.243023
1,{'ridge__alpha': 0.1},-20.452657,11.243485
0,{'ridge__alpha': 0.01},-20.45261,11.243532


Completed training for RidgeRegression. CV_RMSE: 20.4526
Training (Fitting) LassoRegression
Fitting 30 folds for each of 4 candidates, totalling 120 fits


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_lasso__alpha,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,...,split23_test_score,split24_test_score,split25_test_score,split26_test_score,split27_test_score,split28_test_score,split29_test_score,mean_test_score,std_test_score,rank_test_score
3,0.083607,0.029638,0.021266,0.014522,1.0,{'lasso__alpha': 1.0},-16.672147,-15.872021,-7.224639,-12.888698,...,-5.366948,-10.03175,-30.951034,-38.121383,-40.577025,-25.64615,-36.308846,-20.552705,11.05526,4
0,0.171718,0.067189,0.030657,0.03336,0.001,{'lasso__alpha': 0.001},-15.542118,-15.17776,-6.954014,-12.67942,...,-5.505533,-8.946875,-30.789488,-37.315228,-40.248267,-25.770209,-36.471326,-20.452515,11.244617,3
1,0.135287,0.051807,0.031392,0.033482,0.01,{'lasso__alpha': 0.01},-15.561852,-15.175785,-6.910305,-12.665614,...,-5.474834,-8.964108,-30.816666,-37.346465,-40.291349,-25.77144,-36.45843,-20.450691,11.253308,2
2,0.147227,0.063047,0.032337,0.021619,0.1,{'lasso__alpha': 0.1},-15.680324,-15.225521,-6.756993,-12.681798,...,-5.260677,-9.196576,-30.856528,-37.499363,-40.385222,-25.781598,-36.421182,-20.412825,11.254827,1


Unnamed: 0,params,mean_test_score,std_test_score
3,{'lasso__alpha': 1.0},-20.552705,11.05526
0,{'lasso__alpha': 0.001},-20.452515,11.244617
1,{'lasso__alpha': 0.01},-20.450691,11.253308
2,{'lasso__alpha': 0.1},-20.412825,11.254827


Completed training for LassoRegression. CV_RMSE: 20.4128
Training (Fitting) KNN
Fitting 30 folds for each of 8 candidates, totalling 240 fits


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_knn__n_neighbors,param_knn__weights,params,split0_test_score,split1_test_score,split2_test_score,...,split23_test_score,split24_test_score,split25_test_score,split26_test_score,split27_test_score,split28_test_score,split29_test_score,mean_test_score,std_test_score,rank_test_score
0,0.231555,0.098836,0.035825,0.023468,3,uniform,"{'knn__n_neighbors': 3, 'knn__weights': 'unifo...",-20.805394,-18.831856,-12.273358,...,-8.282337,-9.786893,-27.554897,-36.524845,-34.774213,-25.073144,-33.587551,-22.176393,11.106375,8
1,0.215524,0.073585,0.037233,0.023056,3,distance,"{'knn__n_neighbors': 3, 'knn__weights': 'dista...",-20.763179,-18.62296,-12.366734,...,-8.586358,-9.877579,-27.347379,-36.684145,-34.643493,-24.970008,-33.729506,-22.16262,11.016155,7
2,0.19388,0.05144,0.030293,0.017729,5,uniform,"{'knn__n_neighbors': 5, 'knn__weights': 'unifo...",-17.266404,-18.944662,-10.768008,...,-7.587202,-10.488569,-27.700253,-37.034117,-37.635561,-24.937326,-35.209485,-21.439792,11.212828,6
3,0.155821,0.044471,0.033264,0.018396,5,distance,"{'knn__n_neighbors': 5, 'knn__weights': 'dista...",-17.499832,-18.771275,-10.811749,...,-7.839916,-10.435308,-27.392522,-37.043653,-37.616163,-24.587386,-34.673695,-21.405739,11.099116,5
5,0.159317,0.042426,0.02645,0.011149,7,distance,"{'knn__n_neighbors': 7, 'knn__weights': 'dista...",-14.785743,-18.382157,-10.08549,...,-7.636423,-10.287919,-28.249657,-36.897528,-35.912658,-23.465357,-35.847252,-20.727626,10.656848,4
4,0.149935,0.035206,0.025215,0.012561,7,uniform,"{'knn__n_neighbors': 7, 'knn__weights': 'unifo...",-14.41689,-18.543478,-10.069996,...,-7.421321,-10.293856,-28.536069,-36.928005,-35.915739,-23.75601,-36.414405,-20.719603,10.703381,3
7,0.178873,0.071085,0.033478,0.015778,9,distance,"{'knn__n_neighbors': 9, 'knn__weights': 'dista...",-14.363676,-18.786791,-9.061038,...,-6.772468,-10.605762,-28.070928,-36.711874,-36.37855,-24.651522,-37.220366,-20.487227,10.976335,2
6,0.136705,0.032526,0.021025,0.007044,9,uniform,"{'knn__n_neighbors': 9, 'knn__weights': 'unifo...",-13.999834,-18.97121,-8.84555,...,-6.504003,-10.674511,-28.204078,-36.682371,-36.486355,-25.018954,-38.002712,-20.482575,11.062444,1


Unnamed: 0,params,mean_test_score,std_test_score
0,"{'knn__n_neighbors': 3, 'knn__weights': 'unifo...",-22.176393,11.106375
1,"{'knn__n_neighbors': 3, 'knn__weights': 'dista...",-22.16262,11.016155
2,"{'knn__n_neighbors': 5, 'knn__weights': 'unifo...",-21.439792,11.212828
3,"{'knn__n_neighbors': 5, 'knn__weights': 'dista...",-21.405739,11.099116
5,"{'knn__n_neighbors': 7, 'knn__weights': 'dista...",-20.727626,10.656848
4,"{'knn__n_neighbors': 7, 'knn__weights': 'unifo...",-20.719603,10.703381
7,"{'knn__n_neighbors': 9, 'knn__weights': 'dista...",-20.487227,10.976335
6,"{'knn__n_neighbors': 9, 'knn__weights': 'unifo...",-20.482575,11.062444


Completed training for KNN. CV_RMSE: 20.4826
Training (Fitting) BaggingRegressor
Fitting 30 folds for each of 9 candidates, totalling 270 fits


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    7.1s finished


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_br__max_samples,param_br__n_estimators,params,split0_test_score,split1_test_score,split2_test_score,...,split23_test_score,split24_test_score,split25_test_score,split26_test_score,split27_test_score,split28_test_score,split29_test_score,mean_test_score,std_test_score,rank_test_score
6,1.457993,0.126073,0.015033,0.014241,1.0,10,"{'br__max_samples': 1.0, 'br__n_estimators': 10}",-17.314734,-19.897905,-18.600875,...,-5.8802,-11.170593,-43.394976,-36.146914,-37.150664,-28.007282,-38.435329,-22.119718,11.766802,9
3,1.317046,0.122342,0.015394,0.013621,0.8,10,"{'br__max_samples': 0.8, 'br__n_estimators': 10}",-13.632413,-19.352256,-15.677877,...,-7.054653,-10.81344,-27.89514,-36.798779,-38.439662,-26.611716,-36.365932,-21.329049,11.180398,8
0,1.201907,0.203557,0.015679,0.012735,0.6,10,"{'br__max_samples': 0.6, 'br__n_estimators': 10}",-17.442987,-15.41391,-13.142985,...,-6.777199,-8.316709,-27.596723,-40.085561,-36.488989,-31.087305,-35.639186,-20.874533,11.045991,7
7,7.322085,0.482517,0.035806,0.01817,1.0,50,"{'br__max_samples': 1.0, 'br__n_estimators': 50}",-12.42509,-18.92296,-14.290686,...,-6.195155,-10.192662,-32.441573,-35.221573,-35.695646,-24.057848,-38.50131,-20.584176,11.050794,6
4,6.646856,0.40986,0.032226,0.016997,0.8,50,"{'br__max_samples': 0.8, 'br__n_estimators': 50}",-13.512276,-18.114476,-13.510107,...,-6.205018,-8.511226,-28.524127,-36.645949,-36.695288,-25.459577,-37.523634,-20.495745,11.255824,5
8,15.378125,1.257044,0.064711,0.018983,1.0,100,"{'br__max_samples': 1.0, 'br__n_estimators': 100}",-12.89406,-17.065271,-13.19829,...,-6.085271,-10.322615,-30.90294,-35.754066,-33.82645,-24.347912,-37.73095,-20.333647,11.125332,4
5,13.087833,0.410785,0.064342,0.018801,0.8,100,"{'br__max_samples': 0.8, 'br__n_estimators': 100}",-13.643921,-16.881653,-12.33163,...,-5.926286,-8.662496,-28.473093,-36.448207,-36.397506,-25.141162,-36.965207,-20.259037,11.194252,3
1,5.742635,0.449209,0.037845,0.023169,0.6,50,"{'br__max_samples': 0.6, 'br__n_estimators': 50}",-15.020318,-15.387915,-13.313091,...,-5.982391,-8.776182,-26.302433,-36.244315,-35.343747,-25.089455,-35.320733,-20.156857,10.842018,2
2,10.993104,1.201732,0.066639,0.029799,0.6,100,"{'br__max_samples': 0.6, 'br__n_estimators': 100}",-14.563739,-16.051865,-11.908572,...,-5.375937,-8.893988,-26.334624,-36.578759,-35.80897,-24.234108,-35.842497,-20.046293,10.936459,1


Unnamed: 0,params,mean_test_score,std_test_score
6,"{'br__max_samples': 1.0, 'br__n_estimators': 10}",-22.119718,11.766802
3,"{'br__max_samples': 0.8, 'br__n_estimators': 10}",-21.329049,11.180398
0,"{'br__max_samples': 0.6, 'br__n_estimators': 10}",-20.874533,11.045991
7,"{'br__max_samples': 1.0, 'br__n_estimators': 50}",-20.584176,11.050794
4,"{'br__max_samples': 0.8, 'br__n_estimators': 50}",-20.495745,11.255824
8,"{'br__max_samples': 1.0, 'br__n_estimators': 100}",-20.333647,11.125332
5,"{'br__max_samples': 0.8, 'br__n_estimators': 100}",-20.259037,11.194252
1,"{'br__max_samples': 0.6, 'br__n_estimators': 50}",-20.156857,10.842018
2,"{'br__max_samples': 0.6, 'br__n_estimators': 100}",-20.046293,10.936459


Completed training for BaggingRegressor. CV_RMSE: 20.0463
Training (Fitting) RandomForestRegressor
Fitting 30 folds for each of 12 candidates, totalling 360 fits


[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    1.8s
[Parallel(n_jobs=1)]: Done 199 tasks      | elapsed:    7.9s
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    7.9s finished


Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_rfr__max_depth,param_rfr__max_features,param_rfr__n_estimators,params,split0_test_score,split1_test_score,...,split23_test_score,split24_test_score,split25_test_score,split26_test_score,split27_test_score,split28_test_score,split29_test_score,mean_test_score,std_test_score,rank_test_score
10,6.267532,0.672381,0.064963,0.028526,20.0,log2,100,"{'rfr__max_depth': 20, 'rfr__max_features': 'l...",-16.332244,-15.718329,...,-7.220334,-7.865275,-25.082863,-36.786753,-35.762957,-23.525052,-35.845716,-19.719707,10.724584,11
8,5.701843,0.426037,0.058255,0.017354,20.0,sqrt,100,"{'rfr__max_depth': 20, 'rfr__max_features': 's...",-16.332244,-15.718329,...,-7.220334,-7.865275,-25.082863,-36.786753,-35.762957,-23.525052,-35.845716,-19.719707,10.724584,11
11,11.85363,0.628804,0.109802,0.038105,20.0,log2,200,"{'rfr__max_depth': 20, 'rfr__max_features': 'l...",-16.33178,-14.43318,...,-6.71673,-8.470961,-25.298405,-36.477157,-35.366528,-23.547242,-35.351424,-19.62563,10.811151,9
9,11.643116,0.596656,0.112692,0.032893,20.0,sqrt,200,"{'rfr__max_depth': 20, 'rfr__max_features': 's...",-16.33178,-14.43318,...,-6.71673,-8.470961,-25.298405,-36.477157,-35.366528,-23.547242,-35.351424,-19.62563,10.811151,9
7,5.999041,0.470867,0.098171,0.027427,10.0,log2,200,"{'rfr__max_depth': 10, 'rfr__max_features': 'l...",-16.442393,-14.51917,...,-5.930138,-8.821306,-26.623778,-36.221152,-35.970647,-22.666053,-35.082031,-19.539442,10.99596,7
5,5.557705,0.348372,0.087426,0.023162,10.0,sqrt,200,"{'rfr__max_depth': 10, 'rfr__max_features': 's...",-16.442393,-14.51917,...,-5.930138,-8.821306,-26.623778,-36.221152,-35.970647,-22.666053,-35.082031,-19.539442,10.99596,7
6,2.697479,0.157678,0.046522,0.015569,10.0,log2,100,"{'rfr__max_depth': 10, 'rfr__max_features': 'l...",-16.347584,-13.85853,...,-6.231971,-8.558025,-26.730388,-36.66311,-35.457463,-22.039944,-35.662683,-19.534352,11.010125,5
4,2.901092,0.202347,0.049211,0.014724,10.0,sqrt,100,"{'rfr__max_depth': 10, 'rfr__max_features': 's...",-16.347584,-13.85853,...,-6.231971,-8.558025,-26.730388,-36.66311,-35.457463,-22.039944,-35.662683,-19.534352,11.010125,5
0,5.983513,0.556552,0.061574,0.034889,,sqrt,100,"{'rfr__max_depth': None, 'rfr__max_features': ...",-15.872219,-16.01716,...,-6.146,-8.515699,-25.128735,-35.198965,-36.339096,-23.230766,-33.936638,-19.508895,10.606856,3
2,6.04562,0.445254,0.060234,0.019985,,log2,100,"{'rfr__max_depth': None, 'rfr__max_features': ...",-15.872219,-16.01716,...,-6.146,-8.515699,-25.128735,-35.198965,-36.339096,-23.230766,-33.936638,-19.508895,10.606856,3


Unnamed: 0,params,mean_test_score,std_test_score
10,"{'rfr__max_depth': 20, 'rfr__max_features': 'l...",-19.719707,10.724584
8,"{'rfr__max_depth': 20, 'rfr__max_features': 's...",-19.719707,10.724584
11,"{'rfr__max_depth': 20, 'rfr__max_features': 'l...",-19.62563,10.811151
9,"{'rfr__max_depth': 20, 'rfr__max_features': 's...",-19.62563,10.811151
7,"{'rfr__max_depth': 10, 'rfr__max_features': 'l...",-19.539442,10.99596
5,"{'rfr__max_depth': 10, 'rfr__max_features': 's...",-19.539442,10.99596
6,"{'rfr__max_depth': 10, 'rfr__max_features': 'l...",-19.534352,11.010125
4,"{'rfr__max_depth': 10, 'rfr__max_features': 's...",-19.534352,11.010125
0,"{'rfr__max_depth': None, 'rfr__max_features': ...",-19.508895,10.606856
2,"{'rfr__max_depth': None, 'rfr__max_features': ...",-19.508895,10.606856


Completed training for RandomForestRegressor. CV_RMSE: 19.4867
Training (Fitting) GradientBoostingRegressor
Fitting 30 folds for each of 12 candidates, totalling 360 fits
      Iter       Train Loss   Remaining Time 
         1         806.9890            2.07s
         2         749.1731            1.80s
         3         699.4807            1.84s
         4         659.5834            1.84s
         5         625.0671            1.99s
         6         596.6742            1.91s
         7         570.8005            2.05s
         8         548.9132            1.96s
         9         530.8555            1.96s
        10         515.3447            1.91s
        20         428.2929            2.13s
        30         398.4818            1.66s
        40         385.5640            1.54s
        50         377.4291            1.22s
        60         371.4514            1.04s
        70         365.2936            0.76s
        80         361.3482            0.50s
        90        

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_gbr__learning_rate,param_gbr__max_depth,param_gbr__n_estimators,params,split0_test_score,split1_test_score,...,split23_test_score,split24_test_score,split25_test_score,split26_test_score,split27_test_score,split28_test_score,split29_test_score,mean_test_score,std_test_score,rank_test_score
11,10.514096,1.152641,0.008669,0.010264,0.1,4,200,"{'gbr__learning_rate': 0.1, 'gbr__max_depth': ...",-18.190295,-15.850778,...,-7.032744,-8.665663,-24.926106,-38.271667,-34.805285,-20.008026,-35.560377,-21.102517,11.277413,12
10,5.039531,0.343983,0.008454,0.007015,0.1,4,100,"{'gbr__learning_rate': 0.1, 'gbr__max_depth': ...",-16.95528,-18.611598,...,-6.109535,-8.257078,-25.130933,-38.379613,-35.730191,-22.945161,-35.83097,-20.836099,11.283366,11
9,7.564462,0.211912,0.012614,0.017838,0.1,3,200,"{'gbr__learning_rate': 0.1, 'gbr__max_depth': ...",-16.910117,-14.986692,...,-6.485563,-7.435284,-25.710513,-41.875279,-36.975384,-23.38157,-36.842715,-20.48962,11.92449,10
5,9.888479,0.402437,0.010758,0.012527,0.05,4,200,"{'gbr__learning_rate': 0.05, 'gbr__max_depth':...",-16.388577,-16.737963,...,-6.331944,-7.803013,-25.320839,-39.169866,-36.002606,-24.751289,-37.142943,-20.403642,11.469921,9
8,3.687694,0.146628,0.006838,0.006059,0.1,3,100,"{'gbr__learning_rate': 0.1, 'gbr__max_depth': ...",-17.063327,-15.300741,...,-6.272809,-7.461097,-26.079325,-40.448112,-37.312771,-24.926177,-37.574109,-20.327386,11.762638,8
4,5.454232,0.696053,0.009005,0.009971,0.05,4,100,"{'gbr__learning_rate': 0.05, 'gbr__max_depth':...",-16.450948,-15.450606,...,-6.140852,-8.129032,-25.923066,-37.670757,-36.518643,-25.259401,-38.793438,-20.274853,11.597142,7
3,7.534541,0.451082,0.006807,0.004031,0.05,3,200,"{'gbr__learning_rate': 0.05, 'gbr__max_depth':...",-17.138171,-16.563671,...,-6.170956,-7.51675,-25.952124,-38.433211,-36.985932,-24.584535,-37.917377,-20.240959,11.536216,6
0,2.55702,0.171104,0.008052,0.008761,0.05,2,100,"{'gbr__learning_rate': 0.05, 'gbr__max_depth':...",-18.183123,-17.075518,...,-5.832822,-7.953272,-27.759677,-36.086347,-38.72413,-25.419993,-37.154577,-20.122704,11.424695,5
2,3.798427,0.243311,0.009403,0.009856,0.05,3,100,"{'gbr__learning_rate': 0.05, 'gbr__max_depth':...",-17.09805,-16.415146,...,-6.185923,-7.766032,-26.24975,-35.95787,-37.622104,-25.825765,-38.907321,-20.105057,11.487641,4
7,5.213106,0.244131,0.009272,0.01089,0.1,2,200,"{'gbr__learning_rate': 0.1, 'gbr__max_depth': ...",-17.731757,-17.011422,...,-6.371574,-7.460525,-26.60261,-36.697482,-37.57903,-24.667135,-37.245793,-20.086105,11.656033,3


Unnamed: 0,params,mean_test_score,std_test_score
11,"{'gbr__learning_rate': 0.1, 'gbr__max_depth': ...",-21.102517,11.277413
10,"{'gbr__learning_rate': 0.1, 'gbr__max_depth': ...",-20.836099,11.283366
9,"{'gbr__learning_rate': 0.1, 'gbr__max_depth': ...",-20.48962,11.92449
5,"{'gbr__learning_rate': 0.05, 'gbr__max_depth':...",-20.403642,11.469921
8,"{'gbr__learning_rate': 0.1, 'gbr__max_depth': ...",-20.327386,11.762638
4,"{'gbr__learning_rate': 0.05, 'gbr__max_depth':...",-20.274853,11.597142
3,"{'gbr__learning_rate': 0.05, 'gbr__max_depth':...",-20.240959,11.536216
0,"{'gbr__learning_rate': 0.05, 'gbr__max_depth':...",-20.122704,11.424695
2,"{'gbr__learning_rate': 0.05, 'gbr__max_depth':...",-20.105057,11.487641
7,"{'gbr__learning_rate': 0.1, 'gbr__max_depth': ...",-20.086105,11.656033


Completed training for GradientBoostingRegressor. CV_RMSE: 19.8605
--------------------------------------------------
Printing the Model Leaderboard sorted by CV_RMSE:


Unnamed: 0,Model,CV_RMSE,Best Params
5,RandomForestRegressor,19.4867,"{'rfr__max_depth': None, 'rfr__max_features': ..."
6,GradientBoostingRegressor,19.8605,"{'gbr__learning_rate': 0.1, 'gbr__max_depth': ..."
4,BaggingRegressor,20.0463,"{'br__max_samples': 0.6, 'br__n_estimators': 100}"
2,LassoRegression,20.4128,{'lasso__alpha': 0.1}
0,LinearRegression,20.4526,Default
1,RidgeRegression,20.4526,{'ridge__alpha': 0.01}
3,KNN,20.4826,"{'knn__n_neighbors': 9, 'knn__weights': 'unifo..."


Saving models to disk...
Done.


## Predicting and Calculating Evaluation Metrics (RMSE and $R^2$)

In [28]:
from sklearn.metrics import root_mean_squared_error as rmse, r2_score
import pandas as pd

# Retrieving estimators dictionary
print("Loading models from disk...")
estimators = joblib.load("../data/processed/estimators.pkl")

# Models
rf_model = estimators["RandomForestRegressor"]
gbr_model = estimators["GradientBoostingRegressor"]

# CV RMSE from leaderboard
rf_cv_rmse = leaderb_df.loc[
    leaderb_df["Model"] == "RandomForestRegressor", "CV_RMSE"
].values[0]

gbr_cv_rmse = leaderb_df.loc[
    leaderb_df["Model"] == "GradientBoostingRegressor", "CV_RMSE"
].values[0]

# Predictions
rf_pred = rf_model.predict(X_test)
gbr_pred = gbr_model.predict(X_test)

# Test metrics
rf_rmse = round(rmse(y_test, rf_pred), 4)
gbr_rmse = round(rmse(y_test, gbr_pred), 4)

rf_r2 = round(r2_score(y_test, rf_pred), 4)
gbr_r2 = round(r2_score(y_test, gbr_pred), 4)

# Generalization gap (absolute difference)
rf_gap = round(abs(rf_rmse - rf_cv_rmse), 4)
gbr_gap = round(abs(gbr_rmse - gbr_cv_rmse), 4)

# Dynamic best vs comparison (lower RMSE is better)
if rf_rmse < gbr_rmse:
    rows = [
        ["RandomForestRegressor", "Best Model", rf_cv_rmse, rf_rmse, rf_r2, rf_gap],
        ["GradientBoostingRegressor", "Comparison Model", gbr_cv_rmse, gbr_rmse, gbr_r2, gbr_gap]
    ]
else:
    rows = [
        ["GradientBoostingRegressor", "Best Model", gbr_cv_rmse, gbr_rmse, gbr_r2, gbr_gap],
        ["RandomForestRegressor", "Comparison Model", rf_cv_rmse, rf_rmse, rf_r2, rf_gap]
    ]

# Final DataFrame
final_results = pd.DataFrame(
    rows,
    columns=["Model", "Type", "CV_RMSE", "Test_RMSE", "Test_R2", "Gap"]
)

display(leaderb_df)
display(final_results)

Loading models from disk...


[Parallel(n_jobs=1)]: Done  49 tasks      | elapsed:    0.0s
[Parallel(n_jobs=1)]: Done 199 tasks      | elapsed:    0.1s
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.1s finished


Unnamed: 0,Model,CV_RMSE,Best Params
5,RandomForestRegressor,19.4867,"{'rfr__max_depth': None, 'rfr__max_features': ..."
6,GradientBoostingRegressor,19.8605,"{'gbr__learning_rate': 0.1, 'gbr__max_depth': ..."
4,BaggingRegressor,20.0463,"{'br__max_samples': 0.6, 'br__n_estimators': 100}"
2,LassoRegression,20.4128,{'lasso__alpha': 0.1}
0,LinearRegression,20.4526,Default
1,RidgeRegression,20.4526,{'ridge__alpha': 0.01}
3,KNN,20.4826,"{'knn__n_neighbors': 9, 'knn__weights': 'unifo..."


Unnamed: 0,Model,Type,CV_RMSE,Test_RMSE,Test_R2,Gap
0,GradientBoostingRegressor,Best Model,19.8605,17.1221,0.401,2.7384
1,RandomForestRegressor,Comparison Model,19.4867,17.855,0.3486,1.6317


## CONCLUSIONS AND RECOMMENDATIONS