# Model Building

In this notebook, I will develop a predictive model to forecast future energy consumption. 

I will experiment with multiple regression algorithms, compare their performances using standard evaluation metrics (RMSE, MAE, R² Score), and select the best-performing model for deployment. This step is critical to ensure accurate and reliable energy demand forecasting.

### Key Steps:
- Load processed dataset
- Perform time-based train-test split
- Train multiple regression models
- Evaluate using RMSE, MAE, and R²
- Select and save the best model

In [29]:
## importing required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from sklearn.neighbors import KNeighborsRegressor

import warnings
warnings.filterwarnings('ignore')

In [39]:
## loading the dataset
df_cleaned = pd.read_parquet(r"C:\Users\himan\Desktop\Projects\Energy_Forecasting_System\data\processed-data\est_hourly_cleaned_with_features.parquet")
df_cleaned.head()

Unnamed: 0_level_0,AEP,COMED,DAYTON,DEOK,DOM,DUQ,EKPC,FE,NI,PJME,...,PJME_lag_1,PJME_rolling_mean_24,PJME_rolling_std_24,PJMW_lag_1,PJMW_rolling_mean_24,PJMW_rolling_std_24,PJM_Load_lag_1,PJM_Load_rolling_mean_24,PJM_Load_rolling_std_24,is_holiday
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1999-01-01 00:00:00+00:00,13478.0,9970.0,1596.0,2945.0,9389.0,1458.0,1861.0,6222.0,9810.0,26498.0,...,26498.0,26498.0,0.0,5077.0,5077.0,0.0,31569.0,31569.0,0.0,1
1998-12-30 01:00:00+00:00,13478.0,9970.0,1596.0,2945.0,9389.0,1458.0,1861.0,6222.0,9810.0,26498.0,...,26498.0,26498.0,0.0,5077.0,5077.0,0.0,31569.0,31569.0,0.0,0
1998-12-30 02:00:00+00:00,13478.0,9970.0,1596.0,2945.0,9389.0,1458.0,1861.0,6222.0,9810.0,26498.0,...,26498.0,26498.0,0.0,5077.0,5077.0,0.0,31569.0,31569.0,0.0,0
1998-12-30 03:00:00+00:00,13478.0,9970.0,1596.0,2945.0,9389.0,1458.0,1861.0,6222.0,9810.0,26498.0,...,26498.0,26498.0,0.0,5077.0,5077.0,0.0,31569.0,31569.0,0.0,0
1998-12-30 04:00:00+00:00,13478.0,9970.0,1596.0,2945.0,9389.0,1458.0,1861.0,6222.0,9810.0,26498.0,...,26498.0,26498.0,0.0,5077.0,5077.0,0.0,31569.0,31569.0,0.0,0


Now its time to split the data into "train" for training the model and "test" for testing it on the model. But unlike other cases where we split the data randomly, in time-series tasks, we need to take care that we only want to train our model on past data and test on the future data. This can prevent data leakage.  

In [41]:
## splitting the data into train and test
test_size = 0.8
split_range = int(test_size * int(len(df_cleaned)))  ## 0.8 * x can give us float value which will give us an error in train-test split. So we make sure that it returns only int.

In [42]:
## defining target variable
X = df_cleaned.drop("PJM_Load", axis = 1)
y = df_cleaned["PJM_Load"]

In [43]:
## Time-based Train-Test split
X_train, X_test = X.iloc[:split_range], X.iloc[split_range:]
y_train, y_test = y.iloc[:split_range], y.iloc[split_range:]

In [48]:
print(X.columns)                # What features were used?
print(X.head(3))                # Let’s inspect values
print(y.head(3))                # Target values
X.corrwith(y).sort_values()     # Which feature has suspiciously high correlation?


Index(['AEP', 'COMED', 'DAYTON', 'DEOK', 'DOM', 'DUQ', 'EKPC', 'FE', 'NI',
       'PJME', 'PJMW', 'hour', 'day_of_week', 'month', 'day_of_year',
       'is_weekend', 'AEP_lag_1', 'AEP_rolling_mean_24', 'AEP_rolling_std_24',
       'COMED_lag_1', 'COMED_rolling_mean_24', 'COMED_rolling_std_24',
       'DAYTON_lag_1', 'DAYTON_rolling_mean_24', 'DAYTON_rolling_std_24',
       'DEOK_lag_1', 'DEOK_rolling_mean_24', 'DEOK_rolling_std_24',
       'DOM_lag_1', 'DOM_rolling_mean_24', 'DOM_rolling_std_24', 'DUQ_lag_1',
       'DUQ_rolling_mean_24', 'DUQ_rolling_std_24', 'EKPC_lag_1',
       'EKPC_rolling_mean_24', 'EKPC_rolling_std_24', 'FE_lag_1',
       'FE_rolling_mean_24', 'FE_rolling_std_24', 'NI_lag_1',
       'NI_rolling_mean_24', 'NI_rolling_std_24', 'PJME_lag_1',
       'PJME_rolling_mean_24', 'PJME_rolling_std_24', 'PJMW_lag_1',
       'PJMW_rolling_mean_24', 'PJMW_rolling_std_24', 'PJM_Load_lag_1',
       'PJM_Load_rolling_mean_24', 'PJM_Load_rolling_std_24', 'is_holiday'],
      dtyp

AEP                        NaN
COMED                      NaN
DAYTON                     NaN
DEOK                       NaN
DOM                        NaN
DUQ                        NaN
EKPC                       NaN
FE                         NaN
NI                         NaN
PJME                       NaN
PJMW                       NaN
hour                       NaN
day_of_week                NaN
month                      NaN
day_of_year                NaN
is_weekend                 NaN
AEP_lag_1                  NaN
AEP_rolling_mean_24        NaN
AEP_rolling_std_24         NaN
COMED_lag_1                NaN
COMED_rolling_mean_24      NaN
COMED_rolling_std_24       NaN
DAYTON_lag_1               NaN
DAYTON_rolling_mean_24     NaN
DAYTON_rolling_std_24      NaN
DEOK_lag_1                 NaN
DEOK_rolling_mean_24       NaN
DEOK_rolling_std_24        NaN
DOM_lag_1                  NaN
DOM_rolling_mean_24        NaN
DOM_rolling_std_24         NaN
DUQ_lag_1                  NaN
DUQ_roll

In [44]:
## importing metrices
from sklearn.metrics import mean_absolute_error, mean_squared_error, root_mean_squared_error, r2_score

In [45]:
## creating function to evalueate model
import numpy as np
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2_square = r2_score(true, predicted)
    return mae, mse, rmse, r2_square

In [46]:
## our models
models = {
    "Linear Regression": LinearRegression(),
    "Lasso": Lasso(),
    "Ridge": Ridge(),
    "K-Neighbors Regressor": KNeighborsRegressor(),
    "Decision Tree": DecisionTreeRegressor(),
    "Random Forest Regressor": RandomForestRegressor(),
}

In [47]:
## building the model
result = []

for model_name, model in models.items():
    model.fit(X_train, y_train)

    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)

    train_metrices = evaluate_model(y_train, y_train_pred)
    test_metrices = evaluate_model(y_test, y_test_pred)

    result.append({
        'Model': model_name,
        'Train_RMSE': train_metrices[2],
        'Test_RMSE': test_metrices[2],
        'Train_R2': train_metrices[3],
        'Test_R2': test_metrices[3]
    })

In [None]:
import pandas as pd

results_df = pd.DataFrame(result)
results_df = results_df.sort_values(by='Test_RMSE')

display(results_df)

Unnamed: 0,Model,Train_RMSE,Test_RMSE,Train_R2,Test_R2
0,Linear Regression,0.0,0.0,1.0,1.0
1,Lasso,0.0,0.0,1.0,1.0
2,Ridge,0.0,0.0,1.0,1.0
3,K-Neighbors Regressor,0.0,0.0,1.0,1.0
4,Decision Tree,0.0,0.0,1.0,1.0
5,Random Forest Regressor,0.0,0.0,1.0,1.0
