# Model Building

In this notebook, I will develop a predictive model to forecast future energy consumption. 

I will experiment with multiple regression algorithms, compare their performances using standard evaluation metrics (RMSE, MAE, R² Score), and select the best-performing model for deployment. This step is critical to ensure accurate and reliable energy demand forecasting.

### Key Steps:
- Load processed dataset
- Perform time-based train-test split
- Train multiple regression models
- Evaluate using RMSE, MAE, and R²
- Select and save the best model

In [1]:
## installing lightgbm and catboost
!pip install lightgbm catboost



In [24]:
## installing prophet
!pip install prophet

Collecting prophet
  Downloading prophet-1.1.7-py3-none-win_amd64.whl.metadata (3.6 kB)
Collecting cmdstanpy>=1.0.4 (from prophet)
  Downloading cmdstanpy-1.2.5-py3-none-any.whl.metadata (4.0 kB)
Collecting holidays<1,>=0.25 (from prophet)
  Using cached holidays-0.77-py3-none-any.whl.metadata (46 kB)
Collecting importlib_resources (from prophet)
  Downloading importlib_resources-6.5.2-py3-none-any.whl.metadata (3.9 kB)
Collecting stanio<2.0.0,>=0.4.0 (from cmdstanpy>=1.0.4->prophet)
  Downloading stanio-0.5.1-py3-none-any.whl.metadata (1.6 kB)
Downloading prophet-1.1.7-py3-none-win_amd64.whl (13.3 MB)
   ---------------------------------------- 0.0/13.3 MB ? eta -:--:--
   ----------- ---------------------------- 3.7/13.3 MB 21.8 MB/s eta 0:00:01
   ----------- ---------------------------- 3.7/13.3 MB 21.8 MB/s eta 0:00:01
   ---------------- ----------------------- 5.5/13.3 MB 10.5 MB/s eta 0:00:01
   -------------------- ------------------- 6.8/13.3 MB 8.2 MB/s eta 0:00:01
   ------

In [26]:
## importing required libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import ElasticNet
from statsmodels.tsa.statespace.sarimax import SARIMAX
# for arima model
from statsmodels.tsa.arima.model import ARIMA
# for prophet
from prophet import Prophet

import warnings
warnings.filterwarnings('ignore')

In [27]:
## loading the dataset
df_final = pd.read_parquet(r"C:\Users\himan\Desktop\Projects\Energy_Forecasting_System\data\processed-data\est_hourly_cleaned_with_features.parquet")
df_final.head()

Unnamed: 0_level_0,PJME_MW,PJMW_MW,hour,dayofweek,quarter,month,year,dayofyear,dayofmonth,weekofyear,is_holiday,PJME_PJMW_avg_Consumption
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2002-04-01 01:00:00,21734.0,4374.0,1,0,2,4,2002,91,1,14,0,13054.0
2002-04-01 02:00:00,20971.0,4306.0,2,0,2,4,2002,91,1,14,0,12638.5
2002-04-01 03:00:00,20721.0,4322.0,3,0,2,4,2002,91,1,14,0,12521.5
2002-04-01 04:00:00,20771.0,4359.0,4,0,2,4,2002,91,1,14,0,12565.0
2002-04-01 05:00:00,21334.0,4436.0,5,0,2,4,2002,91,1,14,0,12885.0


Now its time to split the data into "train" for training the model and "test" for testing it on the model. But unlike other cases where we split the data randomly, in time-series tasks, we need to take care that we only want to train our model on past data and test on the future data. This can prevent data leakage.  

In [28]:
## splitting the data into train and test
test_size = 0.8
split_range = int(test_size * len(df_final))  ## 0.8 * x can give us float value which will give us an error in train-test split. So we make sure that it returns only int.

In [29]:
## defining target variable
"""X_pjme = df_final.drop(["PJME_MW"] , axis = 1)
X_pjmw = df_final.drop(["PJMW_MW"] , axis = 1)

y_pjme = df_final["PJME_MW"]
y_pjmw = df_final["PJMW_MW"]"""

'X_pjme = df_final.drop(["PJME_MW"] , axis = 1)\nX_pjmw = df_final.drop(["PJMW_MW"] , axis = 1)\n\ny_pjme = df_final["PJME_MW"]\ny_pjmw = df_final["PJMW_MW"]'

In [30]:
## Time-based Train-Test split
"""X_train_pjme, X_test_pjme = X_pjme.iloc[:split_range], X_pjme.iloc[split_range:]
y_train_pjme, y_test_pjme = y_pjme.iloc[:split_range], y_pjme.iloc[split_range:]

X_train_pjmw, X_test_pjmw = X_pjmw.iloc[:split_range], X_pjmw.iloc[split_range:]
y_train_pjmw, y_test_pjmw = y_pjmw.iloc[:split_range], y_pjmw.iloc[split_range:]"""

'X_train_pjme, X_test_pjme = X_pjme.iloc[:split_range], X_pjme.iloc[split_range:]\ny_train_pjme, y_test_pjme = y_pjme.iloc[:split_range], y_pjme.iloc[split_range:]\n\nX_train_pjmw, X_test_pjmw = X_pjmw.iloc[:split_range], X_pjmw.iloc[split_range:]\ny_train_pjmw, y_test_pjmw = y_pjmw.iloc[:split_range], y_pjmw.iloc[split_range:]'

In [31]:
## importing metrices
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [32]:
## creating function to evalueate model
import numpy as np
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mean_squared_error(true, predicted))
    r2_square = r2_score(true, predicted)
    return mae, mse, rmse, r2_square

In [33]:
## our models
models = {
    "XGBoost": XGBRegressor(),
    "Decision Tree" : DecisionTreeRegressor(),
    "Random Forest": RandomForestRegressor(),
    "LightGBM": LGBMRegressor(),
    "CatBoost": CatBoostRegressor(verbose=0),
    "Support Vector Machine": SVR(),
    "K-Neighbors": KNeighborsRegressor(),
    "Elastic Net": ElasticNet(),
    "SARIMA" : SARIMAX,
    "ARIMA": ARIMA,
    "Prophet": Prophet()
}

In [None]:
data_dict = {
    'PJME': {
        'X': df_final.drop(["PJME_MW", "PJME_PJMW_avg_Consumption"], axis=1),
        'y': df_final["PJME_MW"]
    },
    'PJMW': {
        'X': df_final.drop(["PJMW_MW", "PJME_PJMW_avg_Consumption"], axis=1),
        'y': df_final["PJMW_MW"]
    },
    'Average': {
        'X': df_final.drop(["PJME_MW", "PJMW_MW", "PJME_PJMW_avg_Consumption"], axis=1),
        'y': df_final["PJME_PJMW_avg_Consumption"]  
}}

In [35]:
## Initialize result storage for models
results = {}

## Define the test size split
test_size = 0.8
split_range = int(test_size * len(df_final))

for target_name, data in data_dict.items():
    X = data['X']
    y = data['y']

    # Train-test split
    X_train, X_test = X.iloc[:split_range], X.iloc[split_range:]
    y_train, y_test = y.iloc[:split_range], y.iloc[split_range:]

    result_target = []

    ## Iterate through the models
    for model_name, model in models.items():
        if model_name == "SARIMA":
            ## SARIMA model
            sarima_model = SARIMAX(y_train,
                                   order=(1, 1, 1),  # p, d, q values (adjust as necessary)
                                   seasonal_order=(1, 1, 1, 24),  # P, D, Q, S for daily seasonality
                                   enforce_stationarity=False,
                                   enforce_invertibility=False)
            ## Fit the SARIMA model
            sarima_fitted = sarima_model.fit(disp=False)

            ## Predict on the test data
            predictions_sarima = sarima_fitted.predict(start=len(y_train), end=len(y_train) + len(y_test) - 1, dynamic=False)

            ## Evaluating the performance of the SARIMA model
            train_metrices_sarima = evaluate_model(y_train, predictions_sarima[:len(y_train)])
            test_metrices_sarima = evaluate_model(y_test, predictions_sarima[len(y_train):])

            ## Store results for SARIMA
            result_target.append({
                'Model': model_name,
                'Train_RMSE': train_metrices_sarima[2],
                'Test_RMSE': test_metrices_sarima[2],
                'Train_R2': train_metrices_sarima[3],
                'Test_R2': test_metrices_sarima[3]
            })
        
        else:
            ## For other models (XGBoost, Random Forest, etc.)
            model.fit(X_train, y_train)

            ## Predict on the training and test sets
            y_train_pred = model.predict(X_train)
            y_test_pred = model.predict(X_test)

            ## Evaluating the model
            train_metrices = evaluate_model(y_train, y_train_pred)
            test_metrices = evaluate_model(y_test, y_test_pred)

            ## Store results for other models
            result_target.append({
                'Model': model_name,
                'Train_RMSE': train_metrices[2],
                'Test_RMSE': test_metrices[2],
                'Train_R2': train_metrices[3],
                'Test_R2': test_metrices[3]
            })
    
    ## Add the results for the current target variable (PJME, PJMW, Average)
    results.append({
        'Target': target_name,
        'Results': result_target
    })

## Convert results into a DataFrame for better visualization
results_df = pd.DataFrame()

  File "C:\Users\himan\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\site-packages\joblib\externals\loky\backend\context.py", line 257, in _count_physical_cores
    cpu_info = subprocess.run(
               ^^^^^^^^^^^^^^^
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.2544.0_x64__qbz5n2kfra8p0\Lib\subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.2544.0_x64__qbz5n2kfra8p0\Lib\subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.11_3.11.2544.0_x64__qbz5n2kfra8p0\Lib\subprocess.py", line 1538, in _execute_child
    hp, ht, pid, tid = _winapi.CreateProcess(executable, args,
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000938 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 662
[LightGBM] [Info] Number of data points in the train set: 114571, number of used features: 10
[LightGBM] [Info] Start training from score 32348.251739


MemoryError: Unable to allocate 2.22 GiB for an array with shape (114571, 51, 51) and data type float64

In [23]:
# results of our models
results_df = pd.DataFrame()

for target_name, result in results.items():
    temp_df = pd.DataFrame(result)
    temp_df['Target'] = target_name
    results_df = pd.concat([results_df, temp_df], ignore_index=True)

results_df

Unnamed: 0,Model,Train_RMSE,Test_RMSE,Train_R2,Test_R2,Target
0,XGBoost,1010.47957,1815.091392,0.975558,0.922046,PJME
1,Decision Tree,2.422571,2259.956295,1.0,0.879151,PJME
2,Random Forest,365.52294,1805.646018,0.996802,0.922855,PJME
3,XGBoost,155.527193,289.999553,0.974545,0.914982,PJMW
4,Decision Tree,0.124083,361.878317,1.0,0.867613,PJMW
5,Random Forest,56.25267,273.107572,0.99667,0.924597,PJMW
6,XGBoost,972.206337,2536.107815,0.929623,0.530333,Average
7,Decision Tree,1.212873,2878.551652,1.0,0.394934,Average
8,Random Forest,267.839692,2606.987918,0.994659,0.503713,Average


### Training on SARIMA - Seasonal ARIMA 

In [17]:
from statsmodels.tsa.statespace.sarimax import SARIMAX
df_sarima = df_cleaned[['PJM_Load']]
# Fit the SARIMA model (adjust the parameters based on your data, p,d,q, seasonal_order)
sarima_model = SARIMAX(y_train,
                       order=(1, 1, 1),  # p, d, q values (adjust as necessary)
                       seasonal_order=(1, 1, 1, 24),  # P, D, Q, S for daily seasonality
                       enforce_stationarity=False,
                       enforce_invertibility=False)

sarima_fitted = sarima_model.fit(disp=False)

predictions_sarima = sarima_fitted.predict(start=len(y_train), end=len(y_train) + len(y_test) - 1, dynamic=False)

# Evaluate the performance of the model
train_metrices_sarima = evaluate_model(y_train, y_train_pred)
test_metrices_sarima = evaluate_model(y_test, y_test_pred)

result.append({
    'Model': model_name,
    'Train_RMSE': train_metrices[2],
    'Test_RMSE': test_metrices[2],
    'Train_R2': train_metrices[3],
    'Test_R2': test_metrices[3]})

  self._init_dates(dates, freq)
  self._init_dates(dates, freq)
  return get_prediction_index(


MemoryError: Unable to allocate 643. MiB for an array with shape (51, 51, 32378) and data type float64

### Training on SARIMAX - Seasonal ARIMA 

In [None]:
X = df_cleaned[['AEP', 'COMED', 'DAYTON']]
df_sarima = df_cleaned[['PJM_Load']]

# Fit the SARIMAX model with exogenous variables
sarimax_model = SARIMAX(y_train,
                        exog=X_train,
                        order=(1, 1, 1),  # p, d, q values (adjust as necessary)
                        seasonal_order=(1, 1, 1, 24),  # P, D, Q, S for daily seasonality
                        enforce_stationarity=False,
                        enforce_invertibility=False)

sarimax_fitted = sarimax_model.fit(disp=False)

predictions_sarimax = sarimax_fitted.predict(start=len(y_train), end=len(y_train) + len(y_test) - 1, exog=X_test, dynamic=False)

# Evaluate the performance of the model
train_metrices_sarimax = evaluate_model(y_train, y_train_pred)
test_metrices_sarimax = evaluate_model(y_test, y_test_pred)

result.append({
    'Model': model_name,
    'Train_RMSE': train_metrices[2],
    'Test_RMSE': test_metrices[2],
    'Train_R2': train_metrices[3],
    'Test_R2': test_metrices[3]})

  self._init_dates(dates, freq)
  self._init_dates(dates, freq)


### Prophet

In [None]:
from fbprophet import Prophet


prophet_data = df_cleaned[['PJM_Load']].reset_index()
prophet_data.columns = ['ds', 'y']  # 'ds' is the datetime column, 'y' is the target variable

# Initialize the Prophet model
prophet_model = Prophet(daily_seasonality=True, yearly_seasonality=True, seasonality_mode='multiplicative')

# Fit the model
prophet_model.fit(prophet_data)

# Make predictions (for next 24 hours as an example)
future = prophet_model.make_future_dataframe(prophet_data, periods=24, freq='H')  # 24 hours ahead
forecast = prophet_model.predict(future)

# Evaluate the performance
prophet_predictions = forecast['yhat'][-len(y_test):].values  # Get the predictions for test period

# Evaluate the performance of the model
train_metrices_sarimax = evaluate_model(y_train, y_train_pred)
test_metrices_sarimax = evaluate_model(y_test, y_test_pred)

In [19]:
import pandas as pd

results_df = pd.DataFrame(result)
results_df = results_df.sort_values(by='Test_RMSE')
print(results_df)

                     Model  Train_RMSE  Test_RMSE  Train_R2  Test_R2
0            Decision Tree    0.000000        0.0  1.000000      1.0
1  Random Forest Regressor  146.288107        0.0  0.997468      1.0
