## Feature Engineering for Energy Consumption Forecasting

- In this notebook, I will create new features to enhance the predictive power of our energy consumption forecasting models.  
- Feature engineering involves transforming raw data into meaningful inputs that help machine learning algorithms capture patterns, trends, and seasonality.  
- I will extract time-based features, generate lagged variables, rolling statistics, and other relevant attributes to improve model accuracy and performance.

In [17]:
import pandas as pd

In [18]:
## accessing the data
df_cleaned = pd.read_parquet(r"C:\Users\himan\Desktop\Projects\Energy_Forecasting_System\data\processed-data\est_hourly_cleaned.parquet")

In [19]:
df_cleaned.head()

Unnamed: 0_level_0,AEP,COMED,DAYTON,DEOK,DOM,DUQ,EKPC,FE,NI,PJME,PJMW,PJM_Load
Datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1998-12-31 01:00:00+00:00,13478.0,9970.0,1596.0,2945.0,9389.0,1458.0,1861.0,6222.0,9810.0,26498.0,5077.0,31569.0
1998-12-31 02:00:00+00:00,13478.0,9970.0,1596.0,2945.0,9389.0,1458.0,1861.0,6222.0,9810.0,26498.0,5077.0,31569.0
1998-12-31 03:00:00+00:00,13478.0,9970.0,1596.0,2945.0,9389.0,1458.0,1861.0,6222.0,9810.0,26498.0,5077.0,31569.0
1998-12-31 04:00:00+00:00,13478.0,9970.0,1596.0,2945.0,9389.0,1458.0,1861.0,6222.0,9810.0,26498.0,5077.0,31569.0
1998-12-31 05:00:00+00:00,13478.0,9970.0,1596.0,2945.0,9389.0,1458.0,1861.0,6222.0,9810.0,26498.0,5077.0,31569.0


In [20]:
df_cleaned.shape

(162080, 12)

In [21]:
df_cleaned.columns

Index(['AEP', 'COMED', 'DAYTON', 'DEOK', 'DOM', 'DUQ', 'EKPC', 'FE', 'NI',
       'PJME', 'PJMW', 'PJM_Load'],
      dtype='object')

##### Extracting time-based feature

In [22]:
## extracting time-based features
df_cleaned['hour'] = df_cleaned.index.hour
df_cleaned['day_of_week'] = df_cleaned.index.dayofweek
df_cleaned['month'] = df_cleaned.index.month
df_cleaned['day_of_year'] = df_cleaned.index.dayofyear
df_cleaned['is_weekend'] = (df_cleaned.index.dayofweek >= 5).astype(int)

##### Lag Feature  
A lag feature is a past value of a variable used as a predictor for future values.  
Think of it as teaching our model:  
“What happened 1 hour/day/week ago might help predict what happens now or next.”

In [23]:
core_cols = ['AEP', 'COMED', 'DAYTON', 'DEOK', 'DOM', 'DUQ', 
             'EKPC', 'FE', 'NI', 'PJME', 'PJMW', 'PJM_Load']

for col in core_cols:
    # Add lag features
    df_cleaned[f'{col}_lag_1'] = df_cleaned[col].shift(1)
    # Add rolling features
    df_cleaned[f'{col}_rolling_mean_24'] = df_cleaned[col].rolling(window=24).mean()
    df_cleaned[f'{col}_rolling_std_24'] = df_cleaned[col].rolling(window=24).std()

##### Holiday Indicators
This feature allows the model to distinguish between regular and holiday days, improving its ability to capture sudden shifts in demand and enhancing forecast accuracy.  

In [24]:
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar

calendar = USFederalHolidayCalendar()
holidays = calendar.holidays(start=df_cleaned.index.min(), end=df_cleaned.index.max())

df_cleaned['is_holiday'] = df_cleaned.index.isin(holidays).astype(int)

In [25]:
df_cleaned.columns

Index(['AEP', 'COMED', 'DAYTON', 'DEOK', 'DOM', 'DUQ', 'EKPC', 'FE', 'NI',
       'PJME', 'PJMW', 'PJM_Load', 'hour', 'day_of_week', 'month',
       'day_of_year', 'is_weekend', 'AEP_lag_1', 'AEP_rolling_mean_24',
       'AEP_rolling_std_24', 'COMED_lag_1', 'COMED_rolling_mean_24',
       'COMED_rolling_std_24', 'DAYTON_lag_1', 'DAYTON_rolling_mean_24',
       'DAYTON_rolling_std_24', 'DEOK_lag_1', 'DEOK_rolling_mean_24',
       'DEOK_rolling_std_24', 'DOM_lag_1', 'DOM_rolling_mean_24',
       'DOM_rolling_std_24', 'DUQ_lag_1', 'DUQ_rolling_mean_24',
       'DUQ_rolling_std_24', 'EKPC_lag_1', 'EKPC_rolling_mean_24',
       'EKPC_rolling_std_24', 'FE_lag_1', 'FE_rolling_mean_24',
       'FE_rolling_std_24', 'NI_lag_1', 'NI_rolling_mean_24',
       'NI_rolling_std_24', 'PJME_lag_1', 'PJME_rolling_mean_24',
       'PJME_rolling_std_24', 'PJMW_lag_1', 'PJMW_rolling_mean_24',
       'PJMW_rolling_std_24', 'PJM_Load_lag_1', 'PJM_Load_rolling_mean_24',
       'PJM_Load_rolling_std_24', 'is_ho

In [29]:
## checking for null values
print(df_cleaned.isnull().sum().sum())

564


In [30]:
## dropping the null values
df_cleaned.dropna(inplace=True)

In [None]:
## saving the cleaned data with new features
df_cleaned.to_parquet(r"C:\Users\himan\Desktop\Projects\Energy_Forecasting_System\data\processed-data\est_hourly_cleaned_with_features.parquet")

: 