## 02 â€“ Feature engineering

In this notebook I:
- Add calendar features (day-of-week, month, weekend, hour-of-day encodings).
- Add lagged and rolling-window features for load, solar, and wind.
- Construct a supervised dataset for 1-hour-ahead load forecasting.
- Perform a quick correlation check between engineered features and the target.


In [5]:
import sys, os
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt

# Paths and options
PROJECT_ROOT = Path.cwd().parent
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))



In [7]:
from src.data import load_opsd_germany
from src.features import (
    add_time_features,
    add_lagged_features,
    add_rolling_features,
    make_supervised,
    make_features,
)

RAW_PATH = (PROJECT_ROOT / "data" / "time_series_60min_singleindex.csv")
OUT_DIR = (PROJECT_ROOT / "data" / "processed")
OUT_DIR.mkdir(parents=True, exist_ok=True)

# Load raw hourly data
raw = load_opsd_germany(str(RAW_PATH))
print("Loaded:", raw.index.min(), "->", raw.index.max(), "n=", len(raw))
raw.head()


Loaded: 2014-12-31 23:00:00+00:00 -> 2020-09-30 23:00:00+00:00 n= 50401


Unnamed: 0_level_0,load,solar,wind
utc_timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-12-31 23:00:00+00:00,,,
2015-01-01 00:00:00+00:00,41151.0,,8852.0
2015-01-01 01:00:00+00:00,40135.0,,9054.0
2015-01-01 02:00:00+00:00,39106.0,,9070.0
2015-01-01 03:00:00+00:00,38765.0,,9163.0


In [8]:
# Add time-based features
df_time = add_time_features(raw)
df_time[["load", "hour_sin", "hour_cos"]].head()


Unnamed: 0_level_0,load,hour_sin,hour_cos
utc_timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-12-31 23:00:00+00:00,,-0.258819,0.965926
2015-01-01 00:00:00+00:00,41151.0,0.0,1.0
2015-01-01 01:00:00+00:00,40135.0,0.258819,0.965926
2015-01-01 02:00:00+00:00,39106.0,0.5,0.866025
2015-01-01 03:00:00+00:00,38765.0,0.707107,0.707107


In [9]:
# Add lag features
df_lag = add_lagged_features(df_time, lags=(1, 2, 24, 168))
df_lag.filter(regex="load|solar|wind").head()


Unnamed: 0_level_0,load,solar,wind,load_lag_1,load_lag_2,load_lag_24,load_lag_168,solar_lag_1,solar_lag_2,solar_lag_24,solar_lag_168,wind_lag_1,wind_lag_2,wind_lag_24,wind_lag_168
utc_timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
2014-12-31 23:00:00+00:00,,,,,,,,,,,,,,,
2015-01-01 00:00:00+00:00,41151.0,,8852.0,,,,,,,,,,,,
2015-01-01 01:00:00+00:00,40135.0,,9054.0,41151.0,,,,,,,,8852.0,,,
2015-01-01 02:00:00+00:00,39106.0,,9070.0,40135.0,41151.0,,,,,,,9054.0,8852.0,,
2015-01-01 03:00:00+00:00,38765.0,,9163.0,39106.0,40135.0,,,,,,,9070.0,9054.0,,


In [10]:
# Add rolling features
df_roll = add_rolling_features(df_lag)
df_roll.filter(regex="roll").head()


Unnamed: 0_level_0,load_roll_mean_24,load_roll_std_24,load_roll_mean_168,load_roll_std_168,solar_roll_mean_24,solar_roll_std_24,wind_roll_mean_24,wind_roll_std_24
utc_timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2014-12-31 23:00:00+00:00,,,,,,,,
2015-01-01 00:00:00+00:00,,,,,,,,
2015-01-01 01:00:00+00:00,,,,,,,,
2015-01-01 02:00:00+00:00,,,,,,,,
2015-01-01 03:00:00+00:00,,,,,,,,


In [11]:
# Drop NaNs & inspect
df_clean = df_roll.dropna()
df_clean.head()
df_clean.info()


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 49931 entries, 2015-01-08 07:00:00+00:00 to 2020-09-30 23:00:00+00:00
Data columns (total 28 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   load                49931 non-null  float64
 1   solar               49931 non-null  float64
 2   wind                49931 non-null  float64
 3   day_of_week         49931 non-null  int32  
 4   month               49931 non-null  int32  
 5   is_weekend          49931 non-null  int64  
 6   hour_sin            49931 non-null  float64
 7   hour_cos            49931 non-null  float64
 8   load_lag_1          49931 non-null  float64
 9   load_lag_2          49931 non-null  float64
 10  load_lag_24         49931 non-null  float64
 11  load_lag_168        49931 non-null  float64
 12  solar_lag_1         49931 non-null  float64
 13  solar_lag_2         49931 non-null  float64
 14  solar_lag_24        49931 non-null  float64
 15  solar_

In [16]:
# Create supervised dataset for 1-hour-ahead load
X_1h, y_1h = make_supervised(df_clean, horizon=1, target_col="load")
assert not X_1h.isna().any().any(), "Features contain NaNs"
assert not y_1h.isna().any(), "Targets contain NaNs"
X_1h.shape, y_1h.shape

X_1h.head(), y_1h.head()


(                              load   solar     wind  day_of_week  month  \
 utc_timestamp                                                             
 2015-01-08 07:00:00+00:00  68569.0    57.0  18039.0            3      1   
 2015-01-08 08:00:00+00:00  68599.0   446.0  18177.0            3      1   
 2015-01-08 09:00:00+00:00  69484.0  1083.0  18094.0            3      1   
 2015-01-08 10:00:00+00:00  70635.0  1738.0  17924.0            3      1   
 2015-01-08 11:00:00+00:00  69962.0  2062.0  17249.0            3      1   
 
                            is_weekend  hour_sin  hour_cos  load_lag_1  \
 utc_timestamp                                                           
 2015-01-08 07:00:00+00:00           0  0.965926 -0.258819     65447.0   
 2015-01-08 08:00:00+00:00           0  0.866025 -0.500000     68569.0   
 2015-01-08 09:00:00+00:00           0  0.707107 -0.707107     68599.0   
 2015-01-08 10:00:00+00:00           0  0.500000 -0.866025     69484.0   
 2015-01-08 11:00:00+0

We now have a supervised learning table where each row corresponds to time *t*, with
features summarizing the recent history and calendar context, and a target equal to
the load at time *t + 1h*. This will let us train standard regression models for
1-hour-ahead forecasting.

In [13]:
# Quick correlation with target
corr = X_1h.assign(target=y_1h).corr()["target"].sort_values(ascending=False)
corr.head(15)


target                1.000000
load                  0.966947
load_lag_168          0.881723
load_lag_1            0.881635
load_lag_2            0.763895
load_lag_24           0.727838
load_roll_mean_24     0.548246
load_roll_std_24      0.377397
load_roll_mean_168    0.346359
solar_lag_24          0.303160
solar_lag_168         0.302438
solar                 0.295879
solar_lag_1           0.266181
solar_lag_2           0.228637
hour_sin              0.141835
Name: target, dtype: float64

The correlation table confirms that the strongest predictors of 1-hour-ahead load are
recent load values and the same hour in the previous week, followed by daily and
weekly rolling averages. Solar-related features and hour-of-day encodings also show
moderate correlation, reflecting the alignment between daytime demand and solar
availability. This sanity check supports using these engineered features in our
forecasting models.


In [14]:
# Save processed features (optional)
X_1h_path = (OUT_DIR / "X_load_h1.parquet")
y_1h_path = (OUT_DIR / "y_load_h1.parquet")

X_1h.to_parquet(str(X_1h_path))
y_1h.to_frame("y").to_parquet(str(y_1h_path))

X_1h_path, y_1h_path


(PosixPath('/Users/test/Desktop/forecast/data/processed/X_load_h1.parquet'),
 PosixPath('/Users/test/Desktop/forecast/data/processed/y_load_h1.parquet'))