# Load Forecasting with Gradient Boosting

From the [Sisyphean Gridworks ML Playground](https://sgridworks.com/ml-playground/guides/02-load-forecasting.html)

## Setup

Clone the repository and install dependencies. Run this cell first.

In [None]:
!git clone https://github.com/SGridworks/Dynamic-Network-Model.git 2>/dev/null || echo 'Already cloned'
%cd Dynamic-Network-Model
!pip install -q pandas numpy matplotlib seaborn scikit-learn xgboost lightgbm pyarrow

## Load the Data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error
from demo_data.load_demo_data import load_load_profiles, load_weather_data

# Load feeder-level 15-minute load profiles
load = load_load_profiles()

# Load hourly weather
weather = load_weather_data()

print(f"Load rows:    {len(load):,}")
print(f"Weather rows: {len(weather):,}")
print(f"Load columns: {list(load.columns)}")

## Pick a Feeder and Explore

The SP&L dataset contains 65 feeders. To keep things simple, pick one feeder and work with it throughout this guide. You can repeat the process for other feeders later.

In [None]:
# Pick Feeder 1
feeder = load[load["feeder_id"] == "FDR-0001"].copy()
feeder["timestamp"] = pd.to_datetime(feeder["timestamp"])
feeder = feeder.sort_values("timestamp").reset_index(drop=True)

# Plot one month of data to see the daily pattern
one_month = feeder[(feeder["timestamp"] >= "2024-07-01") &
                   (feeder["timestamp"] "2024-08-01")]

plt.figure(figsize=(14, 4))
plt.plot(one_month["timestamp"], one_month["load_mw"], linewidth=0.8)
plt.title("Feeder FDR-0001 — July 2024 15-Minute Load")
plt.ylabel("Load (MW)")
plt.xlabel("Date")
plt.tight_layout()
plt.show()

## Build Time Features

The load pattern depends heavily on the time of day, day of week, and season. Let's extract those from the timestamp.

In [None]:
# Time-based features
feeder["hour"]        = feeder["timestamp"].dt.hour
feeder["day_of_week"] = feeder["timestamp"].dt.dayofweek
feeder["month"]       = feeder["timestamp"].dt.month
feeder["is_weekend"]  = (feeder["day_of_week"] >= 5).astype(int)

# Show the average load by hour of day
feeder.groupby("hour")["load_mw"].mean().plot(
    kind="bar", color="#5FCCDB", title="Average Load by Hour of Day"
)
plt.ylabel("Load (MW)")
plt.tight_layout()
plt.show()

## Merge Weather Data

Temperature is the single biggest driver of electricity demand. On hot days, air conditioners run at full blast. On cold days, electric heating spikes. Let's join weather data to our load table.

In [None]:
# Merge weather on the nearest hour
weather["timestamp"] = pd.to_datetime(weather["timestamp"])
df = feeder.merge(
    weather[["timestamp", "temperature_f", "humidity_pct", "wind_speed_mph"]],
    on="timestamp",
    how="left"
)

# Drop rows with missing weather
df = df.dropna(subset=["temperature_f"])

print(f"Merged rows: {len(df):,}")
print(df[["timestamp", "load_mw", "temperature_f", "hour"]].head())

## Add Lag Features

What was the load 24 hours ago? That is often the best predictor of what load will be now. These "lag" features give the model a sense of recent history.

In [None]:
# Load from the same interval yesterday and one week ago (15-min data: 96 intervals/day)
df["load_lag_24h"]  = df["load_mw"].shift(96)
df["load_lag_168h"] = df["load_mw"].shift(672)  # 7 days * 96 intervals

# Rolling average over the past 24 hours (96 intervals)
df["load_rolling_24h"] = df["load_mw"].rolling(96).mean()

# Drop rows where lags are not available (first 672 intervals)
df = df.dropna()

print(f"Rows after adding lags: {len(df):,}")

## Build a Baseline Forecast

Before training an ML model, build a simple baseline. A "persistence" forecast says: "Tomorrow's load at 2 PM will be the same as today's load at 2 PM." This gives you a bar to beat.

In [None]:
# Use 2024 as test, everything before as train
train = df[df["timestamp"] "2024-01-01"]
test  = df[df["timestamp"] >= "2024-01-01"]

# Baseline: predict the load from 24 hours ago
baseline_mae = mean_absolute_error(test["load_mw"], test["load_lag_24h"])

print(f"Baseline (persistence) MAE: {baseline_mae:.4f} MW")

## Train the Gradient Boosting Model

In [None]:
# Define features
feature_cols = [
    "hour", "day_of_week", "month", "is_weekend",
    "temperature_f", "humidity_pct", "wind_speed_mph",
    "load_lag_24h", "load_lag_168h", "load_rolling_24h"
]

X_train = train[feature_cols]
y_train = train["load_mw"]
X_test  = test[feature_cols]
y_test  = test["load_mw"]

# Create and train the model
model = GradientBoostingRegressor(
    n_estimators=300,     # number of boosting stages
    max_depth=5,           # depth of each tree
    learning_rate=0.1,     # how much each tree contributes
    random_state=42
)

model.fit(X_train, y_train)
print("Model training complete.")

## Test and Compare

In [None]:
# Predict on the test set
y_pred = model.predict(X_test)

# Calculate error metrics
model_mae  = mean_absolute_error(y_test, y_pred)
model_rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"Baseline MAE:         {baseline_mae:.4f} MW")
print(f"Gradient Boosting MAE: {model_mae:.4f} MW")
print(f"Gradient Boosting RMSE: {model_rmse:.4f} MW")
print(f"\nImprovement over baseline: {((baseline_mae - model_mae) / baseline_mae * 100):.1f}%")

## Visualize the Forecast

Let's plot one week of predictions against actual load to see how the model performs visually.

In [None]:
# Plot one week of actual vs. predicted
week = test.head(672).copy()  # 7 days * 96 intervals
week["predicted"] = y_pred[:672]

fig, ax = plt.subplots(figsize=(14, 5))
ax.plot(week["timestamp"], week["load_mw"],
        label="Actual", linewidth=1.5)
ax.plot(week["timestamp"], week["predicted"],
        label="Predicted", linewidth=1.5, linestyle="--")
ax.set_title("Load Forecast vs. Actual — First Week of Test Set")
ax.set_ylabel("Load (MW)")
ax.legend()
plt.tight_layout()
plt.show()

## Feature Importance

In [None]:
# Which features matter most?
importances = pd.Series(model.feature_importances_, index=feature_cols)
importances = importances.sort_values(ascending=True)

fig, ax = plt.subplots(figsize=(8, 5))
importances.plot(kind="barh", color="#5FCCDB", ax=ax)
ax.set_title("Feature Importance: What Drives Load?")
ax.set_xlabel("Importance Score")
plt.tight_layout()
plt.show()

## What You Built and Next Steps

You just built a day-ahead load forecasting model that beat a persistence baseline by over 50%. Here's what you did: