# 09a: ML Price Forecasting — NO_5 (Bergen)

**Phase 3** — Machine learning price forecasting using gradient boosted trees.

This notebook predicts day-ahead electricity prices (EUR/MWh → NOK/kWh) for NO_5 (Bergen)
using **fundamental features only**: weather, reservoir levels, gas prices, calendar patterns,
load, and generation data.

**Fundamentals-only approach:**
We deliberately exclude price lag features (price_lag_1h, price_lag_24h, rolling means, etc.).
These autoregressive features let the model learn "price ≈ yesterday's price" — a shortcut
that masks the actual supply/demand drivers. By removing them, the model must learn from
**production, consumption, reservoir levels, commodity prices, weather, and calendar** —
the physical factors that actually determine electricity price.

**Why ML over statistical methods?**
Statistical methods (ARIMA, SARIMA, ETS) rely on price history alone. Tree-based models
(XGBoost, LightGBM, CatBoost) can leverage ALL fundamental features simultaneously,
capturing complex nonlinear relationships between weather, supply, demand, and price.

**Methods:**
1. Naive baseline (same hour last week) — the bar to beat
2. Statistical baselines (best ARIMA/SARIMA — compressed summary)
3. XGBoost
4. LightGBM
5. CatBoost
6. Weighted ensemble (inverse-MAE)
7. Walk-forward validation (6-fold)
8. SHAP feature importance analysis
9. Yr weather forecast integration (forward-looking predictions)

**Data split:**
- Training: 2022-01-01 to 2024-12-31 (~26,280 hours)
- Validation: 2025-01-01 to 2025-06-30 (~4,344 hours)
- Test: 2025-07-01 to 2026-02-22

## 0. Setup & Data Loading

In [None]:
import sys
import logging
import warnings
import time
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy import stats

# Project imports
sys.path.insert(0, str(Path.cwd().parent))
from src.models.forecasters import NaiveForecaster
from src.models.train import (
    MLPriceForecaster,
    prepare_ml_features,
    walk_forward_validate,
    train_ensemble,
    forecast_with_yr,
)
from src.models.evaluate import compute_metrics, comparison_table, plot_forecast, plot_residuals

warnings.filterwarnings("ignore")
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(name)s %(levelname)s %(message)s")

%matplotlib inline
plt.rcParams["figure.figsize"] = (16, 5)
plt.rcParams["figure.dpi"] = 100

ZONE = "NO_5"
print(f"Forecasting target: price_eur_mwh for {ZONE} (Bergen)")

In [None]:
# Load feature matrix
data_path = Path.cwd().parent / "data" / "processed" / "features_NO_5_2022-01-01_2026-01-01.parquet"
df = pd.read_parquet(data_path)
print(f"Loaded: {df.shape[0]:,} rows x {df.shape[1]} columns")
print(f"Date range: {df.index.min()} to {df.index.max()}")

# Extract target
target = df["price_eur_mwh"]

# Truncate at 2026-02-22 (end of available verified data)
cutoff = pd.Timestamp("2026-02-22", tz="Europe/Oslo")
df = df[df.index <= cutoff]
target = target[target.index <= cutoff]
print(f"After truncation: {len(target):,} hours ({target.index.min()} to {target.index.max()})")

In [None]:
# --- Data validation: check for gaps / forward-fill artifacts ---
# ENTSO-E had only Jan 2024 for NO_5 — the rest of 2024 was forward-filled
# as a constant (47.5 EUR/MWh). Nord Pool has the real data.

price_hourly_count = target.resample("D").count()
missing_days = price_hourly_count[price_hourly_count < 20]

# Detect forward-fill artifacts: days where price never changes
daily_unique = target.resample("D").apply(lambda x: x.nunique())
flat_days = daily_unique[daily_unique <= 1]

print("Data quality check for price_eur_mwh:")
print(f"  Total hours: {len(target):,}")
print(f"  NaN count: {target.isna().sum()}")
print(f"  Negative price hours: {(target < 0).sum()}")
print(f"  Days with <20 hours coverage: {len(missing_days)}")
print(f"  Days with constant price (ffill artifact): {len(flat_days)}")

if len(flat_days) > 5:
    first_flat = flat_days.index[0]
    last_flat = flat_days.index[-1]
    print(f"\n  Forward-fill artifact detected: {first_flat.date()} to {last_flat.date()}")
    print(f"  ({len(flat_days)} days of constant prices — likely ENTSO-E gap)")
    print(f"  -> Will patch from hvakosterstrommen.no (Nord Pool) below")

# Year-by-year summary
print("\nYearly summary:")
for year in range(2022, 2026):
    mask = target.index.year == year
    yearly = target[mask]
    n_unique = yearly.nunique()
    print(f"  {year}: {len(yearly):>5,} hours, {n_unique:>5,} unique vals, "
          f"mean={yearly.mean():.1f}, std={yearly.std():.1f}, "
          f"min={yearly.min():.1f}, max={yearly.max():.1f}")

In [None]:
# --- Patch: fill ENTSO-E price gaps with Nord Pool data ---
# ENTSO-E only had Jan 2024 for NO_5 — Feb-Dec 2024 was forward-filled.
# hvakosterstrommen.no (Nord Pool) has the actual day-ahead prices.

from src.data.fetch_nordpool import fetch_prices as fetch_nordpool_prices

# Load Nord Pool prices for the full period
nordpool_all = fetch_nordpool_prices("2022-01-01", "2025-12-31", cache=True)

if ZONE in nordpool_all.columns:
    nordpool_prices = nordpool_all[ZONE].rename("price_eur_mwh")
    
    # Detect which hours in our data are forward-fill artifacts:
    # consecutive hours with identical prices for 24+ hours straight
    is_flat = (target.diff().abs() < 1e-6)
    # Mark runs of 24+ identical values as suspect
    flat_runs = is_flat.rolling(24, min_periods=24).sum()
    suspect_mask = flat_runs >= 24
    
    # How many hours need patching?
    n_suspect = suspect_mask.sum()
    
    if n_suspect > 100:
        # Align Nord Pool to our index
        nordpool_aligned = nordpool_prices.reindex(target.index)
        
        # Patch: replace suspect hours with Nord Pool prices where available
        patched = target.copy()
        can_patch = suspect_mask & nordpool_aligned.notna()
        patched[can_patch] = nordpool_aligned[can_patch]
        
        n_patched = can_patch.sum()
        print(f"Patched {n_patched:,} hours of forward-fill artifacts with Nord Pool data")
        print(f"  Before: {n_suspect:,} suspect hours (constant price runs)")
        
        # Update target and df
        target = patched
        df["price_eur_mwh"] = patched
        
        # Also recompute NOK prices (for reporting, not modeling)
        if "eur_nok" in df.columns:
            df["price_nok_mwh"] = target * df["eur_nok"]
            df["price_nok_kwh"] = df["price_nok_mwh"] / 1000
        
        # Verify the fix
        daily_unique_after = target.resample("D").apply(lambda x: x.nunique())
        flat_after = daily_unique_after[daily_unique_after <= 1]
        print(f"  After:  {len(flat_after)} days with constant prices")
        
        # Show 2024 stats after fix
        y_2024 = target[target.index.year == 2024]
        print(f"\n  2024 after patch: {y_2024.nunique():,} unique values, "
              f"mean={y_2024.mean():.1f}, std={y_2024.std():.1f}, "
              f"min={y_2024.min():.1f}, max={y_2024.max():.1f}")
    else:
        print(f"Only {n_suspect} suspect hours detected — no patching needed")
else:
    print(f"Warning: {ZONE} not found in Nord Pool data")

In [None]:
# Train / Validation / Test split
TRAIN_END = pd.Timestamp("2024-12-31 23:00", tz="Europe/Oslo")
VAL_END = pd.Timestamp("2025-06-30 23:00", tz="Europe/Oslo")

y_train = target[target.index <= TRAIN_END]
y_val = target[(target.index > TRAIN_END) & (target.index <= VAL_END)]
y_test = target[target.index > VAL_END]

# Also split the full DataFrame for features
df_train = df[df.index <= TRAIN_END]
df_val = df[(df.index > TRAIN_END) & (df.index <= VAL_END)]
df_test = df[df.index > VAL_END]

print(f"Training:   {len(y_train):>6,} hours  ({y_train.index.min().date()} to {y_train.index.max().date()})")
print(f"Validation: {len(y_val):>6,} hours  ({y_val.index.min().date()} to {y_val.index.max().date()})")
print(f"Test:       {len(y_test):>6,} hours  ({y_test.index.min().date()} to {y_test.index.max().date()})")
print(f"\nTrain mean: {y_train.mean():.1f} EUR/MWh, std: {y_train.std():.1f}")
print(f"Val mean:   {y_val.mean():.1f} EUR/MWh, std: {y_val.std():.1f}")
print(f"Test mean:  {y_test.mean():.1f} EUR/MWh, std: {y_test.std():.1f}")

In [None]:
# Visualize the full price series with split boundaries
fig, ax = plt.subplots(figsize=(16, 5))

ax.plot(y_train.index, y_train, color="steelblue", linewidth=0.5, label="Train")
ax.plot(y_val.index, y_val, color="darkorange", linewidth=0.5, label="Validation")
ax.plot(y_test.index, y_test, color="green", linewidth=0.5, label="Test")

ax.axvline(TRAIN_END, color="red", linestyle="--", alpha=0.7, label="Train/Val split")
ax.axvline(VAL_END, color="red", linestyle="--", alpha=0.7, label="Val/Test split")

ax.set_xlabel("Date")
ax.set_ylabel("EUR/MWh")
ax.set_title(f"Day-Ahead Electricity Price — {ZONE} (Bergen)")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 1. Feature Overview

Our feature matrix uses **fundamental features only** — no price lags. This is deliberate.

**Why no price lags?** Features like `price_lag_24h` and `price_rolling_168h_mean` let the model
learn "price ≈ yesterday's price" — a shortcut with r > 0.95 correlation to the target. While
this gives low MAE on paper, the model isn't learning *why* prices move. By removing these,
the model must learn from the **physical factors that actually drive electricity price**:
production levels, consumption patterns, reservoir levels, gas prices, weather, and calendar effects.

| Category | Features | Examples |
|----------|----------|----------|
| Calendar | 7 | hour_of_day, day_of_week, is_weekend, is_holiday |
| Weather | ~5 | temperature, wind_speed, precipitation |
| Commodities | 5 | ttf_gas_close, brent_oil_close, ng_fut_close |
| Reservoir | 5 | reservoir_filling_pct, reservoir_vs_median |
| FX | 1 | eur_nok |
| ENTSO-E (load/gen) | ~15 | actual_load, generation_hydro, net_export |
| Internal flows | ~6 | flow_from_no1, flow_from_no2, net_internal_flow |
| Statnett | ~4 | net_exchange_mwh, production_mwh, consumption_mwh |

**~42 fundamental features** — each one represents a real physical or economic driver.

In [None]:
# Feature overview: grouped by category
X_train_full, y_train_full = prepare_ml_features(df_train)
X_val_full, y_val_full = prepare_ml_features(df_val)

print(f"Feature matrix: {X_train_full.shape[1]} features (after dropping NOK + price lag columns)")
print(f"Training: {len(X_train_full):,} samples")
print(f"Validation: {len(X_val_full):,} samples")

# Group features by category
categories = {
    "Calendar": [c for c in X_train_full.columns if c in [
        "hour_of_day", "day_of_week", "month", "week_of_year",
        "is_weekend", "is_holiday", "is_business_hour"]],
    "Weather": [c for c in X_train_full.columns if c in [
        "temperature", "wind_speed", "precipitation",
        "temperature_lag_24h", "temperature_rolling_24h_mean"]],
    "Commodities": [c for c in X_train_full.columns if any(
        c.startswith(p) for p in ["ttf_", "brent_", "coal_", "ng_"])],
    "Reservoir": [c for c in X_train_full.columns if "reservoir" in c],
    "FX": [c for c in X_train_full.columns if c == "eur_nok"],
    "ENTSO-E": [c for c in X_train_full.columns if any(
        c.startswith(p) for p in ["actual_", "load_", "generation_",
                                   "hydro_", "wind_share", "total_net",
                                   "n_cables"])],
    "Internal Flows": [c for c in X_train_full.columns if any(
        c.startswith(p) for p in ["flow_from_", "total_internal_",
                                   "net_internal_"])],
    "Statnett": [c for c in X_train_full.columns if any(
        c.startswith(p) for p in ["net_exchange", "production_", "consumption_", "net_balance"])],
}

# Also catch any uncategorized
all_categorized = set()
for cols in categories.values():
    all_categorized.update(cols)
uncategorized = [c for c in X_train_full.columns if c not in all_categorized]
if uncategorized:
    categories["Other"] = uncategorized

print("\nFeatures by category (fundamentals only — no price lags):")
for cat, cols in categories.items():
    if cols:
        print(f"  {cat} ({len(cols)}): {', '.join(cols[:5])}{'...' if len(cols) > 5 else ''}")

# Missing data summary
missing_pct = X_train_full.isna().mean() * 100
cols_with_missing = missing_pct[missing_pct > 0].sort_values(ascending=False)
if len(cols_with_missing) > 0:
    print(f"\nColumns with missing data ({len(cols_with_missing)}):")
    for col, pct in cols_with_missing.head(10).items():
        print(f"  {col}: {pct:.1f}%")
else:
    print("\nNo missing data in training features.")

In [None]:
# Correlation heatmap: top features vs price
corr_with_price = X_train_full.corrwith(y_train_full).abs().sort_values(ascending=False)
top_corr = corr_with_price.head(20)

fig, ax = plt.subplots(figsize=(10, 8))
top_corr.plot(kind="barh", ax=ax, color="steelblue", alpha=0.8)
ax.set_xlabel("Absolute Correlation with price_eur_mwh")
ax.set_title(f"Top 20 Features Correlated with Price — {ZONE}")
ax.invert_yaxis()
ax.grid(True, alpha=0.3, axis="x")
plt.tight_layout()
plt.show()

print("Top 10 correlated features:")
for feat, corr in top_corr.head(10).items():
    print(f"  {feat}: {corr:.3f}")

## 2. Naive Baseline

The simplest possible forecast: predict that each hour's price equals the same hour from last week. This captures both daily and weekly patterns with zero modeling effort.

**Every model below must beat this.** A model that can't beat `shift(168)` is adding complexity without value.

In [None]:
# We'll collect results for all methods here
all_results = []
all_forecasts = {}

# Naive baseline: same hour last week
naive = NaiveForecaster(name="Naive (same hour last week)", horizon=len(y_val), frequency="h", lag=168)
naive.fit(y_train)
naive_pred = naive.predict(steps=len(y_val))
naive_pred.index = y_val.index

naive_metrics = compute_metrics(y_val, naive_pred)
all_results.append({"name": "Naive (same hour last week)", "metrics": naive_metrics, "fit_time": naive.fit_time_seconds})
all_forecasts["Naive"] = naive_pred

print("Naive Baseline Results:")
for k, v in naive_metrics.items():
    print(f"  {k}: {v}")

# Convert to NOK/kWh for context
latest_eur_nok = df["eur_nok"].dropna().iloc[-1]
print(f"\nIn NOK/kWh (EUR/NOK = {latest_eur_nok:.2f}):")
print(f"  MAE: {naive_metrics['mae'] * latest_eur_nok / 1000:.3f} NOK/kWh")

## 3. Statistical Baselines (Summary)

For reference, here are the best statistical method results from Phase 3.1. These methods are limited to 1–2 features (price history + optionally temperature/gas), so they can't leverage the full feature matrix.

**Key limitation:** ARIMA/SARIMA treat future weather, reservoir levels, and load data as unavailable — they only use past price patterns. ML models use ALL features.

In [None]:
# Quick SARIMA baseline for reference (uses only price history)
# We run just one statistical method as a reference point
from src.models.forecasters import SARIMAXForecaster

sarima = SARIMAXForecaster(
    name="SARIMA (m=24)", horizon=len(y_val), frequency="h",
    seasonal_period=24, max_train_size=4000,
)
sarima.fit(y_train)
sarima_pred = sarima.predict(steps=len(y_val))
sarima_pred.index = y_val.index

sarima_metrics = compute_metrics(y_val, sarima_pred, naive_pred=naive_pred)
all_results.append({"name": "SARIMA (m=24)", "metrics": sarima_metrics, "fit_time": sarima.fit_time_seconds})
all_forecasts["SARIMA"] = sarima_pred

print("Statistical baseline — SARIMA:")
print(f"  MAE: {sarima_metrics['mae']:.3f} EUR/MWh")
print(f"  Skill score vs naive: {sarima_metrics.get('skill_score', 'N/A')}")
print(f"\n  Note: SARIMA uses ONLY price history — no weather, no gas prices, no reservoir data.")
print(f"  ML models below will use all {X_train_full.shape[1]} features.")

## 4. XGBoost

**XGBoost (eXtreme Gradient Boosting)** builds an ensemble of decision trees sequentially, where each new tree corrects the errors of the previous ones.

**How it works:**
1. Start with a simple prediction (e.g. mean price)
2. Calculate the residuals (errors)
3. Train a small decision tree to predict those residuals
4. Add the tree's predictions (scaled by learning rate) to the running total
5. Repeat for 1000 iterations (with early stopping if validation error stops improving)

**Why it's good for electricity prices with fundamental features:**
- Handles nonlinear relationships (e.g. temperature below 0°C has outsized effect on price)
- Captures feature interactions (e.g. low wind + high gas price = spike)
- Learns from the real price drivers: load, generation, reservoir levels, gas prices, weather
- Built-in feature importance tells us which fundamental drivers matter most
- Robust to missing data and different feature scales

In [None]:
%%time

# Train XGBoost with early stopping on validation set
xgb_model = MLPriceForecaster("xgboost")
xgb_model.fit(X_train_full, y_train_full, X_val_full, y_val_full)
xgb_pred = xgb_model.predict(X_val_full)

xgb_metrics = compute_metrics(y_val_full, xgb_pred, naive_pred=naive_pred)
all_results.append({"name": "XGBoost", "metrics": xgb_metrics, "fit_time": xgb_model.fit_time_seconds})
all_forecasts["XGBoost"] = xgb_pred

print(f"XGBoost Results:")
for k, v in xgb_metrics.items():
    print(f"  {k}: {v}")
print(f"\n  Fit time: {xgb_model.fit_time_seconds:.1f}s")
print(f"  vs Naive: {'BETTER' if xgb_metrics['mae'] < naive_metrics['mae'] else 'WORSE'} "
      f"(skill_score: {xgb_metrics.get('skill_score', 'N/A')})")

In [None]:
# XGBoost feature importance (top 20)
xgb_importance = xgb_model.feature_importance()

fig, ax = plt.subplots(figsize=(10, 8))
xgb_importance.head(20).plot(kind="barh", ax=ax, color="steelblue", alpha=0.8)
ax.set_xlabel("Feature Importance (Gain)")
ax.set_title("XGBoost — Top 20 Features")
ax.invert_yaxis()
ax.grid(True, alpha=0.3, axis="x")
plt.tight_layout()
plt.show()

print("Top 10 features by importance:")
for i, (feat, imp) in enumerate(xgb_importance.head(10).items(), 1):
    print(f"  {i}. {feat}: {imp:.4f}")

## 5. LightGBM

**LightGBM** is Microsoft's gradient boosting framework. Key differences from XGBoost:

- **Leaf-wise growth** (vs XGBoost's level-wise): grows the leaf that reduces error most, leading to deeper trees that converge faster
- **Histogram-based splitting**: bins continuous features into discrete buckets for faster computation
- **Typically 2-5x faster** than XGBoost for the same accuracy
- **Native categorical support** (though we use numerical features here)

In practice, LightGBM and XGBoost produce similar accuracy — the main advantage is speed.

In [None]:
%%time

# Train LightGBM with early stopping
lgbm_model = MLPriceForecaster("lightgbm")
lgbm_model.fit(X_train_full, y_train_full, X_val_full, y_val_full)
lgbm_pred = lgbm_model.predict(X_val_full)

lgbm_metrics = compute_metrics(y_val_full, lgbm_pred, naive_pred=naive_pred)
all_results.append({"name": "LightGBM", "metrics": lgbm_metrics, "fit_time": lgbm_model.fit_time_seconds})
all_forecasts["LightGBM"] = lgbm_pred

print(f"LightGBM Results:")
for k, v in lgbm_metrics.items():
    print(f"  {k}: {v}")
print(f"\n  Fit time: {lgbm_model.fit_time_seconds:.1f}s")
print(f"  vs XGBoost: {'BETTER' if lgbm_metrics['mae'] < xgb_metrics['mae'] else 'WORSE or EQUAL'} "
      f"({lgbm_model.fit_time_seconds:.1f}s vs {xgb_model.fit_time_seconds:.1f}s)")

## 6. CatBoost

**CatBoost** (Yandex) is designed for minimal tuning and handles categorical features natively.

Key features:
- **Ordered boosting**: uses a permutation-based approach that reduces overfitting (especially on small datasets)
- **Symmetric trees**: builds balanced decision trees (all leaves at the same depth)
- **Target encoding**: automatically converts categorical features using target statistics
- **Typically needs less hyperparameter tuning** than XGBoost/LightGBM

For tabular data with mixed feature types, CatBoost often works well out of the box.

In [None]:
%%time

# Train CatBoost with early stopping
cat_model = MLPriceForecaster("catboost")
cat_model.fit(X_train_full, y_train_full, X_val_full, y_val_full)
cat_pred = cat_model.predict(X_val_full)

cat_metrics = compute_metrics(y_val_full, cat_pred, naive_pred=naive_pred)
all_results.append({"name": "CatBoost", "metrics": cat_metrics, "fit_time": cat_model.fit_time_seconds})
all_forecasts["CatBoost"] = cat_pred

print(f"CatBoost Results:")
for k, v in cat_metrics.items():
    print(f"  {k}: {v}")
print(f"\n  Fit time: {cat_model.fit_time_seconds:.1f}s")

# Quick comparison
print(f"\nML model comparison (MAE EUR/MWh):")
print(f"  XGBoost:  {xgb_metrics['mae']:.3f}")
print(f"  LightGBM: {lgbm_metrics['mae']:.3f}")
print(f"  CatBoost: {cat_metrics['mae']:.3f}")
print(f"  Naive:    {naive_metrics['mae']:.3f}")

## 7. Ensemble

**Why ensembles work:** Different models make different mistakes. By averaging their predictions, individual errors tend to cancel out. This is one of the most reliable ways to improve accuracy in ML.

We build two ensembles:
1. **Simple average** — equal weight to each model
2. **Inverse-MAE weighted** — better models get more weight

Ensembles almost always match or beat the best individual model.

In [None]:
%%time

# Build ensemble using all 3 models (already trained above)
# Compute inverse-MAE weights
model_maes = {
    "xgboost": xgb_metrics["mae"],
    "lightgbm": lgbm_metrics["mae"],
    "catboost": cat_metrics["mae"],
}
inv_maes = {k: 1.0 / v for k, v in model_maes.items()}
total_inv = sum(inv_maes.values())
weights = {k: v / total_inv for k, v in inv_maes.items()}

print("Ensemble weights (inverse-MAE):")
for k, w in weights.items():
    print(f"  {k}: {w:.3f} (MAE: {model_maes[k]:.3f})")

# Weighted ensemble
ensemble_pred = (
    weights["xgboost"] * xgb_pred
    + weights["lightgbm"] * lgbm_pred
    + weights["catboost"] * cat_pred
)
ensemble_pred.name = "Ensemble (weighted)"

# Simple average
simple_avg_pred = (xgb_pred + lgbm_pred + cat_pred) / 3
simple_avg_pred.name = "Ensemble (simple avg)"

# Evaluate both
ens_w_metrics = compute_metrics(y_val_full, ensemble_pred, naive_pred=naive_pred)
ens_s_metrics = compute_metrics(y_val_full, simple_avg_pred, naive_pred=naive_pred)

all_results.append({"name": "Ensemble (weighted)", "metrics": ens_w_metrics, "fit_time": 0})
all_results.append({"name": "Ensemble (simple avg)", "metrics": ens_s_metrics, "fit_time": 0})
all_forecasts["Ensemble (weighted)"] = ensemble_pred
all_forecasts["Ensemble (simple avg)"] = simple_avg_pred

print(f"\nEnsemble Results:")
print(f"  Weighted:   MAE={ens_w_metrics['mae']:.3f}, skill_score={ens_w_metrics.get('skill_score', 'N/A')}")
print(f"  Simple avg: MAE={ens_s_metrics['mae']:.3f}, skill_score={ens_s_metrics.get('skill_score', 'N/A')}")
print(f"  Best individual: {min(model_maes, key=model_maes.get)} MAE={min(model_maes.values()):.3f}")
print(f"\n  Ensemble {'beats' if ens_w_metrics['mae'] <= min(model_maes.values()) else 'does not beat'} best individual model")

In [None]:
# Forecast overlay: first 2 weeks of validation
two_weeks = y_val_full.index[:336]

fig, ax = plt.subplots(figsize=(16, 6))
ax.plot(y_val_full.loc[two_weeks].index, y_val_full.loc[two_weeks],
        color="black", linewidth=1.2, label="Actual", zorder=5)
ax.plot(xgb_pred.loc[two_weeks].index, xgb_pred.loc[two_weeks],
        color="tab:blue", linewidth=0.8, alpha=0.7, label=f"XGBoost (MAE={xgb_metrics['mae']:.1f})")
ax.plot(lgbm_pred.loc[two_weeks].index, lgbm_pred.loc[two_weeks],
        color="tab:orange", linewidth=0.8, alpha=0.7, label=f"LightGBM (MAE={lgbm_metrics['mae']:.1f})")
ax.plot(cat_pred.loc[two_weeks].index, cat_pred.loc[two_weeks],
        color="tab:green", linewidth=0.8, alpha=0.7, label=f"CatBoost (MAE={cat_metrics['mae']:.1f})")
ax.plot(ensemble_pred.loc[two_weeks].index, ensemble_pred.loc[two_weeks],
        color="tab:red", linewidth=1.0, alpha=0.9, label=f"Ensemble (MAE={ens_w_metrics['mae']:.1f})")

ax.set_xlabel("Date")
ax.set_ylabel("EUR/MWh")
ax.set_title(f"ML Model Predictions — First 2 Weeks of Validation — {ZONE}")
ax.legend(loc="upper right", fontsize=9)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 8. Walk-Forward Validation

**Why walk-forward?** A single train/val split can be misleading — the model might perform well in January but poorly in July. Walk-forward validation simulates real deployment:

```
Fold 1: Train [2022-01 → 2025-01] | Val [2025-01 → 2025-02]
Fold 2: Train [2022-01 → 2025-02] | Val [2025-02 → 2025-03]
Fold 3: Train [2022-01 → 2025-03] | Val [2025-03 → 2025-04]
...expanding window, always predicting unseen future data
```

This tells us: **is the model consistently good, or did we get lucky with one split?**

**Important:** We never use random k-fold cross-validation for time series — it leaks future information into training.

In [None]:
%%time

# Walk-forward validation with expanding window
# Use the combined train+val data for walk-forward
df_walkforward = pd.concat([df_train, df_val])

# Use the best-performing individual model type
best_model_type = min(model_maes, key=model_maes.get)
print(f"Walk-forward validation using: {best_model_type}")
print(f"Data: {len(df_walkforward):,} hours")
print(f"Configuration: 6 folds, ~720 hours (~1 month) each\n")

wf_results = walk_forward_validate(
    df_walkforward,
    model_type=best_model_type,
    n_splits=6,
    val_size_hours=720,
    target_col="price_eur_mwh",
)

# Display per-fold metrics
wf_rows = []
for r in wf_results:
    row = {
        "Fold": r["fold"],
        "Train Size": f"{r['train_size']:,}",
        "Val Period": f"{r['val_start'].strftime('%Y-%m-%d')} → {r['val_end'].strftime('%Y-%m-%d')}",
        "MAE": r["metrics"].get("mae", np.nan),
        "RMSE": r["metrics"].get("rmse", np.nan),
        "Skill Score": r["metrics"].get("skill_score", np.nan),
        "Dir. Acc.": r["metrics"].get("directional_accuracy", np.nan),
        "Fit Time (s)": r["fit_time"],
    }
    wf_rows.append(row)

wf_df = pd.DataFrame(wf_rows)
display(wf_df)

# Summary statistics
mae_values = [r["metrics"]["mae"] for r in wf_results if "mae" in r["metrics"]]
print(f"\nWalk-forward MAE summary:")
print(f"  Mean: {np.mean(mae_values):.3f} EUR/MWh")
print(f"  Std:  {np.std(mae_values):.3f} EUR/MWh")
print(f"  Min:  {np.min(mae_values):.3f} (best fold)")
print(f"  Max:  {np.max(mae_values):.3f} (worst fold)")

In [None]:
# Walk-forward stability plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# MAE per fold
ax = axes[0]
folds = [r["fold"] for r in wf_results]
maes = [r["metrics"]["mae"] for r in wf_results]
skills = [r["metrics"].get("skill_score", 0) for r in wf_results]

ax.bar(folds, maes, color="steelblue", alpha=0.8)
ax.axhline(np.mean(maes), color="red", linestyle="--", label=f"Mean: {np.mean(maes):.1f}")
ax.set_xlabel("Fold")
ax.set_ylabel("MAE (EUR/MWh)")
ax.set_title("Walk-Forward MAE by Fold")
ax.legend()
ax.grid(True, alpha=0.3)

# Skill score per fold
ax = axes[1]
colors = ["green" if s > 0 else "red" for s in skills]
ax.bar(folds, skills, color=colors, alpha=0.8)
ax.axhline(0, color="black", linestyle="-", linewidth=0.5)
ax.set_xlabel("Fold")
ax.set_ylabel("Skill Score vs Naive")
ax.set_title("Walk-Forward Skill Score by Fold")
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 9. SHAP Analysis

**SHAP (SHapley Additive exPlanations)** goes beyond simple feature importance. For each prediction, SHAP shows **how much each feature pushed the prediction up or down** from the baseline.

This is critical for understanding:
- Does the model make **physical sense**? (e.g., high temperature in winter should decrease price via less heating demand)
- Are there **surprising relationships**? (features we didn't expect to matter)
- Is the model **trustworthy** for deployment?

In [None]:
%%time

import shap

# Use the XGBoost model for SHAP (best native support)
print("Computing SHAP values for XGBoost model...")
print(f"Using {len(X_val_full)} validation samples\n")

# Use a subsample for speed (SHAP on full val set can be slow)
shap_sample_size = min(2000, len(X_val_full))
X_shap = X_val_full.ffill().bfill().fillna(0).iloc[:shap_sample_size]

explainer = shap.TreeExplainer(xgb_model.model_)
shap_values = explainer.shap_values(X_shap)

print(f"SHAP values computed: {shap_values.shape}")
print(f"Base value (expected prediction): {explainer.expected_value:.2f} EUR/MWh")

In [None]:
# SHAP summary plot — global feature importance with direction
# Each dot is one prediction. Position shows SHAP value (impact on output).
# Color shows feature value (red=high, blue=low).
plt.figure(figsize=(10, 10))
shap.summary_plot(shap_values, X_shap, max_display=20, show=False)
plt.title(f"SHAP Feature Importance — XGBoost — {ZONE}", fontsize=13)
plt.tight_layout()
plt.show()

# Mean absolute SHAP values (global importance)
mean_abs_shap = pd.Series(
    np.abs(shap_values).mean(axis=0),
    index=X_shap.columns,
).sort_values(ascending=False)

print("\nTop 10 features by mean |SHAP value|:")
for i, (feat, val) in enumerate(mean_abs_shap.head(10).items(), 1):
    print(f"  {i}. {feat}: {val:.3f}")

In [None]:
# SHAP dependence plots for top 5 features
# Shows how each feature value affects the prediction
top_5_features = mean_abs_shap.head(5).index.tolist()

fig, axes = plt.subplots(1, 5, figsize=(20, 4))
for i, feat in enumerate(top_5_features):
    ax = axes[i]
    feat_idx = list(X_shap.columns).index(feat)
    ax.scatter(X_shap[feat].values, shap_values[:, feat_idx],
               alpha=0.3, s=5, color="steelblue")
    ax.set_xlabel(feat, fontsize=9)
    ax.set_ylabel("SHAP value" if i == 0 else "")
    ax.axhline(0, color="gray", linewidth=0.5)
    ax.grid(True, alpha=0.3)

fig.suptitle("SHAP Dependence — Top 5 Features", fontsize=13, fontweight="bold")
plt.tight_layout()
plt.show()

In [None]:
# SHAP waterfall for a single prediction (most expensive hour in validation)
# Shows how each feature contributed to this specific prediction
max_price_idx = y_val_full.idxmax()
sample_idx = X_shap.index.get_indexer([max_price_idx], method="nearest")[0]

print(f"Explaining the highest-price prediction in validation:")
print(f"  Timestamp: {X_shap.index[sample_idx]}")
print(f"  Actual price: {y_val_full.iloc[sample_idx]:.1f} EUR/MWh")
print(f"  Predicted: {xgb_pred.iloc[sample_idx]:.1f} EUR/MWh\n")

shap.initjs()
explanation = shap.Explanation(
    values=shap_values[sample_idx],
    base_values=explainer.expected_value,
    data=X_shap.iloc[sample_idx],
    feature_names=list(X_shap.columns),
)
shap.plots.waterfall(explanation, max_display=15, show=True)

## 10. Yr Weather Forecast Integration

So far, all models used **historical weather observations** — actual temperature, wind, and precipitation that were measured at the time. For a real forecast, we need **future weather** data.

**Yr Locationforecast** (MET Norway) provides ~9 days of weather forecasts. By replacing historical weather columns with Yr forecast values, we can make **genuinely forward-looking** price predictions.

In [None]:
from src.data.fetch_yr_forecast import fetch_yr_forecast

# Fetch Yr weather forecast for Bergen (NO_5)
yr_df = fetch_yr_forecast(ZONE, cache=True)

if not yr_df.empty:
    print(f"Yr forecast for {ZONE}:")
    print(f"  Hours: {len(yr_df)}")
    print(f"  Range: {yr_df.index.min()} to {yr_df.index.max()}")
    print(f"  Columns: {list(yr_df.columns)}")
    print(f"\nForecast summary:")
    for col in ['yr_temperature', 'yr_wind_speed', 'yr_precipitation_1h', 'yr_cloud_cover']:
        if col in yr_df.columns:
            print(f"  {col}: mean={yr_df[col].mean():.1f}, "
                  f"min={yr_df[col].min():.1f}, max={yr_df[col].max():.1f}")
    
    # Visualize the forecast
    fig, axes = plt.subplots(2, 2, figsize=(16, 8))
    
    for i, (col, title, unit) in enumerate([
        ('yr_temperature', 'Temperature Forecast', '°C'),
        ('yr_wind_speed', 'Wind Speed Forecast', 'm/s'),
        ('yr_precipitation_1h', 'Precipitation Forecast', 'mm/h'),
        ('yr_cloud_cover', 'Cloud Cover Forecast', '%'),
    ]):
        ax = axes[i // 2, i % 2]
        if col in yr_df.columns:
            ax.plot(yr_df.index, yr_df[col], color='steelblue', linewidth=1)
            ax.set_ylabel(unit)
            ax.set_title(title)
            ax.grid(True, alpha=0.3)
    
    fig.suptitle(f'Yr Weather Forecast — {ZONE} (Bergen)', fontsize=13, fontweight='bold')
    plt.tight_layout()
    plt.show()
else:
    print("Yr forecast fetch failed — check network connection.")
    print("Forward price forecast will be skipped.")

In [None]:
# Forward price forecast using Yr weather + best ML model
if not yr_df.empty:
    # Use the ensemble's best model for the forward forecast
    # Use last 168+ rows of feature data for lag computation
    last_features = df.iloc[-336:]  # 2 weeks of history for lags
    
    print(f"Forward forecast using {best_model_type} model + Yr weather")
    print(f"Forecast from: {yr_df.index.min()}")
    print(f"Forecast to:   {yr_df.index.max()}")
    print(f"EUR/NOK rate:  {latest_eur_nok:.4f}\n")
    
    # Pick the best individual model
    best_ml_model = {"xgboost": xgb_model, "lightgbm": lgbm_model, "catboost": cat_model}[best_model_type]
    
    forward_forecast = forecast_with_yr(
        model=best_ml_model,
        last_features=last_features,
        yr_forecast_df=yr_df,
        eur_nok=latest_eur_nok,
    )
    
    if not forward_forecast.empty:
        print(f"Hourly forecast: {len(forward_forecast)} hours")
        print(f"  EUR/MWh: mean={forward_forecast['price_eur_mwh'].mean():.1f}, "
              f"min={forward_forecast['price_eur_mwh'].min():.1f}, "
              f"max={forward_forecast['price_eur_mwh'].max():.1f}")
        print(f"  NOK/kWh: mean={forward_forecast['price_nok_kwh'].mean():.3f}, "
              f"min={forward_forecast['price_nok_kwh'].min():.3f}, "
              f"max={forward_forecast['price_nok_kwh'].max():.3f}")
        
        # Aggregate hourly forecast to daily
        forward_daily = forward_forecast.resample("D").agg({
            "price_eur_mwh": ["mean", "min", "max"],
            "price_nok_kwh": ["mean", "min", "max"],
        })
        forward_daily.columns = [f"{col}_{agg}" for col, agg in forward_daily.columns]
        forward_daily = forward_daily[forward_daily["price_eur_mwh_mean"].notna()]
        
        print(f"\nDaily aggregation: {len(forward_daily)} days")
        display(forward_daily.round(3))
else:
    forward_forecast = pd.DataFrame()
    forward_daily = pd.DataFrame()
    print("Skipping forward forecast (no Yr data).")

## 11. Daily Price Forecast (NOK/kWh)

The forward forecast from the ML model + Yr weather, aggregated to **daily resolution** and displayed in **NOK/kWh** (the unit Norwegian consumers care about).

Hourly predictions are aggregated per day showing:
- **Mean** — average price across all hours (bar height)
- **Min–Max** — cheapest and most expensive hour (shaded range)

Three horizons:
- **Day-ahead (1 day):** Most accurate — matches Nord Pool auction timeline
- **Week-ahead (7 days):** Planning horizon
- **Full Yr range (~9 days):** Maximum available forecast horizon

In [None]:
if not forward_daily.empty:
    # Daily bar chart with min–max range (NOK/kWh)
    fig = go.Figure()

    # Min–max range as shaded area
    fig.add_trace(go.Scatter(
        x=forward_daily.index,
        y=forward_daily["price_nok_kwh_max"],
        mode="lines",
        line=dict(width=0),
        showlegend=False,
        hoverinfo="skip",
    ))
    fig.add_trace(go.Scatter(
        x=forward_daily.index,
        y=forward_daily["price_nok_kwh_min"],
        mode="lines",
        line=dict(width=0),
        fill="tonexty",
        fillcolor="rgba(255, 165, 0, 0.2)",
        name="Hourly min–max range",
        hoverinfo="skip",
    ))

    # Day-ahead (first day) in red, rest in orange
    colors = ["#d62728"] + ["#ff7f0e"] * (len(forward_daily) - 1)
    labels = ["Day-ahead"] + [""] * (len(forward_daily) - 1)

    fig.add_trace(go.Bar(
        x=forward_daily.index,
        y=forward_daily["price_nok_kwh_mean"],
        marker_color=colors,
        name="Daily mean price",
        text=[f"{v:.3f}" for v in forward_daily["price_nok_kwh_mean"]],
        textposition="outside",
        hovertemplate=(
            "%{x|%a %d %b}<br>"
            "Mean: %{y:.3f} NOK/kWh<br>"
            "Min: %{customdata[0]:.3f}<br>"
            "Max: %{customdata[1]:.3f}<extra></extra>"
        ),
        customdata=forward_daily[["price_nok_kwh_min", "price_nok_kwh_max"]].values,
    ))

    # Reference: last known price (use add_shape + add_annotation instead of
    # add_hline with annotation_text to avoid Plotly bug with tz-aware Timestamps)
    last_known_price_nok = target.iloc[-1] * latest_eur_nok / 1000
    fig.add_shape(
        type="line",
        x0=0, x1=1, xref="paper",
        y0=last_known_price_nok, y1=last_known_price_nok,
        line=dict(color="black", width=1, dash="dash"),
        opacity=0.5,
    )
    fig.add_annotation(
        x=0.0, xref="paper",
        y=last_known_price_nok,
        text=f"Last known: {last_known_price_nok:.3f} NOK/kWh ({target.index[-1].strftime('%Y-%m-%d')})",
        showarrow=False,
        font=dict(size=10),
        xanchor="left",
        yanchor="bottom",
    )

    fig.update_layout(
        title=f"Daily Price Forecast — {ZONE} (Bergen) — {best_model_type.upper()} + Yr Weather",
        xaxis_title="Date",
        yaxis_title="NOK/kWh",
        hovermode="x unified",
        height=500,
        showlegend=True,
        bargap=0.15,
    )
    fig.show()

    # Summary table by horizon (daily)
    n_days = len(forward_daily)
    horizons = {
        "Day-ahead (1 day)": forward_daily.iloc[:1],
        f"Week-ahead (7 days)": forward_daily.iloc[:min(7, n_days)],
        f"Full range ({n_days} days)": forward_daily,
    }

    print(f"\nForecast summary (EUR/NOK = {latest_eur_nok:.2f}):")
    print(f"{'Horizon':<22} {'Mean NOK/kWh':>12} {'Min':>8} {'Max':>8} {'Mean EUR/MWh':>13}")
    print("-" * 67)
    for name, data in horizons.items():
        print(f"{name:<22} {data['price_nok_kwh_mean'].mean():>12.3f} "
              f"{data['price_nok_kwh_min'].min():>8.3f} {data['price_nok_kwh_max'].max():>8.3f} "
              f"{data['price_eur_mwh_mean'].mean():>13.1f}")
else:
    print("No forward forecast available (Yr data was not fetched).")

## 12. Model Comparison — Grand Table

All methods compared: Naive, SARIMA (statistical), XGBoost, LightGBM, CatBoost, and both ensembles. Metrics in EUR/MWh with NOK/kWh equivalents.

In [None]:
# Build grand comparison table
comp = comparison_table(all_results)

print("=" * 90)
print(f"MODEL COMPARISON — {ZONE} (Bergen) — Validation Period")
print(f"Val period: {y_val.index.min().date()} to {y_val.index.max().date()}")
print("=" * 90)
display(comp)

# Add NOK/kWh column
print(f"\nNOK/kWh equivalent (EUR/NOK = {latest_eur_nok:.2f}):")
for _, row in comp.iterrows():
    mae_nok = row['mae'] * latest_eur_nok / 1000
    print(f"  {row['Method']:30s}: MAE = {mae_nok:.3f} NOK/kWh  ({row['mae']:.1f} EUR/MWh)")

# Highlight winners
naive_mae = naive_metrics["mae"]
print(f"\nNaive baseline MAE: {naive_mae:.1f} EUR/MWh")
beat_naive = comp[comp["mae"] < naive_mae]
if len(beat_naive) > 0:
    print(f"{len(beat_naive)} method(s) beat the naive baseline:")
    for _, row in beat_naive.iterrows():
        improvement = (1 - row['mae'] / naive_mae) * 100
        print(f"  {row['Method']}: {improvement:.1f}% better")

In [None]:
# Bar chart comparison
fig, ax = plt.subplots(figsize=(12, 5))

methods = comp["Method"].values
maes = comp["mae"].values
colors = ["tab:green" if m < naive_mae else "tab:red" for m in maes]

bars = ax.barh(range(len(methods)), maes, color=colors, alpha=0.8)
ax.set_yticks(range(len(methods)))
ax.set_yticklabels(methods)
ax.axvline(naive_mae, color="black", linestyle="--", linewidth=1.5,
           label=f"Naive baseline ({naive_mae:.1f})")
ax.set_xlabel("MAE (EUR/MWh) — lower is better")
ax.set_title(f"Forecast MAE Comparison — {ZONE}")
ax.legend()
ax.grid(True, alpha=0.3, axis="x")

for i, (method, mae) in enumerate(zip(methods, maes)):
    ax.text(mae + 0.3, i, f"{mae:.1f}", va="center", fontsize=9)

plt.tight_layout()
plt.show()

## 13. Residual Analysis (Ensemble)

A good model's residuals (errors) should look like white noise:
- **Normally distributed** — errors are random, not systematic
- **No autocorrelation** — the model captured all temporal patterns
- **No structure by hour or month** — works equally well at all times

In [None]:
# Use ensemble predictions for residual analysis
best_ensemble_pred = ensemble_pred
residuals = (y_val_full - best_ensemble_pred).dropna()

# 3-panel diagnostic
fig = plot_residuals(y_val_full, best_ensemble_pred, method_name="Ensemble (weighted)")
plt.show()

# Shapiro-Wilk test
sample_resid = residuals.sample(min(2000, len(residuals)), random_state=42)
sw_stat, sw_p = stats.shapiro(sample_resid)
print(f"\nShapiro-Wilk test: stat={sw_stat:.4f}, p={sw_p:.6f}")
print(f"Residuals {'appear' if sw_p > 0.05 else 'do NOT appear'} normally distributed")
print(f"\nResidual stats: mean={residuals.mean():.2f}, std={residuals.std():.2f}")
print(f"Skewness: {residuals.skew():.3f}, Kurtosis: {residuals.kurtosis():.3f}")

In [None]:
# Residuals by hour-of-day and by month
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# By hour
ax = axes[0]
hourly_mae = residuals.abs().groupby(residuals.index.hour).mean()
ax.bar(hourly_mae.index, hourly_mae.values, color="steelblue", alpha=0.8)
ax.axhline(residuals.abs().mean(), color="red", linestyle="--", label=f"Overall MAE: {residuals.abs().mean():.1f}")
ax.set_xlabel("Hour of Day")
ax.set_ylabel("MAE (EUR/MWh)")
ax.set_title("Residual MAE by Hour of Day")
ax.legend()
ax.grid(True, alpha=0.3)

# By month
ax = axes[1]
monthly_mae = residuals.abs().groupby(residuals.index.month).mean()
ax.bar(monthly_mae.index, monthly_mae.values, color="darkorange", alpha=0.8)
ax.axhline(residuals.abs().mean(), color="red", linestyle="--", label=f"Overall MAE: {residuals.abs().mean():.1f}")
ax.set_xlabel("Month")
ax.set_ylabel("MAE (EUR/MWh)")
ax.set_title("Residual MAE by Month")
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Rolling MAE
fig, ax = plt.subplots(figsize=(16, 4))
rolling_mae = residuals.abs().rolling(168).mean()  # 7-day window
ax.plot(rolling_mae.index, rolling_mae, color="darkorange", linewidth=1)
ax.axhline(residuals.abs().mean(), color="red", linestyle="--", linewidth=0.8)
ax.set_ylabel("7-Day Rolling MAE (EUR/MWh)")
ax.set_title("Error Stability Over Time — Ensemble")
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 14. Key Findings

In [None]:
# Final summary
print("=" * 80)
print(f"PHASE 3 SUMMARY — ML Price Forecasting for {ZONE} (Bergen)")
print("=" * 80)

print(f"\nValidation period: {y_val.index.min().date()} to {y_val.index.max().date()}")
print(f"Actual price range: {y_val.min():.1f} to {y_val.max():.1f} EUR/MWh")
print(f"Features used: {X_train_full.shape[1]} (fundamentals only — no price lags)")

print(f"\n--- Final Rankings ---")
display(comp)

# Key numbers
best_method = comp.iloc[0]['Method']
best_mae = comp.iloc[0]['mae']
best_mae_nok = best_mae * latest_eur_nok / 1000

print(f"\n--- Key Results ---")
print(f"Best model: {best_method}")
print(f"  MAE: {best_mae:.1f} EUR/MWh ({best_mae_nok:.3f} NOK/kWh)")
print(f"  vs Naive ({naive_mae:.1f} EUR/MWh): "
      f"{(1 - best_mae/naive_mae)*100:.1f}% improvement")

print(f"\n--- Top Features (SHAP) ---")
for i, (feat, val) in enumerate(mean_abs_shap.head(5).items(), 1):
    print(f"  {i}. {feat} (SHAP: {val:.3f})")

if not forward_forecast.empty:
    print(f"\n--- Forward Forecast ---")
    print(f"  Horizon: {len(forward_forecast)} hours ({forward_forecast.index.min().date()} to {forward_forecast.index.max().date()})")
    print(f"  Mean: {forward_forecast['price_nok_kwh'].mean():.3f} NOK/kWh")
    print(f"  Range: {forward_forecast['price_nok_kwh'].min():.3f} – {forward_forecast['price_nok_kwh'].max():.3f} NOK/kWh")

print(f"\n--- Walk-Forward Validation ({best_model_type}) ---")
print(f"  {len(wf_results)} folds, MAE: {np.mean(mae_values):.1f} +/- {np.std(mae_values):.1f} EUR/MWh")

print("\n" + "-" * 80)
print("Observations:")
print("-" * 80)
print("1. Fundamentals-only approach: no price lag features. The model learns from")
print("   actual supply/demand drivers (load, generation, reservoirs, gas, weather)")
print("   instead of the autoregressive shortcut 'price ~ yesterday's price'.")
print("2. SHAP analysis reveals which physical/economic drivers matter most for NO_5.")
print("3. The ensemble combines XGBoost + LightGBM + CatBoost for more robust predictions.")
print("4. Yr weather forecasts enable genuinely forward-looking predictions.")
print("5. Walk-forward validation confirms model stability across time periods.")
print("")
print("Next steps:")
print("- Replicate for other zones (NO_1-NO_4)")
print("- Optuna hyperparameter tuning (Phase 4)")
print("- Multi-target forecasting: reservoir, demand, production (Phase 5)")
print("- Streamlit dashboard with live Yr integration (Phase 7)")