## The goal of this notebook is to build a model that can forecast day-ahead electricity prices for Bergen, Norway (NO5). The motivation is practical rather than academic. Day-ahead prices are published daily and strongly influence household behavior, industrial planning, and energy awareness.

## I rely on the publicly available and generously maintained API from https://www.hvakosterstrommen.no/ ( who in turn collects it from here https://transparency.entsoe.eu/), which provides hourly spot prices by bidding zone. The availability and transparency of this data makes it suitable for experimentation, validation, and later deployment in a live dashboard setting.

## The focus of this work is short-term forecasting, specifically predicting the next 24 hours of prices using information that would realistically be available at the time of prediction.

In [0]:
import requests
import pandas as pd
from datetime import datetime,date, timedelta
import matplotlib.pyplot as plt



In [0]:

# --- Step 1: Define the date and zones ---
year = 2023
month = "01"
day = "01"
zones = ["NO1", "NO2", "NO3", "NO4", "NO5"]
start_date = date(2023, 1, 1)
end_date   = date(2026, 1, 3)



In [0]:
# --- Step 2: Fetch data for each zone ---
# price_data = []

# current_date = start_date

# while current_date <= end_date:
#     year = current_date.year
#     month = f"{current_date.month:02d}"
#     day = f"{current_date.day:02d}"

#     for zone in zones:
#         url = f"https://www.hvakosterstrommen.no/api/v1/prices/{year}/{month}-{day}_{zone}.json"
#         response = requests.get(url)

#         if response.status_code != 200:
#             # Some dates/zones may legitimately be missing
#             continue

#         data = response.json()

#         for entry in data:
#             price_data.append({
#                 "datetime": entry["time_start"],
#                 "price_nok": entry["NOK_per_kWh"],
#                 "zone": zone
#             })

#     current_date += timedelta(days=1)

#I dont want to call the api frequently. 
#the above code is very bad. This should have been done in batches.

In [0]:
# # --- Step 3: Build a DataFrame ---
# df = pd.DataFrame(price_data)

# df["datetime"] = pd.to_datetime(df["datetime"])

# df = df.sort_values(by=["datetime", "zone"])

# df = df.set_index("datetime")

# df.head()

In [0]:
#df.to_csv("electricity_prices_all_zones.csv")

In [0]:
#df.to_csv("electricity_prices_all_zones_backup.csv")

In [0]:
df= pd.read_csv(
    "electricity_prices_all_zones.csv", 
    index_col=["datetime"],
    parse_dates=["datetime"]
)
df.head()

In [0]:
df.index = pd.to_datetime(df.index, utc=True).tz_convert(None)

df.head()

In [0]:
df = df.query('zone == "NO5"')

df.head()

In [0]:
df.plot()

In [0]:
df["price_nok"].iloc[:744].plot()

### I started by exploring the structure of the price series itself. Electricity prices are not a generic time series; they exhibit strong hourly patterns, weekday effects, and longer seasonal movements.

### Before jumping into modeling, I wanted to understand how much of the variation could be explained by regular temporal structure alone. I therefore performed basic exploratory analysis and time series decomposition to separate trend, seasonality, and residual components.

### This step was not meant to produce a final model, but to build intuition about what kind of signals are present and which modeling directions might be reasonable.

In [0]:
from statsmodels.tsa.seasonal import seasonal_decompose

results = seasonal_decompose(df["price_nok"].iloc[:744], period=(24 * 7))

results.plot();

In [0]:
# #this is too messy use the above
# from statsmodels.tsa.seasonal import seasonal_decompose

# results = seasonal_decompose(df["price_nok"], period=(24 * 7))

# results.plot();

## Linear Regression

In [0]:
df.head()

In [0]:
type(df)

In [0]:
#df.index = pd.to_datetime(df.index)

In [0]:
#df.index = pd.DatetimeIndex(df.index).tz_localize(None)

In [0]:
df.info()

In [0]:
type(df.index)

In [0]:
df.index

In [0]:
# 1. Fix the index type
#df.index = pd.DatetimeIndex(df.index)


In [0]:
#df.index = pd.to_datetime(df.index, utc=True).tz_convert(None)


In [0]:
df.head()

In [0]:
# lm_df = df.assign(
#     trend = range(len(df)).index,
#     hour = df.index.hour.astype("string"),
# #     day_of_week = electricity_df["Datetime"].dt.dayofweek.astype("string"),
# ).set_index("datetime")

# lm_df = pd.get_dummies(lm_df, drop_first=True)



In [0]:
#lm_df.info()

In [0]:
lm_df = df["price_nok"].reset_index()
lm_df.head()

In [0]:

# Feature engineering
lm_df = lm_df.assign(
    trend = lm_df.index,
   hour = lm_df["datetime"].dt.hour.astype("string"),
    day_of_week = lm_df["datetime"].dt.dayofweek.astype("string"),
   # month = lm_df["datetime"].dt.month.astype("string"),
    #taking month dummies is a very bad idea because electricity prices are not driven by what happened last month
    #but there is seasonal production best stuff at play that i will need to reconsider
).set_index("datetime")

lm_df.head()

In [0]:
# One-hot encode
lm_df = pd.get_dummies(lm_df, drop_first=True,dtype=int)

lm_df.head()

### My initial modeling approach was deliberately simple. I started with linear regression models using only calendar-based features such as hour of day and day of week. The intention was to establish a transparent baseline that captures recurring patterns before introducing more complex dependencies.

### At this stage, I treated the models as diagnostic tools rather than final solutions. If a simple specification already performs poorly, it is usually a sign that more complexity is needed. If it performs reasonably well, it provides a strong reference point for judging whether additional features are actually adding value.

In [0]:
import statsmodels.api as sm

# lm_df_train = lm_df.loc[:"2025-11-30"]
# lm_df_test = lm_df.loc["2025-12-01": "2025-12-31"]

lm_df_train = lm_df.loc[:"2024-12-31"]
lm_df_test  = lm_df.loc["2025-01-01":"2025-01-31"]

# lm_df_train = lm_df.loc[:"2025-07-31"]
# lm_df_test = lm_df.loc["2025-08-01": "2025-08-31"]

# lm_df_train = lm_df.loc[:"2023-11-30"]
# lm_df_test = lm_df.loc["2023-12-01": "2023-12-31"]

X_train = sm.add_constant(lm_df_train.drop("price_nok", axis=1))
y_train = lm_df_train["price_nok"]

X_test = sm.add_constant(lm_df_test.drop("price_nok", axis=1))
y_test = lm_df_test["price_nok"]

In [0]:
import statsmodels.api as sm

model = sm.OLS(y_train, X_train).fit()
model.summary()

In [0]:
from sklearn.metrics import mean_absolute_error as mae
from sklearn.metrics import mean_absolute_percentage_error as mape

#X_test = X_test.reindex(columns=X_train.columns, fill_value=0)

print(f"MAE: {mae(y_test, model.predict(X_test))}")
print(f"MAPE: {mape(y_test, model.predict(X_test))}")

In [0]:
test_preds = pd.DataFrame({
    "actuals": y_test.values, 
    "predicted": model.predict(X_test)
})

test_preds.plot(ylim=0)

### **Because your model has no idea what happened one hour ago, and that’s everything in electricity prices. A linear model using only hourly seasonality fails to capture price dynamics, producing smooth average-based predictions that do not track level changes. This confirms strong temporal dependence in electricity prices and motivates the inclusion of lagged price features.**

In [0]:
# YES BUT still i refuse to agree. it shouldnt be this much off. lets try to bring our train and test dates closer 
# because inflation and shit

In [0]:

df.query('price_nok < 0')

There is actually no reason to worry aobut this since norwegian electricity prices can ACTUALLY HAVE negative prices.

Check out this https://montel.energy/resources/blog/why-do-negative-prices-occur-in-nordic-energy-hours
or this https://www.reddit.com/r/Norway/comments/1f1ycjq/what_does_it_mean_the_electricity_price_is/

In [0]:
lm_df.head()

In [0]:
lm_df.info()



### After testing purely calendar-based models, it became clear that past prices contain substantial information about future prices. This aligns with findings in the electricity price forecasting literature, where autoregressive effects are consistently reported as strong predictors.

### I therefore experimented with lagged price features, focusing primarily on a 24-hour lag. The choice of t-24 is practical: it reflects daily repetition in consumption and production patterns while remaining compatible with a day-ahead forecasting setup.

### Shorter lags such as t-1 can improve accuracy, but they also blur the line between forecasting and "nowcasting". Since the aim here is a realistic day-ahead forecast, I avoided relying on information that would not be robustly available at prediction time.

In [0]:
lm_lag = lm_df.assign(
   # price_lag_1 = lm_df["price_nok"].shift(1),
    price_lag_24 = lm_df["price_nok"].shift(24),
)

In [0]:
lm_lag.head()

In [0]:
lm_lag = lm_lag.dropna()

In [0]:
lm_lag.head()

In [0]:
# lm_lag_train = lm_lag.loc[:"2025-11-30"]
# lm_lag_test = lm_lag.loc["2025-12-01": "2025-12-31"]

#lm_lag_train = lm_lag.loc[:"2024-12-31"]   #main set
#lm_lag_test  = lm_lag.loc["2025-01-01":"2025-01-31"]

lm_lag_train = lm_lag.loc[:"2025-11-30"]
lm_lag_test  = lm_lag.loc["2025-12-01":"2026-01-3"]


X_train = sm.add_constant(lm_lag_train.drop("price_nok", axis=1))
y_train = lm_lag_train["price_nok"]

X_test = sm.add_constant(lm_lag_test.drop("price_nok", axis=1))
y_test = lm_lag_test["price_nok"]





In [0]:
model = sm.OLS(y_train, X_train).fit()
model.summary()

In [0]:
print(f"MAE: {mae(y_test, model.predict(X_test))}")
print(f"MAPE: {mape(y_test, model.predict(X_test))}")

In [0]:
test_preds = pd.DataFrame({
    "actuals": y_test.values, 
    "predicted": model.predict(X_test)
})

test_preds.plot(ylim=0)

### Now lets try XG-boost because,
### Although linear regression performs surprisingly well, electricity prices are shaped by nonlinear interactions, threshold effects, and regime changes that a linear model cannot fully capture.

### To explore whether such nonlinearities could be learned from the data, I tested XGBoost as an alternative modeling approach. XGBoost is well suited for structured tabular data and is widely used in short-term electricity price forecasting.

### The goal was not to replace linear regression by default, but to evaluate whether a more flexible model could improve predictive accuracy without sacrificing stability or interpretability.

In [0]:
#validation Split
val_cutoff = "2025-09-30"

X_tr = X_train.loc[:val_cutoff]
y_tr = y_train.loc[:val_cutoff]

X_val = X_train.loc[val_cutoff:]
y_val = y_train.loc[val_cutoff:]

In [0]:
#Train a conservative XGBoost model

from xgboost import XGBRegressor

xgb_model = XGBRegressor(
    objective="reg:squarederror",
    n_estimators=500,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1
)

In [0]:
# #Fit with early stopping
# xgb_model.fit(
#     X_tr, y_tr,
#     eval_set=[(X_val, y_val)],
#     eval_metric="mae",
#     early_stopping_rounds=30,
#     verbose=True
# )


from xgboost import XGBRegressor
import xgboost as xgb

# xgb_model = XGBRegressor(
#     objective="reg:squarederror",
#     eval_metric="mae",          # ← MOVE IT HERE
#     n_estimators=500,
#     learning_rate=0.05,
#     max_depth=4,
#     subsample=0.8,
#     colsample_bytree=0.8,
#     random_state=42,
#     n_jobs=-1
# )


# xgb_model.fit(
#     X_tr, y_tr,
#     eval_set=[(X_val, y_val)],
#     callbacks=[xgb.callback.EarlyStopping(rounds=30)],
#     verbose=True
# )

In [0]:
import xgboost as xgb

dtrain = xgb.DMatrix(X_tr, label=y_tr)
dval   = xgb.DMatrix(X_val, label=y_val)
dtest  = xgb.DMatrix(X_test, label=y_test)


params = {
    "objective": "reg:squarederror",
    "eval_metric": "mae",
    "learning_rate": 0.05,
    "max_depth": 4,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "seed": 42
}


xgb_model = xgb.train(
    params=params,
    dtrain=dtrain,
    num_boost_round=500,
    evals=[(dval, "validation")],
    early_stopping_rounds=30,
    verbose_eval=True
)

In [0]:

#Evaluate on the test set
from sklearn.metrics import mean_absolute_error

xgb_preds = xgb_model.predict(dtest)


from sklearn.metrics import mean_absolute_error

print("XGBoost MAE:", mean_absolute_error(y_test, xgb_preds))

In [0]:
#Plot predicted vs actual
test_preds_xgb = pd.DataFrame(
    {
        "actuals": y_test.values,
        "predicted": xgb_preds
    },
    index=y_test.index
)

test_preds_xgb.plot(ylim=0)

In [0]:
#Feature importance
# import pandas as pd

# importance = pd.Series(
#     xgb_model.feature_importances_,
#     index=X_train.columns
# ).sort_values(ascending=False)

# importance.head(10)

In [0]:
lm_lag.info()

### Weather conditions influence electricity prices through both demand and supply channels. Temperature affects heating demand, while also interacting with hydropower availability and broader system conditions.

### I chose to start with air temperature as a first weather variable because it is widely available, easy to interpret, and commonly used in prior studies. The intention here is not to claim that temperature alone explains price movements, but to test whether it adds incremental explanatory power on top of temporal structure and lagged prices.

### Given Bergen’s volatile and rapidly changing weather, I focused on contemporaneous temperature rather than temperature forecasts far into the future.

In [0]:
weather = pd.read_csv("weather_data.csv", sep=";")

weather = weather[[
    "Time(norwegian mean time)",
    "Air temperature"
]]

weather.head()

In [0]:
weather["datetime"] = pd.to_datetime(
    weather["Time(norwegian mean time)"],
    format="%d.%m.%Y %H:%M"
)
weather = weather.drop(columns="Time(norwegian mean time)")
weather.head()

In [0]:
weather = weather.set_index("datetime")
weather.head()

In [0]:
weather = weather.rename(columns={
    "Air temperature": "temperature"
})

weather["temperature"] = pd.to_numeric(weather["temperature"], errors="coerce")

weather.head()

In [0]:
weather.info()

In [0]:
# THE JOINING
lm_lag_weather = lm_lag.join(weather, how="left")

In [0]:
lm_lag_weather.head()

In [0]:
lm_lag_weather["temperature"].isna().sum()

In [0]:
lm_lag_weather = lm_lag_weather.dropna(subset=["temperature"])


In [0]:
#train / test split

# lm_lag_weather_train = lm_lag_weather.loc[:"2025-11-30"]
# lm_lag_weather_test  = lm_lag_weather.loc["2025-12-01":"2026-01-03"]

lm_lag_weather_train = lm_lag_weather.loc[:"2025-12-31"]
lm_lag_weather_test  = lm_lag_weather.loc["2026-01-01":"2026-01-03"]

X_train = sm.add_constant(lm_lag_weather_train.drop("price_nok", axis=1))
y_train = lm_lag_weather_train["price_nok"]

X_test = sm.add_constant(lm_lag_weather_test.drop("price_nok", axis=1))
y_test = lm_lag_weather_test["price_nok"]

In [0]:
model = sm.OLS(y_train, X_train).fit()
model.summary()

In [0]:
print(f"MAE: {mae(y_test, model.predict(X_test))}")
print(f"MAPE: {mape(y_test, model.predict(X_test))}")

In [0]:
test_preds = pd.DataFrame({
    "actuals": y_test.values, 
    "predicted": model.predict(X_test)
})

test_preds.plot(ylim=0)

## Lets try XG boost on this. 

In [0]:
lm_lag_weather.head()

In [0]:
#validation Split
val_cutoff = "2025-09-30"

X_tr = X_train.loc[:val_cutoff]
y_tr = y_train.loc[:val_cutoff]

X_val = X_train.loc[val_cutoff:]
y_val = y_train.loc[val_cutoff:]

In [0]:
#Train a conservative XGBoost model

from xgboost import XGBRegressor

xgb_model = XGBRegressor(
    objective="reg:squarederror",
    n_estimators=500,
    learning_rate=0.05,
    max_depth=4,
    subsample=0.8,
    colsample_bytree=0.8,
    random_state=42,
    n_jobs=-1
)

In [0]:
from xgboost import XGBRegressor
import xgboost as xgb

In [0]:
import xgboost as xgb

dtrain = xgb.DMatrix(X_tr, label=y_tr)
dval   = xgb.DMatrix(X_val, label=y_val)
dtest  = xgb.DMatrix(X_test, label=y_test)


params = {
    "objective": "reg:squarederror",
    "eval_metric": "mae",
    "learning_rate": 0.05,
    "max_depth": 4,
    "subsample": 0.8,
    "colsample_bytree": 0.8,
    "seed": 42
}


xgb_model = xgb.train(
    params=params,
    dtrain=dtrain,
    num_boost_round=500,
    evals=[(dval, "validation")],
    early_stopping_rounds=30,
    verbose_eval=True
)

In [0]:
#Evaluate on the test set
from sklearn.metrics import mean_absolute_error

xgb_preds = xgb_model.predict(dtest)


from sklearn.metrics import mean_absolute_error

print("XGBoost MAE:", mean_absolute_error(y_test, xgb_preds))

In [0]:
#Plot predicted vs actual
test_preds_xgb = pd.DataFrame(
    {
        "actuals": y_test.values,
        "predicted": xgb_preds
    },
    index=y_test.index
)

test_preds_xgb.plot(ylim=0)

### After comparing multiple specifications, including linear models with different lag structures and XGBoost-based approaches, I decided to proceed with the OLS model that includes hour indicators, weekday indicators, a 24-hour price lag, and temperature.

### While XGBoost can capture nonlinear effects, the gains in accuracy were limited relative to the added complexity and tuning effort. In contrast, the OLS model offers strong performance, stability across time, and clear interpretability of coefficients.

### Given the goal of deploying this model in a live dashboard and explaining its behavior to non-technical users, the simpler model provides a better trade-off between accuracy and transparency.

In [0]:
import pandas as pd
import statsmodels.api as sm




### The final step is to simulate a realistic day-ahead forecasting scenario. The model is trained on historical data and used to predict the next 24 hours of prices. These forecasts are then compared against actual realized prices once they become available.

### This setup allows continuous evaluation of model performance over time. Rather than reporting a single accuracy number, the model’s strengths and weaknesses remain visible, especially during price spikes or unusual market conditions.

In [0]:
test_preds = pd.DataFrame(
    {
        "actual": y_test,
        "predicted": model.predict(X_test)
    },
    index=lm_lag_weather_test.index
)


tomorrow_prices_plot = tomorrow_prices.rename(
    columns={"price_forecast": "predicted"}
)

combined = pd.concat(
    [test_preds[["actual", "predicted"]], tomorrow_prices_plot],
    axis=0
)

In [0]:
import matplotlib.pyplot as plt

plt.figure(figsize=(12,5))

plt.plot(
    combined.index,
    combined["actual"],
    label="Actual",
    color="black"
)

plt.plot(
    combined.index,
    combined["predicted"],
    label="Predicted",
    linestyle="--",
    color="red"
)

plt.legend()
plt.ylim(0)
plt.title("Actual vs Predicted Electricity Prices")
plt.show()


In [0]:
tomorrow_prices_plot

In [0]:
import requests
import pandas as pd

year = 2026
month = "01"
day = "04"
zone = "NO5"

url = f"https://www.hvakosterstrommen.no/api/v1/prices/{year}/{month}-{day}_{zone}.json"

response = requests.get(url)
response.raise_for_status()

data = response.json()

df_no5_0401 = pd.DataFrame({
    "datetime": [entry["time_start"] for entry in data],
    "price_nok": [entry["NOK_per_kWh"] for entry in data]
})

df_no5_0401["datetime"] = pd.to_datetime(df_no5_0401["datetime"])
df_no5_0401 = df_no5_0401.set_index("datetime")

df_no5_0401
