<h1> The Open Power System Data (OPSD) Data Set  </h1>

In [None]:
from pathlib import Path
import sys

# If this notebook lives in <repo>/notebooks/, add the repo root to sys.path
PROJECT_ROOT = Path.cwd().parent  # .. from notebooks/ to repo root
sys.path.insert(0, str(PROJECT_ROOT))

# sanity check (optional)
print("Using PROJECT_ROOT:", PROJECT_ROOT)
print("Has src? ", (PROJECT_ROOT / "src").exists())


<h2> Download the Hourly Time Series </h2>

The open data platform [OPSD](https://open-power-system-data.org/) provides time series at different temporal resolutions. The following cell ownloads the hourly time series.

In [None]:
from pathlib import Path
import sys, importlib
import pandas as pd

# Make repo importable and load config
PROJECT_ROOT = Path.cwd().parent
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

import src.config as cfg
cfg = importlib.reload(cfg)

OPSD_60min_df = None
URL = "https://data.open-power-system-data.org/time_series/2020-10-06/time_series_60min_singleindex.csv"

if cfg.OPSD_60min_CSV.exists():
    OPSD_60min_df = pd.read_csv(cfg.OPSD_60min_CSV)
    print("(Re-)loaded hourly OPSD dataset into OPSD_60min_df.")
else:
    print('Not found: "OPSD_time_series_60min_singleindex.csv" in data/raw/.')
    print("Attempting to download from:", URL)
    try:
        from urllib.request import urlopen
        with urlopen(URL) as r, open(cfg.OPSD_60min_CSV, "wb") as f:
            f.write(r.read())
        OPSD_60min_df = pd.read_csv(cfg.OPSD_60min_CSV)
        print("Successfully downloaded and loaded the hourly OPSD dataset into OPSD_60min_df.")
    except Exception as e:
        print("Unfortunately, the download failed.")
        print("Reason:", repr(e))
        print("Please, try to download it manually, and place it at:", cfg.OPSD_60min_CSV)


The `.info()` and `.shape` output: 

In [None]:
print("   .info():") 
print(OPSD_60min_df.info())
print("   .shape:")
print(OPSD_60min_df.shape)

The first ten rows:

In [None]:
OPSD_60min_df.head(10)

In [None]:
print(OPSD_60min_df[["cet_cest_timestamp", "utc_timestamp"]].head(3))
print(OPSD_60min_df[["cet_cest_timestamp", "utc_timestamp"]].tail(3))

In summary, a first look at the data set shows: The data ranges from **2015-01-01** to **2020-10-01**. Stepwidth: one hour.  That adds up to $50 401$ rows, with data features comprising $300$ columns.  
We have two  **timestamp** columns, namely `utc_timestamp` and `cet_cest_timestamp`.  
Now, we have a closer look at the remaining $298$ features.

In [4]:
OPSD_column_names_lst = OPSD_60min_df.columns.tolist()

In [None]:
OPSD_column_names_lst 

Generally, the electricity data is organized by country, transmission system operators' (TSO) control areas and bidding zones. Column names start with `DE` when referring to Germany, with `LU` when referring to Luxembourg etc. The common bidding zone Germany-Luxembourg has `DE_LU` as a prefix. Germany has many more columns than any other country. This is due to the territorial subdivision into four different TSO Control Areas. To complicate matters even further, two of them have a control area that consists of two different connected components.  

<h2> Data Completeness and NaN Analysis </h2> 

<h3> Getting the NaN Stats by Columns</h3> 

In [None]:
# --- NaN stats → sort → tidy DataFrame (OPSD_60min_df) ---
from pathlib import Path
import sys, importlib
from IPython.display import display

# import module
PROJECT_ROOT = Path.cwd().parent
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from src import eda_utils
importlib.reload(eda_utils)  # pick up latest edits

df = globals().get("OPSD_60min_df")
if df is None:
    print("No DataFrame named OPSD_60min_df found. Load it first by running the preceding cells, then re-run this cell again, please.")
else:
    # Call with ONLY the df (all other args are keyword-only and optional)
    opsd_60min_nan_desc_lst, opsd_60min_nan_df = eda_utils.build_nan_stats(df)

    print(f"Columns analyzed: {len(opsd_60min_nan_df)}")
    print("\nTop 20 by relative NaNs:\n")
    display(opsd_60min_nan_df.head(20))  # nice HTML table


<h3> Missingness Plots </h3>

In [None]:
# --- Missingness mask → dual-plot (weekly vs daily) ---
from pathlib import Path
import sys, importlib
import matplotlib.pyplot as plt

# Ensure we can import from src/
PROJECT_ROOT = Path.cwd().parent
if str(PROJECT_ROOT) not in sys.path:
    sys.path.inse
from src import eda_utils, plotting_utils
importlib.reload(eda_utils)
importlib.reload(plotting_utils)

# 1) Build missingness mask from the OPSD DataFrame
mask_df = eda_utils.missingness_mask(OPSD_60min_df)

# 2) Plot: weekly missingness of DE_LU_price_day_ahead vs daily missingness of DE_LU_load_forecast_entsoe_transparency
fig, ax1, ax2 = plotting_utils.plot_dual_timeseries(
    mask_df,
    start_date="2016-01-01",
    end_date="2019-12-31",
    column_name_one="DE_LU_price_day_ahead",                 # weekly missingness
    granularity_one="W",
    column_name_two="DE_LU_load_forecast_entsoe_transparency",  # daily missingness
    granularity_two="W",
    coverage_threshold=0.5,   # same coverage rule as before; can adjust if desired
    title="Missingness: DE_LU_price_day_ahead (weekly) vs DE_LU_load_forecast_entsoe_transparency (weekly)",
)

plt.show()



In [None]:
# --- Missingness mask → dual-plot (weekly vs weekly) ---
from pathlib import Path
import sys, importlib
import matplotlib.pyplot as plt

# Ensure we can import from src/
PROJECT_ROOT = Path.cwd().parent
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))  # <- fixed

from src import eda_utils, plotting_utils
importlib.reload(eda_utils)
importlib.reload(plotting_utils)

# 1) Build missingness mask (keeps timestamp columns intact)
mask_df = eda_utils.missingness_mask(OPSD_60min_df)

# 2) Plot weekly missingness for both columns
fig, ax1, ax2 = plotting_utils.plot_dual_timeseries(
    mask_df,
    start_date="2016-01-01",
    end_date="2019-12-31",
    column_name_one="DE_LU_price_day_ahead",
    granularity_one="W",
    column_name_two="DE_LU_load_forecast_entsoe_transparency",
    granularity_two="W",
    coverage_threshold=0.5,
    title="Missingness: DE_LU_price_day_ahead (weekly) vs DE_LU_load_forecast_entsoe_transparency (weekly)",
)

# -- Color the second series & right-axis labels red
if ax2.get_lines():
    ax2.get_lines()[-1].set_color("red")
ax2.tick_params(axis="y", labelcolor="red")
ax2.yaxis.label.set_color("red")

plt.show()


In [None]:

from pathlib import Path
import sys, importlib
import matplotlib.pyplot as plt

# import module
PROJECT_ROOT = Path.cwd().parent
if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

from src import plotting_utils
importlib.reload(plotting_utils)


fig, ax1, ax2 = plotting_utils.plot_dual_timeseries(
    OPSD_60min_df,
    start_date="2020-01-01",
    end_date="2020-01-02",
    column_name_one="DE_load_actual_entsoe_transparency",
    granularity_one="H",
    column_name_two="LU_load_actual_entsoe_transparency",
    granularity_two="H",
    coverage_threshold=0.5,
    title="OPSD (2019): DE weekly vs LU daily",
)
if ax2.get_lines():
    ax2.get_lines()[-1].set_color("red")
ax2.tick_params(axis="y", labelcolor="red")
ax2.yaxis.label.set_color("red")

plt.show()


<h2>Select Columns Relevant for Germany into Derived DataFrames</h2>

🇩🇪 We select the Germany-related columns and the two timestamp columns into a sub-frame: 

In [28]:
germany_cols = [col for col in OPSD_60min_df.columns if col.startswith("DE_") or "timestamp" in col]
OPSD_60min_de_df = OPSD_60min_df[germany_cols].copy()

In [None]:
OPSD_60min_de_df.head(5)

🇩🇪 $+$ 🇱🇺 Since Luxembourg and Germany form one bidding zone, it might be beneficial to include the (two) Luxembourg columns as well.

In [31]:
germany_lux_cols = [col for col in OPSD_60min_df.columns if col.startswith("DE_") or col.startswith("LU_") or  "timestamp" in col]
OPSD_60min_de_lu_df = OPSD_60min_df[germany_lux_cols].copy()

We have $45$ columns now. 

In [32]:
OPSD_60min_de_lu_df.shape

(50401, 45)

In [None]:
OPSD_60min_de_lu_df.columns

Here is a view of the columns grouped by prefixes.  

`50hertz`,`amprion` ,`tennet` and `transnetbw` refer to the  transmission system operators (TOS) currently operating in Germany.  Only the last one, `transnetbw`, has a control area that matches one federal state:  **Baden-Württemberg**.  The control area of `50hertz` contains the former **GDR**'s territory, the whole of  the city state **Berlin** and, in addition, the city state of **Hamburg**.  The control areas of `amprion` and `tennet` do not align well with the federal subdivision.  

| **Group**          | **Columns** |
|---------------------|-------------|
| **DE_LU** | `DE_LU_load_actual_entsoe_transparency`, `DE_LU_load_forecast_entsoe_transparency`, `DE_LU_price_day_ahead`, `DE_LU_solar_generation_actual`, `DE_LU_wind_generation_actual`, `DE_LU_wind_offshore_generation_actual`, `DE_LU_wind_onshore_generation_actual` |
| **DE** | `DE_load_actual_entsoe_transparency`, `DE_load_forecast_entsoe_transparency`, `DE_solar_capacity`, `DE_solar_generation_actual`, `DE_solar_profile`, `DE_wind_capacity`, `DE_wind_generation_actual`, `DE_wind_profile`, `DE_wind_offshore_capacity`, `DE_wind_offshore_generation_actual`, `DE_wind_offshore_profile`, `DE_wind_onshore_capacity`, `DE_wind_onshore_generation_actual`, `DE_wind_onshore_profile` |
| **LU** | `LU_load_actual_entsoe_transparency`, `LU_load_forecast_entsoe_transparency` |
| **DE_50hertz** | `DE_50hertz_load_actual_entsoe_transparency`, `DE_50hertz_load_forecast_entsoe_transparency`, `DE_50hertz_solar_generation_actual`, `DE_50hertz_wind_generation_actual`, `DE_50hertz_wind_offshore_generation_actual`, `DE_50hertz_wind_onshore_generation_actual` |
| **DE_amprion** | `DE_amprion_load_actual_entsoe_transparency`, `DE_amprion_load_forecast_entsoe_transparency`, `DE_amprion_solar_generation_actual`, `DE_amprion_wind_onshore_generation_actual` |
| **DE_tennet** | `DE_tennet_load_actual_entsoe_transparency`, `DE_tennet_load_forecast_entsoe_transparency`, `DE_tennet_solar_generation_actual`, `DE_tennet_wind_generation_actual`, `DE_tennet_wind_offshore_generation_actual`, `DE_tennet_wind_onshore_generation_actual` |
| **DE_transnetbw** | `DE_transnetbw_load_actual_entsoe_transparency`, `DE_transnetbw_load_forecast_entsoe_transparency`, `DE_transnetbw_solar_generation_actual`, `DE_transnetbw_wind_onshore_generation_actual` |
