Title and Goals

Purpose: quick sanity checks on raw/bronze/silver data, shapes, ranges, plots.


Goals

Inspect raw ingestion results (bronze).

Verify datetime range, districts, and 5‑minute cadence.

Plot sample days to visually confirm patterns.

Check nulls and duplicates.

Code: imports and config
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from src.common.config import load_config
from src.common.io_utils import read_parquet

cfg = load_config()
bronze_path = os.path.join(cfg.data_paths["bronze"], f"bronze_{cfg.year}.parquet")
silver_path = os.path.join(cfg.data_paths["silver"], f"silver_{cfg.year}.parquet")

Markdown: load bronze data

Load bronze (ingested) data
Code:
bronze = read_parquet(bronze_path)
print("Bronze rows:", len(bronze))
bronze.head(5)

Markdown: basic structure checks

Basic structure and summary
Code:
print("Columns:", list(bronze.columns))
print("Districts:", bronze['district'].nunique(), sorted(bronze['district'].unique()))
print("Datetime min/max:", bronze['datetime'].min(), bronze['datetime'].max())
print("Nulls per column:\n", bronze.isna().sum())
print("Duplicates on (district, datetime):", bronze.duplicated(['district','datetime']).sum())

Markdown: cadence check (5-minute)

Cadence check (expect ~5-minute intervals in daytime window)
Code:
def cadence_report(df):
df = df.sort_values(['district','datetime'])
gaps = []
for d, g in df.groupby('district'):
delta = g['datetime'].diff().dropna()
bad = delta[~delta.isin([pd.Timedelta(minutes=5),
pd.Timedelta(minutes=3), # if your source uses 3-min style
pd.Timedelta(minutes=6)])]
gaps.append((d, len(bad)))
return gaps

cadence = cadence_report(bronze)
cadence[:10]

Markdown: load silver and compare

Load silver and compare to bronze
Code:
silver = read_parquet(silver_path)
print("Silver rows:", len(silver))
print("Silver columns:", len(silver.columns))
silver.head(5)

Markdown: plot one sample day

Plot a sample day
Code:
sample_dist = silver['district'].iloc
sample_date = silver['date'].dt.date.iloc

day_df = silver[(silver['district'] == sample_dist) &
(silver['date'].dt.date == sample_date)].sort_values('datetime')

plt.figure(figsize=(10,4))
plt.plot(day_df['datetime'], day_df['generation_kw'])
plt.title(f"{sample_dist} — {sample_date} generation (kW)")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

Markdown: takeaway

Notes
If cadence or duplicates are unexpected, revisit src/data/ingest.py and src/data/transform.py.

If many zeros appear in mid‑day, check the source files or instrumentation logs.

Notebook 2: 02_features.ipynb
Purpose: inspect Gold (feature‑rich) dataset, verify feature creation, feature importance quick view, and verify train/test split counts.

Markdown: title and goals