
# IND320 — Project Work, Part 2: Data Sources

## Links
- **Live app:** https://ind320-project-work-nonewthing.streamlit.app
- **Repo:** https://github.com/TaoM29/IND320-dashboard-basics

## Plan for Part 2
1. Fetch Elhub **PRODUCTION_PER_GROUP_MBA_HOUR** for **2021** (chunked by month).
2. Parse `productionPerGroupMbaHour` → tidy DataFrame (`priceArea, productionGroup, startTime, endTime, quantityKwh`).
3. Write DataFrame to **Cassandra** (`ind320.elhub_production_raw`) via **Spark**.
4. Use **Spark** to select: `priceArea, productionGroup, startTime, quantityKwh`.
5. Plots:
   - Pie: total 2021 by **production group** for a chosen **price area**.
   - Line: **January 2021** for a chosen **price area**, separate lines per **production group**.
6. Insert curated data into **MongoDB** (for Streamlit).
7. Log (300–500 words) + brief AI usage note.

# Elhub PRODUCTION_PER_GROUP_MBA_HOUR → tidy DataFrame for 2021 
 - builds monthly windows in Europe/Oslo (DST-safe), converts them to UTC for the API
 - calls the dataset month-by-month (safer w/ period limits)
 - extracts ONLY the `productionPerGroupMbaHour` list → tidy DataFrame

In [1]:
#  Setup + probes (Elhub)
import os, requests, pandas as pd
from datetime import datetime, timezone

BASE_V0 = "https://api.elhub.no/energy-data/v0"
DATASET = "PRODUCTION_PER_GROUP_MBA_HOUR"

ELHUB_API_TOKEN = os.getenv("ELHUB_API_TOKEN")

def headers_jsonapi():
    # v0 uses JSON:API; this content-type is required
    h = {"Accept": "application/vnd.api+json"}
    if ELHUB_API_TOKEN:
        h["Authorization"] = f"Bearer {ELHUB_API_TOKEN}"
    return h

def iso_utc_offset(dt: datetime) -> str:
    """
    Return ISO-8601 with offset like '+00:00' (v0 requires this format).
    """
    if dt.tzinfo is None:
        dt = dt.replace(tzinfo=timezone.utc)
    dt = dt.astimezone(timezone.utc)
    off = dt.strftime("%z")           
    off = off[:-2] + ":" + off[-2:]    
    return dt.strftime("%Y-%m-%dT%H:%M:%S") + off

print("Token present?", bool(ELHUB_API_TOKEN))

# production-groups (lists valid group ids) 
r = requests.get(f"{BASE_V0}/production-groups", headers=headers_jsonapi(), timeout=30)
print("production-groups → HTTP", r.status_code, "| Content-Type:", r.headers.get("Content-Type"))
r.raise_for_status()
pg_payload = r.json()

pg_rows = []
for item in pg_payload.get("data", []):
    attrs = item.get("attributes", {}) or {}
    pg_rows.append({"id": item.get("id"), "name": attrs.get("name"), "description": attrs.get("description")})
production_groups_df = pd.DataFrame(pg_rows)
display(production_groups_df)

# show the entity we’ll query (price-areas) just to confirm it responds 
r2 = requests.get(f"{BASE_V0}/price-areas", headers=headers_jsonapi(), timeout=30, params={"dataset": DATASET, "pageSize": 1})
print("price-areas (dataset ping) → HTTP", r2.status_code, "| Content-Type:", r2.headers.get("Content-Type"))




Token present? False
production-groups → HTTP 200 | Content-Type: application/json; charset=utf-8


Unnamed: 0,id,name,description
0,solar,Solar,Unit in which solar energy is converted to ele...
1,hydro,Hydro,Unit in which moving water energy is converted...
2,wind,Wind,Unit in which wind energy is converted to elec...
3,thermal,Thermal,Unit in which heat energy is converted to elec...
4,nuclear,Nuclear,Unit in which the heat source is a nuclear rea...
5,other,Other,Other unspecified technology.
6,*,*,*


price-areas (dataset ping) → HTTP 200 | Content-Type: application/json; charset=utf-8


In [2]:
# Fetch one month (NO1 + hydro) with flexible key mapping 
import pandas as pd
import requests
from datetime import datetime, timezone

def fetch_month(price_area: str, production_group: str, year: int, month: int) -> pd.DataFrame:
    
    # month window (inclusive start, exclusive end)
    start = datetime(year, month, 1, tzinfo=timezone.utc)
    end = datetime(year + (month==12), (month % 12) + 1, 1, tzinfo=timezone.utc)

    params = {
        "dataset": DATASET,
        "priceArea": price_area,
        "productionGroup": production_group,
        "startDate": iso_utc_offset(start),
        "endDate":   iso_utc_offset(end),
        "pageSize":  10000
    }

    url = f"{BASE_V0}/price-areas"
    r = requests.get(url, headers=headers_jsonapi(), params=params, timeout=90)
    print("HTTP", r.status_code, "|", r.headers.get("Content-Type"))
    print("URL:", r.url)

    if r.status_code != 200:
        print("Body preview:", r.text[:600])
        return pd.DataFrame()

    j = r.json()
    data = [rec.get("attributes", {}) for rec in j.get("data", [])]
    if not data:
        print("No data returned.")
        return pd.DataFrame(columns=["priceArea","productionGroup","startTime","quantityKwh"])

    df_raw = pd.DataFrame(data)
    print("Raw columns from API:", list(df_raw.columns))

    # Flexible mapping: try multiple variants for each required field
    variants = {
        "priceArea":       ["priceArea", "PRISOMRÅDE", "price_area"],
        "productionGroup": ["productionGroup", "PRODUKSJONSGRUPPE", "production_group"],
        "startTime":       ["startTime", "START_TID", "start_time", "startDateTime", "start_date_time"],
        "quantityKwh":     ["quantityKwh", "VOLUM_KWH", "quantity_kwh", "quantityKWh", "kwh", "volumeKwh"],
    }

    colmap = {}
    for target, opts in variants.items():
        for c in opts:
            if c in df_raw.columns:
                colmap[c] = target
                break

    missing = [t for t in variants if t not in colmap.values()]
    if missing:
        print("Could not find required fields:", missing)
        print("Please show these API columns to me so I can add mappings.")
        display(df_raw.head())
        return pd.DataFrame(columns=["priceArea","productionGroup","startTime","quantityKwh"])

    df = df_raw.rename(columns=colmap)[["priceArea","productionGroup","startTime","quantityKwh"]]
    df["startTime"]   = pd.to_datetime(df["startTime"], utc=True, errors="coerce")
    df["quantityKwh"] = pd.to_numeric(df["quantityKwh"], errors="coerce")
    df = df.dropna(subset=["startTime","quantityKwh"]).reset_index(drop=True)

    print(f"Fetched rows: {len(df)}")
    return df

# test for one month: NO1 + hydro, Jan 2021 
df_test = fetch_month("NO1", "hydro", 2021, 1)
display(df_test.head(10))
print("Columns:", df_test.columns.tolist())
print("Date range:", (df_test["startTime"].min() if not df_test.empty else None),
      "→", (df_test["startTime"].max() if not df_test.empty else None))


HTTP 200 | application/json; charset=utf-8
URL: https://api.elhub.no/energy-data/v0/price-areas?dataset=PRODUCTION_PER_GROUP_MBA_HOUR&priceArea=NO1&productionGroup=hydro&startDate=2021-01-01T00%3A00%3A00%2B00%3A00&endDate=2021-02-01T00%3A00%3A00%2B00%3A00&pageSize=10000
Raw columns from API: ['country', 'eic', 'name', 'productionPerGroupMbaHour']
Could not find required fields: ['priceArea', 'productionGroup', 'startTime', 'quantityKwh']
Please show these API columns to me so I can add mappings.


Unnamed: 0,country,eic,name,productionPerGroupMbaHour
0,NO,*,*,[]
1,NO,10YNO-1--------2,NO1,"[{'endTime': '2021-01-01T01:00:00+01:00', 'las..."
2,NO,10YNO-2--------T,NO2,"[{'endTime': '2021-01-01T01:00:00+01:00', 'las..."
3,NO,10YNO-3--------J,NO3,"[{'endTime': '2021-01-01T01:00:00+01:00', 'las..."
4,NO,10YNO-4--------9,NO4,"[{'endTime': '2021-01-01T01:00:00+01:00', 'las..."


Unnamed: 0,priceArea,productionGroup,startTime,quantityKwh


Columns: ['priceArea', 'productionGroup', 'startTime', 'quantityKwh']
Date range: None → None


In [3]:
# Expand nested productionPerGroupMbaHour 
def fetch_month(price_area: str, production_group: str, year: int, month: int) -> pd.DataFrame:
    """
    Fetch one month of hourly production data for a given price area and production group.
    Extracts the nested list in 'productionPerGroupMbaHour'.
    """
    start = datetime(year, month, 1, tzinfo=timezone.utc)
    end = datetime(year + (month == 12), (month % 12) + 1, 1, tzinfo=timezone.utc)

    params = {
        "dataset": DATASET,
        "priceArea": price_area,
        "productionGroup": production_group,
        "startDate": iso_utc_offset(start),
        "endDate": iso_utc_offset(end),
        "pageSize": 10000,
    }

    url = f"{BASE_V0}/price-areas"
    r = requests.get(url, headers=headers_jsonapi(), params=params, timeout=90)
    print("HTTP", r.status_code, "|", r.headers.get("Content-Type"))
    print("URL:", r.url)

    if r.status_code != 200:
        print("Body preview:", r.text[:400])
        return pd.DataFrame()

    j = r.json()
    data = j.get("data", [])
    if not data:
        print("No data rows in 'data'.")
        return pd.DataFrame(columns=["priceArea", "productionGroup", "startTime", "quantityKwh"])

    rows = []
    for rec in data:
        attrs = rec.get("attributes", {})
        area = attrs.get("name") or rec.get("id")
        inner = attrs.get("productionPerGroupMbaHour", [])
        if not inner:
            continue
        for item in inner:
            rows.append({
                "priceArea": area,
                "productionGroup": item.get("productionGroup"),
                "startTime": item.get("startTime"),
                "quantityKwh": item.get("quantityKwh")
            })

    df = pd.DataFrame(rows)
    if df.empty:
        print("No inner rows found (productionPerGroupMbaHour empty for this month).")
        return pd.DataFrame(columns=["priceArea","productionGroup","startTime","quantityKwh"])

    df["startTime"] = pd.to_datetime(df["startTime"], utc=True, errors="coerce")
    df["quantityKwh"] = pd.to_numeric(df["quantityKwh"], errors="coerce")
    df = df.dropna(subset=["startTime", "quantityKwh"]).reset_index(drop=True)
    print(f"Fetched rows: {len(df)}")
    return df

# Test: NO1 + hydro, Jan 2021 
df_test = fetch_month("NO1", "hydro", 2021, 1)
display(df_test.head(10))
print("Columns:", df_test.columns.tolist())
print("Date range:", (df_test["startTime"].min() if not df_test.empty else None),
      "→", (df_test["startTime"].max() if not df_test.empty else None))


HTTP 200 | application/json; charset=utf-8
URL: https://api.elhub.no/energy-data/v0/price-areas?dataset=PRODUCTION_PER_GROUP_MBA_HOUR&priceArea=NO1&productionGroup=hydro&startDate=2021-01-01T00%3A00%3A00%2B00%3A00&endDate=2021-02-01T00%3A00%3A00%2B00%3A00&pageSize=10000
Fetched rows: 3720


Unnamed: 0,priceArea,productionGroup,startTime,quantityKwh
0,NO1,hydro,2020-12-31 23:00:00+00:00,2507716.8
1,NO1,hydro,2021-01-01 00:00:00+00:00,2494728.0
2,NO1,hydro,2021-01-01 01:00:00+00:00,2486777.5
3,NO1,hydro,2021-01-01 02:00:00+00:00,2461176.0
4,NO1,hydro,2021-01-01 03:00:00+00:00,2466969.2
5,NO1,hydro,2021-01-01 04:00:00+00:00,2467460.0
6,NO1,hydro,2021-01-01 05:00:00+00:00,2482320.8
7,NO1,hydro,2021-01-01 06:00:00+00:00,2509533.0
8,NO1,hydro,2021-01-01 07:00:00+00:00,2550758.2
9,NO1,hydro,2021-01-01 08:00:00+00:00,2693111.0


Columns: ['priceArea', 'productionGroup', 'startTime', 'quantityKwh']
Date range: 2020-12-31 23:00:00+00:00 → 2021-01-31 22:00:00+00:00


In [4]:
# Build full-year (2021) DataFrame 
import pandas as pd
from datetime import datetime, timezone

PRICE_AREAS = ["NO1","NO2","NO3","NO4","NO5"]

# Use the groups we got in the pervious cell
valid_groups = [g for g in production_groups_df["id"].tolist() if g != "*"]
print("Using production groups:", valid_groups)

def fetch_month(price_area: str, production_group: str, year: int, month: int) -> pd.DataFrame:
    start = datetime(year, month, 1, tzinfo=timezone.utc)
    end = datetime(year + (month == 12), (month % 12) + 1, 1, tzinfo=timezone.utc)

    params = {
        "dataset": DATASET,
        "priceArea": price_area,
        "productionGroup": production_group,
        "startDate": iso_utc_offset(start),
        "endDate": iso_utc_offset(end),
        "pageSize": 10000,
    }
    url = f"{BASE_V0}/price-areas"
    r = requests.get(url, headers=headers_jsonapi(), params=params, timeout=90)
    if r.status_code != 200:
        print(f"[{price_area} {production_group} {year}-{month:02d}] HTTP {r.status_code}")
        return pd.DataFrame(columns=["priceArea","productionGroup","startTime","quantityKwh"])

    j = r.json()
    rows = []
    for rec in j.get("data", []):
        attrs = rec.get("attributes", {}) or {}
        area_name = attrs.get("name") or rec.get("id")  # 'NO1', etc.
        inner = attrs.get("productionPerGroupMbaHour", []) or []
        for item in inner:
            rows.append({
                "priceArea": area_name,
                "productionGroup": item.get("productionGroup"),
                "startTime": item.get("startTime"),
                "quantityKwh": item.get("quantityKwh"),
            })
    df = pd.DataFrame(rows)
    if df.empty:
        return df
    df["startTime"] = pd.to_datetime(df["startTime"], utc=True, errors="coerce")
    df["quantityKwh"] = pd.to_numeric(df["quantityKwh"], errors="coerce")
    df = df.dropna(subset=["startTime","quantityKwh"]).reset_index(drop=True)
    return df[["priceArea","productionGroup","startTime","quantityKwh"]]

# Build it month-by-month to respect the API's one-month window limit
parts = []
for area in PRICE_AREAS:
    for group in valid_groups:
        for m in range(1, 13):
            df_m = fetch_month(area, group, 2021, m)
            if not df_m.empty:
                parts.append(df_m)
            print(f"{area} {group} {2021}-{m:02d}: {len(df_m)} rows")

df_2021 = pd.concat(parts, ignore_index=True) if parts else pd.DataFrame(columns=["priceArea","productionGroup","startTime","quantityKwh"])

print("\n=== FULL YEAR SUMMARY ===")
print("Total rows:", len(df_2021))
print("Areas:", sorted(df_2021['priceArea'].unique().tolist()) if not df_2021.empty else [])
print("Groups:", sorted(df_2021['productionGroup'].unique().tolist()) if not df_2021.empty else [])
print("Time span:", (df_2021['startTime'].min() if not df_2021.empty else None), "→",
      (df_2021['startTime'].max() if not df_2021.empty else None))

display(df_2021.head())


Using production groups: ['solar', 'hydro', 'wind', 'thermal', 'nuclear', 'other']
NO1 solar 2021-01: 3720 rows
NO1 solar 2021-02: 3360 rows
NO1 solar 2021-03: 3715 rows
NO1 solar 2021-04: 3600 rows
NO1 solar 2021-05: 3720 rows
NO1 solar 2021-06: 3600 rows
NO1 solar 2021-07: 3720 rows
NO1 solar 2021-08: 3720 rows
NO1 solar 2021-09: 3600 rows
NO1 solar 2021-10: 3725 rows
NO1 solar 2021-11: 3600 rows
NO1 solar 2021-12: 3720 rows
NO1 hydro 2021-01: 3720 rows
NO1 hydro 2021-02: 3360 rows
NO1 hydro 2021-03: 3715 rows
NO1 hydro 2021-04: 3600 rows
NO1 hydro 2021-05: 3720 rows
NO1 hydro 2021-06: 3600 rows
NO1 hydro 2021-07: 3720 rows
NO1 hydro 2021-08: 3720 rows
NO1 hydro 2021-09: 3600 rows
NO1 hydro 2021-10: 3725 rows
NO1 hydro 2021-11: 3600 rows
NO1 hydro 2021-12: 3720 rows
NO1 wind 2021-01: 2976 rows
NO1 wind 2021-02: 2688 rows
NO1 wind 2021-03: 2972 rows
NO1 wind 2021-04: 2880 rows
NO1 wind 2021-05: 2976 rows
NO1 wind 2021-06: 3576 rows
NO1 wind 2021-07: 3720 rows
NO1 wind 2021-08: 3720 ro

Unnamed: 0,priceArea,productionGroup,startTime,quantityKwh
0,NO1,solar,2020-12-31 23:00:00+00:00,6.106
1,NO1,solar,2021-01-01 00:00:00+00:00,4.03
2,NO1,solar,2021-01-01 01:00:00+00:00,3.982
3,NO1,solar,2021-01-01 02:00:00+00:00,8.146
4,NO1,solar,2021-01-01 03:00:00+00:00,8.616


In [None]:
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
      .appName("elhub-cassandra")
      .config("spark.jars.packages",
              "com.datastax.spark:spark-cassandra-connector_2.12:3.5.1,"
              "com.datastax.spark:spark-cassandra-connector-driver_2.12:3.5.1")
      .config("spark.cassandra.connection.host", "127.0.0.1")
      .config("spark.cassandra.connection.port", "9042")
      .getOrCreate()
)

print("Spark:", spark.version)

# Quick check the connector is present
try:
    spark._jvm.java.lang.Thread.currentThread()\
        .getContextClassLoader()\
        .loadClass("com.datastax.spark.connector.cql.CassandraConnector")
    print("✅ Connector loaded")
except Exception as e:
    print("❌ Connector not found:", e)


25/10/21 12:32:26 WARN Utils: Your hostname, Taofiks-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 10.42.77.207 instead (on interface en0)
25/10/21 12:32:26 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Ivy Default Cache set to: /Users/taom/.ivy2/cache
The jars for the packages stored in: /Users/taom/.ivy2/jars
com.datastax.spark#spark-cassandra-connector_2.12 added as a dependency
com.datastax.spark#spark-cassandra-connector-driver_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e04e0d1b-6104-453a-bced-7f953153fc6c;1.0
	confs: [default]


:: loading settings :: url = jar:file:/opt/miniconda3/envs/IND320env/lib/python3.13/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found com.datastax.spark#spark-cassandra-connector_2.12;3.5.1 in central
	found com.datastax.spark#spark-cassandra-connector-driver_2.12;3.5.1 in central
	found org.scala-lang.modules#scala-collection-compat_2.12;2.11.0 in central
	found org.apache.cassandra#java-driver-core-shaded;4.18.1 in central
	found com.datastax.oss#native-protocol;1.5.1 in central
	found com.datastax.oss#java-driver-shaded-guava;25.1-jre-graal-sub-1 in central
	found com.typesafe#config;1.4.1 in central
	found org.slf4j#slf4j-api;1.7.26 in central
	found io.dropwizard.metrics#metrics-core;4.1.18 in central
	found org.hdrhistogram#HdrHistogram;2.1.12 in central
	found org.reactivestreams#reactive-streams;1.0.3 in central
	found org.apache.cassandra#java-driver-mapper-runtime;4.18.1 in central
	found org.apache.cassandra#java-driver-query-builder;4.18.1 in central
	found org.apache.commons#commons-lang3;3.10 in central
	found com.thoughtworks.paranamer#paranamer;2.8 in central
	found org.scala-lang#scala-reflect

Spark: 3.5.1
✅ Connector loaded


In [6]:
from pyspark.sql import functions as F, types as T
import pandas as pd

if 'pdf' in globals():
    _pdf = pdf
elif 'df_2021' in globals():
    _pdf = df_2021.copy()
    _pdf["quantityKwh"] = pd.to_numeric(_pdf["quantityKwh"], errors="coerce")
else:
    raise RuntimeError("Need pdf or df_2021 in memory.")

sdf = spark.createDataFrame(_pdf)

sdf_tidy = (
    sdf
    .withColumnRenamed("priceArea", "price_area")
    .withColumnRenamed("productionGroup", "production_group")
    .withColumn("start_time_utc", F.col("startTime").cast(T.TimestampType()))
    .withColumn("quantity_kwh", F.col("quantityKwh").cast(T.DoubleType()))
    .select("price_area", "production_group", "start_time_utc", "quantity_kwh")
)

print("Rows in Spark DF:", sdf_tidy.count())


25/10/21 12:33:07 WARN TaskSetManager: Stage 0 contains a task of very large size (4250 KiB). The maximum recommended task size is 1000 KiB.


Rows in Spark DF: 1076765


                                                                                