# Silver: Space Weather (GFZ + NMDB)

This notebook builds the **Silver** table for the Space Weather domain by transforming **Bronze** extracts (created by `0_bronze.ipynb`) into a **normalized, windowed, long-form** dataset.

#### What “30-minute windows” means here
Each row represents a metric value computed over a fixed time window:
- `window_start_utc` = the **UTC start timestamp** of the window
- `window_duration_s` = the **window length in seconds** (constant `1800`, i.e., 30 minutes)
- The window covers `[window_start_utc, window_start_utc + 1800s)`

This is important because **multiple sources and metrics** share the same temporal grid, enabling consistent downstream aggregations (daily averages, joins, anomaly checks, etc.).

#### Inputs
- **GFZ** Bronze files (geomagnetic indices; currently focused on `Hp30`)
- **NMDB** Bronze files (station readings provided as raw ASCII lines; parsed via regex)

#### Output (Silver schema)
| column | type | meaning |
|---|---|---|
| `window_start_utc` | timestamp | window start (UTC) |
| `source` | string | `GFZ` or `NMDB` |
| `metric` | string | index/station identifier (e.g., `Hp30`, `OULU`) |
| `value` | double | numeric value for that window |
| `window_duration_s` | int | window size in seconds (constant `1800`) |

#### Processing overview
1. Read Bronze files per configured task (`VAR["bronze_files"]`)
2. Parse + normalize into the common schema above
3. Apply data-quality rules (nulling invalid values, collapsing duplicates, avoiding re-inserts)
4. Append into the Silver Delta destination (`VAR["silver_output"]`)

#### How to run / configure
- Attach to a Databricks cluster (Spark + Delta available).
- Edit `global_variables()`:
  - set your storage/container paths (if needed)
  - provide one Bronze file path per task in `bronze_files` (missing/blank entries are skipped)
- Run all cells top-to-bottom.


## Libraries & runtime assumptions

This notebook intentionally keeps dependencies minimal:

- **Standard library**: lightweight parsing and helpers (e.g., regex).
- **Spark (PySpark)**: all transformations are executed as Spark DataFrame operations.
- **Delta Lake**: the Silver output is written as a **Delta** dataset.

**Assumes Databricks runtime** (i.e., an active `spark` session and Delta support available on the cluster).



In [0]:
import re
import json

from pyspark.sql import DataFrame
from delta.tables import DeltaTable

from pyspark.sql import types as T
from pyspark.sql import functions as F

## Pipeline building blocks

This section defines the core functions that the notebook wires together later. The cell at the end should read like a short recipe because the heavy lifting lives here.

#### What’s defined below

- **Persistence helpers**
  Ensure the Silver Delta destination exists and write/append in a re-runnable way.

- **Source readers (Bronze → Silver shape)**
  GFZ and NMDB parsing/shaping logic that produces the long-form, windowed table.

- **Schema normalization**
  Standardize timestamps and column naming so all sources converge to:
  `window_start_utc, source, metric, value, window_duration_s`.

- **Data-quality transforms**
  Small Spark transforms that enforce deterministic rules (e.g., null known-invalid values, collapse duplicates) without changing the schema.

#### Design principles
- **One job per function** (easy to test, easy to chain).
- **Spark-first** (DataFrame transforms, no hidden side effects).
- **Deterministic + rerunnable** (re-execution shouldn’t create inconsistent Silver output).

Flow later is intentionally boring: configure → read → normalize → (DQ rules) → write.

### Global configuration

All notebook settings live in `VAR = global_variables()`:

- **Paths**: Bronze/Silver roots and the final `silver_output`
- **What to run**: `tasks` (ordered)
- **Inputs**: `bronze_files` mapping task → Bronze file path (missing/blank paths are skipped)
- **Constants / rules**: e.g., `window_duration_s = 1800`, allowed NMDB stations, value ranges

Change this cell when you switch environments or adjust scope—functions should not hardcode paths or task lists.


In [0]:
def global_variables():
    """Build the global configuration dictionary used throughout the notebook.

    Returns:
        dict: The VAR configuration dictionary (paths, ranges, reader functions, etc.).
    """
    try:
        # This will work in a Jobs run (taskValues are only available there)
        bronze_output = dbutils.jobs.taskValues.get(taskKey="Bronze", key="bronze_output")
        container = bronze_output.get("container", "")
        bronze_files = bronze_output.get("output_bronze", "")
    except Exception:
        # Not in a Jobs run (e.g., interactive notebook) or task value not available, for testing
        tiers = ["bronze", "silver", "gold"]
        container = {tier: f"abfss://{tier}@alexccrv0dcn.dfs.core.windows.net" for tier in tiers}
        bronze_files = {
            
            }

    VAR = {
        "container": container,
        "bronze_files": bronze_files,
        "silver_output": "/".join([container["silver"], "space_weather"]),
        "gfz_hp30_range": (0.0, 9.0),
        "window_duration_s": 1800,
    }
    return VAR
VAR = global_variables()

### Helpers

Utility functions that make the run safe and repeatable:

- Ensure the **Silver Delta** destination exists with the expected schema

This keeps the main flow clean: configure → ensure destination → transform → write.

In [0]:
def ensure_silver_delta():
    """Ensure the Silver Delta destination exists and is initialized.

    Creates the Delta location defined by VAR["silver_output"] if missing and initializes
    an empty Delta table with the expected schema.
    """
    silver_path = VAR["silver_output"]

    schema = T.StructType([
        T.StructField("window_start_utc", T.TimestampType(), True),
        T.StructField("source", T.StringType(), True),
        T.StructField("metric", T.StringType(), True),
        T.StructField("value", T.DoubleType(), True),
        T.StructField("window_duration_s", T.IntegerType(), True),
    ])

    # 1) If it's already a Delta table, we're done
    if DeltaTable.isDeltaTable(spark, silver_path):
        return

    # 2) If path exists but isn't Delta, fail (safety)
    try:
        dbutils.fs.ls(silver_path)
    except Exception:
        # Doesn't exist (or not listable) → create empty Delta table
        (
            spark.createDataFrame([], schema)
            .write.format("delta")
            .mode("overwrite")
            .save(silver_path)
        )
        return

    # Exists but not Delta
    raise ValueError(f"Path exists but is not a Delta table: {silver_path}")


### Read Bronze: GFZ

Loads the **GFZ** Bronze extract and converts it into the standard Silver-ready shape (windowed, long-form).

- Input: Bronze GFZ file(s) for the configured task(s)
- Output columns: `window_start_utc, source, metric, value, window_duration_s`
- Notes: focuses on the configured geomagnetic index (e.g., `Hp30`) and preserves the 30-minute window grid.

In [0]:
def readbc_gfz(gfz_path: str) -> DataFrame:
    """Read one GFZ Bronze file and return canonical Silver-shaped rows.

    Args:
        gfz_path (str): Bronze input path for a GFZ export.

    Returns:
        DataFrame: Canonical long-form rows with columns:
            window_start_utc, source, metric, value, window_duration_s.

    Raises:
        ValueError: If the metric cannot be inferred from the file path or the
            input schema is not as expected.
    """
    index_pattern = r"_index-([^_]+)_"
    UTCZ_pattern  = r"^(\d{4}-\d{2}-\d{2})T(\d{2}:\d{2}:\d{2})Z$"
    UTCZ_replace  = r"$1 $2"

    if not index_pattern:
        raise ValueError(f"Could not extract GFZ index from path: {gfz_path}")

    idx = re.search(index_pattern, gfz_path).group(1)
    
    df = spark.read.option("header", "true").csv(gfz_path)
    value_cols = [c for c in df.columns if c != "datetime"]
    if len(value_cols) != 1:
        raise ValueError(
            f"Expected one value column in GFZ CSV for {idx}."
        )

    df = (
        df.withColumnRenamed("datetime", "window_start_utc")
            .withColumn("window_start_utc",  F.regexp_replace("window_start_utc", UTCZ_pattern, UTCZ_replace))
            .withColumn("window_start_utc",  F.to_timestamp(F.col("window_start_utc"), "yyyy-MM-dd HH:mm:ss"))
            .withColumn("source",F.lit("GFZ"))
            .withColumn("metric",F.lit(idx))
            .withColumnRenamed(value_cols[0], "value")
            .withColumn("window_duration_s",F.lit(VAR["window_duration_s"]))
            .withColumn("value",F.col("value").cast("double"))
    )

    return df

### Read Bronze: NMDB

Loads the **NMDB** Bronze extract (raw ASCII lines) and parses it into the standard Silver-ready shape.

- Input: Bronze NMDB file(s) with raw text lines
- Parsing: regex-based extraction of timestamp + station values
- Output columns: `window_start_utc, source, metric, value, window_duration_s`
- Notes: `metric` is the **station code** (e.g., `OULU`); only configured/allowed stations are kept.

In [0]:
def readbc_nmdb(nmdb_path: str) -> DataFrame:
    """Read one NMDB Bronze file and return canonical Silver-shaped rows.

    Args:
        nmdb_path (str): Bronze input path for an NMDB station export.

    Returns:
        DataFrame: Canonical long-form rows with columns:
            window_start_utc, source, metric, value, window_duration_s.

    Raises:
        ValueError: If the station/metric cannot be inferred from the file path or
            the ASCII payload cannot be parsed.
    """
    station_pattern = r"_stations-([^_]+)_"
    NMDB_LINE_PATTERN = r"^\s*(\d{4}-\d{2}-\d{2}\s+\d{2}:\d{2}:\d{2})\s*[;\s]+\s*(-?\d+(?:\.\d+)?)"

    station = re.search(station_pattern, nmdb_path).group(1)

    sourcefile = spark.read.text(nmdb_path).withColumnRenamed("value", "line")

    parsed = (
        sourcefile
        .withColumn("window_start_utc",  F.regexp_extract(F.col("line"), NMDB_LINE_PATTERN, 1))
        .withColumn("value",  F.regexp_extract(F.col("line"), NMDB_LINE_PATTERN, 2))
        .filter((F.col("window_start_utc").isNotNull()) & (F.col("window_start_utc") != ""))
        .filter((F.col("value").isNotNull()) & (F.col("value") != ""))
    )
   
    df = (
        parsed
            .drop("line")
            .withColumn("source",F.lit("NMDB"))
            .withColumn("metric",F.lit(station))
            .withColumn("window_duration_s",F.lit(VAR["window_duration_s"]))
            .withColumn("window_start_utc",  F.to_timestamp(F.col("window_start_utc"), "yyyy-MM-dd HH:mm:ss"))
            .withColumn("value",F.col("value").cast("double"))
            
    )
    
    return df


### Data quality rules

This section applies **lightweight, deterministic** cleanup rules before writing to Silver.

#### Goals
- Remove or neutralize values that are **known-invalid** for a given source/metric.
- Keep the dataset **safe to aggregate** (daily means, joins across sources, anomaly detection).
- Preserve raw coverage while preventing invalid points from contaminating downstream results.

#### Approach
- Rules are implemented as **small Spark transforms** that only modify `value` (typically by setting invalid values to `null`).
- Rules are **source/metric-specific** (e.g., NMDB vs GFZ Hp30).
- Duplicate windows are handled explicitly to ensure **one value per** `(window_start_utc, source, metric)`.

#### Output guarantee
After this section, the dataset should satisfy:
- stable schema: `window_start_utc, source, metric, value, window_duration_s`
- one row per window/source/metric (after duplicate collapse)
- invalid values are nulled rather than silently kept

In [0]:

def dq_null_negative_nmdb(df: DataFrame) -> DataFrame:
    """Null-out invalid negative NMDB values.

    Args:
        df (DataFrame): Canonical Silver-shaped DataFrame.

    Returns:
        DataFrame: Same schema as input with negative NMDB values replaced by null.
    """
    return df.withColumn(
        "value",
        F.when((F.col("source") == "NMDB") & (F.col("value") < 0), F.lit(None)).otherwise(F.col("value")),
    )

def dq_null_out_of_range_hp30(df: DataFrame) -> DataFrame:
    """Null-out GFZ Hp30 values outside the configured allowed range.

    Args:
        df (DataFrame): Canonical Silver-shaped DataFrame.

    Returns:
        DataFrame: Same schema as input with out-of-range Hp30 values replaced by null.
    """
    min_val, max_val = VAR["gfz_hp30_range"]
    cond = (
        (F.col("source") == "GFZ")
        & (F.col("metric") == "Hp30")
        & ((F.col("value") < min_val) | (F.col("value") > max_val))
    )
    return df.withColumn("value", F.when(cond, F.lit(None)).otherwise(F.col("value")))

def dq_collapse_duplicates(df: DataFrame) -> DataFrame:
    """Collapse duplicate keys to a single row per (window_start_utc, source, metric).

    This pipeline treats duplicate rows for the same (window_start_utc, source, metric)
    as equivalent because they represent the same aggregation window. When duplicates
    exist, retaining *any* non-null value is acceptable.

    Args:
        df (DataFrame): Canonical Silver-shaped DataFrame.

    Returns:
        DataFrame: Canonical DataFrame with duplicate keys collapsed to one value.
    """
    return (
        df.groupBy("window_start_utc", "source", "metric")
          .agg(
              F.first("value", ignorenulls=True).alias("value"),
              F.first("window_duration_s", ignorenulls=True).alias("window_duration_s"),
          )
    )

def dq_filter_existing_keys(df: DataFrame) -> DataFrame:
    """Drop rows whose keys already exist in the Silver Delta table.

    Args:
        df (DataFrame): Canonical Silver-shaped DataFrame.

    Returns:
        DataFrame: Input rows excluding keys already present in VAR["silver_output"].
    """
    silver_path = VAR["silver_output"]
    try:
        existing = (
            spark.read.format("delta")
                 .load(silver_path)
                 .select("window_start_utc", "source", "metric")
                 .dropDuplicates()
        )
    except Exception:
        return df

    return df.join(existing, on=["window_start_utc", "source", "metric"], how="left_anti")

VAR["data_quality_funcs"] = [
    dq_collapse_duplicates,
    dq_null_negative_nmdb,
    dq_null_out_of_range_hp30,
    dq_filter_existing_keys,
]


## Show Time

This is the **main execution** section of the notebook.

It runs the pipeline end-to-end:

1. Load `VAR` configuration (`global_variables()`)
2. Ensure the Silver Delta destination exists (`ensure_silver_delta()`)
3. Read Bronze inputs (GFZ / NMDB)
4. Standardize into the common Silver schema:
   `window_start_utc, source, metric, value, window_duration_s`
5. Apply data-quality rules and de-duplication
6. Write/append the final result to `VAR["silver_output"]`

The cells below are the “orchestration layer”: they call the functions defined above in the intended order and show the resulting DataFrames / write outcome.

In [0]:
bronze_files = VAR["bronze_files"]
silver_path = VAR["silver_output"]

# Ensure destination exists as a Delta table (empty if new)
ensure_silver_delta()

dfs = []
for bcfpath in bronze_files.values():
    if not bcfpath or not str(bcfpath).strip():
        continue
    
    filename = bcfpath.split("/")[-1]
    task = filename.split("_")[0].lower()
    match task:
        case "gfz":
            df = readbc_gfz(bcfpath)
        case "nmdb":
            df = readbc_nmdb(bcfpath)
        case _:
            raise ValueError(f"Unknown task prefix for '{task}'. Expected 'gfz_*' or 'nmdb_*'.")

    # Keep a consistent final schema/order
    df = df.select("window_start_utc", "source", "metric", "value", "window_duration_s")

    dfs.append(df)

if dfs:
    outdf = dfs[0]
    for d in dfs[1:]:
        outdf = outdf.unionByName(d, allowMissingColumns=True)

    for dqfunc in VAR["data_quality_funcs"]:
        outdf = dqfunc(outdf)

    (
        outdf.write.format("delta")
        .mode("append")
        .save(silver_path)
    )

### Content Display

In [0]:
display(spark.read.format("delta").load(VAR["silver_output"]).limit(100))

### Saving Information for Next Notebooks

This section documents the process and results of saving data to the  Gold layer.  
It includes details about the written paths and any updates performed during the current run.

In [0]:
# Defing output
output_data = {
    "container": VAR["container"],
    "output_silver": VAR["silver_output"],
}

# Return the dictionary
dbutils.jobs.taskValues.set(key="silver_output", value=output_data)
