# Bronze ingestion: Space Weather (GFZ + NMDB)

This notebook ingests:
- **GFZ Potsdam** geomagnetic index time series (e.g., Hp30)
- **NMDB NEST** neutron monitor station data (ASCII)

Data is written to the **Bronze** layer as **single files** (not Spark partition folders).

**Environment:** Databricks (Spark + `dbutils` required)  
**Timestamps:** `YYYY-MM-DD` or `YYYY-MM-DDTHH:MM:SSZ` (UTC)  
**Workflow:** if the destination folder is empty → bootstrap (2025-01-01 → yesterday UTC); else → incremental update based on the latest saved file.

**Filename convention:** includes `start-YYYY-MM-DD` and `end-YYYY-MM-DD` so `update_data()` can infer the next missing range.

## How to run

1. Attach this notebook to a Databricks cluster with Spark + `dbutils`.
2. Edit `global_variables()` for your environment:
   - Update the `container`/paths if your storage account differs.
   - Update `tasks` to add or change sources:
     - Format is `"<source>_<id>"`, e.g., `gfz_Hp30` or `nmdb_OULU`.
     - Use more GFZ indices by adding more `gfz_<index>` entries.
     - Use more NMDB stations by adding more `nmdb_<station>` entries.
3. Run all cells


## Libraries

Standard library + Databricks/Spark utilities. No third-party Python packages required.

In [None]:
import json

from typing import Tuple
from datetime import datetime, timezone, timedelta

from urllib.parse import urlencode
from urllib.request import urlopen

---

## Start Functions

### Global configuration

All configuration lives in `VAR`, including:
- output directories per task
- allowed NMDB stations
- mapping of task → import function (`save_gfz` / `save_nmdb`)

In [None]:
def global_variables():
    """
    This function defines and returns a dictionary of key variables used in the script.

    Returns:
        dict: A dictionary containing key variables for configuration and use across the script.
    """

    tiers = ["bronze", "silver", "gold"]
    container = {tier: f"abfss://{tier}@alexccrv0dcn.dfs.core.windows.net" for tier in tiers}

    tasks = ["gfz_Hp30","nmdb_JUNG1","nmdb_OULU","nmdb_ROME"]
    outputdirs = {task: "/".join([container["bronze"],task]) for task in tasks}

    VAR = {
        "container": container,
        "outputdirs": outputdirs,
        "GFZ_BASE_URL": "https://kp.gfz-potsdam.de/app/json/",
        "NMDB_BASE_URL": "https://www.nmdb.eu/nest/draw_graph.php",
    }
    return VAR
VAR = global_variables()

### Helpers

Reusable utilities:
- `parse_utc(...)`: parse date-only or ISO-Z timestamps into UTC-aware datetimes
- `normalize_utc(...)`: normalize inputs to `YYYY-MM-DDTHH:MM:SSZ`

Date-only inputs expand to day boundaries:
- start → `00:00:00Z`
- end → `23:59:59Z` (or `end_seconds` if overridden)

In [None]:
def parse_utc(ts: str, is_start: bool=True, *, end_seconds: int = 59) -> datetime:
    """
    Parse a UTC timestamp string into a timezone-aware `datetime` (UTC).

    Accepts either a date-only string (`YYYY-MM-DD`) or a Zulu timestamp
    (`YYYY-MM-DDTHH:MM:SSZ`). For date-only inputs, expands the time to a window
    boundary determined by `is_start`.

    Args:
        ts (str): Timestamp in `YYYY-MM-DD` or `YYYY-MM-DDTHH:MM:SSZ` format (UTC).
        is_start (bool): If True and `ts` is date-only, use `00:00:00Z`. If False and
            `ts` is date-only, use `23:59:{end_seconds:02d}Z`.
        end_seconds (int): Second value (0-59) used when `is_start` is False and `ts`
            is date-only.

    Returns:
        datetime: Timezone-aware `datetime` in UTC.

    Raises:
        ValueError: If `ts` does not match an accepted format, contains invalid date/time
            components, or if `end_seconds` is outside 0-59.
    """

    # --- SETUP AND VALIDATION ---
    ts = ts.strip()

    if not (0 <= end_seconds <= 59):
        raise ValueError(f"end_seconds must be in 0..59. Got: {end_seconds!r}")

    # --- LOGIC ---
    if len(ts) == 10:
        base_dt = datetime.strptime(ts, "%Y-%m-%d").replace(tzinfo=timezone.utc)
        if is_start:
            return base_dt.replace(hour=0, minute=0, second=0)
        return base_dt.replace(hour=23, minute=59, second=end_seconds)

    if ts.endswith("Z"):
        return datetime.strptime(ts, "%Y-%m-%dT%H:%M:%SZ").replace(tzinfo=timezone.utc)

    # --- RETURN ---
    raise ValueError(
        "Invalid timestamp format. Use 'YYYY-MM-DD' or 'YYYY-MM-DDTHH:MM:SSZ'. "
        f"Got: {ts!r}"
    )

def normalize_utc(ts: str, is_start: bool = True, *, end_seconds: int = 59) -> str:
    """
    Normalize a UTC timestamp string to ISO 8601 Zulu format.

    Args:
        ts (str): Timestamp in 'YYYY-MM-DD' or 'YYYY-MM-DDTHH:MM:SSZ' (UTC).
        is_start (bool): Whether `ts` represents a start boundary (00:00:00Z) or
            an end boundary (23:59:end_secondsZ) when `ts` is date-only.
        end_seconds (int): Second value used for the end boundary when `ts` is
            date-only.

    Returns:
        str: Normalized UTC Zulu timestamp 'YYYY-MM-DDTHH:MM:SSZ'.
    """
    dt = parse_utc(ts, is_start=is_start, end_seconds=end_seconds)
    return dt.strftime("%Y-%m-%dT%H:%M:%SZ")



## API fetchers

Low-level “download only” functions (no filesystem writes):
- validate inputs
- build request URLs
- return raw payloads / parsed tuples

GFZ returns parsed arrays (times, values, status).  
NMDB returns raw ASCII text.

### GFZ: geomagnetic disturbance indices

Fetches an index time series for a UTC window.

Notes:
- `index` controls which series is downloaded (e.g., Hp30)
- `status` may be supported only for some indices
- output is parsed into `(times, values, status)`

In [None]:
def getGFZindex(starttime: str, endtime: str,
               index: str,status: str = "all") -> Tuple[tuple, tuple, tuple]:
    """
    Fetch a GFZ Potsdam geomagnetic index time series for a UTC time window.

    Args:
        starttime (str): 'YYYY-MM-DD' or 'YYYY-MM-DDTHH:MM:SSZ' (UTC).
        endtime (str): 'YYYY-MM-DD' or 'YYYY-MM-DDTHH:MM:SSZ' (UTC).
        index (str): One of: Kp, ap, Ap, Cp, C9, Hp30, Hp60, ap30, ap60, SN, Fobs, Fadj.
        status (str): 'all' or 'def' (where supported). Default: 'all'.

    Returns:
        tuple[tuple[str, ...], tuple[object, ...], tuple[str, ...]]: (times, values, status).

    Raises:
        ValueError: Invalid inputs or endtime < starttime.
        RuntimeError: Request/parse failure.
    """

    # --- SETUP AND VALIDATION ---
    allowed_indices = {
        "Kp", "ap", "Ap", "Cp", "C9", "Hp30", "Hp60", "ap30", "ap60", "SN", "Fobs", "Fadj"
    }
    allowed_status = {"all", "def"}

    if index not in allowed_indices:
        raise ValueError(
            "Wrong index parameter. Allowed: "
            "'Kp','ap','Ap','Cp','C9','Hp30','Hp60','ap30','ap60','SN','Fobs','Fadj'. "
            f"Got: {index!r}"
        )

    if status not in allowed_status:
        raise ValueError("Wrong status parameter. Allowed: 'all', 'def'. Got: {!r}".format(status))

    # Start: 00:00:00Z if date-only; End: 23:59:59Z if date-only (full-day coverage)
    d1 = parse_utc(starttime, is_start=True)
    d2 = parse_utc(endtime, is_start=False, end_seconds=59)

    if d1 > d2:
        raise ValueError(f"Start time must be <= end time. Got: {d1.isoformat()} > {d2.isoformat()}")

    # --- LOGIC ---
    time_string = (
        f"start={d1.strftime('%Y-%m-%dT%H:%M:%SZ')}"
        f"&end={d2.strftime('%Y-%m-%dT%H:%M:%SZ')}"
    )
    url = f"{VAR['GFZ_BASE_URL']}?{time_string}&index={index}"

    if status == "def":
        url += "&status=def"

    try:
        with urlopen(url, timeout=30) as resp:
            payload = resp.read().decode("utf-8")
        data = json.loads(payload)
    except Exception as e:
        raise RuntimeError(f"Failed to fetch/parse GFZ response. URL={url!r}. Error: {e}") from e

    datetime_values = tuple(data.get("datetime", ()))
    index_values = tuple(data.get(index, ()))

    status_values = tuple(data.get("status", ()))
    
    # --- RETURN ---
    return datetime_values, index_values, status_values


### NMDB: NEST neutron monitor stations

Downloads station data for a UTC window as raw ASCII.

Notes:
- `station` must be in the allowlist (prevents typos and unexpected calls)
- payload is written **as-is** to preserve the source format for Bronze

In [None]:
def getNMDBnest(starttime: str, endtime: str, station:str):
    """
    Fetch NMDB NEST neutron monitor data (ASCII) for one station over a UTC window.

    Args:
        starttime (str): Start time in 'YYYY-MM-DD' or 'YYYY-MM-DDTHH:MM:SSZ' (UTC).
        endtime (str): End time in 'YYYY-MM-DD' or 'YYYY-MM-DDTHH:MM:SSZ' (UTC).
        station (str): NMDB station code (must be in the allowed station list).

    Returns:
        str: Raw ASCII response text returned by the NMDB NEST endpoint.

    Raises:
        ValueError: If timestamps are invalid, endtime < starttime, or station is not
            allowed.
        urllib.error.URLError: If the request fails (network/DNS/timeout).
        urllib.error.HTTPError: If NMDB returns a non-2xx HTTP status.
    """
    # --- SETUP AND VALIDATION ---
    start_dt = parse_utc(starttime, is_start=True)
    end_dt = parse_utc(endtime, is_start=False)
    if end_dt < start_dt:
        raise ValueError("endtime must be >= starttime.")

    NMDB_ALLOWED_STATIONS = [
        "AATA","AATB","AHMD","APTY","ARNM","ATHN","BKSN","BUDA","CALG",
        "CALM","CHAC","CLMX","DJON","DOMB","DOMC","DRBS","DRHM","ESOI",
        "FSMT","HRMS","HUAN","ICRB","ICRO","INVK","IRK2","IRK3","IRKT",
        "JBGO","JUNG","JUNG1","KERG","KGSN","KIEL","KIEL2","LMKS","MCMU",
        "MCRL","MGDN","MOSC","MRNY","MWSB","MWSN","MXCO","NAIN","NANM",
        "NEU3","NEWK","NRLK","NVBK","OULU","PSNM","PTFM","PWNK","ROME",
        "SANB","SNAE","SOPB","SOPO","TERA","THUL","TSMB","TXBY","UFSZ",
        "YKTK","ZUGS",
    ]
    
    if station not in NMDB_ALLOWED_STATIONS:
        raise ValueError(
            f"NMDB station not allowed: {station}. "
            f"Allowed: {NMDB_ALLOWED_STATIONS}"
        )
    
    params = [
        ("formchk", "1"),
        ("stations[]", station),
        ("tabchoice", "revori"),
        ("dtype", "corr_for_pressure"),
        ("tresolution", "30"),
        ("date_choice", "bydate"),

        ("start_year", f"{start_dt.year:04d}"),
        ("start_month", f"{start_dt.month:02d}"),
        ("start_day", f"{start_dt.day:02d}"),
        ("start_hour", f"{start_dt.hour:d}"),
        ("start_min", f"{start_dt.minute:d}"),

        ("end_year", f"{end_dt.year:04d}"),
        ("end_month", f"{end_dt.month:02d}"),
        ("end_day", f"{end_dt.day:02d}"),
        ("end_hour", f"{end_dt.hour:d}"),
        ("end_min", f"{end_dt.minute:d}"),

        ("output", "ascii"),
        ("yunits", "0"),
        ("anomalous", "1"),
        ("display_null", "1"),
    ]

    # --- LOGIC ---
    url = VAR["NMDB_BASE_URL"] + "?" + urlencode(params, doseq=True)
    
    with urlopen(url, timeout=30) as resp:
        text = resp.read().decode("utf-8", errors="replace")

    # --- RETURN ---
    return text

## Saving data (single-file Bronze writes)

This section contains the persistence utilities and “fetch + persist” entrypoints:

- `write_single_file(...)`: forces **one output file** (Spark `coalesce(1)` → rename `part-*`)
- `save_gfz(...)`: fetches via `getGFZindex(...)`, serializes to CSV, writes a single file
- `save_nmdb(...)`: fetches via `getNMDBnest(...)`, writes raw ASCII as a single TXT file

All saved filenames embed `start-YYYY-MM-DD` and `end-YYYY-MM-DD` so incremental updates
can infer the next missing window from existing Bronze files.


In [None]:
def write_single_file(dir_path: str, filename: str, content: str) -> str:
    """
    Write a single text file to DBFS/ABFSS via Spark.

    Spark text writes produce a directory of part-files, so this helper writes to a
    temporary directory with `coalesce(1)`, moves the single `part-*` file to the
    requested `{dir_path}/{filename}`, then deletes the temp directory.

    Args:
        dir_path (str): Destination directory path (DBFS/ABFSS).
        filename (str): Target filename to create in `dir_path`.
        content (str): Full file contents to write.

    Returns:
        str: Full path to the written file.

    Raises:
        RuntimeError: If no `part-*` file is produced in the temporary directory.
    """

    dbutils.fs.mkdirs(dir_path)
    target = f"{dir_path.rstrip('/')}/{filename}"

    # Best-effort remove existing target
    try:
        dbutils.fs.rm(target)
    except Exception:
        pass

    # Unique temp dir (avoid collisions); uses current UTC timestamp
    tmp_tag = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")
    tmp_dir = f"{dir_path.rstrip('/')}/_tmp_{filename}_{tmp_tag}"

    # Spark write (single partition -> single part-* file)
    df = spark.createDataFrame([(content,)], ["value"])
    df.coalesce(1).write.mode("overwrite").text(tmp_dir)

    # Find the part file and move it to the desired filename
    part_files = [x.path for x in dbutils.fs.ls(tmp_dir) if x.name.startswith("part-")]
    if not part_files:
        raise RuntimeError(f"No part-* file found in {tmp_dir}")

    dbutils.fs.mv(part_files[0], target, True)
    dbutils.fs.rm(tmp_dir, True)

    return target

def save_gfz(outputdir:str, idx:str, startdate: str, enddate: str) -> str:
    """
    Fetch a GFZ geomagnetic index for a UTC window and write it as a single CSV file.

    Args:
        outputdir (str): Destination directory path (DBFS/ABFSS).
        idx (str): GFZ index name (e.g., "Hp30").
        startdate (str): 'YYYY-MM-DD' or 'YYYY-MM-DDTHH:MM:SSZ' (UTC).
        enddate (str): 'YYYY-MM-DD' or 'YYYY-MM-DDTHH:MM:SSZ' (UTC).

    Returns:
        str: Full path to the written CSV file.

    Raises:
        ValueError: If timestamps are invalid or enddate < startdate.
        RuntimeError: If the output file cannot be materialized as a single file.
    """

    start_iso = normalize_utc(startdate, is_start=True)
    end_iso   = normalize_utc(enddate, is_start=False)

    start_day = start_iso[:10]
    end_day   = end_iso[:10]

    run_tag = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")

    dt_vals, idx_vals, _ = getGFZindex(start_iso, end_iso, idx, "def")

    csv_lines = [f"datetime,{idx}"]
    for d, v in zip(dt_vals, idx_vals):
        csv_lines.append(f"{d},{v}")

    filename = f"gfz_index-{idx}_start-{start_day}_end-{end_day}_tag-{run_tag}.csv"
    return write_single_file(outputdir, filename, "\n".join(csv_lines))


def save_nmdb(outputdir:str, station:list,startdate: str, enddate: str) -> str:
    """
    Fetch NMDB NEST data for one station over a UTC window and write it as a single TXT file.

    Args:
        outputdir (str): Destination directory path (DBFS/ABFSS).
        station (str): NMDB station code (must be allowed by `getNMDBnest`).
        startdate (str): 'YYYY-MM-DD' or 'YYYY-MM-DDTHH:MM:SSZ' (UTC).
        enddate (str): 'YYYY-MM-DD' or 'YYYY-MM-DDTHH:MM:SSZ' (UTC).

    Returns:
        str: Full path to the written TXT file.

    Raises:
        ValueError: If timestamps are invalid, enddate < startdate, or station is invalid.
        urllib.error.URLError: If the request fails (network/DNS/timeout).
        urllib.error.HTTPError: If NMDB returns a non-2xx HTTP status.
        RuntimeError: If the output file cannot be materialized as a single file.
    """
        
    start_iso = normalize_utc(startdate, is_start=True)
    end_iso   = normalize_utc(enddate, is_start=False)

    start_day = start_iso[:10]
    end_day   = end_iso[:10]

    run_tag = datetime.now(timezone.utc).strftime("%Y%m%dT%H%M%SZ")

    text = getNMDBnest(start_iso, end_iso,station)

    filename = f"nmdb_stations-{station}_start-{start_day}_end-{end_day}_tag-{run_tag}.txt"
    return write_single_file(outputdir, filename, text)

### Gathering Functions (registry)

To keep orchestration generic, we register the **source-specific save functions** in `VAR['importfuncs']`
so they can be called indirectly later (by source key), e.g.:

- `save_gfz`  -> GFZ index writer
- `save_nmdb` -> NMDB station writer

This keeps the "which source do I call?" decision out of the orchestration logic and
lets the run loop dispatch by `source` + `idx` (where `idx` is the GFZ index or NMDB station code).

In [None]:
VAR["importfuncs"] = {
    "gfz":save_gfz,
    "nmdb":save_nmdb,
}

### Run: bootstrap or update

For each configured task:
- if the destination folder is empty → bootstrap
- otherwise → incremental update

This keeps Bronze append-only and makes reruns idempotent at the “file per window” level.

In [None]:
def bootstrap_2025tonow(importfunc,outputdir,idx):
    """
    Backfill a full 2025-to-yesterday (UTC) window into Bronze for one source/index.

    Computes the window `2025-01-01T00:00:00Z` to end-of-yesterday (UTC), normalizes
    timestamps with `normalize_utc()`, then calls `importfunc(outputdir, idx, start_iso, end_iso)`.

    Args:
        importfunc (Callable): Import/write function (e.g., `save_gfz` or `save_nmdb`).
        outputdir (str): Destination directory path (DBFS/ABFSS).
        idx (str): Index/station identifier passed through to `importfunc`.

    Returns:
        Any: Whatever `importfunc` returns (typically the written file path).

    Raises:
        ValueError: If the computed end timestamp is before 2025-01-01.
    """
    # --- window: 2025-01-01 .. yesterday (UTC) ---
    startdate = "2025-01-01"
    enddate = (datetime.now(timezone.utc) - timedelta(days=1)).date().isoformat()

    # normalize once to validate + for display
    start_iso = normalize_utc(startdate, is_start=True)
    end_iso = normalize_utc(enddate, is_start=False)

    if end_iso < "2025-01-01T00:00:00Z":
        raise ValueError(f"Computed end={end_iso} is before 2025-01-01; refusing to bootstrap.")

    return importfunc(outputdir,idx,start_iso, end_iso)

def update_data(importfunc,outputdir,idx):
    """
    Incrementally fetch new data since the latest saved file and append it to Bronze.

    Scans `outputdir` for existing files, extracts each file's end date from the
    filename pattern containing `end-<YYYY-MM-DD>`, then sets the next start date
    to (latest_end + 1 day). Uses yesterday (UTC) as the update end date and calls
    `importfunc(outputdir, idx, start_iso, end_iso)` when updates are needed.

    Args:
        importfunc (Callable): Import/write function (e.g., `save_gfz` or `save_nmdb`).
        outputdir (str): Destination directory path (DBFS/ABFSS).
        idx (str): Index/station identifier passed through to `importfunc`.

    Returns:
        Any | None: `importfunc` result (typically a written file path), or None if
            no update is needed.

    Raises:
        ValueError: If timestamps or filename-derived dates cannot be parsed.
        Exception: If listing `outputdir` or `importfunc` execution fails.
    """

    startdate = parse_utc("2025-01-01",is_start=True)
    for file in dbutils.fs.ls(outputdir):
        file_enddate = parse_utc(file.name.split("end-")[1].split("_")[0],is_start=False)
        file_nextday = parse_utc(
            (file_enddate + timedelta(days=1)).date().isoformat(),
            is_start=True,
        )
        
        startdate = file_nextday if file_nextday > startdate else startdate
        
    enddate = parse_utc(
        (datetime.now(timezone.utc) - timedelta(days=1)).date().isoformat(),
        is_start=False,
    )

    start_iso = startdate.strftime("%Y-%m-%dT%H:%M:%SZ")
    end_iso = enddate.strftime("%Y-%m-%dT%H:%M:%SZ")

    if enddate < startdate:
        print (f"The {outputdir} resource is uptodated to {start_iso}")
        return None
    
    return importfunc(outputdir,idx,start_iso, end_iso)


## Show time

The code below is the **main driver** for this notebook.

It iterates over `VAR["outputdirs"]`, derives `source` + `idx` from the task name,
checks whether the destination folder is empty, and then dispatches to:
- `bootstrap_2025tonow(...)` for first-time backfills (2025 → yesterday UTC)
- `update_data(...)` for incremental ingestion since the latest saved end date

At the end it prints the written paths (bootstrap and/or update) per task.


In [None]:
for task,outputdir in VAR["outputdirs"].items():

    source,idx = task.split("_")

    try: flag_empty = len(dbutils.fs.ls(outputdir)) == 0
    except Exception: flag_empty = True  # missing/unlistable -> treat as empty

    bootstrap_message = {}
    updated_message = {}
    if flag_empty:
        bootstrap_message[task] = bootstrap_2025tonow(VAR["importfuncs"][source],outputdir,idx)
    else:
        updated_message[task] = update_data(VAR["importfuncs"][source],outputdir,idx)
    
print("Bootstrap complete. Written paths:")
for k, v in bootstrap_message.items():
    print(f"- {k}: {v}")

print("\nUpdate complete. Written paths:")
for k, v in updated_message.items():
    print(f"- {k}: {v}")