Skip to content

Migrate cd_fetch() to DestinE Earth Data Hub Zarr #36

@NewGraphEnvironment

Description

@NewGraphEnvironment

Problem

cd_fetch() currently pulls all climate variables from the CDS (ecmwfr) API. Today's experience (#33) and benchmark (#35) showed:

  • CDS has aggressive rate limits — sustained ~10 files/hour after the initial ~70-file allowance
  • Every request queues server-side and counts against quota, even when polling fails
  • The tmax/tmin backfill will take ~3 days of babysitting to complete via CDS
  • CDS returns GRIB per-month per-variable, requiring 912 requests per variable for the 1950-2025 backfill

Solution

Migrate cd_fetch() to DestinE Earth Data Hub (EDH) Zarr as the primary data source.

Validated in #35 (benchmark):

  • Same product (ERA5-Land, 9 km native, 1950-present)
  • Same license (CC-BY 4.0, commercial OK)
  • One Zarr store contains all 50 ERA5-Land variables
  • 15.9 seconds per month BC bbox vs ~80s via CDS
  • 500K requests/month quota — effectively unlimited for our use
  • No queueing, no polling, no rate-limit babysitting
  • Full backfill ~4 hours unattended vs ~3 days via CDS

Zarr URL: https://data.earthdatahub.destine.eu/era5/reanalysis-era5-land-no-antartica-v0.zarr

Scope

Phase 1: New EDH fetcher alongside CDS

  • Add cd_fetch_edh() or refactor cd_fetch() with a source parameter
  • Use xarray + zarr via reticulate, OR pure R via stars::read_mdim() (GDAL zarr driver) — evaluate both
  • Token via EDH_TOKEN env var (already in ~/.Renviron)
  • Variable mapping: EDH t2m → our tmean/tmax/tmin inputs, tp → prcp, d2m → dewpoint, swvl1-4 → soil_moisture, etc.
  • Maintain existing output format (monthly COG or intermediate NetCDF) so downstream stages are untouched

Phase 2: Finish the backfill via EDH

Phase 3: Decide on CDS role

  • Option A: drop CDS entirely
  • Option B: keep CDS as fallback for operational redundancy (both serve the same ERA5-Land data)
  • Option C: keep CDS only for near-real-time updates if EDH has more lag than CDS

Phase 4: Update docs and pipeline

  • CLAUDE.md — update CDS API section, mention EDH primary
  • README + pkgdown — EDH auth setup instructions
  • Monthly GitHub Action — switch to EDH
  • Secrets — add EDH_TOKEN to repo secrets (rotate the one we've been using in chat)

Out of scope

  • Derived variables (vpd, rh) — still computed locally, no source change needed
  • Downstream (cd_derive, cd_aggregate, cd_cog_write, cd_stac_catalog, cd_s3_push) — unaffected if intermediate format stays the same

Risks

  • EDH reliability / uptime — we're adopting a single provider. Mitigated by keeping CDS as fallback (option B above).
  • Zarr chunking may not align with month boundaries, so a "one month pull" could fetch slightly more bytes than needed. In practice fine — quota is generous.
  • R zarr tooling is less mature than Python. Reticulate + Python xarray is the pragmatic path; pure R via stars/GDAL is cleaner if it works.

Tracking

Relates to #33 (tmax/tmin operational backfill)
Relates to #35 (alternative source evaluation — SUPERSEDED by this migration)
Relates to NewGraphEnvironment/sred-2025-2026#23

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions