Skip to content

Migrate producer pipeline to DestinE Earth Data Hub (#36)#40

Merged
NewGraphEnvironment merged 12 commits into
mainfrom
36-edh-migration
Apr 14, 2026
Merged

Migrate producer pipeline to DestinE Earth Data Hub (#36)#40
NewGraphEnvironment merged 12 commits into
mainfrom
36-edh-migration

Conversation

@NewGraphEnvironment
Copy link
Copy Markdown
Owner

Summary

  • Migrate the cd producer pipeline from CDS (ecmwfr) to DestinE Earth Data Hub (Zarr). ERA5-Land at the same 9 km native grid, no rate limiting (500K req/mo quota), ~5× faster than CDS.
  • Full 1950-2025 backfill regenerated across all 7 cd variables (tmax, tmin, tmean, prcp, vpd, rh, soil_moisture) on one internally-consistent EPSG:4326 120×260 BC grid — resolves a latent grid mismatch between prior CDS-era variables.
  • Stage 3 (COG aggregation + STAC catalog + S3 sync) already ran from this branch — 35 COGs live on s3://stac-era5-land, verified reading back via /vsicurl with expected 76-year time series and BC warming signal (annual mean 1950: -1.42 °C → 2024: +1.85 °C).
  • New GitHub Action (.github/workflows/climate-update.yml) takes over monthly updates from CDS using EDH. Checks S3, detects missing complete years, fetches via EDH, aggregates, pushes back.

Test plan

  • 1-month benchmark (scripts/test_edh_era5_land.py): 15.9s per month vs CDS ~80s, 5× speedup confirmed.
  • QA probe (scripts/probe_edh_vars.py): for each variable, compare EDH monthly aggregate against CDS reference for Jan 2000. tmean corr 0.9947, prcp matches CDS monthly-means product exactly via EDH daily product (naive hourly sum is 8× wrong due to tp accumulation semantics — documented).
  • Full backfill 1950-2025 × 7 variables ran to completion on EDH (~4 hours, two transient errors handled cleanly by retry-with-backoff + continue-on-failure).
  • Grid alignment QA (scripts/qa_monthly.R): all 7 variables share extent, res, CRS, ncell. tmin ≤ tmean ≤ tmax sanity passes with zero violations across 163,888 cell-checks.
  • Stage 3 ran locally; 35 COGs live on s3://stac-era5-land.
  • GHA dispatched on this branch, passed (runs/24412364974): detects S3 latest year = 2025, attempts 2026, EDH has 3/12 months, SKIP per variable, exits clean with "No new complete years available on EDH yet (latency is normal)".
  • Step 4–5 of pipeline_update_edh.R (read existing COG from S3 via /vsicurl, append new year, push) has only been exercised implicitly — the full end-to-end will run first when a complete new year lands on EDH (~Apr 2027 for 2026). Component functions are proven via the Stage 3 run on this branch.

Follow-ups (not blocking)

What is NOT in this PR

  • No changes to R/ function signatures — consumer API unchanged
  • No vignette text changes — vignette renders correctly against the new EDH-sourced data automatically
  • No secrets exposed in commits — EDH_TOKEN lives in ~/.Renviron locally and as a repo secret

Required secrets

  • EDH_TOKEN — DestinE personal access token (set ✓)
  • AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY — for aws s3 sync (set ✓)
  • CDS_API_KEY is no longer needed (CDS path remains in source as fallback but not wired into the GHA)

Fixes #36
Relates to #33
Relates to #35
Relates to NewGraphEnvironment/sred-2025-2026#23

🤖 Generated with Claude Code

NewGraphEnvironment and others added 12 commits April 12, 2026 02:03
Baseline planning files for the migration from CDS to DestinE Earth
Data Hub as the primary ERA5-Land source. Captures Phase 1 benchmark
findings from #35 and the phased migration plan.

Relates to #36
Relates to #35
Relates to NewGraphEnvironment/sred#23

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Portable PEP 723 Python script (runs via `uv run`, no venv setup)
that pulls one month of ERA5-Land hourly 2m_temperature for the BC
bounding box from DestinE Earth Data Hub and benchmarks vs CDS.

Result: 15.9s end-to-end vs CDS ~80s per month. Validated same product
(ERA5-Land 9km native), full temporal coverage (1950-01 to 2026-02),
and all 50 ERA5-Land variables accessible from one Zarr store.

Reads EDH_TOKEN from env or ~/.Renviron. Asserts subset is non-empty
to guard against silent lat-direction / lon-convention regressions.

Relates to #36
Relates to #35

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pragmatic Phase 2 of the EDH migration. Python script that pulls
hourly 2m temperature from DestinE Earth Data Hub (Zarr), computes
monthly mean of daily max/min, and writes yearly GeoTIFFs that are
a drop-in replacement for the existing R pipeline's Stage 2 output.

Verified on 1950:
- 114.3s/year end-to-end, full backfill ~2.5h vs CDS ~3 days
- 12-band EPSG:4326 GeoTIFFs, bands named Jan..Dec for cd_aggregate
- terra::rast() reads cleanly with expected dims, CRS, names
- Realistic BC values (Jan tmax -30 to 1C, Jul tmax -1 to 30C)

Documents the UTC-day aggregation limitation matching the existing R
pipeline behaviour. Addressing the timezone bias is a follow-up for
the cd package methodology, not blocking for #36.

Portable via PEP 723 inline deps, no venv setup. Idempotent skipping
of already-written years and guards against partial current-year data.

Relates to #36
Relates to #33
Relates to NewGraphEnvironment/sred#23

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full 1950-2025 tmax/tmin backfill via EDH completed in ~1h 53min.
76 years x 2 variables written to data/backfill/monthly/. Values
validated via terra spot-check across 1950, 2000, 2024, 2025.

Regenerated the 2024 tmax/tmin files from EDH as well — they were
produced earlier via the abandoned CDS path and had methodology
drift from the rest of the record.

Next: R Stage 3 (COG + STAC + S3 push) against the EDH outputs.

Relates to #36

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two scripts produced during the EDH migration QA:

- scripts/qa_monthly.R — R/terra checks that span all monthly TIFs
  in data/backfill/monthly/: grid alignment across variables, time
  coverage, physical sanity (tmin<=tmean<=tmax), value ranges, and
  a CDS vs EDH comparison for a sample year.

- scripts/probe_edh_vars.py — Python/PEP-723 probe that opens EDH
  Zarr, pulls one month for each variable we need (t2m, d2m, tp,
  swvl1-4), computes the naive monthly aggregation, and compares to
  the existing CDS-era monthly TIF. Demonstrates precipitation's
  GRIB_stepType=accum behaviour so the decision to use EDH's daily
  product is self-documenting.

Findings:
- CDS-era tmean/prcp/vpd/rh are on a 121x261 grid with no CRS tag;
  EDH-era tmax/tmin are on a 120x260 EPSG:4326 grid. Grid mismatch
  blocks pixel-wise cross-variable analysis.
- EDH t2m, d2m, swvl1..4 all match CDS closely (0.99 correlation on
  tmean, soil moisture within rounding).
- EDH hourly tp cannot be naively summed; use EDH daily product instead.

Relates to #36

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Python PEP 723 script that produces all 7 cd-package monthly GeoTIFFs
on a single consistent EPSG:4326 / 120x260 BC grid:

  tmax, tmin   hourly t2m -> daily max/min -> monthly mean of daily stat
  tmean        hourly t2m -> monthly mean
  vpd, rh      Tetens on monthly mean t2m + d2m (matches R cd_derive)
  soil_moisture hourly swvl1..4 -> monthly mean -> 4-depth mean
  prcp         DAILY product tp -> monthly sum (handles accumulation
               reset correctly; naive hourly sum is 8x too high)

Uses two EDH Zarr stores: hourly for state variables, daily for
precipitation. Hourly lacks correct accumulation semantics; daily
pre-computes totals. Tested on 2000: all 7 outputs align on the
same grid, values match CDS closely (prcp Jan max 415.57 mm matches
CDS 415.60 mm).

Atomic writes via .tif.tmp + os.replace, so a killed run does not
leave a truncated file that fools the exists() idempotency check.

Logs explicit SKIP reasons when a variables monthly count is not 12
(e.g. partial current year), instead of silently doing nothing.

Resolves the grid mismatch found during QA between the CDS-era and
EDH-era monthly files. Running a full regeneration will produce a
fully internally-consistent dataset.

Relates to #36

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes two of the three missing safeguards in #38:

- preflight_single_instance(): pgrep on script name refuses to start
  if another backfill_edh_all.py is already running. Prevents the
  zombie-process hammering that bit us on CDS (#33).
- with_retry(): exponential backoff on OSError/ConnectionError/
  TimeoutError around zarr-open and per-year processing. xarray and
  fsspec do not retry transient HTTP errors by default, so one
  network blip in a multi-hour run would kill the whole backfill.
  4 attempts, 10s -> 20s -> 40s -> 80s.

Per-year failures after retry exhaustion log and continue to next
year. The script stays idempotent so a failed year is picked up on
the next run.

Third safeguard (backup-before-overwrite of CDS-era files) is a
shell concern handled outside the script: mv to backup dir, run
EDH backfill, QA, then delete backup.

Propagation to soul as a reusable "bulk-fetch-safeguards" convention
is tracked in #38 — leave this issue open until that PR lands.

Relates to #38
Relates to #36

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full 1950-2025 unified backfill completed in ~4 hours. 532 monthly TIFs
across 7 variables, all on consistent EPSG:4326 / 120x260 BC grid with
proper CRS tagging.

QA validates:
- Grid alignment across all variables
- Zero tmin>tmax or tmean inversion violations across 163888 cell-checks
  spanning 4 sample years and Jan+Jul months
- Value ranges physically plausible for BC climate
- (tmax+tmin)/2 vs tmean mean bias 0.57C (classical climatology shortcut
  normal, confirms aggregation correctness)

Two transient errors during run were handled:
- 1989 TimeoutError caught by with_retry, completed on attempt 2
- 2008 ClientPayloadError was not caught by retry (wrong exception
  class) but outer handler skipped the year and continued; filled
  later via --year 2008 rerun. Broadening the retry catch is queued
  as a follow-up in #38.

qa_monthly.R updated: soil_moisture is a single composite file
(per cd_derive_soil mean of 4 depths) not per-depth files.

Relates to #36

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds scripts/pipeline_stage3_edh.R — a slim Stage-3-only script that
runs cd_aggregate -> cd_cog_write -> cd_stac_catalog -> cd_s3_push
against the unified EDH-produced monthly TIFs in data/backfill/monthly/.
Separate from the existing pipeline_backfill.R because that script
bundles CDS fetch + aggregate + push as one thing.

Result: 35 COGs (7 vars x 5 periods) live on s3://stac-era5-land,
overwriting the April 6-7 CDS-era data with internally-consistent
EDH-derived values. Verified via /vsicurl read — tmean_annual returns
76 years 1950-2025 on the expected EPSG:4326 120x260 BC grid.

STAC catalog.json is byte-identical between CDS-era and EDH-era runs
because filenames, extents, and metadata did not change. That is
legitimate sync behaviour — STAC catalogs point at data, they do not
checksum it — not a bug in cd_s3_push.

Relates to #36

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces CDS fetch in the monthly update pipeline with EDH Zarr.

scripts/pipeline_update_edh.R — R orchestrator that:
  1. Reads STAC catalog from S3 to find latest year already published
  2. If behind current year, iterates missing years calling
     `uv run scripts/backfill_edh_all.py --year YYYY` for each
  3. For each var × period: reads existing COG from S3 via /vsicurl,
     aggregates new years, stacks, writes locally
  4. Rebuilds catalog, pushes to S3

Distinguishes three exit states:
  - Already current: exit 0, "nothing to do"
  - Latest year not yet complete on EDH (normal latency): exit 0
  - Attempted fetch errored and nothing was written: exit 1 (visibly red)

Sanity-checks grid alignment between the S3-sourced COG and locally
aggregated new layers before stacking — catches any floating-point
extent drift before it produces a cryptic terra error.

.github/workflows/climate-update.yml updated:
  - EDH_TOKEN secret (was CDS_API_KEY)
  - AWS_* secrets validated up-front alongside EDH_TOKEN
  - uv installed via astral-sh/setup-uv@v5 for the Python backfill
  - Log commit step gated on github.ref == main, so workflow_dispatch
    from a feature branch won't try to rebase+push to main

Smoke-tested locally: script correctly identified 2025 as latest,
tried 2026, saw only 2 months (April 2026 — ERA5-Land ~3mo latency),
all 7 vars got "SKIP got 2 months, expected 12", exited 0 with
"No new complete years available on EDH yet (latency is normal)".

Required repo secrets: EDH_TOKEN, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY.
CDS_API_KEY no longer needed.

Relates to #36

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. pgrep self-detection under uv run: backfill_edh_all.py was aborting
   with "another backfill_edh_all is running" because uv is the parent
   process when invoked via `uv run scripts/backfill_edh_all.py`, and
   uv's command line also contains "backfill_edh_all". Exclude both
   our own pid AND our parent pid from the match.

2. tee masks Rscript exit code in the GHA step: `Rscript ... | tee log`
   returns tee's exit status. With bash -e alone, a non-zero R exit
   was silently swallowed and the step reported success. Added
   `set -o pipefail` and pinned `shell: bash` so pipeline errors
   propagate to the job status.

These surfaced on run 24378657502 — the pgrep false-positive made R
exit 1, which then went unnoticed because of the tee issue. Second
run should pass cleanly.

Relates to #36

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The pgrep check still found a false-positive match on ubuntu-24.04
even after filtering my_pid and my_ppid. Candidates include uv's
child-of-child processes, bash ancestors whose cmdline includes
the pattern, or pgrep's own pre-exec cmdline during fork.

Not worth chasing the exact cause — the check is only meaningful
for catching local zombie-process mistakes. CI runs in fresh
containers where another instance is not possible by construction.
Guard the check with GITHUB_ACTIONS env sentinel.

Relates to #36

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@NewGraphEnvironment NewGraphEnvironment merged commit b976008 into main Apr 14, 2026
2 of 3 checks passed
@NewGraphEnvironment NewGraphEnvironment deleted the 36-edh-migration branch April 14, 2026 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate cd_fetch() to DestinE Earth Data Hub Zarr

1 participant