Migrate producer pipeline to DestinE Earth Data Hub (#36)#40
Merged
Conversation
Baseline planning files for the migration from CDS to DestinE Earth Data Hub as the primary ERA5-Land source. Captures Phase 1 benchmark findings from #35 and the phased migration plan. Relates to #36 Relates to #35 Relates to NewGraphEnvironment/sred#23 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Portable PEP 723 Python script (runs via `uv run`, no venv setup) that pulls one month of ERA5-Land hourly 2m_temperature for the BC bounding box from DestinE Earth Data Hub and benchmarks vs CDS. Result: 15.9s end-to-end vs CDS ~80s per month. Validated same product (ERA5-Land 9km native), full temporal coverage (1950-01 to 2026-02), and all 50 ERA5-Land variables accessible from one Zarr store. Reads EDH_TOKEN from env or ~/.Renviron. Asserts subset is non-empty to guard against silent lat-direction / lon-convention regressions. Relates to #36 Relates to #35 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pragmatic Phase 2 of the EDH migration. Python script that pulls hourly 2m temperature from DestinE Earth Data Hub (Zarr), computes monthly mean of daily max/min, and writes yearly GeoTIFFs that are a drop-in replacement for the existing R pipeline's Stage 2 output. Verified on 1950: - 114.3s/year end-to-end, full backfill ~2.5h vs CDS ~3 days - 12-band EPSG:4326 GeoTIFFs, bands named Jan..Dec for cd_aggregate - terra::rast() reads cleanly with expected dims, CRS, names - Realistic BC values (Jan tmax -30 to 1C, Jul tmax -1 to 30C) Documents the UTC-day aggregation limitation matching the existing R pipeline behaviour. Addressing the timezone bias is a follow-up for the cd package methodology, not blocking for #36. Portable via PEP 723 inline deps, no venv setup. Idempotent skipping of already-written years and guards against partial current-year data. Relates to #36 Relates to #33 Relates to NewGraphEnvironment/sred#23 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full 1950-2025 tmax/tmin backfill via EDH completed in ~1h 53min. 76 years x 2 variables written to data/backfill/monthly/. Values validated via terra spot-check across 1950, 2000, 2024, 2025. Regenerated the 2024 tmax/tmin files from EDH as well — they were produced earlier via the abandoned CDS path and had methodology drift from the rest of the record. Next: R Stage 3 (COG + STAC + S3 push) against the EDH outputs. Relates to #36 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two scripts produced during the EDH migration QA: - scripts/qa_monthly.R — R/terra checks that span all monthly TIFs in data/backfill/monthly/: grid alignment across variables, time coverage, physical sanity (tmin<=tmean<=tmax), value ranges, and a CDS vs EDH comparison for a sample year. - scripts/probe_edh_vars.py — Python/PEP-723 probe that opens EDH Zarr, pulls one month for each variable we need (t2m, d2m, tp, swvl1-4), computes the naive monthly aggregation, and compares to the existing CDS-era monthly TIF. Demonstrates precipitation's GRIB_stepType=accum behaviour so the decision to use EDH's daily product is self-documenting. Findings: - CDS-era tmean/prcp/vpd/rh are on a 121x261 grid with no CRS tag; EDH-era tmax/tmin are on a 120x260 EPSG:4326 grid. Grid mismatch blocks pixel-wise cross-variable analysis. - EDH t2m, d2m, swvl1..4 all match CDS closely (0.99 correlation on tmean, soil moisture within rounding). - EDH hourly tp cannot be naively summed; use EDH daily product instead. Relates to #36 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Python PEP 723 script that produces all 7 cd-package monthly GeoTIFFs
on a single consistent EPSG:4326 / 120x260 BC grid:
tmax, tmin hourly t2m -> daily max/min -> monthly mean of daily stat
tmean hourly t2m -> monthly mean
vpd, rh Tetens on monthly mean t2m + d2m (matches R cd_derive)
soil_moisture hourly swvl1..4 -> monthly mean -> 4-depth mean
prcp DAILY product tp -> monthly sum (handles accumulation
reset correctly; naive hourly sum is 8x too high)
Uses two EDH Zarr stores: hourly for state variables, daily for
precipitation. Hourly lacks correct accumulation semantics; daily
pre-computes totals. Tested on 2000: all 7 outputs align on the
same grid, values match CDS closely (prcp Jan max 415.57 mm matches
CDS 415.60 mm).
Atomic writes via .tif.tmp + os.replace, so a killed run does not
leave a truncated file that fools the exists() idempotency check.
Logs explicit SKIP reasons when a variables monthly count is not 12
(e.g. partial current year), instead of silently doing nothing.
Resolves the grid mismatch found during QA between the CDS-era and
EDH-era monthly files. Running a full regeneration will produce a
fully internally-consistent dataset.
Relates to #36
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Closes two of the three missing safeguards in #38: - preflight_single_instance(): pgrep on script name refuses to start if another backfill_edh_all.py is already running. Prevents the zombie-process hammering that bit us on CDS (#33). - with_retry(): exponential backoff on OSError/ConnectionError/ TimeoutError around zarr-open and per-year processing. xarray and fsspec do not retry transient HTTP errors by default, so one network blip in a multi-hour run would kill the whole backfill. 4 attempts, 10s -> 20s -> 40s -> 80s. Per-year failures after retry exhaustion log and continue to next year. The script stays idempotent so a failed year is picked up on the next run. Third safeguard (backup-before-overwrite of CDS-era files) is a shell concern handled outside the script: mv to backup dir, run EDH backfill, QA, then delete backup. Propagation to soul as a reusable "bulk-fetch-safeguards" convention is tracked in #38 — leave this issue open until that PR lands. Relates to #38 Relates to #36 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full 1950-2025 unified backfill completed in ~4 hours. 532 monthly TIFs across 7 variables, all on consistent EPSG:4326 / 120x260 BC grid with proper CRS tagging. QA validates: - Grid alignment across all variables - Zero tmin>tmax or tmean inversion violations across 163888 cell-checks spanning 4 sample years and Jan+Jul months - Value ranges physically plausible for BC climate - (tmax+tmin)/2 vs tmean mean bias 0.57C (classical climatology shortcut normal, confirms aggregation correctness) Two transient errors during run were handled: - 1989 TimeoutError caught by with_retry, completed on attempt 2 - 2008 ClientPayloadError was not caught by retry (wrong exception class) but outer handler skipped the year and continued; filled later via --year 2008 rerun. Broadening the retry catch is queued as a follow-up in #38. qa_monthly.R updated: soil_moisture is a single composite file (per cd_derive_soil mean of 4 depths) not per-depth files. Relates to #36 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds scripts/pipeline_stage3_edh.R — a slim Stage-3-only script that runs cd_aggregate -> cd_cog_write -> cd_stac_catalog -> cd_s3_push against the unified EDH-produced monthly TIFs in data/backfill/monthly/. Separate from the existing pipeline_backfill.R because that script bundles CDS fetch + aggregate + push as one thing. Result: 35 COGs (7 vars x 5 periods) live on s3://stac-era5-land, overwriting the April 6-7 CDS-era data with internally-consistent EDH-derived values. Verified via /vsicurl read — tmean_annual returns 76 years 1950-2025 on the expected EPSG:4326 120x260 BC grid. STAC catalog.json is byte-identical between CDS-era and EDH-era runs because filenames, extents, and metadata did not change. That is legitimate sync behaviour — STAC catalogs point at data, they do not checksum it — not a bug in cd_s3_push. Relates to #36 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces CDS fetch in the monthly update pipeline with EDH Zarr.
scripts/pipeline_update_edh.R — R orchestrator that:
1. Reads STAC catalog from S3 to find latest year already published
2. If behind current year, iterates missing years calling
`uv run scripts/backfill_edh_all.py --year YYYY` for each
3. For each var × period: reads existing COG from S3 via /vsicurl,
aggregates new years, stacks, writes locally
4. Rebuilds catalog, pushes to S3
Distinguishes three exit states:
- Already current: exit 0, "nothing to do"
- Latest year not yet complete on EDH (normal latency): exit 0
- Attempted fetch errored and nothing was written: exit 1 (visibly red)
Sanity-checks grid alignment between the S3-sourced COG and locally
aggregated new layers before stacking — catches any floating-point
extent drift before it produces a cryptic terra error.
.github/workflows/climate-update.yml updated:
- EDH_TOKEN secret (was CDS_API_KEY)
- AWS_* secrets validated up-front alongside EDH_TOKEN
- uv installed via astral-sh/setup-uv@v5 for the Python backfill
- Log commit step gated on github.ref == main, so workflow_dispatch
from a feature branch won't try to rebase+push to main
Smoke-tested locally: script correctly identified 2025 as latest,
tried 2026, saw only 2 months (April 2026 — ERA5-Land ~3mo latency),
all 7 vars got "SKIP got 2 months, expected 12", exited 0 with
"No new complete years available on EDH yet (latency is normal)".
Required repo secrets: EDH_TOKEN, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY.
CDS_API_KEY no longer needed.
Relates to #36
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. pgrep self-detection under uv run: backfill_edh_all.py was aborting with "another backfill_edh_all is running" because uv is the parent process when invoked via `uv run scripts/backfill_edh_all.py`, and uv's command line also contains "backfill_edh_all". Exclude both our own pid AND our parent pid from the match. 2. tee masks Rscript exit code in the GHA step: `Rscript ... | tee log` returns tee's exit status. With bash -e alone, a non-zero R exit was silently swallowed and the step reported success. Added `set -o pipefail` and pinned `shell: bash` so pipeline errors propagate to the job status. These surfaced on run 24378657502 — the pgrep false-positive made R exit 1, which then went unnoticed because of the tee issue. Second run should pass cleanly. Relates to #36 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The pgrep check still found a false-positive match on ubuntu-24.04 even after filtering my_pid and my_ppid. Candidates include uv's child-of-child processes, bash ancestors whose cmdline includes the pattern, or pgrep's own pre-exec cmdline during fork. Not worth chasing the exact cause — the check is only meaningful for catching local zombie-process mistakes. CI runs in fresh containers where another instance is not possible by construction. Guard the check with GITHUB_ACTIONS env sentinel. Relates to #36 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
ecmwfr) to DestinE Earth Data Hub (Zarr). ERA5-Land at the same 9 km native grid, no rate limiting (500K req/mo quota), ~5× faster than CDS.s3://stac-era5-land, verified reading back via/vsicurlwith expected 76-year time series and BC warming signal (annual mean 1950: -1.42 °C → 2024: +1.85 °C)..github/workflows/climate-update.yml) takes over monthly updates from CDS using EDH. Checks S3, detects missing complete years, fetches via EDH, aggregates, pushes back.Test plan
scripts/test_edh_era5_land.py): 15.9s per month vs CDS ~80s, 5× speedup confirmed.scripts/probe_edh_vars.py): for each variable, compare EDH monthly aggregate against CDS reference for Jan 2000. tmean corr 0.9947, prcp matches CDS monthly-means product exactly via EDH daily product (naive hourly sum is 8× wrong due totpaccumulation semantics — documented).scripts/qa_monthly.R): all 7 variables share extent, res, CRS, ncell. tmin ≤ tmean ≤ tmax sanity passes with zero violations across 163,888 cell-checks.s3://stac-era5-land.runs/24412364974): detects S3 latest year = 2025, attempts 2026, EDH has 3/12 months, SKIP per variable, exits clean with "No new complete years available on EDH yet (latency is normal)".pipeline_update_edh.R(read existing COG from S3 via/vsicurl, append new year, push) has only been exercised implicitly — the full end-to-end will run first when a complete new year lands on EDH (~Apr 2027 for 2026). Component functions are proven via the Stage 3 run on this branch.Follow-ups (not blocking)
What is NOT in this PR
R/function signatures — consumer API unchangedEDH_TOKENlives in~/.Renvironlocally and as a repo secretRequired secrets
EDH_TOKEN— DestinE personal access token (set ✓)AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY— foraws s3 sync(set ✓)CDS_API_KEYis no longer needed (CDS path remains in source as fallback but not wired into the GHA)Fixes #36
Relates to #33
Relates to #35
Relates to NewGraphEnvironment/sred-2025-2026#23
🤖 Generated with Claude Code