You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The EDH migration surfaced a reusable pattern: "what safeguards does a bulk
data fetch script need to be polite AND safe against its own failure modes?"
Born from painful experience on CDS (#33: rate-limited, orphan jobs, zombie
processes) and the EDH migration QA (#36: atomic writes, partial writes,
silent skips).
Two layers to track:
Layer 1: fix the gaps in this project's script
scripts/backfill_edh_all.py already has:
Idempotency per output file (skip if exists)
Atomic writes (.tif.tmp + os.replace) so a killed run doesn't leave
truncated files that fool the idempotency check
Explicit SKIP logging when source data is incomplete (partial current year)
Missing safeguards to add on this branch:
Single-instance pgrep pre-flight check (so two instances don't run
concurrently — bit us on CDS when we thought we'd killed a run)
Retry-with-backoff on transient HTTP errors from EDH
(xarray/fsspec do not retry by default — one network blip in a
3-hour run would crash it)
Backup-before-delete when regenerating from a new source (move old
files to <dir>/_backup/ instead of rm; delete only after QA passes)
Layer 2: extract as a soul convention
Once the gaps above are closed here, propagate the pattern to the soul repo as a convention
that future projects can reference. Candidate structure:
Bulk data fetch safeguards (proposed soul convention)
Checklist for any script that pulls a lot of data from an external API
or cloud store (CDS, EDH, GEE, AWS S3, STAC, ACAT, CKAN, etc.):
Politeness (hammering prevention):
Match the service's pacing model — queue-based APIs (CDS) need per-request
sleeps; chunk-based stores (Zarr, STAC+COG, S3) do not
Detect and STOP on rate-limit errors; never retry a 429 without cooldown
Abort after N consecutive failures — don't pile up orphan requests
Self-safety (don't shoot your own foot):
Pre-flight single-instance check (pgrep on the script's own name)
Idempotency per output, not per job — skip already-written files
Atomic writes — tmp suffix + rename, so a killed run doesn't produce
"successful-looking" partial files
Explicit skip-vs-success logging — a user scanning the log should be able
to tell "wrote it" from "skipped, already exists" from "skipped, source
data incomplete"
Retry-with-backoff on transient network errors (separate from rate limits)
Data integrity:
Grid alignment check when mixing sources — different APIs yield different
pixel grids, extents, CRS (discovered in Migrate cd_fetch() to DestinE Earth Data Hub Zarr #36 comparing CDS-produced vs
EDH-produced monthly TIFs)
Backup-before-overwrite when regenerating from a different source — keep
the old data until the new data is QA'd
Performance sanity:
Benchmark one unit (one month, one file) before committing to a full run
Estimate total runtime from the benchmark and state it up front
Log per-unit timing so a slowdown trend is visible
Problem
The EDH migration surfaced a reusable pattern: "what safeguards does a bulk
data fetch script need to be polite AND safe against its own failure modes?"
Born from painful experience on CDS (#33: rate-limited, orphan jobs, zombie
processes) and the EDH migration QA (#36: atomic writes, partial writes,
silent skips).
Two layers to track:
Layer 1: fix the gaps in this project's script
scripts/backfill_edh_all.pyalready has:truncated files that fool the idempotency check
Missing safeguards to add on this branch:
concurrently — bit us on CDS when we thought we'd killed a run)
(xarray/fsspec do not retry by default — one network blip in a
3-hour run would crash it)
files to
<dir>/_backup/instead of rm; delete only after QA passes)Layer 2: extract as a soul convention
Once the gaps above are closed here, propagate the pattern to the
soul repo as a convention
that future projects can reference. Candidate structure:
Bulk data fetch safeguards (proposed soul convention)
Checklist for any script that pulls a lot of data from an external API
or cloud store (CDS, EDH, GEE, AWS S3, STAC, ACAT, CKAN, etc.):
Politeness (hammering prevention):
sleeps; chunk-based stores (Zarr, STAC+COG, S3) do not
Self-safety (don't shoot your own foot):
"successful-looking" partial files
to tell "wrote it" from "skipped, already exists" from "skipped, source
data incomplete"
Data integrity:
pixel grids, extents, CRS (discovered in Migrate cd_fetch() to DestinE Earth Data Hub Zarr #36 comparing CDS-produced vs
EDH-produced monthly TIFs)
the old data until the new data is QA'd
Performance sanity:
Worked examples in this ecosystem
politeness (60s between requests, STOP on rate limit, 3-failure abort,
pgrep guard)
(idempotent, atomic writes, no per-request sleep needed)
Next steps
bulk-fetch-safeguards.mdconvention file based on the checklist above
Relates to #36
Relates to #33
Relates to NewGraphEnvironment/sred-2025-2026#23