Skip to content

Bulk data fetch safeguards — fix gaps, extract as soul convention #38

@NewGraphEnvironment

Description

@NewGraphEnvironment

Problem

The EDH migration surfaced a reusable pattern: "what safeguards does a bulk
data fetch script need to be polite AND safe against its own failure modes?"

Born from painful experience on CDS (#33: rate-limited, orphan jobs, zombie
processes) and the EDH migration QA (#36: atomic writes, partial writes,
silent skips).

Two layers to track:

Layer 1: fix the gaps in this project's script

scripts/backfill_edh_all.py already has:

  • Idempotency per output file (skip if exists)
  • Atomic writes (.tif.tmp + os.replace) so a killed run doesn't leave
    truncated files that fool the idempotency check
  • Explicit SKIP logging when source data is incomplete (partial current year)

Missing safeguards to add on this branch:

  • Single-instance pgrep pre-flight check (so two instances don't run
    concurrently — bit us on CDS when we thought we'd killed a run)
  • Retry-with-backoff on transient HTTP errors from EDH
    (xarray/fsspec do not retry by default — one network blip in a
    3-hour run would crash it)
  • Backup-before-delete when regenerating from a new source (move old
    files to <dir>/_backup/ instead of rm; delete only after QA passes)

Layer 2: extract as a soul convention

Once the gaps above are closed here, propagate the pattern to the
soul repo as a convention
that future projects can reference. Candidate structure:

Bulk data fetch safeguards (proposed soul convention)

Checklist for any script that pulls a lot of data from an external API
or cloud store (CDS, EDH, GEE, AWS S3, STAC, ACAT, CKAN, etc.):

Politeness (hammering prevention):

  • Match the service's pacing model — queue-based APIs (CDS) need per-request
    sleeps; chunk-based stores (Zarr, STAC+COG, S3) do not
  • Detect and STOP on rate-limit errors; never retry a 429 without cooldown
  • Abort after N consecutive failures — don't pile up orphan requests

Self-safety (don't shoot your own foot):

  • Pre-flight single-instance check (pgrep on the script's own name)
  • Idempotency per output, not per job — skip already-written files
  • Atomic writes — tmp suffix + rename, so a killed run doesn't produce
    "successful-looking" partial files
  • Explicit skip-vs-success logging — a user scanning the log should be able
    to tell "wrote it" from "skipped, already exists" from "skipped, source
    data incomplete"
  • Retry-with-backoff on transient network errors (separate from rate limits)

Data integrity:

  • Grid alignment check when mixing sources — different APIs yield different
    pixel grids, extents, CRS (discovered in Migrate cd_fetch() to DestinE Earth Data Hub Zarr #36 comparing CDS-produced vs
    EDH-produced monthly TIFs)
  • Backup-before-overwrite when regenerating from a different source — keep
    the old data until the new data is QA'd

Performance sanity:

  • Benchmark one unit (one month, one file) before committing to a full run
  • Estimate total runtime from the benchmark and state it up front
  • Log per-unit timing so a slowdown trend is visible

Worked examples in this ecosystem

Next steps

  1. Close the three missing safeguards on this branch
  2. Once merged, open a soul PR adding a bulk-fetch-safeguards.md
    convention file based on the checklist above

Relates to #36
Relates to #33
Relates to NewGraphEnvironment/sred-2025-2026#23

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions