Bulk data fetch safeguards — fix gaps, extract as soul convention

## Problem

The EDH migration surfaced a reusable pattern: "what safeguards does a bulk
data fetch script need to be polite AND safe against its own failure modes?"

Born from painful experience on CDS (#33: rate-limited, orphan jobs, zombie
processes) and the EDH migration QA (#36: atomic writes, partial writes,
silent skips).

Two layers to track:

## Layer 1: fix the gaps in this project's script

`scripts/backfill_edh_all.py` already has:

- [x] Idempotency per output file (skip if exists)
- [x] Atomic writes (.tif.tmp + os.replace) so a killed run doesn't leave
      truncated files that fool the idempotency check
- [x] Explicit SKIP logging when source data is incomplete (partial current year)

Missing safeguards to add on this branch:

- [ ] Single-instance pgrep pre-flight check (so two instances don't run
      concurrently — bit us on CDS when we thought we'd killed a run)
- [ ] Retry-with-backoff on transient HTTP errors from EDH
      (xarray/fsspec do not retry by default — one network blip in a
      3-hour run would crash it)
- [ ] Backup-before-delete when regenerating from a new source (move old
      files to `<dir>/_backup/` instead of rm; delete only after QA passes)

## Layer 2: extract as a soul convention

Once the gaps above are closed here, propagate the pattern to the
[soul](https://github.com/NewGraphEnvironment/soul) repo as a convention
that future projects can reference. Candidate structure:

### Bulk data fetch safeguards (proposed soul convention)

Checklist for any script that pulls a lot of data from an external API
or cloud store (CDS, EDH, GEE, AWS S3, STAC, ACAT, CKAN, etc.):

**Politeness (hammering prevention):**
- Match the service's pacing model — queue-based APIs (CDS) need per-request
  sleeps; chunk-based stores (Zarr, STAC+COG, S3) do not
- Detect and STOP on rate-limit errors; never retry a 429 without cooldown
- Abort after N consecutive failures — don't pile up orphan requests

**Self-safety (don't shoot your own foot):**
- Pre-flight single-instance check (pgrep on the script's own name)
- Idempotency per output, not per job — skip already-written files
- Atomic writes — tmp suffix + rename, so a killed run doesn't produce
  "successful-looking" partial files
- Explicit skip-vs-success logging — a user scanning the log should be able
  to tell "wrote it" from "skipped, already exists" from "skipped, source
  data incomplete"
- Retry-with-backoff on transient network errors (separate from rate limits)

**Data integrity:**
- Grid alignment check when mixing sources — different APIs yield different
  pixel grids, extents, CRS (discovered in #36 comparing CDS-produced vs
  EDH-produced monthly TIFs)
- Backup-before-overwrite when regenerating from a different source — keep
  the old data until the new data is QA'd

**Performance sanity:**
- Benchmark one unit (one month, one file) before committing to a full run
- Estimate total runtime from the benchmark and state it up front
- Log per-unit timing so a slowdown trend is visible

### Worked examples in this ecosystem

- #33 CDS polite-citizen rewrite — canonical example of queue-based API
  politeness (60s between requests, STOP on rate limit, 3-failure abort,
  pgrep guard)
- #36 EDH Zarr backfill — canonical example of chunk-based stores
  (idempotent, atomic writes, no per-request sleep needed)

## Next steps

1. Close the three missing safeguards on this branch
2. Once merged, open a soul PR adding a `bulk-fetch-safeguards.md`
   convention file based on the checklist above

Relates to #36
Relates to #33
Relates to NewGraphEnvironment/sred-2025-2026#23


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk data fetch safeguards — fix gaps, extract as soul convention #38

Problem

Layer 1: fix the gaps in this project's script

Layer 2: extract as a soul convention

Bulk data fetch safeguards (proposed soul convention)

Worked examples in this ecosystem

Next steps

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bulk data fetch safeguards — fix gaps, extract as soul convention #38

Description

Problem

Layer 1: fix the gaps in this project's script

Layer 2: extract as a soul convention

Bulk data fetch safeguards (proposed soul convention)

Worked examples in this ecosystem

Next steps

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions