Skip to content

Extract bulk-fetch safeguards into shared scripts/_lib.py#52

Merged
NewGraphEnvironment merged 2 commits into
mainfrom
38-bulk-fetch-safeguards
May 4, 2026
Merged

Extract bulk-fetch safeguards into shared scripts/_lib.py#52
NewGraphEnvironment merged 2 commits into
mainfrom
38-bulk-fetch-safeguards

Conversation

@NewGraphEnvironment
Copy link
Copy Markdown
Owner

Summary

Closes Layer 1 of #38 — bulk-fetch safeguards. Extracts the existing pgrep guard, retry-with-backoff, and atomic write helpers from scripts/backfill_edh_all.py into a new shared scripts/_lib.py, then applies them to the sibling scripts/backfill_edh_tmax_tmin.py (which previously had none of the three safeguards). Adds a new backup_before_delete() helper that codifies the on-disk pattern from data/backfill/monthly/_cds_backup/.

Net result: -156/+277 LOC, mostly relocation. Both production bulk-fetch scripts now share one source of truth for the safeguards, and the snow-vars backfill script we'll write for #48 inherits them for free via from _lib import ….

Surprise finding during exploration

The issue checklist named three "missing" safeguards in backfill_edh_all.py. Plan-mode exploration found that two of them — pgrep guard and with_retry — were already in place via commits 5bf1b34 and 6f66a01. The issue was partially stale. The third (backup_before_delete) was de facto in operational use on disk (375 files in data/backfill/monthly/_cds_backup/ hand-moved during the EDH migration) but never codified as code. The sibling script backfill_edh_tmax_tmin.py was missing all three.

So the actual scope landed as: extract existing helpers, ship a backup_before_delete() codification, port everything to backfill_edh_tmax_tmin.py.

Changes

New: scripts/_lib.py

  • preflight_single_instance(name) — parameterized pgrep guard (each script passes its own basename)
  • with_retry(fn, ...) — exponential backoff on OSError / ConnectionError / TimeoutError
  • write_geotiff(da, out_path, band_names=MONTH_NAMES) — atomic .tmp + os.replace
  • log(msg) — timestamped, flushed
  • get_token() — EDH token from env or ~/.Renviron
  • backup_before_delete(files, backup_subdir="_backup") — new helper, no overwrites
  • MONTH_NAMES constant

Refactored: scripts/backfill_edh_all.py

  • from _lib import …, drop the now-redundant local copies
  • preflight_single_instance("backfill_edh_all") (parameterized)

Ported: scripts/backfill_edh_tmax_tmin.py

  • Same imports, preflight_single_instance("backfill_edh_tmax_tmin") at top of main()
  • with_retry around xr.open_dataset and each .compute() call
  • write_geotiff replaces the inline non-atomic to_geotiff_raster
  • log(...) replaces print(f"[{time.strftime(...)}] ...") calls

Test plan

  • python3 -c "import ast; ast.parse(...)" clean on all three files
  • uv run scripts/backfill_edh_all.py --year 1950 — opens both Zarrs under with_retry, hits idempotent-skip path, exits clean
  • uv run scripts/backfill_edh_tmax_tmin.py --year 1950 — same, single hourly Zarr
  • pgrep guard fires: launched backfill_edh_tmax_tmin --year 2026 in background, second concurrent instance ABORTed with both pids reported in the message
  • No orphaned os. / subprocess. / sys. / rasterio. / rioxarray references remain in either bulk-fetch script after the imports were trimmed
  • Post-merge: file soul-repo issue scoping bulk-fetch-safeguards.md convention with scripts/_lib.py as the worked example

Out of scope

  • The soul convention extraction itself (Layer 2 of Bulk data fetch safeguards — fix gaps, extract as soul convention #38). Filed as a follow-up so the convention text is informed by the actual landed code rather than co-evolving across two repos.
  • A --regen CLI flag on either backfill script that wires backup_before_delete into a real call site. Helper ships unused; first real call site lands with Add snow-related variables (SWE, snowfall fraction, melt timing) for hydrology departure #48 if its aggregation method requires re-running existing year files.
  • Python lint config (ruff, black) and pytest setup. cd is R-first; Python scripts are utility-tier with PEP 723 inline deps. Adding lint/test infra is a separate decision.
  • probe_edh_vars.py and test_edh_era5_land.py. Both are one-shot validation scripts; the safeguards target production bulk-fetch.

Notes

#48 (snow variables) will be the first downstream caller of _lib.py. Methodology pinned in issue comment: daily resolution sourced from era5-land-daily-utc-v1.zarr, 7-day rolling sum of daily smlt for snowmelt_rate_peak, daily product preferred over hourly to dodge the stepType=accum trap that bit tp in #36.

Fixes #38
Relates to NewGraphEnvironment/sred-2025-2026#23

🤖 Generated with Claude Code

NewGraphEnvironment and others added 2 commits May 3, 2026 15:24
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ckfill scripts

Closes the gaps in #38. preflight_single_instance, with_retry, atomic
write_geotiff, log, get_token, MONTH_NAMES move from
backfill_edh_all.py into a new shared scripts/_lib.py. New
backup_before_delete() helper codifies the on-disk pattern from
data/backfill/monthly/_cds_backup/ — no call sites yet, ready for
the snow-vars script (#48) if a regen is needed.

backfill_edh_tmax_tmin.py was missing all three safeguards; now
imports the same helpers, with_retry wraps the zarr open and the
.compute() calls, write_geotiff replaces the inline non-atomic
to_geotiff_raster.

Smoke-tested both scripts with --year 1950 (idempotent-skip path) and
verified the pgrep guard rejects a concurrent second instance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@NewGraphEnvironment NewGraphEnvironment merged commit e8584c9 into main May 4, 2026
1 check passed
@NewGraphEnvironment NewGraphEnvironment deleted the 38-bulk-fetch-safeguards branch May 4, 2026 02:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bulk data fetch safeguards — fix gaps, extract as soul convention

1 participant