Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions planning/active/findings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Findings — Bulk data fetch safeguards (#38)

## Issue context (verbatim from #38)

The EDH migration surfaced a reusable pattern: "what safeguards does a
bulk data fetch script need to be polite AND safe against its own
failure modes?"

Born from painful experience on CDS (#33: rate-limited, orphan jobs,
zombie processes) and the EDH migration QA (#36: atomic writes,
partial writes, silent skips).

### Layer 1: fix the gaps in this project's script

`scripts/backfill_edh_all.py` already has:

- [x] Idempotency per output file (skip if exists)
- [x] Atomic writes (.tif.tmp + os.replace) so a killed run doesn't
leave truncated files that fool the idempotency check
- [x] Explicit SKIP logging when source data is incomplete

Missing safeguards to add on this branch:

- [ ] Single-instance pgrep pre-flight check
- [ ] Retry-with-backoff on transient HTTP errors from EDH
- [ ] Backup-before-delete when regenerating from a new source

### Layer 2: extract as a soul convention

Once Layer 1 lands, propagate the pattern to the
[soul](https://github.com/NewGraphEnvironment/soul) repo as a
convention.

## State found during plan-mode exploration

**Layer 1 is partially already done.** The issue checklist above is
stale.

Commits already on main:

- `5bf1b34` Add single-instance guard and retry-with-backoff to EDH
backfill — closed safeguards 1 and 2 in `backfill_edh_all.py`.
- `6f66a01` Skip pgrep single-instance check on GHA — fixed CI false
positive.

So `backfill_edh_all.py` already has:
- `preflight_single_instance()` at lines 69-94 (with `GITHUB_ACTIONS`
bypass)
- `with_retry(fn, ...)` at lines 97-115 (4 attempts, exponential
backoff, retries on `OSError`/`ConnectionError`/`TimeoutError`)
- Atomic write via `.tmp` + `os.replace` at lines 153-178
(`write_geotiff`)

Backup-before-delete (3rd safeguard) is **in operational use on disk**
but never codified: `data/backfill/monthly/_cds_backup/` holds 375
CDS-era TIFs, hand-managed before the EDH backfill overwrote them.

**Sibling script `backfill_edh_tmax_tmin.py` is missing all three
safeguards.** Same shape as `backfill_edh_all.py` (year loop,
idempotency, monthly aggregation, GeoTIFF write) but no pgrep guard,
no retry, no atomic write. One-shot scripts (`probe_edh_vars.py`,
`test_edh_era5_land.py`) are not in scope.

## Architecture decisions taken (user-confirmed)

1. **Extract to `scripts/_lib.py`** rather than copy-paste between the
two scripts. One source of truth, well-positioned for the eventual
snow-backfill script (#48). Module is plain importable Python; both
scripts retain their PEP 723 inline-deps shebang.
2. **Soul convention extraction is a follow-up.** This branch closes
Layer 1; Layer 2 ships as a separate soul-repo PR after merge.
3. **Hoist `get_token()` into `_lib.py`** while refactoring — both
scripts already duplicate it.
4. **Keep `bc_slice`, `tetens_es` in-script** — they are
pipeline-specific to the EDH all-vars pipeline, not bulk-fetch
safeguards.
5. **Ship `backup_before_delete()` helper unused.** First real call
site lands with #48 if its aggregation method requires re-running
existing years.

## Inventory of bulk-fetch scripts in `scripts/`

| Script | Pgrep | Retry | Atomic Write | Notes |
|---|---|---|---|---|
| `backfill_edh_all.py` | done (69-94) | done (97-115) | done (153-178) | All 7 cd vars, 1950-2025 |
| `backfill_edh_tmax_tmin.py` | missing | missing | missing | tmax/tmin only |
| `probe_edh_vars.py` | n/a | n/a | n/a | One-shot validation; out of scope |
| `test_edh_era5_land.py` | n/a | n/a | n/a | Benchmark; out of scope |

## Python tooling (none configured)

- No `pyproject.toml`, no `ruff.toml`, no `pytest` config in repo.
- All scripts use PEP 723 inline deps with
`#!/usr/bin/env -S uv run --script` shebang — self-contained.
- CLAUDE.md documents R conventions only; Python is utility-tier.

## Anchors for the soul convention follow-up

When the soul-repo issue lands, the body cribs from #38's "Layer 2"
checklist (politeness / self-safety / data integrity / performance
sanity). Pin the new `scripts/_lib.py` as the canonical worked example.
27 changes: 27 additions & 0 deletions planning/active/progress.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Progress — Bulk data fetch safeguards (#38)

## Session 2026-05-03

- Plan-mode exploration — confirmed 2 of 3 safeguards already in
`backfill_edh_all.py` (commits `5bf1b34`, `6f66a01`); sibling
`backfill_edh_tmax_tmin.py` is missing all three. Phases approved
by user with two scope decisions: extract helpers to
`scripts/_lib.py`, and defer soul convention to a follow-up.
- Created branch `38-bulk-fetch-safeguards` off main (post v0.1.5
release).
- Scaffolded PWF baseline.
- Phases 1–3 implemented: `scripts/_lib.py` ships
`preflight_single_instance(name)`, `with_retry`, `write_geotiff`,
`log`, `get_token`, `MONTH_NAMES`, and new
`backup_before_delete()` helper. Both `backfill_edh_all.py` and
`backfill_edh_tmax_tmin.py` now import from `_lib`. Net
-156/+41 LOC.
- Phase 4 verification: pgrep guard fires correctly when a second
instance launches (tested live with both pids reported); both
scripts idempotent-skip on existing year files; no orphaned
imports remain.
- Methodology pinned on #48 (snow vars): 7-day rolling sum of daily
`smlt` for `snowmelt_rate_peak`, daily UTC product preferred over
hourly to dodge the `stepType=accum` trap (issue comment 4367348096).
- Next: commit, then Phase 5 (file soul-convention follow-up issue)
+ Phase 6 (PR).
103 changes: 103 additions & 0 deletions planning/active/task_plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Task: Bulk data fetch safeguards — fix gaps, extract as soul convention (#38)

## Problem

Three safeguards were filed as missing in `scripts/backfill_edh_all.py`:
single-instance pgrep guard, retry-with-backoff, backup-before-delete.
On exploration, (1) and (2) are already implemented (commits `5bf1b34`,
`6f66a01`) but the issue checklist is stale. The `_cds_backup/`
directory on disk (375 files) shows backup-before-delete is in
operational use but never codified. Sibling script
`backfill_edh_tmax_tmin.py` is missing all three safeguards.

Goal: extract the three safeguards (plus shared `log` helper and
`MONTH_NAMES`) into a new `scripts/_lib.py`, refactor both production
bulk-fetch scripts to import from it, and add a `backup_before_delete()`
helper that codifies the on-disk pattern. Soul convention extraction
(Layer 2) is a follow-up issue once this lands.

## Phase 1 — Extract `scripts/_lib.py`

- [x] Create `scripts/_lib.py` with helpers parameterized for reuse:
`preflight_single_instance(name)`, `with_retry(fn, ...)`,
`write_geotiff(da, out_path, band_names)`, `log(msg)`,
`MONTH_NAMES`. Hoist `get_token()` too — both scripts duplicate it.
- [x] Refactor `backfill_edh_all.py` to `from _lib import (...)`. Drop
the now-redundant local copies (lines 69-94, 97-115, 118-127,
153-178, 181-182).
- [x] Smoke test on one year:
`uv run scripts/backfill_edh_all.py --year 1950`. Idempotent skip
fired; opened both Zarr stores under `with_retry`; clean exit.

## Phase 2 — Port safeguards to `backfill_edh_tmax_tmin.py`

- [x] `from _lib import (...)` at top.
- [x] `preflight_single_instance("backfill_edh_tmax_tmin")` at top of
`main()`.
- [x] Wrap `xr.open_dataset(zarr_url, ...)` (line 96) in `with_retry`.
- [x] Wrap each `.compute()` call (lines 132-133) in `with_retry`.
- [x] Replace inline `to_geotiff_raster` (lines 141-155) with
`write_geotiff` from `_lib.py`.
- [x] Replace ad-hoc `print(...)` with `log(...)`.
- [x] Smoke test:
`uv run scripts/backfill_edh_tmax_tmin.py --year 1950`. Idempotent
skip fired; clean exit.

## Phase 3 — Add `backup_before_delete()` helper

- [x] Add `backup_before_delete(files, backup_subdir="_backup")` to
`_lib.py`. Each file moves to `file.parent/backup_subdir/file.name`.
No overwrite (skip with warning if backup target exists).
- [x] Module docstring documents intended use during regenerations and
points at `data/backfill/monthly/_cds_backup/` as the worked
example.
- [x] No call sites added now — both current scripts are pure-write,
protected by idempotency. Helper exists for #48 if its
aggregation method requires re-running existing years.

## Phase 4 — Verify safeguard behaviors

- [x] Pgrep guard fires: started a `backfill_edh_tmax_tmin --year 2026`
in background, second instance ABORTed with the expected message
naming both pids.
- [ ] Output equivalence: byte-compare a re-rendered year file from
both scripts against the pre-refactor file. (Deferred — both
scripts hit the idempotent-skip path on existing years; full
re-render would re-fetch from EDH unnecessarily. The atomic-write
and band-naming logic are byte-identical to the pre-refactor
versions by code inspection — only the call site moved.)

## Phase 5 — File soul-convention follow-up

- [ ] After this PR merges, file an issue in
`NewGraphEnvironment/soul` referencing #38, scoping a
`bulk-fetch-safeguards.md` convention file with `scripts/_lib.py`
as the canonical worked example.

## Phase 6 — Code-check, commit, PR

- [ ] `/code-check` on staged diff before each commit.
- [ ] Atomic commits: `_lib.py` + `backfill_edh_all.py` refactor (1),
`backfill_edh_tmax_tmin.py` port (2), `backup_before_delete` (3).
- [ ] PR with `Fixes #38`, SRED ref in body
(`Relates to NewGraphEnvironment/sred-2025-2026#23`).

## Validation

- [ ] Single-year smoke test on `backfill_edh_all.py` succeeds.
- [ ] Single-year smoke test on `backfill_edh_tmax_tmin.py` succeeds.
- [ ] Pgrep guard refuses second concurrent instance of each script.
- [ ] No new dependencies in either script's PEP 723 inline deps.
- [ ] `/code-check` clean on each commit.
- [ ] PWF checkboxes match landed work.
- [ ] `/planning-archive` on completion.

## Out of scope

- Python lint config (ruff, black) — `cd` is R-first.
- Pytest test runner.
- Soul PR itself (deferred).
- `probe_edh_vars.py` and `test_edh_era5_land.py` (one-shot validation
scripts; safeguards target production bulk-fetch).
- A `--regen` flag wiring `backup_before_delete` into either script;
helper ships unused, first real call site lands with #48 if needed.
Loading
Loading