Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Infrastructure stack stabilized; methodology research is the active focus. Recen

- **v0.27.0** (#114, #45) — `cfg$pipeline$gradient_classes` config knob + per-species filter derivation in `prep_minimal`. Bit-identical bcfp parity by default; lets bundles override the break vector.
- **v0.28.0** (#119, #45-followup) — Orphan-class break source: classes below all species' access thresholds enter `gradient_barriers_minimal` as a shared `barriers_orphan` table (segmentation only, no access semantics). New `default_extrabreaks` bundle. Province-wide methodology delta (231 vs 225 WSGs): SK spawn +6.7%, BT/RB/ST/WCT/GR spawn +1–2%, CH/ST rear -11%, BT/CO/GR rear -3 to -7%. "Ceiling sub-segment" mechanism — flat parts of mixed reaches now pass spawn; steep pockets previously folded into rear via averaging exceed rear ceiling as standalone segments.
- **v0.29.0** (#120, #118) — DB hygiene: `cleanup_working = TRUE` in `compare_bcfishpass_wsg` drops working schemas after rollup; `keep_source = FALSE` in `consolidate_schema` drops source schemas after pg_restore. Prevents the disk-full incident that crashed cypher 2026-05-04.
- **v0.29.0** (#120, #118) — DB hygiene: `cleanup_working = TRUE` in `compare_bcfishpass_wsg` drops working schemas after rollup; `keep_source = FALSE` in `schema_consolidate` drops source schemas after pg_restore. Prevents the disk-full incident that crashed cypher 2026-05-04.

Cypher recovered (volumes nuked); needs fwapg reload before next provincial run. Methodology research is active: `default_extrabreaks` proved the mechanism; next variants include spawn-only / rear-only break vectors (each adds ~4 new classes; per-WSG cost ~1.3–1.4× vs default) and the channel-class breaks research (#52, blocked on fresh-side helper).

Expand Down
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: link
Title: Stream Network Habitat Interpretation (Experimental)
Version: 0.37.0
Version: 0.38.0
Date: 2026-05-14
Authors@R: c(
person("Allan", "Irvine", , "airvine@newgraphenvironment.com",
Expand Down
25 changes: 25 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,28 @@
# link 0.38.0

Provincial-run autonomy CLI + 8 operational-script renames to noun_verb convention. Closes [#172](https://github.com/NewGraphEnvironment/link/issues/172). Builds on v0.37.0's #168 decouple — with PG-state resume in place, the autonomy surface stays thin and the renames stay mechanical.

- **Single-command autonomous run.** `wsgs_run_pipeline.sh` (was `province_run.sh`) accepts `--wsgs=A,B,C`, `--config=<name>`, `--schema=<name>`, `--no-cyphers`, `--force`, forwards to `wsgs_dispatch.sh` (was `trifecta_provincial.sh`) which intersects the WSG subset in its LPT split. M4+M1-only baseline validated end-to-end: 16-WSG default-bundle dispatch lands 16/16 in `fresh_default.streams` on M4, ~20 min wall, no operator prompts.
- **Step 0 pre-clean.** When `--schema=` is set, umbrella fires `state_clean.sh --schemas=<schema>` on every host before Step 1. Drops only the target schema (skips the canonical-fresh heuristic + snapshot reload). Eliminates a class of consolidate failures where stale leftover WSGs on a source host collided with destination data during pg_restore.
- **Scoped `state_clean.sh` (was `province_clean.sh`).** New `--schemas=A,B,C` mode drops only the listed schemas. Empty `--schemas=` rejected loud to prevent dynamic-arg silent fall-through to the destructive default mode.
- **Phantom-cy + error-surface fixes in `wsgs_dispatch.sh`.** R's `paste0("cy", integer(0))` returns `"cy"` length-1 (constant recycling) — would put a non-existent cypher in the host plan under `--no-cyphers`. Three-branched `cy_host_keys`. Empty `CY_WORKSPACES` init via explicit `CY_WS_ARR=()` (was `read -r -a` yielding single-empty-element). `SPLIT_OUT=$(Rscript ...)` wrapped with explicit `||` block so R-side `stop()` messages reach the operator (e.g. `--wsgs=BOGUS` surfaces the R error verbatim instead of silent abort).
- **8 rename mapping (`git mv` preserves `git log --follow`).** Names now describe scope honestly — these scripts work for any list of WSGs / any host count / any reference:

| Old | New |
|---|---|
| `data-raw/province_run.sh` | `data-raw/wsgs_run_pipeline.sh` |
| `data-raw/province_clean.sh` | `data-raw/state_clean.sh` |
| `data-raw/province_progress.sh` | `data-raw/progress_check.sh` |
| `data-raw/trifecta_provincial.sh` | `data-raw/wsgs_dispatch.sh` |
| `data-raw/run_provincial_parity.R` | `data-raw/wsgs_run_host.R` |
| `data-raw/consolidate_schema.R` | `data-raw/schema_consolidate.R` |
| `data-raw/archive_provincial_runs.sh` | `data-raw/runs_archive.sh` |
| `data-raw/balance_provincial_buckets.R` | `data-raw/buckets_balance.R` |

The `wsg_*` (singular, per-WSG functions from #168) vs `wsgs_*` (plural, collection-level orchestrators) distinction is now load-bearing in the naming. `compare_bcfishpass_wsg.R → wsg_compare.R` was renamed in #168.

Filed-but-not-closed follow-ups: cypher integration testing (issue #172 Phase 2 + 3 acceptance — defer until M4+M1 baseline lands repeatably); LPT-fallback empty-bucket edge case when N_WSGs ≤ N_hosts without timing CSVs (pre-existing, not a #172 regression).

# link 0.37.0

Decouple bcfp comparison from the modelling pipeline. Closes [#168](https://github.com/NewGraphEnvironment/link/issues/168). The link package's deliverable — the per-WSG model in `<persist_schema>.streams` + per-species habitat + barriers — now runs and is observable independently of any comparison framework. Comparison vs bcfishpass (or any future reference) is a diagnostic overlay that reads the persisted state and never gates whether the model itself ran.
Expand Down
2 changes: 1 addition & 1 deletion R/utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -193,7 +193,7 @@
#' Probes `<persist_schema>.streams` for `watershed_group_code = aoi`.
#' Returns `TRUE` when the table exists and has at least one row for
#' the WSG; `FALSE` when the table is absent OR has no rows for the
#' WSG. Used by the orchestrator loop (run_provincial_parity.R) as the
#' WSG. Used by the orchestrator loop (wsgs_run_host.R) as the
#' canonical resume check — PG state is authoritative, RDS files are
#' diagnostic side-artifacts.
#'
Expand Down
36 changes: 18 additions & 18 deletions data-raw/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,10 +76,10 @@ The dispatch hierarchy: trifecta → run_provincial → compare_wsg.

| Script | Calls | Purpose |
|--------|-------|---------|
| `trifecta_provincial.sh` | `run_provincial_parity.R` (×N hosts) | M4 + M1 + N-cypher orchestrator. Inline LPT bucket allocation (reads `_per_wsg_times.csv` from prior runs, computes balanced split using `--host-speeds=`), pre-flight version check across all hosts, parallel dispatch, RDS pull-back, post-pull `lnk_parity_annotate` against the divergence taxonomy. See "Provincial dispatch" section below for full flag reference + gotchas. |
| `wsgs_dispatch.sh` | `wsgs_run_host.R` (×N hosts) | M4 + M1 + N-cypher orchestrator. Inline LPT bucket allocation (reads `_per_wsg_times.csv` from prior runs, computes balanced split using `--host-speeds=`), pre-flight version check across all hosts, parallel dispatch, RDS pull-back, post-pull `lnk_parity_annotate` against the divergence taxonomy. See "Provincial dispatch" section below for full flag reference + gotchas. |
| `trifecta_15wsg.sh` | same | 15-WSG smoke variant (legacy 3-host, hardcoded WSG list). |
| `trifecta_smoke.sh` | `trifecta_provincial.sh` | N-host smoke shim: one small WSG per host, ~3 min wall. See `Provincial dispatch` section. |
| `run_provincial_parity.R` | `compare_bcfishpass_wsg.R` per WSG | Single-host provincial dispatcher. Loops every WSG in `wsg_species_presence`, saves per-WSG RDS, emits per-WSG times CSV. After the loop, optionally annotates the host's bucket against `research/bcfp_divergence_taxonomy.yml` (writes `<TS>_<host>_annotated.csv`). Accepts `--wsgs=`, `--config=`, `--schema=`, `--rds-dir=`, `--with-mapping-code`. |
| `trifecta_smoke.sh` | `wsgs_dispatch.sh` | N-host smoke shim: one small WSG per host, ~3 min wall. See `Provincial dispatch` section. |
| `wsgs_run_host.R` | `compare_bcfishpass_wsg.R` per WSG | Single-host provincial dispatcher. Loops every WSG in `wsg_species_presence`, saves per-WSG RDS, emits per-WSG times CSV. After the loop, optionally annotates the host's bucket against `research/bcfp_divergence_taxonomy.yml` (writes `<TS>_<host>_annotated.csv`). Accepts `--wsgs=`, `--config=`, `--schema=`, `--rds-dir=`, `--with-mapping-code`. |
| `compare_bcfishpass_wsg.R` | `lnk_pipeline_*` family | Single-WSG end-to-end runner. Sources both connections (local fwapg + bcfp tunnel), runs the 6-phase pipeline, persists, emits comparison rollup tibble (link vs bcfp). The atomic unit of work in every multi-WSG run above. |

## Pipeline support
Expand All @@ -88,14 +88,14 @@ Run-adjacent helpers (planning, consolidation across hosts).

| Script | Purpose |
|--------|---------|
| `balance_provincial_buckets.R` | Standalone LPT planner for the 3-host case. Reads per-host wall times from prior runs and prints buckets ready to paste into `trifecta_provincial.sh --m4-bucket=…`. **Superseded for the N-host orchestrator** — `trifecta_provincial.sh` now computes the LPT plan inline at dispatch time using the same algorithm. Kept here for one-off planning + cross-checks. Dedups `(wsg, host)` and across hosts before LPT so multi-run CSV accumulation doesn't double-assign WSGs. |
| `consolidate_schema.R` | pg_dump from M1 + cypher → scp to M4 → pg_restore --data-only. Bucket-aware destination cleanup (DELETEs each source host's WSG bucket from destination tables before restore — avoids duplicate-key violations on re-consolidation). `ok = TRUE` requires pg_restore rc=0 AND post-restore row count > 0; rc=0 with empty schema flags as failure. |
| `archive_provincial_runs.sh` | Moves the current top-level `_per_wsg_times.csv` + `*.rds` + `*_annotated.csv` artifacts in `provincial_<bundle>/` to `archive/<TS>/`. Operator cadence: run between provincial runs when you want the LPT planner to use the most recent run only. Skip to median-over multiple recent runs. |
| `trifecta_smoke.sh` | Thin shim over `trifecta_provincial.sh` — one small WSG per host (m4→DEAD, m1→ELKR, cyN→ADMS/BABL/BULL). ~3 min wall. Exercises every orchestrator code path (preflight, dispatch, tunnel, RDS pull-back, annotation) before committing to a 200-WSG run. All flags pass through (e.g. `--cy-workspaces=`, `--with-mapping-code`). |
| `buckets_balance.R` | Standalone LPT planner for the 3-host case. Reads per-host wall times from prior runs and prints buckets ready to paste into `wsgs_dispatch.sh --m4-bucket=…`. **Superseded for the N-host orchestrator** — `wsgs_dispatch.sh` now computes the LPT plan inline at dispatch time using the same algorithm. Kept here for one-off planning + cross-checks. Dedups `(wsg, host)` and across hosts before LPT so multi-run CSV accumulation doesn't double-assign WSGs. |
| `schema_consolidate.R` | pg_dump from M1 + cypher → scp to M4 → pg_restore --data-only. Bucket-aware destination cleanup (DELETEs each source host's WSG bucket from destination tables before restore — avoids duplicate-key violations on re-consolidation). `ok = TRUE` requires pg_restore rc=0 AND post-restore row count > 0; rc=0 with empty schema flags as failure. |
| `runs_archive.sh` | Moves the current top-level `_per_wsg_times.csv` + `*.rds` + `*_annotated.csv` artifacts in `provincial_<bundle>/` to `archive/<TS>/`. Operator cadence: run between provincial runs when you want the LPT planner to use the most recent run only. Skip to median-over multiple recent runs. |
| `trifecta_smoke.sh` | Thin shim over `wsgs_dispatch.sh` — one small WSG per host (m4→DEAD, m1→ELKR, cyN→ADMS/BABL/BULL). ~3 min wall. Exercises every orchestrator code path (preflight, dispatch, tunnel, RDS pull-back, annotation) before committing to a 200-WSG run. All flags pass through (e.g. `--cy-workspaces=`, `--with-mapping-code`). |

## Provincial dispatch (`trifecta_provincial.sh`)
## Provincial dispatch (`wsgs_dispatch.sh`)

The flagship orchestrator. Dispatches `run_provincial_parity.R` across
The flagship orchestrator. Dispatches `wsgs_run_host.R` across
M4 + M1 + N cyphers in parallel, pulls RDS files back, and emits a
province-wide annotated CSV.

Expand All @@ -106,22 +106,22 @@ cd ~/Projects/repo/link/data-raw

# Optional: archive prior run's CSVs first if you want LPT to plan
# against this run only (not median-of-recent-runs):
./archive_provincial_runs.sh
./runs_archive.sh

# Smoke-test first (~3 min, one small WSG per host) — catches preflight,
# tunnel, dispatch, and annotation surprises before the full run:
./trifecta_smoke.sh # 3-host smoke
./trifecta_smoke.sh --cy-workspaces=job1,job2,job3 # 5-host smoke

# Full run:
./trifecta_provincial.sh # 3-host default
./trifecta_provincial.sh --cy-workspaces=job1,job2,job3 # 5-host
./wsgs_dispatch.sh # 3-host default
./wsgs_dispatch.sh --cy-workspaces=job1,job2,job3 # 5-host

# Add per-segment mapping_code lens (+50% cost):
./trifecta_provincial.sh --cy-workspaces=job1,job2,job3 --with-mapping-code
./wsgs_dispatch.sh --cy-workspaces=job1,job2,job3 --with-mapping-code

# Custom host-speed factors (lower = faster):
./trifecta_provincial.sh --host-speeds=m4=1.0,m1=0.83,cy=1.83
./wsgs_dispatch.sh --host-speeds=m4=1.0,m1=0.83,cy=1.83
```

**Recommended cadence:** archive → smoke → full run. The smoke catches
Expand Down Expand Up @@ -240,7 +240,7 @@ Research scratchpad and one-off verification scripts.

| Script | Purpose |
|--------|---------|
| `_targets.R` | The (legacy) targets pipeline that pre-dated `trifecta_provincial.sh`. Kept for the multi-WSG comparison harness. |
| `_targets.R` | The (legacy) targets pipeline that pre-dated `wsgs_dispatch.sh`. Kept for the multi-WSG comparison harness. |
| `exp_gradient_extra_breaks.R` | Experimental script that prototyped the orphan-class break source via in-line `frs_break_apply` before it was absorbed into `lnk_pipeline_prep_minimal()` (link v0.28.0). Kept as the smoke-test reference. |
| `rule_flexibility_demo.R` / `rule_flexibility_render.R` | Demonstration of rules.yaml format flexibility, rendered as RMarkdown. |
| `regress_dams_isolation.R` | One-off regression test for the dams-isolation work (link #109). |
Expand All @@ -266,9 +266,9 @@ discoverable without opening the file:

Run artifacts land in subdirectories of `data-raw/logs/` keyed by topic:

- `provincial_parity/`, `provincial_default/`, `provincial_default_extrabreaks/` — per-WSG RDS + per-WSG times CSV from `run_provincial_parity.R`.
- `provincial_parity/`, `provincial_default/`, `provincial_default_extrabreaks/` — per-WSG RDS + per-WSG times CSV from `wsgs_run_host.R`.
- `methodology_delta/` — schema-vs-schema delta RDS from `methodology_delta_query.R`.
- `dumps_<schema>/` — pg_dump custom-format files from `consolidate_schema.R`.
- `dumps_<schema>/` — pg_dump custom-format files from `schema_consolidate.R`.
- `<TS>_*.txt` — orchestrator + per-host run logs.

Reusable helper scripts read from these directories without hardcoded
Expand All @@ -290,7 +290,7 @@ single bundle's persistent schema. Rough footprints on a 232-WSG run:
The 2026-05-04 cypher disk-full incident filled a 96 GB droplet with
3 accumulated bundles + 60 working schemas at once. After the v0.29.0
hygiene fixes (`compare_bcfishpass_wsg(cleanup_working = TRUE)` drops
working schemas on completion; `consolidate_schema(keep_source = FALSE)`
working schemas on completion; `schema_consolidate(keep_source = FALSE)`
drops source persistent schema after successful pg_restore), a
single-bundle-in-flight worker holds ~60 GB total — comfortable on the
existing 96 GB cypher tier.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
# cypher caught up. This script projects ~10-15 min savings per run.
#
# Usage:
# Rscript data-raw/balance_provincial_buckets.R
# Rscript data-raw/buckets_balance.R
#
# Hardcoded inputs (override above when re-running with new baseline):
# - Yesterday's host log paths (one per host)
Expand All @@ -23,16 +23,16 @@ suppressPackageStartupMessages({})

logs_dir <- "/Users/airvine/Projects/repo/link/data-raw/logs"

# Prefer per-WSG CSVs (emitted by run_provincial_parity.R after this script
# Prefer per-WSG CSVs (emitted by wsgs_run_host.R after this script
# was added). Fall back to text-log regex parsing for older runs.
csvs <- list.files(file.path(logs_dir, "provincial_parity"),
pattern = "_per_wsg_times\\.csv$", full.names = TRUE)
csvs <- c(csvs, list.files(file.path(logs_dir, "provincial_default"),
pattern = "_per_wsg_times\\.csv$", full.names = TRUE))
if (length(csvs) == 0) {
m4_log <- file.path(logs_dir, "202605031423_trifecta_provincial_m4.txt")
m1_log <- file.path(logs_dir, "202605031423_trifecta_provincial_m1.txt")
cy_log <- "/Users/airvine/Projects/repo/rtj/scripts/cypher/logs/202605031423_cypher-run_202605031423_trifecta_provincial_cypher.txt"
m4_log <- file.path(logs_dir, "202605031423_wsgs_dispatch_m4.txt")
m1_log <- file.path(logs_dir, "202605031423_wsgs_dispatch_m1.txt")
cy_log <- "/Users/airvine/Projects/repo/rtj/scripts/cypher/logs/202605031423_cypher-run_202605031423_wsgs_dispatch_cypher.txt"
}

# Host speed factors (M4 = reference). Use yesterday's per-host MEAN
Expand Down
12 changes: 6 additions & 6 deletions data-raw/province_progress.sh → data-raw/progress_check.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#!/usr/bin/env bash
# province_progress.sh — report live progress of an in-flight provincial dispatch.
# progress_check.sh — report live progress of an in-flight provincial dispatch.
#
# Reads each host's newest `_per_wsg_times.csv` (by mtime, NOT by date glob —
# cypher logs use UTC, M4/M1 use local TZ; date-globbing across hosts breaks
Expand All @@ -9,9 +9,9 @@
# plus a sample of the most recent completions to see what each host is on.
#
# Usage:
# bash data-raw/province_progress.sh [--cy-workspaces=job1,job2,job3] [--mtime-min=120]
# bash data-raw/progress_check.sh [--cy-workspaces=job1,job2,job3] [--mtime-min=120]
#
# Honors /tmp/cy_ips.env if present (set by trifecta_provincial.sh dispatch).
# Honors /tmp/cy_ips.env if present (set by wsgs_dispatch.sh dispatch).
# Otherwise derives cypher IPs from tofu state per workspace.
#
# --mtime-min=N : only consider CSVs modified in the last N minutes (default 240
Expand Down Expand Up @@ -85,11 +85,11 @@ done

echo
# Orchestrator-side state
if pgrep -f "trifecta_provincial.sh" >/dev/null 2>&1; then
PID=$(pgrep -f "trifecta_provincial.sh" | head -1)
if pgrep -f "wsgs_dispatch.sh" >/dev/null 2>&1; then
PID=$(pgrep -f "wsgs_dispatch.sh" | head -1)
echo "dispatch process: ✓ PID=$PID (running)"
# Find latest orchestrator log to extract bucket sizes if possible
LATEST_ORCH=$(find ~/Projects/repo/link/data-raw/logs -maxdepth 1 -name '*_trifecta_provincial_orchestrator.txt' -mmin -$MTIME_MIN | sort | tail -1)
LATEST_ORCH=$(find ~/Projects/repo/link/data-raw/logs -maxdepth 1 -name '*_wsgs_dispatch_orchestrator.txt' -mmin -$MTIME_MIN | sort | tail -1)
if [ -n "$LATEST_ORCH" ]; then
echo "orchestrator log: $LATEST_ORCH"
grep -E "total WSGs|projected finish" "$LATEST_ORCH" 2>/dev/null | head -2
Expand Down
Loading