_targets.R pipeline + lnk_pipeline_species + reproducibility bar (PR 2 of 3)#42
Merged
NewGraphEnvironment merged 1 commit intomainfrom Apr 23, 2026
Merged
Conversation
## What lands
- `data-raw/_targets.R` — `tar_map(wsg = c("ADMS", "BULK", "BABL", "ELKR"))` over the per-AOI target, synchronous execution, `dplyr::bind_rows` rollup.
- `data-raw/compare_bcfishpass_wsg(wsg, config)` — per-AOI target function. Runs the six `lnk_pipeline_*` phases, diffs against `bcfishpass.habitat_linear_*` on the tunnel DB, returns a ~10-row tibble.
- `R/lnk_pipeline_species.R` — promotes the former `.lnk_pipeline_classify_species` private helper to an exported member of the `lnk_pipeline_*` family. classify + connect now call `lnk_pipeline_species`; the inlined `.wsg_species_present` helper in data-raw/ is deleted.
## Correctness framing
Reframed from "within 5% of bcfishpass" to **bit-identical output from the same inputs**. The 5% comparison stays as parity diagnostics. Added a section to CLAUDE.md ("Correctness bar: exact reproduction") and a memory entry. Three consecutive `tar_make()` runs on the same DB state produced the exact same 34-row rollup tibble — reproducibility proven.
## Fixes caught by code-check
- `data-raw/compare_bcfishpass_wsg.R` — connection leak: second `dbConnect` could throw before `on.exit` registered the first. Now each `dbConnect` registers its own cleanup immediately after succeeding.
- Tunnel password fallback is now a hard stop with a pointer at `~/.Renviron` rather than a silent empty-string connect that fails with an opaque auth error later.
- Species list in the `ours` SQL now goes through `DBI::dbQuoteLiteral` (matching the already-quoted `wsg` and `sp` in the reference queries). Consistent with the rest of the function; forward-proof against a config that includes a species code with a quote.
## Dependencies
- `targets`, `tarchetypes`, `tibble`, `dplyr` added to DESCRIPTION Suggests.
- `crew` dropped — crew_controller_local hung on dispatched-never-complete behavior. Synchronous execution is correct anyway while `fresh.streams` is a shared schema (documented in findings.md).
## Verification
Three full `tar_make()` runs on 4 WSGs, each ~8.5 min wall clock:
- `data-raw/logs/20260422_10_tar_make_from_dataraw.txt`
- `data-raw/logs/20260422_11_tar_make_final.txt`
- `data-raw/logs/20260422_12_tar_make_after_fixes.txt`
All three produce bit-identical rollup tibbles. `diff_pct` vs bcfishpass is within 5% for all 34 rows (informational only).
Version bumped to 0.4.0.
Relates to #38
NewGraphEnvironment
added a commit
that referenced
this pull request
Apr 23, 2026
…) (#47) * Vignette: fwapg prerequisites, user-definite barriers break bullet Vignette additions: - New "Prerequisites" section names fwapg as the source of the stream-network tables (ltree-typed watershed codes) and the traversal SQL functions (fwa_upstream, fwa_downstream, fwa_watershedatmeasure) the pipeline reads. bcfishobs marked as optional-but-recommended for observation overrides. Comparison tunnel called out as a validation convenience, not a runtime requirement. - DAG first line changed from "FWA streams (raw)" to "FWA stream network (via fwapg, ltree-enriched)" to match the prerequisites note. - Observations bullet now explicitly says overrides apply to gradient barriers, falls, AND user-definite barriers — plus an inline link to fwa_upstream since that's the SQL function driving the count. - New "User-identified definite barriers" bullet in the break-positions list, with inline links to user_barriers_definite.csv in bcfishpass and the mirror in link. Treated the same as falls: always-blocking, always a break position, eligible for per-species override via lnk_barrier_overrides. * Archive #38 PWF, init #44 PWF for barriers_definite_control wiring - planning/archive/2026-04-23-targets-pipeline/ — three-PR arc closed (PRs #41/#42/#43 shipping 0.3.0 → 0.5.0). Bit-identical rollups across three consecutive tar_make runs. - planning/active/ — new PWF for #44. Approach per /Users/airvine/.claude/plans/stateful-hopping-feather.md: fix latent ctrl_filter bug in lnk_barrier_overrides (current filter treats any control row as blocking; docstring says only barrier_ind = TRUE blocks) + wire control through from .lnk_pipeline_prep_overrides via manifest-driven gating (cfg$overrides$barriers_definite_control, not information_schema probe). Follow-up issue scoped to migrate remaining probes to manifest-driven gating. Relates to #44 * Fix lnk_barrier_overrides control filter semantics The `control` parameter's documented contract is "barriers in this table with `barrier_ind = TRUE` cannot be overridden." Previous implementation used `LEFT JOIN control c + WHERE c.blue_line_key IS NULL`, which treated ANY control row as blocking — including `barrier_ind = FALSE` rows (which mark "not actually a barrier" positions, not "do not override" positions). Replaces the LEFT-JOIN/IS-NULL pattern with a `NOT EXISTS` subquery filtered to `barrier_ind::boolean = true`. Fix: - Correctly blocks override only when at least one matching control row has `barrier_ind = TRUE`. Mixed TRUE/FALSE within the 1 m position tolerance resolves to "blocked" (TRUE wins). - Removes a latent row-multiplication issue in the observation-count aggregation: the prior LEFT JOIN could multiply rows before `GROUP BY ... HAVING count(observation_key) >= threshold`, under-counting the threshold when multiple control rows matched one barrier. `NOT EXISTS` is a WHERE-clause subquery — no multiplication. Applies identically to the observation and habitat override paths (same `ctrl_where` / `ctrl_filter` pair used in both INSERTs). `.lnk_pipeline_prep_overrides` still calls `lnk_barrier_overrides()` with `control = NULL` — behavioural wiring lands in Phase 2 of this PR. New `tests/testthat/test-lnk_barrier_overrides.R` — 11 mocked SQL tests covering: - NOT EXISTS + `= true` clauses appear when `control` is non-NULL (observation path) - Same when habitat path is taken (observation path disabled) - Neither clause appears when `control = NULL` Full suite: 269 PASS. Relates to #44 * Wire barriers_definite_control through prep_overrides (manifest-gated) `.lnk_pipeline_prep_overrides` now passes `<schema>.barriers_definite_control` to `lnk_barrier_overrides` as the `control` argument whenever `cfg$overrides$barriers_definite_control` is declared by the config manifest. The manifest key is the direct contract — same pattern used for the other override roles in the pipeline. Phase 1's NOT EXISTS filter is what honours it at the SQL layer. Also fixes an asymmetric-gating bug caught in code review: `.lnk_pipeline_prep_load_aux` previously only wrote `<schema>.barriers_definite_control` when both the manifest declared the key AND the current AOI had matching rows. With Phase 2's manifest-driven gate, that meant AOIs where the manifest was declared but no rows matched would reference a table that was never created, and `lnk_barrier_overrides`'s NOT EXISTS subquery would raise "relation does not exist." Mirrored the `barriers_definite` pattern — whenever the manifest declares the key, write a schema-valid table (empty or populated). The NOT EXISTS against an empty table is always TRUE, so a zero-row AOI correctly blocks nothing. Two new tests in `test-lnk_pipeline_prepare.R` mock `lnk_barrier_overrides` and assert the resolved `control` arg: - manifest declares key → `control = "<schema>.barriers_definite_control"` - manifest omits key → `control = NULL` Full suite: 271 PASS. End-to-end verification via `tar_make()` follows in the next commits (rollup regeneration, research doc + NEWS updates, vignette artifact regen). Relates to #44 * Gate barriers_definite_control per-species via observation_control_apply Phase 2 applied the NOT EXISTS control filter across every species in the `params` loop. A post-Phase-2 `tar_make()` drifted 11–22pp *away* from bcfishpass on ADMS and BABL (BULK and ELKR unchanged — their TRUE control rows have no upstream observations). Root cause: bcfishpass applies the control filter only in `model_access_ch_cm_co_pk_sk.sql` and `model_access_st.sql`. The BT, WCT, and CT/DV/RB access models don't reference `user_barriers_definite_control` at all. Their observations are allowed to override anadromous-blocking falls because residents routinely inhabit reaches upstream of such falls — post-glacial headwater connectivity populated many upper basins before the present channel dropped below its late-Pleistocene profile, and residents don't require ocean return. Species scope as a parameter, not a hard-coded list: - New column `observation_control_apply` in `inst/extdata/configs/bcfishpass/parameters_fresh.csv`. Logical. TRUE for CH, CM, CO, PK, SK, ST; FALSE for BT, WCT; NA for CT, DV, RB (which have no `observation_threshold` either — the flag is inapplicable). - `lnk_barrier_overrides()` reads `params$observation_control_apply[i]` inside the per-species loop. `isTRUE(as.logical(...))` normalises NA, missing-column, character, and unexpected inputs to FALSE — the resident-safe default. When FALSE, the NOT EXISTS clause is omitted from both the observation and habitat override paths. Two concerns, two locations: `cfg$overrides$barriers_definite_control` remains the table-level contract (is the control CSV declared for this config?); the new column is the application-level contract (does this species honour it?). Rules YAML stays focused on habitat classification. Tests (test-lnk_barrier_overrides.R, +3 cases): - `observation_control_apply = FALSE` → no NOT EXISTS / no `c.barrier_ind::boolean` clause in the rendered SQL. - `observation_control_apply = NA` → same — resident-safe default. - Mixed params (BT = FALSE, CH = TRUE) → per-species gating confirmed by inspecting the two emitted INSERT statements. Full suite: 279 PASS. Relates to #44 * Ungate habitat override path from control (bcfishpass parity) Phase 2a gated the control filter per-species but left a second defect: `ctrl_filter` was applied to BOTH the observation-path INSERT and the habitat-path INSERT in `lnk_barrier_overrides()`. bcfishpass's `model_access_ch_cm_co_pk_sk.sql` has separate CTEs: obs_upstr — joins observations, LEFT JOINs control, filters `bc.barrier_ind IS NULL` (control-gated) hab_upstr — joins habitat only, no control join at all The biology: expert-confirmed habitat is higher-trust than observations. By the time a reviewer has confirmed upstream habitat for a species at a given position, they have already considered the barrier's passability and the control-table designation. Observations are noisier and may be misattributed, so the control table vetoes them; habitat is a direct assertion that a species uses the reach upstream, so it stands. The old behaviour under-overrode bcfishpass: on ADMS and BABL, TRUE control positions that bcfishpass overrode via habitat stayed as barriers in link, cutting 11-22pp of accessible spawning/rearing km on CH, CM, CO, PK, SK, ST. Removing the habitat-path gate brings all four WSGs back into parity. Changes: - `lnk_barrier_overrides()`: the habitat-path INSERT no longer interpolates `ctrl_where` / `ctrl_filter`. The observation-path INSERT is unchanged (still gated by `observation_control_apply` per species). - Flipped the existing test "control filter applies to habitat overrides too" to its corrected form: "habitat override path is NOT gated by control (bcfishpass parity)". Filters captured SQL to the habitat INSERT only and asserts absence of the NOT EXISTS clause. - Roxygen on `@param control` documents that habitat confirmations bypass the control table entirely. `devtools::test()`: 279 PASS. `tar_make()` across ADMS, BULK, BABL, ELKR — all 34 rollup rows within 5% of bcfishpass reference. Exact numeric match to pre-fix baseline (no behaviour change vs pre-fix on these 4 WSGs, because habitat classifications already covered the control-table positions). Filter is now correctly applied semantically, not relying on coincidence with the habitat path. Relates to #44 * comms(→link): M1 verified as a ready R-worker host; crew.cluster 0.4.0 API gap 7/7 checks pass on M1, 1.1s SSH+Rscript round-trip, NG packages load, .Renviron propagates. Includes a heads-up that crew.cluster 0.4.0 only exports HPC-scheduler controllers — no generic ssh one exists, so PR 3-of-3's launcher needs one of custom crew_class_launcher, clustermq, or bespoke mirai+ssh. Landing on 44-barriers-definite-control per soul 2026-04-23 branch-landing ruling (Policy A, commit on peer's current branch with --only flag). * Add DEAD as the end-to-end test WSG for barriers_definite_control Phase 2b rollup was numerically identical to pre-fix on all four parity WSGs (ADMS/BULK/BABL/ELKR) because none of their TRUE control rows actually exercise the filter. Every row is rescued by either the observation threshold or the habitat path. The filter semantics were verified at the unit level but not end-to-end on real data. Province-wide hunt for TRUE control rows where the filter actually fires (obs >= threshold upstream AND zero habitat coverage) produced four candidates: CAMB (11 obs), DEAD (6), LFRA (16 but too large), SALM (7). Picked DEAD (Deadman River): smallest runtime, single TRUE control row at FALLS (356361749, 45743), six CH/CM/CO/PK/SK obs upstream, zero habitat classification for those species upstream. The tight "just above threshold" condition means pre-fix link would have overridden the fall (fall skipped, upstream accessible); post-fix link correctly blocks the override on anadromous species. bcfishpass reference: the fall at (356361749, 45743) IS present in bcfishpass.barriers_ch_cm_co_pk_sk post-override. bcfishpass kept it as a barrier via the control filter. link now matches. Direct verification on working_dead.barrier_overrides: - At the control position, exactly one species row is emitted: BT. observation_control_apply = FALSE for BT, so the NOT EXISTS clause was skipped and the observation count (>= threshold) produced the override. Matches bcfishpass BT model (no control join). - CH, CM, CO, PK, SK, ST all blocked at that position by the filter. Matches bcfishpass ch_cm_co_pk_sk and st models (control join present). DEAD rollup: all six species within 3% of bcfishpass reference. Incremental tar_make: comparison_ADMS/BULK/BABL/ELKR cached from the Phase 2b run, only comparison_DEAD + rollup rebuilt (42s). Log files captured: - 20260423_01_tar_make_post_44_phase2.txt (Phase 2, pre per-species gate) - 20260423_02_tar_make_phase2a.txt (Phase 2a, species gate, still bad on CH/CO/SK/ST) - 20260423_03_tar_make_phase2b.txt (Phase 2b, habitat path ungated, baseline rollup) - 20260423_04_tar_make_repro.txt (Phase 2b reproducibility) - 20260423_05_tar_make_dead.txt (incremental with DEAD added) Relates to #44 * Add tar_make run logs for #44 phases 2, 2a, 2b, repro, DEAD Evidence trail for the barriers_definite_control investigation. Referenced from planning/active/task_plan.md. Relates to #44 * Bump to 0.6.0: NEWS, DESCRIPTION, research doc, vignette, artifacts Phase 4 of #44. Numerical artifacts regenerated from the 5-WSG post-DEAD pipeline; narrative updates document the control-filter wiring and the split between parity WSGs and the end-to-end test WSG. - DESCRIPTION: Version 0.5.0 -> 0.6.0 - NEWS.md: 0.6.0 entry covering the control-filter wiring, per-species gating via observation_control_apply, habitat-path bypass, manifest- gated pipeline wiring, asymmetric-gating fix in prep_load_aux, and DEAD added as the end-to-end validation WSG. - inst/extdata/configs/bcfishpass/README.md: 5 WSGs; note that DEAD is the control-filter end-to-end test, the other four are parity. - inst/extdata/vignette-data/*.rds: regenerated from data-raw/ vignette_reproducing_bcfishpass.R against the current tar_make rollup (46 rows, bit-identical across two full rebuilds; digest 210c3f8254c47ac88573a80d96a2701e). - research/bcfishpass_comparison.md: adds the DEAD parity table, a section explaining why DEAD is the filter test WSG, a row in the "Key fixes" table, and a subsection describing the three-part fix (observation-path NOT EXISTS, per-species gate, habitat-path bypass). DAG rollup node updated to 46 rows; tar_map includes DEAD. - vignettes/reproducing-bcfishpass.Rmd: 5 WSGs; names DEAD's role; pivot tables include DEAD column via intersect() on names(w). Reproducibility verified: two consecutive `targets::tar_destroy(ask = FALSE); targets::tar_make()` runs on the same DB state produce bit-identical rollups (same SHA256 via digest::digest()). Relates to #44 * PWF: update progress for #44 Phase 4 + follow-up issue #46 + PR #47 Relates to #44
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR 2 of 3 for issue #38. Layers a targets-driven pipeline on top of the six
lnk_pipeline_*phase helpers shipped in PR 1, runs the bcfishpass comparison across all four validated watershed groups, and proves the pipeline is deterministic.What lands
data-raw/_targets.R— pipeline definition.tar_map(wsg = c("ADMS", "BULK", "BABL", "ELKR"))over a per-AOI target function, synchronous execution,dplyr::bind_rowsrollup.data-raw/compare_bcfishpass_wsg(wsg, config)— per-AOI target function. Runs the sixlnk_pipeline_*phases in order, queriesbcfishpass.habitat_linear_*reference tables over the tunnel, returns a ~10-row tibble (wsg × species × habitat_type × link_km × bcfishpass_km × diff_pct). KB-scale return — safe to ship over SSH when we go distributed.lnk_pipeline_species(cfg, aoi)— promoted from the former private.lnk_pipeline_classify_species. Canonical public helper for "species this config classifies in this AOI." Internals oflnk_pipeline_classify+lnk_pipeline_connectnow use it, and the duplicate inlined.wsg_species_presenthelper indata-raw/is deleted.Correctness bar: exact reproduction
This PR reframes verification. The standard is bit-identical output from the same inputs, not "within 5% of bcfishpass." Parity to bcfishpass stays as a diagnostic column (
diff_pct) but it's informational, not pass/fail. Documented inCLAUDE.md("Correctness bar: exact reproduction") and saved to memory so future sessions keep the framing straight.Three successive
tar_make()runs on the same fwapg + bcfishobs state produced bit-identical 34-row rollup tibbles:data-raw/logs/20260422_10_tar_make_from_dataraw.txtdata-raw/logs/20260422_11_tar_make_final.txtdata-raw/logs/20260422_12_tar_make_after_fixes.txtEach full pipeline runs in ~8.5 min wall clock (serial, by design — see below).
Why synchronous, not crew
fresh.streamsis a shared schema — parallel workers on one host race their base-segments builds. We already knew we neededworkers = 1from PR 1's code-check. On top of that,crew_controller_localexhibited a "dispatched but never completed" behavior on this workload: the parent R process would exit after dispatching the first target to the crew worker, and the worker would run invisibly. Removed crew entirely for this PR. Synchronous execution is correct for the shared-schema constraint and straightforward to reason about. Distributed M4+M1 runs remain a follow-up once fresh supports a per-AOI output path (seeplanning/active/findings.md).Code-check findings, fixed
dbConnectto the tunnel could throw beforeon.exitregistered the local conn's cleanup. Now eachdbConnectregisters its own cleanup immediately after succeeding.PG_PASS_SHAREwould get an opaque auth error. Now stops with a clear "set it in ~/.Renviron" message.oursquery was hand-rolled single-quoting while thewsgparameter usedDBI::dbQuoteLiteral. Made consistent — species now goes throughdbQuoteLiteraltoo.Re-verified (run 12) after fixes: rollup identical to runs 10 + 11.
Test plan
lnk_pipeline_species(input validation, intersect semantics, fallback whenwsg_speciesmissing, fallback whencfg$speciesmissing, live bundle with real config)tar_make()end-to-end runs — bit-identical rollups across all three/code-checkon staged diff before commit — all findings addressedDependencies
targets,tarchetypes,tibble,dplyrcrew(not used)Version
Bumped to 0.4.0 (new exported function is a minor bump per semver).
What's in PR 3
tar_mermaid()so it stays in sync with_targets.Rdata-raw/compare_bcfishpass.R(legacy orchestration script) — PR 2 kept it alongside the new targets pipeline as a defensive reference. PR 3 retires it.Fixes part of #38 (PR 2 of 3)
Relates to NewGraphEnvironment/sred-2025-2026#24