Problem
Today we have three disjoint sources of "why does this WSG-species-metric diverge from bcfp":
research/provincial_parity_2026_05_01.md (Class A/B/C/D taxonomy, narrative)
research/bcfp_compare_mapping_code.md (Phase A per-segment, 4-WSG sample)
- ~12 PR bodies + issue threads + commit messages (closing each class)
Every time a divergence surfaces during a provincial run, we manually re-derive whether it's already known. The next-WSG investigation often duplicates the BBAR/THOM trace from #158 just to find the same answer.
There is no single artifact a future-us (or anyone) can look at and say: "this BBAR CH +12% rearing — what class is it, what's known, where's the next step." We rediscover.
Goal
One CSV per provincial run: every (wsg, species, metric) row annotated against a canonical taxonomy. Zero rows ≥2% divergence end up "unexplained" — either they map to a known class with a citation, or they're flagged "needs investigation" with a clear hand-off.
Output shape
data-raw/logs/provincial_parity/<TS>_annotated.csv:
wsg, species, metric, unit,
link_value, bcfp_value, diff_pct, # rollup lens
mc_match_pct, mc_n_diffs, mc_top_pattern, # mapping_code lens (NULL if not run)
class, # A | B | C | D | MEASUREMENT_ASYMMETRY |
# TOKEN2_RESIDUAL | INTENTIONAL | UNEXPLAINED
mechanism_ref, # link#158, fresh#158, research/<doc>.md, ...
status, # CLOSED | INTENTIONAL | OPEN | NEEDS_INVESTIGATION
notes
Single source of truth
research/bcfp_divergence_taxonomy.yml — a keyed lookup of every known divergence pattern. Each PR / issue that surfaces a new pattern updates THIS file. The annotation script joins provincial-run output to it.
Example entries:
SETN:
species: all-anadromous
metric: rearing_stream
pattern: link > bcfp by 50-200%
class: A
mechanism: bcfp barriers_subsurfaceflow stale
refs: [research/provincial_parity_2026_05_01.md#class-a]
status: INTENTIONAL # link correct; awaiting upstream refresh
HORS:
species: BT
metric: rearing_stream
pattern: link < bcfp by 5-9%
class: B
mechanism: fresh#158 stream_order_parent rear bypass not implemented
refs: [fresh#158, research/dimensions_audit.md]
status: INTENTIONAL # methodology choice
"*":
species: "*"
metric: lake_rearing | wetland_rearing
pattern: link == 0, bcfp > 0 (diff_pct == -100)
class: MEASUREMENT_ASYMMETRY
mechanism: link credits centerline km vs bcfp polygon ha
refs: [research/default_vs_bcfishpass.md]
status: INTENTIONAL
Scope — what gets built
-
R/lnk_compare_wsg.R — NEW exported function. Signature: lnk_compare_wsg(conn, aoi, cfg, loaded, reference = "bcfishpass", with_mapping_code = FALSE). Per-WSG convenience wrapper around the lnk_pipeline_* phases + reference queries. Returns a list with rollup (linear sums per species) AND optionally mapping_code (per-species segment match stats). The reference arg leaves room for future non-bcfp comparisons (default-bundle parity, regression detection, federal sources). Initial implementation handles reference = "bcfishpass" only.
-
data-raw/compare_bcfishpass_wsg.R — refactor. Becomes a thin per-WSG orchestrator that calls lnk_compare_wsg(reference = "bcfishpass") and persists the RDS. New optional arg --with-mapping-code.
-
data-raw/compare_bcfp_mapping_code.R — DELETE. Its logic moves into lnk_compare_wsg() (gated by with_mapping_code = TRUE).
-
research/bcfp_divergence_taxonomy.yml — NEW. Initial population captures every Class A/B/C/D row from research/provincial_parity_2026_05_01.md plus MEASUREMENT_ASYMMETRY + TOKEN2_RESIDUAL patterns from research/provincial_parity_2026_05_11.md.
-
data-raw/annotate_provincial_parity.R — NEW. Reads data-raw/logs/provincial_parity/<TS>*.rds + the taxonomy, emits the annotated CSV.
-
data-raw/trifecta_provincial.sh — extended. New --with-mapping-code flag (default off for fast runs). Pre-flight check that all 5 hosts (3 cyphers + M4 + M1) are ready. Support --workspace job1,job2,job3 style for the 3-cypher fan-out.
Cleanups bundled
data-raw/balance_provincial_buckets.R — dedup bug (yesterday I worked around by writing inline LPT; should land in the canonical script)
data-raw/consolidate_schema.R — ok=FALSE false-positive on cypher restores; also need bucket-aware destination cleanup so streams_habitat_<sp> pg_restore doesn't dup-key collide. See 2026-05-12 addendum in research/provincial_parity_2026_05_11.md.
- Stale
_per_wsg_times.csv discovery pattern — currently picks up old logs from 2026-05-08 unless you manually rename. Move to an archive convention so only the latest run feeds the LPT.
Out of scope (separate issues if they come up)
- fresh-side changes (fresh#158 stream-order bypass, fresh#190/191 SK new-geographies) — bake them into the taxonomy with
status: INTENTIONAL_FRESH_DEFERRED for now.
- Cypher infra (rtj#129 closed; if a different infra issue surfaces, file separately in rtj).
- Default bundle parity — same machinery should work but defer to a follow-up after bcfishpass bundle reaches "zero unexplained".
Acceptance
Problem
Today we have three disjoint sources of "why does this WSG-species-metric diverge from bcfp":
research/provincial_parity_2026_05_01.md(Class A/B/C/D taxonomy, narrative)research/bcfp_compare_mapping_code.md(Phase A per-segment, 4-WSG sample)Every time a divergence surfaces during a provincial run, we manually re-derive whether it's already known. The next-WSG investigation often duplicates the BBAR/THOM trace from #158 just to find the same answer.
There is no single artifact a future-us (or anyone) can look at and say: "this BBAR CH +12% rearing — what class is it, what's known, where's the next step." We rediscover.
Goal
One CSV per provincial run: every (wsg, species, metric) row annotated against a canonical taxonomy. Zero rows ≥2% divergence end up "unexplained" — either they map to a known class with a citation, or they're flagged "needs investigation" with a clear hand-off.
Output shape
data-raw/logs/provincial_parity/<TS>_annotated.csv:Single source of truth
research/bcfp_divergence_taxonomy.yml— a keyed lookup of every known divergence pattern. Each PR / issue that surfaces a new pattern updates THIS file. The annotation script joins provincial-run output to it.Example entries:
Scope — what gets built
R/lnk_compare_wsg.R— NEW exported function. Signature:lnk_compare_wsg(conn, aoi, cfg, loaded, reference = "bcfishpass", with_mapping_code = FALSE). Per-WSG convenience wrapper around thelnk_pipeline_*phases + reference queries. Returns a list withrollup(linear sums per species) AND optionallymapping_code(per-species segment match stats). Thereferencearg leaves room for future non-bcfp comparisons (default-bundle parity, regression detection, federal sources). Initial implementation handlesreference = "bcfishpass"only.data-raw/compare_bcfishpass_wsg.R— refactor. Becomes a thin per-WSG orchestrator that callslnk_compare_wsg(reference = "bcfishpass")and persists the RDS. New optional arg--with-mapping-code.data-raw/compare_bcfp_mapping_code.R— DELETE. Its logic moves intolnk_compare_wsg()(gated bywith_mapping_code = TRUE).research/bcfp_divergence_taxonomy.yml— NEW. Initial population captures every Class A/B/C/D row fromresearch/provincial_parity_2026_05_01.mdplus MEASUREMENT_ASYMMETRY + TOKEN2_RESIDUAL patterns fromresearch/provincial_parity_2026_05_11.md.data-raw/annotate_provincial_parity.R— NEW. Readsdata-raw/logs/provincial_parity/<TS>*.rds+ the taxonomy, emits the annotated CSV.data-raw/trifecta_provincial.sh— extended. New--with-mapping-codeflag (default off for fast runs). Pre-flight check that all 5 hosts (3 cyphers + M4 + M1) are ready. Support--workspace job1,job2,job3style for the 3-cypher fan-out.Cleanups bundled
data-raw/balance_provincial_buckets.R— dedup bug (yesterday I worked around by writing inline LPT; should land in the canonical script)data-raw/consolidate_schema.R—ok=FALSEfalse-positive on cypher restores; also need bucket-aware destination cleanup sostreams_habitat_<sp>pg_restore doesn't dup-key collide. See 2026-05-12 addendum inresearch/provincial_parity_2026_05_11.md._per_wsg_times.csvdiscovery pattern — currently picks up old logs from 2026-05-08 unless you manually rename. Move to an archive convention so only the latest run feeds the LPT.Out of scope (separate issues if they come up)
status: INTENTIONAL_FRESH_DEFERREDfor now.Acceptance
--with-mapping-codefinishes in ≤90 min on 5 hosts (3 cyphers + M4 + M1)abs(diff_pct) >= 2ANDclass == UNEXPLAINED(every ≥2% divergence has either a class label or a "needs_investigation" hand-off)research/bcfp_divergence_taxonomy.ymlis the only place we record new patterns going forwardresearch/provincial_run_runbook.md) updated: the CSV is the primary deliverable; existing rollup/mapping_code compare outputs become diagnostic detail