Problem
link's id_segment is a per-WSG row index, unique only within the persist primary key (id_segment, watershed_group_code) — not globally. In province-wide persist (fresh.streams): 1,542,427 rows but only 80,555 distinct id_segment — each value repeats ~19× across WSGs. So any join between persist tables on id_segment alone is a ~19–22× cartesian blow-up.
Active damage found 2026-05-24 (during #175):
Safe today (join the working schema = single WSG = id unique): lnk_compare_wsg rollup, lnk_pipeline_persist.
Root cause + proposed fix
bcfp's segmented_stream_id is position-derived — it encodes blue_line_key + downstream_route_measure (e.g. 359371372.3868298), so it is globally unique by construction. link should do the same: derive id_segment deterministically from (blue_line_key, downstream_route_measure) so it is globally unique.
Benefits:
- Kills the whole
id_segment-alone-join bug class (defense in depth, not just per-join PK discipline).
- Bonus: link
id_segment == bcfp segmented_stream_id → compares can join directly on the id (no blue_line_key + round(downstream_route_measure) position join), and it becomes a stable cross-build / cross-system key.
Blast radius (scope carefully)
id_segment is generated upstream (fresh segmentation / link persist — confirm during exploration) and is a PK component + join key across streams, streams_habitat_<sp>, streams_access, streams_mapping_code, barriers. Changing its derivation touches generation + every consumer. Decide: determinism, uniqueness guarantee, type (bcfp uses a float position-encoding — weigh numeric vs text vs bigint), and migration of existing persist data.
Interim discipline (until root fix)
All persist joins MUST use the full PK (id_segment, watershed_group_code). Periodic audit:
grep -rnE "id_segment *= *[a-z]+\.id_segment|USING \(id_segment\)" R/ data-raw/ | grep -v watershed_group_code
Relates to #175. SRED NewGraphEnvironment/sred-2025-2026#24.
Verified (2026-05-24): bcfp segmented_stream_id
- Globally unique —
fresh.streams_vw_bcfp: 4,230,118 rows = 4,230,118 distinct segmented_stream_id (no watershed_group_code needed in the key).
- Position-derived from
blue_line_key + downstream_route_measure, per bcfp's data dictionary (docs/05_data_dictionary.rst): "segmented_stream_id is based on FWA blue_line_key and downstream_route_measure for a given segment. When the segmentation by bcfishpass modelling changes, the segmented_stream_id will also change."
- Type: text (
character varying). Integer part = blue_line_key (verified for every sampled row); fractional part encodes the rounded measure — e.g. measure 120.287972 → …903.120288, 48.3061357 → …361.48306 (≈ blue_line_key || '.' || digits-of round(downstream_route_measure, 3)). It is not a meaningful float (a fixed blk + measure/scale does NOT fit — the fraction-digit count varies); treat it as an opaque deterministic position key.
- Segmentation-dependent — regenerated each bcfp build; only comparable within the same model run (already handled: we pin the build via
s3://fresh-bc/bcfishpass/log.json).
Implication for the fix: adopting bcfp's exact text construction for link's id_segment would make it globally unique AND id_segment == segmented_stream_id (direct id-join in compares). Pin the exact generating SQL from bcfp source (likely fwapg FWA_* / the bcfp segmentation step) during implementation — the AS segmented_stream_id expression wasn't located by code-search at @e12c1a5, so trace it from the segmentation/load step.
Problem
link's
id_segmentis a per-WSG row index, unique only within the persist primary key(id_segment, watershed_group_code)— not globally. In province-wide persist (fresh.streams): 1,542,427 rows but only 80,555 distinctid_segment— each value repeats ~19× across WSGs. So any join between persist tables onid_segmentalone is a ~19–22× cartesian blow-up.Active damage found 2026-05-24 (during #175):
lnk_compare_rollupjoinedstreams_habitat_<sp>onid_segmentalone against province-wide persist → PARS BTspawning_km36,819.8 vs correct 1,681.1 (~22×). The km parity numbers from that path are garbage whenever persist holds >1 WSG. Tactically fixed in Promote with_mapping_code flag to stand-alone lnk_compare_mapping_code() export #175 (full-PK joins).lnk_compare_mapping_codehad the same latent join — also fixed in Promote with_mapping_code flag to stand-alone lnk_compare_mapping_code() export #175.Safe today (join the working schema = single WSG = id unique):
lnk_compare_wsgrollup,lnk_pipeline_persist.Root cause + proposed fix
bcfp's
segmented_stream_idis position-derived — it encodesblue_line_key+downstream_route_measure(e.g.359371372.3868298), so it is globally unique by construction. link should do the same: deriveid_segmentdeterministically from(blue_line_key, downstream_route_measure)so it is globally unique.Benefits:
id_segment-alone-join bug class (defense in depth, not just per-join PK discipline).id_segment== bcfpsegmented_stream_id→ compares can join directly on the id (noblue_line_key + round(downstream_route_measure)position join), and it becomes a stable cross-build / cross-system key.Blast radius (scope carefully)
id_segmentis generated upstream (fresh segmentation / link persist — confirm during exploration) and is a PK component + join key acrossstreams,streams_habitat_<sp>,streams_access,streams_mapping_code,barriers. Changing its derivation touches generation + every consumer. Decide: determinism, uniqueness guarantee, type (bcfp uses a float position-encoding — weigh numeric vs text vs bigint), and migration of existing persist data.Interim discipline (until root fix)
All persist joins MUST use the full PK
(id_segment, watershed_group_code). Periodic audit:Relates to #175. SRED NewGraphEnvironment/sred-2025-2026#24.
Verified (2026-05-24): bcfp segmented_stream_id
fresh.streams_vw_bcfp: 4,230,118 rows = 4,230,118 distinctsegmented_stream_id(nowatershed_group_codeneeded in the key).blue_line_key+downstream_route_measure, per bcfp's data dictionary (docs/05_data_dictionary.rst): "segmented_stream_id is based on FWA blue_line_key and downstream_route_measure for a given segment. When the segmentation by bcfishpass modelling changes, the segmented_stream_id will also change."character varying). Integer part =blue_line_key(verified for every sampled row); fractional part encodes the rounded measure — e.g. measure120.287972→…903.120288,48.3061357→…361.48306(≈blue_line_key || '.' || digits-of round(downstream_route_measure, 3)). It is not a meaningful float (a fixedblk + measure/scaledoes NOT fit — the fraction-digit count varies); treat it as an opaque deterministic position key.s3://fresh-bc/bcfishpass/log.json).Implication for the fix: adopting bcfp's exact text construction for link's
id_segmentwould make it globally unique ANDid_segment == segmented_stream_id(direct id-join in compares). Pin the exact generating SQL from bcfp source (likely fwapgFWA_*/ the bcfp segmentation step) during implementation — theAS segmented_stream_idexpression wasn't located by code-search at@e12c1a5, so trace it from the segmentation/load step.