Skip to content

Make id_segment globally unique (position-derived, bcfp-style) — id_segment-alone persist joins cartesian across WSGs #203

@NewGraphEnvironment

Description

@NewGraphEnvironment

Problem

link's id_segment is a per-WSG row index, unique only within the persist primary key (id_segment, watershed_group_code) — not globally. In province-wide persist (fresh.streams): 1,542,427 rows but only 80,555 distinct id_segment — each value repeats ~19× across WSGs. So any join between persist tables on id_segment alone is a ~19–22× cartesian blow-up.

Active damage found 2026-05-24 (during #175):

Safe today (join the working schema = single WSG = id unique): lnk_compare_wsg rollup, lnk_pipeline_persist.

Root cause + proposed fix

bcfp's segmented_stream_id is position-derived — it encodes blue_line_key + downstream_route_measure (e.g. 359371372.3868298), so it is globally unique by construction. link should do the same: derive id_segment deterministically from (blue_line_key, downstream_route_measure) so it is globally unique.

Benefits:

  1. Kills the whole id_segment-alone-join bug class (defense in depth, not just per-join PK discipline).
  2. Bonus: link id_segment == bcfp segmented_stream_id → compares can join directly on the id (no blue_line_key + round(downstream_route_measure) position join), and it becomes a stable cross-build / cross-system key.

Blast radius (scope carefully)

id_segment is generated upstream (fresh segmentation / link persist — confirm during exploration) and is a PK component + join key across streams, streams_habitat_<sp>, streams_access, streams_mapping_code, barriers. Changing its derivation touches generation + every consumer. Decide: determinism, uniqueness guarantee, type (bcfp uses a float position-encoding — weigh numeric vs text vs bigint), and migration of existing persist data.

Interim discipline (until root fix)

All persist joins MUST use the full PK (id_segment, watershed_group_code). Periodic audit:

grep -rnE "id_segment *= *[a-z]+\.id_segment|USING \(id_segment\)" R/ data-raw/ | grep -v watershed_group_code

Relates to #175. SRED NewGraphEnvironment/sred-2025-2026#24.

Verified (2026-05-24): bcfp segmented_stream_id

  • Globally uniquefresh.streams_vw_bcfp: 4,230,118 rows = 4,230,118 distinct segmented_stream_id (no watershed_group_code needed in the key).
  • Position-derived from blue_line_key + downstream_route_measure, per bcfp's data dictionary (docs/05_data_dictionary.rst): "segmented_stream_id is based on FWA blue_line_key and downstream_route_measure for a given segment. When the segmentation by bcfishpass modelling changes, the segmented_stream_id will also change."
  • Type: text (character varying). Integer part = blue_line_key (verified for every sampled row); fractional part encodes the rounded measure — e.g. measure 120.287972…903.120288, 48.3061357…361.48306 (≈ blue_line_key || '.' || digits-of round(downstream_route_measure, 3)). It is not a meaningful float (a fixed blk + measure/scale does NOT fit — the fraction-digit count varies); treat it as an opaque deterministic position key.
  • Segmentation-dependent — regenerated each bcfp build; only comparable within the same model run (already handled: we pin the build via s3://fresh-bc/bcfishpass/log.json).

Implication for the fix: adopting bcfp's exact text construction for link's id_segment would make it globally unique AND id_segment == segmented_stream_id (direct id-join in compares). Pin the exact generating SQL from bcfp source (likely fwapg FWA_* / the bcfp segmentation step) during implementation — the AS segmented_stream_id expression wasn't located by code-search at @e12c1a5, so trace it from the segmentation/load step.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions