Skip to content

frs_point_match: match two FWA-snapped point datasets within instream distance (v0.30.0)#208

Merged
NewGraphEnvironment merged 9 commits into
mainfrom
206-frs-point-match-match-two-point-datasets
May 11, 2026
Merged

frs_point_match: match two FWA-snapped point datasets within instream distance (v0.30.0)#208
NewGraphEnvironment merged 9 commits into
mainfrom
206-frs-point-match-match-two-point-datasets

Conversation

@NewGraphEnvironment
Copy link
Copy Markdown
Owner

Summary

Closes #206. Ships frs_point_match() — third primitive in the point-handling family alongside frs_point_snap and frs_network_features. Matches two FWA-snapped point datasets along the network within an instream-distance threshold, with bidirectional dedup and a switchable tiebreak metric.

  • Generic over any FWA-snapped point-pair use case (PSCIS↔modelled, observations↔habitat-confirmations, field-assessed↔user-added dedup, etc.).
  • Algorithm mirrors 02_pscis_streams_150m.sql at smnorris/bcfishpass@v0.7.14-125-g6e9cf1c (tunnel bcfishpass.log.model_run_id=121, rebuilt 2026-05-05).
  • Parameter naming follows the table_<role> / col_<role> conventions: table_a, table_b, table_to, col_a_id, col_b_id, tiebreak, distance_max.

Live byte-identical parity vs bcfp

ADMS: 60 / 60 (stream_crossing_id, modelled_crossing_id) pairs byte-identical to bcfishpass.pscis.modelled_crossing_id. 0 in ours-not-ref, 0 in ref-not-ours.

BULK (xref-excluded snap-only subset, since BULK has 1467 manual xref overrides that the caller layers on top of this primitive): 77 / 78 ref pairs identical; 5 in ours-not-ref, 1 in ref-not-ours.

The 5 BULK divergences are bcfp's multi-stream candidate consideration at the SNAP layer (before this primitive runs) — bcfp picks among multiple FWA streams within 150m planar by name + width + distance scoring. frs_point_match accepts a single (PSCIS, stream) per row, so it can't reselect the stream. Documented in @details. Addressed by sibling primitive #207 (frs_candidates_pick): score + filter + dedup candidates per key. With #206 + #207 composed via frs_point_snap(num_features = N)frs_candidates_pickfrs_point_match, the bcfp PSCIS-to-stream + PSCIS-to-modelled algorithm reproduces byte-identically.

Test plan

  • 24 mocked tests covering validation (expect_error() per arg path) and SQL composition (DROP + CREATE order, DISTINCT ON, bidirectional dedup via PARTITION BY, tiebreak switch, identifier sanitization, reserved-column collision guards). [ FAIL 0 | WARN 0 | SKIP 0 | PASS 24 ] on filter=frs_point_match.
  • Full suite: 987 PASS / 2 FAIL / 3 WARN — both FAIL and the WARN are pre-existing on main (verified via git checkout main + re-run). No regressions caused by this change.
  • R CMD check --no-tests: 0 errors / 4 warnings / 4 notes — identical to main.
  • lintr::lint("R/frs_point_match.R"): zero lints.
  • Live integration test against bcfp tunnel (ADMS + BULK above).
  • roxygen2::roxygenise() regenerates man/frs_point_match.Rd + frs_network_features.Rd (@family network cross-reference).
  • DESCRIPTION 0.29.0 → 0.30.0 (minor bump — new exported function).
  • NEWS.md 0.30.0 entry covers semantics, parity numbers, tiebreak, persistence-contract rationale, follow-up reference.

Notable design decisions

  • Write-to-table contract (different from frs_point_snap which returns sf and frs_network_features which returns tibble). Rationale: the result is a derived dataset not a query result; bcfp's analog also writes table→table; output sizes can be large; DDL split into two DBI::dbExecute calls because RPostgres rejects multi-statement SQL in one call.
  • Network-position columns hard-coded (blue_line_key, downstream_route_measure) to the FWA convention. Per-side overrides can be added if a real divergence appears (precedent: fresh#204's per-side wscode/localcode overrides on frs_network_features).
  • tiebreak parameter (c("instream", "planar"), default "instream"). Default is geometry-free and self-consistent with the threshold filter; "planar" requires geom on both inputs and mirrors bcfp's tiebreak metric exactly.
  • Bidirectional dedup: a-side keeps each table_a row's nearest table_b; b-side keeps each table_b row's nearest table_a (others get NULL). Mirrors bcfp's two-pass algorithm in lines 172-218 + 226+ of 02_pscis_streams_150m.sql.

First consumer: link#154 (lnk_pipeline_crossings: missing PSCIS↔modelled 100m-instream auto-snap layer). That issue wires this primitive into link's per-WSG crossings build.

🤖 Generated with Claude Code

NewGraphEnvironment and others added 9 commits May 11, 2026 07:58
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
R/frs_point_match.R + man/frs_point_match.Rd + NAMESPACE export.
Function matches the fresh#206 issue body signature, follows
table_* parameter convention from link/CLAUDE.md, uses
.frs_validate_identifier() + .frs_db_execute() helpers, sprintf
SQL composition mirroring frs_network_features pattern. Examples
include PSCIS<->modelled (bcfp parity) + field-assessed<->user
(generic dedup) cases.

SQL implements bcfp's 02_pscis_streams_150m.sql algorithm
(verified unchanged from local v0.7.13 source through tunnel
v0.7.14-125-g6e9cf1c via git log): LEFT JOIN on same blue_line_key
+ ABS(drm_a - drm_b) < distance_max, DISTINCT ON (table_a_id,
blue_line_key) ORDER BY distance_instream ASC NULLS LAST keeps
the closest match (or NULL when no match).

Lint clean; load_all signature verified. Tests + live byte-identical
validation against bcfp follow in Phase 2+3.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…cross-ref

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
tests/testthat/test-frs_point_match.R mirrors frs_network_features
test structure:

Tier 1 -- validation (7 tests): required args, distance_max scalar
positive numeric, .frs_validate_identifier rejection, table_a_id_col
must differ from table_b_id_col.

Tier 2 -- SQL composition via local_mocked_bindings on
.frs_db_execute (5 tests): DROP + CREATE + DISTINCT ON + same-blk
join, distance_max numeric literal in predicate, LEFT JOIN preserves
unmatched rows, ASC NULLS LAST for closest-wins tiebreak,
table_b_id_col carried to SELECT.

Total: 20 PASS / 0 FAIL on filter=frs_point_match.

Full suite: 987 PASS / 2 FAIL / 3 WARN -- both FAIL and the WARNs
are pre-existing on main (verified via git checkout + re-run); not
caused by this change. Specifically test-frs_params.R:92 fails the
same way before and after.

Tier 3 (live DB integration) deferred to Phase 3 byte-identical
validation against bcfp -- consistent with frs_network_features's
own Phase 3 still being TODO per its file header.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bug surfaced by live integration test: RPostgres can't run multi-
statement SQL ("cannot insert multiple commands into a prepared
statement"). Split DROP TABLE IF EXISTS and CREATE TABLE AS into
two .frs_db_execute calls. Updated tests to capture both SQL calls
via a list-accumulator mock.

Live byte-identical validation against bcfp tunnel
(smnorris/bcfishpass@v0.7.14-125-g6e9cf1c via bcfishpass.log
model_run_id=121, 2026-05-05):

  ours: 60 matches | ref: 60 matches
  (stream_crossing_id, modelled_crossing_id) pairs identical: 60
  only in ours: 0
  only in ref:  0

Script captured at /tmp/fresh_206_live_validation.R for PR body
attachment. Tests: 21 PASS / 0 FAIL on filter=frs_point_match.

Reference is bcfishpass.pscis.modelled_crossing_id (the canonical
post-dedup linkage) not bcfishpass.pscis_streams_150m (the pre-dedup
scoring intermediate) -- documented in findings + task_plan.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
DESCRIPTION 0.29.0 -> 0.30.0 (minor bump per R-package conventions
for new exported function).

NEWS.md 0.30.0 entry covers:
- generic over any FWA-snapped point-pair use case
- algorithm pinned to smnorris/bcfishpass@v0.7.14-125-g6e9cf1c
- table_* parameter convention (per link/CLAUDE.md)
- write-to-table contract + RPostgres two-call split rationale
- live parity numbers (60/60 byte-identical ADMS)
- test coverage summary (21 tests)
- first consumer is link#154

R CMD check (--no-tests): 0 errors / 4 warnings / 4 notes -- identical
to main per side-by-side check. lintr clean on the new R file.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two algorithmic improvements after BULK validation surfaced gaps:

1) Bidirectional dedup. Initial impl only did a-side dedup (each PSCIS
keeps nearest modelled). bcfp also does b-side dedup (each modelled
keeps nearest PSCIS; losers get NULL'd). Added via ROW_NUMBER OVER
PARTITION BY b_id, CASE-NULLing non-winners. Mirrors bcfp's UPDATE
in lines 172-218 of 02_pscis_streams_150m.sql.

2) tiebreak parameter (c("instream", "planar"), default "instream").
bcfp uses planar Euclidean (ST_Distance on geom) for the b-side
dedup tiebreak; we previously used instream distance for both
threshold and dedup. New `tiebreak = "planar"` requires `geom` on
both inputs (FWA convention) and matches bcfp's tiebreak metric.

Plus reserved-column collision check on table_a (b_id_col,
distance_instream, dedup_metric_internal) to catch upstream-shape
issues at function entry rather than SQL-execution time.

Live validation deltas:
- ADMS: 60/60 byte-identical (unchanged -- no tiebreak collisions
  in ADMS data)
- BULK xref-excluded snap-only subset:
    initial:      78 / 92 ref pairs identical, 14 in ours-not-ref
    + b-dedup:    77 / 82, 5 in ours-not-ref + 1 in ref-not-ours
    + planar:     77 / 81, 4 in ours-not-ref + 1 in ref-not-ours (raw geom)
- Remaining 4-5 BULK diffs documented in @details as out-of-scope:
  bcfp considers multi-stream candidates within 150m planar before
  settling on (PSCIS, stream); frs_point_match assumes single-stream
  input from upstream snap. Caller can layer multi-stream selection
  on top if needed.

Tests: 24 PASS / 0 FAIL on filter=frs_point_match. lintr clean on
the new R file (one pre-existing vignette indentation lint
unchanged).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Renamed `table_a_id_col` -> `col_a_id`, `table_b_id_col` -> `col_b_id`
per the `col_<role>` parameter naming convention (link/CLAUDE.md
extension). Matches the `table_<role>` precedent (table_a, table_b,
table_to, table_in, table_out). Picked over `<role>_col` for the
same autocomplete-grouping reason.

Live re-validation post-rename: ADMS still 60/60 byte-identical.

NEWS.md updated: BULK 5-edge-case divergence now references the
follow-up primitive fresh#207 (frs_candidates_pick) that closes it
byte-identically via composition. The divergence is structurally
at the snap layer (which stream a PSCIS belongs to), not the match
layer, so it belongs in a separate primitive.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@NewGraphEnvironment NewGraphEnvironment merged commit f3995dc into main May 11, 2026
1 check passed
@NewGraphEnvironment NewGraphEnvironment deleted the 206-frs-point-match-match-two-point-datasets branch May 11, 2026 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

frs_point_match: match two point datasets along FWA network within instream distance

1 participant