Goal
When the same physical place exists in multiple sources (e.g. OSM node 123 and Wheelmap node 456 both describe the same café), we need to (a) detect that they're the same and (b) merge them into one places row with both IDs recorded in external_ids.
Tracking under #10.
Why this is hard
External sources do not share IDs. There is no authoritative cross-reference. We have to infer identity from observable features:
- Geographic proximity (within N meters)
- Name similarity (fuzzy match, language-aware)
- Category compatibility (a café and a parking lot can't be the same place even if collocated)
- Address agreement when available
False matches collapse distinct places into one row. False non-matches duplicate places. Both degrade the registry.
Open design questions
- Matching algorithm: simple distance + name similarity threshold, or something heavier (vector embeddings, learned model)? Start simple, document the threshold values.
- When does matching run: on every ingest of a new source, or as a separate periodic job over the whole
places table?
- Match confidence: do we record a confidence score per
external_ids entry so that low-confidence matches can be reviewed later?
- Human-in-the-loop: should low-confidence matches go to a review queue rather than auto-merge?
- Schema impact:
external_ids is currently []string (commit 009d937) — does it need to become []{source, id, confidence, added_at}?
Out of scope
- Implementing the matcher. This issue is the design.
- Splitting an existing merged row back apart (un-merge). Separate concern.
Acceptance
A short design doc (in-repo, under docs/) covering: matching algorithm choice, schema changes to external_ids if any, when matching runs, and how low-confidence cases are handled. Then implementation is broken into separate issues.
Goal
When the same physical place exists in multiple sources (e.g. OSM node 123 and Wheelmap node 456 both describe the same café), we need to (a) detect that they're the same and (b) merge them into one
placesrow with both IDs recorded inexternal_ids.Tracking under #10.
Why this is hard
External sources do not share IDs. There is no authoritative cross-reference. We have to infer identity from observable features:
False matches collapse distinct places into one row. False non-matches duplicate places. Both degrade the registry.
Open design questions
placestable?external_idsentry so that low-confidence matches can be reviewed later?external_idsis currently[]string(commit009d937) — does it need to become[]{source, id, confidence, added_at}?Out of scope
Acceptance
A short design doc (in-repo, under
docs/) covering: matching algorithm choice, schema changes toexternal_idsif any, when matching runs, and how low-confidence cases are handled. Then implementation is broken into separate issues.