Skip to content

Design: identity resolution — matching the same place across sources #61

@koinsaari

Description

@koinsaari

Goal

When the same physical place exists in multiple sources (e.g. OSM node 123 and Wheelmap node 456 both describe the same café), we need to (a) detect that they're the same and (b) merge them into one places row with both IDs recorded in external_ids.

Tracking under #10.

Why this is hard

External sources do not share IDs. There is no authoritative cross-reference. We have to infer identity from observable features:

  • Geographic proximity (within N meters)
  • Name similarity (fuzzy match, language-aware)
  • Category compatibility (a café and a parking lot can't be the same place even if collocated)
  • Address agreement when available

False matches collapse distinct places into one row. False non-matches duplicate places. Both degrade the registry.

Open design questions

  • Matching algorithm: simple distance + name similarity threshold, or something heavier (vector embeddings, learned model)? Start simple, document the threshold values.
  • When does matching run: on every ingest of a new source, or as a separate periodic job over the whole places table?
  • Match confidence: do we record a confidence score per external_ids entry so that low-confidence matches can be reviewed later?
  • Human-in-the-loop: should low-confidence matches go to a review queue rather than auto-merge?
  • Schema impact: external_ids is currently []string (commit 009d937) — does it need to become []{source, id, confidence, added_at}?

Out of scope

  • Implementing the matcher. This issue is the design.
  • Splitting an existing merged row back apart (un-merge). Separate concern.

Acceptance

A short design doc (in-repo, under docs/) covering: matching algorithm choice, schema changes to external_ids if any, when matching runs, and how low-confidence cases are handled. Then implementation is broken into separate issues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:ingestionOSM and other data source ingestionenhancementNew feature or requestpriority:shouldShould-have, rough edges

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions