Skip to content

Lexical Variant Dictionary infrastructure: dictionaries as OCL Sources + $lexical-variants operation #2511

@paynejd

Description

@paynejd

Summary

Build the Lexical Variant Dictionary infrastructure: a generic, data-driven query expansion mechanism where dictionaries are modeled as OCL Sources, exposed through a $lexical-variants operation, and consumable by the Mapper, standard concept search, and external pipelines.

Full design: Lexical Variant Dictionaries (in ocl-online-docs/mapper/)

➡️ Immediate next step — OCL Mapper UI consumes the new feature

The backend is ready and OCL/lexical-variants-en is live on prod. Next: surface the new capability in the OCL Mapper UI so users can opt in.

Scope:

  • Add a checkbox to the OCL Mapper Project Config, near or within the algorithms section
  • Unchecked by default
  • Label suggestion: "Apply lexical variants" or "Expand US/UK spelling variants"
  • Optional helper text: "Match concepts even when the input uses British spelling and the target uses American (or vice versa)"
  • When checked, the Mapper UI must include "variants": true in the $match request body
  • When unchecked or absent, behavior matches today (no expansion)

Future-friendly UI consideration: the backend already accepts a dictionary URI string (not just true/false) so users can point at a custom dictionary later. For MVP a simple checkbox is fine; the field can graduate to a dictionary picker when more dictionaries exist.

Action: file a frontend sub-ticket in the appropriate Mapper UI repo and link it back here.

Why now

Surfaced during review of oclapi2 PR #868 (en-US/en-GB spelling mismatch in $match). The original PR hardcoded ~70 regex pairs in core/common/utils.py. Three problems with that approach:

  1. False positives in the regex list (hem/haem matched themselves, anthem, hemisphere, hemp, hemlock, remember)
  2. Not reusable beyond ConceptFuzzySearch — standard /concepts/?q=… has the same gap, and upcoming abbreviation-expansion work needs the same primitive
  3. Buried in code — terminologists can't add a variant pair without a code change

PR #868 was redirected to the dictionary-as-Source approach and now ships the MVP for the Mapper. This ticket tracks the broader infrastructure rollout.

Rationale highlights

  • Eats our own dog food — we're a terminology platform; lexical variants are a terminology resource. Versioning, release management, locale, editability, and discoverability all come for free from OCL Sources.
  • Aligns with industry precedent — UMLS SPECIALIST (LRSPL/LRABR), SNOMED Language Reference Sets, OBO synonym types. Survey of CTS2, LexEVS, BioPortal, Ontoserver, FHIR, RxNorm, MeSH, LOINC found no existing operation in this space, so we get to define the convention.
  • Reuses existing OCL Name fieldsname, locale, locale_preferred, name_type (Fully Specified, Short, Index Term, custom). No new schema fields needed for MVP.
  • Tokenization-first — operates on whole tokens against the dictionary's Names, eliminating regex-precondition pain.

Tradeoff: caching pressure on oclapi2

Storing the dictionary in the DB means every variant-enabled query triggers dictionary lookups. The MVP scale (~40 entries) is trivial, but the architecture commits us to doing caching well as dictionary types and sizes grow:

  • Abbreviation dictionaries at UMLS LRABR scale are ~250k rows; naive per-token DB hits on the search hot path will not be acceptable.
  • Multi-dictionary composition (lexical + abbreviation + synonym in one request) multiplies lookups per query.
  • Source-version invalidation must flip atomically without thrashing or dropping in-flight requests.
  • Cache warming at deploy time and after dictionary releases to avoid cold-load penalties.

Plan to revisit the caching strategy when (a) a dictionary exceeds ~5k entries, or (b) variant expansion is enabled by default on a high-QPS endpoint — whichever comes first. Likely path: Redis-backed token-indexed cache for >1k-entry dictionaries; ES-backed lookup for very large ones.

MVP — Done in oclapi2 PR #868

  • OCL Source ocl/lexical-variants-en (43 vetted en-US/en-GB pairs, source_type: "Lexical Variants", extras.dictionary_kind: "lexical_variant", Source Version v1.0)
  • OCL bulk import file: oclapi2:core/common/data/lexical-variants-en.json
  • Helper: oclapi2:core/common/lexical_variants.pyget_lexical_variants() and get_variant_terms(), function-level boundary, per-(source_uri, version) cache
  • Wired into MetadataToConceptsListView ($match) via request body variants field
  • Default OFF — clients opt in with variants: true or variants: "<URI>"
  • Tests including false-positive regression suite

Deployment: after merge of PR #868, run ocl import core/common/data/lexical-variants-en.json once per environment to seed ocl/lexical-variants-en.

Phase 2 — OCL Mapper UI integration (immediate next step — see top of ticket)

  • Project Config checkbox to opt in to lexical variants
  • Sends variants: true in $match request body when checked
  • Unchecked by default; behavior unchanged when unchecked
  • Frontend sub-ticket to be filed and linked here

Phase 2 — Standard concept search wiring

  • ?variants=... query param on ConceptListView (/concepts/?q=…) — same shape as $match body field
  • Touches CustomESSearch.get_search_string() / get_raw_search_string() layer
  • Consumes the same get_lexical_variants() helper

Phase 2 — Operation: $lexical-variants

  • POST /sources/<dict>/$lexical-variants returning structured variants with provenance
  • Same shape on org and user namespaces
  • Supports dictionaries list parameter for multi-dictionary composition
  • Reusable by external pipelines, AI assistant, oclweb3

Naming chosen after surveying terminology-server precedent (UMLS LVG, FHIR, CTS2, LexEVS, BioPortal, Ontoserver, RxNorm). "Lexical variant" is the term of art (per UMLS SPECIALIST); no other terminology server has defined this operation.

Phase 2 — Abbreviation dictionary type

  • Mapping-based modeling (Concept per abbreviation; Mapping EXPANDS-TO to expansion target)
  • context attribute (lab, clinical, vitals, etc.) for ambiguity disambiguation
  • Initial dictionary seeded from current 3-column work (term, expanded-term, context)

Phase 2 — LVG-style normalization stack

  • Tokenization beyond MVP's lowercase + whitespace + ASCII punctuation
  • Optional inflection normalization (uninflect)
  • Composable flow components (LVG-inspired)

Phase 3 — Additional dictionary types

  • Synonym dictionaries (clinical/lay, with name_type carrying register)
  • Inflectional variant dictionaries (plural/tense)
  • User-defined dictionary types with custom dictionary_kind values
  • Per-org dictionary preferences and overrides

Phase 3 — AI assistant integration

  • Feed deterministic dictionary results into ocl-ai-assistant's keyword-expansion prompt as abbreviation_dictionary input
  • Promote high-confidence LLM outputs back into curated OCL dictionaries (learning loop)

Out of scope (separate work)

Acceptance criteria

  • All Phase 2 tickets filed and prioritized
  • Standard concept search wiring delivers same variants opt-in shape as $match
  • $lexical-variants operation documented and consumable by at least one non-Mapper caller
  • Caching strategy documented and reviewed before any dictionary exceeds 5k entries OR variants are enabled by default on a high-QPS endpoint

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/featureNew or improved functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions