Summary
Build the Lexical Variant Dictionary infrastructure: a generic, data-driven query expansion mechanism where dictionaries are modeled as OCL Sources, exposed through a $lexical-variants operation, and consumable by the Mapper, standard concept search, and external pipelines.
Full design: Lexical Variant Dictionaries (in ocl-online-docs/mapper/)
➡️ Immediate next step — OCL Mapper UI consumes the new feature
The backend is ready and OCL/lexical-variants-en is live on prod. Next: surface the new capability in the OCL Mapper UI so users can opt in.
Scope:
- Add a checkbox to the OCL Mapper Project Config, near or within the algorithms section
- Unchecked by default
- Label suggestion: "Apply lexical variants" or "Expand US/UK spelling variants"
- Optional helper text: "Match concepts even when the input uses British spelling and the target uses American (or vice versa)"
- When checked, the Mapper UI must include
"variants": true in the $match request body
- When unchecked or absent, behavior matches today (no expansion)
Future-friendly UI consideration: the backend already accepts a dictionary URI string (not just true/false) so users can point at a custom dictionary later. For MVP a simple checkbox is fine; the field can graduate to a dictionary picker when more dictionaries exist.
Action: file a frontend sub-ticket in the appropriate Mapper UI repo and link it back here.
Why now
Surfaced during review of oclapi2 PR #868 (en-US/en-GB spelling mismatch in $match). The original PR hardcoded ~70 regex pairs in core/common/utils.py. Three problems with that approach:
- False positives in the regex list (
hem/haem matched themselves, anthem, hemisphere, hemp, hemlock, remember)
- Not reusable beyond
ConceptFuzzySearch — standard /concepts/?q=… has the same gap, and upcoming abbreviation-expansion work needs the same primitive
- Buried in code — terminologists can't add a variant pair without a code change
PR #868 was redirected to the dictionary-as-Source approach and now ships the MVP for the Mapper. This ticket tracks the broader infrastructure rollout.
Rationale highlights
- Eats our own dog food — we're a terminology platform; lexical variants are a terminology resource. Versioning, release management, locale, editability, and discoverability all come for free from OCL Sources.
- Aligns with industry precedent — UMLS SPECIALIST (LRSPL/LRABR), SNOMED Language Reference Sets, OBO synonym types. Survey of CTS2, LexEVS, BioPortal, Ontoserver, FHIR, RxNorm, MeSH, LOINC found no existing operation in this space, so we get to define the convention.
- Reuses existing OCL Name fields —
name, locale, locale_preferred, name_type (Fully Specified, Short, Index Term, custom). No new schema fields needed for MVP.
- Tokenization-first — operates on whole tokens against the dictionary's Names, eliminating regex-precondition pain.
Tradeoff: caching pressure on oclapi2
Storing the dictionary in the DB means every variant-enabled query triggers dictionary lookups. The MVP scale (~40 entries) is trivial, but the architecture commits us to doing caching well as dictionary types and sizes grow:
- Abbreviation dictionaries at UMLS LRABR scale are ~250k rows; naive per-token DB hits on the search hot path will not be acceptable.
- Multi-dictionary composition (lexical + abbreviation + synonym in one request) multiplies lookups per query.
- Source-version invalidation must flip atomically without thrashing or dropping in-flight requests.
- Cache warming at deploy time and after dictionary releases to avoid cold-load penalties.
Plan to revisit the caching strategy when (a) a dictionary exceeds ~5k entries, or (b) variant expansion is enabled by default on a high-QPS endpoint — whichever comes first. Likely path: Redis-backed token-indexed cache for >1k-entry dictionaries; ES-backed lookup for very large ones.
Deployment: after merge of PR #868, run ocl import core/common/data/lexical-variants-en.json once per environment to seed ocl/lexical-variants-en.
Phase 2 — OCL Mapper UI integration (immediate next step — see top of ticket)
Phase 2 — Standard concept search wiring
Phase 2 — Operation: $lexical-variants
Naming chosen after surveying terminology-server precedent (UMLS LVG, FHIR, CTS2, LexEVS, BioPortal, Ontoserver, RxNorm). "Lexical variant" is the term of art (per UMLS SPECIALIST); no other terminology server has defined this operation.
Phase 2 — Abbreviation dictionary type
Phase 2 — LVG-style normalization stack
Phase 3 — Additional dictionary types
Phase 3 — AI assistant integration
Out of scope (separate work)
Acceptance criteria
- All Phase 2 tickets filed and prioritized
- Standard concept search wiring delivers same
variants opt-in shape as $match
$lexical-variants operation documented and consumable by at least one non-Mapper caller
- Caching strategy documented and reviewed before any dictionary exceeds 5k entries OR variants are enabled by default on a high-QPS endpoint
References
Summary
Build the Lexical Variant Dictionary infrastructure: a generic, data-driven query expansion mechanism where dictionaries are modeled as OCL Sources, exposed through a
$lexical-variantsoperation, and consumable by the Mapper, standard concept search, and external pipelines.Full design: Lexical Variant Dictionaries (in
ocl-online-docs/mapper/)➡️ Immediate next step — OCL Mapper UI consumes the new feature
The backend is ready and
OCL/lexical-variants-enis live on prod. Next: surface the new capability in the OCL Mapper UI so users can opt in.Scope:
"variants": truein the$matchrequest bodyFuture-friendly UI consideration: the backend already accepts a dictionary URI string (not just
true/false) so users can point at a custom dictionary later. For MVP a simple checkbox is fine; the field can graduate to a dictionary picker when more dictionaries exist.Action: file a frontend sub-ticket in the appropriate Mapper UI repo and link it back here.
Why now
Surfaced during review of oclapi2 PR #868 (en-US/en-GB spelling mismatch in
$match). The original PR hardcoded ~70 regex pairs incore/common/utils.py. Three problems with that approach:hem/haemmatchedthemselves,anthem,hemisphere,hemp,hemlock,remember)ConceptFuzzySearch— standard/concepts/?q=…has the same gap, and upcoming abbreviation-expansion work needs the same primitivePR #868 was redirected to the dictionary-as-Source approach and now ships the MVP for the Mapper. This ticket tracks the broader infrastructure rollout.
Rationale highlights
name,locale,locale_preferred,name_type(Fully Specified,Short,Index Term, custom). No new schema fields needed for MVP.Tradeoff: caching pressure on oclapi2
Storing the dictionary in the DB means every variant-enabled query triggers dictionary lookups. The MVP scale (~40 entries) is trivial, but the architecture commits us to doing caching well as dictionary types and sizes grow:
Plan to revisit the caching strategy when (a) a dictionary exceeds ~5k entries, or (b) variant expansion is enabled by default on a high-QPS endpoint — whichever comes first. Likely path: Redis-backed token-indexed cache for >1k-entry dictionaries; ES-backed lookup for very large ones.
MVP — Done in oclapi2 PR #868
ocl/lexical-variants-en(43 vetted en-US/en-GB pairs,source_type: "Lexical Variants",extras.dictionary_kind: "lexical_variant", Source Versionv1.0)oclapi2:core/common/data/lexical-variants-en.jsonoclapi2:core/common/lexical_variants.py—get_lexical_variants()andget_variant_terms(), function-level boundary, per-(source_uri, version) cacheMetadataToConceptsListView($match) via request bodyvariantsfieldvariants: trueorvariants: "<URI>"Deployment: after merge of PR #868, run
ocl import core/common/data/lexical-variants-en.jsononce per environment to seedocl/lexical-variants-en.Phase 2 — OCL Mapper UI integration (immediate next step — see top of ticket)
variants: truein$matchrequest body when checkedPhase 2 — Standard concept search wiring
?variants=...query param onConceptListView(/concepts/?q=…) — same shape as$matchbody fieldCustomESSearch.get_search_string()/get_raw_search_string()layerget_lexical_variants()helperPhase 2 — Operation:
$lexical-variantsPOST /sources/<dict>/$lexical-variantsreturning structured variants with provenancedictionarieslist parameter for multi-dictionary compositionNaming chosen after surveying terminology-server precedent (UMLS LVG, FHIR, CTS2, LexEVS, BioPortal, Ontoserver, RxNorm). "Lexical variant" is the term of art (per UMLS SPECIALIST); no other terminology server has defined this operation.
Phase 2 — Abbreviation dictionary type
EXPANDS-TOto expansion target)contextattribute (lab, clinical, vitals, etc.) for ambiguity disambiguationPhase 2 — LVG-style normalization stack
Phase 3 — Additional dictionary types
name_typecarrying register)dictionary_kindvaluesPhase 3 — AI assistant integration
ocl-ai-assistant'skeyword-expansionprompt asabbreviation_dictionaryinputOut of scope (separate work)
apply_score()highlight mechanism (lives inviews.py:apply_score(), separate code path from variant expansion)Acceptance criteria
variantsopt-in shape as$match$lexical-variantsoperation documented and consumable by at least one non-Mapper callerReferences
ocl-ai-assistantkeyword-expansion prompt (private)