## Introducing KUMLS: a lightweight Unified Medical Languages System (UMLS) implemented in Kùzu

The [Unified Medical Language System (UMLS)](https://www.nlm.nih.gov/research/umls/index.html) integrates and distributes key terminology, classification and coding standards, and associated resources to promote creation of more effective and interoperable biomedical information systems and services, including electronic health records. It brings together more than [210 health and biomedical vocabularies and standards](https://www.nlm.nih.gov/research/umls/sourcereleasedocs/) to enable interoperability between computer systems. The files are available for use and integration in knowledge graph databases, such as neo4j, to support implementation of APIs for mappings and terminology servers.

In many real-world settings, there is a need for a lightweight version of UMLS, which:

- Only includes a subset of the dictionaries that are needed c.q. most widely used such as SNOMED-CT, IDC10 and LOINC;
- Provides a serverless implementation, leveraging new technologies that are part of the [composable data stack](https://voltrondata.com/codex). Specfically, we seek to implement a lightweight, file-based UMLS database that can be easily distributed;
- Provide a query interface that is easy to use and in line with new standards;
- Lower the threshold to add own vocabularies and custom mappings for specific use-cases.

Taking these requirements at heart, and triggered by the work done at PharmAccess Foundation with the Momcare programme, we introduce KUMLS, a lightweight UMLS knowledge graph database implemented in [Kùzu](https://kuzudb.com/). The first implementation of KUMLS includes three core dictionaries:

- the [SNOMED IPS Terminology](https://confluence.ihtsdotools.org/display/DOCIPSTUG/2.+IPS+Terminology+Overview)
- [LOINC 2.79](https://loinc.org/downloads/)
- [ICD10 2019](https://icdcdn.who.int/icd10/index.html)

As part of the FAIR with FHIR paper, we demonstrate how KUMLS can be used to add case-specific vocabularies with associated mappings, in this case

- the [WHO Antenatal Care (ANC) Guidelines](https://build.fhir.org/ig/dhes/smart-anc/)
- the vocabulary used by the [HE2AT Center](https://dsi-africa.org/project/5)

This version of KUMLS can be used using the interfaces provided by Kùzu, most notably Cypher as one of the most widely used graph query language which is evolving towards an open standard through [openCypher](https://opencypher.org/).

This notebook demonstrates the creation of KUMLS, starting from the downloaded files of each of the vocabularies in `data/external`.


### SNOMED IPS Terminology


In [26]:
from pathlib import Path
from shutil import rmtree

import altair as alt
import fsspec
import kuzu
import polars as pl

# data definitions of Kùzu tables are in separate file
import ddl


snomed_data = Path("./data/external/snomed-ips/Snapshot/Terminology/")
kuzu_path = Path("./data/kuzu-db/")
if kuzu_path.exists:
    rmtree(kuzu_path)

db = kuzu.Database(kuzu_path)
conn = kuzu.Connection(db)


def read_snomed(path: Path) -> pl.DataFrame:
    return pl.read_csv(path, separator="\t").with_columns(
        pl.col("effectiveTime").cast(pl.String).str.to_date("%Y%m%d"),
        pl.col("active").cast(pl.Boolean),
    )


#### Concepts and descriptions

SNOMED IPS has 19,699 Concepts, with two types of descriptions.

- 900000000000003001 | Fully qualified name
- 900000000000013009 | Synonyms

These are loaded in a node table `SCT`.


In [27]:
Description = read_snomed(
    snomed_data / "sct2_Description_IPSSnapshot-en_IPST_20240701.txt"
)
Description.select(pl.col("typeId").value_counts()).unnest("typeId")

typeId,count
i64,u32
900000000000013009,40417
900000000000003001,19697


In [28]:
fullname = Description.filter(pl.col("typeId") == 900000000000003001).select(
    pl.col("conceptId"), pl.col("term").alias("fullQualifiedName")
)
synonyms = (
    Description.filter(pl.col("typeId") == 900000000000013009)
    .select(pl.col("conceptId"), pl.col("term").alias("synonyms"))
    .group_by(pl.col("conceptId"))
    .agg("synonyms")
)

join_concept = dict(how="left", left_on="id", right_on="conceptId")
Concept = (
    read_snomed(snomed_data / "sct2_Concept_IPSSnapshot_IPST_20240701.txt")
    .join(fullname, **join_concept)
    .join(synonyms, **join_concept)
)

conn.execute(ddl.sct.concept + "COPY SCT FROM Concept;")

[<kuzu.query_result.QueryResult at 0x1357f7510>,
 <kuzu.query_result.QueryResult at 0x10460df90>]

### Relationships

SNOMED has 70 relationship types for 66,017 relationships that are included in the IPS Terminology. The top 10 relationship types cover 51,886 relationships i.e. 78%. Separate `REL TABLES` are defined for these 10 relationships. The remaining 60 relationships are contained in a generic relationship table.


In [29]:
# load Relationship, note we need to change ordering of columns for loading in Kuzu
join_relationship = dict(how="left", left_on="typeId", right_on="conceptId")
Relationship = (
    read_snomed(snomed_data / "sct2_Relationship_IPSSnapshot_IPST_20240701.txt")
    .select(
        pl.col(
            [
                "sourceId",
                "destinationId",
                "id",
                "effectiveTime",
                "active",
                "moduleId",
                "relationshipGroup",
                "typeId",
                "characteristicTypeId",
                "modifierId",
            ]
        )
    )
    .join(fullname, **join_relationship)
    .join(synonyms, **join_relationship)
)

In [30]:
Relationship.select(pl.col("typeId").value_counts()).unnest("typeId").sort(
    "count", descending=True
).head(10).sum()

typeId,count
i64,u32
3652444046,51886


In [31]:
# inspect frequency of each type of relationship out of 66,017 relationships
print(Relationship.shape)

# 116680003 | Is A occurs 32,111 times i.e accounts for half
# 363698007 | Finding site 5,497
# 116676008 | Associated morphology 3,818
type_count = (
    Relationship.select(pl.col("typeId").value_counts())
    .unnest("typeId")
    .sort("count", descending=True)
)
type_count.plot.bar(
    alt.X("count:Q").scale(type="symlog"), y=alt.Y("typeId:O").sort("-x")
)

(66017, 12)


In [32]:
for name, id in ddl.sct.top10_relationships:
    Relationship_ = Relationship.filter(pl.col("typeId") == id)
    conn.execute(
        f"DROP TABLE IF EXISTS {name};"
        + ddl.sct.relationship(name)
        + f"COPY {name} FROM Relationship_;"
    )

In [33]:
# TO DO: add generic relationship table with remaining 60 relationship types

## ICD-10 2019

The structure of ICD-10 is strictly hierarchical. From top to bottom is defines:

- Chapter
- Group
- Category
  - three-position code, e.g. C88
  - four-position code, e.g. C88.9

These four levels in the hierarchy are loaded as separate node tables. Relationships are defined in the `IsSubClassOf` table.

Whilst developing this demonstrator, we found differences in the various ICD10 versions. The HE2AT study uses ICD10CM, which contains codes that are not present in the ICD10 2019 version that we have used here. One such example is [`I16`](https://icd10cmtool.cdc.gov/?fy=FY2024&query=i16).


In [34]:
icd_data = Path("./data/external/icd10-2019/")

ICD10Chapter = pl.read_csv(
    icd_data / "icd102019syst_chapters.txt",
    has_header=False,
    separator=";",
).rename({"column_1": "number", "column_2": "rubric"})

ICD10Group = pl.read_csv(
    icd_data / "icd102019syst_groups.txt", has_header=False, separator=";"
).with_columns(
    pl.concat_str([pl.col("column_1"), pl.col("column_2")], separator="-").alias(
        "code"
    ),
    pl.col("column_3").alias("chapter"),
    pl.col("column_4").alias("rubric"),
)

ICD10Code = pl.read_csv(
    icd_data / "icd102019syst_codes.txt",
    has_header=False,
    separator=";",
    infer_schema_length=10000,
)

Group_to_Chapter = ICD10Group.select("code", "chapter")

Category3_to_group = ICD10Group.join(
    ICD10Code.filter(pl.col("column_1") == 3), left_on="column_1", right_on="column_5"
).select(pl.col("column_7").alias("category3"), pl.col("code").alias("group"))

ICD10Group = ICD10Group.select("code", "rubric")

expr_category = (
    pl.col("column_7").alias("code"),
    pl.col("column_9").alias("rubric"),
)

ICD10Category3 = ICD10Code.filter(pl.col("column_1") == 3).select(expr_category)
ICD10Category4 = ICD10Code.filter(pl.col("column_1") == 4).select(expr_category)

Category4_to_3 = ICD10Category4.select(
    pl.col("code"), pl.col("code").str.head(3).alias("superclass")
)


In [35]:
for name in ["Chapter", "Group", "Category3", "Category4"]:
    conn.execute(
        f"DROP TABLE IF EXISTS ICD10{name};"
        + ddl.icd.__dict__[name]
        + f"COPY ICD10{name} FROM ICD10{name};"
    )

conn.execute(
    "DROP TABLE IF EXISTS IsSubClassOf;"
    + ddl.icd.IsSubClassOf
    + "COPY IsSubClassOf_ICD10Category4_ICD10Category3 FROM Category4_to_3;"
    + "COPY IsSubClassOf_ICD10Category3_ICD10Group FROM Category3_to_group;"
    + "COPY IsSubClassOf_ICD10Group_ICD10Chapter FROM Group_to_Chapter;"
)


[<kuzu.query_result.QueryResult at 0x168f32310>,
 <kuzu.query_result.QueryResult at 0x168f484d0>,
 <kuzu.query_result.QueryResult at 0x168f4aa10>,
 <kuzu.query_result.QueryResult at 0x168f4a8d0>,
 <kuzu.query_result.QueryResult at 0x168f49950>]

### LOINC

LOINC provides the essential content in a format that is stable over the long-runin the core table, contains all of the LOINC terms that are in the complete table (i.e., the same number of rows), but a subset of the fields (i.e., different number of columns). The `MapTo.csv` file is included so that users can update their mappings for deprecated terms without having to download a separate artifact. These tables are loaded as `LOINC` and `LOINC_deprecated` nodes.

The LOINC `ComponentHierarchyBySystem` provides the relationships within LOINC. It is a multiaxial hierarchy with [enriched linkages between LOINC terms and LOINC Parts](https://loinc.org/kb/enriched-linkages-between-loinc-terms-and-loinc-parts/)

TO DO: add [LOINC Ontology](https://loincsnomed.org/) when it is available. top 20,000 used terms mapped to SNOMED


In [36]:
loinc_data = Path("./data/external/LOINC_2.79/")
loinc_core = pl.read_csv(
    loinc_data / "LoincTableCore/LoincTableCore.csv",
    schema_overrides={"VersionLastChanged": pl.String},
).drop("METHOD_TYP", "EXTERNAL_COPYRIGHT_NOTICE")
loinc_core


LOINC_NUM,COMPONENT,PROPERTY,TIME_ASPCT,SYSTEM,SCALE_TYP,CLASS,CLASSTYPE,LONG_COMMON_NAME,SHORTNAME,STATUS,VersionFirstReleased,VersionLastChanged
str,str,str,str,str,str,str,i64,str,str,str,str,str
"""100000-9""","""Health informatics pioneer and…","""Hx""","""Pt""","""^Patient""","""Nar""","""H&P.HX""",2,"""Health informatics pioneer and…","""Health Info Pioneer+Father of …","""ACTIVE""","""2.74""","""2.74"""
"""100001-7""","""Health informatics pioneer and…","""Hx""","""Pt""","""^Patient""","""Nar""","""H&P.HX""",2,"""Health informatics pioneer and…","""Health Info Pioneer+Cofound LO…","""ACTIVE""","""2.74""","""2.74"""
"""100002-5""","""Specimen care is maintained""","""Find""","""Pt""","""^Patient""","""Ord""","""SURVEY.PNDS""",4,"""Specimen care is maintained""","""""","""ACTIVE""","""2.72""","""2.72"""
"""100003-3""","""Team communication is maintain…","""Find""","""Pt""","""^Patient""","""Ord""","""SURVEY.PNDS""",4,"""Team communication is maintain…","""""","""ACTIVE""","""2.72""","""2.72"""
"""100004-1""","""Demonstrates knowledge of the …","""Find""","""Pt""","""^Patient""","""Ord""","""SURVEY.PNDS""",4,"""Demonstrates knowledge of the …","""""","""ACTIVE""","""2.72""","""2.72"""
…,…,…,…,…,…,…,…,…,…,…,…,…
"""99994-6""","""Fluid, electrolyte, and acid-b…","""Find""","""Pt""","""^Patient""","""Ord""","""SURVEY.PNDS""",4,"""Fluid, electrolyte, and acid-b…","""""","""ACTIVE""","""2.72""","""2.72"""
"""99995-3""","""Respiratory status is maintain…","""Find""","""Pt""","""^Patient""","""Ord""","""SURVEY.PNDS""",4,"""Respiratory status is maintain…","""""","""ACTIVE""","""2.72""","""2.72"""
"""99996-1""","""Cardiovascular status is maint…","""Find""","""Pt""","""^Patient""","""Ord""","""SURVEY.PNDS""",4,"""Cardiovascular status is maint…","""""","""ACTIVE""","""2.72""","""2.72"""
"""99997-9""","""Demonstrates &or reports adequ…","""Find""","""Pt""","""^Patient""","""Ord""","""SURVEY.PNDS""",4,"""Demonstrates AndOr reports ade…","""""","""ACTIVE""","""2.72""","""2.72"""


In [37]:
loinc_mapto = pl.read_csv(
    loinc_data / "LoincTableCore/MapTo.csv",
)

# deprecated codes can have multiple mappings!
loinc_deprecated = loinc_mapto.select("LOINC").unique()

conn.execute(
    "DROP TABLE IF EXISTS LOINC;"
    + ddl.loinc.LOINC
    + "COPY LOINC FROM loinc_core;"
    + "DROP TABLE IF EXISTS LOINC_deprecated;"
    + ddl.loinc.LOINC_deprecated
    + "COPY LOINC_deprecated FROM loinc_deprecated;"
    + "DROP TABLE IF EXISTS MapTo;"
    + ddl.loinc.LOINC_mapto
    + "COPY MapTo FROM loinc_mapto"
)

[<kuzu.query_result.QueryResult at 0x168f47390>,
 <kuzu.query_result.QueryResult at 0x168f57210>,
 <kuzu.query_result.QueryResult at 0x168f571d0>,
 <kuzu.query_result.QueryResult at 0x168f574d0>,
 <kuzu.query_result.QueryResult at 0x168f576d0>,
 <kuzu.query_result.QueryResult at 0x168f57750>,
 <kuzu.query_result.QueryResult at 0x168f57810>,
 <kuzu.query_result.QueryResult at 0x168f57890>,
 <kuzu.query_result.QueryResult at 0x168f57950>]

In [38]:
# TO DO: load LOINC parts and system hierarchy

## WHO ANC Profile

- Note we are using value sets with mapping to SNOMED IPS
- Also constraints are not relevant (too detailed for Momcare)
- We do use Measures (downstream)


In [39]:
systems = ["ICD-10", "ICD-11", "ICF", "ICHI", "LOINC", "SNOMED-CT"]


def parse_conceptmap(system: str) -> pl.DataFrame:
    "Generate flattened mapping tables from WHO ANC conceptmap."

    if system not in systems:
        return None

    with fsspec.open(
        f"https://build.fhir.org/ig/dhes/smart-anc/ConceptMap-{system}.json"
    ) as f:
        df = pl.read_json(f)

    unnest_group = pl.col("group").list.explode().struct.unnest()
    unnest_element = pl.col("element").list.explode().struct.unnest().list.explode()

    return (
        df.select(unnest_group)
        .select(unnest_element)
        .select(
            pl.col(pl.String).name.prefix("who_anc_"),
            pl.lit(system.replace("-", "")).alias("target"),
            pl.col("target").struct.unnest(),
        )
    )


# there are errors in WHO ANC codes for SNOMED-CT mapping. These need fixing
# see who-anc-sct-errors.py
errors = {
    "1.56399E+16": 15639921000119107,
    "1.22475E+16": 12247531000119106,
    "1.07437E+16": 10743651000119105,
    "1.07612E+16": 10761341000119105,
    "4.41041E+14": 441041000124100,
}

who = pl.concat(
    [parse_conceptmap(system) for system in systems], how="diagonal"
).with_columns(pl.col("code").replace(errors))

In [40]:
# 735 unique WHO ANC codes
who.select(pl.col("who_anc_code").n_unique())

who_anc_code
u32
735


In [41]:
# Coverage varies widely, SNOMED most complete
# Multiple WHO ANC code can map to the same target code
who.group_by("target").agg(pl.n_unique("who_anc_code", "code")).sort(
    "who_anc_code", descending=True
)

target,who_anc_code,code
str,u32,u32
"""SNOMEDCT""",725,429
"""ICD11""",550,225
"""ICD10""",532,188
"""LOINC""",385,145
"""ICHI""",163,42
"""ICF""",100,32


In [42]:
# Same SNOMED or ICD10 concept maps to multiple WNO ANC codes!
many_to_one = (
    (
        who.group_by("target", "code")
        .agg(pl.count("who_anc_code").alias("count_"))
        .filter(pl.col("count_") > 1)
    )
    .join(who, on=["target", "code"])
    .sort(["target", "code"])
)
many_to_one

target,code,count_,who_anc_code,who_anc_display,equivalence,display
str,str,u32,str,str,str,str
"""ICD10""","""A53.9""",2,"""ANC.B9.DE111""","""Syphilis positive""","""equivalent""","""Syphilis, unspecified"""
"""ICD10""","""A53.9""",2,"""ANC.B9.DE108""","""Syphilis positive""","""equivalent""","""Syphilis, unspecified"""
"""ICD10""","""B18.1""",2,"""ANC.B9.DE72""","""Hepatitis B positive""","""equivalent""","""Chronic viral hepatitis B with…"
"""ICD10""","""B18.1""",2,"""ANC.B9.DE75""","""Hepatitis B positive""","""equivalent""","""Chronic viral hepatitis B with…"
"""ICD10""","""B18.2""",2,"""ANC.B9.DE93""","""Hepatitis C positive""","""equivalent""","""Chronic viral hepatitis C"""
…,…,…,…,…,…,…
"""SNOMEDCT""","""84229001""",3,"""ANC.B7.DE53""","""Gets tired easily""","""equivalent""","""Fatigue (finding)"""
"""SNOMEDCT""","""8517006""",2,"""ANC.B6.DE154""","""Recently quit tobacco products""","""equivalent""","""Ex-smoker (finding)"""
"""SNOMEDCT""","""8517006""",2,"""ANC.B7.DE12""","""Recently quit tobacco products""","""equivalent""","""Ex-smoker (finding)"""
"""SNOMEDCT""","""91175000""",2,"""ANC.B6.DE41""","""Convulsions""","""equivalent""","""Seizure (finding)"""


In [43]:
# SCT mapping has two types of equivalence relationships
many_to_one.group_by("target").agg(pl.col("equivalence").value_counts())

target,equivalence
str,list[struct[2]]
"""ICD11""","[{""equivalent"",428}]"
"""LOINC""","[{""relatedto"",95}, {""equivalent"",220}]"
"""ICF""","[{""relatedto"",13}, {""equivalent"",78}]"
"""SNOMEDCT""","[{""relatedto"",28}, {""equivalent"",390}]"
"""ICD10""","[{""equivalent"",438}]"
"""ICHI""","[{""equivalent"",141}]"


In [44]:
# Unique WHO ANC Code
who_unique = who.select(
    pl.col("who_anc_code").alias("code"),
    pl.col("who_anc_display").alias("rubric"),
).unique()

# exclude 70 mappings not in snomed
who_sct = many_to_one.filter((pl.col("target") == "SNOMEDCT")).with_columns(
    pl.col("code").cast(pl.Int64)
)


In [45]:
# SCT codes in WHO ANC that are not in SNOMED IPS
# For example 720407008|Mother victim of domestic violence
sct_not_in_ips = (
    who_sct.join(Concept, left_on="code", right_on="id", how="left")
    .filter(pl.col("active").is_null())
    .select("code")
    .unique()
)

sct_not_in_ips

code
i64
4484000
733460004
165780000
71994000
386359008
…
47758006
12803000
165332000
736687002


In [46]:
who_icd10 = many_to_one.filter(pl.col("target") == "ICD10").select(
    "who_anc_code", "code"
)

# mappings to 3- and 4-position ICD10 codes
who_icd10_3 = who_icd10.filter(pl.col("code").str.len_chars() < 4)
who_icd10_4 = who_icd10.filter(pl.col("code").str.len_chars() == 4)

In [47]:
who_sct_equivalent = who_sct.filter(
    (pl.col("equivalence") == "equivalent") & (~pl.col("code").is_in(sct_not_in_ips))
).select("who_anc_code", "code")

who_sct_related = who_sct.filter(
    (pl.col("equivalence") == "relatedto") & (~pl.col("code").is_in(sct_not_in_ips))
).select("who_anc_code", "code")

In [48]:
# load WhoAncCode
conn.execute(
    "DROP TABLE IF EXISTS WhoAncCode;"
    + ddl.who_anc.WhoAncCode
    + "COPY WhoAncCode FROM who_unique;"
)

[<kuzu.query_result.QueryResult at 0x107ac9f10>,
 <kuzu.query_result.QueryResult at 0x168f33b90>,
 <kuzu.query_result.QueryResult at 0x168f33c50>]

In [49]:
conn.execute(
    "DROP TABLE IF EXISTS EquivalentTo;"
    + ddl.who_anc.EquivalentTo
    + "COPY EquivalentTo_WhoAncCode_SCT FROM who_sct_equivalent;"
    + "COPY EquivalentTo_WhoAncCode_ICD10Category3 FROM who_icd10_3;"
    + "COPY EquivalentTo_WhoAncCode_ICD10Category4 FROM who_icd10_4;"
)


[<kuzu.query_result.QueryResult at 0x168f45250>,
 <kuzu.query_result.QueryResult at 0x168f47510>,
 <kuzu.query_result.QueryResult at 0x168f45f10>,
 <kuzu.query_result.QueryResult at 0x168f45f50>,
 <kuzu.query_result.QueryResult at 0x168f46e10>]

In [50]:
conn.execute(
    "DROP TABLE IF EXISTS RelatedTo;"
    + ddl.who_anc.RelatedTo
    + "COPY RelatedTo FROM who_sct_related;"
)


[<kuzu.query_result.QueryResult at 0x168f45d10>,
 <kuzu.query_result.QueryResult at 0x168f47550>,
 <kuzu.query_result.QueryResult at 0x168f44f90>]

### HE2AT

- Mapping to ICD10CM
- Mappings are very different, for example:
  - a whole chapter, e.g. `congenital_abn` allows all codes in chapter 17
  - Level four or five in the ICD10 tree

Original mapping data is verbose, listing all the codes and subcodes, making it error prone. We demonstrate an easier logic:

- Mapping to relevant granularity, and using `IsSubClassOf` relationship we automatically include all underlying codes
- Explicitly excluding certain parts in a branch of the ICD10 tree with Cypher queries.

We demonstrate this using [./data/external/HE2AT/he2at-to-icd10.csv](./data/external/HE2AT/he2at-to-icd10.csv)`.


In [51]:
# TO DO: and HE2AT node with concepts from CSV
# Write Cypher query to show how you can make it more expressive