# Mapping Journey: ICD-11 to CIEL Concepts

## Introduction

`TD;LR Tip`: Just [Run Devcontainer on VS Code](https://docs.github.com/en/codespaces/setting-up-your-project-for-codespaces/adding-a-dev-container-configuration/introduction-to-dev-containers#rebuilding-the-dev-container-in-the-vs-code-web-client-or-desktop-application) and Jump to **1.2 Testing the Tools**  

Completing the mappings of diagnoses and clinical findings to SNOMED CT and ICD-10 was relatively easier. Andrew was able to leverage some professional connections to obtain curated data, which significantly improved the accuracy of mappings between CIEL concepts and these terminologies. In return, we committed to delivering mappings to ICD-11.

Since ICD-11 is relatively recent — not necessarily in its creation, but in its implementation — and is much more complex than ICD-10 due to its use of extension codes, pre-coordinated, and post-coordinated concepts, the mapping posed additional challenges. It is important to note that CIEL is based exclusively on pre-coordinated concepts. Thus, we began the ambitious task of mapping the remaining **27,796 CIEL diagnosis concepts** to ICD-11.

Previously, manual mapping efforts were extremely time-consuming. Before the initiation of this project, it could take up to **5 minutes** to manually map 3 CIEL concepts using a labor-intensive approach of copying, pasting, and searching for references across multiple tabs. Based on this method, a skilled individual could theoretically map the remaining concepts in approximately **193 days** (considering 1.67 minutes per mapping and 4 working hours per day).

## Phase 1: Leveraging Existing Resources and Developing Support Tools

At this time, there was a parallel effort by the OCL [(Open Concept Lab)](https://app.openconceptlab.org/) community to develop a mapping support tool, known as the **Mapping Tool**. This tool aimed to assist the mapping process across various terminologies, combining the power of large language models (LLMs) and ElasticSearch to suggest likely target concepts for a given source concept. Although the Mapping Tool was not a production-grade tool, we were able to take advantage of its development to create an initial triage table with approximately **1,693 ICD-11 code suggestions**.

Since the Mapping Tool provided only **code suggestions** — and not the **mapping relationship type** (such as *NARROWER-THAN* or *BROADER-THAN*) — it operated under the assumption of *SAME-AS*, which was often not accurate. Therefore, it was necessary to curate these data points further.

We imported these suggestions into a web service that was already deployed to support CIEL improvement processes, known as **CIEL Lab**.

For the ICD-11 mapping, the **CIEL Lab** became a large working table where each row represented a CIEL concept. Concepts were either sourced from the Mapping Tool or added manually. The table allowed users to correct a suggested code, provide a missing code, and define the mapping type as *SAME-AS*, *NARROWER-THAN*, or *BROADER-THAN*.

| Concept ID | FSN | ICD-10 Code | ICD-10 Name | ICD-11 Code | ICD-11 Name | Map Type | Actions |
|------------|-----|-------------|-------------|-------------|-------------|----------|------|
| Integer ID (clickable link to OCL) | Full Specified Name of CIEL Concept | Mapped WHO-ICD-10 code | ICD-10 FSN captured from secondary lookup table | Input field for suggested ICD-11 code (auto-filled if available) | Captured via ICD-11 API | Radio buttons to select mapping type: SAME-AS, NARROWER-THAN, BROADER-THAN | Final action buttons:<br>- **Send to Review** (stores in review table for later update to CIEL database)<br>- **Send to Manual Queue** (for concepts requiring two ICD-11 codes; removed from pending list and added to manual queue) |

Each ICD-11 code field is supported by three auxiliary buttons:

- **Run Linearization Search** (using the ICD API) using `/icd/release/11/{releaseId}/{linearizationname}/search` endpoint
- **Run Autocode** (using the ICD API) using `/icd/entity/autocode` endpoint
- **Run Cross Reference Mapping** using the WHO ICD-10 to ICD-11 cross-reference mapping

Each button opens a modal with analysis options, and the best match found in the search can be automatically populated into the input field.


### 1.1 Developing the Environment for CIEL Lab Support Tools

#### 1.1.1 Installing the ICD API Using Docker

To support the **Run Linearization Search** and **Run Autocode** functionalities, it was necessary to install the **ICD-11 API** locally using Docker. The setup followed the official WHO instructions available at [ICD API Docker Container Documentation](https://icd.who.int/docs/icd-api/ICDAPI-DockerContainer/).

```sh
docker run -d \
  --name icd-api \
  -p 8887:80 \
  -e acceptLicense=true \
  -e saveAnalytics=true \
  whoicd/icd-api
```

In our setup, an **Nginx proxy** was configured to distribute the API under a DNS name. However, for reproducibility and ease of use, we also propose the creation of a `vscode dev container` file that can deploy everything required to run the sandbox environment locally.

You just need to enter in your vscode, press `Cmd + Shift + P`, search and select `Open Workspace in Container`

With the local ICD API available, our tools were able to automatically interact with ICD-11 services for term searches and auto-coding suggestions without the need of authentication like the WHO API and without speed and requests limitations.

#### 1.1.2 Creating Supporting Tables for ICD-10 to ICD-11 Cross Reference

To enable the **Run Cross Reference Mapping** functionality, we needed access to cross-reference data between ICD-10 and ICD-11.

We downloaded the mapping archive from the [official WHO site](https://icd.who.int/browse/2025-01/mms/en) under:
- `Info > ICD-10 / ICD-11 Mapping Tables`,  
or directly from this link: [Download Mapping Tables (ZIP)](https://icdcdn.who.int/static/releasefiles/2025-01/mapping.zip).

After extracting the files, we executed the following shell script to import the mapping data into a MySQL database:

```sh
cd icd11_mapping_tool
wget https://icdcdn.who.int/static/releasefiles/2025-01/mapping.zip
unzip mapping.zip -d mapping
sudo chmod +x seed_icd_cross_reference_tables.sh
./seed_icd_cross_reference_tables.sh -u root -proot -d sandbox -h db-sandbox
```

##### Key Table for Cross-Reference Mapping

The most relevant table for searches is `icd11_10To11MapToMultipleCategories` as it contains the mappings of ICD-10 codes to multiple corresponding ICD-11 codes.

#### Additional View for ICD-10 Codes

In order to assist with future joins, especially when needing the ICD-10 FSN (Full Specified Name), we created a simple view:

```sql
CREATE OR REPLACE `vw_icd10_codes` AS
select
    distinct `cr`.`icd10Code` AS `code`,
    `cr`.`icd10Title` AS `name`
from
    `icd11_10To11MapToMultipleCategories` `cr`;
```

### 1.2 Testing the Tools

If you've executed the commands above or simply launched the devcontainer in VS Code, your environment is already prepared. Let's install the necessary packages:

In [5]:
!pip install requests pymysql sqlalchemy python-dotenv tqdm

Defaulting to user installation because normal site-packages is not writeable
Collecting pymysql
  Downloading PyMySQL-1.1.1-py3-none-any.whl.metadata (4.4 kB)
Downloading PyMySQL-1.1.1-py3-none-any.whl (44 kB)
Installing collected packages: pymysql
Successfully installed pymysql-1.1.1

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


Create the required environment variables in Python:

In [2]:
import os
from dotenv import load_dotenv

load_dotenv()

DB_USER = os.getenv("DB_USER", "root")
DB_PASS = os.getenv("DB_PASS", "root")
DB_HOST = os.getenv("DB_HOST", "db-sandbox")
DB_PORT = os.getenv("DB_PORT", 3306)
DB_NAME = os.getenv("DB_NAME", "sandbox")

In [3]:
CIEL_FSN = "Carrion's disease"
CIEL_ICD10 = "A44.0"

Establish the SQLAlchemy connection and perform a simple query:

In [6]:
from sqlalchemy import create_engine, text

DATABASE_URL = f"mysql+pymysql://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}/{DB_NAME}"
engine = create_engine(DATABASE_URL)

with engine.connect() as conn:
    result = conn.execute(text("SELECT 1")).scalar()
    print("Connection Test:", result)

Connection Test: 1


#### 1.2.1 Linearization Search

This method has proven most effective so far. It leverages the ICD API to find matches between CIEL_FSN and ICD-11 entities:

In [9]:
import requests

ICD_API_URL = "http://icdapi/icd/release/11/2025-01/mms/search"

params = {
    "q": CIEL_FSN,
    "subtreeFilterUsesFoundationDescendants": False,
    "includeKeywordResult": False,
    "useFlexisearch": False,
    "flatResults": True,
    "highlightingEnabled": True,
    "medicalCodingMode": True
}

headers = {
    "accept": "application/json",
    "API-Version": "v2",
    "Accept-Language": "en"
}

response = requests.get(ICD_API_URL, params=params, headers=headers)
data = response.json()

if data.get("destinationEntities"):
    results = sorted(data["destinationEntities"], key=lambda x: x["score"], reverse=True)
    best = results[0]
    print(f"✅ Matched concept: {best['theCode']} - {best.get('title', '')}")
else:
    print("⚠️ No match found.")

✅ Matched concept: 1C11.0 - <em class='found'>Carrion</em> <em class='found'>disease</em>


#### 1.2.2 Autocode

A simplified approach has been designed to generate mapping suggestions, particularly effective when the `matchScore` is exactly **1**. In these cases, the system identifies a direct and highly reliable match between the source term and the ICD-11 entity without ambiguity.

Although the autocode method may not capture the full complexity of some clinical concepts, it can still provide valuable preliminary suggestions. These initial suggestions can be particularly useful for assembling the first batches of mappings that will undergo manual review and curation by subject matter experts, such as Andrew.

In [12]:
from pprint import pprint

autocode_url = "http://icdapi/icd/entity/autocode"
autocode_params = {"searchText": CIEL_FSN}

autocode_resp = requests.get(autocode_url, params=autocode_params, headers=headers).json()
entity_id = autocode_resp["foundationURI"].split("/")[-1]

pprint(autocode_resp)

entity_resp = requests.get(f"http://icdapi/icd/release/11/2025-01/mms/{entity_id}", headers=headers).json()
code = entity_resp.get("code") or entity_resp.get("codeRange", "[No code found]").split("-")[-1]
title = entity_resp.get("title", {}).get("@value", "[No name found]")

print(f"✅ Autocode result: {code} - {title}")

{'foundationURI': 'http://id.who.int/icd/entity/1917297026',
 'matchLevel': 0,
 'matchScore': 1,
 'matchType': 0,
 'matchingText': 'Carrion disease',
 'searchText': "Carrion's disease"}
✅ Autocode result: 1C11.0 - Carrion disease


#### 1.2.3 WHO ICD-10 Cross Reference

Using WHO-provided tables to cross-reference ICD-10 and ICD-11:

In [14]:
def icd10_to_icd11_crossref(icd10_code, db_engine):
    query = text("""
        SELECT icd11Code, icd11Title
        FROM icd11_10To11MapToOneCategory
        WHERE icd10Code = :icd10_code
        LIMIT 1
    """)
    with db_engine.connect() as conn:
        result = conn.execute(query, {"icd10_code": icd10_code}).fetchone()
    if result:
        return result
    else:
        raise ValueError("No cross-reference found.")

cross_ref = icd10_to_icd11_crossref(CIEL_ICD10, engine)
print(f"🔁 ICD-10 → ICD-11: {cross_ref.icd11Code} - {cross_ref.icd11Title}")

🔁 ICD-10 → ICD-11: 1C11.00 - Oroya fever


#### 1.2.4 Quality Verification

Ensure ICD-11 codes are terminal (leaf) or have an extension pattern (starting with "&X"):

In [None]:
import re
from tqdm import tqdm

errors = []
terminal_pattern = re.compile(r'&X[A-Z0-9]+')

with engine.connect() as conn:
    rows = conn.execute(text("SELECT id, concept_id, icd11_code FROM analytics.lab_review_icd11_mapping WHERE review_status = 'PENDING'")).mappings().all()

base_url = "https://icd.filipelopes.med.br/icd/release/11/2025-01/mms"

for row in tqdm(rows):
    code = row['icd11_code']
    if terminal_pattern.search(code):
        continue

    codeinfo_resp = requests.get(f"{base_url}/codeinfo/{code}", headers=headers)
    if codeinfo_resp.status_code != 200:
        errors.append(f"Code {code}: API error {codeinfo_resp.status_code}")
        continue

    stem_id = codeinfo_resp.json().get("stemId", "").split("/mms/")[-1]
    linear_resp = requests.get(f"{base_url}/{stem_id}", headers=headers).json()

    if linear_resp.get("child"):
        errors.append(f"Code {code}: Not terminal (has children).")

if errors:
    print(f"QA completed with {len(errors)} errors:")
    for e in errors:
        print(f"• {e}")
else:
    print("QA completed. No errors.")

## Phase 2. Semantic Approach

Given that generic Elasticsearch queries yielded insufficient results, a semantic vectorization approach using structured ICD data has been considered. This process might later integrate into the OCL as a specialized plugin (refer: [OCL API match](https://docs.openconceptlab.org/en/latest/oclapi/apireference/match.html)).

Semantic vectorization involves converting clinical terms, descriptions, and synonyms into dense numerical vectors that capture their underlying meaning and context, rather than relying solely on keyword-based similarity.

This process will leverage **ChromaDB**, a vector database optimized for high-performance similarity search. ChromaDB will store embeddings generated from multiple structured datasets, enabling efficient retrieval of conceptually similar terms across different coding systems. The main datasets to be vectorized are:

| Dataset                    | Vectorized Documents                             | Metadata                                                    |
|-----------------------------|--------------------------------------------------|-------------------------------------------------------------|
| **ciel_v20240726**, **ciel_v20250317** | concept.name, concept.description            | concept_id, concept_class, datatype, locale, retired, type ["fully specified", "synonym", "description"] |
| **icd10_102019**            | fsn, synonym                                     | code, name_type ["fsn", "synonym"]                          |
| **icd11_merge**             | fsn, synonyms, index_terms                       | code, name_type ["fsn", "synonym", "index"]                 |
| **icd11_linear_extensions** | extension code (leaf), title, hierarchical parents | extension metadata (to be defined)                          |
| **icd11_flat_hierarchy**    | concepts, code, related extension possibilities  | hierarchy metadata (to be defined)                          |

By incorporating semantic search into the existing mapping workflow — alongside **Linearization Search**, **Autocode**, **WHO ICD-10 Cross-reference**, and **Elastic Search** — we can enrich the candidate pool of ICD-11 mappings. Each proposed mapping will then undergo a quality assurance (QA) process. Only those mappings that pass QA validation will be subjected to a scoring mechanism to determine the most appropriate match.

The final validated mappings will be made available in a structured JSON format for downstream use, as shown below:

```json
[
  {
    "ciel_concept_id": 1234,
    "icd11_proposal": "ABC123"
  }
]

## Phase 3. Working with Extensions

ICD-11 uses extensions (prefixed with 'X' and connected by '&') to specify aspects such as severity, laterality, and body location, enhancing the precision of medical concepts.

In ICD-11, a stem code represents a pre-coordinated term, which refers to a clinical concept that has been fully assembled with its meaning encapsulated in a single code, such as "Pneumonia" or "Type 2 Diabetes Mellitus." Pre-coordination means that the complexity of the concept is built into the stem itself. However, when further granularity is needed — for example, indicating "Left lung pneumonia" instead of just "Pneumonia" — extensions are used.

Extensions are linked to stem codes through an ampersand (’&’) and typically begin with the letter ‘X’. They allow post-coordination, meaning that users can combine a stem code with one or more extensions to provide richer, more specific clinical descriptions. For instance, a code for "Pneumonia" could be combined with an extension for "Left lung" to denote precisely the affected site.

**Proposed Process:**
- Structure and join ICD-10/11 tables clearly.
- Vectorize data using ChromaDB.
- Integrate vectors into LangChain.
- Utilize a lightweight LLM (e.g., Mistral Small) to query semantic vectors.
- Deploy an interactive webpage for easy validation, linking suggestions directly to the ICD-11 Browser.

This structured process enhances efficiency and accuracy in ICD-11 code mapping.