
## The problem

We have a set of terms (sometimes multi-word) that represent a meaning. We want to map those terms to other terms from an ontology or controlled vocabulary, but we want to do so using their meaning, not just text matching. "Encodings" are a way of transforming words or sentences into vectors of numbers such that the points in N-dimensional space that are near each other have similar meanings. We use that idea here to map

uncurated term ---> curated term

We do this by:

curated term ---> encodings

uncurated term ---> encodings


## Sentence transformers

Sentence transformers refer to a type of natural language processing (NLP) model designed specifically for transforming sentences or text snippets into fixed-dimensional vectors, often with the goal of capturing semantic similarity. These models use deep learning techniques, typically employing architectures like Siamese networks or Transformer models.

The primary objective of sentence transformers is to generate embeddings or representations of sentences in a way that the distance or similarity between these embeddings reflects the semantic meaning of the corresponding sentences. This makes them useful for various NLP tasks such as sentence similarity, clustering, and information retrieval.

Commonly used architectures for sentence transformers include BERT (Bidirectional Encoder Representations from Transformers), RoBERTa (Robustly optimized BERT approach), and DistilBERT, among others. Pre-trained transformer models can be fine-tuned on specific tasks or datasets to create sentence embeddings tailored to a particular application.

Sentence embeddings obtained from these models can be useful in a variety of applications, including semantic search, document retrieval, and sentiment analysis, where understanding the underlying semantic relationships between sentences is crucial.







### Value-based curation
#### 1. Choose and import sentence transformer for our job


In [1]:
# %pip install sentence-transformers pandas numpy scikit-learn

from sentence_transformers import SentenceTransformer
from scipy.spatial.distance import cosine, cdist
import pandas as pd
import numpy as np

model = SentenceTransformer("kamalkraj/BioSimCSE-BioLinkBert-Base")

#### 2. Prepare data for model training
Load the `curated_bodysite.csv` file and create a set of "original" (uncurated) values and a set of "curated" values.

Our task is to use the "curated" set, which corresponds to our controlled vocabulary or ontology terms, as a dictionary of sorts. We want to provide an index to that dictionary that allows us to look up words by **their meaning**. The emdeddings are the "meaning" and we can then use those as our index.

In [None]:
curated_df = pd.read_csv("curated_bodysite.csv")

# On a bigger view of the task, we can see that we will use curated_bodysite_ontology_term_id along with curated_bodysite, so it's better to map them for better efficiency

# Drop null and standardize the original_bodysite column and the curated_bodysite column
df = curated_df.dropna(subset=["curated_bodysite", "curated_bodysite_ontology_term_id"]).copy()
df["curated_bodysite_clean"] = df["curated_bodysite"].str.strip().str.upper()
df["original_bodysite_clean"] = df["original_bodysite"].str.strip().str.upper()
df = df[~df["original_bodysite_clean"].str.contains("LOCATION")]

# If we take a deep look on the dataset, we can see that "curated_bodysite" may contain multiple bodysites.
df["curated_bodysite_split"] = df["curated_bodysite_clean"].str.split("<;>")
df["ontology_term_id_split"] = df["curated_bodysite_ontology_term_id"].str.split("<;>")
df["original_bodysite_split"] = df["original_bodysite_clean"].str.split("<;>")

# Check if each row have same number of splits
df["bodysite_count"] = df["curated_bodysite_split"].apply(len)
df["ontology_id_count"] = df["ontology_term_id_split"].apply(len)
df["original_bodysite_count"] = df["original_bodysite_split"].apply(len)

# Drop rows with mismatched splits. We only need examples where the number of bodysites and ontology term ids match
df = df[(df["bodysite_count"] == df["ontology_id_count"]) & (df["bodysite_count"] == df["original_bodysite_count"])]

# Map the bodysite to the ontology term id
exploded_df = df.explode(["curated_bodysite_split", "ontology_term_id_split"])
curated_to_id_map = dict(zip(exploded_df["curated_bodysite_split"], exploded_df["ontology_term_id_split"]))

for term, ontology_id in list(curated_to_id_map.items())[:5]:
    print(f"{term} -> {ontology_id}")

SALIVARY GLAND -> NCIT:C12426
LUNG -> NCIT:C12468
BUCCAL MUCOSA -> NCIT:C12505
BONE -> NCIT:C12366
ORBIT -> NCIT:C12347


#### 3. Embed the curated values

In [3]:
curated_terms = list(curated_to_id_map.keys())
curated_embeddings = model.encode(curated_terms)

exploded_original_df = df.explode("original_bodysite_split")
uncurated_terms = exploded_original_df["original_bodysite_split"].unique()
uncurated_embeddings = model.encode(list(uncurated_terms))

# Compute cosine distances between each uncurated embedding and all curated embeddings
distance_matrix = cdist(uncurated_embeddings, curated_embeddings, metric="cosine")

#### 4. Map uncurated terms to curated terms

In [4]:
results = []
for i, uncurated_term in enumerate(uncurated_terms):
    sorted_indices = np.argsort(distance_matrix[i])
    for idx in sorted_indices[:5]:
        curated_term = curated_terms[idx]
        ontology_id = curated_to_id_map[curated_term]
        distance = distance_matrix[i][idx]
        results.append({
            "original_term": uncurated_term,
            "mapped_curated_term": curated_term,
            "ontology_term_id": ontology_id,
            "cosine_distance": distance
        })

results_df = pd.DataFrame(results)

# Group by original term and select the top 5 matches based on cosine distance
top5_per_term = results_df.sort_values("cosine_distance")\
    .groupby("original_term", as_index=False)\
    .head(5)

top5_per_term["rank"] = top5_per_term.groupby("original_term").cumcount() + 1
pivot_df = top5_per_term.pivot(index="original_term", columns="rank", values="mapped_curated_term")
pivot_df.columns = [f"mapped_curated_term{col}" for col in pivot_df.columns]
pivot_df = pivot_df.reset_index()

best_info = top5_per_term[top5_per_term["rank"] == 1][["original_term", "ontology_term_id", "cosine_distance"]].rename(
    columns={"ontology_term_id": "best_ontology_term_id", "cosine_distance": "best_cosine_distance"}
)

final_df = pd.merge(pivot_df, best_info, on="original_term")

final_df.to_csv("top5_matches.csv", index=False, encoding="utf-8")

print("\nFinal result:")
print(final_df)


Final result:
                     original_term mapped_curated_term1 mapped_curated_term2  \
0              "BRAIN, CEREBELLUM"                BRAIN           CEREBELLUM   
1                "BRAIN, PARIETAL"        PARIETAL LOBE             PARIETAL   
2                    3RD VENTRICLE      THIRD VENTRICLE     FOURTH VENTRICLE   
3                    4TH VENTRICLE     FOURTH VENTRICLE      THIRD VENTRICLE   
4                          ABDOMEN              ABDOMEN               PELVIS   
..                             ...                  ...                  ...   
390                         UTERUS               UTERUS         CERVIX UTERI   
391                         VAGINA               VAGINA              URETHRA   
392                     VENTRICLES         CERVIX UTERI              ABDOMEN   
393  VERY DISTAL RECTUM RECURRENCE               RECTUM               DISTAL   
394                          VULVA                VULVA       GASTRIC CARDIA   

          mapped_curated

---
**Work By Changchang Li**  
March 24th, 2025