# Evaluation of the exact matching pipeline

Here we evaluate the performance on the task of finding exact matches of source terms to standard concepts in the OHDSI Vocabulary.
For this we use a gold standard dataset of source terms with their correct mappings to standard concepts, with some source terms not having any correct mapping.

Before running this notebook, make sure the environment is set up as described in the README.
This includes credentials for a database with the OHDSI Vocabulary loaded, and credentials for the LLM API.
The prompts, as specified in `config.yaml`, are tailored for **GPT o3**, so we recommend using that model for this evaluation.
This notebook currently assumes the use of a local PGVector instance for vector search, instantiated with the code at [https://github.com/schuemie/OhdsiVocabVectorStore](https://github.com/schuemie/OhdsiVocabVectorStore), but an option to use the public OHDSI Hecate vector search API is also available.

## Setup

All results will be stored in the `data/notebook_results` folder.
Most code blocks will load results from file if they already exist, to save time and costs.
Delete those files to rerun the corresponding steps.

In [2]:
from pathlib import Path
import pandas as pd

project_root = Path.cwd().parent

## Gold standard

We use the gold standard file in the `data/gold_standards` folder.

In [3]:
gold_standard_path = project_root / "data" / "gold_standards" / "exact_matching_gs.csv"

gold_standard = pd.read_csv(gold_standard_path)
gold_standard.head()

Unnamed: 0,source_concept_id,source_term,target_concept_id,target_concept_name,predicate,target_concept_id_b,target_concept_name_b,predicate_b
0,8690,"Unspecified enthesopathy, lower limb, excludin...",4116324,Enthesopathy of lower leg and ankle region,exactMatch,4116324.0,Enthesopathy of lower leg and ankle region,broadMatch
1,9724,"Lead-induced chronic gout, unspecified elbow",607432,Chronic gout caused by lead,broadMatch,,,
2,9770,"Chronic gout due to renal impairment, right elbow",46270464,Gout of elbow due to renal impairment,broadMatch,,,
3,9946,"Pathological fracture, left ankle",760649,Pathological fracture of left ankle,exactMatch,,,
4,10389,Unspecified fracture of skull,4324690,Fracture of skull,exactMatch,,,


This gold standard contains the following columns:
 - `source_concept_id`: the concept ID of the source term (non-standard concept)
  - `source_term`: the source term text
  - `standard_concept_id`: the concept ID of the standard concept to which the source term maps
  - `standard_concept_name`: the name of the standard concept
  - `predicate`: Whethe the mapping is an `exactMatch` or a `broadMatch`. Only `exactMatch` mappings are considered correct in this evaluation.
  - `target_concept_id_b`, `target_concept_name_b`, and `predicate_b`: optional second mapping for the source term that is equally valid, if available.

## Term cleanup

The first step is to clean up the source terms, removing any extraneous information that may interfere with matching.
This includes removing phrases like "not otherwise specified", as well as other uninformative parts of the source term.
This step uses the LLM to perform the cleanup.

In [4]:
from ariadne.term_cleanup.term_cleaner import TermCleaner

cleaned_terms_file = project_root / "data" / "notebook_results" / "exact_matching_cleaned_terms.csv"
if cleaned_terms_file.exists():
    cleaned_terms = pd.read_csv(cleaned_terms_file)
    print("Loaded cleaned terms from file.")
else:
    term_cleaner = TermCleaner()
    cleaned_terms = term_cleaner.clean_terms(gold_standard, "source_term")
    print(f"Total LLM cost: ${term_cleaner.get_total_cost():.6f}")
cleaned_terms.to_csv(cleaned_terms_file, index=False)


cleaned_terms[["source_term", "cleaned_term"]].head(10)

Loaded cleaned terms from file.


Unnamed: 0,source_term,cleaned_term
0,"Unspecified enthesopathy, lower limb, excludin...","enthesopathy, lower limb, excluding foot"
1,"Lead-induced chronic gout, unspecified elbow","Lead-induced chronic gout, elbow"
2,"Chronic gout due to renal impairment, right elbow","Chronic gout due to renal impairment, right elbow"
3,"Pathological fracture, left ankle","Pathological fracture, left ankle"
4,Unspecified fracture of skull,fracture of skull
5,Ocular laceration without prolapse or loss of ...,Ocular laceration without prolapse or loss of ...
6,Penetrating wound of orbit with or without for...,Penetrating wound of orbit
7,"Fracture of unspecified shoulder girdle, part ...",Fracture of shoulder girdle
8,Drowning and submersion due to falling or jump...,Drowning and submersion
9,"Alcohol use, unspecified with withdrawal delirium",Alcohol use with withdrawal delirium


## Verbatim matching

The next step is to perform verbatim matching, i.e. looking for exact matches of the cleaned source terms in the vocabulary.
We do allow for some minor variations, such as case differences and punctuation differences.

We use the `VocabVerbatimTermMapper` class for this, which first needs to create an index of the vocabulary terms.
For this it will connect to the database specified in the environment variables.
It will restrict to the vocabularies and domains specified in the `config.yaml` file.



In [5]:
from ariadne.verbatim_mapping.term_downloader import download_terms
from ariadne.verbatim_mapping.vocab_verbatim_term_mapper import VocabVerbatimTermMapper

verbatim_match_file = project_root / "data" / "notebook_results" / "exact_matching_verbatim_matches.csv"
if verbatim_match_file.exists():
    verbatim_matches = pd.read_csv(verbatim_match_file)
    print("Loaded verbatim matches from file.")
else:
    download_terms() # Downloads the terms as Parquet files to the folder specified in config.yaml.
    verbatim_mapper = VocabVerbatimTermMapper() # Will construct the vocabulary index if needed
    verbatim_matches = verbatim_mapper.map_terms(
        cleaned_terms, "cleaned_term"
    )
    verbatim_matches.to_csv(verbatim_match_file, index=False)
verbatim_matches[["source_term", "cleaned_term", "matched_concept_id", "matched_concept_name"]].head(10)

Loaded verbatim matches from file.


Unnamed: 0,source_term,cleaned_term,matched_concept_id,matched_concept_name
0,"Unspecified enthesopathy, lower limb, excludin...","enthesopathy, lower limb, excluding foot",-1,
1,"Lead-induced chronic gout, unspecified elbow","Lead-induced chronic gout, elbow",-1,
2,"Chronic gout due to renal impairment, right elbow","Chronic gout due to renal impairment, right elbow",-1,
3,"Pathological fracture, left ankle","Pathological fracture, left ankle",-1,
4,Unspecified fracture of skull,fracture of skull,4324690,Fracture of skull
5,Ocular laceration without prolapse or loss of ...,Ocular laceration without prolapse or loss of ...,-1,
6,Penetrating wound of orbit with or without for...,Penetrating wound of orbit,4334734,Penetrating wound of orbit
7,"Fracture of unspecified shoulder girdle, part ...",Fracture of shoulder girdle,-1,
8,Drowning and submersion due to falling or jump...,Drowning and submersion,-1,
9,"Alcohol use, unspecified with withdrawal delirium",Alcohol use with withdrawal delirium,-1,


# Embedding vector search

Next, we perform embedding vector search for the source terms that were not matched by verbatim matching.
This uses the vector store specified in the `config.yaml` file, which can be either a local PGVector instance or the OHDSI Hecate API.

In [7]:
from ariadne.vector_search.pgvector_concept_searcher import PgvectorConceptSearcher
# from ariadne.vector_search.hecate_concept_searcher import HecateConceptSearcher

vector_search_results_file = project_root / "data" / "notebook_results" / "exact_matching_vector_search_results.csv"
if vector_search_results_file.exists():
    vector_search_results = pd.read_csv(vector_search_results_file)
    print("Loaded vector search results from file.")
else:
    concept_searcher = PgvectorConceptSearcher()
    # concept_searcher = HecateConceptSearcher()
    unmatched_terms = cleaned_terms.copy()
    unmatched_terms = unmatched_terms[unmatched_terms["source_concept_id"].isin(
        verbatim_matches["source_concept_id"][verbatim_matches["matched_concept_id"] != -1]
    )]
    vector_search_results = concept_searcher.search_terms(
        unmatched_terms, "cleaned_term"
    )
    vector_search_results.to_csv(vector_search_results_file, index=False)
vector_search_results[["source_term", "cleaned_term", "matched_concept_id", "matched_concept_name", "match_score"]].head(10)

Loaded vector search results from file.


Unnamed: 0,source_term,cleaned_term,matched_concept_id,matched_concept_name,match_score
0,Unspecified fracture of skull,fracture of skull,4324690,Fracture of skull,0.055798
1,Unspecified fracture of skull,fracture of skull,4011508,Fracture of skull and facial bones,0.160907
2,Unspecified fracture of skull,fracture of skull,439384,Open fracture of skull,0.180531
3,Unspecified fracture of skull,fracture of skull,4169757,Multiple fractures of skull,0.180899
4,Unspecified fracture of skull,fracture of skull,4159165,Closed fracture of skull,0.182617
5,Unspecified fracture of skull,fracture of skull,4308296,Fracture of parietal bone,0.204327
6,Unspecified fracture of skull,fracture of skull,4168152,Fracture of vault of skull,0.204327
7,Unspecified fracture of skull,fracture of skull,40483811,Closed fracture of vault of skull,0.204379
8,Unspecified fracture of skull,fracture of skull,4302223,Fracture of bone of head,0.218712
9,Unspecified fracture of skull,fracture of skull,4082029,Fracture of base of skull,0.219878


### Evaluating the vector search results

