# Medical Retrieval with `medretrieval`

This notebook demonstrates how to use the `medretrieval` library for medical document retrieval.

## Installation

First, install the `medretrieval` library from GitHub.

In [None]:
!pip install git+https://github.com/McDermottHealthAI/Medical-Retrieval-DB --quiet


In [None]:
from medretrieval import Corpus, Embedding
from datasets import load_dataset
import polars as pl
import time


## Small Dataset Example

### Loading and Embedding Medical Data

`metabolic.txt` and `respiratory.txt` are 2 simple medical documents generated by GPT.

In [None]:
directory = "https://github.com/McDermottHealthAI/Medical-Retrieval-DB/tree/main/examples/data"
dataset = Corpus.load_data([f"{directory}/metabolic.txt", f"{directory}/respiratory.txt"])
emb = Embedding("thomas-sounack/BioClinical-ModernBERT-base")


In [None]:
start = time.time()
dataset = emb.embed(dataset, build_faiss_index=True)
end = time.time()
print(f"Embedding {len(dataset)} documents took {round(end - start, 3)}s")


### Performing Queries

Query the embedded data to retrieve relevant information.

In [None]:
queries = [
  "What are the main types of chronic respiratory diseases?",
  "What is the most effective way to prevent COPD and lung cancer?",
  "What conditions, including abdominal obesity and high triglycerides, define metabolic syndrome?",
  "How does the file differentiate between Hypothyroidism and Hyperthyroidism  in terms of their effect on the body's metabolism?",
]

start = time.time()
scores, results = emb.query(dataset, queries, 1)
end = time.time()
print(f"Running {len(queries)} queries took {round(end - start, 3)}s")

for score, result in zip(scores, results):
  print(f"Score: {score[0]}, Neighbor: {result['document_id'][0]}")


## Larger Hugging Face Dataset Example

### Loading and Transforming the dataset

`medretrieval` embedder expects the corpus dataset to have the following columns: `document_id`, `content`. So, we load PubMedQA dataset and transform it into medretrieval friendly format.

In [None]:
medqa_ds = load_dataset("qiaojin/PubMedQA", "pqa_labeled")
dataset = medqa_ds["train"].map(
    lambda record: {
        "document_id": record["pubid"],
        "content": "\n".join(record["context"]["contexts"]),
        "query": record["question"],
    }
).remove_columns(column_names=medqa_ds["train"].column_names)

### Embedding the dataset

In [None]:
emb = Embedding("thomas-sounack/BioClinical-ModernBERT-base")


In [None]:
start = time.time()
dataset = emb.embed(dataset, build_faiss_index=True)
end = time.time()
print(f"Embedding {len(dataset)} documents took {round(end - start, 3)}s")


### Performing Queries

Query the embedded data to retrieve relevant information.

In [None]:
queries = [str(query) for query in dataset["query"]]


start = time.time()
scores, results = emb.query(dataset, queries, 1)
end = time.time()
print(f"Running {len(queries)} queries took {round(end - start, 3)}s")


In [None]:
query_results = pl.DataFrame([{"score": score, "result": list(result['document_id'])} for score, result in zip(scores, results)])


### Metrics

Each query in PubMedQA is generated from a corresponding document. Intuitively, ideal similarity search should result in query pointing to the document it was generated from. In the code below we calculate that.

### Evaluating Retrieval Performance

For each query in the PubMedQA dataset, we know the original document it was generated from. A good retrieval model should ideally return this source document among the top results for its corresponding query. Below, we calculate how many queries successfully retrieved their original document within the top results.

In [None]:
df = pl.DataFrame({
    "document_id": dataset["document_id"],
    "result": query_results["result"]
})
print(sum(df["result"].list.contains(df["document_id"])) / len(df))
