## Medplexity
Medplexity is a framework to help explore capabilities of LLMs in the medical domain. We do this by providing interfaces and collections of common benchmarks, LLMs, and prompts. In this tutorial we will go over the main features of medplexity by running OpenAI's GPT-4 model against MedMCQA dataset.


## Setup
Let's start by installing the latest version of medplexity if you haven't already:


In [1]:
# !pip install medplexity

## Benchmarks
First let's select the benchmark that we want to evaluate against. In this example we are going with the MedMCQA dataset, which is a collection of multiple-choice questions to address real-world medical entrance exam questions.

In [2]:
from medplexity.benchmarks.medmcqa.medmcqa_dataset_builder import MedMCQADatasetBuilder

In [3]:
medmcqa_dataset = MedMCQADatasetBuilder().build_dataset(split_type="validation")

In [4]:
print(medmcqa_dataset.description)

Multiple-choice questions designed to address real-world medical entrance exam questions like AIIMS & NEET PG.
    This dataset encompasses over 194k high-quality MCQs spanning 2.4k healthcare topics and 21 medical subjects. Questions are accompanied by an explanation of the correct answer.

    Original paper: MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering

    2022 · Ankit Pal, Logesh Kumar Umapathi, Malaikannan Sankarasubbu
    <https://arxiv.org/abs/2203.14371>

    Train/validation/test splits available.

    Dataset version used: <https://huggingface.co/datasets/medmcqa>
    


In [5]:
# Example data point
medmcqa_dataset[1]

{
    "input": {
        "question": "Which of the following is not true about glomerular capillaries')",
        "options": [
            "The oncotic pressure of the fluid leaving the capillaries is less than that of fluid entering it",
            "Glucose concentration in the capillaries is the same as that in glomerular filtrate",
            "Constriction of afferent aeriole decreases the blood flow to the glomerulas",
            "Hematocrit of the fluid leaving the capillaries is less than that of the fluid entering it"
        ]
    },
    "expected_output": 0,
    "metadata": {
        "explanation": "Ans-a. The oncotic pressure of the fluid leaving the capillaries is less than that of fluid entering it Guyton I LpJ1 4-.;anong 23/e p653-6_)Glomerular oncotic pressure (due to plasma protein content) is higher than that of filtrate oncotic pressure in Bowman's capsule\"Since glucose is freely filtered and the fluid in the Bowman's capsule is isotonic with plasma, the concentrat

## LLMs & Chains
Now let's create the LLMs that we want to use for the evaluation. Medplexity provides a interfaces to common APIs, such as OpenAI API. If you don't have an API key for OpenAI yet, you can get one on [Open AI website](https://openai.com/api).

In [6]:
from medplexity.llms.openai_caller import OpenAI

In [7]:
openai_llm = OpenAI(
    api_token="<YOUR TOKEN HERE>"
)

It's not enough to just have an LLM. We also need to find the right prompt to use, and then make sure that the output is in the correct format. For this reason, we use an abstraction of _Chain_. Chains are wrappers around complicated sequences of operations with LLMs at the core of it. If you are familiar with Langchain it's meant to be the same concept as the Chain there.

For this specific benchmark we already have a prompt template that uses chain-of-thought prompting and a few examples. With this prompt we  also ask it to output a JSON of the format:
```json
{
    "answer": "...",
    "explanation": "..."
}
```
to also get LLM's reasoning behind the answer.

In [8]:
from medplexity.benchmarks.medmcqa.medmcqa_prompt_template import MedMCQAPromptTemplate

prompt_template = MedMCQAPromptTemplate()

In [9]:
print(prompt_template.PROMPT)

The following are multiple choice questions about medical knowledge. Solve them in a step-by-step fashion, starting by summarizing the available information. Output a JSON with the answer (give back only the letter) and an explanation for it.
{examples}

Question: {question}
{options}
Output: 


In [10]:
example_datapoint = medmcqa_dataset[2]

# Here is how a final prompt would look like
print(prompt_template.format(
    question=example_datapoint.input.question,
    options=example_datapoint.input.options
))

The following are multiple choice questions about medical knowledge. Solve them in a step-by-step fashion, starting by summarizing the available information. Output a JSON with the answer (give back only the letter) and an explanation for it.
Question: Maximum increase in prolactin level is caused by:
 (A) Risperidone (B) Clozapine (C) Olanzapine (D) Aripiprazole
Output: {"answer":"(A)","explanation":"Let’s solve this step-by-step, referring to authoritative sources as needed. Clozapine generally does not raise prolactin levels. Atypicals such as olanzapine and aripiprazole cause small if no elevation. Risperidone is known to result in a sustained elevated prolactin level. Therefore risperidone is likely to cause the maximum increase in prolactin level."}

Question: What is the age of routine screening mammography?
 (A) 20 years (B) 30 years (C) 40 years (D) 50 years
Output: {"answer":"(C)","explanation":"Let’s solve this step-by-step, referring to authoritative sources as needed. The 

To prepare LLM for the evaluation we will need to transform the dataset inputs into the right prompt, then transform the output from LLM into the expected format (JSON of answer and explanation), and use that format in the comparison with the expected answer in the dataset (which is a single number). Right now, with medplexity the way to do it is by defining small adapter functions:

In [11]:
from medplexity.benchmarks.multiple_choice_utils import AnswerWithExplanation
from medplexity.benchmarks.medmcqa.medmcqa_dataset_builder import MedMCQAInput

def input_adapter(medmcqa_input: MedMCQAInput) -> str:
    """Transforms input into a single string that will be passed down to LLM"""
    prompt_template = MedMCQAPromptTemplate()

    return prompt_template.format(
        question=medmcqa_input.question,
        options=medmcqa_input.options
    )

def output_adapter(output_json: str) -> AnswerWithExplanation:
    """Parses the output string to the expected JSON format, for which we use a Pydantic model"""
    parsed_output = AnswerWithExplanation.model_validate_json(output_json)

    return parsed_output

def comparator(expected_output: int, predicted_output: AnswerWithExplanation):
    """Compare the answer with the expected output in the dataset.

    Since in the dataset it's a number that is used to indicate the right answer we convert it to the corresponding letter.
    """

    letter_to_idx = { "(A)" : 0, "(B)": 1, "(C)": 2, "(D)": 3 }
    predicted_idx =  letter_to_idx[predicted_output.answer]

    return expected_output == predicted_idx

Now we can define the final chain and go on to evaluation:

In [12]:
from medplexity.chains.evaluation_adapter_chain import EvaluationAdapterChain

chain = EvaluationAdapterChain(
    llm=openai_llm,
    input_adapter=input_adapter,
    output_adapter=output_adapter,
)

## Evaluation
Evaluators accept your chain and then can run it on a given dataset. They will generate a report that you can later examine on level of individual predictions.

In [13]:
from medplexity.evaluators.sequential_evaluator import SequentialEvaluator

evaluator = SequentialEvaluator(
    chain=chain,
    comparator=comparator
)

Now we can evaluate our model on a subset of MedMCQA dataset. Beware, that calling the cell below will actaully use your OpenAI credits to make predictions (which is why we call just on a small subset of the dataset).

In [14]:
# BEWARE!: calling this will actually consume your OpenAI credits, that's why we run on a very small subset
evaluation = evaluator.evaluate(medmcqa_dataset[5:10])

100%|██████████| 5/5 [00:15<00:00,  3.14s/it]


In [15]:
evaluation.accuracy()

0.6

In [16]:
correct, incorrect = evaluation.partition_by_correctness()

Now we can also examine incorrect results:

In [17]:
# Expected output was (B) and not (C)
incorrect[1]

{
    "input": {
        "question": "Which of the following are not a branch of external carotid Aery in Kiesselbach's plexus.",
        "options": [
            "Sphenopalatine aery",
            "Anterior ethmoidal aery",
            "Greater palatine aery",
            "Septal branch of superior labial aery"
        ]
    },
    "input_metadata": {
        "explanation": "*Kiesselbach's plexus: Antero superior pa is supplied by ANTERIOR & POSTERIOR ETHMOIDAL AERIES which are branches of ophthalmic aery, branch of INTERNAL CAROTID AERY. Antero inferior pa is supplied by SUPERIOR LABIAL AERY - branch of facial aery, which is branch of EXTERNAL CAROTID AERY. Postero superior pa is supplied by SPHENO-PALATINE AERY - branch of MAXILLARY aery, which is branch of ECA. POSTERO INFERIOR pa is supplied by branches of GREATER PALATINE AERY - branch of ECA Antero inferior pa/vestibule of septum contain anastomosis b/w septal ramus of superior labial branch of facial aery & branches of sphenopa

Finally, you can also save this in a file to later have a look at the results or visualise them.

In [18]:
evaluation.save("medmcqa_validation_evaluation.json")

You can learn more about the available benchmarks in our [docs](https://medplexity.readthedocs.io/en/latest/).