## Medplexity
Medplexity is a framework to help explore capabilities of LLMs in the medical domain. We do this by providing interfaces and collections of common benchmarks, LLMs, and prompts. In this tutorial we will go over the main features of medplexity by running OpenAI's GPT-4 model against MedMCQA dataset.


## Setup
Let's start by installing the latest version of medplexity if you haven't already:


In [1]:
# !pip install medplexity

## Medharness
The top-level abstraction in medplexity is the _Medharness_. It allows to run the evaluation of the given chain on a dataset. This will seem a bit overwhelming, but we will deep dive into every single underlying component of medharness in a bit, but let's start by looking at our end result, which generates example cases of the multiple-choice benchmark MedMCQA and solves them with OpenAI's GPT-4

In [2]:
from medplexity.medharness import Medharness
from medplexity.llms.openai_caller import OpenAI
from medplexity.chains.multiple_choice_question_chain import \
    MultipleChoiceEvaluationChain
from medplexity.benchmarks.dataset_factory import DatasetFactory
from medplexity.benchmarks.medmcqa import MedMCQADatasetBuilder
from medplexity.benchmarks.multiple_choice_utils import load_example_questions_from_json

harness = Medharness(
    dataset=DatasetFactory().build("medmcqa", "validation"),
    chain=MultipleChoiceEvaluationChain(
        llm=OpenAI(model="gpt-4", api_token="YOUR_TOKEN"),
        save_prompt=True,
        # Providing some additional examples for the prompt
        examples=load_example_questions_from_json(MedMCQADatasetBuilder.EXAMPLE_QUESTIONS_PATH)
    ),
)

In [3]:
# Let's now run evaluation on 1 item
evaluation = harness.run(k=1)
print("Accuracy: ", evaluation.accuracy())
evaluation

100%|██████████| 1/1 [00:11<00:00, 11.87s/it]

Accuracy:  1.0





EvaluationSummary(evaluation_results=[{
    "input": {
        "question": "Which of the following is not true for myelinated nerve fibers:",
        "options": [
            "Impulse through myelinated fibers is slower than non-myelinated fibers",
            "Membrane currents are generated at nodes of Ranvier",
            "Saltatory conduction of impulses is seen",
            "Local anesthesia is effective only when the nerve is not covered by myelin sheath"
        ],
        "context": null,
        "examples": null
    },
    "input_metadata": {
        "explanation": null,
        "subject_name": "Physiology"
    },
    "expected_output": "(A)",
    "output": "(A)",
    "output_metadata": {
        "explanation": "Let’s solve this step-by-step, referring to authoritative sources as needed. Myelinated nerve fibers are covered by a myelin sheath, which allows for faster transmission of nerve impulses compared to non-myelinated fibers. This is due to the fact that the nerve impul

These results can then be visualised and explored, as done on [medplexity explorer](https://www.medplexityai.com/).

## Benchmarks
Let's start by select the benchmark that we want to evaluate against. In this example we are going with the MedMCQA dataset, which is a collection of multiple-choice questions to address real-world medical entrance exam questions. You can see all available benchmarks mentioned in the [docs](https://medplexity.readthedocs.io/en/latest/).

In [4]:
from medplexity.benchmarks.medmcqa import MedMCQADatasetBuilder

In [5]:
medmcqa_dataset = MedMCQADatasetBuilder().build_dataset(split_type="validation")

In [6]:
print(medmcqa_dataset.description)

Multiple-choice questions designed to address real-world medical entrance exam questions like AIIMS & NEET PG.
    This dataset encompasses over 194k high-quality MCQs spanning 2.4k healthcare topics and 21 medical subjects. Questions are accompanied by an explanation of the correct answer.

    Original paper: MedMCQA: A Large-scale Multi-Subject Multi-Choice Dataset for Medical domain Question Answering

    2022 · Ankit Pal, Logesh Kumar Umapathi, Malaikannan Sankarasubbu
    <https://arxiv.org/abs/2203.14371>

    Train/validation/test splits available.

    Dataset version used: <https://huggingface.co/datasets/medmcqa>
    


Note that for multiple choice-questions we have a formatting convention to use (A)/(B)/(C)/(D) formatting for giving options and their answer.

In [7]:
# Example data point
medmcqa_dataset[1]

{
    "input": {
        "question": "Which of the following is not true about glomerular capillaries')",
        "options": [
            "The oncotic pressure of the fluid leaving the capillaries is less than that of fluid entering it",
            "Glucose concentration in the capillaries is the same as that in glomerular filtrate",
            "Constriction of afferent aeriole decreases the blood flow to the glomerulas",
            "Hematocrit of the fluid leaving the capillaries is less than that of the fluid entering it"
        ],
        "context": null,
        "examples": null
    },
    "expected_output": "(A)",
    "metadata": {
        "explanation": "Ans-a. The oncotic pressure of the fluid leaving the capillaries is less than that of fluid entering it Guyton I LpJ1 4-.;anong 23/e p653-6_)Glomerular oncotic pressure (due to plasma protein content) is higher than that of filtrate oncotic pressure in Bowman's capsule\"Since glucose is freely filtered and the fluid in the B

## LLMs & Chains
Now let's create the LLMs that we want to use for the evaluation. Medplexity provides a interfaces to common APIs, such as OpenAI API. If you don't have an API key for OpenAI yet, you can get one on [Open AI website](https://openai.com/api).

In [8]:
from medplexity.llms.openai_caller import OpenAI

In [9]:
openai_llm = OpenAI(
    api_token="YOUR_TOKEN"
)

It's not enough to just have an LLM. We also need to find the right prompt to use, and then make sure that the output is in the correct format. For this reason, we use an abstraction of _Chain_. Chains are wrappers around complicated sequences of operations with LLMs at the core of it. If you are familiar with Langchain it's meant to be the same concept as the Chain there.

For this specific benchmark we already have a prompt template that uses chain-of-thought prompting that we combine it with a few examples following the following answer format:
```
Explanation: ...
Answer: ...
```

In [10]:
from medplexity.prompts.multiple_choice_prompt import MultipleChoiceChainOfThoughtPrompt

prompt_template = MultipleChoiceChainOfThoughtPrompt()

In [11]:
print(prompt_template.PROMPT)

Instructions: The following are multiple choice questions about medical knowledge. Solve them in a step-by-step fashion,
starting by summarizing the available information. Output a single option from the given options as the final answer.
{examples}

{context}
Question: {question}
{options}



In [12]:
example_datapoint = medmcqa_dataset[2]

# If we fill in the blanks it would look as follows:
print(prompt_template.format(
    question=example_datapoint.input.question,
    options=example_datapoint.input.options
))

Instructions: The following are multiple choice questions about medical knowledge. Solve them in a step-by-step fashion,
starting by summarizing the available information. Output a single option from the given options as the final answer.



Question: A 29 yrs old woman with a pregnancy of 17 week has a 10 years old boy with down syndrome. She does not want another down syndrome kid; best advice to her is
(A) No test is required now as her age is below 35 years (B) Ultra sound at this point of time will definitely tell her that next baby will be down syndromic or not (C) Amniotic fluid samples plus chromosomal analysis will definitely tell her that next baby will be down syndromic or not (D) blood screening at this point of time will clear the exact picture



To prepare LLM for the evaluation we will need to transform the dataset inputs into the right prompt, then transform the output from LLM into the expected format for comparison, which is just a single option, e.g. "(A)".

In [13]:
from medplexity.chains.multiple_choice_question_chain import MultipleChoiceEvaluationChain

chain = MultipleChoiceEvaluationChain(
    llm=openai_llm,
    save_prompt=True,
    examples=load_example_questions_from_json(MedMCQADatasetBuilder.EXAMPLE_QUESTIONS_PATH)
)

## Evaluation
Evaluators accept your chain and then can run it on a given dataset. They will generate a report that you can later examine on level of individual predictions.

In [14]:
from medplexity.evaluators.sequential_evaluator import SequentialEvaluator

# Sequential evaluator just goes over the items in the dataset one by one. In the future we plan to also support parallel evaluation.
evaluator = SequentialEvaluator()

Beware, that calling the cell below will actaully use your OpenAI credits to make predictions (which is why we call just on a small subset of the dataset).

In [15]:
# BEWARE!: calling this will actually consume your OpenAI credits, that's why we run on a very small subset
evaluation = evaluator.evaluate(medmcqa_dataset[:1], chain)

100%|██████████| 1/1 [00:03<00:00,  3.93s/it]


In [16]:
correct, incorrect = evaluation.partition_by_correctness()

Now we can also examine the results:

In [17]:
# Expected output was (B) and not (C)
correct[0]

{
    "input": {
        "question": "Which of the following is not true for myelinated nerve fibers:",
        "options": [
            "Impulse through myelinated fibers is slower than non-myelinated fibers",
            "Membrane currents are generated at nodes of Ranvier",
            "Saltatory conduction of impulses is seen",
            "Local anesthesia is effective only when the nerve is not covered by myelin sheath"
        ],
        "context": null,
        "examples": null
    },
    "input_metadata": {
        "explanation": null,
        "subject_name": "Physiology"
    },
    "expected_output": "(A)",
    "output": "(A)",
    "output_metadata": {
        "explanation": "Let’s solve this step-by-step, referring to authoritative sources as needed. Myelinated nerve fibers have several characteristics. Impulses through myelinated fibers are actually faster than non-myelinated fibers because the myelin sheath acts as an insulator and allows for saltatory conduction. Membrane

Finally, you can also save this in a file to later have a look at the results or visualise them.

In [18]:
evaluation.save("medmcqa_validation_evaluation.json")

You can learn more about the available benchmarks in our [docs](https://medplexity.readthedocs.io/en/latest/).