# MedQDx Banchmark Creation

This script iterates over each patient case in a DataFrame to build a zero-shot diagnostic benchmark:
1. For each case, the “doctor” LLM (GPT-4.1) receives only 50% of the full case and asks a first question.
2. The “patient” LLM (GPT-4o-mini) answers based on the 100% case.
3. After each answer, GPT-4.1 attempts a diagnosis and we record the similarity to the ground-truth disease.
4. The process repeats for a second and third round of questioning, each time appending question, answer, diagnosis, and similarity.
5. Finally, the script outputs a benchmark table with columns:           
   prognosis, symptoms, 100% case, 80% case, 50% case,        
   Question_1, Answer_1, Diagnosis_1, Similarity_1,        
   Question_2, Answer_2, Diagnosis_2, Similarity_2,        
   Question_3, Answer_3, Diagnosis_3, Similarity_3        

This provides a comprehensive MedQDx zero-shot performance benchmark.


In [None]:
!pip install openai
!pip install -q transformers accelerate bitsandbytes
!pip install openai --upgrade



In [None]:
import os
import re
import torch
import openai
import pandas as pd
from tqdm import tqdm
from openai import AzureOpenAI
from google.colab import files
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoTokenizer, AutoModelForCausalLM

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# Doctor LLM - GPT 4.1

In [None]:
DoctorGPTclient = AzureOpenAI(
    api_version    = "2024-12-01-preview",
    azure_endpoint = "MedQDx_ENDPOINT",
    api_key        = "MedQDx__API_key",
)

## Generating doctor question

In [None]:
def GPT_doctor(prompt, max_tokens=2000):
    response = DoctorGPTclient.chat.completions.create(
    messages=[
        {"role": "system","content": "You are a medical doctor."},
        {"role": "user","content": prompt}
    ],
    max_completion_tokens=800,
    temperature=1.0,
    top_p=1.0,
    frequency_penalty=0.0,
    presence_penalty=0.0,
    model='gpt-4.1')

    return response.choices[0].message.content

## Updating prompt for doctors questions

In [None]:
def build_first_doctor_question_prompt(case_text):
    prompt = f"""
You are a medical doctor conducting a diagnostic interview with a patient.
You have received partial information about the case.
Your goal is to ask exactly **ONE** relevant, case-specific question that will enable a more precise diagnosis.

Rules:
- Output exactly one question as a single, complete sentence.
- The question must be a single line, ending in ‘?’
- The question must be specific to the case.
- Do not include any explanations, reasoning, or additional text.
- Do not provide a diagnosis or suggest treatments in the question.
- Base the question solely on the provided partial case information.

Partial case Information:
{case_text}

### User: What is the first best diagnostic question you want to ask the patient (one question)?
### Assistant:
"""
    return prompt.strip()

In [None]:
def build_doctor_question_prompt(case_text: str, history: str, previous_questions: list) -> str:
    """
    Constructs a prompt for the doctor to ask exactly one new question,
    ensuring it does NOT repeat any question in previous_questions.
    """

    # # If there are previous questions, format them as bullet points,
    #    so the model can see what’s already been asked and avoid repetition
    if previous_questions:
        prev_section = "Previous questions asked (do NOT repeat any of these):\n"
        for q in previous_questions:
            prev_section += f"- {q}\n"
    else:
        prev_section = ""  # Leave empty if no prior questions

    prompt = f"""
You are a medical doctor conducting a diagnostic interview with a patient.
You have received partial information about the case and the past conversation with the patient.
Your goal is to ask exactly **ONE** new, relevant, case-specific question that will enable a more precise diagnosis.
Ensure that this new question is **word-for-word different** from all previous questions.

Rules:
- Output **ONLY ONE** question as a single, complete sentence ending with a question mark.
- Do NOT include any explanations, reasoning, or additional text—only the question itself.
- The question must be a single line, ending in ‘?’
- Do NOT provide a diagnosis or suggest treatments.
- Base the question on the partial case information and past conversation with the patient.
- ask new question to obtain additional information for better diasnosis.
- If the patient responded "I'm not sure," ask a broader or differently phrased question to elicit new information.

Partial case Information::
{case_text}

Past Conversation with the patient:
{history}

{prev_section}

### User:Next, output one NEW question you would ask the patient:
### Assistant:
"""
    return prompt.strip()


# Patient LLM - GPT-4o Mini

In [None]:
!pip install azure-ai-inference



In [None]:
endpoint       = "MedQDx_ENDPOINT"
api_key        = "MedQDx__API_key"
api_version    = "2024-12-01-preview"
deployment     = "gpt-4o-mini"

client = AzureOpenAI(
    api_version    = api_version,
    azure_endpoint = endpoint,
    api_key        = api_key,
)

In [None]:
def build_patient_answer_prompt(Full_case, doctor_question):
    prompt = (
                f"""You are a patient who has provided a detailed case history.  Your task is to answer **only** the doctor’s question, using information from the case description below.  Do not add, remove, or invent any details.  Answer in first person, as a realistic patient would.

–––––––––––––––––––––––––––––––––––––––––––––––––––––
FULL PATIENT CASE:
{Full_case}
–––––––––––––––––––––––––––––––––––––––––––––––––––––

DOCTOR’S QUESTION:
{doctor_question}

PATIENT INSTRUCTIONS:
•  Read the “FULL PATIENT CASE” carefully.
•  When you answer, respond as “the patient” in first person (e.g., “I have been feeling…,” “Yes, I have noticed…”).
•  Use only information that appears in the case description.
•  If the doctor’s question refers to a symptom or detail that is **not** in the case, reply honestly: “No,” or “I have not noticed that.”
•  Keep your answer concise—just those facts from the case that directly address the question.
•  Do **not** volunteer any additional background, diagnosis, or speculation.

PATIENT RESPONSE:
"""
    )
    return prompt.strip()

## Generating patient answer to doctor question

In [None]:
def patient_answer(Full_case, doctor_question):
    prompt = build_patient_answer_prompt(Full_case, doctor_question)
    response = client.chat.completions.create(
        model= deployment,
        messages=[
            {"role": "system", "content": "You are a patient answering your doctor's questions based on the symptoms that your case presents."},
            {"role": "user", "content": prompt}
        ],
        max_tokens=300,
        temperature=0.8,
        top_p=0.95
    )
    return response.choices[0].message.content.strip()

# Doctor diagnosis (GPT 4.1)

## Creating an updated prompt for doctor diagnosis and diagnosing

In [None]:
def build_doctor_diagnosis_prompt(case_text, history):
    """
    Creates a prompt for the doctor to provide a diagnosis based on case and history.
    """
    prompt = f"""
***You are a medical doctor***. Your task is to provide a single most likely diagnosis based on the partial case information and the past conversation with the patient.

Rules:
- Analyze the case details and patient responses.
- Use clinical reasoning to determine the most probable diagnosis.
- Output ONLY the name of the disease or condition using correct medical term (e.g., Pneumonia, Hypoglycemia).
- Do not include any notes, explanations, disclaimers, or additional text.
- Do not output symbols like ### or other placeholders.
- Do not repeat on the case symptoms
- Do not repeat on the patient answers

Case Information:
{case_text}

Conversation History:
{history}

### User: The patient diagnosis is:
### Assistant:
"""
    return prompt.strip()

## Checking the similarity between the doctor's diagnosis and a real disease


In [None]:
embeddings_client = openai.AzureOpenAI(
    api_key="MedQDx_API_key",
    azure_endpoint="MedQDx_ENDPOINT",
    api_version="2024-12-01-preview"
)

def get_embedding(text, embeddings_client):
    """
    Gets the embedding for a given text using Azure OpenAI.
    """
    try:
        response = embeddings_client.embeddings.create(
            input=text,
            model="text-embedding-3-small"
        )
        return response.data[0].embedding
    except Exception as e:
        print(f"Error generating embedding: {e}")
        return None

In [None]:
def compute_similarity(pred, label, embeddings_client):
    """
    Computes cosine similarity between predicted and ground-truth diagnosis.
    """
    if not pred or not label:
        return 0.0
    emb1 = get_embedding(pred, embeddings_client)
    emb2 = get_embedding(label, embeddings_client)
    if emb1 is None or emb2 is None:
        return 0.0
    return cosine_similarity([emb1], [emb2])[0][0]

# Uploading CSV and prepare dataframe

In [None]:
file_path = '/content/drive/MyDrive/NLP/patient_cases.csv'

In [None]:
df = pd.read_csv(file_path)

for i in range(1, 4):
    df[f"Question_{i}"] = ""
    df[f"Answer_{i}"] = ""
    df[f"Diagnosis_{i}"] = ""
    df[f"Similarity_{i}"] = None

# Banckmark creation

In [None]:
for idx, row in df.iterrows():
    case_text = row['50% Case']
    Full_case = row['100% Case']
    gt_prognosis = row["prognosis"]
    history = ""
    previous_questions = []

    print(f"\n====== Start Case {idx} ======")

    for round_num in range(1, 4):
        print(f"\n--- Round {round_num} ---")
        # Doctor asks question
        if round_num == 1:
            doctor_q_prompt = build_first_doctor_question_prompt(case_text)
        else:
            doctor_q_prompt = build_doctor_question_prompt(case_text, history, previous_questions)
        doctor_question = GPT_doctor(doctor_q_prompt)
        previous_questions.append(doctor_question)
        print(f"Doctor Question (round {round_num}): {doctor_question}")
        df.at[idx, f"Question_{round_num}"] = doctor_question

        # Patient answers
        patient_ans = patient_answer(Full_case, doctor_question)
        print(f"Patient Answer (round {round_num}): {patient_ans}")
        df.at[idx, f"Answer_{round_num}"] = patient_ans

        # Updating history
        history += f"Doctor: {doctor_question}\nPatient: {patient_ans}\n"

        # Doctor provides diagnosis
        prompt = build_doctor_diagnosis_prompt(case_text, history)
        doctor_diagnosis = GPT_doctor(prompt)
        print(f"Doctor Diagnosis (round {round_num}): {doctor_diagnosis}")
        df.at[idx, f"Diagnosis_{round_num}"] = doctor_diagnosis

        # Computing similarity
        similarity = compute_similarity(doctor_diagnosis, gt_prognosis, embeddings_client)
        print(f"Similarity (round {round_num}): {similarity}")
        df.at[idx, f"Similarity_{round_num}"] = similarity


# Final save
output_path = "MedQDx_benchmark.csv"
df.to_csv(output_path, index=False)
files.download(output_path)




--- Round 1 ---
Doctor Question (round 1): Have you experienced a sore throat or any difficulty swallowing recently?
Patient Answer (round 1): No, I have not noticed that.
Doctor Diagnosis (round 1): Infectious mononucleosis
Similarity (round 1): 0.3970650055518463

--- Round 2 ---
Doctor Question (round 2): Have you had any recent contact with individuals who were sick or had similar symptoms?
Patient Answer (round 2): I have not noticed that.
Doctor Diagnosis (round 2): Infectious mononucleosis
Similarity (round 2): 0.3970650055518463

--- Round 3 ---
Doctor Question (round 3): Have you recently traveled outside your usual area or been exposed to animals?
Patient Answer (round 3): No, I have not noticed that.
Doctor Diagnosis (round 3): Infectious mononucleosis
Similarity (round 3): 0.3970650055518463


--- Round 1 ---
Doctor Question (round 1): Have you noticed any recent changes in your blood pressure, such as episodes of high or low blood pressure?
Patient Answer (round 1): I ha

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>