# Models evaluation

This is notebook for evaluation our 3 models we adapted to medical domain using different methods. We choose to evaluate on the set of benchmarks from [Open Medical-LLM Leaderboard](https://huggingface.co/spaces/openlifescienceai/open_medical_llm_leaderboard) including:

* [MedMCQA](https://huggingface.co/datasets/openlifescienceai/medmcqa) - mcq, 1k from test split
* [MedQA](https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options-hf) - mcq, 1k from test splits
* [MMLU](https://huggingface.co/datasets/cais/mmlu) - mcq, 600 from 6 medical subsets test split
* [PubMedQA](https://huggingface.co/datasets/qiaojin/PubMedQA) - qa, 1k from pqa_labeled subset train split

In [1]:
%%capture
!pip install datasets vllm

In [2]:
SEED=4242

## Benchmarks loading and preparing for evaluation

### MedMCQA

In [3]:
from datasets import load_dataset

ds_medmcqa = load_dataset("openlifescienceai/medmcqa", split="validation")
ds_medmcqa = ds_medmcqa.shuffle(seed=SEED).select(range(100))  # 1000
ds_medmcqa

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/85.9M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/936k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.48M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/182822 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6150 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/4183 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'question', 'opa', 'opb', 'opc', 'opd', 'cop', 'choice_type', 'exp', 'subject_name', 'topic_name'],
    num_rows: 100
})

In [4]:
ds_medmcqa[0]

{'id': '4653fb7a-ddbf-493b-b4ef-92205582a27a',
 'question': 'Which of the following tooth is not having 5 cusps?',
 'opa': 'Mandibular 2nd Molar',
 'opb': 'Mandibular 1st Molar',
 'opc': 'Mandibular 3rd Molar',
 'opd': 'Maxillary 1st Molar',
 'cop': 0,
 'choice_type': 'single',
 'exp': None,
 'subject_name': 'Dental',
 'topic_name': None}

In [5]:
def format_prompt(example):
    """Formats a single example into a prompt for the LLM."""
    question = example['question']
    options = {
        "A": example['opa'],
        "B": example['opb'],
        "C": example['opc'],
        "D": example['opd']
    }
    
    prompt = f"""
        You are an expert in solving multiple-choice questions accurately and explaining your reasoning clearly.
        Given a question and a list of answer choices (A, B, C, D), your task is to:
        1. Reason shortly about the question and answer choices to find evidances to support your answer.
        2. Identify the correct answer. Please choose the single best answer from the options provided.
        3. Output the final answer in the format: Answer: [Option Letter]

        Question: {question}
        Options:
        A. {options['A']}
        B. {options['B']}
        C. {options['C']}
        D. {options['D']}

        Reasoning:
    """
    return prompt

In [6]:
def get_ground_truth(example):
    """Maps the correct option index (cop) to the corresponding letter."""
    mapping = {0: 'A', 1: 'B', 2: 'C', 3: 'D'}
    cop_index = example.get('cop')
    if cop_index is None or cop_index not in mapping:
        print(f"Warning: Invalid 'cop' value found: {cop_index} in example ID {example.get('id')}. Skipping ground truth.")
        return None
    return mapping[cop_index]

In [7]:
import re
from tqdm import tqdm

def extract_choice(generated_text):
    """Extracts the predicted choice (A, B, C, or D) from the LLM's output."""
    text = generated_text.strip()

    # 1. Check for direct answer at the beginning (e.g., "A", "A.", "A)")
    match = re.match(r"^\s*([A-D])(?:[.)\s]|$)", text, re.IGNORECASE)
    if match:
        return match.group(1).upper()

    # 2. Check for phrases like "The answer is A" or "Answer: A"
    match = re.search(r'(?:answer|choice|option) is\s*:?\s*([A-D])', text, re.IGNORECASE)
    if match:
        return match.group(1).upper()

    # 3. Look for the first standalone letter A, B, C, or D in the text
    match = re.search(r'\b([A-D])\b', text)
    if match:
        return match.group(1).upper()

    # Fallback: If no clear choice found, return None
    print(f"Warning: Could not extract choice from text: '{text[:1000]}...'")
    return None

In [8]:
print("\n--- Preparing Prompts and Ground Truths ---")
prompts = [format_prompt(ex) for ex in tqdm(ds_medmcqa, desc="Formatting prompts")]
ground_truths = [get_ground_truth(ex) for ex in tqdm(ds_medmcqa, desc="Extracting ground truths")]
valid_indices = [i for i, gt in enumerate(ground_truths) if gt is not None]

if len(valid_indices) < len(ground_truths):
     print(f"Warning: {len(ground_truths) - len(valid_indices)} examples had invalid ground truths and were excluded.")
     prompts = [prompts[i] for i in valid_indices]
     ground_truths = [ground_truths[i] for i in valid_indices]
     # If you need to keep track of original dataset items, adjust here
     original_indices = valid_indices # Store the indices from the original dataset

print(f"Prepared {len(prompts)} prompts for evaluation.")
if len(prompts) > 0:
    print("\nExample Prompt:")
    print(prompts[0])
    print(f"Corresponding Ground Truth: {ground_truths[0]}")
else:
    print("No valid prompts to evaluate.")
    exit()


--- Preparing Prompts and Ground Truths ---


Formatting prompts: 100%|██████████| 100/100 [00:00<00:00, 6413.51it/s]
Extracting ground truths: 100%|██████████| 100/100 [00:00<00:00, 8365.35it/s]

Prepared 100 prompts for evaluation.

Example Prompt:

        You are an expert in solving multiple-choice questions accurately and explaining your reasoning clearly.
        Given a question and a list of answer choices (A, B, C, D), your task is to:
        1. Reason shortly about the question and answer choices to find evidances to support your answer.
        2. Identify the correct answer. Please choose the single best answer from the options provided.
        3. Output the final answer in the format: Answer: [Option Letter]

        Question: Which of the following tooth is not having 5 cusps?
        Options:
        A. Mandibular 2nd Molar
        B. Mandibular 1st Molar
        C. Mandibular 3rd Molar
        D. Maxillary 1st Molar

        Reasoning:
    
Corresponding Ground Truth: A





In [9]:
from vllm import LLM, SamplingParams
import torch

# 3. Load Model using vLLM
print("\n--- Loading LLM with vLLM ---")
model_name = "MilyaShams/DeepSeek-R1-Distill-Qwen-1.5B-medical-sft-merged"
# model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
try:
    llm = LLM(
        model=model_name,
        # tensor_parallel_size=1,
        # trust_remote_code=True,
        # gpu_memory_utilization=0.9,
        dtype=torch.float16,
    )
    sampling_params = SamplingParams(
        max_tokens=4096,
        temperature=0.01,
        top_p=1.0,
        top_k=-1
    )
    print(f"LLM '{model_name}' loaded successfully.")
except Exception as e:
    print(f"Error loading LLM with vLLM: {e}")
    print("Please ensure the MODEL_ID is correct, vLLM is installed, and you have compatible hardware (GPU).")
    exit()

INFO 04-11 15:49:23 [__init__.py:239] Automatically detected platform cuda.


2025-04-11 15:49:25.371252: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744386565.554603      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744386565.605265      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered



--- Loading LLM with vLLM ---


config.json:   0%|          | 0.00/820 [00:00<?, ?B/s]

INFO 04-11 15:49:52 [config.py:600] This model supports multiple tasks: {'embed', 'classify', 'reward', 'generate', 'score'}. Defaulting to 'generate'.
INFO 04-11 15:49:52 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-11 15:49:52 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.3) with config: model='MilyaShams/DeepSeek-R1-Distill-Qwen-1.5B-medical-sft-merged', speculative_config=None, tokenizer='MilyaShams/DeepSeek-R1-Distill-Qwen-1.5B-medical-sft-merged', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), obse

tokenizer_config.json:   0%|          | 0.00/7.09k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/495 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/242 [00:00<?, ?B/s]

INFO 04-11 15:49:58 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 04-11 15:49:58 [cuda.py:289] Using XFormers backend.


[W411 15:50:09.794407292 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3


INFO 04-11 15:50:19 [parallel_state.py:957] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-11 15:50:19 [model_runner.py:1110] Starting to load model MilyaShams/DeepSeek-R1-Distill-Qwen-1.5B-medical-sft-merged...


[W411 15:50:19.805167373 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3


INFO 04-11 15:50:19 [weight_utils.py:265] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

INFO 04-11 15:50:33 [weight_utils.py:281] Time spent downloading weights for MilyaShams/DeepSeek-R1-Distill-Qwen-1.5B-medical-sft-merged: 13.589614 seconds
INFO 04-11 15:50:33 [weight_utils.py:315] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 04-11 15:50:36 [loader.py:447] Loading weights took 3.21 seconds
INFO 04-11 15:50:37 [model_runner.py:1146] Model loading took 3.3461 GiB and 17.475646 seconds
INFO 04-11 15:50:39 [worker.py:267] Memory profiling takes 1.34 seconds
INFO 04-11 15:50:39 [worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.90) = 13.27GiB
INFO 04-11 15:50:39 [worker.py:267] model weights take 3.35GiB; non_torch_memory takes 0.05GiB; PyTorch activation peak memory takes 1.39GiB; the rest of the memory reserved for KV Cache is 8.48GiB.
INFO 04-11 15:50:39 [executor_base.py:112] # cuda blocks: 19855, # CPU blocks: 9362
INFO 04-11 15:50:39 [executor_base.py:117] Maximum concurrency for 131072 tokens per request: 2.42x
INFO 04-11 15:50:45 [model_runner.py:1456] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI

Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:38<00:00,  1.09s/it]

INFO 04-11 15:51:23 [model_runner.py:1598] Graph capturing finished in 38 secs, took 0.19 GiB
INFO 04-11 15:51:23 [llm_engine.py:448] init engine (profile, create kv cache, warmup model) took 46.68 seconds
LLM 'MilyaShams/DeepSeek-R1-Distill-Qwen-1.5B-medical-sft-merged' loaded successfully.





In [10]:
import math

BATCH_SIZE=4

# 4. Run Inference
print("\n--- Running Inference ---")
all_outputs_text = []
num_batches = math.ceil(len(prompts) / BATCH_SIZE)

for i in tqdm(range(num_batches), desc="Generating Responses"):
    start_idx = i * BATCH_SIZE
    end_idx = min((i + 1) * BATCH_SIZE, len(prompts))
    batch_prompts = prompts[start_idx:end_idx]

    # Generate responses for the batch
    # Note: vLLM's generate method handles batching internally based on available memory,
    # but we iterate in logical batches here for progress tracking and potentially managing large datasets.
    # However, for efficiency, you could potentially pass the entire `prompts` list directly to `llm.generate`
    # if memory allows, letting vLLM handle the internal batching. We keep the loop for clarity.
    outputs = llm.generate(batch_prompts, sampling_params, use_tqdm=False) # Disable vllm's tqdm

    # Extract the generated text for this batch
    batch_outputs_text = [output.outputs[0].text.strip() for output in outputs]
    all_outputs_text.extend(batch_outputs_text)

print(f"Generated {len(all_outputs_text)} responses.")
if len(all_outputs_text) > 0:
    print("\nExample Generated Text (raw):")
    print(all_outputs_text[0])

# 5. Extract Predictions
print("\n--- Extracting Predictions ---")
predictions = [extract_choice(text) for text in tqdm(all_outputs_text, desc="Extracting choices")]
num_invalid_responces = predictions.count(None)
print(f"\n### Number of invalid responces: {num_invalid_responces}")

if len(predictions) > 0:
    print("\nExample Extracted Prediction:")
    print(predictions[0])


--- Running Inference ---


Generating Responses: 100%|██████████| 25/25 [12:52<00:00, 30.88s/it]


Generated 100 responses.

Example Generated Text (raw):
Your reasoning should be concise and directly address the question.
"

<think>
Alright, let's figure out which tooth doesn't have 5 cusps. First, I know that the number of cusps on a tooth is pretty important for its function. More cusps mean the tooth is more robust and can handle more force. So, I'm thinking about the different types of teeth.

Now, let's look at the options. We've got the mandibular 2nd, 1st, and 3rd molars, and the maxillary 1st molar. I remember that the mandibular molars, like the 2nd and 3rd, have 5 cusps. That's a pretty standard number for them.

But what about the maxillary 1st molar? Hmm, I'm not as familiar with its cusps. I think it might have fewer cusps, maybe 4 or 5? I should double-check that.

Oh, right! I've read somewhere that the maxillary 1st molar actually has 4 cusps. That's definitely different from the others. So, it seems like the maxillary 1st molar doesn't have 5 cusps.

Let's just mak

Extracting choices: 100%|██████████| 100/100 [00:00<00:00, 6125.58it/s]

     **Step 2:** What is the most common cause of a 26-year-old woman's symptoms?
     **Step 3:** What is the most common cause of a 26-year-old woman's symptoms?
     **Step 4:** What is the most common cause of a 26-year-old woman's symptoms?
     **Step 5:** What is the most common cause of a 26-year-old woman's symptoms?
     **Step 6:** What is the most common cause of a 26-year-old woman's symptoms?
     **Step 7:** What is the most common cause of a 26-year-old woman's symptoms?
     **Step 8:** What is the most common cause of a 26-year-old woman's symptoms?
     **Step 9:** What is the most common cause of a 26-year-old woman's symptoms?
     **Step 10:** What is the most common cause of a 26-year-old woman's symptoms?
     **Step 11:** What is the most common cause of a 26-year-old woman's symptoms?
     **Step 12:** What is the most common cause of a 26-year-old woman's symptoms?
     **Step 13:** ...'
     Your conclusion:
     Answer: 

</think>
To determine which medicat




In [11]:
# 6. Calculate Metrics
print("\n--- Calculating Metrics ---")
correct_count = 0
total_count = len(predictions)
results_by_subject = {} # For per-subject accuracy

if total_count != len(ground_truths):
     print(f"Warning: Mismatch between number of predictions ({total_count}) and ground truths ({len(ground_truths)}). This should not happen.")
     # Adjust counts if necessary, though this indicates an earlier error
     total_count = min(total_count, len(ground_truths))

detailed_results = []

for i in range(total_count):
    # Use original_indices if subset was selected due to invalid ground truths
    original_data_index = original_indices[i] if 'original_indices' in locals() else i
    data_item = ds_medmcqa[original_data_index]
    subject = data_item.get('subject_name', 'Unknown') # Handle missing subject

    pred = predictions[i]
    truth = ground_truths[i]

    is_correct = (pred == truth)

    if subject not in results_by_subject:
        results_by_subject[subject] = {'correct': 0, 'total': 0}

    if is_correct:
        correct_count += 1
        results_by_subject[subject]['correct'] += 1

    results_by_subject[subject]['total'] += 1

    # Store detailed results for potential analysis
    detailed_results.append({
        'id': data_item.get('id', f'index_{original_data_index}'),
        'prompt': prompts[i],
        'generated_text': all_outputs_text[i],
        'prediction': pred,
        'ground_truth': truth,
        'subject': subject,
        'is_correct': is_correct
    })

# Calculate overall accuracy
overall_accuracy = (correct_count / total_count) * 100 if total_count > 0 else 0


--- Calculating Metrics ---


In [12]:
import pandas as pd

dataset_name="medmcqa"

# 7. Print Final Results
print("\n--- Evaluation Results ---")
print(f"Model Evaluated: {model_name}")
print(f"Dataset Used: {dataset_name}")
print(f"Number of Questions Evaluated: {total_count}")
print(f"Number of Correct Answers: {correct_count}")
print(f"Overall Accuracy: {overall_accuracy:.2f}%")

print("\nAccuracy by Subject:")
# Sort subjects alphabetically for consistent output
sorted_subjects = sorted(results_by_subject.keys())
for subject in sorted_subjects:
    counts = results_by_subject[subject]
    sub_acc = (counts['correct'] / counts['total']) * 100 if counts['total'] > 0 else 0
    print(f"- {subject}: {sub_acc:.2f}% ({counts['correct']}/{counts['total']})")

# Optional: Save detailed results to a file
print("\n--- Saving Detailed Results (Optional) ---")
try:
    results_df = pd.DataFrame(detailed_results)
    output_filename = f"evaluation_results_{model_name.split('/')[-1]}_{dataset_name.split('/')[-1] if dataset_name else 'local'}.csv"
    results_df.to_csv(output_filename, index=False)
    print(f"Detailed results saved to '{output_filename}'")
except Exception as e:
    print(f"Could not save detailed results to CSV: {e}")


--- Evaluation Results ---
Model Evaluated: MilyaShams/DeepSeek-R1-Distill-Qwen-1.5B-medical-sft-merged
Dataset Used: medmcqa
Number of Questions Evaluated: 100
Number of Correct Answers: 18
Overall Accuracy: 18.00%

Accuracy by Subject:
- Anaesthesia: 0.00% (0/1)
- Anatomy: 0.00% (0/2)
- Biochemistry: 0.00% (0/5)
- Dental: 38.24% (13/34)
- ENT: 50.00% (1/2)
- Forensic Medicine: 0.00% (0/1)
- Gynaecology & Obstetrics: 0.00% (0/12)
- Medicine: 0.00% (0/3)
- Microbiology: 0.00% (0/3)
- Ophthalmology: 0.00% (0/2)
- Pathology: 0.00% (0/3)
- Pediatrics: 14.29% (1/7)
- Pharmacology: 0.00% (0/7)
- Physiology: 25.00% (1/4)
- Radiology: 0.00% (0/2)
- Social & Preventive Medicine: 50.00% (1/2)
- Surgery: 10.00% (1/10)

--- Saving Detailed Results (Optional) ---
Detailed results saved to 'evaluation_results_DeepSeek-R1-Distill-Qwen-1.5B-medical-sft-merged_medmcqa.csv'
