## Evaluation: base model

This is notebook for **base model evaluation** to compare with adapted model. We choose to evaluate on the set of benchmarks from [Open Medical-LLM Leaderboard](https://huggingface.co/spaces/openlifescienceai/open_medical_llm_leaderboard) including:

* [MedMCQA](https://huggingface.co/datasets/openlifescienceai/medmcqa) - MCQ, 200 samples from validation split
* [MedQA](https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options-hf) - MCQ, 200 samples from validation split
* [MMLU](https://huggingface.co/datasets/cais/mmlu) - MCQ, 200 samples from test splits of 6 medical subsets
* [PubMedQA](https://huggingface.co/datasets/qiaojin/PubMedQA) - QA, 200 samples from train split of pqa_labeled subset

*Base model:* [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)

### Setup

In [1]:
%%capture
!pip install datasets vllm

In [2]:
import re
from tqdm import tqdm
import math
import pandas as pd
from datasets import load_dataset, concatenate_datasets
from vllm import LLM, SamplingParams
import torch

INFO 04-13 13:26:29 [__init__.py:239] Automatically detected platform cuda.


2025-04-13 13:26:31.255869: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1744550791.484681      31 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1744550791.547479      31 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


### Base model loading

In [3]:
MODEL_NAME = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
MAX_TOKENS = 4096

In [4]:
print("\n--- Loading LLM with vLLM ---")
try:
    llm = LLM(
        model=MODEL_NAME,
        tensor_parallel_size=2,
        dtype=torch.float16,
    )
    sampling_params = SamplingParams(
        max_tokens=MAX_TOKENS,
        temperature=0.01,
        top_p=1.0,
        top_k=-1
    )
    print(f"LLM '{MODEL_NAME}' loaded successfully.")
except Exception as e:
    print(f"Error loading LLM with vLLM: {e}")
    print("Please ensure the MODEL_NAME is correct, vLLM is installed, and you have compatible hardware (GPU).")
    exit()


--- Loading LLM with vLLM ---


config.json:   0%|          | 0.00/679 [00:00<?, ?B/s]

INFO 04-13 13:26:58 [config.py:600] This model supports multiple tasks: {'score', 'embed', 'generate', 'reward', 'classify'}. Defaulting to 'generate'.
INFO 04-13 13:26:58 [config.py:1600] Defaulting to use mp for distributed inference
INFO 04-13 13:26:58 [config.py:1780] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-13 13:26:58 [llm_engine.py:242] Initializing a V0 LLM engine (v0.8.3) with config: model='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_b

tokenizer_config.json:   0%|          | 0.00/3.07k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/7.03M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/181 [00:00<?, ?B/s]

[1;36m(VllmWorkerProcess pid=147)[0;0m INFO 04-13 13:27:01 [multiproc_worker_utils.py:225] Worker ready; awaiting tasks
INFO 04-13 13:27:02 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 04-13 13:27:02 [cuda.py:289] Using XFormers backend.
[1;36m(VllmWorkerProcess pid=147)[0;0m INFO 04-13 13:27:02 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
[1;36m(VllmWorkerProcess pid=147)[0;0m INFO 04-13 13:27:02 [cuda.py:289] Using XFormers backend.


[W413 13:27:13.786050029 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W413 13:27:14.299555800 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W413 13:27:23.793320793 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3


INFO 04-13 13:27:33 [utils.py:990] Found nccl from library libnccl.so.2
INFO 04-13 13:27:33 [pynccl.py:69] vLLM is using nccl==2.21.5
[1;36m(VllmWorkerProcess pid=147)[0;0m INFO 04-13 13:27:33 [utils.py:990] Found nccl from library libnccl.so.2
[1;36m(VllmWorkerProcess pid=147)[0;0m INFO 04-13 13:27:33 [pynccl.py:69] vLLM is using nccl==2.21.5


[W413 13:27:33.803816086 socket.cpp:204] [c10d] The hostname of the client socket cannot be retrieved. err=-3


INFO 04-13 13:27:34 [custom_all_reduce_utils.py:206] generating GPU P2P access cache in /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-13 13:27:57 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
[1;36m(VllmWorkerProcess pid=147)[0;0m INFO 04-13 13:27:57 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /root/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
INFO 04-13 13:27:57 [shm_broadcast.py:264] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_8f3860b3'), local_subscribe_addr='ipc:///tmp/e9dae793-7dc9-4eb2-b518-a187a13bccea', remote_subscribe_addr=None, remote_addr_ipv6=False)
INFO 04-13 13:27:57 [parallel_state.py:957] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
[1;36m(VllmWorkerProcess pid=147)[0;0m INFO 04-13 13:27:57 [parallel_state.py:957] rank 1 in world size 2 is assigned as DP rank 0, PP rank

model.safetensors:   0%|          | 0.00/3.55G [00:00<?, ?B/s]

INFO 04-13 13:28:05 [weight_utils.py:281] Time spent downloading weights for deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B: 7.932770 seconds
INFO 04-13 13:28:05 [weight_utils.py:315] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


[1;36m(VllmWorkerProcess pid=147)[0;0m INFO 04-13 13:28:06 [weight_utils.py:281] Time spent downloading weights for deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B: 0.971203 seconds
[1;36m(VllmWorkerProcess pid=147)[0;0m INFO 04-13 13:28:06 [weight_utils.py:315] No model.safetensors.index.json found in remote.
INFO 04-13 13:28:08 [loader.py:447] Loading weights took 3.10 seconds
INFO 04-13 13:28:09 [model_runner.py:1146] Model loading took 1.6918 GiB and 11.437391 seconds
[1;36m(VllmWorkerProcess pid=147)[0;0m INFO 04-13 13:28:09 [loader.py:447] Loading weights took 2.35 seconds
[1;36m(VllmWorkerProcess pid=147)[0;0m INFO 04-13 13:28:09 [model_runner.py:1146] Model loading took 1.6918 GiB and 11.864033 seconds
[1;36m(VllmWorkerProcess pid=147)[0;0m INFO 04-13 13:28:18 [worker.py:267] Memory profiling takes 8.71 seconds
[1;36m(VllmWorkerProcess pid=147)[0;0m INFO 04-13 13:28:18 [worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (

Capturing CUDA graph shapes:   0%|          | 0/35 [00:00<?, ?it/s]

[1;36m(VllmWorkerProcess pid=147)[0;0m INFO 04-13 13:28:26 [model_runner.py:1456] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.


Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:38<00:00,  1.10s/it]

INFO 04-13 13:29:04 [custom_all_reduce.py:195] Registering 1995 cuda graph addresses
[1;36m(VllmWorkerProcess pid=147)[0;0m INFO 04-13 13:29:04 [custom_all_reduce.py:195] Registering 1995 cuda graph addresses
[1;36m(VllmWorkerProcess pid=147)[0;0m INFO 04-13 13:29:04 [model_runner.py:1598] Graph capturing finished in 39 secs, took 0.21 GiB
INFO 04-13 13:29:04 [model_runner.py:1598] Graph capturing finished in 39 secs, took 0.21 GiB
INFO 04-13 13:29:04 [llm_engine.py:448] init engine (profile, create kv cache, warmup model) took 55.07 seconds





LLM 'deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B' loaded successfully.


### 1. MedMCQA benchmark

#### Dataset loading and preparing

In [None]:
SEED = 4242
BATCH_SIZE = 4
NUM_SAMPLES = 200
DATASET_MEDMCQA = "openlifescienceai/medmcqa"

In [None]:
ds_medmcqa = load_dataset(DATASET_MEDMCQA, split="validation")
ds_medmcqa = ds_medmcqa.shuffle(seed=SEED).select(range(NUM_SAMPLES))
ds_medmcqa

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/85.9M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/936k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.48M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/182822 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/6150 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/4183 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'question', 'opa', 'opb', 'opc', 'opd', 'cop', 'choice_type', 'exp', 'subject_name', 'topic_name'],
    num_rows: 200
})

In [7]:
ds_medmcqa[0]

{'id': '4653fb7a-ddbf-493b-b4ef-92205582a27a',
 'question': 'Which of the following tooth is not having 5 cusps?',
 'opa': 'Mandibular 2nd Molar',
 'opb': 'Mandibular 1st Molar',
 'opc': 'Mandibular 3rd Molar',
 'opd': 'Maxillary 1st Molar',
 'cop': 0,
 'choice_type': 'single',
 'exp': None,
 'subject_name': 'Dental',
 'topic_name': None}

#### Helper functions definition

In [8]:
def format_prompt_medmcqa(example):
    """Formats a single example into a prompt for the LLM."""
    question = example['question']
    options = {
        "A": example['opa'],
        "B": example['opb'],
        "C": example['opc'],
        "D": example['opd']
    }
    
    prompt = f"""
You are an expert in solving multiple-choice questions accurately and explaining your reasoning clearly.
Given a question and a list of answer choices (A, B, C, D), your task is to:
1. Reason shortly about the question and answer choices to find evidances to support your answer.
2. Identify the correct answer. Please choose the single best answer from the options provided.
3. Output the final answer in the format: Answer: [Option Letter]

Question: {question}
Options:
A. {options['A']}
B. {options['B']}
C. {options['C']}
D. {options['D']}

Reasoning:
    """
    return prompt

In [9]:
def get_ground_truth_medmcqa(example):
    """Maps the correct option index (cop) to the corresponding letter."""
    mapping = {0: 'A', 1: 'B', 2: 'C', 3: 'D'}
    cop_index = example.get('cop')
    if cop_index is None or cop_index not in mapping:
        print(f"Warning: Invalid 'cop' value found: {cop_index} in example ID {example.get('id')}. Skipping ground truth.")
        return None
    return mapping[cop_index]

In [10]:
def extract_choice_mcq(generated_text):
    """Extracts the predicted choice (A, B, C, or D) from the LLM's output."""
    text = generated_text.strip()

    # Check for phrases like "The answer is A" or "Answer: A"
    match = re.search(r'(?:answer|choice|option) is\s*:?\s*([A-D])', text, re.IGNORECASE)
    if match:
        return match.group(1).upper()

    # Look for the first standalone letter A, B, C, or D in the text
    match = re.search(r'\b([A-D])\b', text)
    if match:
        return match.group(1).upper()

    # Fallback - If no clear choice found, return None
    print(f"Warning: Could not extract answer from text: '{text[:100]}...{text[-100:]}'")
    return None

#### Evaluation

In [11]:
print("\n--- Preparing Prompts and Ground Truths ---")
prompts = [format_prompt_medmcqa(ex) for ex in tqdm(ds_medmcqa, desc="Formatting prompts")]
ground_truths = [get_ground_truth_medmcqa(ex) for ex in tqdm(ds_medmcqa, desc="Extracting ground truths")]
valid_indices = [i for i, gt in enumerate(ground_truths) if gt is not None]

if len(valid_indices) < len(ground_truths):
     print(f"Warning: {len(ground_truths) - len(valid_indices)} examples had invalid ground truths and were excluded.")
     prompts = [prompts[i] for i in valid_indices]
     ground_truths = [ground_truths[i] for i in valid_indices]
     original_indices = valid_indices

if len(prompts) > 0:
    print("\nExample Prompt:")
    print(prompts[0])
    print(f"Corresponding Ground Truth: {ground_truths[0]}")
else:
    print("No valid prompts to evaluate.")
    exit()


--- Preparing Prompts and Ground Truths ---


Formatting prompts: 100%|██████████| 200/200 [00:00<00:00, 7126.02it/s]
Extracting ground truths: 100%|██████████| 200/200 [00:00<00:00, 7985.12it/s]


Example Prompt:

You are an expert in solving multiple-choice questions accurately and explaining your reasoning clearly.
Given a question and a list of answer choices (A, B, C, D), your task is to:
1. Reason shortly about the question and answer choices to find evidances to support your answer.
2. Identify the correct answer. Please choose the single best answer from the options provided.
3. Output the final answer in the format: Answer: [Option Letter]

Question: Which of the following tooth is not having 5 cusps?
Options:
A. Mandibular 2nd Molar
B. Mandibular 1st Molar
C. Mandibular 3rd Molar
D. Maxillary 1st Molar

Reasoning:
    
Corresponding Ground Truth: A





In [12]:
print("\n--- Running Inference ---")
all_outputs_text = []
num_batches = math.ceil(len(prompts) / BATCH_SIZE)

for i in tqdm(range(num_batches), desc="Generating Responses"):
    start_idx = i * BATCH_SIZE
    end_idx = min((i + 1) * BATCH_SIZE, len(prompts))
    batch_prompts = prompts[start_idx:end_idx]
    outputs = llm.generate(batch_prompts, sampling_params, use_tqdm=False)
    batch_outputs_text = [output.outputs[0].text.strip() for output in outputs]
    all_outputs_text.extend(batch_outputs_text)

if len(all_outputs_text) > 0:
    print("\nExample Generated Text (raw):")
    print(all_outputs_text[0])


--- Running Inference ---


Generating Responses: 100%|██████████| 50/50 [22:29<00:00, 27.00s/it]


Example Generated Text (raw):
1. I need to recall the number of cusps each type of tooth has.
     2. Mandibular teeth are the ones near the mouth, so they are the ones that are used for chewing food.
     3. The Mandibular 1st Molar is the first tooth that comes out of the mouth, and it's located on the lower side.
     4. I remember that the Mandibular 1st Molar has 2 cusps. It's a simple tooth with a single point.
     5. The Mandibular 2nd Molar is the second tooth, which is also on the lower side but a bit larger. It has 5 cusps. This is where the question is about.
     6. The Mandibular 3rd Molar is the third tooth, which is the largest of the Mandibular ones. It has 10 cusps. This is a more complex tooth.
     7. The Maxillary 1st Molar is the first tooth on the upper side of the jaw. It has 2 cusps as well.
     8. So, the Mandibular 2nd Molar has 5 cusps, which is the answer to the question.
</think>

The Mandibular 2nd Molar has 5 cusps, making it the correct answer.

Answe




In [13]:
print("\n--- Extracting Predictions ---")
predictions = [extract_choice_mcq(text) for text in tqdm(all_outputs_text, desc="Extracting choices")]
num_invalid_responces = predictions.count(None)
print(f"\n------------------------------\nNumber of invalid responces: {num_invalid_responces}")

if len(predictions) > 0:
    print("\nExample Extracted Prediction:")
    print(predictions[0])


--- Extracting Predictions ---


Extracting choices: 100%|██████████| 200/200 [00:00<00:00, 6320.10it/s]

     2. Saorius is a muscle in the chest, I...  141. I'm not entirely sure, but I think that the scapula muscles have parallel fibers, but I'm not'
     2. Class I...he Ashley & Howe analysis, which is a specific method for class I.
     100. So, if class II is more'
     186. Therefore, the primary site is'
    ... the DL groove in the first and second molars, but less important than the relative positions of the'
    128. Focal tonic clonic seizures are'

------------------------------
Number of invalid responces: 6

Example Extracted Prediction:
C





In [14]:
print("\n--- Calculating Metrics ---")
correct_count = 0
total_count = len(predictions)
results_by_subject = {}

if total_count != len(ground_truths):
     print(f"Warning: Mismatch between number of predictions ({total_count}) and ground truths ({len(ground_truths)}). This should not happen.")
     total_count = min(total_count, len(ground_truths))

for i in range(total_count):
    original_data_index = original_indices[i] if 'original_indices' in locals() else i
    data_item = ds_medmcqa[original_data_index]
    subject = data_item.get('subject_name', 'Unknown')

    pred = predictions[i]
    truth = ground_truths[i]
    is_correct = (pred == truth)

    if subject not in results_by_subject:
        results_by_subject[subject] = {'correct': 0, 'total': 0}

    if is_correct:
        correct_count += 1
        results_by_subject[subject]['correct'] += 1
    results_by_subject[subject]['total'] += 1

overall_accuracy = (correct_count / total_count) * 100 if total_count > 0 else 0


--- Calculating Metrics ---


In [15]:
print("\n--- Evaluation Results ---")
print(f"Model Evaluated: {MODEL_NAME}")
print(f"Dataset Used: {DATASET_MEDMCQA}")
print(f"Number of Questions Evaluated: {total_count}")
print(f"Number of Correct Answers: {correct_count}")
print(f"Overall Accuracy: {overall_accuracy:.2f}%")

print("\nAccuracy by Subject:")
sorted_subjects = sorted(results_by_subject.keys())
for subject in sorted_subjects:
    counts = results_by_subject[subject]
    sub_acc = (counts['correct'] / counts['total']) * 100 if counts['total'] > 0 else 0
    print(f"- {subject}: {sub_acc:.2f}% ({counts['correct']}/{counts['total']})")


--- Evaluation Results ---
Model Evaluated: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Dataset Used: openlifescienceai/medmcqa
Number of Questions Evaluated: 200
Number of Correct Answers: 61
Overall Accuracy: 30.50%

Accuracy by Subject:
- Anaesthesia: 100.00% (2/2)
- Anatomy: 16.67% (1/6)
- Biochemistry: 25.00% (2/8)
- Dental: 37.31% (25/67)
- ENT: 0.00% (0/5)
- Forensic Medicine: 28.57% (2/7)
- Gynaecology & Obstetrics: 29.41% (5/17)
- Medicine: 16.67% (1/6)
- Microbiology: 16.67% (1/6)
- Ophthalmology: 0.00% (0/4)
- Pathology: 41.67% (5/12)
- Pediatrics: 14.29% (2/14)
- Pharmacology: 50.00% (6/12)
- Physiology: 33.33% (2/6)
- Radiology: 50.00% (1/2)
- Skin: 0.00% (0/1)
- Social & Preventive Medicine: 33.33% (2/6)
- Surgery: 21.05% (4/19)


### 2. MedQA

#### Dataset loading and preparing

In [16]:
SEED = 4242
BATCH_SIZE = 4
NUM_SAMPLES = 200
DATASET_MEDQA = "GBaker/MedQA-USMLE-4-options-hf"
SPLIT_MEDQA = "validation"

In [17]:
ds_medqa = load_dataset(DATASET_MEDQA, split=SPLIT_MEDQA)
ds_medqa = ds_medqa.shuffle(seed=SEED).select(range(NUM_SAMPLES))
ds_medqa

README.md:   0%|          | 0.00/640 [00:00<?, ?B/s]

train.json:   0%|          | 0.00/9.77M [00:00<?, ?B/s]

dev.json:   0%|          | 0.00/1.22M [00:00<?, ?B/s]

test.json:   0%|          | 0.00/1.25M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10178 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1272 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1273 [00:00<?, ? examples/s]

Dataset({
    features: ['id', 'sent1', 'sent2', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
    num_rows: 200
})

In [18]:
ds_medqa[1]

{'id': 'dev-00646',
 'sent1': 'A 31-year-old gravida 2 para 2 woman presents to her primary care physician for follow up. Two weeks ago, she gave birth via vaginal delivery to a 9.5 lb (4.3 kg) male infant. The delivery was complicated by a vaginal laceration that required extensive suturing once the infant was delivered. Immediately after delivery of the placenta she experienced intense shaking and chills that resolved within 1 hour. She has felt well since the delivery but admits to 6 days of malodorous smelling vaginal discharge that is tan in color. She has a history of vaginal candidiasis and is worried that it may be recurring. Her temperature is 98.8°F (37.1°C), blood pressure is 122/73 mmHg, pulse is 88/min, respirations are 16/min, and BMI is 33 kg/m^2. Speculum exam reveals a 1.5 cm dark red, velvety lesion on the posterior vaginal wall with a tan discharge. The pH of the discharge is 6.4. Which of the following is the most likely diagnosis?',
 'sent2': '',
 'ending0': 'Bacte

#### Helper functions definition

In [19]:
def format_prompt_medqa(example):
    """Formats a single example into a prompt for the LLM."""
    question = example['sent1']
    options = {
        "A": example['ending0'],
        "B": example['ending1'],
        "C": example['ending2'],
        "D": example['ending3'],
    }
    
    prompt = f"""
You are an expert in solving multiple-choice questions accurately and explaining your reasoning clearly.
Given a question and a list of answer choices (A, B, C, D), your task is to:
1. Reason shortly about the question and answer choices to find evidances to support your answer.
2. Identify the correct answer. Please choose the single best answer from the options provided.
3. Output the final answer in the format: Answer: [Option Letter]

Question: {question}
Options:
A. {options['A']}
B. {options['B']}
C. {options['C']}
D. {options['D']}

Reasoning:
    """
    return prompt

In [20]:
def get_ground_truth_medqa(example):
    """Maps the label to the corresponding letter."""
    mapping = {0: 'A', 1: 'B', 2: 'C', 3: 'D'}
    label = example.get('label')
    if label is None or label not in mapping:
        print(f"Warning: Invalid 'cop' value found: {label} in example ID {example.get('id')}. Skipping ground truth.")
        return None
    return mapping[label]

#### Evaluation

In [21]:
print("\n--- Preparing Prompts and Ground Truths ---")
prompts = [format_prompt_medqa(ex) for ex in tqdm(ds_medqa, desc="Formatting prompts")]
ground_truths = [get_ground_truth_medqa(ex) for ex in tqdm(ds_medqa, desc="Extracting ground truths")]
valid_indices = [i for i, gt in enumerate(ground_truths) if gt is not None]

if len(valid_indices) < len(ground_truths):
     print(f"Warning: {len(ground_truths) - len(valid_indices)} examples had invalid ground truths and were excluded.")
     prompts = [prompts[i] for i in valid_indices]
     ground_truths = [ground_truths[i] for i in valid_indices]
     original_indices = valid_indices

if len(prompts) > 0:
    print("\nExample Prompt:")
    print(prompts[0])
    print(f"Corresponding Ground Truth: {ground_truths[0]}")
else:
    print("No valid prompts to evaluate.")
    exit()


--- Preparing Prompts and Ground Truths ---


Formatting prompts: 100%|██████████| 200/200 [00:00<00:00, 7845.61it/s]
Extracting ground truths: 100%|██████████| 200/200 [00:00<00:00, 7082.16it/s]


Example Prompt:

You are an expert in solving multiple-choice questions accurately and explaining your reasoning clearly.
Given a question and a list of answer choices (A, B, C, D), your task is to:
1. Reason shortly about the question and answer choices to find evidances to support your answer.
2. Identify the correct answer. Please choose the single best answer from the options provided.
3. Output the final answer in the format: Answer: [Option Letter]

Question: A 9-year-old girl is brought to the physician by her father for evaluation of intermittent muscle cramps for the past year and short stature. She has had recurrent upper respiratory tract infections since infancy. She is at the 5th percentile for weight and 10th percentile for height. Physical examination shows nasal polyps and dry skin. An x-ray of the right wrist shows osteopenia with epiphyseal widening. Which of the following sets of laboratory findings is most likely in this patient's serum?
 $$$ Calcium %%% Phosphorus




In [22]:
print("\n--- Running Inference ---")
all_outputs_text = []
num_batches = math.ceil(len(prompts) / BATCH_SIZE)

for i in tqdm(range(num_batches), desc="Generating Responses"):
    start_idx = i * BATCH_SIZE
    end_idx = min((i + 1) * BATCH_SIZE, len(prompts))
    batch_prompts = prompts[start_idx:end_idx]
    outputs = llm.generate(batch_prompts, sampling_params, use_tqdm=False)
    batch_outputs_text = [output.outputs[0].text.strip() for output in outputs]
    all_outputs_text.extend(batch_outputs_text)

if len(all_outputs_text) > 0:
    print("\nExample Generated Text (raw):")
    print(all_outputs_text[0])


--- Running Inference ---


Generating Responses: 100%|██████████| 50/50 [34:50<00:00, 41.81s/it]


Example Generated Text (raw):
1. The patient has short stature, which is associated with lower calcium and phosphorus levels. This is because shorter individuals have less calcium and phosphorus in their bones due to reduced bone density. So, calcium and phosphorus levels are likely to be low.

2. The patient has recurrent upper respiratory tract infections, which are often caused by viruses like the common cold. However, the presence of a polyp in the nasal area suggests a more serious condition, possibly a polyps or a polyps with a polyps. Polyps can be a sign of a polyp, which is a benign or malignant growth of a tissue. In this case, the polyps are in the upper respiratory tract, which could indicate a polyp in the airway or the throat. Polyps in the airway can be a sign of a polyp, which is a benign or malignant growth of a tissue. Polyps in the airway can be a sign of a polyp, which is a benign or malignant growth of a tissue. Polyps in the airway can be a sign of a polyp, which




In [23]:
print("\n--- Extracting Predictions ---")
predictions = [extract_choice_mcq(text) for text in tqdm(all_outputs_text, desc="Extracting choices")]
num_invalid_responces = predictions.count(None)
print(f"\n------------------------------\nNumber of invalid responces: {num_invalid_responces}")

if len(predictions) > 0:
    print("\nExample Extracted Prediction:")
    print(predictions[0])


--- Extracting Predictions ---


Extracting choices: 100%|██████████| 200/200 [00:00<00:00, 2827.75it/s]

2. He has a history...ed red cells, which is a feature of hypertensive kidney disease.
187. The patient's age is 52, which'
     2. She presents wi...d sound. The decrescendo indicates that the sound decreases in pitch as it moves outward from the di'
     2. Th...h is high but not extremely high, which is typical for paralytic degenerative disc disease.
     127'
95. The patient's'
     2. IBS i...tient's symptoms are described as "intermittent" and "burning," which are more characteristic of IBS'
2. She comes ba... so if the shoulder is injured, the lower trunk brachial plexus might be affected. The long thoracic'
106. Ureaplasma urealyticum is resistant to penicillin and ceftazol, but'
     2. She has been on medications for 4 yea...has a limited duration of action.
     132. The long-acting β-agonist is another β-agonist, which is'

100. The patient has been'
     123.'
132. She has a history of weight loss (15 lb), which is a'

------------------------------
Number of invalid resp




In [24]:
print("\n--- Calculating Metrics ---")
correct_count = 0
total_count = len(predictions)
results_by_subject = {}

if total_count != len(ground_truths):
     print(f"Warning: Mismatch between number of predictions ({total_count}) and ground truths ({len(ground_truths)}). This should not happen.")
     total_count = min(total_count, len(ground_truths))

for i in range(total_count):
    original_data_index = original_indices[i] if 'original_indices' in locals() else i
    data_item = ds_medqa[original_data_index]
    subject = data_item.get('subject_name', 'Unknown')

    pred = predictions[i]
    truth = ground_truths[i]
    is_correct = (pred == truth)

    if subject not in results_by_subject:
        results_by_subject[subject] = {'correct': 0, 'total': 0}

    if is_correct:
        correct_count += 1
        results_by_subject[subject]['correct'] += 1
    results_by_subject[subject]['total'] += 1

overall_accuracy = (correct_count / total_count) * 100 if total_count > 0 else 0


--- Calculating Metrics ---


In [25]:
print("\n--- Evaluation Results ---")
print(f"Model Evaluated: {MODEL_NAME}")
print(f"Dataset Used: {DATASET_MEDQA}")
print(f"Number of Questions Evaluated: {total_count}")
print(f"Number of Correct Answers: {correct_count}")
print(f"Overall Accuracy: {overall_accuracy:.2f}%")


--- Evaluation Results ---
Model Evaluated: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Dataset Used: GBaker/MedQA-USMLE-4-options-hf
Number of Questions Evaluated: 200
Number of Correct Answers: 45
Overall Accuracy: 22.50%


### 3. MMLU medical

#### Dataset loading and preparing

In [26]:
SEED = 4242
BATCH_SIZE = 4
NUM_SAMPLES_SUBSET = 50
NUM_SAMPLES = 200
DATASET_MMLU = "cais/mmlu"
SPLIT_MMLU = "test"

MMLU_MEDICAL_SUBSETS = [
    "anatomy",
    "clinical_knowledge",
    "professional_medicine",
    "college_biology",
    "college_medicine",
    "medical_genetics",
    "professional_medicine"
]

In [27]:
datasets_mmlu = []
for subset in MMLU_MEDICAL_SUBSETS:
    ds = load_dataset(DATASET_MMLU, subset, split=SPLIT_MMLU)
    ds = ds.shuffle(seed=SEED).select(range(NUM_SAMPLES_SUBSET))
    datasets_mmlu.append(ds)


ds_mmlu = concatenate_datasets(datasets_mmlu)
ds_mmlu = ds_mmlu.shuffle(seed=SEED).select(range(NUM_SAMPLES))
ds_mmlu

README.md:   0%|          | 0.00/53.2k [00:00<?, ?B/s]

dataset_infos.json:   0%|          | 0.00/138k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.1k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/5.28k [00:00<?, ?B/s]

dev-00000-of-00001.parquet:   0%|          | 0.00/3.50k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/135 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/14 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

test-00000-of-00001.parquet:   0%|          | 0.00/40.5k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/7.48k [00:00<?, ?B/s]

dev-00000-of-00001.parquet:   0%|          | 0.00/3.67k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/265 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/29 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

test-00000-of-00001.parquet:   0%|          | 0.00/125k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/19.9k [00:00<?, ?B/s]

dev-00000-of-00001.parquet:   0%|          | 0.00/8.45k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/272 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/31 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

test-00000-of-00001.parquet:   0%|          | 0.00/31.8k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/6.90k [00:00<?, ?B/s]

dev-00000-of-00001.parquet:   0%|          | 0.00/4.27k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/144 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/16 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

test-00000-of-00001.parquet:   0%|          | 0.00/42.5k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/8.99k [00:00<?, ?B/s]

dev-00000-of-00001.parquet:   0%|          | 0.00/4.84k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/173 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/22 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

test-00000-of-00001.parquet:   0%|          | 0.00/16.4k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/5.63k [00:00<?, ?B/s]

dev-00000-of-00001.parquet:   0%|          | 0.00/3.77k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/100 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/11 [00:00<?, ? examples/s]

Generating dev split:   0%|          | 0/5 [00:00<?, ? examples/s]

Dataset({
    features: ['question', 'subject', 'choices', 'answer'],
    num_rows: 200
})

In [28]:
ds_mmlu[0]

{'question': 'Mitochondria isolated and placed in a buffered solution with a low pH begin to manufacture ATP. Which of the following is the best explanation for the effect of low external pH?',
 'subject': 'college_biology',
 'choices': ['It increases the concentration of OH-, causing the mitochondria to pump H+ to the intermembrane space.',
  'It increases the OH- concentration in the mitochondria matrix.',
  'It increases the acid concentration in the mitochondria matrix.',
  'It increases diffusion of H+ from the intermembrane space to the matrix.'],
 'answer': 3}

#### Helper functions definition

In [29]:
def format_prompt_mmlu(example):
    """Formats a single example into a prompt for the LLM."""
    question = example['question']
    options = {
        "A": example['choices'][0],
        "B": example['choices'][1],
        "C": example['choices'][2],
        "D": example['choices'][3]
    }
    
    prompt = f"""
You are an expert in solving multiple-choice questions accurately and explaining your reasoning clearly.
Given a question and a list of answer choices (A, B, C, D), your task is to:
1. Reason shortly about the question and answer choices to find evidances to support your answer.
2. Identify the correct answer. Please choose the single best answer from the options provided.
3. Output the final answer in the format: Answer: [Option Letter]

Question: {question}
Options:
A. {options['A']}
B. {options['B']}
C. {options['C']}
D. {options['D']}

Reasoning:
    """
    return prompt

In [30]:
def get_ground_truth_mmlu(example):
    """Maps the label to the corresponding letter."""
    mapping = {0: 'A', 1: 'B', 2: 'C', 3: 'D'}
    label = example.get('answer')
    if label is None or label not in mapping:
        print(f"Warning: Invalid 'cop' value found: {label} in example ID {example.get('id')}. Skipping ground truth.")
        return None
    return mapping[label]

#### Evaluation

In [31]:
print("\n--- Preparing Prompts and Ground Truths ---")
prompts = [format_prompt_mmlu(ex) for ex in tqdm(ds_mmlu, desc="Formatting prompts")]
ground_truths = [get_ground_truth_mmlu(ex) for ex in tqdm(ds_mmlu, desc="Extracting ground truths")]
valid_indices = [i for i, gt in enumerate(ground_truths) if gt is not None]

if len(valid_indices) < len(ground_truths):
     print(f"Warning: {len(ground_truths) - len(valid_indices)} examples had invalid ground truths and were excluded.")
     prompts = [prompts[i] for i in valid_indices]
     ground_truths = [ground_truths[i] for i in valid_indices]
     original_indices = valid_indices

if len(prompts) > 0:
    print("\nExample Prompt:")
    print(prompts[0])
    print(f"Corresponding Ground Truth: {ground_truths[0]}")
else:
    print("No valid prompts to evaluate.")
    exit()


--- Preparing Prompts and Ground Truths ---


Formatting prompts: 100%|██████████| 200/200 [00:00<00:00, 9333.85it/s]
Extracting ground truths: 100%|██████████| 200/200 [00:00<00:00, 11507.32it/s]


Example Prompt:

You are an expert in solving multiple-choice questions accurately and explaining your reasoning clearly.
Given a question and a list of answer choices (A, B, C, D), your task is to:
1. Reason shortly about the question and answer choices to find evidances to support your answer.
2. Identify the correct answer. Please choose the single best answer from the options provided.
3. Output the final answer in the format: Answer: [Option Letter]

Question: Mitochondria isolated and placed in a buffered solution with a low pH begin to manufacture ATP. Which of the following is the best explanation for the effect of low external pH?
Options:
A. It increases the concentration of OH-, causing the mitochondria to pump H+ to the intermembrane space.
B. It increases the OH- concentration in the mitochondria matrix.
C. It increases the acid concentration in the mitochondria matrix.
D. It increases diffusion of H+ from the intermembrane space to the matrix.

Reasoning:
    
Correspond




In [32]:
print("\n--- Running Inference ---")
all_outputs_text = []
num_batches = math.ceil(len(prompts) / BATCH_SIZE)

for i in tqdm(range(num_batches), desc="Generating Responses"):
    start_idx = i * BATCH_SIZE
    end_idx = min((i + 1) * BATCH_SIZE, len(prompts))
    batch_prompts = prompts[start_idx:end_idx]
    outputs = llm.generate(batch_prompts, sampling_params, use_tqdm=False)
    batch_outputs_text = [output.outputs[0].text.strip() for output in outputs]
    all_outputs_text.extend(batch_outputs_text)

if len(all_outputs_text) > 0:
    print("\nExample Generated Text (raw):")
    print(all_outputs_text[0])


--- Running Inference ---


Generating Responses: 100%|██████████| 50/50 [29:15<00:00, 35.10s/it]


Example Generated Text (raw):
1. The mitochondrial matrix is the space where ATP is produced.
2. High pH in the matrix leads to the production of H+ ions.
3. Low pH in the matrix would lead to the production of OH- ions.
4. The mitochondrial matrix is a proton pump, which uses H+ to pump it out.
5. If the matrix is in low pH, it would pump H+ out, which would decrease ATP production.
6. Therefore, the mitochondrial matrix would have less H+ and more OH-.
7. The intermembrane space is where H+ is pumped out.
8. If the intermembrane space has more H+ than the matrix, H+ would flow back into the matrix.
9. So, the matrix would have more H+ and less OH-.
10. Therefore, the mitochondrial matrix would have more H+ and less OH-.
11. The intermembrane space would have more OH- and less H+.
12. The mitochondrial matrix would have more H+ and less OH-.
13. The intermembrane space would have more OH- and less H+.
14. The mitochondrial matrix would have more H+ and less OH-.
15. The intermembrane




In [33]:
print("\n--- Extracting Predictions ---")
predictions = [extract_choice_mcq(text) for text in tqdm(all_outputs_text, desc="Extracting choices")]
num_invalid_responces = predictions.count(None)
print(f"\n------------------------------\nNumber of invalid responces: {num_invalid_responces}")

if len(predictions) > 0:
    print("\nExample Extracted Prediction:")
    print(predictions[0])


--- Extracting Predictions ---


Extracting choices: 100%|██████████| 200/200 [00:00<00:00, 5111.36it/s]

2. High pH in the matrix leads to th...less H+.
232. The mitochondrial matrix would have more H+ and less OH-.
233. The intermembrane space'
154. The question is about which defense mechanism'
     2. Stom... 156. Stomach surgery is performed using a surgical support to allow the blood to flow back into the'
     2. When glycolysis ... intracellular buffer to limit pH changes.
    151. Since ADP and ubiquinone are involved in the cit'

------------------------------
Number of invalid responces: 11

Example Extracted Prediction:
None





In [34]:
print("\n--- Calculating Metrics ---")
correct_count = 0
total_count = len(predictions)
results_by_subject = {}

if total_count != len(ground_truths):
     print(f"Warning: Mismatch between number of predictions ({total_count}) and ground truths ({len(ground_truths)}). This should not happen.")
     total_count = min(total_count, len(ground_truths))

for i in range(total_count):
    original_data_index = original_indices[i] if 'original_indices' in locals() else i
    data_item = ds_mmlu[original_data_index]
    subject = data_item.get('subject', 'Unknown')

    pred = predictions[i]
    truth = ground_truths[i]
    is_correct = (pred == truth)

    if subject not in results_by_subject:
        results_by_subject[subject] = {'correct': 0, 'total': 0}

    if is_correct:
        correct_count += 1
        results_by_subject[subject]['correct'] += 1
    results_by_subject[subject]['total'] += 1

overall_accuracy = (correct_count / total_count) * 100 if total_count > 0 else 0


--- Calculating Metrics ---


In [35]:
print("\n--- Evaluation Results ---")
print(f"Model Evaluated: {MODEL_NAME}")
print(f"Dataset Used: {DATASET_MMLU}")
print(f"Number of Questions Evaluated: {total_count}")
print(f"Number of Correct Answers: {correct_count}")
print(f"Overall Accuracy: {overall_accuracy:.2f}%")

print("\nAccuracy by Subject:")
sorted_subjects = sorted(results_by_subject.keys())
for subject in sorted_subjects:
    counts = results_by_subject[subject]
    sub_acc = (counts['correct'] / counts['total']) * 100 if counts['total'] > 0 else 0
    print(f"- {subject}: {sub_acc:.2f}% ({counts['correct']}/{counts['total']})")


--- Evaluation Results ---
Model Evaluated: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Dataset Used: cais/mmlu
Number of Questions Evaluated: 200
Number of Correct Answers: 68
Overall Accuracy: 34.00%

Accuracy by Subject:
- anatomy: 41.94% (13/31)
- clinical_knowledge: 37.93% (11/29)
- college_biology: 44.44% (12/27)
- college_medicine: 33.33% (10/30)
- medical_genetics: 35.71% (10/28)
- professional_medicine: 21.82% (12/55)


### 4. PubMedQA

#### Dataset loading and preparing

In [5]:
SEED = 4242
BATCH_SIZE = 4
NUM_SAMPLES = 200
DATASET_PUBMEDQA = "qiaojin/PubMedQA"
SUBSET_PUBMEDQA = "pqa_labeled"
SPLIT_PUBMEDQA = "train"

In [6]:
ds_pubmedqa = load_dataset(DATASET_PUBMEDQA, SUBSET_PUBMEDQA, split=SPLIT_PUBMEDQA)
ds_pubmedqa = ds_pubmedqa.shuffle(seed=SEED).select(range(NUM_SAMPLES))
ds_pubmedqa

README.md:   0%|          | 0.00/5.19k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/1000 [00:00<?, ? examples/s]

Dataset({
    features: ['pubid', 'question', 'context', 'long_answer', 'final_decision'],
    num_rows: 200
})

In [7]:
ds_pubmedqa[0]

{'pubid': 22504515,
 'question': 'Endovenous laser ablation in the treatment of small saphenous varicose veins: does site of access influence early outcomes?',
 'context': {'contexts': ['The study was performed to evaluate the clinical and technical efficacy of endovenous laser ablation (EVLA) of small saphenous varicosities, particularly in relation to the site of endovenous access.',
   'Totally 59 patients with unilateral saphenopopliteal junction incompetence and small saphenous vein reflux underwent EVLA (810 nm, 14 W diode laser) with ambulatory phlebectomies. Small saphenous vein access was gained at the lowest site of truncal reflux. Patients were divided into 2 groups: access gained above mid-calf (AMC, n = 33) and below mid-calf (BMC, n = 26) levels. Outcomes included Venous Clinical Severity Scores (VCSS), Aberdeen Varicose Vein Questionnaire (AVVQ), patient satisfaction, complications, and recurrence rates.',
   'Both groups demonstrated significant improvement in VCSS, AVV

#### Helper functions definition

In [8]:
def format_prompt_pubmedqa(example):
    """Formats a single example into a prompt for the LLM."""
    question = example['question']
    if not isinstance(example.get('context'), dict) or 'contexts' not in example['context']:
        print(f"Warning: Skipping example due to missing or invalid context field.")
        return None

    context_passages = example['context']['contexts']
    full_context = "\n\n".join(context_passages)

    prompt = f"""
You are an expert in analyzing scientific texts and answering questions based on provided context and explaining your reasoning clearly.
Your task is to determine the answer to the question ('yes', 'no', or 'maybe') based only on the information given in the context. Follow these steps:
1. Analyze the provided context in relation to the question. Summarize the key evidence (or lack thereof) relevant to answering the question. This is your reasoning.
2. Based on your reasoning from the context, determine if the answer to the question is 'yes', 'no', or 'maybe'.
3. Output your reasoning first. After the reasoning, start a new line and provide the final decision in the specific format: Answer: [yes/no/maybe]

Context:
{full_context}

Question: {question}

Reasoning:
    """
    return prompt

In [9]:
def get_ground_truth_pubmedqa(example):
    """Extracts the ground truth ('yes', 'no', 'maybe') from the example."""
    decision = example.get('final_decision')
    if decision not in ['yes', 'no', 'maybe']:
        print(f"Warning: Invalid 'final_decision' value found: {decision}. Skipping ground truth.")
        return None
    return decision

In [10]:
def extract_yes_no_maybe(generated_text):
    """Extracts the predicted choice (yes, no, maybe) from the LLM's output."""
    text = generated_text.strip().lower()

    # Explicit "Answer: yes/no/maybe" potentially followed by punctuation/eos
    match = re.search(r'(?:answer|decision)\s*[:\-]?\s*(yes|no|maybe)\b', text)
    if match:
        return match.group(1)

    # Look for the first occurrence of "yes", "no", or "maybe" as a whole word
    match = re.search(r'\b(yes|no|maybe)\b', text)
    if match:
        return match.group(1)

    # Fallback - If no clear choice found, return None
    print(f"Warning: Could not extract answer from text: '{text[:100]}...{text[-100:]}'")
    return None

#### Evaluation

In [11]:
print("\n--- Preparing Prompts and Ground Truths ---")
prompts = []
ground_truths_raw = []
original_indices_map = []

for i, ex in enumerate(tqdm(ds_pubmedqa, desc="Formatting prompts")):
    prompt = format_prompt_pubmedqa(ex)
    if prompt:
        prompts.append(prompt)
        ground_truths_raw.append(get_ground_truth_pubmedqa(ex))
        original_indices_map.append(i)

valid_indices = [i for i, gt in enumerate(ground_truths_raw) if gt is not None]

if len(valid_indices) < len(prompts):
     invalid_gt_count = len(prompts) - len(valid_indices)
     print(f"Warning: {invalid_gt_count} examples had invalid ground truths and were excluded.")
     prompts = [prompts[i] for i in valid_indices]
     ground_truths = [ground_truths_raw[i] for i in valid_indices]
     original_indices = [original_indices_map[i] for i in valid_indices]
else:
    ground_truths = ground_truths_raw
    original_indices = original_indices_map

if len(prompts) > 0:
    print("\nExample Prompt:")
    print(prompts[0])
    print(f"Corresponding Ground Truth: {ground_truths[0]}")
else:
    print("No valid prompts to evaluate.")
    exit()


--- Preparing Prompts and Ground Truths ---


Formatting prompts: 100%|██████████| 200/200 [00:00<00:00, 5983.32it/s]

Prepared 200 valid prompts for evaluation.

Example Prompt:

You are an expert in analyzing scientific texts and answering questions based on provided context and explaining your reasoning clearly.
Your task is to determine the answer to the question ('yes', 'no', or 'maybe') based only on the information given in the context. Follow these steps:
1. Analyze the provided context in relation to the question. Summarize the key evidence (or lack thereof) relevant to answering the question. This is your reasoning.
2. Based on your reasoning from the context, determine if the answer to the question is 'yes', 'no', or 'maybe'.
3. Output your reasoning first. After the reasoning, start a new line and provide the final decision in the specific format: Answer: [yes/no/maybe]

Context:
The study was performed to evaluate the clinical and technical efficacy of endovenous laser ablation (EVLA) of small saphenous varicosities, particularly in relation to the site of endovenous access.

Totally 59 pa




In [12]:
print("\n--- Running Inference ---")
all_outputs_text = []
num_batches = math.ceil(len(prompts) / BATCH_SIZE)

for i in tqdm(range(num_batches), desc="Generating Responses"):
    start_idx = i * BATCH_SIZE
    end_idx = min((i + 1) * BATCH_SIZE, len(prompts))
    batch_prompts = prompts[start_idx:end_idx]
    outputs = llm.generate(batch_prompts, sampling_params, use_tqdm=False)
    batch_outputs_text = [output.outputs[0].text.strip() for output in outputs]
    all_outputs_text.extend(batch_outputs_text)

if len(all_outputs_text) > 0:
    print("\nExample Generated Text (raw):")
    print(all_outputs_text[0])


--- Running Inference ---


Generating Responses: 100%|██████████| 50/50 [18:13<00:00, 21.87s/it]

Generated 200 responses.

Example Generated Text (raw):
- The study involved 59 patients with unilateral saphenopopliteal junction incompetence and small saphenous vein reflux.
     - Patients were divided into two groups based on the site of access: AMC (above mid-calf) and BMC (below mid-calf).
     - Both groups showed significant improvement in clinical scores (VCSS, AVVQ, etc.) up to 1 year.
     - No differences were found between the groups in complications or recurrence rates.
     - Therefore, the access site does not influence early outcomes.
Answer: Maybe
Okay, so I need to figure out whether the access site affects early outcomes in endovenous laser ablation for small saphenous varicose veins. Let me start by reading the context carefully.

The study was about evaluating the effectiveness of endovenous laser ablation (EVLA) on small saphenous varicose veins. The patients had unilateral saphenopopliteal junction incompetence and small vein reflux. They used EVLA with a 14 W 




In [15]:
print("\n--- Extracting Predictions ---")
predictions = [extract_yes_no_maybe(text) for text in tqdm(all_outputs_text, desc="Extracting choices")]
num_invalid_responсes = predictions.count(None)
print(f"\n------------------------------\nNumber of invalid responces: {num_invalid_responсes}")

if len(predictions) > 0:
    print("\nExample Extracted Prediction:")
    print(predictions[0])


--- Extracting Predictions ---


Extracting choices: 100%|██████████| 200/200 [00:00<00:00, 13155.92it/s]

   ...alt ratio, and fibrosis are all significant in distinguishing ash from nash.
     - the mcv, ast/alt'
     - the majority of students felt the'
     - it provides ...lity.
     - the study found that rural and urban infants have different mortality rates, with rural'
     - the matched group had a ...ining group.
     - the matched group had a higher recurrence-free survival than the protocol group.'
     - the rats are alive, so the experiment is'
     - the ace genotype is associated with insulin ...reased levels of specific insulin.
     - the ace genotype is associated with lower levels of proins'
     - it not...e overall safety of the procedure, such as the likelihood of bleeding or other complications.
     -'
     -'
     - the surge in hmcis over the past decade indicates that the hospital's preparedness may'
     - the'
     2. metformi...n of ampk, which is a key regulator of circadian rhythms and energy homeostasis.
     110. therefore'
     - the study found that the




In [17]:
print("\n--- Calculating Metrics ---")
correct_count = 0
total_count = len(predictions)
results_by_subject = {}

if total_count != len(ground_truths):
     print(f"Warning: Mismatch between number of predictions ({total_count}) and ground truths ({len(ground_truths)}). This should not happen.")
     total_count = min(total_count, len(ground_truths))

for i in range(total_count):
    original_data_index = original_indices[i] if 'original_indices' in locals() else i
    data_item = ds_pubmedqa[original_data_index]
    subject = data_item.get('subject_name', 'Unknown')

    pred = predictions[i]
    truth = ground_truths[i]
    is_correct = (pred == truth)

    if subject not in results_by_subject:
        results_by_subject[subject] = {'correct': 0, 'total': 0}

    if is_correct:
        correct_count += 1
        results_by_subject[subject]['correct'] += 1
    results_by_subject[subject]['total'] += 1

overall_accuracy = (correct_count / total_count) * 100 if total_count > 0 else 0


--- Calculating Metrics ---


In [19]:
print("\n--- Evaluation Results ---")
print(f"Model Evaluated: {MODEL_NAME}")
print(f"Dataset Used: {DATASET_PUBMEDQA}")
print(f"Number of Questions Evaluated: {total_count}")
print(f"Number of Correct Answers: {correct_count}")
print(f"Overall Accuracy: {overall_accuracy:.2f}%")


--- Evaluation Results ---
Model Evaluated: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Dataset Used: qiaojin/PubMedQA
Number of Questions Evaluated: 200
Number of Correct Answers: 76
Overall Accuracy: 38.00%
