# Project Set Up

In [1]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from peft import prepare_model_for_kbit_training
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
import transformers
import torch
from peft import PeftModel

## Load Test Dataset

In [3]:
from read_data import read_test_data, convert_to_qwen_format_user, read_train_original_data

In [4]:
Math_dataset = read_train_original_data("maths",n_samples=500)  # specify n_samples here

In [5]:
MedQA_dataset = read_train_original_data("medqa",n_samples=500)  # specify n_samples here

In [6]:
Math_dataset.iloc[0]['question']

'Solve this math problem step by step. Put your final answer in \\boxed{}.\n\nProblem: If $4^6=8^n$, what is $n$?'

In [7]:
MedQA_dataset.iloc[0]['question']

'Answer the following medical question by selecting the correct option.\n\nQuestion: A 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She states it started 1 day ago and has been worsening despite drinking more water and taking cranberry extract. She otherwise feels well and is followed by a doctor for her pregnancy. Her temperature is 97.7°F (36.5°C), blood pressure is 122/77 mmHg, pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air. Physical exam is notable for an absence of costovertebral angle tenderness and a gravid uterus. Which of the following is the best treatment for this patient?\n\nOptions:\nA: Ampicillin\nB: Ceftriaxone\nC: Ciprofloxacin\nD: Doxycycline\nE: Nitrofurantoin\n\nAnswer:'

In [8]:
test_formatted_question = convert_to_qwen_format_user(Math_dataset.iloc[11]['question'])
test_formatted_question = convert_to_qwen_format_user(MedQA_dataset.iloc[0]['question'])

In [9]:
print(test_formatted_question)

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful and precise assistant.
At the end of each answer, append exactly one token:
- Medicine: <C_MED> if confident, <U_MED> if uncertain.
- Math: <C_MATH> if confident, <U_MATH> if uncertain.<|im_end|>
<|im_start|>user
Answer the following medical question by selecting the correct option.

Question: A 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She states it started 1 day ago and has been worsening despite drinking more water and taking cranberry extract. She otherwise feels well and is followed by a doctor for her pregnancy. Her temperature is 97.7°F (36.5°C), blood pressure is 122/77 mmHg, pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air. Physical exam is notable for an absence of costovertebral angle tenderness and a gravid uterus. Which of the following is the best treatment for this patient?

Options:
A: Ampicillin
B: Ceftriaxone
C: Ci

# Run PEFT_MODEL_TEST

In [18]:
model_path = "Qwen/Qwen2.5-1.5B-Instruct"

In [19]:
model = AutoModelForCausalLM.from_pretrained(model_path,
                                            device_map="auto", # automatically figures out how to best use CPU + GPU for loading model
                                             trust_remote_code=False, # prevents running custom model files on your machine
                                             revision="main") # which version of model to use in repo


In [None]:
# tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)

In [None]:
# special_tokens = ['<C_MED>', '<C_MATH>', '<U_MED>', '<U_MATH>']
# tokenizer.add_tokens(special_tokens)
# model.resize_token_embeddings(len(tokenizer))

In [20]:
# saved_peft_model_path = "Qwen2.5-1.5B-Instruct_Confident_Task_LoRA_final"
saved_peft_model_path = "Qwen2.5-1.5B-Instruct_Confident_Task_LoRA_NO_SPECIAL_TOKENS_FIX_1000_final"

In [21]:
# Use the latest checkpoint dynamically
finetune_path = f'finetuned_models/{saved_peft_model_path}'

peft_model = PeftModel.from_pretrained(model, finetune_path, torch_dtype=torch.float16)

tokenizer = AutoTokenizer.from_pretrained(finetune_path, use_fast=True)

In [26]:
SPECIAL_EOS_TOKENS = ["<C_MATH>", "<U_MATH>", "<C_MED>", "<U_MED>"]

tokenizer.add_special_tokens({"additional_special_tokens": SPECIAL_EOS_TOKENS})
peft_model.resize_token_embeddings(len(tokenizer))

stop_ids = tokenizer.convert_tokens_to_ids(SPECIAL_EOS_TOKENS)

In [27]:
peft_model.eval()

test_formatted_question = convert_to_qwen_format_user(Math_dataset.iloc[3]['question'])
inputs = tokenizer(test_formatted_question, return_tensors="pt")
outputs = peft_model.generate(input_ids=inputs["input_ids"].to("cuda"), eos_token_id=stop_ids, max_new_tokens=500)

print(tokenizer.batch_decode(outputs)[0])

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful and precise assistant.
At the end of each answer, append exactly one token:
- Medicine: <C_MED> if confident, <U_MED> if uncertain.
- Math: <C_MATH> if confident, <U_MATH> if uncertain.<|im_end|>
<|im_start|>user
Solve this math problem step by step. Put your final answer in \boxed{}.

Problem: When three positive integers are divided by $47$, the remainders are $25$, $20$, and $3$, respectively.

When the sum of the three integers is divided by $47$, what is the remainder?<|im_end|>
<|im_start|>assistant
 To solve the problem, we start by letting the three positive integers be \(a\), \(b\), and \(c\). According to the problem, when these integers are divided by 47, they leave remainders of 25, 20, and 3, respectively. This can be expressed as:

\[a \equiv 25 \pmod{47}\]
\[b \equiv 20 \pmod{47}\]
\[c \equiv 3 \pmod{47}\]

We need to find the remainder when the sum of these three integers is divided by 47. The 

In [24]:
peft_model.eval()
test_formatted_question = convert_to_qwen_format_user(Math_dataset.iloc[200]['question'])
inputs = tokenizer(test_formatted_question, return_tensors="pt")
outputs = peft_model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=1024)

print(tokenizer.batch_decode(outputs)[0])

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful and precise assistant.
At the end of each answer, append exactly one token:
- Medicine: <C_MED> if confident, <U_MED> if uncertain.
- Math: <C_MATH> if confident, <U_MATH> if uncertain.<|im_end|>
<|im_start|>user
Solve this math problem step by step. Put your final answer in \boxed{}.

Problem: The third and fourth terms of a geometric sequence are 12 and 16, respectively. What is the first term of the sequence?<|im_end|>
<|im_start|>assistant
 To find the first term of the geometric sequence, we start by recalling that in a geometric sequence, each term after the first is obtained by multiplying the previous term by a constant ratio \( r \). Let's denote the first term of the sequence by \( a \).

The general form for the \( n \)-th term of a geometric sequence is given by:
\[ a_n = ar^{n-1} \]

We know the third and fourth terms of the sequence are 12 and 16, respectively. Therefore, we can write the followi

## Try Medicine Model

In [25]:
peft_model.eval()
test_formatted_question = convert_to_qwen_format_user(MedQA_dataset.iloc[0]['question'])
inputs = tokenizer(test_formatted_question, return_tensors="pt")
outputs = peft_model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=1024)

print(tokenizer.batch_decode(outputs)[0])

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful and precise assistant.
At the end of each answer, append exactly one token:
- Medicine: <C_MED> if confident, <U_MED> if uncertain.
- Math: <C_MATH> if confident, <U_MATH> if uncertain.<|im_end|>
<|im_start|>user
Answer the following medical question by selecting the correct option.

Question: A 23-year-old pregnant woman at 22 weeks gestation presents with burning upon urination. She states it started 1 day ago and has been worsening despite drinking more water and taking cranberry extract. She otherwise feels well and is followed by a doctor for her pregnancy. Her temperature is 97.7°F (36.5°C), blood pressure is 122/77 mmHg, pulse is 80/min, respirations are 19/min, and oxygen saturation is 98% on room air. Physical exam is notable for an absence of costovertebral angle tenderness and a gravid uterus. Which of the following is the best treatment for this patient?

Options:
A: Ampicillin
B: Ceftriaxone
C: Ci

In [24]:
test_formatted_question = convert_to_qwen_format_user(Math_dataset.iloc[200]['question'])

peft_model.eval()

inputs = tokenizer(test_formatted_question, return_tensors="pt")
outputs = peft_model.generate(input_ids=inputs["input_ids"].to("cuda"), max_new_tokens=1024)

print(tokenizer.batch_decode(outputs)[0])

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful and precise assistant.
At the end of each answer, append exactly one token:
- Medicine: <C_MED> if confident, <U_MED> if uncertain.
- Math: <C_MATH> if confident, <U_MATH> if uncertain.<|im_end|>
<|im_start|>user
Solve this math problem step by step. Put your final answer in \boxed{}.

Problem: The third and fourth terms of a geometric sequence are 12 and 16, respectively. What is the first term of the sequence?<|im_end|>
<|im_start|>assistant
 To find the first term of the geometric sequence, we start by recalling that in a geometric sequence, each term after the first is obtained by multiplying the previous term by a constant ratio \( r \). Let's denote the first term of the sequence by \( a \).

The general form of the \( n \)-th term of a geometric sequence is given by:
\[ a_n = ar^{n-1} \]

We know that the third term (\( a_3 \)) is 12 and the fourth term (\( a_4 \)) is 16. Using the formula for the \( n 

# Run VLLM_PEFT_MODEL

In [10]:
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
from transformers import AutoTokenizer

INFO 11-14 12:00:06 [__init__.py:216] Automatically detected platform cuda.


In [13]:
tensor_parallel_size = 1  # Adjust based on available GPUs
model_path = "Qwen/Qwen2.5-1.5B-Instruct"


In [14]:
llm = LLM(model=model_path, enable_lora=True, tensor_parallel_size=tensor_parallel_size)

INFO 11-14 12:00:22 [utils.py:328] non-default args: {'disable_log_stats': True, 'enable_lora': True, 'model': 'Qwen/Qwen2.5-1.5B-Instruct'}


INFO 11-14 12:00:30 [__init__.py:742] Resolved architecture: Qwen2ForCausalLM


`torch_dtype` is deprecated! Use `dtype` instead!


INFO 11-14 12:00:30 [__init__.py:1815] Using max model len 32768
INFO 11-14 12:00:31 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192.
[1;36m(EngineCore_DP0 pid=3159335)[0;0m INFO 11-14 12:00:32 [core.py:654] Waiting for init message from front-end.
[1;36m(EngineCore_DP0 pid=3159335)[0;0m INFO 11-14 12:00:32 [core.py:76] Initializing a V1 LLM engine (v0.10.2) with config: model='Qwen/Qwen2.5-1.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-1.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_addi



[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[1;36m(EngineCore_DP0 pid=3159335)[0;0m INFO 11-14 12:00:35 [gpu_model_runner.py:2370] Loading model from scratch...
[1;36m(EngineCore_DP0 pid=3159335)[0;0m INFO 11-14 12:00:35 [cuda.py:362] Using Flash Attention backend on V1 engine.
[1;36m(EngineCore_DP0 pid=3159335)[0;0m INFO 11-14 12:00:36 [weight_utils.py:348] Using model weights format ['*.safetensors']
[1;36m(EngineCore_DP0 pid=3159335)[0;0m INFO 11-14 12:00:37 [weight_utils.py:406] No

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


[1;36m(EngineCore_DP0 pid=3159335)[0;0m INFO 11-14 12:00:37 [default_loader.py:268] Loading weights took 0.45 seconds
[1;36m(EngineCore_DP0 pid=3159335)[0;0m INFO 11-14 12:00:37 [punica_selector.py:19] Using PunicaWrapperGPU.
[1;36m(EngineCore_DP0 pid=3159335)[0;0m INFO 11-14 12:00:38 [gpu_model_runner.py:2392] Model loading took 2.9223 GiB and 2.009614 seconds
[1;36m(EngineCore_DP0 pid=3159335)[0;0m INFO 11-14 12:00:44 [backends.py:539] Using cache directory: /home/long_2/.cache/vllm/torch_compile_cache/eb33f14f47/rank_0_0/backbone for vLLM's torch.compile
[1;36m(EngineCore_DP0 pid=3159335)[0;0m INFO 11-14 12:00:44 [backends.py:550] Dynamo bytecode transform time: 5.95 s
[1;36m(EngineCore_DP0 pid=3159335)[0;0m INFO 11-14 12:00:47 [backends.py:161] Directly load the compiled graph(s) for dynamic shape from the cache, took 2.201 s
[1;36m(EngineCore_DP0 pid=3159335)[0;0m INFO 11-14 12:00:48 [monitor.py:34] torch.compile takes 5.95 s in total
[1;36m(EngineCore_DP0 pid=31593

Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 67/67 [00:05<00:00, 12.15it/s]


[1;36m(EngineCore_DP0 pid=3159335)[0;0m INFO 11-14 12:00:56 [gpu_model_runner.py:3118] Graph capturing finished in 6 secs, took 3.03 GiB
[1;36m(EngineCore_DP0 pid=3159335)[0;0m INFO 11-14 12:00:56 [gpu_worker.py:391] Free memory on device (44.05/44.35 GiB) on startup. Desired GPU memory utilization is (0.9, 39.92 GiB). Actual usage is 2.92 GiB for weight, 1.41 GiB for peak activation, 0.02 GiB for non-torch memory, and 3.03 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=34774794035` to fit into requested memory, or `--kv-cache-memory=39209470976` to fully utilize gpu memory. Current kv cache memory in use is 38188957491 bytes.
[1;36m(EngineCore_DP0 pid=3159335)[0;0m INFO 11-14 12:00:56 [core.py:218] init engine (profile, create kv cache, warmup model) took 18.38 seconds
INFO 11-14 12:00:57 [llm.py:295] Supported_tasks: ['generate']
INFO 11-14 12:00:57 [__init__.py:36] No IOProcessor plugins requested by the model


In [15]:
saved_peft_model_path = "Qwen2.5-1.5B-Instruct_Confident_Task_LoRA_1000_final"

In [27]:
sampling_params = SamplingParams(
    temperature=0.0,  # Deterministic generation for evaluation
    top_p=1.0,
    max_tokens=4086,
    stop=["</s>", "<|im_end|>", "<C_MATH>", "<U_MATH>", "<C_MED>", "<U_MED>"],
    include_stop_str_in_output=True
)

In [17]:
lora_request = None
lora_path = f"finetuned_models/{saved_peft_model_path}"
lora_request = LoRARequest("confident_task_lora_adapter", 1, lora_path)

## Test with Math Dataset

In [46]:
test_formatted_question = convert_to_qwen_format_user(Math_dataset.iloc[1]['question'])
# test_formatted_question = convert_to_qwen_format_user(MedQA_dataset.iloc[0]['question'])
prompts = [test_formatted_question]
print(test_formatted_question)

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful and precise assistant.
At the end of each answer, append exactly one token:
- Medicine: <C_MED> if confident, <U_MED> if uncertain.
- Math: <C_MATH> if confident, <U_MATH> if uncertain.<|im_end|>
<|im_start|>user
Solve this math problem step by step. Put your final answer in \boxed{}.

Problem: Which of the following could NOT be the lengths of the external diagonals of a right regular prism [a "box"]? (An $\textit{external diagonal}$ is a diagonal of one of the rectangular faces of the box.)
$\text{(A) }\{4,5,6\} \quad \text{(B) } \{4,5,7\}  \quad \text{(C) } \{4,6,7\} \quad \text{(D) } \{5,6,7\} \quad \text{(E) } \{5,7,8\}$
<|im_end|>
<|im_start|>assistant



In [49]:
outputs = llm.generate(prompts, sampling_params, lora_request=lora_request)

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

In [50]:
print(outputs[0].outputs[0].text.strip())

To determine which set of lengths could not be the lengths of the external diagonals of a right regular prism, we need to understand the properties of the external diagonals of a rectangular prism. The external diagonals of a rectangular prism are the diagonals of its rectangular faces. If the dimensions of the rectangular prism are \(a\), \(b\), and \(c\), then the lengths of the external diagonals are given by \(\sqrt{a^2 + b^2}\), \(\sqrt{b^2 + c^2}\), and \(\sqrt{c^2 + a^2}\).

We need to check if the given sets of lengths can be expressed as the squares of the sums of two of these external diagonals. Let's denote the external diagonals by \(d_1 = \sqrt{a^2 + b^2}\), \(d_2 = \sqrt{b^2 + c^2}\), and \(d_3 = \sqrt{c^2 + a^2}\). We need to check if the given sets of lengths can be expressed as the squares of the sums of two of these external diagonals.

Let's check each option:

**Option (A): \{4, 5, 6\}**

We need to check if there exist positive integers \(a\), \(b\), and \(c\) such

## Test with Medical Dataset

In [59]:
test_formatted_question = convert_to_qwen_format_user(MedQA_dataset.iloc[10]['question'])
tests_formatted_question_1 = convert_to_qwen_format_user(MedQA_dataset.iloc[1]['question'])
# test_formatted_question = convert_to_qwen_format_user(MedQA_dataset.iloc[0]['question'])
prompts = [test_formatted_question, tests_formatted_question_1]
print(test_formatted_question)

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful and precise assistant.
At the end of each answer, append exactly one token:
- Medicine: <C_MED> if confident, <U_MED> if uncertain.
- Math: <C_MATH> if confident, <U_MATH> if uncertain.<|im_end|>
<|im_start|>user
Answer the following medical question by selecting the correct option.

Question: A 46-year-old woman comes to the physician because of a 2-week history of diplopia and ocular pain when reading the newspaper. She also has a 3-month history of amenorrhea, hot flashes, and increased sweating. She reports that she has been overweight all her adult life and is happy to have lost 6.8-kg (15-lb) of weight in the past 2 months. Her pulse is 110/min, and blood pressure is 148/98 mm Hg. Physical examination shows moist palms and a nontender thyroid gland that is enlarged to two times its normal size. Ophthalmologic examination shows prominence of the globes of the eyes, bilateral lid retraction, conjunctival i

In [61]:
outputs = llm.generate(prompts, sampling_params, lora_request=lora_request)

Adding requests:   0%|          | 0/2 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/2 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

In [62]:
full_preds = [output.outputs[0].text.strip() for output in outputs]

In [64]:
for pred in full_preds:
    print(pred)
    print("-----")

C: Glycosaminoglycan accumulation in the orbit

Explanation: The patient's symptoms of diplopia, ocular pain, and lid retraction are most consistent with a condition known as orbital myxoma, which is a benign tumor composed of glycosaminoglycans. These glycosaminoglycans can accumulate in the orbit and cause the symptoms described. The other options are less likely given the patient's clinical presentation and the presence of a nontender thyroid gland and enlarged thyroid, which are more indicative of hyperthyroidism. The fine resting tremor and increased reflexes are also more consistent with hyperthyroidism. The other options are less likely given the patient's clinical presentation and the presence of a nontender thyroid gland and enlarged thyroid, which are more indicative of hyperthyroidism. The fine resting tremor and increased reflexes are also more consistent with hyperthyroidism. The other options are less likely given the patient's clinical presentation and the presence of a 

In [65]:
decoded_preds = [i.strip().split('\n')[0][0] for i in full_preds]

In [66]:
decoded_preds

['C', 'A']