In [None]:
# MedQwen3B-Reasoner: Medical Reasoning Tutorial

Welcome to this Google Colab tutorial for training **MedQwen3B-Reasoner** - a specialized 3B-parameter language model optimized for medical domain reasoning and mathematical problem-solving. This guide will walk you through the process of fine-tuning and deploying a model that combines clinical expertise with structured reasoning capabilities.

**Key Tutorial Focuses**:
- 🏥 Leveraging GRPO (Group Relative Policy Optimization) for medical domain adaptation
- 📊 Curated training data blending PubMedQA (70%) with mathematical reasoning datasets
- 🧠 Implementing structured reasoning outputs with `<reasoning>`/`<answer>` formatting
- ⚡ Efficient deployment using 4-bit quantization via unsloth
- 🩺 Practical applications in clinical decision support and biomedical research analysis

In [None]:
!pip install wandb
!pip install python-dotenv
!pip install vllm==0.8.2
!pip install --upgrade pillow
!pip install git+https://github.com/huggingface/trl.git@e95f9fb74a3c3647b86f251b7e230ec51c64b72b
!pip install ipywidgets
!pip install diffusers
!pip install unsloth

### wandb登录及设置

In [1]:
import wandb
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

wandb.login(key=os.getenv("WANDB_KEY"))

# 初始化wandb
run = wandb.init(
    project="medical-reasoning",  # 你的项目名称
    name="test",  # 你的实验名称
    tags=["test"],  # 你的实验标签
    # config={
    #     "model": "MyLargeModel",
    #     "batch_size": 64,
    #     "learning_rate": 0.001,
    #     # 其他模型和训练参数...
    # },
    # notes="Training run for my large model",
    group="reasoning-training",
    job_type="training",
)

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mhuangxinzhe94[0m to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


### Google Colab设置

In [2]:
# %%capture
# import sys; modules = list(sys.modules.keys())
# for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None

# !pip install unsloth vllm
# !pip install --upgrade pillow
# # If you are running this notebook on local, you need to install `diffusers` too
# # !pip install diffusers
# # Temporarily install a specific TRL nightly version
# !pip install git+https://github.com/huggingface/trl.git@e95f9fb74a3c3647b86f251b7e230ec51c64b72b
# !pip install ipywidgets
# !pip install diffusers

### 本地和服务器设置

In [None]:
# !pip install unsloth vllm  # 内存高效的训练和推理  
# !pip install trl@git+https://github.com/huggingface/trl  # GRPO实现

We will be using the amazing Unsloth library for this tutorial.

In [3]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Failed to patch Gemma3ForConditionalGeneration.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 04-07 14:42:44 [__init__.py:239] Automatically detected platform cuda.


## Download and initialize the model
We will first download the model and leverage 50% of the GPU capacity along with vLLM inference to speed up the GRPO training using Qlora.
### 关键选择
- 量化：启用16/24GB GPU训练（兼容T4/A10）
- LoRA Rank 64：平衡性能与内存
- vLLM集成：生成速度提高50%

In [4]:
from unsloth import is_bfloat16_supported
import torch
max_seq_length = 2048 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    # model_name = "Qwen/Qwen2.5-3B-Instruct",
    model_name = "./model/Qwen/Qwen2.5-3B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.5, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

==((====))==  Unsloth 2025.3.19: Fast Qwen2 patching. Transformers: 4.51.0. vLLM: 0.8.2.
   \\   /|    NVIDIA GeForce RTX 3090. Num GPUs = 1. Max memory: 23.57 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading ./model/Qwen/Qwen2.5-3B-Instruct with actual GPU utilization = 49.37%
Unsloth: Your GPU has CUDA compute capability 8.6 with VRAM = 23.57 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 2048. Num Sequences = 192.
Unsloth: vLLM's KV Cache can use up to 5.67 GB. Also swap space = 6 GB.
INFO 04-07 14:44:54 [config.py:585] This model supports multiple tasks: {'embed', 'classify', 'generate', 'reward', 'score'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config usin

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 04-07 14:44:57 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 04-07 14:44:57 [model_runner.py:1146] Model loading took 2.1585 GB and 2.110830 seconds
INFO 04-07 14:44:59 [worker.py:267] Memory profiling takes 1.58 seconds
INFO 04-07 14:44:59 [worker.py:267] the current vLLM instance can use total_gpu_memory (23.57GiB) x gpu_memory_utilization (0.49) = 11.64GiB
INFO 04-07 14:44:59 [worker.py:267] model weights take 2.16GiB; non_torch_memory takes 0.06GiB; PyTorch activation peak memory takes 1.05GiB; the rest of the memory reserved for KV Cache is 8.37GiB.
INFO 04-07 14:44:59 [executor_base.py:111] # cuda blocks: 15238, # CPU blocks: 10922
INFO 04-07 14:44:59 [executor_base.py:116] Maximum concurrency for 2048 tokens per request: 119.05x
INFO 04-07 14:45:03 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. I

Capturing CUDA graph shapes: 100%|██████████| 27/27 [00:21<00:00,  1.23it/s]

INFO 04-07 14:45:25 [model_runner.py:1570] Graph capturing finished in 22 secs, took 0.58 GiB
INFO 04-07 14:45:25 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 27.42 seconds



Sliding Window Attention is enabled but not implemented for `eager`; unexpected results may be encountered.
Unsloth 2025.3.19 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


## Continual Pretraining

Now we go through the continual finetuning. We will be using three datasets from huggingface hub respectively. `openai/gsm8k` , `qiaojin/PubMedQA` and `esilhealth/Health_Benchmarks`. As you can see in the code, we are filtering the length of contexts in the case of PubMedQA as it might have longer traces that could cause out of memory issues for our training (in this tutorial we are aiming for a T4 or A10 GPU with 16/24 Gb of memory).

Also note that after filtering we have almost three times more samples from `PubmedQA` datasets. This is on purpose as that is a more challenging dataset for the model to learn and therefore, we want it to be shown to the model more often.

### 数据策略：医学推理鸡尾酒
我们使用Hugging Face的interleave_datasets混合了三个关键数据集：

PubMedQA（占总数据的70%）：
- 临床问答，答案为yes/no/maybe
- 过滤到<1024个token以提高内存效率

GSM8K：
- 数学文字问题，以保持数值推理能力

Health Benchmarks：
- 超过50个医学专科的多项选择题
- 包括心脏病学到疫苗接种等类别

小贴士：权重应反映数据集复杂性——PubMedQA获得3倍更多的曝光以处理其复杂性。我们这里没有使用任何等待时间，但我们打乱了数据集，由于我们有三倍的PubMedQA样本，因此模型有三倍的机会展示这些示例。

In [5]:
import re
from datasets import load_dataset, Dataset, interleave_datasets, concatenate_datasets

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# uncomment middle messages for 1-shot prompting
def get_datasets(split = "train") -> Dataset:
    data = load_dataset('./data/openai/gsm8k', 'main')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer']),
        'db_set':'gsm8k'
    }) # type: ignore
    data = data.remove_columns(['question'])

    data_qa = load_dataset("./data/qiaojin/PubMedQA", "pqa_artificial")[split] # two times more than other datasets
    data_qa = data_qa.filter(lambda x: len("\n".join(x['context']['contexts'])) < 1024) # avoid long traces
    data_qa = data_qa.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {
                "role": "user",
                "content": "Given the scientific context below:\n" +
                          "\n".join(x['context']['contexts']) +
                          "\n\nAnswer the following question:\n" +
                          x['question'] +
                          " with 'yes', 'no' or 'maybe'. You need to carefully review the context and reason before answering."
            },
        ],
        'answer': x['final_decision'],
        'db_set': 'pubmedqa'
    }) # type: ignore
    data_qa = data_qa.remove_columns(['pubid', 'question', 'context', 'long_answer', 'final_decision'])


    categories =['Lab_Medicine', 'Wearables', 'Dermatology', 'Gastroenterology', 'Internal_Medicine', 'Oncology', 'Orthopedics', 'General_Surgery', 'Ophthalmology', 'Audiology', 'Head_Neck_Surgery', 'Elderly_Care', 'Pediatrics', 'Allergy_Immunology', 'Rheumatology', 'Pharmacy', 'Obstetrics_Gynecology', 'Microbiology', 'Dentistry', 'Physical_Medicine_and_Rehabilitation', 'Neurology', 'Psychiatry', 'Pathology', 'Genetics', 'Rare_Diseases', 'Hematology', 'Emergency', 'Endocrinology', 'Radiology', 'Cardiology', 'Pulmonology', 'Infectious_Diseases', 'Critical_Care', 'Pediatric_Surgery', 'Neuroscience', 'Epidemiology', 'Fitness_Sports', 'Health_Education', 'Health_Economics', 'Health_Entrepreneurship', 'Hospital_Management', 'Mental_Health', 'Nutrition', 'Palliative_Care', 'Preventive_Medicine', 'Public_Health', 'Social_Media_Addiction', 'Sleep', 'Supplements', 'Vaccination', 'Work_Health', 'Wellbeing']
    data_mc = concatenate_datasets([load_dataset("./data/yesilhealth/Health_Benchmarks",i)[i] for i in categories])
    data_mc = data_mc.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {
                "role": "user",
                "content": "\n\nAnswer the following question:\n" +
                          x['Questions'] +
                          "\n With 'A', 'B', 'C' or 'D'. You need to carefully review the context and reason before answering."
            },
        ],
        'answer': x['Answers'],
        'db_set': 'med_mc'
    }) # type: ignore
    data_mc = data_mc.remove_columns(['Answers', 'Questions'])

    dataset = concatenate_datasets([data, data_qa, data_mc])
    return dataset


In [6]:
dataset = get_datasets()
dataset = dataset.shuffle(seed=42)
train_test_split = dataset.train_test_split(test_size=0.1)
train_dataset = train_test_split["train"]
test_dataset = train_test_split["test"]
print(f"train size: {len(train_dataset)}, test size: {len(test_dataset)}")

train size: 38171, test size: 4242


# Desigining Reward Functions

Personally I believe the trick to get a good performance using GRPO is to have really nicely designed reward functions. Like when we are teaching a dog to perform some tricks, we want to give the model higher rewards for difficult actions and smaller treats for when it gets smaller tasks correct. This means we will try to teach the model both about the format we want it to respond (such as `reasoning` and the quality and correctness of its response).

Lets quickly review the following ones:

## correctness_reward_func

This one ensures that the final answer is correct. In case of `gsm8k` sometimes the model answers `The final answer is $80.` in that case it wont perfectly match the ground truth `80` and therefore the `a in r` check to some extend captures such scenarios but the reward is only 1 since we do not want to encourage verbosity. For the other datasets, we simply accept the answer since in case of `pubmedqa` answers are in `yes`, `no` or `maybe` and in the `health_benchmarks` case multiple choice questions.

The other reward functions ensure the correctness of the format, so that the model responds in proper `reasoning` and `answer` tags.

### 秘密武器：奖励工程
我们的多奖励系统既教授推理结构也教授医学准确性（请参阅笔记本[4]以了解详细的奖励函数）：

奖励层次：
- 准确性（50%权重）：与真实情况一致
- 格式化（30%）：XML风格的推理跟踪
- 中间检查（20%）：有效答案类型

类比：就像教导一位医学住院医生一样——既要表扬诊断准确度也要表扬适当的文档记录。

In [7]:
## Reward functions
def correctness_reward_func(prompts, completions, answer, db_set, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    rewards = []
    for r,a,dt in zip(extracted_responses, answer, db_set):
        if dt == "gsm8k":
            if a in r:
                rewards.append(1.0)
            elif r == a:
                rewards.append(2.0)
            else:
                rewards.append(0.0)
        else:
            rewards.append(2.0 if r.lower() == a.strip().lower() else 0.0)
    return rewards


def int_reward_func(completions, db_set, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    rewards = []
    for r,dt in zip(extracted_responses,db_set):
        if dt == "gsm8k":
            rewards.append(0.5 if r.isdigit() else 0.0)
        elif dt == "pubmedqa":
            rewards.append(0.5 if ('yes' in r.lower() or 'no' in r.lower() or 'maybe' in r.lower()) else 0.0)
        else:
            rewards.append(0.5 if ('a' in r.lower() or 'b' in r.lower() or 'c' in r.lower() or 'd' in r.lower()) else 0.0)
    return rewards

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

# Setup Training Arguments

We will be using TRL library from huggingface that has support for GRPO.

### GRPO训练配置
这些参数主要是猜测工作，尚未优化，但在我的初步实验中似乎效果很好。您可以根据自己的用例进行调整和实验。

In [8]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 6, # Decrease if out of memory
    max_prompt_length = 1024,
    max_completion_length = 1024,
    #num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 750,
    save_steps = 100,
    max_grad_norm = 0.1,
    # report_to = "none", # Can use Weights & Biases
    report_to = "wandb",
    output_dir = "outputs",
)

In [9]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = train_dataset,
    eval_dataset=test_dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 38,171 | Num Epochs = 1 | Total steps = 750
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 1 x 1) = 1
 "-____-"     Trainable parameters = 119,734,272/3,000,000,000 (3.99% trained)


-------------------- Question:
Given the scientific context below:
Preventing or reducing amyloid-beta (Aβ) accumulation in the brain is an important therapeutic strategy for Alzheimer's disease (AD). Recent studies showed that the water channel aquaporin-4 (AQP4) mediates soluble Aβ clearance from the brain parenchyma along the paravascular pathway. However the direct evidence for roles of AQP4 in the pathophysiology of AD remains absent.
Here, we reported that the deletion of AQP4 exacerbated cognitive deficits of 12-moth old APP/PS1 mice, with increases in Aβ accumulation, cerebral amyloid angiopathy and loss of synaptic protein and brain-derived neurotrophic factor in the hippocampus and cortex. Furthermore, AQP4 deficiency increased atrophy of astrocytes with significant decreases in interleukin-1 beta and nonsignificant decreases in interleukin-6 and tumor necrosis factor-alpha in hippocampal and cerebral samples.

Answer the following question:
Does deletion of aquaporin-4 in AP

Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / int_reward_func,rewards / correctness_reward_func
1,-0.0,1.980667,0.804364,143.166672,0.0,-0.186,0.0,0.0,0.5,1.666667
2,-0.0,0.0195,0.616718,241.666672,0.0,-0.4805,0.0,0.0,0.0,0.5
3,0.0,-0.088667,0.608326,207.666672,0.000679,-0.255333,0.0,0.0,0.0,0.166667
4,0.0,1.808167,0.86592,208.666672,0.00086,-0.025167,0.0,0.0,0.5,1.333333
5,0.0,0.1625,0.350255,434.833344,0.000288,-0.8375,0.0,0.0,0.0,1.0
6,0.0,0.383333,0.465621,248.833344,0.000225,-0.45,0.0,0.0,0.0,0.833333
7,0.0,1.017333,1.180007,185.666672,0.000859,-0.482667,0.0,0.0,0.5,1.0
8,0.0,0.937167,0.104028,154.166672,0.00055,-0.062833,0.0,0.0,0.0,1.0
9,0.0,1.470667,0.164908,286.833344,0.000825,-1.029333,0.0,0.0,0.5,2.0
10,0.0,0.219833,0.383756,443.5,0.000474,-0.8635,0.0,0.0,0.083333,1.0


-------------------- Question:
Kyle has a newspaper-delivery route. Every Monday through Saturday, he delivers the daily paper for the 100 houses on his route. On Sunday, 10 of his customers do not get the Sunday paper, but he delivers 30 papers to other houses that get the newspaper only on Sunday. How many papers does Kyle deliver each week? 
Answer:
720 
Response:
<reasoning>
To determine how many papers Kyle delivers each week, we need to calculate the number of papers he delivers from Monday to Saturday and then add the papers he delivers on Sunday.

- From Monday to Saturday, he delivers daily to 100 houses. The number of papers he delivers from Monday to Saturday is simply 100 papers per day * 6 days = 600 papers.
- On Sunday, 10 customers do not get the Sunday paper. So, he provides papers to 100 houses - 10 houses = 90 customers and also delivers 30 papers to other houses that get the newspaper only on Sunday. Thus, he delivers 90 + 30 = 120 papers on Sunday.

To find the tota

TrainOutput(global_step=750, training_loss=0.007225370059726823, metrics={'train_runtime': 6834.2212, 'train_samples_per_second': 0.11, 'train_steps_per_second': 0.11, 'total_flos': 0.0, 'train_loss': 0.007225370059726823})

In [None]:
# 在训练结束后
run.finish()

就这样——由于我们的奖励函数质量很高——您应该很快开始看到奖励函数跳得很高（这是RL微调的第一步）。

# Testing time

First we will test our model without `Qlora` heads. Then we will add the head and compare it.

In [10]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "Is Aspirin good for cardio vascular function?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.80s/it, est. speed input: 21.68 toks/s, output: 105.61 toks/s]


"Aspirin can be beneficial for certain individuals with cardiovascular conditions, particularly those who have had a heart attack or stroke, or who are at high risk of developing cardiovascular issues. Aspirin's primary function is to prevent blood clots from forming, which can help reduce the risk of a heart attack or stroke. This is known as its antiplatelet effect.\n\nHowever, it's important to note that aspirin does carry some risks and is not suitable for everyone. The use of aspirin for cardiovascular protection is generally recommended only for adults with specific cardiovascular risk factors, and it must be prescribed and monitored by a healthcare professional. Aspirin may also cause side effects, such as stomach pain, bleeding, or easy bruising.\n\nIn summary, aspirin can be beneficial for certain individuals with cardiovascular conditions, but its use should be carefully considered and supervised by a healthcare provider. Always consult with a doctor before starting any new m

## Lets Add Qlora weight

Adding Qlora weigths that we just finetuned to see the difference

In [11]:
model.save_lora("grpo_saved_lora")

In [12]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "Is Aspirin good for cardio vascular function?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.24s/it, est. speed input: 20.08 toks/s, output: 70.06 toks/s]


"<reasoning>\nAspirin (acetylsalicylic acid) is often used for its antiplatelet properties, which can help prevent blood clots. This can be beneficial for individuals with cardiovascular diseases because blood clots can cause serious complications such as heart attacks and strokes. However, aspirin is not without risks, as it can cause side effects such as bleeding issues. It is important to note that the use of aspirin for cardiovascular health should be under medical supervision and not self-prescribed. \n</reasoning>\n<answer>\nYes, aspirin can be beneficial for cardiovascular health when used under medical supervision, due to its antiplatelet properties. However, it is not without risks and should not be taken without a doctor's guidance.\n</answer>"

In [13]:
model.save_pretrained_merged("model", tokenizer)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 524.24 out of 755.27 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 36/36 [00:00<00:00, 83.50it/s]


Unsloth: Saving tokenizer... Done.
Done.


# Push to huggingface hub

If you like to push your finetuned model to the hub simply:

In [20]:
!export HF_ENDPOINT=https://hf-mirror.com

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [None]:
model.push_to_hub_merged("myMedModel", tokenizer, token = "")