# GRPO-CoT RL for Efficient Reasoning on Qwen 2.5 (3B) Model

## Installation

This cell prepares a consistent Python environment across Colab and local machines. It uses uv (fast installer) and conditionally installs unsloth, vllm, and GPU-optimized libraries (e.g., bitsandbytes, xformers, triton). A quick T4 check pins vllm/triton versions for driver compatibility.

In [2]:
%%capture
import os
!pip install --upgrade -qqq uv
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install!
    !pip install unsloth vllm
else:
    try: import numpy; get_numpy = f"numpy=={numpy.__version__}"
    except: get_numpy = "numpy"
    try: import subprocess; is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
    except: is_tesla_t4 = False
    get_vllm, get_triton = ("vllm==0.10.1", "triton==3.2.0") if is_t4 else ("vllm", "triton")
    !uv pip install -qqq --upgrade \
        unsloth {get_vllm} {get_numpy} torchvision bitsandbytes xformers transformers
    !uv pip install -qqq {get_triton}

## Parameter-Efficient Initialization: 4-bit LoRA on Qwen2.5-3B

Initializes Qwen2.5-3B-Instruct via Unsloth with 4-bit loading (memory-efficient) and wraps it with PEFT LoRA adapters targeting attention (Q/K/V/O) and MLP (gate/up/down) projections. Gradient checkpointing reduces memory for longer contexts.

In [3]:
from unsloth import FastLanguageModel, is_bfloat16_supported
import torch
max_seq_length = 1024 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-3B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.5, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


2025-08-22 05:23:03.662173: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1755840183.863338      36 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1755840183.926335      36 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


INFO 08-22 05:23:27 [__init__.py:241] Automatically detected platform cuda.
ERROR 08-22 05:23:29 [fa_utils.py:57] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
🦥 Unsloth Zoo will now patch everything to make training faster!
Unsloth: Patching vLLM v1 graph capture
Unsloth: Patching vLLM v0 graph capture
==((====))==  Unsloth 2025.8.9: Fast Qwen2 patching. Transformers: 4.55.3. vLLM: 0.10.1.
   \\   /|    Tesla T4. Num GPUs = 2. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.7.1+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.31. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit with actual GPU utilization = 49.53%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 1

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/271 [00:00<?, ?B/s]

INFO 08-22 05:23:54 [cuda.py:384] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-22 05:23:54 [cuda.py:433] Using XFormers backend.


[W822 05:24:04.063664624 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3


INFO 08-22 05:24:15 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 08-22 05:24:15 [model_runner.py:1080] Starting to load model unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit...


[W822 05:24:14.074320289 socket.cpp:200] [c10d] The hostname of the client socket cannot be retrieved. err=-3


INFO 08-22 05:24:15 [bitsandbytes_loader.py:742] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 08-22 05:24:15 [weight_utils.py:296] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/2.36G [00:00<?, ?B/s]

INFO 08-22 05:24:32 [weight_utils.py:312] Time spent downloading weights for unsloth/qwen2.5-3b-instruct-unsloth-bnb-4bit: 16.084374 seconds
INFO 08-22 05:24:32 [weight_utils.py:349] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 08-22 05:24:36 [logger.py:65] Using PunicaWrapperGPU.
INFO 08-22 05:24:37 [model_runner.py:1112] Model loading took 2.4392 GiB and 21.334319 seconds
INFO 08-22 05:24:45 [worker.py:295] Memory profiling takes 6.79 seconds
INFO 08-22 05:24:45 [worker.py:295] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.50) = 7.30GiB
INFO 08-22 05:24:45 [worker.py:295] model weights take 2.44GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 1.05GiB; the rest of the memory reserved for KV Cache is 3.79GiB.
INFO 08-22 05:24:45 [executor_base.py:114] # cuda blocks: 6894, # CPU blocks: 7281
INFO 08-22 05:24:45 [executor_base.py:119] Maximum concurrency for 1024 tokens per request: 107.72x
INFO 08-22 05:24:49 [vllm_utils.py:671] Unsloth: Running patched vLLM v0 `capture_model`.
INFO 08-22 05:24:49 [model_runner.py:1383] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the mode

Capturing CUDA graph shapes:   0%|          | 0/27 [00:00<?, ?it/s]

INFO 08-22 05:25:11 [model_runner.py:1535] Graph capturing finished in 22 secs, took 0.56 GiB
INFO 08-22 05:25:11 [vllm_utils.py:678] Unsloth: Patched vLLM v0 graph capture finished in 22 secs.
INFO 08-22 05:25:12 [llm_engine.py:417] init engine (profile, create kv cache, warmup model) took 34.54 seconds
INFO 08-22 05:25:12 [llm.py:298] Supported_tasks: ['generate']
Unsloth: Just some info: will skip parsing ['pre_feedforward_layernorm', 'post_feedforward_layernorm', 'q_norm', 'k_norm']
Unsloth: Just some info: will skip parsing ['pre_feedforward_layernorm', 'post_feedforward_layernorm', 'q_norm', 'k_norm']


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Unsloth 2025.8.9 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


## Reward Shaping for Structure-Aware Reasoning on GSM8K

Defines a structured prompting schema (XML <reasoning> / <answer>) and prepares GSM8K with chat prompts. Multiple reward functions encourage:

Correctness (correctness_reward_func, strict match to gold answer),

Output format adherence (strict/soft XML checks),

Answer type sanity (integer check),

Token-level structure incentives (xmlcount_reward_func).

In [4]:
import re
from datasets import load_dataset, Dataset

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

dataset = get_gsm8k_questions()

# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

README.md: 0.00B [00:00, ?B/s]

main/train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

main/test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

## GRPO Configuration for Sample-Efficient Policy Updates

Configures GRPO (Group Relative Policy Optimization) with vLLM-based generation for speed. Conservative LR, cosine schedule, weight decay, and small max_completion_length balance stability and cost.

### Why this matters

- Group-relative baselining can improve sample efficiency vs. vanilla PPO in preference/structured tasks.

- vLLM speeds up multi-candidate sampling, increasing on-policy data per wall-clock.

### Key knobs

- num_generations=8: more candidates per prompt improve signal but use more GPU.

- max_steps=250: increase for full runs; start small to validate pipeline.

- max_prompt_length/max_completion_length: cap to control memory and latency.

In [5]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 8, # Decrease if out of memory
    max_prompt_length = 256,
    max_completion_length = 200,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 8


## Trainer Assembly: Policy + Tokenizer + Composite Rewards

Builds the GRPOTrainer with your model, tokenizer, dataset, and the reward function list, then launches training.

### Why this matters

- Cleanly separates model, data, and reward components for ablations.

- Logging every step (logging_steps=1) helps diagnose collapse or format drift.

### Tips

- Start with fewer rewards, add complexity gradually.

- Track reward components individually to see which ones drive learning.

In [6]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 250
O^O/ \_/ \    Batch size per device = 16 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (16 x 1 x 1) = 16
 "-____-"     Trainable parameters = 119,734,272 of 3,205,672,960 (3.74% trained)


-------------------- Question:
A concert ticket costs $40. Mr. Benson bought 12 tickets and received a 5% discount for every ticket bought that exceeds 10. How much did Mr. Benson pay in all? 
Answer:
476 
Response:
<reasoning>
To determine the total cost Mr. Benson paid, we need to consider the following details:
- Each regular ticket costs $40.
- Mr. Benson bought 12 tickets.
- He received a 5% discount for each ticket bought over 10.
- He bought 2 tickets over the initial 10 that got the discount.

First, we need to calculate the regular cost for 12 tickets:
\[ 12 \times 40 = 480 \]

Since he purchased more than 10 tickets, he gets the discount on 2 tickets. The cost of these 2 additional tickets with the 5% discount can be calculated as follows:
\[ 40 \times (1 - 0.05) = 40 \times 0.95 \]

The discount on each of these 2 tickets is:
\[ 40 \times 0.05 = 2 \]

 
Extracted:
<reasoning>
To determine the total cost Mr. Benson paid, we need to consider the following details:
- Each regul

Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,kl,entropy,rewards / xmlcount_reward_func / mean,rewards / xmlcount_reward_func / std,rewards / soft_format_reward_func / mean,rewards / soft_format_reward_func / std,rewards / strict_format_reward_func / mean,rewards / strict_format_reward_func / std,rewards / int_reward_func / mean,rewards / int_reward_func / std,rewards / correctness_reward_func / mean,rewards / correctness_reward_func / std
1,0.0,0.118813,0.048024,199.9375,199.0,200.0,0.9375,199.0,199.0,199.0,0.0,0,0.118813,0.065923,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,-0.0,0.22375,0.535611,194.8125,117.0,200.0,0.9375,117.0,117.0,117.0,0.0,No Log,0.0675,0.154355,0.0,0.0,0.0,0.0,0.03125,0.125,0.125,0.5
3,0.0,0.029625,0.107413,200.0,200.0,200.0,1.0,0.0,0.0,0.0,2e-05,No Log,0.029625,0.176748,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.052125,0.135055,200.0,200.0,200.0,1.0,0.0,0.0,0.0,1.8e-05,No Log,0.052125,0.19928,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.087875,0.076314,198.6875,179.0,200.0,0.9375,179.0,179.0,179.0,1.5e-05,No Log,0.087875,0.111091,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.050563,0.13501,198.25,185.0,200.0,0.8125,190.666672,185.0,199.0,1.4e-05,No Log,0.050563,0.199838,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,0.0,0.081062,0.083163,199.0,184.0,200.0,0.9375,184.0,184.0,184.0,1.1e-05,No Log,0.081062,0.122348,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,0.0,0.011875,0.233984,200.0,200.0,200.0,1.0,0.0,0.0,0.0,3.1e-05,No Log,-0.019375,0.240032,0.0,0.0,0.0,0.0,0.03125,0.125,0.0,0.0
9,-0.0,0.068938,0.163524,198.8125,181.0,200.0,0.9375,181.0,181.0,181.0,8e-06,No Log,0.068938,0.164694,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,0.0,0.101,0.044468,196.0625,164.0,200.0,0.875,168.5,164.0,173.0,2e-05,No Log,0.101,0.065617,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


-------------------- Question:
Janet pays $40/hour for 3 hours per week of clarinet lessons and $28/hour for 5 hours a week of piano lessons. How much more does she spend on piano lessons than clarinet lessons in a year? 
Answer:
1040 
Response:
<reasoning>
To determine how much more Janet spends on piano lessons compared to clarinet lessons in a year, we need to calculate the weekly cost for each and then find the annual difference. 
- For clarinet lessons, Janet pays $40/hour and she takes 3 hours per week. 
  - Weekly cost for clarinet lessons = $40 * 3 = $120.
- For piano lessons, she pays $28/hour and takes 5 hours per week. 
  - Weekly cost for piano lessons = $28 * 5 = $140.
- To find the annual cost for each:
  - Each component is for a 365 day year. 
  - Annual cost for clarinet lessons = $120 * 52 (since there are 52 weeks in a year) = $6240.
  - Annual cost for piano lessons = $140 * 5 
Extracted:
<reasoning>
To determine how much more Janet spends on piano lessons compared 

TrainOutput(global_step=250, training_loss=924521082.8111562, metrics={'train_runtime': 7878.4531, 'train_samples_per_second': 0.508, 'train_steps_per_second': 0.032, 'total_flos': 0.0, 'train_loss': 924521082.8111562})

## Baseline Inference Check (Pre-/Post-Adapter)

Formats a quick query with the chat template and runs fast inference via vLLM. lora_request=None exercises the current active weights without explicitly loading an external adapter—useful as a sanity baseline.

In [9]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

'There are no letters \'r\' in the word "strawberry".'

## Controlled A/B Inference with Explicit LoRA Loading

Re-runs the earlier prompt but explicitly loads the saved LoRA via model.load_lora(...) and passes it to lora_request. This provides a clear A/B against the baseline.

In [10]:
model.save_lora("grpo_saved_lora")

In [12]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Adding requests:   0%|          | 0/1 [00:00<?, ?it/s]

Processed prompts:   0%|          | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]

'<reasoning>\nTo determine how many r\'s are in the word "strawberry," we need to count each occurrence of the letter r within the word. \n\nLet\'s break it down step by step:\n1. The word "strawberry" contains the letter r in the following positions: \n   - The 1st letter, \n   - The 5th letter, \n   - The 7th letter, \n   - The 10th letter.\n\n2. We see that the letter r appears 4 times in the word "strawberry."\n\nTherefore, the total number of r\'s in "strawberry" is 4.\n</reasoning>\n\n<answer>\n4\n</answer>'

## Model Artifact Publishing: Merged 16-bit Checkpoint

Pushes a merged (base + LoRA) 16-bit model to the Hub, suitable for direct transformers inference without PEFT.

In [13]:
model.push_to_hub_merged("srikar-v05/Qwen2.5-3B-GRPO-16bit", tokenizer, save_method = "merged_16bit", token = "...")

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...wen2.5-3B-GRPO-16bit/tokenizer.json:  28%|##7       | 3.15MB / 11.4MB            

Found HuggingFace hub cache directory: /root/.cache/huggingface/hub
Checking cache directory for required files...
Cache check failed: model-00001-of-00002.safetensors not found in local cache.
Not all required files found in cache. Will proceed with downloading.
Downloading safetensors index for unsloth/qwen2.5-3b-instruct...


model.safetensors.index.json: 0.00B [00:00, ?B/s]

Unsloth: Merging weights into 16bit:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/3.97G [00:00<?, ?B/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...it/model-00001-of-00002.safetensors:   1%|1         | 50.3MB / 3.97GB            

Unsloth: Merging weights into 16bit:  50%|█████     | 1/2 [01:15<01:15, 75.52s/it]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.20G [00:00<?, ?B/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...it/model-00002-of-00002.safetensors:   0%|          |  209kB / 2.20GB            

Unsloth: Merging weights into 16bit: 100%|██████████| 2/2 [01:55<00:00, 57.63s/it]


## Adapter-First Publishing: LoRA + Tokenizer Repos

Publishes the LoRA adapters and tokenizer separately for users who prefer PEFT workflows or wish to combine adapters.

In [14]:
model.push_to_hub("srikar-v05/Qwen2.5-3B-GRPO-LoRA", token = "...")
tokenizer.push_to_hub("srikar-v05/Qwen2.5-3B-GRPO-LoRA", token = "...")

README.md:   0%|          | 0.00/611 [00:00<?, ?B/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...pomfwyv_u/adapter_model.safetensors:   0%|          | 30.2kB /  479MB            

Saved model to https://huggingface.co/srikar-v05/Qwen2.5-3B-GRPO-LoRA


Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  /tmp/tmpe9nghigi/tokenizer.json       : 100%|##########| 11.4MB / 11.4MB            

## Downstream Consumption: Loading the Merged Model

Demonstrates how to load the merged 16-bit model via transformers and place it on CUDA if available.

In [3]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("srikar-v05/Qwen2.5-3B-GRPO-16bit")
model = AutoModelForCausalLM.from_pretrained("srikar-v05/Qwen2.5-3B-GRPO-16bit")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

## Inference-Time Schema: Re-Stating the XML Contract

Re-defines the system prompt that enforces the `<reasoning>/<answer>` contract at inference time.

In [1]:
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

## End-to-End Verification on an Arithmetic CoT Prompt

Runs a simple arithmetic reasoning prompt (“Sarah has 12 apples…”) using the merged model and prints only the generated continuation for inspection.

In [8]:
messages = [
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role": "user", "content": "Sarah has 12 apples. She gives 4 apples to her friend and eats 2 herself. How many apples does Sarah have left?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(device)

outputs = model.generate(**inputs, max_new_tokens=1024)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

<reasoning>
Sarah initially has 12 apples. She gives away 4 apples to her friend, which leaves her with 12 - 4 = 8 apples. Then, she eats 2 apples, reducing her total to 8 - 2 = 6 apples. Therefore, Sarah has 6 apples left.

</reasoning>
<answer>
Sarah has 6 apples left.

</answer><|im_end|>


**Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!**