# Fast Fine-tuning of LLMs with GRPO using Unsloth

Unsloth notebooks [link](https://docs.unsloth.ai/get-started/unsloth-notebooks)

## Environement set-up (Windows)

Setting up the environment for this notebook depends heavily on you PC configuration:

First, Unsloth only supports `python=3.10, 3.11, 3.12`
Then the environment setup is different depending on whether you have conda or not (refer to [unsloth docs](https://docs.unsloth.ai/get-started/installing-+-updating)).
1. You need to download the CUDA Toolkit version that is supported by your GPU.
2. Based on the cuda version you have, you need then to install the required libraries with the correct versions.

**Note:** If you're on windows and you want to use vLLM for fast inference. You need then to work with WSL and you can then follow the Linux installation from the [unsloth docs](https://docs.unsloth.ai/get-started/installing-+-updating).


If your GPU supports CUDA v 12.4, then you can clone my conda enironment, either from the .yaml file or by copying the following commands.

```
conda create --name unsloth_env  python=3.11
pip install torch==2.5.1 torchvision --index-url https://download.pytorch.org/whl/cu124
pip install -U xformers==0.0.28.post3 --index-url https://download.pytorch.org/whl/cu124
pip install https://github.com/woct0rdho/triton-windows/releases/download/v3.1.0-windows.post9/triton-3.1.0-cp311-cp311-win_amd64.whl
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps peft accelerate bitsandbytes
pip install git+https://github.com/huggingface/trl.git@e95f9fb74a3c3647b86f251b7e230ec51c64b72b
pip install diffusers[torch] transformers
```

## Preparing Data and Model for training

### Setup Weights and Biases connection to monitor training

In [None]:
import os
import wandb
from dotenv import load_dotenv

In [None]:
load_dotenv()
wandb_key = os.getenv("WANDB_KEY")

In [None]:
wandb.login(key=wandb_key)
wandb.init(project="training_grpo_locally")

wandb: Appending key for api.wandb.ai to your netrc file: C:\Users\mlwit\_netrc
wandb: Currently logged in as: hmzbo (hmzbo-vektrai) to https://api.wandb.ai. Use `wandb login --relogin` to force relogin
wandb: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


### Download the base model and setup the lora config

In [None]:
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "0"  # use to avoid issues when downloading the model from the hub

In [1]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [None]:
import torch
max_seq_length = 512 # Can increase for longer reasoning traces
lora_rank = 64 # Larger rank = smarter, but slower (Suggested 8, 16, 32, 64, 128)

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-3B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = False, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

==((====))==  Unsloth 2025.2.4: Fast Qwen2 patching. Transformers: 4.48.3.
   \\   /|    GPU: NVIDIA GeForce RTX 3060 Laptop GPU. Max memory: 6.0 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


  self.register_buffer("cos_cached", emb.cos().to(dtype=dtype, device=device, non_blocking=True), persistent=False)
Unsloth 2025.2.4 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


### Get the dataset, and define the reward functions and the training prompt

In [None]:
import re
from datasets import load_dataset, Dataset

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

def extract_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# Download GSM8K dataset and preprocess it for the task
def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

dataset = get_gsm8k_questions()

# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_answer(r) for r in responses]
    #print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    rewards = [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]
    print("correctness_rewards: ", rewards, "\n", "="*10, "\n")
    return rewards

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_answer(r) for r in responses]
    rewards = [0.5 if r.isdigit() else 0.0 for r in extracted_responses]
    print("int_rewards: ", rewards)
    return rewards

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    rewards = [0.5 if match else 0.0 for match in matches]
    print("strict_format_rewards: ", rewards)
    return rewards

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    rewards = [0.5 if match else 0.0 for match in matches]
    print("soft_format_rewards: ", rewards)
    return rewards

def count_tags(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def count_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    rewards = [count_tags(c) for c in contents]
    print("xml_count_rewards: ", rewards)
    return rewards

## Training

In [None]:
from unsloth import is_bfloat16_supported
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = False, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 3, # Decrease if out of memory
    max_prompt_length = 256,
    max_completion_length = 400,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 350,
    save_steps = 150,
    max_grad_norm = 0.1,
    report_to = "wandb", # Can use Weights & Biases
    output_dir = "outputs",
)

In [None]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        count_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)

In [10]:
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 1 | Gradient Accumulation steps = 1
\        /    Total batch size = 1 | Total steps = 350
 "-____-"     Number of trainable parameters = 119,734,272


xml_count_rewards:  [-0.727, 0.125, 0.125]
soft_format_rewards:  [0.0, 0.0, 0.0]
strict_format_rewards:  [0.0, 0.0, 0.0]
int_rewards:  [0.0, 0.0, 0.0]
---------- 
 correctness_rewards:  [0.0, 0.0, 0.0]


Step,Training Loss,reward,reward_std,completion_length,kl
1,0.0,-0.159,0.491902,382.0,0.0
2,-0.0,-0.301,0.101533,239.666672,0.0
3,0.0,-0.230333,0.272979,225.333344,0.000368
4,0.0,-0.885,0.123746,364.666687,0.001245
5,0.0,-0.64,0.077078,337.333344,0.000415
6,0.0,-0.269667,0.20288,205.666672,0.000494
7,0.0,-0.331667,0.1785,202.0,0.00069
8,0.0,-0.268667,0.162802,227.0,0.000959
9,0.0,-0.064667,0.102163,158.333344,0.000341
10,0.0,-0.074,0.100195,158.333344,0.000669


xml_count_rewards:  [-0.353, -0.18400000000000005, -0.366]
soft_format_rewards:  [0.0, 0.0, 0.0]
strict_format_rewards:  [0.0, 0.0, 0.0]
int_rewards:  [0.0, 0.0, 0.0]
---------- 
 correctness_rewards:  [0.0, 0.0, 0.0]
xml_count_rewards:  [-0.05700000000000005, -0.545, -0.08899999999999997]
soft_format_rewards:  [0.0, 0.0, 0.0]
strict_format_rewards:  [0.0, 0.0, 0.0]
int_rewards:  [0.0, 0.0, 0.0]
---------- 
 correctness_rewards:  [0.0, 0.0, 0.0]
xml_count_rewards:  [-0.8940000000000001, -0.757, -1.004]
soft_format_rewards:  [0.0, 0.0, 0.0]
strict_format_rewards:  [0.0, 0.0, 0.0]
int_rewards:  [0.0, 0.0, 0.0]
---------- 
 correctness_rewards:  [0.0, 0.0, 0.0]
xml_count_rewards:  [-0.719, -0.636, -0.5650000000000001]
soft_format_rewards:  [0.0, 0.0, 0.0]
strict_format_rewards:  [0.0, 0.0, 0.0]
int_rewards:  [0.0, 0.0, 0.0]
---------- 
 correctness_rewards:  [0.0, 0.0, 0.0]
xml_count_rewards:  [-0.401, -0.03600000000000003, -0.372]
soft_format_rewards:  [0.0, 0.0, 0.0]
strict_format_rewar

TrainOutput(global_step=350, training_loss=0.002211461592364462, metrics={'train_runtime': 11050.8236, 'train_samples_per_second': 0.032, 'train_steps_per_second': 0.032, 'total_flos': 0.0, 'train_loss': 0.002211461592364462})

In [20]:
model.save_pretrained("grpo_saved_lora")

# Testing

### Base Model

In [1]:
from unsloth import FastLanguageModel

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!


In [2]:
model_ref, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-3B-Instruct",
    max_seq_length = 512,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = False, # Enable vLLM fast inference
)

model_ref = FastLanguageModel.for_inference(model_ref) 

text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)

output = model_ref.generate(
    input_ids=tokenizer(text, return_tensors="pt").input_ids.cuda(),
    temperature=0.8,
    top_p=0.95,
    max_new_tokens=1024,
)

output = tokenizer.decode(output[0], skip_special_tokens=True)

==((====))==  Unsloth 2025.2.4: Fast Qwen2 patching. Transformers: 4.48.3.
   \\   /|    GPU: NVIDIA GeForce RTX 3060 Laptop GPU. Max memory: 6.0 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


  self.register_buffer("cos_cached", emb.cos().to(dtype=dtype, device=device, non_blocking=True), persistent=False)
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [3]:
print(output)

system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.
user
How many r's are in strawberry?
assistant
There are no letters 'r' in the word "strawberry".


### Fine-tuned model for reasoning

In [None]:
from unsloth import FastLanguageModel

In [2]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "grpo_saved_lora",
    max_seq_length = 512,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = False, # Enable vLLM fast inference
    max_lora_rank = 64,
)

==((====))==  Unsloth 2025.2.4: Fast Qwen2 patching. Transformers: 4.48.3.
   \\   /|    GPU: NVIDIA GeForce RTX 3060 Laptop GPU. Max memory: 6.0 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.5.1+cu124. CUDA: 8.6. CUDA Toolkit: 12.4. Triton: 3.1.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.28.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


  self.register_buffer("cos_cached", emb.cos().to(dtype=dtype, device=device, non_blocking=True), persistent=False)
Unsloth 2025.2.4 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


In [3]:
model = FastLanguageModel.for_inference(model)

In [4]:
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

In [5]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "How many r's are in strawberry?"},
], tokenize = False, add_generation_prompt = True)

output = model.generate(
    input_ids=tokenizer(text, return_tensors="pt").input_ids.cuda(),
    temperature=0.8,
    top_p=0.95,
    max_new_tokens=1024,
    do_sample=True,
)

output = tokenizer.decode(output[0], skip_special_tokens=True)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [6]:
print(output)

system

Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>

user
How many r's are in strawberry?
assistant
<reasoning>
To determine how many times the letter 'r' appears in the word "strawberry", we can go through each letter of the word and count the occurrences of 'r'. The word "strawberry" has the following letters: s, t, r, a, w, b, r, r, e, y. Counting the occurrences, we find that 'r' appears three times.
</reasoning>
<answer>
3
</answer>
