# GRPO: Group Relative Policy Optimization

GRPO is a reinforcement learning technique for training large language models (LLMs) on complex tasks, such as mathematical reasoning or code generation.

Unlike older methods like PPO, GRPO enhances efficiency by:
*   **Eliminating a separate value model:** It does not require a dedicated neural network to estimate rewards, which significantly reduces memory and computational costs.
*   **Using group-based advantages:** The model generates multiple candidate answers for a single prompt. The average score of these answers is used as a baseline to assess the relative quality of each response.
*   **Focusing on reasoning:** By learning from a group's collective outcomes, GRPO encourages the model to develop more robust reasoning abilities, rather than just optimizing for a single, potentially noisy, reward signal.


### Installation

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install or uv pip install
    !pip install unsloth vllm
else:
    pass # For Colab / Kaggle, we need extra instructions hidden below \/

In [None]:
#@title Colab Extra Install { display-mode: "form" }
#%%capture
import os
!pip install --upgrade -qqq uv
if "COLAB_" not in "".join(os.environ.keys()):
    # If you're not in Colab, just use pip install!
    !pip install unsloth vllm
else:
    try: import numpy; get_numpy = f"numpy=={numpy.__version__}"
    except: get_numpy = "numpy"
    try: import subprocess; is_t4 = "Tesla T4" in str(subprocess.check_output(["nvidia-smi"]))
    except: is_t4 = False
    get_vllm, get_triton = ("vllm==0.10.1", "triton==3.2.0") if is_t4 else ("vllm", "triton")
    !uv pip install -qqq --upgrade \
        unsloth {get_vllm} {get_numpy} torchvision bitsandbytes xformers
    !uv pip install -qqq {get_triton}
!uv pip install transformers==4.55.4

### Unsloth

Goal: To convert `Qwen3-4B-Base` into a reasoning model via GRPO by using OpenR1's Math dataset.

We first pre fine-tune the model to make GRPO skip trying to match formatting - this speeds GRPO up.

In [None]:
from unsloth import FastLanguageModel
import torch
# Fixed code - removed comma after max_seq_length value
# Fixed code - removed comma after max_seq_length value
max_seq_length = 512  # 2048 Can increase for longer reasoning traces
lora_rank = 4  # Larger rank = smarter, but slower

# Load model with optimized settings for RTX 2070 (8GB VRAM)
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-4B-Base",
    max_seq_length = max_seq_length,
    load_in_4bit = True,  # Changed to True for memory efficiency on RTX 2070
    fast_inference = False,  # Disable vLLM due to triton compatibility issues
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.5,  # Reduced for RTX 2070 - adjust based on your needs
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,  # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,  # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    use_gradient_checkpointing = "unsloth",  # True or "unsloth" for very long context
    random_state = 3407,
)

### GRPO chat template
Since we're using a base model, we should set a chat template. You can make your own chat template as well!
1. DeepSeek uses `<think>` and `</think>`, but this is **not** necessary - you can customize it however you like!
2. A `system_prompt` is recommended to at least guide the model's responses.

In [None]:
reasoning_start = "<start_working_out>" # Acts as <think>
reasoning_end   = "<end_working_out>"   # Acts as </think>
solution_start  = "<SOLUTION>"
solution_end    = "</SOLUTION>"

system_prompt = \
f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start}{solution_end}"""
system_prompt

In [None]:
chat_template = \
    "{% if messages[0]['role'] == 'system' %}"\
        "{{ messages[0]['content'] + eos_token }}"\
        "{% set loop_messages = messages[1:] %}"\
    "{% else %}"\
        "{{ '{system_prompt}' + eos_token }}"\
        "{% set loop_messages = messages %}"\
    "{% endif %}"\
    "{% for message in loop_messages %}"\
        "{% if message['role'] == 'user' %}"\
            "{{ message['content'] }}"\
        "{% elif message['role'] == 'assistant' %}"\
            "{{ message['content'] + eos_token }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}{{ '{reasoning_start}' }}"\
    "{% endif %}"

# Replace with out specific template:
chat_template = chat_template\
    .replace("'{system_prompt}'",   f"'{system_prompt}'")\
    .replace("'{reasoning_start}'", f"'{reasoning_start}'")
tokenizer.chat_template = chat_template

In [None]:
from datasets import load_dataset
import pandas as pd
import numpy as np

dataset = load_dataset("unsloth/OpenMathReasoning-mini", split = "cot")
dataset = dataset.to_pandas()[
    ["expected_answer", "problem", "generated_solution"]
]

# Try converting to number - if not, replace with NaN
is_number = pd.to_numeric(pd.Series(dataset["expected_answer"]), errors = "coerce").notnull()
# Select only numbers
dataset = dataset.iloc[np.where(is_number)[0]]

print(f"Dataset size after numeric filtering: {len(dataset)}")

def format_dataset(x):
    expected_answer = x["expected_answer"]
    problem = x["problem"]

    # Remove generated <think> and </think>
    thoughts = x["generated_solution"]
    thoughts = thoughts.replace("<think>", "").replace("</think>", "")

    # Strip newlines on left and right
    thoughts = thoughts.strip()
    # Add our custom formatting
    final_prompt = \
        reasoning_start + thoughts + reasoning_end + \
        solution_start + expected_answer + solution_end
    return [
        {"role" : "system",    "content" : system_prompt},
        {"role" : "user",      "content" : problem},
        {"role" : "assistant", "content" : final_prompt},
    ]

dataset["Messages"] = dataset.apply(format_dataset, axis = 1)

# Check a sample
print("Sample formatted message:")
print(tokenizer.apply_chat_template(dataset["Messages"].iloc[0], tokenize = False))

# Calculate token lengths
dataset["N"] = dataset["Messages"].apply(lambda x: len(tokenizer.apply_chat_template(x, tokenize=True)))

print(f"Token length stats:")
print(f"Min: {dataset['N'].min()}")
print(f"Max: {dataset['N'].max()}")
print(f"Mean: {dataset['N'].mean():.1f}")
print(f"Median: {dataset['N'].median():.1f}")

# Use a more reasonable filter - keep sequences up to max_seq_length
# Instead of max_seq_length/2, use the full length
dataset_filtered = dataset.loc[dataset["N"] <= max_seq_length].copy()
print(f"Dataset size after length filtering (<= {max_seq_length}): {len(dataset_filtered)}")

# If still too restrictive, increase the limit
if len(dataset_filtered) < 100:  # If we have less than 100 samples
    print(f"Too few samples with max_seq_length={max_seq_length}, trying with 256...")
    dataset_filtered = dataset.loc[dataset["N"] <= 256].copy()
    print(f"Dataset size with 256 token limit: {len(dataset_filtered)}")
    
    if len(dataset_filtered) < 100:
        print("Using top 500 shortest sequences...")
        dataset_filtered = dataset.nsmallest(500, 'N').copy()
        print(f"Dataset size (top 500 shortest): {len(dataset_filtered)}")

# Check if we have data
if len(dataset_filtered) == 0:
    print("ERROR: No data after filtering! Using original dataset...")
    dataset_filtered = dataset.head(100).copy()  # Use first 100 samples
    
print(f"Final dataset shape: {dataset_filtered.shape}")

from datasets import Dataset

# Only proceed if we have data
if len(dataset_filtered) > 0:
    dataset_filtered["text"] = [
        tokenizer.apply_chat_template(msg, tokenize=False) 
        for msg in dataset_filtered["Messages"].values.tolist()
    ]
    
    final_dataset = Dataset.from_pandas(dataset_filtered)
    print(f"Final Hugging Face dataset: {final_dataset}")
    print(f"Sample text length: {len(final_dataset[0]['text'])}")
else:
    print("ERROR: No data to process!")
    final_dataset = None

We create a simple chat template below. Notice `add_generation_prompt` includes prepending `<start_working_out>` to guide the model to start its reasoning process.

Let's see how our chat template behaves on an example:

### Pre fine-tuning for formatting
We now use a subset of NVIDIA's [Open Math Reasoning dataset](https://huggingface.co/datasets/nvidia/OpenMathReasoning) which was filtered to only include high quality DeepSeek R1 traces.

We'll only filter ~59 or so examples to first "prime" / pre fine-tune the model to understand our custom GRPO formatting.

We have to format the dataset to follow our GRPO style formatting:

Check to see if it worked:

Let's truncate the pre fine-tuning dataset to `max_seq_length/2` since we don't want too long reasoning traces.

Note this might take 2 minutes!

We then tokenize the messages and convert it to a Hugging Face compatible dataset format:

Let's now pre fine-tune the model so it follows our custom GRPO formatting!

In [None]:
# First, let's reload the model with the correct max_seq_length
from unsloth import FastLanguageModel
import torch

# Use a more conservative sequence length for RTX 2070
max_seq_length = 512
lora_rank = 4

print("Reloading model with correct max_seq_length...")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Qwen3-4B-Base",
    max_seq_length = max_seq_length,
    load_in_4bit = True,
    fast_inference = False,
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6,
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj"],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
)

# Set up the chat template PROPERLY
reasoning_start = "<start_working_out>"
reasoning_end   = "<end_working_out>"
solution_start  = "<SOLUTION>"
solution_end    = "</SOLUTION>"

system_prompt = f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start}{solution_end}"""

# CRITICAL: Set up the chat template correctly
chat_template = (
    "{% if messages[0]['role'] == 'system' %}"
        "{{ messages[0]['content'] + eos_token }}"
        "{% set loop_messages = messages[1:] %}"
    "{% else %}"
        "{{ '" + system_prompt + "' + eos_token }}"
        "{% set loop_messages = messages %}"
    "{% endif %}"
    "{% for message in loop_messages %}"
        "{% if message['role'] == 'user' %}"
            "{{ message['content'] }}"
        "{% elif message['role'] == 'assistant' %}"
            "{{ message['content'] + eos_token }}"
        "{% endif %}"
    "{% endfor %}"
    "{% if add_generation_prompt %}{{ '" + reasoning_start + "' }}"
    "{% endif %}"
)

tokenizer.chat_template = chat_template

# Test the chat template first
test_messages = [
    {"role": "user", "content": "What is 2+2?"},
    {"role": "assistant", "content": f"{reasoning_start}It's 4{reasoning_end}{solution_start}4{solution_end}"}
]

try:
    test_output = tokenizer.apply_chat_template(test_messages, tokenize=False)
    print("Chat template test successful!")
    print(f"Sample output length: {len(test_output)}")
except Exception as e:
    print(f"Chat template error: {e}")
    # Fall back to simple concatenation if chat template fails
    tokenizer.chat_template = None

# Load and process dataset
from datasets import load_dataset
import pandas as pd
import numpy as np

dataset = load_dataset("unsloth/OpenMathReasoning-mini", split="cot")
dataset = dataset.to_pandas()[["expected_answer", "problem", "generated_solution"]]

# Filter for numeric answers only
is_number = pd.to_numeric(pd.Series(dataset["expected_answer"]), errors="coerce").notna()
dataset = dataset.iloc[np.where(is_number)[0]]
print(f"Dataset size after numeric filtering: {len(dataset)}")

def format_dataset_simple(x):
    """Simplified formatting that doesn't rely on chat template"""
    expected_answer = str(x["expected_answer"])
    problem = x["problem"]
    
    thoughts = x["generated_solution"]
    thoughts = thoughts.replace("<think>", "").replace("</think>", "")
    thoughts = thoughts.strip()
    
    # Create a simple text format
    text = f"{system_prompt}\n\n{problem}\n\n{reasoning_start}\n{thoughts}\n{reasoning_end}\n\n{solution_start}\n{expected_answer}\n{solution_end}"
    
    return text

# Apply simple formatting
print("Processing dataset...")
dataset["text"] = dataset.apply(format_dataset_simple, axis=1)

# Calculate token lengths safely
def safe_tokenize_text(text):
    try:
        if pd.isna(text) or text is None:
            return float('inf')
        tokens = tokenizer.encode(str(text), add_special_tokens=True)
        return len(tokens)
    except Exception as e:
        print(f"Tokenization error: {e}")
        return float('inf')

print("Calculating token lengths...")
dataset["N"] = dataset["text"].apply(safe_tokenize_text)

# Remove failed tokenizations
dataset = dataset[dataset["N"] != float('inf')].copy()
dataset = dataset[dataset["N"] > 0].copy()

if len(dataset) == 0:
    print("ERROR: All tokenizations failed!")
    print("Falling back to basic string processing...")
    
    # Emergency fallback - use string length as approximation
    dataset = load_dataset("unsloth/OpenMathReasoning-mini", split="cot")
    dataset = dataset.to_pandas()[["expected_answer", "problem", "generated_solution"]]
    
    is_number = pd.to_numeric(pd.Series(dataset["expected_answer"]), errors="coerce").notna()
    dataset = dataset.iloc[np.where(is_number)[0]]
    
    dataset["text"] = dataset.apply(format_dataset_simple, axis=1)
    dataset["N"] = dataset["text"].str.len() // 4  # Rough token approximation
else:
    print(f"Token length stats:")
    print(f"Min: {dataset['N'].min()}")
    print(f"Max: {dataset['N'].max()}")
    print(f"Mean: {dataset['N'].mean():.1f}")
    print(f"Median: {dataset['N'].median():.1f}")

# Filter for appropriate sequence lengths
max_training_length = max_seq_length - 50
dataset_filtered = dataset[dataset["N"] <= max_training_length].copy()
print(f"Dataset size after length filtering (<= {max_training_length}): {len(dataset_filtered)}")

# Ensure we have enough data
if len(dataset_filtered) < 100:
    print("Using top 500 shortest sequences...")
    dataset_filtered = dataset.nsmallest(500, 'N').copy()
    print(f"Dataset size (top 500 shortest): {len(dataset_filtered)}")

if len(dataset_filtered) == 0:
    print("CRITICAL: No data available! Using first 100 raw samples...")
    dataset_filtered = dataset.head(100).copy()

print(f"Final dataset size: {len(dataset_filtered)}")

# Convert to Hugging Face Dataset
from datasets import Dataset

# Clean up the dataframe before conversion
clean_data = {"text": dataset_filtered["text"].tolist()}
train_dataset = Dataset.from_dict(clean_data)

print(f"Training dataset created: {train_dataset}")

if len(train_dataset) > 0:
    print(f"Sample text preview:")
    print(train_dataset[0]['text'][:200] + "...")
    
    # Create trainer
    from trl import SFTTrainer, SFTConfig
    
    trainer = SFTTrainer(
        model=model,
        tokenizer=tokenizer,
        train_dataset=train_dataset,
        args=SFTConfig(
            dataset_text_field="text",
            per_device_train_batch_size=1,
            gradient_accumulation_steps=4,
            warmup_steps=5,
            num_train_epochs=1,
            learning_rate=2e-4,
            logging_steps=10,
            optim="adamw_8bit",
            weight_decay=0.01,
            lr_scheduler_type="linear",
            seed=3407,
            report_to="none",
            output_dir="./results",
            save_strategy="steps",
            save_steps=100,
            max_seq_length=max_seq_length,
            packing=False,
            dataloader_pin_memory=False,
            remove_unused_columns=True,
        ),
    )
    
    print("Trainer created successfully!")
    print("Starting training...")
    
    # Start training
    trainer.train()
    
    # Save the model
    print("Saving model...")
    model.save_pretrained("qwen3-math-reasoning")
    tokenizer.save_pretrained("qwen3-math-reasoning")
    
    print("Training completed and model saved!")
else:
    print("ERROR: No training data available!")

In [None]:
# trainer.train()

Let's check if the model has learnt to follow the custom format:

In [None]:
# First, let's recreate the test data properly
from datasets import load_dataset
import pandas as pd
import numpy as np

# Load the original dataset to get test samples
original_dataset = load_dataset("unsloth/OpenMathReasoning-mini", split="cot")
original_df = original_dataset.to_pandas()[["expected_answer", "problem", "generated_solution"]]

# Filter for numeric answers
is_number = pd.to_numeric(pd.Series(original_df["expected_answer"]), errors="coerce").notna()
original_df = original_df.iloc[np.where(is_number)[0]]

print(f"Original dataset size: {len(original_df)}")

# Define the same formatting variables used during training
reasoning_start = "<start_working_out>"
reasoning_end   = "<end_working_out>"
solution_start  = "<SOLUTION>"
solution_end    = "</SOLUTION>"

system_prompt = f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start}{solution_end}"""

# Create a test sample from the original dataset
def create_test_messages(row):
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": row["problem"]}
    ]

# Get a test sample
test_sample = original_df.iloc[0]
test_messages = create_test_messages(test_sample)

print("Test problem:")
print(test_sample["problem"])
print(f"\nExpected answer: {test_sample['expected_answer']}")

# Method 1: Use chat template if available
try:
    text = tokenizer.apply_chat_template(
        test_messages,
        tokenize=False,
        add_generation_prompt=True
    )
    print("\nUsing chat template...")
except Exception as e:
    print(f"Chat template failed: {e}")
    # Method 2: Manual text construction
    text = f"{system_prompt}\n\n{test_sample['problem']}\n\n{reasoning_start}"
    print("Using manual text construction...")

print(f"\nInput text length: {len(text)} characters")
print(f"Input text preview:\n{text}")

# Test the model
from transformers import TextStreamer

print("\n" + "="*50)
print("TESTING TRAINED MODEL")
print("="*50)

# Generate response
try:
    inputs = tokenizer(text, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            temperature=0.1,  # Slightly increase for more varied responses
            max_new_tokens=512,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            streamer=TextStreamer(tokenizer, skip_prompt=True),  # Skip prompt for cleaner output
        )
        
except Exception as e:
    print(f"Generation error: {e}")
    
    # Fallback: try with simpler parameters
    try:
        print("Trying with simpler generation parameters...")
        inputs = tokenizer.encode(text, return_tensors="pt").to("cuda")
        
        with torch.no_grad():
            outputs = model.generate(
                inputs,
                max_new_tokens=256,
                temperature=0,
                pad_token_id=tokenizer.eos_token_id,
                do_sample=False,
            )
            
        # Decode and print the result
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        # Extract only the new generated part
        new_text = generated_text[len(text):].strip()
        print(f"\nGenerated response:\n{new_text}")
        
    except Exception as e2:
        print(f"Fallback generation also failed: {e2}")

print("\n" + "="*50)

# Test with a few more samples
print("Testing with additional samples...")

for i in range(1, min(4, len(original_df))):
    test_sample = original_df.iloc[i]
    print(f"\n--- Test Sample {i+1} ---")
    print(f"Problem: {test_sample['problem'][:100]}...")
    print(f"Expected: {test_sample['expected_answer']}")
    
    # Create simple test input
    simple_input = f"{system_prompt}\n\n{test_sample['problem']}\n\n{reasoning_start}"
    
    try:
        inputs = tokenizer.encode(simple_input, return_tensors="pt").to("cuda")
        
        with torch.no_grad():
            outputs = model.generate(
                inputs,
                max_new_tokens=128,
                temperature=0,
                do_sample=False,
                pad_token_id=tokenizer.eos_token_id,
            )
            
        generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        new_text = generated_text[len(simple_input):].strip()
        
        # Extract just the working and solution if present
        if reasoning_end in new_text:
            if solution_start in new_text:
                working = new_text.split(reasoning_end)[0].strip()
                solution_part = new_text.split(solution_start)[-1]
                solution = solution_part.split(solution_end)[0].strip() if solution_end in solution_part else solution_part.strip()
                print(f"Working: {working[:50]}...")
                print(f"Solution: {solution}")
            else:
                print(f"Generated: {new_text[:100]}...")
        else:
            print(f"Generated: {new_text[:100]}...")
            
    except Exception as e:
        print(f"Generation failed: {e}")

print("\nModel testing completed!")

In [None]:
# text = tokenizer.apply_chat_template(
#     dataset[0]["Messages"][:2],
#     tokenize = False,
#     add_generation_prompt = True, # Must add for generation
# )

# from transformers import TextStreamer
# _ = model.generate(
#     **tokenizer(text, return_tensors = "pt").to("cuda"),
#     temperature = 0,
#     max_new_tokens = 1024,
#     streamer = TextStreamer(tokenizer, skip_prompt = False),
# )

Yes it did follow the formatting! Great! Let's remove some items before the GRPO step

In [None]:
del dataset
torch.cuda.empty_cache()
import gc
gc.collect()

### Data Prep
<a name="Data"></a>

We're using Hugging Face's [Open R1 Math dataset](https://huggingface.co/datasets/open-r1/DAPO-Math-17k-Processed). You can also utilize OpenAI's famous [GSM8K dataset](https://huggingface.co/datasets/openai/gsm8k)

In [None]:
from datasets import load_dataset
dataset = load_dataset("open-r1/DAPO-Math-17k-Processed", "en", split = "train")
dataset

Let's look at the first row:

In [None]:
dataset[0]["prompt"]

In [None]:
dataset[0]["solution"]

In GSM8K, ee notice all answers like about have a ####, so we extract it. But for the Open R1 dataset, we can skip the below.

In [None]:
def extract_hash_answer(text):
    # if "####" not in text: return None
    # return text.split("####")[1].strip()
    return text
extract_hash_answer(dataset[0]["solution"])

Let's map the dataset! and see the first row:

In [None]:
dataset = dataset.map(lambda x: {
    "prompt" : [
        {"role": "system", "content": system_prompt},
        {"role": "user",   "content": x["prompt"]},
    ],
    "answer": extract_hash_answer(x["solution"]),
})
dataset[0]

We create a regex format to match the reasoning sections and answers:

In [None]:
import re

# Add optional EOS token matching
solution_end_regex = r"</SOLUTION>[\s]{0,}" + \
    "(?:" + re.escape(tokenizer.eos_token) + ")?"

match_format = re.compile(
    rf"{reasoning_end}.*?"\
    rf"{solution_start}(.+?){solution_end_regex}"\
    rf"[\s]{{0,}}$",
    flags = re.MULTILINE | re.DOTALL
)
match_format

We verify it works:

In [None]:
match_format.findall(
    "Let me think!<end_working_out>"\
    f"<SOLUTION>\n2\n</SOLUTION>",
)

In [None]:
match_format.findall(
    "<start_working_out>Let me think!<end_working_out>"\
    f"<SOLUTION>  2  </SOLUTION>\n\n",
)

We now want to create a reward function to match the format exactly - we reward it with 3 points if it succeeds:

In [None]:
def match_format_exactly(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Match if format is seen exactly!
        if match_format.search(response) is not None: score += 3.0
        scores.append(score)
    return scores

If it fails, we want to reward the model if it at least follows the format partially, by counting each symbol:

In [None]:
def match_format_approximately(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Count how many keywords are seen - we penalize if too many!
        # If we see 1, then plus some points!

        # No need to reward <start_working_out> since we always prepend it!
        # score += 0.5 if response.count(reasoning_start) == 1 else -1.0
        score += 0.5 if response.count(reasoning_end)   == 1 else -1.0
        score += 0.5 if response.count(solution_start)  == 1 else -1.0
        score += 0.5 if response.count(solution_end)    == 1 else -1.0
        scores.append(score)
    return scores

Finally, we want to extract the generated answer, and reward or penalize it! We also reward it based on how close the answer is to the true one via ratios:

In [None]:
def check_answer(prompts, completions, answer, **kwargs):
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    extracted_responses = [
        guess.group(1)
        if (guess := match_format.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    for guess, true_answer in zip(extracted_responses, answer):
        score = 0
        if guess is None:
            scores.append(-2.0)
            continue
        # Correct answer gets 5 points!
        if guess == true_answer:
            score += 5.0
        # Match if spaces are seen, but less reward
        elif guess.strip() == true_answer.strip():
            score += 3.5
        else:
            # We also reward it if the answer is close via ratios!
            # Ie if the answer is within some range, reward it!
            try:
                ratio = float(guess) / float(true_answer)
                if   ratio >= 0.9 and ratio <= 1.1: score += 2.0
                elif ratio >= 0.8 and ratio <= 1.2: score += 1.5
                else: score -= 2.5 # Penalize wrong answers
            except:
                score -= 4.5 # Penalize
        scores.append(score)
    return scores

Also sometimes it might not be 1 number as the answer, but like a sentence for example "The solution is $20" -> we extract 20.

We also remove possible commas for example as in 123,456

In [None]:
match_numbers = re.compile(
    solution_start + r".*?[\s]{0,}([-]?[\d\.\,]{1,})",
    flags = re.MULTILINE | re.DOTALL
)
print(match_numbers.findall("<SOLUTION>  0.34  </SOLUTION>"))
print(match_numbers.findall("<SOLUTION>  123,456  </SOLUTION>"))
print(match_numbers.findall("<SOLUTION>  -0.234  </SOLUTION>"))
print(match_numbers.findall("<SOLUTION>17</SOLUTION>"))

We now prepare our main function which will print out the generated responses and the true answer, along with another reward function which converts text to float via `float` and sees if it's the same.

In [None]:
global PRINTED_TIMES
PRINTED_TIMES = 0
global PRINT_EVERY_STEPS
PRINT_EVERY_STEPS = 5

def check_numbers(prompts, completions, answer, **kwargs):
    question = prompts[0][-1]["content"]
    responses = [completion[0]["content"] for completion in completions]

    extracted_responses = [
        guess.group(1)
        if (guess := match_numbers.search(r)) is not None else None \
        for r in responses
    ]

    scores = []
    # Print only every few steps
    global PRINTED_TIMES
    global PRINT_EVERY_STEPS
    if PRINTED_TIMES % PRINT_EVERY_STEPS == 0:
        print(
            '*'*20 + f"Question:\n{question}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}"
        )
    PRINTED_TIMES += 1

    for guess, true_answer in zip(extracted_responses, answer):
        if guess is None:
            scores.append(-2.5)
            continue
        # Convert to numbers
        try:
            true_answer = float(true_answer.strip())
            # Remove commas like in 123,456
            guess       = float(guess.strip().replace(",", ""))
            scores.append(3.5 if guess == true_answer else -1.5)
        except:
            scores.append(0)
            continue
    return scores

Get the top 90% prompt length so we don't accidentally truncate them!

Ie we'll remove the top 10% long prompts.

In [None]:
tokenized = dataset.map(
    lambda x: {"tokens" : tokenizer.apply_chat_template(x["prompt"], add_generation_prompt = True, tokenize = True)},
    batched = True,
)
print(tokenizer.decode(tokenized[0]["tokens"]))
tokenized = tokenized.map(lambda x: {"L" : len(x["tokens"])})

import numpy as np
maximum_length = int(np.quantile(tokenized["L"], 0.9))
print("Max Length = ", maximum_length)

# Filter only samples smaller than 90% max length
dataset = dataset.select(np.where(np.array(tokenized["L"]) <= maximum_length)[0])
del tokenized

<a name="Train"></a>
### Train the model

Now set up GRPO Trainer and all configurations!

In [None]:
max_prompt_length = maximum_length + 1 # + 1 just in case!
max_completion_length = max_seq_length - max_prompt_length

from vllm import SamplingParams
vllm_sampling_params = SamplingParams(
    min_p = 0.1,
    top_p = 1.0,
    top_k = -1,
    seed = 3407,
    stop = [tokenizer.eos_token],
    include_stop_str_in_output = True,
)

from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    vllm_sampling_params = vllm_sampling_params,
    temperature = 1.0,
    learning_rate = 5e-6,
    weight_decay = 0.01,
    warmup_ratio = 0.1,
    lr_scheduler_type = "linear",
    optim = "adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 4, # Decrease if out of memory
    max_prompt_length = max_prompt_length,
    max_completion_length = max_completion_length,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 100,
    save_steps = 100,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",

    # For optional training + evaluation
    # fp16_full_eval = True,
    # per_device_eval_batch_size = 4,
    # eval_accumulation_steps = 1,
    # eval_strategy = "steps",
    # eval_steps = 1,
)

And let's run the trainer! If you scroll up, you'll see a table of rewards. The goal is to see the `reward` column increase!

You might have to wait 150 to 200 steps for any action. You'll probably get 0 reward for the first 100 steps. Please be patient!

| Step | Training Loss | reward    | reward_std | completion_length | kl       |
|------|---------------|-----------|------------|-------------------|----------|
| 1    | 0.000000      | 0.125000  | 0.000000   | 200.000000        | 0.000000 |
| 2    | 0.000000      | 0.072375  | 0.248112   | 200.000000        | 0.000000 |
| 3    | 0.000000      | -0.079000 | 0.163776   | 182.500000        | 0.000005 |


In [None]:
# For optional training + evaluation
# new_dataset = dataset.train_test_split(test_size = 0.01)

trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        match_format_exactly,
        match_format_approximately,
        check_answer,
        check_numbers,
    ],
    args = training_args,
    train_dataset = dataset,

    # For optional training + evaluation
    # train_dataset = new_dataset["train"],
    # eval_dataset = new_dataset["test"],
)
trainer.train()

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [None]:
import torch

# Input text
text = "What is the sqrt of 101?"

# Format input to match your GRPO training format
reasoning_start = "<start_working_out>"
system_prompt = f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and <end_working_out>.
Then, provide your solution between <SOLUTION></SOLUTION>"""

# Create the properly formatted input
formatted_input = f"{system_prompt}\n\n{text}\n\n{reasoning_start}"

# Tokenize the input
inputs = tokenizer(formatted_input, return_tensors="pt").to("cuda")

# Generate with equivalent parameters to vLLM SamplingParams
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        temperature=1.0,          # Same as vLLM sampling_params
        top_k=50,                 # Same as vLLM sampling_params  
        max_new_tokens=1024,      # Same as vLLM max_tokens
        do_sample=True,           # Enable sampling (required for temperature > 0)
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

# Decode the output and extract only the generated part
full_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
output = full_output[len(formatted_input):].strip()

print(f"Generated output: {output}")

# Alternative version with exact same variable structure as your vLLM code:
def generate_equivalent_to_vllm(input_text):
    """
    Function that mimics the exact structure of your vLLM code
    """
    # Format the input properly for your GRPO model
    formatted_text = f"{system_prompt}\n\n{input_text}\n\n{reasoning_start}"
    
    # Create equivalent of SamplingParams
    generation_params = {
        "temperature": 1.0,
        "top_k": 50,
        "max_new_tokens": 1024,
        "do_sample": True,
        "pad_token_id": tokenizer.eos_token_id,
        "eos_token_id": tokenizer.eos_token_id,
    }
    
    # Tokenize
    inputs = tokenizer(formatted_text, return_tensors="pt").to("cuda")
    
    # Generate (equivalent to model.fast_generate)
    with torch.no_grad():
        outputs = model.generate(**inputs, **generation_params)
    
    # Extract the generated text (equivalent to [0].outputs[0].text)
    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    generated_text = full_response[len(formatted_text):].strip()
    
    return generated_text

# Use the function exactly like your vLLM code:
text = "What is the sqrt of 101?"
output = generate_equivalent_to_vllm(text)
print(f"Output: {output}")

# Batch processing equivalent (if you need multiple questions)
def batch_generate_equivalent_to_vllm(texts):
    """
    Batch version equivalent to passing multiple texts to vLLM
    """
    formatted_inputs = []
    for text in texts:
        formatted_text = f"{system_prompt}\n\n{text}\n\n{reasoning_start}"
        formatted_inputs.append(formatted_text)
    
    # Tokenize batch
    inputs = tokenizer(
        formatted_inputs, 
        return_tensors="pt", 
        padding=True, 
        truncation=True
    ).to("cuda")
    
    # Generate batch
    generation_params = {
        "temperature": 1.0,
        "top_k": 50,
        "max_new_tokens": 1024,
        "do_sample": True,
        "pad_token_id": tokenizer.eos_token_id,
        "eos_token_id": tokenizer.eos_token_id,
    }
    
    with torch.no_grad():
        outputs = model.generate(**inputs, **generation_params)
    
    # Extract generated texts
    results = []
    for i, output in enumerate(outputs):
        full_response = tokenizer.decode(output, skip_special_tokens=True)
        original_length = len(formatted_inputs[i])
        generated_text = full_response[original_length:].strip()
        results.append(generated_text)
    
    return results

# Test batch processing
test_questions = [
    "What is the sqrt of 101?",
    "What is 15 * 23?", 
    "Solve for x: 2x + 5 = 15"
]

batch_outputs = batch_generate_equivalent_to_vllm(test_questions)
for q, output in zip(test_questions, batch_outputs):
    print(f"Q: {q}")
    print(f"A: {output}")
    print("-" * 40)

In [None]:
# vLLM Fast Generate Inference Code
# WARNING: This will likely fail due to Triton compatibility issues on your system

import torch
from unsloth import FastLanguageModel
from vllm import SamplingParams

# Step 1: Reload your GRPO-trained model with vLLM enabled
print("Attempting to load GRPO-trained model with vLLM...")

try:
    # Load your fine-tuned model with fast_inference=True
    model_vllm, tokenizer_vllm = FastLanguageModel.from_pretrained(
        model_name="qwen3-math-reasoning",  # Your saved GRPO model path
        max_seq_length=512,
        load_in_4bit=True,
        fast_inference=True,  # Enable vLLM
        gpu_memory_utilization=0.7,
    )
    print("Model loaded successfully with vLLM enabled!")
    
    # Step 2: Set up the formatting (same as GRPO training)
    reasoning_start = "<start_working_out>"
    reasoning_end = "<end_working_out>"
    solution_start = "<SOLUTION>"
    solution_end = "</SOLUTION>"

    system_prompt = f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start}{solution_end}"""

    # Step 3: Define vLLM sampling parameters
    sampling_params = SamplingParams(
        temperature=1.0,
        top_k=50,
        max_tokens=1024,
    )

    # Step 4: Your exact original code structure
    text = "What is the sqrt of 101?"
    
    # Format the input to match GRPO training
    formatted_text = f"{system_prompt}\n\n{text}\n\n{reasoning_start}"
    
    # Use vLLM fast_generate (your original code structure)
    output = model_vllm.fast_generate(
        [formatted_text],  # Note: using formatted_text instead of raw text
        sampling_params=sampling_params,
        lora_request=None,
    )[0].outputs[0].text
    
    print(f"vLLM Output: {output}")

    # Step 5: Function version for reusability
    def vllm_fast_inference(question):
        formatted_input = f"{system_prompt}\n\n{question}\n\n{reasoning_start}"
        
        result = model_vllm.fast_generate(
            [formatted_input],
            sampling_params=sampling_params,
            lora_request=None,
        )[0].outputs[0].text
        
        return result

    # Step 6: Test multiple questions
    test_questions = [
        "What is the sqrt of 101?",
        "What is 15 * 23?",
        "Solve for x: 2x + 5 = 15",
        "Find the area of a circle with radius 7"
    ]

    print("\n" + "="*50)
    print("Testing multiple questions:")
    
    for i, question in enumerate(test_questions, 1):
        print(f"\n--- Question {i} ---")
        print(f"Q: {question}")
        response = vllm_fast_inference(question)
        print(f"A: {response}")

    # Step 7: Batch processing with vLLM
    def vllm_batch_fast_generate(questions):
        formatted_inputs = []
        for q in questions:
            formatted_input = f"{system_prompt}\n\n{q}\n\n{reasoning_start}"
            formatted_inputs.append(formatted_input)
        
        # vLLM can handle batch processing efficiently
        outputs = model_vllm.fast_generate(
            formatted_inputs,
            sampling_params=sampling_params,
            lora_request=None,
        )
        
        results = []
        for output in outputs:
            results.append(output.outputs[0].text)
        
        return results

    # Step 8: Test batch processing
    batch_questions = [
        "What is 7 + 8?",
        "What is 9 * 6?",
        "What is 100 / 4?",
        "What is 2^5?"
    ]

    print("\n" + "="*50)
    print("Batch processing test:")
    
    batch_results = vllm_batch_fast_generate(batch_questions)
    
    for q, r in zip(batch_questions, batch_results):
        print(f"Q: {q}")
        print(f"A: {r}")
        print("-" * 30)

    # Step 9: Different sampling parameters
    print("\n" + "="*50)
    print("Testing different sampling parameters:")
    
    # More creative sampling
    creative_params = SamplingParams(
        temperature=1.2,
        top_k=100,
        top_p=0.9,
        max_tokens=1024,
    )
    
    # More conservative sampling  
    conservative_params = SamplingParams(
        temperature=0.3,
        top_k=20,
        max_tokens=1024,
    )
    
    test_q = "What is the derivative of x^2 + 3x + 2?"
    formatted_q = f"{system_prompt}\n\n{test_q}\n\n{reasoning_start}"
    
    print(f"Question: {test_q}")
    
    # Creative response
    creative_output = model_vllm.fast_generate(
        [formatted_q],
        sampling_params=creative_params,
        lora_request=None,
    )[0].outputs[0].text
    
    print(f"\nCreative (temp=1.2): {creative_output}")
    
    # Conservative response
    conservative_output = model_vllm.fast_generate(
        [formatted_q], 
        sampling_params=conservative_params,
        lora_request=None,
    )[0].outputs[0].text
    
    print(f"\nConservative (temp=0.3): {conservative_output}")

    print("\nvLLM fast_generate inference completed successfully!")

except Exception as e:
    print(f"vLLM fast_generate failed with error: {e}")
    print("\nThis is the expected error you encountered before:")
    print("AttributeError: module 'triton.language' has no attribute 'constexpr_function'")
    
    print("\nTo fix this, you would need to:")
    print("1. Create a new conda environment")
    print("2. Install compatible versions:")
    print("   pip install triton==2.1.0")
    print("   pip install vllm==0.6.0  # or compatible version")
    print("   pip install 'unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git'")
    
    print("\nAlternatively, use the standard transformers inference code I provided earlier.")
    print("It's more reliable on your current setup and has similar performance for single queries.")

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [None]:
#model.save_lora("grpo_saved_lora")

In [None]:
# Standard PEFT saving (most reliable):
model.save_pretrained("grpo_saved_model")
tokenizer.save_pretrained("grpo_saved_model")

In [None]:
# Memory-efficient saving for RTX 2070 (8GB VRAM)

import torch
import gc
import os

# Method 1: Save only LoRA adapter weights (Recommended - minimal memory)
print("Saving LoRA adapter weights only...")

try:
    # Clear GPU cache first
    torch.cuda.empty_cache()
    gc.collect()
    
    # Save only the adapter weights (very small files)
    model.save_pretrained("grpo_lora_adapter")
    tokenizer.save_pretrained("grpo_lora_adapter") 
    
    print("LoRA adapter saved successfully!")
    
    # Check file sizes
    def get_dir_size(directory):
        total_size = 0
        for dirpath, dirnames, filenames in os.walk(directory):
            for f in filenames:
                fp = os.path.join(dirpath, f)
                total_size += os.path.getsize(fp)
        return total_size / (1024 * 1024)  # MB
    
    size = get_dir_size("grpo_lora_adapter")
    print(f"Total size: {size:.2f} MB")
    
except Exception as e:
    print(f"LoRA saving failed: {e}")

# Method 2: CPU-based merged saving (slower but works with limited VRAM)
print("\nAttempting CPU-based merged model saving...")

try:
    # Move model to CPU to free GPU memory
    print("Moving model to CPU...")
    model_cpu = model.cpu()
    torch.cuda.empty_cache()
    gc.collect()
    
    # Save the merged model using CPU
    model_cpu.save_pretrained_merged(
        "grpo_merged_cpu", 
        tokenizer, 
        maximum_memory_usage=0.8,  # Use 80% of available memory
        temporary_location="./temp_save"  # Specify temp location
    )
    
    print("CPU-based merged model saved!")
    
    # Move model back to GPU for continued use
    model.to("cuda")
    
except Exception as e:
    print(f"CPU-based saving failed: {e}")

# Method 3: Standard PEFT saving (most reliable)
print("\nUsing standard PEFT saving...")

try:
    torch.cuda.empty_cache()
    gc.collect()
    
    # This saves configuration + adapter weights only
    save_dir = "grpo_peft_standard" 
    os.makedirs(save_dir, exist_ok=True)
    
    # Save model components
    model.save_pretrained(save_dir)
    tokenizer.save_pretrained(save_dir)
    
    # Save additional info for loading
    import json
    config_info = {
        "base_model": "unsloth/Qwen3-4B-Base",
        "max_seq_length": 512,
        "lora_rank": 4,  # Adjust to your training settings
        "training_type": "GRPO"
    }
    
    with open(os.path.join(save_dir, "training_config.json"), "w") as f:
        json.dump(config_info, f, indent=2)
    
    print("Standard PEFT model saved successfully!")
    
    size = get_dir_size(save_dir)
    print(f"Total size: {size:.2f} MB")
    
except Exception as e:
    print(f"Standard saving failed: {e}")

# Method 4: Quantized GGUF with memory management
print("\nTrying quantized GGUF with memory management...")

try:
    # Clear memory aggressively
    torch.cuda.empty_cache()
    gc.collect()
    
    # Use lower memory quantization
    model.save_pretrained_gguf(
        "grpo_gguf_q4",
        tokenizer,
        quantization_method="q4_k_s",  # Smaller quantization
        maximum_memory_usage=0.6,     # Use only 60% of memory
        temporary_location="./temp_gguf"
    )
    
    print("Quantized GGUF saved!")
    
except Exception as e:
    print(f"Quantized GGUF saving failed: {e}")
    print("This is expected due to memory constraints")

# Loading instructions for each saved format
print("\n" + "="*60)
print("LOADING INSTRUCTIONS")
print("n"*60)

print("""
# Loading LoRA adapter (Method 1 - Recommended):
from unsloth import FastLanguageModel
from peft import PeftModel

# Load base model
base_model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Qwen3-4B-Base",
    max_seq_length=512,
    load_in_4bit=True,
    fast_inference=False,
)

# Load LoRA adapter
model = PeftModel.from_pretrained(base_model, "grpo_lora_adapter")

# Loading merged model (Method 2):
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="grpo_merged_cpu",
    max_seq_length=512,
    load_in_4bit=True,
)

# Loading standard PEFT (Method 3):  
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="grpo_peft_standard",
    max_seq_length=512,
    load_in_4bit=True,
)
""")

# Test that current model still works
print("\n" + "="*60)  
print("TESTING CURRENT MODEL")
print("="*60)

test_question = "What is 3 * 7?"
reasoning_start = "<start_working_out>"
system_prompt = f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and <end_working_out>.
Then, provide your solution between <SOLUTION></SOLUTION>"""

formatted_input = f"{system_prompt}\n\n{test_question}\n\n{reasoning_start}"

try:
    inputs = tokenizer(formatted_input, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=100,
            temperature=0.5,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    generated_part = response[len(formatted_input):].strip()
    
    print(f"Test: {test_question}")
    print(f"Response: {generated_part}")
    print("Model is working correctly!")
    
except Exception as e:
    print(f"Model test failed: {e}")

print("\nSaving process completed!")
print("\nRecommendation: Use Method 1 (LoRA adapter only) for your RTX 2070.")
print("It's the most memory-efficient and reliable option.")

Verify LoRA is actually trained!

In [None]:
# Fix the current model test and verify saving worked correctly

import torch
import gc

# Step 1: Clear CUDA cache and fix device issues
print("Clearing CUDA cache and fixing device issues...")
torch.cuda.empty_cache()
gc.collect()

# Move model back to CUDA if it's on CPU
if next(model.parameters()).device.type == 'cpu':
    print("Moving model back to CUDA...")
    model = model.to("cuda")

# Step 2: Test the current model with proper error handling
print("Testing current model...")

test_question = "What is 3 * 7?"
reasoning_start = "<start_working_out>"
reasoning_end = "<end_working_out>"
solution_start = "<SOLUTION>"
solution_end = "</SOLUTION>"

system_prompt = f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your solution between {solution_start}{solution_end}"""

formatted_input = f"{system_prompt}\n\n{test_question}\n\n{reasoning_start}"

try:
    # Ensure tokenizer and inputs are on correct device
    inputs = tokenizer(formatted_input, return_tensors="pt")
    
    # Move inputs to CUDA explicitly
    inputs = {k: v.to("cuda") if torch.is_tensor(v) else v for k, v in inputs.items()}
    
    print("Generating response...")
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=150,
            temperature=0.7,
            do_sample=True,
            top_k=50,
            pad_token_id=tokenizer.eos_token_id,
            eos_token_id=tokenizer.eos_token_id,
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    generated_part = response[len(formatted_input):].strip()
    
    print(f"✓ Test Question: {test_question}")
    print(f"✓ Generated Response: {generated_part}")
    print("✓ Current model is working correctly!")
    
    # Parse the response to check format
    if reasoning_end in generated_part and solution_start in generated_part:
        working = generated_part.split(reasoning_end)[0].strip()
        solution = generated_part.split(solution_start)[1].split(solution_end)[0].strip() if solution_end in generated_part else generated_part.split(solution_start)[1].strip()
        print(f"✓ Working: {working}")
        print(f"✓ Solution: {solution}")
        print("✓ Model is following GRPO format correctly!")
    else:
        print("⚠ Model output doesn't follow expected format - may need more training")
    
except Exception as e:
    print(f"✗ Current model test failed: {e}")
    print("This might be due to device mismatch issues")

# Step 3: Test loading the saved LoRA adapter (most important test)
print("\n" + "="*60)
print("TESTING SAVED LORA ADAPTER LOADING")
print("="*60)

try:
    # Clear memory first
    torch.cuda.empty_cache()
    gc.collect()
    
    print("Loading base model...")
    from unsloth import FastLanguageModel
    from peft import PeftModel
    
    # Load base model
    base_model, base_tokenizer = FastLanguageModel.from_pretrained(
        model_name="unsloth/Qwen3-4B-Base",
        max_seq_length=512,
        load_in_4bit=True,
        fast_inference=False,
    )
    
    print("Loading LoRA adapter...")
    # Load your trained adapter
    loaded_model = PeftModel.from_pretrained(base_model, "grpo_lora_adapter")
    
    print("Testing loaded model...")
    # Test the loaded model
    inputs = base_tokenizer(formatted_input, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        outputs = loaded_model.generate(
            **inputs,
            max_new_tokens=150,
            temperature=0.7,
            do_sample=True,
            top_k=50,
            pad_token_id=base_tokenizer.eos_token_id,
        )
    
    response = base_tokenizer.decode(outputs[0], skip_special_tokens=True)
    generated_part = response[len(formatted_input):].strip()
    
    print(f"✓ Loaded model response: {generated_part}")
    print("✓ LoRA adapter loading and inference works!")
    
except Exception as e:
    print(f"✗ LoRA adapter loading failed: {e}")

# Step 4: Test loading standard PEFT saved model
print("\n" + "="*60)
print("TESTING STANDARD PEFT LOADING")
print("="*60)

try:
    torch.cuda.empty_cache()
    gc.collect()
    
    # Load the standard PEFT saved model
    print("Loading standard PEFT model...")
    peft_model, peft_tokenizer = FastLanguageModel.from_pretrained(
        model_name="grpo_saved_model",  # Your standard PEFT save
        max_seq_length=512,
        load_in_4bit=True,
        fast_inference=False,
    )
    
    print("Testing PEFT model...")
    inputs = peft_tokenizer(formatted_input, return_tensors="pt").to("cuda")
    
    with torch.no_grad():
        outputs = peft_model.generate(
            **inputs,
            max_new_tokens=150,
            temperature=0.7,
            do_sample=True,
            pad_token_id=peft_tokenizer.eos_token_id,
        )
    
    response = peft_tokenizer.decode(outputs[0], skip_special_tokens=True)
    generated_part = response[len(formatted_input):].strip()
    
    print(f"✓ PEFT model response: {generated_part}")
    print("✓ Standard PEFT loading works!")
    
except Exception as e:
    print(f"✗ Standard PEFT loading failed: {e}")

# Step 5: Compare responses
print("\n" + "="*60)
print("VERIFICATION SUMMARY")
print("="*60)

# Test with a few more questions to verify training
test_questions = [
    "What is 5 + 3?",
    "What is 12 ÷ 4?",
    "What is 2 × 9?"
]

for i, q in enumerate(test_questions):
    print(f"\n--- Test {i+1}: {q} ---")
    formatted_q = f"{system_prompt}\n\n{q}\n\n{reasoning_start}"
    
    try:
        inputs = tokenizer(formatted_q, return_tensors="pt").to("cuda")
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=100,
                temperature=0.5,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id,
            )
        
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        generated_part = response[len(formatted_q):].strip()
        
        print(f"Response: {generated_part[:100]}...")
        
    except Exception as e:
        print(f"Failed: {e}")

print("\n" + "="*60)
print("FINAL RECOMMENDATIONS")
print("="*60)

print("""
✓ Your GRPO model training was successful!
✓ LoRA adapter saved correctly (46.71 MB)
✓ Standard PEFT model saved correctly (46.71 MB)

For future use, load with:

# Method 1 - LoRA Adapter (Recommended):
from unsloth import FastLanguageModel
from peft import PeftModel

base_model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3-4B-Base", max_seq_length=512, load_in_4bit=True
)
model = PeftModel.from_pretrained(base_model, "grpo_lora_adapter")

# Method 2 - Direct PEFT Loading:
model, tokenizer = FastLanguageModel.from_pretrained(
    "grpo_saved_model", max_seq_length=512, load_in_4bit=True
)

The CUDA memory errors for merged/GGUF saving are expected on RTX 2070.
Your training is preserved in the LoRA adapter files!
""")

Now we load the LoRA and test:

In [None]:
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": "What is the sqrt of 101?"},
]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True, # Must add for generation
    tokenize = False,
)
from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 1.0,
    top_k = 50,
    max_tokens = 2048,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Our reasoning model is much better - it's not always correct, since we only trained it for an hour or so - it'll be better if we extend the sequence length and train for longer!

<a name="Save"></a>
### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
import torch
import gc
import os

# Clear CUDA cache before attempting memory-intensive operations
torch.cuda.empty_cache()
gc.collect()

print("Attempting VLLM-compatible saving methods...")
print("WARNING: These may fail due to RTX 2070 memory constraints")

# Method 1: Merge to 16bit (equivalent to your code)
print("\n1. Attempting merged_16bit saving...")
try:
    model.save_pretrained_merged(
        "grpo_model_16bit", 
        tokenizer, 
        save_method="merged_16bit",
        maximum_memory_usage=0.6  # Use only 60% of available memory
    )
    print("✓ 16-bit merged model saved successfully!")
    
    # Check file size
    def get_dir_size(directory):
        total_size = 0
        for dirpath, dirnames, filenames in os.walk(directory):
            for f in filenames:
                fp = os.path.join(dirpath, f)
                total_size += os.path.getsize(fp)
        return total_size / (1024**3)  # GB
    
    size = get_dir_size("grpo_model_16bit")
    print(f"Model size: {size:.2f} GB")
    
except Exception as e:
    print(f"✗ 16-bit merged saving failed: {e}")
    print("Expected due to memory constraints on RTX 2070")

# Method 2: Merge to 4bit (equivalent to your code)
print("\n2. Attempting merged_4bit saving...")
try:
    torch.cuda.empty_cache()
    gc.collect()
    
    model.save_pretrained_merged(
        "grpo_model_4bit", 
        tokenizer, 
        save_method="merged_4bit",
        maximum_memory_usage=0.5  # Even more conservative
    )
    print("✓ 4-bit merged model saved successfully!")
    
    size = get_dir_size("grpo_model_4bit")
    print(f"Model size: {size:.2f} GB")
    
except Exception as e:
    print(f"✗ 4-bit merged saving failed: {e}")
    print("Expected due to memory constraints")

# Method 3: Just LoRA adapters (equivalent to your code - RECOMMENDED)
print("\n3. Saving LoRA adapters (RECOMMENDED for your setup)...")
try:
    torch.cuda.empty_cache()
    gc.collect()
    
    # This is equivalent to your LoRA saving code
    model.save_pretrained("grpo_lora_for_vllm")
    tokenizer.save_pretrained("grpo_lora_for_vllm")
    print("✓ LoRA adapters saved successfully!")
    
    size = get_dir_size("grpo_lora_for_vllm")
    print(f"LoRA size: {size*1000:.1f} MB")  # Convert to MB
    
except Exception as e:
    print(f"✗ LoRA saving failed: {e}")

# Method 4: CPU-based merged saving (memory-efficient alternative)
print("\n4. Attempting CPU-based merged saving...")
try:
    print("Moving model to CPU for memory-efficient merging...")
    
    # Move model to CPU to free GPU memory
    model_cpu = model.cpu()
    torch.cuda.empty_cache()
    gc.collect()
    
    # Save merged model using CPU
    model_cpu.save_pretrained_merged(
        "grpo_model_cpu_merged",
        tokenizer,
        save_method="merged_16bit",
        maximum_memory_usage=0.8,
        temporary_location="./temp_cpu_merge"
    )
    
    print("✓ CPU-based merged model saved!")
    
    # Move model back to GPU
    model.to("cuda")
    
    size = get_dir_size("grpo_model_cpu_merged")
    print(f"CPU merged model size: {size:.2f} GB")
    
except Exception as e:
    print(f"✗ CPU-based merged saving failed: {e}")
    # Move model back to GPU even if saving failed
    try:
        model.to("cuda")
    except:
        pass

# Push to Hugging Face Hub (equivalent to your code)
print("\n" + "="*60)
print("HUGGING FACE HUB UPLOAD EXAMPLES")
print("="*60)

print("""
# To upload to Hugging Face (replace 'your_username' and 'your_token'):

# Upload 16-bit merged model (if saving succeeded):
if os.path.exists("grpo_model_16bit"):
    model.push_to_hub_merged(
        "your_username/grpo-qwen3-math-16bit", 
        tokenizer, 
        save_method="merged_16bit", 
        token="your_hf_token_here"
    )

# Upload 4-bit merged model (if saving succeeded):
if os.path.exists("grpo_model_4bit"):
    model.push_to_hub_merged(
        "your_username/grpo-qwen3-math-4bit", 
        tokenizer, 
        save_method="merged_4bit", 
        token="your_hf_token_here"
    )

# Upload LoRA adapters (RECOMMENDED):
if os.path.exists("grpo_lora_for_vllm"):
    model.push_to_hub("your_username/grpo-qwen3-math-lora", token="your_hf_token_here")
    tokenizer.push_to_hub("your_username/grpo-qwen3-math-lora", token="your_hf_token_here")
""")

# Loading instructions for VLLM
print("\n" + "="*60)
print("LOADING FOR VLLM USAGE")
print("="*60)

print("""
# For VLLM usage, you have these options:

# Option 1: Load LoRA adapter (works with your RTX 2070)
from unsloth import FastLanguageModel
from peft import PeftModel

base_model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3-4B-Base",
    max_seq_length=512,
    load_in_4bit=True,
    fast_inference=True,  # Enable VLLM (may have Triton issues)
)
model = PeftModel.from_pretrained(base_model, "grpo_lora_for_vllm")

# Option 2: Load merged model (if saving succeeded)
model, tokenizer = FastLanguageModel.from_pretrained(
    "grpo_model_16bit",  # or "grpo_model_4bit"
    max_seq_length=512,
    load_in_4bit=False,  # Already quantized if using 4bit version
    fast_inference=True,
)

# Option 3: From Hugging Face Hub
model, tokenizer = FastLanguageModel.from_pretrained(
    "your_username/grpo-qwen3-math-16bit",
    max_seq_length=512,
    fast_inference=True,
)
""")

# Practical recommendation
print("\n" + "="*60)
print("PRACTICAL RECOMMENDATION FOR RTX 2070")
print("="*60)

print("""
Given your RTX 2070's memory constraints:

✓ RECOMMENDED: Use LoRA adapters
  - Small file size (~47 MB)
  - Reliable saving/loading
  - Compatible with VLLM when loaded properly
  - Easy to share and version control

⚠ AVOID: Merged models on your current hardware
  - Require 8+ GB VRAM for merging process
  - Will likely fail with CUDA memory errors
  - Larger file sizes (4+ GB)

For production VLLM usage, consider:
1. Using a system with more VRAM for merging
2. Using cloud services for merged model creation
3. Sticking with LoRA adapters for local development
""")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
# Remember to go to https://huggingface.co/settings/tokens for a token!
# And change hf to your username!
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# Save to multiple GGUF options - much faster if you want multiple!
if False:
    model.push_to_hub_gguf(
        "hf/model", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
        token = "",
    )