# Example code for VLMGRPOTrainer

Please note that this output is intended only as a proof of concept using a simplified dataset. Do not expect meaningful results from this demonstration. Before deploying for your specific use case, you may fine-tune the model on your target data. For proper GRPO training methodology, please refer to the official Unsloth notebooks.

In [None]:
!pip install transformers==4.49.0
!pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu124
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
!pip install --no-deps unsloth==2025.4.1

In [2]:
#Either download the code manually or clone it
#!git clone https://github.com/GAD-cell/VLM_GRPO.git
!pip install -e /content/VLM_GRPO-main

Obtaining file:///content/vlmgrpo
  Preparing metadata (setup.py) ... [?25l[?25hdone
Installing collected packages: vlmgrpo
  Running setup.py develop for vlmgrpo
Successfully installed vlmgrpo-0.1


In [2]:
from unsloth import FastVisionModel # FastLanguageModel for LLMs

model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Qwen2.5-VL-7B-Instruct",
    load_in_4bit = True, # Use 4bit to reduce memory use. False for 16bit LoRA.
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for long context
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.4.1: Fast Qwen2 patching. Transformers: 4.49.0.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.495 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


In [3]:
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers     = True, # False if not finetuning vision layers
    finetune_language_layers   = True, # False if not finetuning language layers
    finetune_attention_modules = True, # False if not finetuning attention layers
    finetune_mlp_modules       = True, # False if not finetuning MLP layers

    r = 16,           # The larger, the higher the accuracy, but might overfit
    lora_alpha = 16,  # Recommended alpha == r at least
    lora_dropout = 0.1,
    bias = "none",
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
    # target_modules = "all-linear", # Optional now! Can specify a list if needed
)

In [4]:
from datasets import load_dataset, Dataset

ds = load_dataset("AI4Math/MathVista",split="testmini")

def is_numeric_answer(example):
  try:
    float(example["answer"])
    return True
  except:
    return False

numeric_ds = ds.filter(is_numeric_answer) # For convenience and easier reward function definition, we keep only the float answer ex : 42

In [13]:
print(numeric_ds)

Dataset({
    features: ['pid', 'question', 'image', 'decoded_image', 'choices', 'unit', 'precision', 'answer', 'question_type', 'answer_type', 'metadata', 'query'],
    num_rows: 566
})


# Data loader and reward functions

In [5]:
reasoning_start = "<think>"
reasoning_end   = "</think>"
solution_start  = "<answer>"
solution_end    = "</answer>"

def format_fn(sample):
    image = sample['decoded_image']
    prompt = sample['question']
    answer = sample['answer']
    format = {"prompt": [
                  {
                  "role": "user",
                  "content": [
                      {"type": "image"}, # because we have only 1 image per sample
                      {"type": "text",  "text": f"{prompt} provide your reasoning between {reasoning_start} and {reasoning_end} and then your final answer between {solution_start} and (put a float here) {solution_end}"}]
                  }],
                  "image": [image.resize((512,512))],
                  "answer":answer,
              }
    return format

class FormattedDataset():
    def __init__(self, dataset, format_fn):
        self.dataset = dataset
        self.format_fn = format_fn

    def __getitem__(self, idx):
        item = self.dataset[idx]
        return self.format_fn(item)

    def __len__(self):
        return len(self.dataset)


train_dataset=FormattedDataset(dataset=numeric_ds,format_fn=format_fn)

In [6]:
# Reward functions
def formatting_reward_func(completions,**kwargs):
    import re
    print("test")
    thinking_pattern = f'{reasoning_start}(.*?){reasoning_end}'
    answer_pattern = f'{solution_start}(.*?){solution_end}'

    scores=[]
    for completion in completions :
      score=0
      thinking_matches = re.findall(thinking_pattern, completion[0]['content'], re.DOTALL)
      answer_matches = re.findall(answer_pattern, completion[0]['content'], re.DOTALL)
      if len(thinking_matches) == 1 :
        score +=1.0
      if len(answer_matches) == 1 :
        score +=1.0
      scores.append(score)
    return scores


def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    import re

    answer_pattern = f'{solution_start}(.*?){solution_end}'

    responses = [re.findall(answer_pattern, completion[0]['content'], re.DOTALL) for completion in completions]
    q = prompts[0][-1]['content']

    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:{completions[0]}")
    return [2.0 if len(r)==1 and a == r[0].replace('\n','') else 0.0 for r, a in zip(responses, answer)]

# pour contrer le design du GRPO
#def length_reward_func(prompts, completions, answer, **kwargs):


# GRPO Training

In [7]:
from unsloth import is_bf16_supported
from vlmgrpo import VLMGRPOTrainer
from trl import GRPOConfig

training_args = GRPOConfig(
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    bf16 = is_bf16_supported(),
    fp16 = not is_bf16_supported(),
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 2, # Decrease if out of memory
    max_prompt_length = 256,
    max_completion_length = 200,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

trainer = VLMGRPOTrainer(
    model = model,
    processing_class=tokenizer, # MUST put unsloth processor here !
    reward_processing_classes = tokenizer, #Here also
    reward_funcs = [
        formatting_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = train_dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 566 | Num Epochs = 1 | Total steps = 250
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 1 x 1) = 2
 "-____-"     Trainable parameters = 51,521,536/7,000,000,000 (0.74% trained)
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...


test
-------------------- Question:
[{'type': 'image'}, {'type': 'text', 'text': 'What is the highest value on the X axis? provide your reasoning between <think> and </think> and then your final answer between <answer> and (put a float here) </answer>'}] 
Answer:
30 
Response:[{'role': 'assistant', 'content': 'What is the highest value on the X axis? provide your reasoning between <think> and </think> and then your final answer between <answer> and (put a float here) </answer>\n addCriterion\n\n addCriterion\nThe image depicts a plot with an X-axis ranging from 0 to 30. The highest value on the X-axis is 30.\n\n<answer>\n30.0\n</answer>'}]


Step,Training Loss,reward,reward_std,completion_length,kl,rewards / formatting_reward_func,rewards / correctness_reward_func
1,0.0,1.5,0.707107,125.0,0.0,1.5,0.0
2,0.0,1.0,0.0,127.0,0.0,1.0,0.0


test
-------------------- Question:
[{'type': 'image'}, {'type': 'text', 'text': 'What is the age gap between these two people in image? provide your reasoning between <think> and </think> and then your final answer between <answer> and (put a float here) </answer>'}] 
Answer:
6 
Response:[{'role': 'assistant', 'content': "What is the age gap between these two people in image? provide your reasoning between <think> and </think> and then your final answer between <answer> and (put a float here) </answer>\n addCriterion\n\n addCriterion\nThe image shows two individuals standing in a docked boat. Both individuals appear to be wearing striped shirts, and their clothing and hairstyles suggest they are young. The individual on the left appears slightly taller and perhaps a few years older. However, without knowing their exact ages, it's difficult to determine an exact age gap.\n\n<answer>\n1.0 to 3.0 years\n</answer>"}]
test
-------------------- Question:
[{'type': 'image'}, {'type': 'text',

KeyboardInterrupt: 

In [None]:
model.save_pretrained("V0_GRPO")  # Local saving
tokenizer.save_pretrained("V0_GRPO")

[]