<a href="https://colab.research.google.com/github/HarleyCoops/TrainingRun/blob/main/Qwen_0_5b__GRPO.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h3 align="center"></h3>

<h1 align="center">Working Qwen 0.5b on GRPO</h1>

---

<h1 align="center">Training a small math reasoner with RL</h1>

This notebook is an alternate version of the [GRPO demo](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) by [will brown,](https://x.com/willccbb) training llama-1b on the gsm8k math dataset.

We've only implemented a series of changes to make the code more workable on Colab:
* Replacement of llama-1b with Qwen-0.5b
* Generation with vllm, which yields a significant speed-up. Qwen small size makes it possible to run vllm on the same gpu as the one being used for GRPO.
* Dropping flash-attn (recurrent bug with modeling qwen, not clear why)

## Setting up the models.

## Understanding vLLM in this Project

### What is vLLM?
vLLM (Very Large Language Model) is a high-performance library developed by UC Berkeley's RISELab for efficient LLM inference and serving. It represents a significant advancement in LLM deployment technology, offering production-grade performance used by major companies like Databricks and Anyscale.

### Core Features and Benefits
1. **PagedAttention™ Technology**
   - Novel memory management system similar to operating system page management
   - Dramatically reduces memory usage during inference
   - Enables efficient handling of multiple requests simultaneously

2. **Performance Optimizations**
   - Continuous batching for dynamic request processing
   - Optimized CUDA kernels for maximum GPU utilization
   - Efficient KV cache management for transformer architectures
   - Supports both CPU and GPU inference

### Why vLLM is Critical for This Training Pipeline
1. **Speed Benefits**
   - Significantly faster inference during training
   - Essential for GRPO (Generative Reinforcement Policy Optimization)
   - Enables rapid model evaluation during reinforcement learning

2. **Memory Efficiency**
   - Allows both training and inference on the same GPU
   - Particularly important for our Qwen-0.5B model setup
   - Optimizes GPU memory usage through smart caching

### Installation Requirements
- Must be installed BEFORE TRL (Transformer Reinforcement Learning)
- Requires CUDA support for GPU acceleration
- Dependencies are automatically handled by pip

### Documentation & Resources
- Official Docs: [vllm.readthedocs.io](https://vllm.readthedocs.io/)
- GitHub: [github.com/vllm-project/vllm](https://github.com/vllm-project/vllm)
- Paper: ["vLLM: Easy, Fast, and Cheap LLM Serving with PagedAttention"](https://arxiv.org/abs/2309.06180)

### Important Note
After installing vLLM, you must restart the runtime before proceeding with other installations. This is due to a known interaction with the TRL library that requires vLLM to be installed first.

In [None]:
!pip install vllm

Collecting vllm
  Downloading vllm-0.7.0-cp38-abi3-manylinux1_x86_64.whl.metadata (12 kB)
Collecting blake3 (from vllm)
  Downloading blake3-1.0.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting fastapi!=0.113.*,!=0.114.0,>=0.107.0 (from vllm)
  Downloading fastapi-0.115.8-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn[standard] (from vllm)
  Downloading uvicorn-0.34.0-py3-none-any.whl.metadata (6.5 kB)
Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)
  Downloading prometheus_fastapi_instrumentator-7.0.2-py3-none-any.whl.metadata (13 kB)
Collecting tiktoken>=0.6.0 (from vllm)
  Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting lm-format-enforcer<0.11,>=0.10.9 (from vllm)
  Downloading lm_format_enforcer-0.10.9-py3-none-any.whl.metadata (17 kB)
Collecting outlines==0.1.11 (from vllm)
  Downloading outlines-0.1.11-py3-none-any.whl.metadata (17 kB)
Collecting lark=

Then we install trl and datasets. It has to be in this order for some reason (bug on trl if you do vllm afterwards)

In [None]:
!pip install trl datasets



## Defining the RL rewards

Now we have everything ready to set up our RL training set and reward policy.

First we set the general prompt structure (with the reasoning tags).

In [None]:
import re
import torch
from datasets import load_dataset, Dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import GRPOConfig, GRPOTrainer

# Load and prep dataset

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

INFO 02-01 23:29:04 __init__.py:183] Automatically detected platform cuda.


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Now we import the gsm8k dataset and restructure it to fit into a conversational prompt format:

In [None]:
def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

dataset = get_gsm8k_questions()

We move on now to the reward functions. The most important one is the "correctness" function which acts as a verifier (comparison of model completions vs. answer). The three others are formatting functions.

In [None]:
# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

We now set the training arguments:

In [None]:
model_name = "Qwen/Qwen2.5-0.5B-Instruct"

output_dir="outputs/Qwen-0.5B-GRPO"
run_name="Qwen-0.5B-GRPO-gsm8k"

training_args = GRPOConfig(
    output_dir=output_dir,
    run_name=run_name,
    learning_rate=5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type='cosine',
    logging_steps=1,
    bf16=True,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_generations=16,
    max_prompt_length=256,
    max_completion_length=200,
    num_train_epochs=1,
    save_steps=100,
    max_grad_norm=0.1,
    log_on_each_node=False,
    use_vllm=True,
    vllm_gpu_memory_utilization=.3,
    vllm_device="cuda:0",
    report_to="none" #I'm disabling Wandb.
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map=None
).to("cuda")

tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

And launch the actual training:

In [6]:
# use peft at your own risk; not working for me with multi-GPU training
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func],
    args=training_args,
    train_dataset=dataset,
    #peft_config=peft_config
)
trainer.train()



INFO 02-04 12:30:24 config.py:526] This model supports multiple tasks: {'embed', 'score', 'reward', 'classify', 'generate'}. Defaulting to 'generate'.
INFO 02-04 12:30:24 llm_engine.py:232] Initializing a V0 LLM engine (v0.7.1) with config: model='Qwen/Qwen2.5-0.5B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-0.5B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda:0, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=Qwen/Qwen2.5-0.5B-Instruct, n

Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 02-04 12:30:26 model_runner.py:1116] Loading model weights took 0.9279 GB
INFO 02-04 12:30:28 worker.py:266] Memory profiling takes 1.07 seconds
INFO 02-04 12:30:28 worker.py:266] the current vLLM instance can use total_gpu_memory (39.56GiB) x gpu_memory_utilization (0.30) = 11.87GiB
INFO 02-04 12:30:28 worker.py:266] model weights take 0.93GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 1.44GiB; the rest of the memory reserved for KV Cache is 9.41GiB.
INFO 02-04 12:30:28 executor_base.py:108] # CUDA blocks: 51373, # CPU blocks: 21845
INFO 02-04 12:30:28 executor_base.py:113] Maximum concurrency for 32768 tokens per request: 25.08x
INFO 02-04 12:30:31 model_runner.py:1435] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_u

Capturing CUDA graph shapes: 100%|██████████| 35/35 [00:29<00:00,  1.20it/s]

INFO 02-04 12:31:01 model_runner.py:1563] Graph capturing finished in 29 secs, took 0.16 GiB
INFO 02-04 12:31:01 llm_engine.py:429] init engine (profile, create kv cache, warmup model) took 34.25 seconds





-------------------- Question:
Gerald wants to buy a meat pie that costs 2 pfennigs. Gerald has 54 farthings, and there are 6 farthings to a pfennig. How many pfennigs will Gerald have left after buying the pie? 
Answer:
7 
Response:
To solve this problem, we first need to understand how far possessions are worth in pfennigs and how far they are worth in farthings. We know that 5 farthings are equivalent to 1 pfennig. Therefore, in pfennigs, 54 farthings would be equivalent to:

\( 54 \text{ farthings} \times \frac{1 \text{ pfennig}}{5 \text{ farthings}} = 10.8 \text{ pfennigs} \)

Since Gerald has 10.8 pfennigs, after purchasing the meat pie, he will have:

\( 10.8 \text{ pfennigs} - 2 \text{ pfennigs} = 8.8 \text{ pfennigs} \)

So, Gerald will have 8.8 pfennigs left. 
Extracted:
To solve this problem, we first need to understand how far possessions are worth in pfennigs and how far they are worth in farthings. We know that 5 farthings are equivalent to 1 pfennig. Therefore, in pfenni

Step,Training Loss
1,0.0
2,0.0
3,0.0
4,0.0
5,0.0
6,0.0
7,0.0
8,0.0
9,0.0
10,0.0


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Extracted:
To solve this problem, we need to determine the total number of sandstone blocks in a four-layer pyramid. Each layer has three times as many sandstone blocks as the layer directly below it, with the top layer being a single block.

Let's denote the number of sandstone blocks in the top layer (which is the largest layer) as \( x \). According to the problem, each subsequent layer has 3 blocks less than the layer directly below it. Therefore, the sequence of number of blocks in each layer is:

1. \( x \)
2. \( x - 3 \)
3. \( x - 6 \)
4. \( x - 9 \)

We know that the top layer has one block, so \( x = 1 \).

Using this value, we can calculate the total number of sandstone blocks in the pyramid:
\[
\text{Total blocks} = 1 + (1 - 3) + (1 - 6)
-------------------- Question:
Jerry can run from his house to his school and back in the time it takes his brother Carson to run to the school. If it takes Jerry 15 minutes to

Step,Training Loss
1,0.0
2,0.0
3,0.0
4,0.0
5,0.0
6,0.0
7,0.0
8,0.0
9,0.0
10,0.0


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Extracted:
195
-------------------- Question:
Arthur knows that he needs to finish 800 pages of reading over the summer. He has read 80% of a 500-page book and 1/5 of a 1000 page book. How many more pages does he need to reach to meet his goal? 
Answer:
200 
Response:
<reasoning>
Arthur has read 80% of a 500-page book, so he has read 80/100 * 500 = 400 pages. He has also read 1/5 of a 1000-page book, so he has read 1/5 * 1000 = 200 pages. In total, he has read 400 + 200 = 600 pages. Arthur needs to finish 800 pages and has already read 600 pages, so he still needs to read 800 - 600 = 200 more pages.
</reasoning>
<answer>
200
</answer>
 
Extracted:
200
-------------------- Question:
Jeanine bought 18 pencils and Clare bought half as many pencils. If Jeanine gave one third of his pencils to Abby, how many more pencils than Clare does Jeanine have now? 
Answer:
3 
Response:
<reasoning>
Clare bought half as many pencils as Je

TrainOutput(global_step=1868, training_loss=0.006341318532330285, metrics={'train_runtime': 6990.8268, 'train_samples_per_second': 1.069, 'train_steps_per_second': 0.267, 'total_flos': 0.0, 'train_loss': 0.006341318532330285})

In [21]:
# Import necessary libraries
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# First, load the base model architecture
base_model = "Qwen/Qwen2.5-0.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Load the fine-tuned weights
checkpoint_path = "outputs/Qwen-0.5B-GRPO/checkpoint-1868"  # Specific checkpoint folder
model = AutoModelForCausalLM.from_pretrained(
    checkpoint_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    checkpoint_path,
    trust_remote_code=True
)

# Rest of the inference code remains the same
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

def solve_math_problem(question: str):
    prompt = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": question}
    ]

    input_text = tokenizer.apply_chat_template(
        prompt,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=200,
        temperature=0.7,
        num_return_sequences=1,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id
    )

    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Test with a sample problem
test_question = "A train travels at 60 miles per hour. If the journey is 270 miles long, how many hours will the trip take?"
print("\nQuestion:", test_question)
print("\nResponse:", solve_math_problem(test_question))


Question: A train travels at 60 miles per hour. If the journey is 270 miles long, how many hours will the trip take?

Response: system

Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>

user
A train travels at 60 miles per hour. If the journey is 270 miles long, how many hours will the trip take?
assistant
<reasoning>
The time taken to travel a certain distance is calculated by dividing the distance by the speed. So for a journey of 270 miles and a speed of 60 mph, the time taken would be \( \frac{270}{60} = 4.5 \) hours.
</reasoning>
<answer>
4.5
</answer>



Save collab output to local zip

In [17]:
!zip -r /content/model_checkpoint.zip /content/outputs/Qwen-0.5B-GRPO/checkpoint-1868

  adding: content/outputs/Qwen-0.5B-GRPO/checkpoint-1868/ (stored 0%)
  adding: content/outputs/Qwen-0.5B-GRPO/checkpoint-1868/rng_state.pth (deflated 25%)
  adding: content/outputs/Qwen-0.5B-GRPO/checkpoint-1868/tokenizer_config.json (deflated 83%)
  adding: content/outputs/Qwen-0.5B-GRPO/checkpoint-1868/training_args.bin (deflated 51%)
  adding: content/outputs/Qwen-0.5B-GRPO/checkpoint-1868/model.safetensors (deflated 21%)
  adding: content/outputs/Qwen-0.5B-GRPO/checkpoint-1868/special_tokens_map.json (deflated 63%)
  adding: content/outputs/Qwen-0.5B-GRPO/checkpoint-1868/config.json (deflated 47%)
  adding: content/outputs/Qwen-0.5B-GRPO/checkpoint-1868/generation_config.json (deflated 39%)
  adding: content/outputs/Qwen-0.5B-GRPO/checkpoint-1868/tokenizer.json (deflated 81%)
  adding: content/outputs/Qwen-0.5B-GRPO/checkpoint-1868/scheduler.pt (deflated 56%)
  adding: content/outputs/Qwen-0.5B-GRPO/checkpoint-1868/optimizer.pt (deflated 24%)
  adding: content/outputs/Qwen-0.5B-GR

In [20]:
from google.colab import drive
drive.mount('/content/drive')

# Copy to Drive
!cp -r /content/outputs/Qwen-0.5B-GRPO/checkpoint-1868 /content/drive/MyDrive/my_models/

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
