# 🧠 Mastering LLM Fine-Tuning with TRL – Group Relative Policy Optimization

Welcome! This notebook is part of a tutorial series where you'll learn how to fine-tune Large Language Models (LLMs) using 🤗 TRL.
We introduce key concepts, set up the required tools, and use techniques like Supervised Fine-Tuning (SFT) and Group-Relative Policy Optimization (GRPO).

## 📋 Prerequisites

Before you begin, make sure you have the following:

* A working knowledge of Python and PyTorch
* A basic understanding of machine learning and deep learning concepts
* **Access to 2 GPU accelerators**
* The `trl` library installed – this tutorial has been tested with **TRL version 0.17**
  If you don’t have `trl` installed yet, you can install it by running the following code block:

In [None]:
%pip install trl

* A [Hugging Face account](https://huggingface.co) with a configured access token. If needed, run the following code.
This will prompt you to enter your Hugging Face access token. You can generate one from your Hugging Face account settings under [Access Tokens](https://huggingface.co/settings/tokens). The token must have `Write access to contents/settings of all repos under your personal namespace`

In [None]:
from huggingface_hub import notebook_login

notebook_login()

## 🔄 Quick Recap of the Last Session

In the previous session, we explored Supervised fine-tuning (SFT) and how to use it to post-train a language model on a custom dataset.

* SFT is a technique used to adapt a pre-trained (base) language model to a specific task or domain by training it on a labeled dataset.
* We used the `trl` library to load a pre-trained model, and then fine-tuned it on a custom dataset.
* We also discussed the importance of data preprocessing and how to prepare your dataset for training.
* We discussed how to manage memory, which is crucial when working with large models.
* We pushed the fine-tuned model to the Hugging Face Hub, making it accessible for others to use.
* We showed that, even if the model is now capable of be conversational, it is still not very good at scientific tasks.

Our goal here is to continue fine-tuning our model to improve its performance on scientific tasks. To do this, we’ll use a technique that has recently proven highly effective and led to the development of some of the best-performing reasoning models, such as DeepSeek-R1 and Qwen3: RLVR — **Reinforcement Learning with Verifiable Rewards**.

## What is RLVR?

TODO

This approach is well-suited for problem types where it’s easy to assess whether a completion is correct or not. In our case, we want our model (Rick) to first lay out its reasoning, enclosed in `<think></think>` tags, and then provide the final answer after.

These two requirements can be implemented as functions.

In [None]:
# How do you calculate the power consumed by a 4-ohm resistor with a 6A current running through it?
# Correct answer, correct format
completion_1 = [
    {
        "role": "assistant",
        "content": "<think>(burps) Morty, haven’t you learned anything? Power is just I squared R. Current times current times resistance. That's 6 squared times 4, which is 144 watts.</think> So, the power consumed by a 4-ohm resistor with a 6A current running through it is 144 watts. You got that, Morty? 144 watts! That's a lot of power, Morty! You gotta be careful with that kind of power. It can really mess you up if you're not careful.",
    }
]
# Wrong format, correct answer
completion_2 = [
    {
        "role": "assistant",
        "content": "Easy, Morty. I know that power is equal to current squared times resistance. So, 6A squared times 4 ohms equals 144 watts.",
    }
]
# Wrong answer, correct format
completion_3 = [
    {
        "role": "assistant",
        "content": "<think>Well, Morty, you know that power is equal to voltage times current. So, if we have a 4-ohm resistor and a 6A current, we can use Ohm's law to find the voltage across the resistor. V = I * R, so V = 6A * 4 ohms = 12V. Now we can calculate the power: P = V * I = 12V * 6A = 72W.</think> So, the power consumed by the 4-ohm resistor with a 6A current is 72 watts. But remember, Morty, this is just a simple example. In real life, things can get a lot more complicated. You gotta be careful with these calculations, Morty. You don't want to fry your circuits or anything.",
    }
]


Let's check the format first:

In [None]:
import re


def format_reward(completions, **kwargs):
    pattern = r"^<think>(?!.*<think>)(.*?)</think>.*$"
    completion_contents = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE) for content in completion_contents]
    return [1.0 if match else 0.0 for match in matches]

format_reward([completion_1, completion_2, completion_3])

Now let's check the correctness of the answer. We can do this by checking if the answer is in the list of possible answers. If it is, we can return a reward of 1.0, otherwise we return 0.0. It's pretty basic, not very robust, but it works for our basic use case. 

In [None]:
def correctness_reward(completions, solution, **kwargs):
    rewards = []
    for completion, ground_truth in zip(completions, solution):
        content = completion[0]["content"]
        reward = 1.0 if ground_truth in content else 0.0
        rewards.append(reward)
    return rewards

correctness_reward([completion_1, completion_2, completion_3], solution=["144 watts", "144 watts", "144 watts"])

In [None]:
from datasets import load_dataset

dataset = load_dataset("qgallouedec/rick-physics-grpo")

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("qgallouedec/SmolLM2-360M-Rickified")
model = AutoModelForCausalLM.from_pretrained("qgallouedec/SmolLM2-360M-Rickified", device_map="auto", force_download=True)

As before, let's define a chat template and make the necessary modifications to the template and tokenizer.

Remember, configuring a chat template doesn't make the model capable of chatting.
It only gives the ability to format inputs in a dialogue structure; the model still needs to be fine-tuned on conversational data to respond like a chatbot.

In [None]:
from transformers import pipeline

pipeline = pipeline(task="text-generation", model=model, tokenizer=tokenizer)

question = "A ball is thrown vertically upward with an initial speed of 14 m/s. What is its maximum height?"
question = "How do you calculate the power consumed by a 4-ohm resistor with a 6A current running through it?"
# question = "What is Edmonton?"
prompt = [{"role": "user", "content": question}]

pipeline(prompt, max_new_tokens=400)[0]["generated_text"]

In [None]:
from datasets import load_dataset

dataset = load_dataset("qgallouedec/rick-physics-grpo", split="train")


def format_dataset(example):
    return {"prompt": [{"role": "user", "content": example["question"]}]}


dataset = dataset.map(format_dataset)

In [None]:
from trl import GRPOTrainer, GRPOConfig


args = GRPOConfig(
    max_completion_length=512,
    # Speedup and reduce memory
    gradient_checkpointing=True,
    bf16=True,
    use_vllm=True,  # CUDA_VISIBLE_DEVICES=1 trl vllm-serve --model qgallouedec/SmolLM2-360M-Rickified
    output_dir="data/SmolLM2-360M-Rickified-GRPO",
    # Logging
    run_name="SmolLM2-360M-Rickified-GRPO",
    logging_steps=2,
    log_completions=True,
    num_completions_to_print=1,
)

trainer = GRPOTrainer(
    model="qgallouedec/SmolLM2-360M-Rickified",
    reward_funcs=[format_reward, correctness_reward],
    train_dataset=dataset,
    args=args,
)
trainer.train()