# NLP PROJECT 2024/2025: Dataset OpenCodeReasoning
Team **o5 mini**: Alberti Emanuele Emilio, Biagi Ottavia, Capodanno Mario, Crippa Tommaso, Dussin Michele

## Local Installation

In [None]:
!pip install unsloth vllm google.generativeai

## Colab Installation

In [None]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm google.generativeai
else:
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    !pip install --no-deps unsloth vllm

In [None]:
#@title Colab Extra Install { display-mode: "form" }
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth vllm
else:
    !pip install --no-deps unsloth vllm
    # [NOTE] Do the below ONLY in Colab! Use [[pip install unsloth vllm]]
    # Skip restarting message in Colab
    import sys, re, requests; modules = list(sys.modules.keys())
    for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft "trl==0.15.2" triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer

    # vLLM requirements - vLLM breaks Colab due to reinstalling numpy
    f = requests.get("https://raw.githubusercontent.com/vllm-project/vllm/refs/heads/main/requirements/common.txt").content
    with open("vllm_requirements.txt", "wb") as file:
        file.write(re.sub(rb"(transformers|numpy|xformers)[^\n]{1,}\n", b"", f))
    !pip install -r vllm_requirements.txt

# Load Base Model & Tokenizer (LoRA-ready)

This code initializes and prepares a pretrained large language model (LLM) for fine-tuning using LoRA (Low-Rank Adaptation), an efficient parameter-efficient fine-tuning (PEFT) method.

LoRA adds a set of trainable parameters while freezing the rest of the model.
This function returns the previous model instance with the LoRA parameters integrated(essentially, it returns the "wrapped" model).

The target_modules refer to the list of layer names within the model where the LoRA blocks are injected, typically including the Q/K/V/O projections and the feed-forward layers.

Base Model: [[Qwen2.5-3B]](https://huggingface.co/Qwen/Qwen2.5-3B)

Unsloth Library: [[website]](https://unsloth.ai/) [[docs]](https://docs.unsloth.ai/")


⚠️ Computational Resources (Google Colab - NVIDIA T4 Tier)
The entire setup has been  configured to ensure successful fine-tuning within the resource constraints of Google Colab Free Tier (which provides an NVIDIA T4 GPU with 15GB VRAM). If experimenting with more larger model, it is suggested to use model loading with reduced precision, such as 8-bit (load_in_8bit=True) or 4-bit (load_in_4bit=True) quantization.

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 8192 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

#The FastLanguageModel.from_pretrained function loads the base model (Qwen2.5-3B) along with its tokenizer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Qwen/Qwen2.5-3B",
    max_seq_length = max_seq_length,
    load_in_4bit = False, # False for LoRA 16bit
    fast_inference = True, # Enable fast inference by using vLLM
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.5, # Reduce if out of memory
)

#LoRA adds a set of trainable parameters while freezing the rest of the model
model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj", # Attention Mechanism Projections
        "gate_proj", "up_proj", "down_proj", # Feed-Forward Network Layers
    ],
    lora_alpha = lora_rank*2, # *2 speeds up training
    use_gradient_checkpointing = "unsloth", # Reduces memory usage
    random_state = 3407,
)

==((====))==  Unsloth 2025.4.7: Fast Qwen2 patching. Transformers: 4.51.3. vLLM: 0.8.2.
   \\   /|    NVIDIA A100-SXM4-40GB. Num GPUs = 1. Max memory: 39.496 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 8.0. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/Qwen2.5-3B with actual GPU utilization = 49.43%
Unsloth: Your GPU has CUDA compute capability 8.0 with VRAM = 39.5 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 8192. Num Sequences = 256.
Unsloth: vLLM's KV Cache can use up to 13.66 GB. Also swap space = 6 GB.
INFO 05-24 18:46:05 [config.py:585] This model supports multiple tasks: {'score', 'reward', 'classify', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 05-24 18:46:05 [arg_utils.py:1865] LORA is experi

tokenizer_config.json:   0%|          | 0.00/4.71k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/2.78M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/1.67M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/171 [00:00<?, ?B/s]

INFO 05-24 18:46:12 [cuda.py:291] Using Flash Attention backend.
INFO 05-24 18:46:13 [parallel_state.py:954] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 05-24 18:46:14 [model_runner.py:1110] Starting to load model unsloth/Qwen2.5-3B...
INFO 05-24 18:46:16 [weight_utils.py:265] Using model weights format ['*.safetensors']


model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.21G [00:00<?, ?B/s]

INFO 05-24 18:46:42 [weight_utils.py:281] Time spent downloading weights for unsloth/Qwen2.5-3B: 25.929875 seconds


model.safetensors.index.json:   0%|          | 0.00/35.6k [00:00<?, ?B/s]

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]


INFO 05-24 18:46:44 [loader.py:447] Loading weights took 1.95 seconds
INFO 05-24 18:46:44 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 05-24 18:46:45 [model_runner.py:1146] Model loading took 5.8821 GB and 30.702653 seconds
INFO 05-24 18:46:51 [worker.py:267] Memory profiling takes 6.49 seconds
INFO 05-24 18:46:51 [worker.py:267] the current vLLM instance can use total_gpu_memory (39.50GiB) x gpu_memory_utilization (0.49) = 19.52GiB
INFO 05-24 18:46:51 [worker.py:267] model weights take 5.88GiB; non_torch_memory takes 0.09GiB; PyTorch activation peak memory takes 1.43GiB; the rest of the memory reserved for KV Cache is 12.11GiB.
INFO 05-24 18:46:52 [executor_base.py:111] # cuda blocks: 22053, # CPU blocks: 10922
INFO 05-24 18:46:52 [executor_base.py:116] Maximum concurrency for 8192 tokens per request: 43.07x
INFO 05-24 18:47:05 [model_runner.py:1442] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eage

Capturing CUDA graph shapes: 100%|█████████████████████████████████████████████| 35/35 [00:32<00:00,  1.07it/s]

INFO 05-24 18:47:37 [model_runner.py:1570] Graph capturing finished in 33 secs, took 0.32 GiB
INFO 05-24 18:47:37 [llm_engine.py:447] init engine (profile, create kv cache, warmup model) took 52.81 seconds



Sliding Window Attention is enabled but not implemented for `eager`; unexpected results may be encountered.
Unsloth 2025.4.7 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.


# System Prompt & Chat Template Setup

This **system_prompt** is a structured to guide the language model in solving programming or logical problems through step-by-step reasoning and code generation.

In [None]:
reasoning_start = "<think>"
reasoning_end   = "</think>"
solution_start  = "```python"
solution_end    = "```"

system_prompt = \
f"""You are given a problem.
Think about the problem and provide your working out.
Place it between {reasoning_start} and {reasoning_end}.
Then, provide your code solution between {solution_start} and {solution_end}"""
system_prompt

'You are given a problem.\nThink about the problem and provide your working out.\nPlace it between <think> and </think>.\nThen, provide your code solution between ```python and ```'

This code defines a custom chat formatting template for use with Hugging Face's tokenizer.chat_template. It's written in Jinja2-like syntax and is used to generate the final prompt string that will be passed to a language model during inference or training (e.g., in RLHF, REASRM).

The goal is to:

- Dynamically compose a multi-turn prompt (user + assistant + system).

- Ensure that the system prompt is injected properly at the beginning.

- Automatically insert the <think> (or other reasoning-start marker) token after the last user message to trigger step-by-step reasoning in the model.

In [None]:
# user prompt is the problem from the dataset
# system prompt is what we pass to the model to guide its behavior
# <reasoning start> is a marker indicating the beginning of the reasoning process

chat_template = \
    "{% if messages[0]['role'] == 'system' %}"\
        "{{ messages[0]['content'] + eos_token }}"\
        "{% set loop_messages = messages[1:] %}"\
    "{% else %}"\
        "{{ '{system_prompt}' + eos_token }}"\
        "{% set loop_messages = messages %}"\
    "{% endif %}"\
    "{% for message in loop_messages %}"\
        "{% if message['role'] == 'user' %}"\
            "{{ message['content'] }}"\
        "{% elif message['role'] == 'assistant' %}"\
            "{{ message['content'] + eos_token }}"\
        "{% endif %}"\
    "{% endfor %}"\
    "{% if add_generation_prompt %}{{ '{reasoning_start}' }}"\
    "{% endif %}"

chat_template = chat_template\
    .replace("'{system_prompt}'",   f"'{system_prompt}'")\
    .replace("'{reasoning_start}'", f"'{reasoning_start}'")
tokenizer.chat_template = chat_template

# Load & Preprocess Dataset

**PRE FINE-TUNING of SFT**

Pre fine-tuning SFT" serves as a crucial initialization step that aligns a pretrained LLM with the structure and style of your task (e.g., reasoning + code), enabling more effective LoRA adaptation and downstream reinforcement learning.

This code loads the split_0 portion of the OpenCodeReasoning dataset from Hugging Face using streaming mode.
Streaming allows you to iterate through the dataset without downloading the entire file locally, which is useful for large-scale data.

In [None]:
from datasets import load_dataset
import pandas as pd
import numpy as np

# split_0 is the portion of the dataset containing reasoning steps and public information
split='split_0'

# Load the dataset in streaming mode(otherwise 700K+ samples will be downloaded D:)
dataset = load_dataset( "nvidia/OpenCodeReasoning", split, split = 'split_0', streaming = True)


README.md:   0%|          | 0.00/6.87k [00:00<?, ?B/s]

Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

This step serves as a preprocessing phase before any fine-tuning or evaluation. Specifically this **dataset** will be used for Supervised Fine-Tuning (SFT).

In [None]:
# We will collect the first 80 samples with a reasoning length < 2000 tokens
data_list = []
counter_accepted = 0

for example in dataset:
    print(example)
    # Extract the output text
    output_text = example.get("output")
    if output_text is None:
        continue

    # Tokenize the reasoning
    tokens = tokenizer.encode(output_text, add_special_tokens=False)

    # Accept if the reasoning is short enough (< 2000 tokens)
    if len(tokens) < 2000:

        data_list.append({
            "output":   output_text,
            "input":    example.get("input"),
            "solution": example.get("solution")
        })
        counter_accepted += 1

    # Stop after collecting 80 valid samples
    if counter_accepted >= 80:
        break

# Convert the collected data into a pandas DataFrame
dataset = pd.DataFrame(data_list)

dataset.head()



In [None]:
# Check the data shape
dataset.shape

(80, 3)

# Format Dataset for SFT

This function reformats a dataset sample into a structured conversation format (system-user-assistant) suitable for fine-tuning.

In [None]:
def format_dataset(x):
    solution = x["solution"]
    problem = x["input"]

    # Remove markers <think> and </think> from the reasoning
    thoughts = x["output"]
    thoughts = thoughts.replace("<think>", "").replace("</think>", "")

    # Format the reasoning
    thoughts = thoughts.strip()

    # Add our custom formatting
    final_prompt = \
        reasoning_start + thoughts + reasoning_end + \
        solution_start + solution + solution_end
    return [
        {"role": "system",    "content": system_prompt}, # Instruction provided to the model to try solving problems
        {"role": "user",      "content": problem},       # The problem prompt taken from the dataset
        {"role": "assistant", "content": final_prompt},  # Reasoning and solution output from the dataset
    ]

In [None]:
# Apply the formatting function to each row and store the result in a new 'Messages' column
# This column contains the formatted system-user-assistant chat used for model training or inference
dataset["Messages"] = dataset.apply(format_dataset, axis = 1)

In [None]:
# Check the column names of the dataset
dataset.columns

Index(['output', 'input', 'solution', 'Messages'], dtype='object')

In [None]:
# Display the formatted conversation (system, user, assistant) for the first sample
dataset["Messages"][0]

In [None]:
# Compute the number of tokens (or characters) for each formatted message using the tokenizer's chat template,
# and store the result in a new column 'N'
dataset["N"] = dataset["Messages"].apply(lambda x: len(tokenizer.apply_chat_template(x)))

In [None]:
# Keep only samples with N ≤ 2000 to limit input size
dataset = dataset.loc[dataset["N"] <= 2000].copy()

In [None]:
# Check the dataset shape
dataset.shape

(52, 5)

**Data Formatting and Preprocessing for Supervised Fine-Tuning (SFT)**

In this phase, we preprocess and format raw dataset samples into structured prompt–response pairs suitable for instruction tuning. Each example is transformed into a multi-turn conversation format (system, user, assistant) and rendered as plain text using a tokenizer’s chat template.

We apply filtering based on input length (e.g., max token count) to ensure compatibility with the model’s context window. The processed data is then converted into a Hugging Face Dataset object, making it ready for SFT with Transformer-based models.

This preparation step is crucial for aligning the model’s behavior with task-specific reasoning patterns and output structure.

In [None]:
from datasets import Dataset

# Apply the tokenizer's chat template to generate full formatted text (without tokenizing)
dataset["text"] = tokenizer.apply_chat_template(dataset["Messages"].values.tolist(), tokenize = False)

# Convert the DataFrame into a Hugging Face Dataset object
dataset = Dataset.from_pandas(dataset)

In [None]:
# Show a sample formatted input prompt
dataset["text"][13]

# Supervised Fine-Tuning (SFT)

**SFT** - In this stage, we fine-tune the language model on the previously formatted dataset using **Supervised Fine-Tuning (SFT)**. We leverage the trl (Transformers Reinforcement Learning) library’s SFTTrainer, which simplifies the training loop and integrates seamlessly with Hugging Face models and datasets.

The dataset contains structured conversational samples in plain text format (system, user, assistant) stored in the text field. Each example encourages the model to perform task-specific reasoning followed by code generation or response formulation.

This fine-tuning step aligns the base model with the specific structure and reasoning style of your task, as defined in the dataset. It allows the model to learn from high-quality demonstrations (reasoning + solution) and **improves generalization in downstream evaluations or reinforcement learning stages**.

In [None]:
from trl import SFTTrainer, SFTConfig

# Initialize the SFTTrainer to fine-tune the Qwen2.5-3B model with LoRA adapters.
# The model passed here was previously loaded via FastLanguageModel.from_pretrained and wrapped with LoRA using get_peft_model.
# This allows us to fine-tune only a small number of trainable parameters (efficient PEFT),
# while keeping the rest of the model frozen — reducing memory usage and accelerating training.
# We use SFTTrainer because it makes it easy to fine-tune Hugging Face models — especially when using LoRA.


trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    args = SFTConfig(
        dataset_text_field = "text", # setting here the prompt we created
        per_device_train_batch_size = 1,
        gradient_accumulation_steps = 2,
        warmup_steps = 5,
        num_train_epochs = 2, # Set this for 1 full training run.
        learning_rate = 2e-4, # Reduce to 2e-5 for long training runs
        logging_steps = 5,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        report_to = "none",
    ),
)

num_proc must be <= 52. Reducing num_proc to 52 for dataset of size 52.


Unsloth: Tokenizing ["text"] (num_proc=52):   0%|          | 0/52 [00:00<?, ? examples/s]

In [None]:
# Start the supervised fine-tuning process using the configured trainer
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 52 | Num Epochs = 2 | Total steps = 52
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 2
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 2 x 1) = 2
 "-____-"     Trainable parameters = 59,867,136/3,145,805,824 (1.90% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
5,1.1544
10,1.0224
15,0.8473
20,0.8501
25,0.8705
30,0.6754
35,0.5996
40,0.656
45,0.626
50,0.7322


TrainOutput(global_step=52, training_loss=0.8020016940740439, metrics={'train_runtime': 118.8149, 'train_samples_per_second': 0.875, 'train_steps_per_second': 0.438, 'total_flos': 2730915716333568.0, 'train_loss': 0.8020016940740439})

In [None]:
# Preview the first fully formatted training example (used in 'text' field for SFT)
dataset["text"][0]

In [None]:
# Check the number of characters (or tokens, depending on tokenizer settings) in each prompt
# Useful for filtering or validating prompt length before training
dataset['N']

In [None]:
# Save the fine-tuned model to a variable for later use
ft_model = trainer.model

After fine-tuning the model, we proceed to **test its generation capabilities**. To simulate a real inference scenario, we **construct a prompt using only the system instruction and the problem description** — that is, the first two elements from the Messages field (the system and user roles). The ground-truth solution from the dataset is intentionally excluded, allowing the model to reason and generate the answer autonomously.

We apply the **same chat template used during training**, this time with add_generation_prompt=True. This ensures that the input ends with the reasoning start marker, which was defined earlier in the system prompt. This marker is crucial, as it signals the model to begin generating its reasoning process, following the structure it learned during supervised fine-tuning.

For generation, we use the fine-tuned model with **greedy decoding** (temperature=0) to produce the most likely output. A TextStreamer is used to display the output in real time as the model generates it.

In [None]:
from transformers import TextStreamer

# Construct a prompt for inference using only the system instruction and the problem statement (without the solution)
# This simulates the real use case where the model must reason and generate an answer
text = tokenizer.apply_chat_template(
    dataset[7]["Messages"][:2],  # First two message blocks: system prompt + user problem (no assistant solution)
    tokenize = False,
    add_generation_prompt = True,   # Required to append the reasoning start marker (e.g., <think>)
                                    # This aligns with the system prompt template defined earlier (reasoning_start tag)
)


# Generate a prediction using the fine-tuned model
_ = ft_model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    temperature = 0, # For greedy decoding (most likely tokens)
    max_new_tokens = 4096, # Allows for long reasoning chains
    streamer = TextStreamer(tokenizer, skip_prompt = False), # TextStreamer provides real-time output printing during generation
)

In [None]:
# Free up unused GPU and CPU memory before starting a new data processing session
import torch
torch.cuda.empty_cache()  # Clears cached GPU memory in PyTorch

import gc
gc.collect()              # Triggers Python garbage collection for unused CPU memory


493

Before reloading and reprocessing data, we **explicitly clear the GPU and Python memory to avoid residual usage from previous steps**. This is especially important when working with large models or datasets, as it helps prevent out-of-memory errors.

We then stream a subset of the OpenCodeReasoning dataset, focusing only on high-quality examples. Specifically, **we select samples labeled as "HARD" or "VERY_HARD"** with a reasoning trace shorter than 6000 tokens — this ensures that the selected problems are both non-trivial and compatible with the model’s context window.

Once we collect 250 suitable samples, we convert the filtered list into a datasets.Dataset object. Finally, **we adapt each entry into a format suitable for GRPO fine-tuning**, separating the prompt (system + user messages) from the answer (ground-truth solution), enabling step-by-step reward modeling and reinforcement learning on reasoning-heavy tasks.

In [None]:
from datasets import load_dataset
from datasets import Dataset
import pandas as pd

# Load the OpenCodeReasoning dataset in streaming mode (efficient for large datasets)
dataset = load_dataset("nvidia/OpenCodeReasoning", "split_0", split="split_0", streaming=True)

data_list = []
counter_accepted = 0

# Iterate over streaming dataset to filter examples by reasoning length and difficulty
for example in dataset:
    output_text = example.get("output")
    if output_text is None:
        continue

    # Tokenize the reasoning output
    tokens = tokenizer.encode(output_text, add_special_tokens=False)

    # Accept examples with reasonable length and high difficulty
    if len(tokens) < 3000:
        data_list.append({
            "output":   output_text,
            "input":    example.get("input"),
            "solution": example.get("solution")
        })
        counter_accepted += 1

    # Stop after collecting 250 valid examples
    if counter_accepted >= 250:
        break

# Convert the filtered list to a Hugging Face Dataset object
dataset = Dataset.from_list(data_list)

# Adapt the dataset structure for GRPO-style fine-tuning or reward modeling
# Format each sample as a conversation with a prompt (system + user) and a separate answer field
dataset = dataset.map(lambda x: {
    "prompt": [
        {"role": "system", "content": system_prompt},
        {"role": "user",   "content": x["input"]},
    ],
    "answer": x["solution"],
})


Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/30 [00:00<?, ?it/s]

HTTP Error 504 thrown while requesting GET https://huggingface.co/datasets/nvidia/OpenCodeReasoning/resolve/20a1ca19c0d050fe9057fc08339d6b370ec1c67a/split_0/train-00000-of-00030.parquet
Retrying in 1s [Retry 1/5].


Map:   0%|          | 0/250 [00:00<?, ? examples/s]

In [None]:
# Convert the Hugging Face Dataset object to a pandas DataFrame for easier inspection or manipulation
dataset_pd = pd.DataFrame(dataset)

# Display the solution field of the first example (used as the ground-truth answer)
dataset_pd['solution'][0]

'T = int(input())\nfor _ in range(T):\n    s = input().strip()\n    stack = []\n    valid = True\n    for c in s:\n        if c == \'(\':\n            stack.append(c)\n        elif c == \')\':\n            if not stack:\n                valid = False\n                break\n            stack.pop()\n    if valid and not stack:\n        print("YES")\n    else:\n        print("NO")'

# Structural Reward Functions

This function is used to **extract Python code** from a text output generated by a language model. The line extract_code(dataset[0]["output"]) applies this logic to the first sample in the dataset, attempting to isolate the Python code segment from the model's full response. This is useful for evaluating or executing model-generated code separately from its reasoning or commentary.

In [None]:
import re

def extract_code(generated_text: str) -> str | None:
    """Extracts Python code block, supporting both ```python``` and <python>...</python> formats."""
    # Match ```python ... ```
    code_block_match = re.search(r"```python\n(.*?)\n```", generated_text, re.DOTALL)
    if code_block_match:
        return code_block_match.group(1).strip()

    # Match <python> ... </python>
    xml_block_match = re.search(r"<python>(.*?)</python>", generated_text, re.DOTALL)
    if xml_block_match:
        return xml_block_match.group(1).strip()

    # Fallback: try using full text if it looks like Python
    print("Warning: Could not find code block. Assuming entire output is code.")
    if re.search(r"^\s*(def |class |import |from )", generated_text, re.MULTILINE):
        return generated_text.strip()

    print("Warning: Fallback code doesn't look like Python definition/import. Skipping execution.")
    return None

# Isolate the Python code
extract_code(dataset[0]["output"])

'T = int(input())\nfor _ in range(T):\n    s = input().strip()\n    stack = []\n    valid = True\n    for c in s:\n        if c == \'(\':\n            stack.append(c)\n        elif c == \')\':\n            if not stack:\n                valid = False\n                break\n            stack.pop()\n    if valid and not stack:\n        print("YES")\n    else:\n        print("NO")'

**Format Validation (Structural Reward)**

Before evaluating the content, using Gemini to assign reward scores to model outputs, we check whether the model's response follows the expected structure.

This ensures that the model outputs answers in a consistent, structured way, which is essential for downstream parsing, readability, or execution.

In [None]:
# Define a regex to optionally match closing code block with optional EOS (End Of Sequence) token
solution_end_regex = r"```[\s]{0,}" + \
    "(?:" + re.escape(tokenizer.eos_token) + ")?"

# Compile regex to extract the solution code between reasoning_end and code block
# Supports optional EOS token and whitespace at the end
match_format = re.compile(
    rf"{reasoning_end}.*?"\
    rf"{solution_start}(.+?){solution_end_regex}"\
    rf"[\s]{{0,}}$",
    flags = re.MULTILINE | re.DOTALL
)
match_format

re.compile(r'</think>.*?```python(.+?)```[\s]{0,}(?:<\|endoftext\|>)?[\s]{0,}$',
           re.MULTILINE|re.DOTALL|re.UNICODE)

In [None]:
# Test examples: does the pattern correctly extract code after </think> and ```python ...```?
match_format.findall(
    "Let me think!</think>"\
    f"```python a=input()```",
)

[' a=input()']

In [None]:
# Test examples: does the pattern correctly extract code after </think> and ```python ...```?
match_format.findall(
    "<think>Let me think!</think>"\
    f"```python  a=input()  ```\n\n",
)

['  a=input()  ']

We compute structure-based rewards (exact or approximate) by checking the presence and position of formatting markers. This helps ensure that model outputs remain machine-readable and consistent.

In [None]:
# Reward function component: exact match score
# Gives a fixed score if the format is perfectly correct (reasoning + solution blocks)
def match_format_exactly(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]
        # Match if format is seen exactly!
        if match_format.search(response) is not None: score += 1.5
        scores.append(score)
    return scores

If the format is not perfectly matched, we apply a softer check (match_format_approximately) that gives partial credit.

In [None]:
# Approximate structural validation
# Gives partial score based on presence of key format markers in the model's output
def match_format_approximately(completions, **kwargs):
    scores = []
    for completion in completions:
        score = 0
        response = completion[0]["content"]

        # Reward presence of each structural marker; penalize if missing
        # Note: we skip reasoning_start ("<think>") since it's always prepended
        score += 0.25 if response.count(reasoning_end)   == 1 else -0.5
        score += 0.25 if response.count(solution_start)  == 1 else -0.5
        score += 0.25 if response.count(solution_end)    == 1 else -0.5
        scores.append(score)
    return scores

In [None]:
match_format.search(dataset['output'][0])

<re.Match object; span=(9233, 9617), match='</think>\n\n```python\nT = int(input())\nfor _ in>

**Model-Generated Reward (Semantic Evaluation)**

We use the evaluate_answer() function to compute semantic reward scores by sending the candidate reasoning and code, alongside the reference solution and problem description, to Gemini. Gemini returns a numerical score between -3 and 3, which we extract using regex. This score reflects how well the model has reasoned and solved the task compared to the ground truth.

This allows us to use an LLM as a reward model, **evaluating both code quality and reasoning accuracy**.

The reward model is based on the paper ["Applying RLAIF for Code Generation with API-usage in Lightweight LLMs"](https://machinelearning.apple.com/research/applying-rlaif)

In [None]:
import os
import google.generativeai as genai

# Initialize Gemini model used as the reward function
# This model will evaluate completions based on reasoning and code quality
os.environ["GOOGLE_API_KEY"] = "YOURKEY" # Using colab secret is advised
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])
model_gem = genai.GenerativeModel("models/gemini-2.0-flash")

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Compile a regex to extract numeric scores from Gemini output text
_NUMBER_RE = re.compile(r"(-?\d+(\.\d+)?)")

# prompts: list of message pairs (system + user), typically from dataset["prompts"]
# completions: model-generated responses (e.g., from the fine-tuned policy model)
# answer: reference (ground-truth) solutions from the dataset, usually in dataset["solution"]

def evaluate_answer(prompts, completions, answer, **kwargs):
    """
    Computes reward scores for model-generated answers using Gemini as the evaluator.
    Each answer is evaluated on:
    - reasoning quality
    - code correctness
    - alignment with the dataset solution

    Returns a score in [-3, 3], where:
    -3 = completely wrong or hallucinated
     3 = perfect step-by-step reasoning and correct solution
    """

    # Build the instruction for Reward LLM
    SCORING_PROMPT = (
        "You are a code expert. You will be given a competitive programming problem, "
        "the candidate's internal reasoning between <think>…</think>, the Python solution "
        "between ```python ... ``` and the official reference solution from the dataset.\n\n"
        "Evaluate the following:\n"
        " - correctness of the reasoning steps\n"
        " - completeness and accuracy of the code\n"
        " - adherence to step-by-step methodology\n"
        " - whether the code uses the correct functions/APIs\n"
        " - whether the code is free of bugs and code smells\n"
        " - whether the code is sufficient to accomplish the task\n"
        " - whether the code uses quotes in string literals correctly\n"
        " - whether the code contains duplicate parameters in functions\n\n"
        "You must also compare the candidate’s solution with the official reference solution from the dataset.\n\n"
        "Assign an integer score from -3 to 3 where:\n"
        " -3 = completely incorrect reasoning and/or code\n"
        " 3 = perfect step-by-step reasoning and a flawless solution\n\n"
        "Respond only with the score and nothing else.\n\n"
        "Reminder: If the candidate’s internal reasoning exceeds 8000 tokens, "
        "they should have admitted inability to continue; penalize that accordingly."
    )

    scores = []
    for prompt, completion_list, ans in zip(prompts, completions, answer):
        problem = prompt[1]["content"]  # user input (problem description)
        candidate = completion_list[0]["content"]  # model's generated answer

        # Build the full input prompt for the reward model
        inp = (
            f"{SCORING_PROMPT}\n\n"
            f"Problem:\n{problem}\n\n"
            f"Dataset answer: \n{ans}\n\n"
            f"Candidate answer:\n{candidate}\n\n"
            "Score:"
        )
        # Generate reward score from Gemini
        resp = model_gem.generate_content(inp).text.strip()
        m = _NUMBER_RE.search(resp)

        # Extract and normalize the score
        if m:
            score = float(m.group(1))  # Extract numeric score
            score = max(-3.0, min(3.0, score))  # Clamp to valid range
        else:
            score = 0.0  # Default fallback if no score found
        scores.append(score)

    return scores

This section performs **length-based filtering** of the dataset to remove outliers and improve training efficiency. First, each sample is tokenized using the same chat_template structure used for inference and fine-tuning. Then, the number of tokens for each prompt is calculated and stored.

To avoid issues caused by overly long examples (which can exceed the model's context window or slow down training), the **code computes the 90th percentile of token lengths**. Only the samples that fall within this limit are retained — effectively filtering out the top 10% longest inputs.

This step **ensures that the final training dataset remains efficient**, memory-friendly, and consistent in input size, which is especially important when training with limited GPU resources or long-context models.

In [None]:
# Tokenize the dataset using the chat template (includes system + user, and adds generation marker)
# This will generate input token IDs for each prompt
tokenized = dataset.map(
    lambda x: {"tokens": tokenizer.apply_chat_template(
        x["prompt"], add_generation_prompt=True, tokenize=True)},
    batched=True,
)

# Decode the first tokenized example to inspect the final input string (for debugging/validation)
print(tokenizer.decode(tokenized[0]["tokens"]))

# Add a new column 'L' that stores the length (number of tokens) for each example
tokenized = tokenized.map(lambda x: {"L": len(x["tokens"])})

# Compute the 90th percentile of token lengths — this is used as a filtering threshold
import numpy as np
maximum_length = int(np.quantile(tokenized["L"], 0.9))
print("Max Length = ", maximum_length)

# Keep only the samples whose token length is below or equal to the 90th percentile
# This filters out the longest 10% of samples to avoid extreme lengths that may break context limits
dataset = dataset.select(np.where(np.array(tokenized["L"]) <= maximum_length)[0])

# Clean up memory by deleting the temporary tokenized dataset
del tokenized

In [None]:
print(dataset.column_names)
print(dataset[0])

['output', 'input', 'solution', 'prompt', 'answer']
{'output': '<think>\nOkay, I need to solve this problem where I have to check if the brackets in a string are balanced. Hmm, so the input is a string consisting of parentheses, and I have to determine if they\'re properly balanced. The example given has three test cases. Let\'s see.\n\nSo, the problem is similar to the classic valid parentheses problem. Right. The standard approach for that is using a stack. Let me think. For every opening bracket \'(\', we push it onto the stack. When we encounter a closing bracket \')\', we check if the stack is not empty and the top is an opening bracket. If so, we pop it. If at any point the closing bracket doesn\'t have a matching opening, it\'s invalid. At the end, the stack should be empty for it to be balanced.\n\nWait, but what about other characters? Oh, the problem says the string S is the one to be checked, but the examples only have parentheses. So maybe the input string only contains \'(

# GRPO Training Setup

After filtering and tokenizing the dataset, we compute maximum prompt and completion lengths to safely fit within the model’s context window. We then configure the GRPO training process, which **fine-tunes the model** using **multiple reward functions**:

- strict format matching,

- approximate format structure,

- and semantic evaluation via Gemini.

The **GRPOTrainer** samples multiple completions per prompt and updates the policy model to maximize average reward.

In [None]:
# Compute prompt and completion lengths based on the 90th percentile from previous filtering
max_prompt_length = maximum_length + 1  # +1 for safety margin
max_completion_length = max_seq_length - max_prompt_length


# Set sampling parameters for vLLM during generation (used within GRPO)
from vllm import SamplingParams


vllm_sampling_params = SamplingParams(
    min_p = 0.1,                  # Minimum nucleus sampling probability
    top_p = 1.0,                  # Top-p (nucleus) sampling
    top_k = -1,                   # Disable top-k filtering
    seed = 3407,                  # For reproducibility
    stop = [tokenizer.eos_token],# Stop generation when EOS token is produced
    include_stop_str_in_output = True,
)


from trl import GRPOConfig, GRPOTrainer

# Configure the GRPO trainer — similar to PPO, but adapted for reward-based generation
training_args = GRPOConfig(
    vllm_sampling_params = vllm_sampling_params,
    temperature = 0.6,                   # Sampling diversity
    learning_rate = 5e-5,                # Suitable for LoRA training
    weight_decay = 0.01,
    warmup_ratio = 0.1,
    lr_scheduler_type = "linear",
    optim = "adamw_8bit",                # Memory-efficient optimizer
    logging_steps = 1,
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1,     # Can be increased for larger batch sizes
    num_generations = 4,                 # Number of completions sampled per prompt
    max_prompt_length = max_prompt_length,
    max_completion_length = max_completion_length,
    max_steps = 200,                     # Short run; increase for full training
    save_steps = 100,
    report_to = "none",                  # Can enable Weights & Biases later
    output_dir = "outputs",              # Directory to save checkpoints
)

Unsloth: We now expect `per_device_train_batch_size` to be a multiple of `num_generations`.
We will change the batch size of 1 to the `num_generations` of 4


In [None]:
# Initialize the GRPO trainer with multiple reward functions:
# - match_format_exactly: reward if full output format matches
# - match_format_approximately: softer reward for partial format structure
# - evaluate_answer: semantic reward from Gemini (LLM-based scoring)
# new_dataset = dataset.train_test_split(test_size = 0.01)
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        match_format_exactly,
        match_format_approximately,
        evaluate_answer,

    ],
    args = training_args,
    train_dataset = dataset, # Full dataset; can split if evaluation is enabled

)

trainer.train() # Start training with GRPO (reward-based fine-tuning)

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 225 | Num Epochs = 1 | Total steps = 200
O^O/ \_/ \    Batch size per device = 4 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (4 x 1 x 1) = 4
 "-____-"     Trainable parameters = 59,867,136/3,145,805,824 (1.90% trained)


Step,Training Loss,reward,reward_std,completion_length,kl,rewards / match_format_exactly,rewards / match_format_approximately,rewards / evaluate_answer
1,0.005,-2.875,2.046338,4560.5,0.125548,0.75,-1.125,-2.5
2,0.0084,0.6875,3.653623,2651.5,0.209572,1.125,-0.9375,0.5
3,0.012,-1.25,0.816497,988.75,0.299481,1.5,-0.75,-2.0
4,0.0094,-1.5,0.957427,1427.75,0.235765,1.5,-0.75,-2.25
5,0.0134,2.5,1.892969,858.75,0.333861,1.5,-0.75,1.75
6,0.016,3.25,0.57735,590.25,0.399651,1.5,-0.75,2.5
7,0.0052,-2.5625,1.375,4020.0,0.131,1.125,-0.9375,-2.75
8,0.0048,-2.875,2.046338,4853.25,0.118931,0.75,-1.125,-2.5
9,0.0082,-1.25,1.154701,1800.75,0.205388,1.5,-0.75,-2.0
10,0.0089,-1.0,0.957427,1542.75,0.223277,1.5,-0.75,-1.75


Unexpected exception formatting exception. Falling back to standard exception


# Inference (Before and After GRPO)

Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [None]:
# FAI VEDDERE CHE QUESTO VA IN LOOP CONTINUO(MODELLO PRE FINETUNING CHE NON SA FARE BENE IL REASONING)
# Simple inference example using vLLM fast_generate interface (no reward model involved)
text = "Solve the famous twosum coding problem"

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 1.0,
    top_k = 50,
    max_tokens = 1024,
)

# Generate a response using the base model
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,  # No LoRA applied here
)[0].outputs[0].text



Processed prompts: 100%|█| 4/4 [00:06<00:00,  1.67s/it, est. speed input: 72.56 toks/s, output: 3503.09 toks/s]


In [None]:
output

After training, we **run inferenc**e using the newly **trained LoRA** weights and optionally export the merged model.

In [None]:
from safetensors import safe_open

# Load and inspect the saved LoRA adapter weights to ensure they are not empty
tensors = {}
with safe_open("grpo_trainer_lora_model/adapter_model.safetensors", framework="pt") as f:
    for key in f.keys():
        tensor = f.get_tensor(key)
        n_zeros = (tensor == 0).sum() / tensor.numel()

        # Sanity check: ensure the entire tensor is not zero-filled
        assert(n_zeros.item() != 1.0), f"Tensor {key} appears to be empty or uninitialized"


# LoRA Verification and Export


Now we load the LoRA and test:

In [None]:
# Construct a full prompt using system + user (structured via chat template)
from safetensors import safe_open
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": "Solve the famous twosum coding problem"},
]

text = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt = True,  # Required to trigger generation after user input
    tokenize = False,
)

# Define sampling parameters for a longer response
sampling_params = SamplingParams(
    temperature = 1.0,
    top_k = 50,
    max_tokens = 2048,
)

# Generate using the trained GRPO LoRA weights
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_trainer_lora_model"),
)[0].outputs[0].text


Processed prompts: 100%|███| 1/1 [00:02<00:00,  2.54s/it, est. speed input: 19.72 toks/s, output: 75.72 toks/s]


In [None]:
output

'First, we should read the problem and understand the goal: given a list of numbers and a target number, we need to find two numbers in the list that add up to the target. Then, we can solve this problem by using a hash table to store the numbers and their indices. We start with the first number in the list, and for each subsequent number, we check if the target minus the current number is in the hash table. If it is, then we have found the two numbers that add up to the target. If not, we add the current number to the hash table.</think>```pythondef twosum(nums, target):    hash_table = {}    for i, num in enumerate(nums):        if target - num in hash_table:            return [hash_table[target-num], i]        hash_table[num] = i    return None\n\ntwosum([4,1,3,7], 6)'