<!-- To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance! -->
<div class="align-center">
<a href="https://nvidia.com/"><img src="https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/01-nvidia-logo-horiz-500x200-2c50-d@2x.png" width="115"></a>

    
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>

<!-- <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ‚≠ê <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠ê
</div> -->

# Goal: Teach a model to play sudoku with GRPO using Unsloth and NeMo-Gym

Our goal is to teach Qwen-2.5-1.5b-Instruct to play sudoku using GRPO on a single GPU!

You will learn how to:
- configure an unsloth optimized model
- start nemo gym resources server
- train using unsloth and nemo gym 
- test and save the trained model

To install Nemo Gym, follow the guide [here](https://docs.nvidia.com/nemo/gym/latest/get-started/setup-installation.html).

To install Unsloth your local device, follow the guide [here](https://docs.unsloth.ai/get-started/install-and-update). 


This notebook was developed on 1 H100 GPU. If you are using a GPU with lower VRAM, you should adjust configuration parameters accordingly, such as max output length, quantization, or parameter efficient finetuning. Unsloth has a bunch of examples of low VRAM training that work with Nemo Gym verifiers! 

# Load the model

In this example, we will do full finetuning, but unsloth supports optimized low precision (e.g. 4 or 8 bit) or parameter-effecient training methods (e.g. LoRA). Check out unsloth's documentation if you are interested in these methods!

In [1]:
from unsloth import FastLanguageModel
import torch

model_name = "unsloth/Qwen2.5-1.5B-Instruct" 
max_seq_length = 4096 # Can increase for longer outputs, or decrease if running into OOM
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_name,
    max_seq_length = max_seq_length,
    load_in_4bit = False, # set to True for low precision training to save VRAM
    full_finetuning=True, # set to False for LoRA training
    offload_embedding = True, # Reduces VRAM a little
)

ü¶• Unsloth: Will patch your computer to enable 2x faster free finetuning.
INFO 12-11 19:11:34 [__init__.py:216] Automatically detected platform cuda.
ü¶• Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.12.1: Fast Qwen2 patching. Transformers: 4.57.3. vLLM: 0.10.2.
   \\   /|    NVIDIA H100 80GB HBM3. Num GPUs = 8. Max memory: 79.205 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu128. CUDA: 9.0. CUDA Toolkit: 12.8. Triton: 3.4.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.32.post1. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: Using bfloat16 full finetuning which cuts memory usage by 50%.
To enable float32 training, use `float32_mixed_precision = True` during FastLanguageModel.from_pretrained
Unsloth: Offloading embeddings to RAM to save 0.43 GB.


If you want to try out LoRA, uncomment the code below, and make sure that full_finetuning is set to False above. LoRA is a parameter-efficient training method that reduces computational cost by only training a small percentage of the full model parameters.


In [2]:
# lora_rank = 4 # Larger rank = smarter, but slower
# model = FastLanguageModel.get_peft_model(
#     model,
#     r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
#     target_modules = [
#         "q_proj", "k_proj", "v_proj", "o_proj",
#         "gate_proj", "up_proj", "down_proj",
#     ],
#     lora_alpha = lora_rank*2, # *2 speeds up training
#     use_gradient_checkpointing = "unsloth", # Reduces memory usage
#     random_state = 42,
# )

# Nemo Gym resources server setup

Nemo Gym resources servers provide tool implementations, logic to process actions, update state, and provide observations, and calculate rewards for actions taken. 

First, start the reasoning_gym resources server in a terminal: 

```
cd ~/Gym
uv venv
source .venv/bin/activate
uv sync --active
ng_run "+config_paths=[resources_servers/reasoning_gym/configs/resources_only.yaml]"
```


You should see a similar output in the terminal: 

```
All 1 / 1 servers ready! Polling every 60s

####################################################################################################
#
# Server Instances
#
####################################################################################################

[1] reasoning_gym (resources_servers/reasoning_gym)
{
    'process_name': 'reasoning_gym',
    'server_type': 'resources_servers',
    'name': 'reasoning_gym',
    'dir_path': (
        '/home/ubuntu/Gym/resources_servers/reasoning_gym'
    ),
    'entrypoint': 'app.py',
    'host': '127.0.0.1',
    'port': 19815,
    'pid': 801468,
    'config_path': 'reasoning_gym',
    'url': 'http://127.0.0.1:19815',
}
####################################################################################################
```

Nemo Gym starts a head server on port 11000 by default, and the resources server port is selected at random from available ports, unless specified otherwise. We can automatically extract the resources server port using the head server:

In [3]:
import yaml
import requests
from omegaconf import OmegaConf


# Nemo Gym head server is hosted on port 11000
head_port = 11000

# We launched the reasoning gym resources server in the previous step!
resources_server_name = "reasoning_gym"

# Retrieve the server config which contains the port that the resources server is hosted on
response = requests.get(f"http://127.0.0.1:{head_port}/global_config_dict_yaml", timeout=5)

# Extract the host ip and port of the resources server
global_config_dict = OmegaConf.create(yaml.safe_load(response.text))
config = global_config_dict[resources_server_name].resources_servers[resources_server_name]
verify_endpoint = f"http://{config.host}:{config.port}/verify"

verify_endpoint

'http://127.0.0.1:59203/verify'

# Dataset prep

Next, let's create and load the dataset. We can generate a mini sudoku dataset using the script in Nemo Gym. 

This resource server is a integration of [reasoning gym](https://github.com/open-thought/reasoning-gym), which provides verifiers for more than 100 tasks over many domains, including but not limited to algebra, arithmetic, computation, cognition, geometry, graph theory, logic, and many common games. Reasoning gym also supports creating composite datasets of many tasks for multi-verifier training, and even implements a difficulty curriculum to enable continued learning: 

```
cd ~/Gym

python resources_servers/reasoning_gym/scripts/create_dataset.py \
    --task mini_sudoku \
    --size 2000 \
    --seed 42 \
    --output resources_servers/reasoning_gym/data/train_mini_sudoku.jsonl
```



Now load the dataset!

In [4]:
import os
import json 
from datasets import Dataset

dataset_path = "~/Gym/resources_servers/reasoning_gym/data/train_mini_sudoku.jsonl"

train_data = []
max_length_seen = 0
with open(os.path.expanduser(dataset_path), 'r') as f:
    for line in f:
        data = json.loads(line)

        # extract prompt from nemo gym format 
        task_prompt = data["responses_create_params"]["input"][0]["content"]
        
        train_data.append({
            "prompt": [{"role": "user", "content": task_prompt}],
            "answer": data["answer"],
            "metadata": data["metadata"],
        })
        
        prompt_length = len(tokenizer.apply_chat_template(
            [{"role": "user", "content": task_prompt}],
            add_generation_prompt=True
        ))
        max_length_seen = max(max_length_seen, prompt_length)

print(f"Loaded {len(train_data)} examples!\n\n")
print(f"Example prompt:\n\n{train_data[0]['prompt'][0]['content']}")
train_dataset = Dataset.from_list(train_data)

Loaded 2000 examples!


Example prompt:

In 4x4 Mini Sudoku:
- Each row must contain each number from 1-4 exactly once
- Each column must contain each number 1-4 exactly once
- Each 2x2 subgrid must contain each number 1-4 exactly once
Solve this 4x4 Mini Sudoku puzzle:
4 _ _ _
_ 3 _ _
_ 1 3 _
_ _ _ _
Format your response as the puzzle above, with spaces separating each number within a row, and newlines separating rows.



# Define reward function

Now lets create a reward function that uses Nemo Gym's verifier

In [5]:
import numpy as np 


def reward_fn(completions, prompts=None, **kwargs):
    answers = kwargs['answer']
    metadatas = kwargs['metadata']
    scores = []
    for i, completion in enumerate(completions):
        completion_text = completion[0]["content"]
        task_prompt = prompts[i][0]["content"]

        # prepare data in Nemo Gym verifer request format
        payload = {
            "responses_create_params": {"input": [{"role": "user", "content": task_prompt, "type": "message"}]},
            "response": {
                "id": "resp", "created_at": 0.0, "model": model_name, "object": "response",
                "output": [{"id": "msg", "role": "assistant", "type": "message", "status": "completed",
                           "content": [{"type": "output_text", "text": completion_text, "annotations": []}]}],
                "parallel_tool_calls": True, "tool_choice": "auto", "tools": []
            },
            "question": task_prompt,
            "answer": answers[i],
            "metadata": metadatas[i],
        }
        try:
            # send verify request to Nemo Gym resources server 
            resp = requests.post(verify_endpoint, json=payload, timeout=30)
            reward = resp.json().get("reward", 0.0) if resp.status_code == 200 else 0.0
        except:
            reward = 0.0
        scores.append(reward)
    return np.array(scores)

# Configure and launch GRPO

Unsloth also supports GSDP, GAPO, Dr GRPO and more! Checkout the unsloth docs for more info: https://docs.unsloth.ai/ 

We will train for 100 steps. The goal is to see the reward go up! You should see the reward start around 0.15, and quickly rise to 0.6 or higher! 

In [6]:
from trl import GRPOConfig, GRPOTrainer

max_prompt_length = max_length_seen + 1 # +1 just in case as in other unsloth examples
max_completion_length = max_seq_length - max_prompt_length 

training_args = GRPOConfig(
    temperature=1.0,
    learning_rate=1e-5,
    weight_decay=0.001,
    warmup_ratio=0.0,
    lr_scheduler_type="linear",
    optim="adamw_8bit",
    logging_steps=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=64,
    num_generations=8,
    max_prompt_length=max_prompt_length,
    max_completion_length=max_completion_length,
    num_train_epochs=1,
    max_steps=100,
    save_steps=100,
    report_to="none", # Can use Weights & Biases 
    # run_name=run_name, # for Weights & Biases
    output_dir="outputs",
    epsilon_high=0.28,
    mask_truncated_completions=True,
    # log_completions=True, # uncomment to see rollouts printed to the console!
    # num_completions_to_print=1,
)

In [7]:
trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    reward_funcs=[reward_fn],
    args=training_args,
    train_dataset=train_dataset,
)

In [8]:
trainer.train()

The model is already on multiple devices. Skipping the move to device specified in `args`.
==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 2,000 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 64
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 64 x 1) = 64
 "-____-"     Trainable parameters = 1,543,714,304 of 1,543,714,304 (100.00% trained)


Step,Training Loss,reward,reward_std,completions / mean_length,completions / min_length,completions / max_length,completions / clipped_ratio,completions / mean_terminated_length,completions / min_terminated_length,completions / max_terminated_length,kl,rewards / reward_fn / mean,rewards / reward_fn / std
1,0.0,0.167067,0.148951,31.890625,18.0,115.0,0.0,31.890625,18.0,115.0,0.001452,0.167067,0.193551
2,0.0004,0.300987,0.118009,32.0625,32.0,36.0,0.0,32.0625,32.0,36.0,0.421682,0.300987,0.136417
3,0.0,0.340543,0.133478,32.0,32.0,32.0,0.0,32.0,32.0,32.0,0.0441,0.340543,0.167031
4,0.0001,0.340528,0.122462,32.0,32.0,32.0,0.0,32.0,32.0,32.0,0.081246,0.340528,0.145538
5,0.0001,0.495667,0.13495,32.0,32.0,32.0,0.0,32.0,32.0,32.0,0.115817,0.495667,0.156288
6,0.0002,0.416524,0.090819,32.0,32.0,32.0,0.0,32.0,32.0,32.0,0.232326,0.416524,0.118735
7,0.0002,0.486607,0.091526,32.0,32.0,32.0,0.0,32.0,32.0,32.0,0.186393,0.486607,0.104969
8,0.0009,0.456758,0.075614,32.28125,32.0,50.0,0.0,32.28125,32.0,50.0,0.937504,0.456758,0.134173
9,0.0003,0.503749,0.110835,31.78125,24.0,32.0,0.0,31.78125,24.0,32.0,0.279072,0.503749,0.127213
10,0.0006,0.499595,0.077543,32.75,32.0,80.0,0.0,32.75,32.0,80.0,0.627995,0.499595,0.144955


TrainOutput(global_step=100, training_loss=0.00043772288860054686, metrics={'train_runtime': 1207.5573, 'train_samples_per_second': 5.3, 'train_steps_per_second': 0.083, 'total_flos': 0.0, 'train_loss': 0.00043772288860054686})

# Test the trained model!

In [9]:
text = tokenizer.apply_chat_template(
    [{"role": "user", "content": "In 4x4 Mini Sudoku:\n- Each row must contain each number from 1-4 exactly once\n- Each column must contain each number 1-4 exactly once\n- Each 2x2 subgrid must contain each number 1-4 exactly once\nSolve this 4x4 Mini Sudoku puzzle:\n4 _ _ _\n_ 3 _ _\n_ 1 3 _\n_ _ _ _\nFormat your response as the puzzle above, with spaces separating each number within a row, and newlines separating rows.\n"}],
    tokenize = False,
    add_generation_prompt = True,
)

from transformers import TextStreamer
_ = model.generate(
    **tokenizer(text, return_tensors = "pt").to("cuda"),
    temperature = 1.0,
    max_new_tokens = 4096,
    streamer = TextStreamer(tokenizer, skip_prompt = False),
)

<|im_start|>system
You are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>
<|im_start|>user
In 4x4 Mini Sudoku:
- Each row must contain each number from 1-4 exactly once
- Each column must contain each number 1-4 exactly once
- Each 2x2 subgrid must contain each number 1-4 exactly once
Solve this 4x4 Mini Sudoku puzzle:
4 _ _ _
_ 3 _ _
_ 1 3 _
_ _ _ _
Format your response as the puzzle above, with spaces separating each number within a row, and newlines separating rows.
<|im_end|>
<|im_start|>assistant
4 2 3 1
2 3 1 4
3 1 3 2
1 2 4 3<|im_end|>


<a name="Save"></a>
### Saving to float16 or MXFP4 for vLLM

Unsloth supports saving to `float16` directly. Select `merged_16bit` for float16. Unsloth also supports saving in low or mixed precision such as `mxfp4`, and allows `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [10]:
# Merge and push to hub in mxfp4 4bit format
if False:
    model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "mxfp4")
if False: model.push_to_hub_merged("repo_id/repo_name", tokenizer, token = "hf...", save_method = "mxfp4")

# Merge and push to hub in 16bit
if False:
    model.save_pretrained_merged("finetuned_model", tokenizer, save_method = "merged_16bit")
if False: # Pushing to HF Hub
    model.push_to_hub_merged("hf/gpt-oss-finetune", tokenizer, save_method = "merged_16bit", token = "")

And we're done! If you have any questions on Nemo Gym, please open an issue on the github repository! 

For Unsloth questions, there is a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join the Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ‚≠êÔ∏è <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ‚≠êÔ∏è
</div>