# Unsloth GRPO Training of DeepSeek-R1

In this notebook, we will use the work done by Unsloth to walk through a small scale sample of the DeepSeek-R1 process outlined in their paper.

This is not a direct 1-to-1, but it does outline the major innovations in their paper - [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https://arxiv.org/pdf/2501.12948).

Let's dive in!

### What is the GRPO Training Process with RL?

1. **TLDR - What is GRPO anyway?** GRPO stands for ***Group Relative Policy Optimization***, a reinforcement learning algorithm used in AI models like DeepSeek to enhance reasoning capabilities. It simplifies training by eliminating the need for a separate value function, making it more efficient and accessible for developing large language models.

2. **Group Sampling:** For a single prompt or state, the policy generates a batch of outputs (instead of just one). This produces a small “group” of possible actions or answers.

3. **Reward Scoring:** Each output is scored by a reward function, which reflects how good or desirable that output is for the task at hand.

4. **Group-Based Advantage:** The algorithm calculates each output’s “advantage” by comparing its reward to the average reward of the entire group. If the output’s reward is above average, it has a positive advantage (and vice versa).

5. **Policy Update:** The policy is adjusted to promote outputs with a positive advantage and discourage those with a negative advantage. A KL penalty term prevents the policy from changing too drastically.

6. **Iterative Process:** The updated policy is used again to generate new groups, score them, and update—repeating until the policy converges or meets performance goals.

This group-based approach removes the need for a separate value function (critic) and helps the policy quickly learn which outputs are relatively better within each sampled group.

> NOTE: This notebook is heavily based on the reasoning model notebooks provided by [Unsloth](https://docs.unsloth.ai/basics/reasoning-grpo), they've done amazing work to make these methods available to all!

### Installation

As you can see, we'll only need a few dependencies thanks to the collective hard work of the community!

In [1]:
%%capture
# Skip restarting message in Colab
import sys; modules = list(sys.modules.keys())
for x in modules: sys.modules.pop(x) if "PIL" in x or "google" in x else None

!pip install unsloth vllm
!pip install --upgrade pillow
# If you are running this notebook on local, you need to install `diffusers` too
# !pip install diffusers
# Temporarily install a specific TRL nightly version
!pip install git+https://github.com/huggingface/trl.git@e95f9fb74a3c3647b86f251b7e230ec51c64b72b

### Unsloth

Use [`PatchFastRL`](https://github.com/unslothai/unsloth/blob/646ad2f141a3a0721d1ec9449cf9454b5612a84a/unsloth/models/rl.py#L44) before all functions to patch GRPO and other RL algorithms!

> NOTE: This patch overwrites the TRL `.generate` to be a bit more optimized. Classic Unsloth!

In [2]:
from unsloth import FastLanguageModel, PatchFastRL
PatchFastRL("GRPO", FastLanguageModel)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Unsloth: Failed to patch Gemma3ForConditionalGeneration.
🦥 Unsloth Zoo will now patch everything to make training faster!
INFO 04-09 21:36:22 [__init__.py:239] Automatically detected platform cuda.


The following cell will take care of loading our model, as well as setting some LoRA specific hyperparameters, as is customary!

There's a lot going on behind the scenes, but the idea is simple: Unsloth is patching things to make this much simipler for us.

> NOTE: You can check out [Unsloth's blog](https://unsloth.ai/blog/r1-reasoning) if you want more insights into how to best train these models.

In [3]:
from unsloth import is_bfloat16_supported
import torch
max_seq_length = 512 # Can increase for longer reasoning traces
lora_rank = 32 # Larger rank = smarter, but slower

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "meta-llama/meta-Llama-3.1-8B-Instruct",
    max_seq_length = max_seq_length,
    load_in_4bit = True, # False for LoRA 16bit
    fast_inference = True, # Enable vLLM fast inference
    max_lora_rank = lora_rank,
    gpu_memory_utilization = 0.6, # Reduce if out of memory
)

model = FastLanguageModel.get_peft_model(
    model,
    r = lora_rank, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = [
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ], # Remove QKVO if out of memory
    lora_alpha = lora_rank,
    use_gradient_checkpointing = "unsloth", # Enable long context finetuning
    random_state = 3407,
)

==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.51.1. vLLM: 0.8.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: vLLM loading unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit with actual GPU utilization = 59.43%
Unsloth: Your GPU has CUDA compute capability 7.5 with VRAM = 14.74 GB.
Unsloth: Using conservativeness = 1.0. Chunked prefill tokens = 512. Num Sequences = 160.
Unsloth: vLLM's KV Cache can use up to 2.59 GB. Also swap space = 2 GB.
INFO 04-09 21:37:41 [config.py:600] This model supports multiple tasks: {'embed', 'score', 'reward', 'generate', 'classify'}. Defaulting to 'generate'.
Unsloth: vLLM Bitsandbytes config

tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

INFO 04-09 21:37:45 [cuda.py:240] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 04-09 21:37:45 [cuda.py:289] Using XFormers backend.
INFO 04-09 21:37:46 [parallel_state.py:957] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 04-09 21:37:46 [model_runner.py:1110] Starting to load model unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit...
INFO 04-09 21:37:46 [loader.py:1155] Loading weights with BitsAndBytes quantization. May take a while ...
INFO 04-09 21:37:47 [weight_utils.py:265] Using model weights format ['*.safetensors']


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

INFO 04-09 21:38:53 [weight_utils.py:281] Time spent downloading weights for unsloth/meta-llama-3.1-8b-instruct-unsloth-bnb-4bit: 66.267446 seconds
INFO 04-09 21:38:53 [weight_utils.py:315] No model.safetensors.index.json found in remote.


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]


INFO 04-09 21:39:30 [punica_selector.py:18] Using PunicaWrapperGPU.
INFO 04-09 21:39:30 [model_runner.py:1146] Model loading took 5.7737 GiB and 103.990201 seconds
INFO 04-09 21:39:43 [worker.py:267] Memory profiling takes 11.92 seconds
INFO 04-09 21:39:43 [worker.py:267] the current vLLM instance can use total_gpu_memory (14.74GiB) x gpu_memory_utilization (0.59) = 8.76GiB
INFO 04-09 21:39:43 [worker.py:267] model weights take 5.77GiB; non_torch_memory takes 0.03GiB; PyTorch activation peak memory takes 0.74GiB; the rest of the memory reserved for KV Cache is 2.22GiB.
INFO 04-09 21:39:43 [executor_base.py:112] # cuda blocks: 1136, # CPU blocks: 1024
INFO 04-09 21:39:43 [executor_base.py:117] Maximum concurrency for 512 tokens per request: 35.50x
INFO 04-09 21:39:45 [model_runner.py:1456] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI. If

Capturing CUDA graph shapes: 100%|██████████| 23/23 [00:53<00:00,  2.33s/it]

INFO 04-09 21:40:38 [model_runner.py:1598] Graph capturing finished in 54 secs, took 0.53 GiB
INFO 04-09 21:40:38 [llm_engine.py:448] init engine (profile, create kv cache, warmup model) took 68.10 seconds





tokenizer_config.json:   0%|          | 0.00/55.5k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Unsloth 2025.3.19 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


### Data Preparation

You'll notice a peculiarity here - our dataset is just...inputs and outputs! (specifically from the GSM8K data).

But wait, we said this was different from SFT - but this is seemingly just SFT all over again!

Well, we still need questions and answers to verify that we're learning *something* productive - but importantly, we are not leveraging a specific human preference reward model, or process reward model to bake our responses into the model - we just need a way to verify if an answer provided by our model was correct or incorrect. A way, to *reward* correct generations!

For now, let's examine what our input data looks like.

> NOTE: Unsloth directly leveraged the work that [@willccbb](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb) did, for data prep and all reward functions.

In [4]:
import re
from datasets import load_dataset, Dataset

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

dataset = get_gsm8k_questions()

README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

As you can see in this data, there is no specific information about preference, or "how to reason", or anything close to it. It's simply the question and the answer.

This is the core idea behind this style of training - we're not going to give the model *how* to think as an example - we're simply going to let it play in a sandbox defined by the question and answer.

> NOTE: This is not the case for DeepSeek-R1, where there is a *very small* amount of SFT that occurs (called the "cold-start") to "prime" the model for the subsequent RL stage of training.

In [5]:
dataset[0]

{'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
 'answer': '72',
 'prompt': [{'content': '\nRespond in the following format:\n<reasoning>\n...\n</reasoning>\n<answer>\n...\n</answer>\n',
   'role': 'system'},
  {'content': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
   'role': 'user'}]}

## Now we get to the *magic* of the approach - a collection of reward models.

Notice that we have a number of different "checks" we do - these come together as expressed in the following diagram:

![ppo grpo](https://raw.githubusercontent.com/100stacks/ml-projects/refs/heads/main/content/data/ppo-grpo.png)

What this means, essentially, is that we use a suite of reward functions to determine if our model is *learning "how we want"*, as opposed to giving it examples that show it how we want it to learn.

These reward functions are totally customizable - and allow users to effectively steer how and what the model is incentivized to "get good at".

In [6]:
# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

### Train the model

All that is left to do, now that we have our:

1. Training examples
2. Reward Functions

Is to train our model!

We'll start with setting a number of hyper-parameters.

> NOTE: These hyper-parameters are based around the free-tier Colab T4 instance, you can modify them to "right size" to your hardware.

### GPROConfig

First and foremost - we have a number of typical hyper-parameters (as always).

You'll also notice a distinct *lack* of GRPO hyper-parameters being used in this implementation - we'll stick with the defaults to keep this notebook manageable, but you're welcome to dive deep into TRL and play around to see what works best for your use-case.

> NOTE: If you wish to walk away with the classic "RL" image of "line going up to the right", you can remove the `report_to = "none"` from the following config.

In [7]:
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
    use_vllm = True, # use vLLM for fast inference!
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "paged_adamw_8bit",
    logging_steps = 1,
    bf16 = is_bfloat16_supported(),
    fp16 = not is_bfloat16_supported(),
    per_device_train_batch_size = 1,
    gradient_accumulation_steps = 1, # Increase to 4 for smoother training
    num_generations = 6, # Decrease if out of memory
    max_prompt_length = 256,
    max_completion_length = 200,
    # num_train_epochs = 1, # Set to 1 for a full training run
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none", # Can use Weights & Biases
    output_dir = "outputs",
)

Finally, we can run our trainer!

The basic idea in this RL focused approach is that, instead of watching loss go down, we want to watch reward *go up!*.

> NOTE: The training has a kind of "Aha!" moment, as it's been described, whereby the reward goes from ~0 and then suddenly begins increasing. This is expected behaviour - but you may not see changes in the reward column (the combined output of our above define reward functions) until you get past the 100th-150th step.

In [8]:
trainer = GRPOTrainer(
    model = model,
    processing_class = tokenizer,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)
trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 7,473 | Num Epochs = 1 | Total steps = 250
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 1 x 1) = 1
 "-____-"     Trainable parameters = 83,886,080/8,000,000,000 (1.05% trained)


-------------------- Question:
A concert ticket costs $40. Mr. Benson bought 12 tickets and received a 5% discount for every ticket bought that exceeds 10. How much did Mr. Benson pay in all? 
Answer:
476 
Response:
Let's break down the problem:

1. The first 10 tickets will be full price ($40 each).
2. The next 2 tickets (12 - 10 = 2) receive a 5% discount.

First, calculate the cost of the first 10 tickets:
10 tickets * $40 = $400

Now, calculate the cost of the 2 tickets with the discount:

The discount on each ticket is 5% of $40 = $2.
The price of each discounted ticket is $40 - $2 = $38.
2 tickets * $38 = $76

Now, add the cost of the full price tickets and the discounted tickets:
$400 + $76 = $476

<answer>
Mr. Benson paid $476.
</answer> 
Extracted:
Mr. Benson paid $476.


Step,Training Loss,reward,reward_std,completion_length,kl,rewards / xmlcount_reward_func,rewards / soft_format_reward_func,rewards / strict_format_reward_func,rewards / int_reward_func,rewards / correctness_reward_func
1,-0.0,-0.048,0.117575,183.666672,0.0,-0.048,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,200.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,171.666672,4e-06,0.0,0.0,0.0,0.0,0.0
4,0.0,1.5665,1.231273,185.333344,3e-06,-0.100167,0.0,0.0,0.333333,1.333333
5,0.0,0.075167,0.18412,128.5,1.2e-05,-0.008167,0.0,0.0,0.083333,0.0
6,0.0,0.4375,1.011651,193.666672,4e-06,0.020833,0.0,0.0,0.083333,0.333333
7,0.0,0.0,0.0,190.5,4e-06,0.0,0.0,0.0,0.0,0.0
8,0.0,-0.0405,0.099204,144.666672,9e-06,-0.0405,0.0,0.0,0.0,0.0
9,0.0,0.291,0.959376,188.5,1e-05,-0.125667,0.0,0.0,0.083333,0.333333
10,0.0,0.0,0.0,200.0,4e-06,0.0,0.0,0.0,0.0,0.0


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
To find the total amount Tara has to pay for the laptop as down payment, we need to find 20% of $1000 and then add $20.

20% of $1000 is (20/100) * 1000 = $200.
The total down payment Tara will pay is $200 + $20 = $220.

Since the laptop costs $1000, and she paid $220, the remaining amount to be paid is $1000 - $220 = $780.

The monthly installment is $65. To find the balance after 4 months, we need to subtract 4 times the monthly installment from the remaining amount.

The balance after 4 months will be: $780 - (4 * $65) = $780 - $260 = $520.

<reasoning>
To find the total down payment, we need to calculate 20% of $1000 and add $20. Then, we need to find the remaining amount by subtracting the
-------------------- Question:
Bill picked 50 apples from the orchard with his wife and two children.  He sends each of his kids to school with 3 apples for their two favorite teachers.  His wife Jill bakes two apple pies, using 10

TrainOutput(global_step=250, training_loss=0.0012113185200519822, metrics={'train_runtime': 11123.5353, 'train_samples_per_second': 0.022, 'train_steps_per_second': 0.022, 'total_flos': 0.0, 'train_loss': 0.0012113185200519822})

<a name="Inference"></a>
### Inference
Now let's try the model we just trained! First, let's first try the model without any GRPO trained:

In [12]:
text = tokenizer.apply_chat_template([
    {"role" : "user", "content" : "Calculate pi!"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    [text],
    sampling_params = sampling_params,
    lora_request = None,
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:25<00:00, 25.72s/it, est. speed input: 1.52 toks/s, output: 18.43 toks/s]


'**Calculating Pi**\n\nPi (π) is an irrational number approximately equal to 3.14159. Calculating its value is a complex task, especially for a large number of decimal places. Here\'s an example of how you can calculate pi using the Monte Carlo method in Python:\n\n```python\nimport random\nimport math\n\ndef calculate_pi(num_samples):\n    """\n    Calculate pi using the Monte Carlo method.\n\n    :param num_samples: The number of random points to generate.\n    :return: An estimate of pi.\n    """\n    points_inside_circle = 0\n    for _ in range(num_samples):\n        x = random.uniform(-1, 1)\n        y = random.uniform(-1, 1)\n        distance = x**2 + y**2\n        if distance <= 1:\n            points_inside_circle += 1\n    return 4 * points_inside_circle / num_samples\n\n# Example usage:\nnum_samples = 1000000\nestimated_pi = calculate_pi(num_samples)\nprint(f"Estimated pi: {estimated_pi}")\nprint(f"Actual pi: {math.pi}")\n\n# Compare the estimated value to the actual value of

And now with the LoRA we just trained with GRPO - we first save the LoRA first!

In [13]:
model.save_lora("grpo_saved_lora")

Now we load the LoRA and test:

In [11]:
text = tokenizer.apply_chat_template([
    {"role" : "system", "content" : SYSTEM_PROMPT},
    {"role" : "user", "content" : "Calculate pi"},
], tokenize = False, add_generation_prompt = True)

from vllm import SamplingParams
sampling_params = SamplingParams(
    temperature = 0.8,
    top_p = 0.95,
    max_tokens = 1024,
)
output = model.fast_generate(
    text,
    sampling_params = sampling_params,
    lora_request = model.load_lora("grpo_saved_lora"),
)[0].outputs[0].text

output

Processed prompts: 100%|██████████| 1/1 [00:27<00:00, 27.49s/it, est. speed input: 2.18 toks/s, output: 16.40 toks/s]


'The value of pi (π) is an irrational number, which means it cannot be expressed exactly as a finite decimal or fraction. However, we can calculate its value to a certain number of decimal places using various mathematical techniques and algorithms.\n\nOne way to calculate pi is to use the Gregory-Leibniz series, which is a infinite series that converges to pi:\n\nπ/4 = 1 - 1/3 + 1/5 - 1/7 + 1/9 - ...\n\nWe can also use the Bailey–Borwein–Plouffe formula (BBP formula), which is a spigot algorithm for computing the nth binary digit of pi:\n\nπ = Σ (1/(16^n)) * ((4/(8n + 1)) + (2/(8n + 4)) - (1/(8n + 5)) - (1/(8n + 6)) - (1/(8n + 7)) + (1/(8n + 2)) + (1/(8n + 3)) + (1/(8n + 5)) + (1/(8n + 6)))\n\nUsing this formula, we can calculate the value of pi to a certain number of decimal places.\n\nHere\'s a simplified example using Python code to calculate the value of pi to 10 decimal places:\n\n```python\ndef calc_pi(n):\n    pi = 0.0\n    for i in range(n):\n        pi += (1/(16**i)) * ((4/(8

## Summary of Results

### First the pre-GRPO model results

**Calculating Pi**

Pi (π) is an irrational number approximately equal to 3.14159. Calculating its value is a complex task, especially for a large number of decimal places. Here's an example of how you can calculate pi using the Monte Carlo method in Python:

```python
import random
import math

def calculate_pi(num_samples):
    """
    Calculate pi using the Monte Carlo method.

    :param num_samples: The number of random points to generate.
    :return: An estimate of pi.
    """
    points_inside_circle = 0
    for _ in range(num_samples):
        x = random.uniform(-1, 1)
        y = random.uniform(-1, 1)
        distance = x**2 + y**2
        if distance <= 1:
            points_inside_circle += 1
    return 4 * points_inside_circle / num_samples

# Example usage:
num_samples = 1000000
estimated_pi = calculate_pi(num_samples)
print(f"Estimated pi: {estimated_pi}")
print(f"Actual pi: {math.pi}")

# Compare the estimated value to the actual value of pi
print(f"Difference: {abs(estimated_pi - math.pi)}")
```

This code generates `num_samples` random points in a square with side length 2. It checks how many points fall inside the unit circle (i.e., `x**2 + y**2 <= 1`). The ratio of points inside the circle to the total number of points is then multiplied by 4 to estimate pi. The more points you generate, the more accurate the estimate will be.

Keep in mind that this method is not the most efficient way to calculate pi, but it's a fun and educational example of the Monte Carlo method.

**Another method to calculate pi**

You can also use the Bailey–Borwein–Plouffe formula (BBP formula) to calculate pi, which is a spigot algorithm that calculates the nth binary digit of pi without having to compute the preceding digits. Here is a Python function that uses the BBP formula:

```python
def calculate_pi_bbpf(num_digits):
    """
    Calculate pi using the Bailey–Borwein–Plouffe formula (BBP formula).

    :param num_digits: The number
```


### GRPO Model

The value of pi (π) is an irrational number, which means it cannot be expressed exactly as a finite decimal or fraction. However, we can calculate its value to a certain number of decimal places using various mathematical techniques and algorithms.

One way to calculate pi is to use the Gregory-Leibniz series, which is a infinite series that converges to pi:

π/4 = 1 - 1/3 + 1/5 - 1/7 + 1/9 - ...

We can also use the Bailey–Borwein–Plouffe formula (BBP formula), which is a spigot algorithm for computing the nth binary digit of pi:

π = Σ (1/(16^n)) * ((4/(8n + 1)) + (2/(8n + 4)) - (1/(8n + 5)) - (1/(8n + 6)) - (1/(8n + 7)) + (1/(8n + 2)) + (1/(8n + 3)) + (1/(8n + 5)) + (1/(8n + 6)))

Using this formula, we can calculate the value of pi to a certain number of decimal places.

Here's a simplified example using Python code to calculate the value of pi to 10 decimal places:

```python
def calc_pi(n):
    pi = 0.0
    for i in range(n):
        pi += (1/(16**i)) * ((4/(8*i + 1)) + (2/(8*i + 4)) - (1/(8*i + 5)) - (1/(8*i + 6)) - (1/(8*i + 7)) + (1/(8*i + 2)) + (1/(8*i + 3)) + (1/(8*i + 5)) + (1/(8*i + 6)))
    return pi

n = 1000  # number of iterations
pi = calc_pi(n)
print("The value of pi is approximately", round(pi, 10))
```

Answer: <answer>3.1415926545</answer>

### Final Words

And in classic fashion - the "reasoning model" is better at the task than the non-reasoning variant.

All this within a few hours in a free Colab instance - this is the power of Open Source!  Of course, with higher throughput instance using multiple GPUs would yield your results faster.  You could also run the model training longer to yield more precise results.  Still these are incredible results!

----