# Make Gemma-4b Think!

In this notebook, we'll leverage the work done by Unsloth to explore a small-scale sample of the DeepSeek-R1 process described in their paper.

This isn't an exact 1-to-1 reflection, but it highlights the main innovations presented in the paper.

Let's dive in!

### What is the GRPO training process with RL?

1. **Group Sampling:**  
   For a single prompt or state, the policy generates a batch of responses instead of just one. This produces a small "group" of possible actions or responses.

2. **Reward Scoring:**  
   Each response is scored using a reward function, which reflects how good or desirable that response is for the specific task.

3. **Group-Based Advantage:**  
   The algorithm calculates the "advantage" of each response by comparing its reward against the group's average reward. If a response's reward is above the average, it has a positive advantage (and vice versa).

4. **Policy Update:**  
   The policy is adjusted to encourage responses with positive advantage and discourage those with negative advantage. A KL penalty term is included to avoid overly drastic changes to the policy.

5. **Iterative Process:**  
   The updated policy is used again to generate new groups, evaluate them, and update further. This process repeats until the policy converges or meets performance objectives.

This group-based approach eliminates the need for a separate value function (critic) and helps the policy rapidly learn which responses are relatively better within each sampled group.

> **NOTE:** This notebook is largely based on the model's notebooks.

---

We're working this time with Google's model: [google/gemma-3-4b-it](https://huggingface.co/google/gemma-3-4b-it)


In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

### Instal dependencies

In [None]:
!pip install -qqq git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3 \
                  git+https://github.com/huggingface/trl.git@main \
                  bitsandbytes

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.1/76.1 MB[0m [31m26.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m34.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m96.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m74.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [3

This code loads the Gemma multimodal (image-text) model from Hugging Face, optimized for efficient fine-tuning. It uses PEFT's LoRA technique to fine-tune only specific layers (all-linear) to reduce training cost and memory usage. It also initializes the associated processor and tokenizer for handling inputs to the model.

> **NOTE:** Check out [Unsloth's blog](https://unsloth.ai/blog/r1-reasoning) for more information on how to best train these models.


In [None]:
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
from peft import LoraConfig, get_peft_model

model_name = "google/gemma-3-4b-it"

model = AutoModelForImageTextToText.from_pretrained(
    model_name, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="eager"
)

lora_config = LoraConfig(
    task_type="CAUSAL_LM",
    r=16,
    lora_alpha=32,
    target_modules="all-linear",
)
model = get_peft_model(model, lora_config)

print(model.print_trainable_parameters())

processor = AutoProcessor.from_pretrained(model_name)
tokenizer = processor.tokenizer

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/90.6k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.96G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.64G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/192 [00:00<?, ?B/s]

trainable params: 42,734,080 || all params: 4,342,813,552 || trainable%: 0.9840
None


processor_config.json:   0%|          | 0.00/70.0 [00:00<?, ?B/s]

chat_template.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.


tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

### Model Test
Let's test the model before proceeding, to see how it performs.


In [None]:
problems = [
    "Laura wants to buy candy bags for a party. Each candy bag costs €2.50, and she wants to buy 8 bags. Additionally, there's a special offer: if she buys 10 bags, she gets a 20% discount on the total price. How much will she pay if she only buys the 8 bags? How much would it cost if she bought 10 bags with the discount applied? Is it better to buy 8 bags or take advantage of the offer and buy 10? How much would she save or spend extra?",

    "Arnau's grandfather has an orchard with apple and pear trees. In total, he has 45 fruit trees. If the number of apple trees is twice that of pear trees, how many trees of each type does he have?",

    "A cyclist travels 60 km in 3 hours at a constant speed. If he maintains the same speed, how many kilometers will he travel in 5 hours?",

    "Andrea has 50 euros and wants to buy 3 books. Each book costs 12 euros. Does she have enough money to buy them? If not, how much more money does she need?"
]

# Results

- **Buying bags of candy:** It is better to buy 10 bags. You pay the same (€20) but get 2 extra bags.

- **Fruit trees:** 15 pear trees and 30 apple trees.

- **Distance traveled by the cyclist:** 100 km.

- **Andrea and the books:** Yes, she has enough money. She will have €14 left.


In [None]:
generation_params = {
    "temperature": 0.3,
    "top_p": 0.95,
    "max_new_tokens": 1024,
    "do_sample": True,
}

for problem in problems:
    text = tokenizer.apply_chat_template(
        [{"role": "user", "content": problem}],
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(text, return_tensors="pt", padding=True)
    inputs = {k: v.to("cuda") for k, v in inputs.items()}

    with torch.no_grad():
        output = model.generate(**inputs, **generation_params)

    response = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"🔹Problem: \n\n{problem}\n\n🟦 Solution: {response}\n\n{'='*50}")

🔹Problem: 

Laura wants to buy candy bags for a party. Each candy bag costs €2.50, and she wants to buy 8 bags. Additionally, there's a special offer: if she buys 10 bags, she gets a 20% discount on the total price. How much will she pay if she only buys the 8 bags? How much would it cost if she bought 10 bags with the discount applied? Is it better to buy 8 bags or take advantage of the offer and buy 10? How much would she save or spend extra?

🟦 Solution: user
Laura wants to buy candy bags for a party. Each candy bag costs €2.50, and she wants to buy 8 bags. Additionally, there's a special offer: if she buys 10 bags, she gets a 20% discount on the total price. How much will she pay if she only buys the 8 bags? How much would it cost if she bought 10 bags with the discount applied? Is it better to buy 8 bags or take advantage of the offer and buy 10? How much would she save or spend extra?
model
Okay, let's break down the costs for Laura's candy bags.

**1. Cost of 8 bags:**

* Cost p

✅ **Conclusion:** The model made very well in all problems. It has internally CoT.


### Data Preparation

Here you'll notice something peculiar: our dataset only contains inputs and outputs! (specifically from the **GSM8K** dataset).

But wait, we said this was different from **SFT**… yet, it looks exactly the same!

Well, we still need questions and answers to verify we're learning *something* useful. However, the key difference is that **we are not using a human preference-based reward model nor a process reward model** to embed responses into the model. Instead, we simply need a way to verify whether a response generated by the model is correct or incorrect. In other words, we just need a way to *reward* correct answers!

For now, let's see what our input dataset looks like.

> **NOTE:** [Will Brown's gist](https://gist.github.com/willccbb/4676755236bb08cab5f4e54a0475d6fb)


In [None]:
import re
from datasets import load_dataset, Dataset

# Load and prep dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

XML_COT_FORMAT = """\
<reasoning>
{reasoning}
</reasoning>
<answer>
{answer}
</answer>
"""

def extract_xml_answer(text: str) -> str:
    answer = text.split("<answer>")[-1]
    answer = answer.split("</answer>")[0]
    return answer.strip()

def extract_hash_answer(text: str) -> str | None:
    if "####" not in text:
        return None
    return text.split("####")[1].strip()

# uncomment middle messages for 1-shot prompting
def get_gsm8k_questions(split = "train") -> Dataset:
    data = load_dataset('openai/gsm8k', 'main')[split] # type: ignore
    data = data.map(lambda x: { # type: ignore
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_hash_answer(x['answer'])
    }) # type: ignore
    return data # type: ignore

dataset = get_gsm8k_questions()

README.md:   0%|          | 0.00/7.94k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/419k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7473 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1319 [00:00<?, ? examples/s]

Map:   0%|          | 0/7473 [00:00<?, ? examples/s]

As you can see from this data, there is no specific information about preferences, "how to reason," or anything similar. We simply present the question and the answer.

This is the central idea behind this training style: **we won't tell the model *how* it should think**; instead, we'll let it explore within an environment defined by the question and answer.

> **NOTE:** This is not the case for **DeepSeek-R1**, where a **small** amount of **SFT** (known as "cold-start") is used to "prime" the model before entering the **RL** training phase.


In [None]:
dataset[0]

Now we reach the *magic* of this approach: a collection of **reward models**.

Notice that we perform a series of **"checks"**, which combine as shown in the diagram below:

![image](https://i.imgur.com/7Dp0qdt.png)

Essentially, this means we use a set of **reward functions** to evaluate whether our model is learning *as intended*, rather than providing explicit examples that dictate *how* it should learn.

These reward functions are completely **customizable**, enabling users to effectively **guide** **how** and **in which areas** the model should specialize.

| Reward Function | Purpose |
|---|---|
| `correctness_reward_func` | Rewards the model when its answer matches the correct answer |
| `int_reward_func` | Rewards the model for providing a numeric answer |
| `strict_format_reward_func` and `soft_format_reward_func` | Reward the model for following the specified format |
| `xmlcount_reward_func` | Rewards proper XML tag usage and penalizes extra content after the closing tags |


In [None]:
# Reward functions
def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    q = prompts[0][-1]['content']
    extracted_responses = [extract_xml_answer(r) for r in responses]
    print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
    return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

def int_reward_func(completions, **kwargs) -> list[float]:
    responses = [completion[0]['content'] for completion in completions]
    extracted_responses = [extract_xml_answer(r) for r in responses]
    return [0.5 if r.isdigit() else 0.0 for r in extracted_responses]

def strict_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"^<reasoning>\n.*?\n</reasoning>\n<answer>\n.*?\n</answer>\n$"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def soft_format_reward_func(completions, **kwargs) -> list[float]:
    """Reward function that checks if the completion has a specific format."""
    pattern = r"<reasoning>.*?</reasoning>\s*<answer>.*?</answer>"
    responses = [completion[0]["content"] for completion in completions]
    matches = [re.match(pattern, r) for r in responses]
    return [0.5 if match else 0.0 for match in matches]

def count_xml(text) -> float:
    count = 0.0
    if text.count("<reasoning>\n") == 1:
        count += 0.125
    if text.count("\n</reasoning>\n") == 1:
        count += 0.125
    if text.count("\n<answer>\n") == 1:
        count += 0.125
        count -= len(text.split("\n</answer>\n")[-1])*0.001
    if text.count("\n</answer>") == 1:
        count += 0.125
        count -= (len(text.split("\n</answer>")[-1]) - 1)*0.001
    return count

def xmlcount_reward_func(completions, **kwargs) -> list[float]:
    contents = [completion[0]["content"] for completion in completions]
    return [count_xml(c) for c in contents]

### Train the Model

Now that we have:

1. **Training examples**
2. **Reward functions**

All that's left is to train our model!

We'll start by setting a series of **hyperparameters**.

> **NOTE:** These hyperparameters are optimized for the free **Colab T4** instance, but feel free to modify them to better suit your hardware.


### GRPOConfig

First and foremost: we have a set of **typical hyperparameters** (as usual).

You'll also notice a notable *absence* of specific **GRPO** hyperparameters in this implementation. We'll stick with default values to keep this notebook manageable, but feel free to explore **TRL** and fine-tune the parameters to find the best configuration for your use case.

> **NOTE:** If you want the classic **RL** image with the "line moving up and to the right," you can remove `report_to = "none"` from the configuration below.


In [None]:
from trl import GRPOConfig, GRPOTrainer

max_prompt_length = 256
max_seq_length = 1024


training_args = GRPOConfig(
    learning_rate = 5e-6,
    adam_beta1 = 0.9,
    adam_beta2 = 0.99,
    weight_decay = 0.1,
    warmup_ratio = 0.1,
    lr_scheduler_type = "cosine",
    optim = "adamw_8bit",
    logging_steps = 1,
    per_device_train_batch_size = 2,
    gradient_accumulation_steps = 1,
    num_generations = 2,
    max_prompt_length = max_prompt_length,
    max_completion_length = max_seq_length - max_prompt_length,
    num_train_epochs = 1,
    max_steps = 250,
    save_steps = 250,
    max_grad_norm = 0.1,
    report_to = "none",
)

Finally, we can run our **trainer**!

The main idea behind this **RL-based** approach is that instead of watching the loss decrease, we want to observe how the **reward increases**.

> **NOTE:** The training has an "Aha!" moment, as described, where the reward goes from ~0 and suddenly starts increasing. This behavior is expected, but you may **not see changes in the reward column** (which reflects the combined output of our previously defined reward functions) until after **steps 100-150**.


In [None]:
processor.pad_token_id = 0
processor.bos_token_id = 1
processor.eos_token_id = 2

trainer = GRPOTrainer(
    model = model,
    processing_class = processor,
    reward_funcs = [
        xmlcount_reward_func,
        soft_format_reward_func,
        strict_format_reward_func,
        int_reward_func,
        correctness_reward_func,
    ],
    args = training_args,
    train_dataset = dataset,
)

trainer.train()

No label_names provided for model class `PeftModelForCausalLM`. Since `PeftModel` hides base models input arguments, if label_names is not given, label_names can't be set automatically within `Trainer`. Note that empty label_names list will be used instead.


-------------------- Question:
Ahmed and Emily are having a contest to see who can get the best grade in the class. There have been 9 assignments and Ahmed has a 91 in the class. Emily has a 92. The final assignment is worth the same amount as all the other assignments. Emily got a 90 on the final assignment. What is the minimum grade Ahmed needs to get to beat Emily if all grades are whole numbers? 
Answer:
100 
Response:
<reasoning>
Let $A$ be Ahmed's grade and $E$ be Emily's grade.
Ahmed has a total grade of 91 from the first 9 assignments.
Emily has a total grade of 92 from the first 9 assignments and a final grade of 90.
So, Emily's total grade is $92 + 90 = 182$.
Ahmed needs to beat Emily, so $A > E$.
We want to find the minimum grade $A$ Ahmed needs to get on the final assignment such that $A > 182$. Since $A$ must be a whole number, we need $A \geq 183$.

However, each assignment is worth the same amount, so we can calculate the average grade for Ahmed and Emily.
Ahmed's averag

Step,Training Loss
1,0.0
2,-0.0
3,0.0
4,0.0
5,0.0
6,0.0
7,0.0001
8,0.0
9,0.0
10,0.0


[1;30;43mSe han truncado las últimas 5000 líneas del flujo de salida.[0m
Substituting R = 43 into the equation:
43 + 5 = 4(M + 5)
48 = 4(M + 5)
Divide both sides by 4:
12 = M + 5
Subtract 5 from both sides:
M = 12 - 5
M = 7
So, Maurice's current age is 7.

We can check our answer:
Ron's current age is 43.
Maurice's current age is 7.
In five years, Ron will be 43 + 5 = 48.
In five years, Maurice will be 7 + 5 = 12.
Is 48 = 4 * 12?
48 = 48. Yes, the equation holds.

</reasoning>
<answer>
7
</answer> 
Extracted:
7
-------------------- Question:
Cathy and Chris got summer jobs at the cake shop and were supposed to work 20 hours per week each for 2 months. During that time, Chris got sick for one week and Cathy took her shifts. If there are four weeks in a month, how many hours did Cathy work during the 2 months? 
Answer:
180 
Response:
<reasoning>
First, let's calculate the total number of weeks in two months. Since there are four weeks in a month, two months have 2 * 4 = 8 weeks.
Each p



TrainOutput(global_step=250, training_loss=3.798358620349518e-05, metrics={'train_runtime': 23752.222, 'train_samples_per_second': 0.021, 'train_steps_per_second': 0.011, 'total_flos': 0.0, 'train_loss': 3.798358620349518e-05})

# Upload version to HF

In [None]:
HF_USERNAME = "ericrisco"
MODEL_NAME = "gemma-3-4b-reasoning"
HF_REPO_ID = f"{HF_USERNAME}/{MODEL_NAME}"

merged_model = model.merge_and_unload()
merged_model.push_to_hub(HF_REPO_ID)

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/ericrisco/gemma-3-4b-reasoning/commit/168875146a3bc63241dcd57f0df910abb7054f09', commit_message='Upload Gemma3ForConditionalGeneration', commit_description='', oid='168875146a3bc63241dcd57f0df910abb7054f09', pr_url=None, repo_url=RepoUrl('https://huggingface.co/ericrisco/gemma-3-4b-reasoning', endpoint='https://huggingface.co', repo_type='model', repo_id='ericrisco/gemma-3-4b-reasoning'), pr_revision=None, pr_num=None)

# Testing the Untrained Model

In this section, we download the model from Hugging Face to test both the R1 version and the base version, enabling us to compare the trained version against the untrained model.

This allows us to clearly observe performance differences on the given tasks.


In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

BASE_MODEL = "google/gemma-3-4b-it"
TUNED_MODEL = "ericrisco/gemma-3-4b-reasoning"

base_model = AutoModelForImageTextToText.from_pretrained(
    BASE_MODEL, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="eager"
)
tuned_model = AutoModelForImageTextToText.from_pretrained(
    TUNED_MODEL, device_map="auto", torch_dtype=torch.bfloat16, attn_implementation="eager"
)

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>
"""

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

config.json:   0%|          | 0.00/1.61k [00:00<?, ?B/s]

adapter_config.json:   0%|          | 0.00/846 [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

adapter_model.safetensors:   0%|          | 0.00/1.51G [00:00<?, ?B/s]

In [None]:
problems = [
    "A factory produces electronic components. On Monday, it produces 1,200 components, and each day after that, production increases by 15%. If this trend continues, how many components will the factory produce on Friday?",

    "A rectangular water tank is being filled at a rate of 25 liters per minute. The tank has a base area of 4 square meters and a height of 3 meters. How long will it take to completely fill the tank if it starts empty?",

    "A farmer divides his land into two sections: one for growing wheat and the other for growing corn. The total area is 500 hectares. If the wheat section is three times the size of the corn section, what is the area of each section?",

    "A train leaves Station A at 8:00 AM traveling at 80 km/h. Another train leaves Station B at 9:30 AM traveling at 120 km/h on the same track but in the opposite direction. If the distance between the two stations is 500 km, at what time will the trains meet?"
]

In [None]:
for problem in problems:
    text = tokenizer.apply_chat_template([
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": problem},
    ], tokenize=False, add_generation_prompt=True)

    inputs = tokenizer(text, return_tensors="pt").to("cuda")

    with torch.no_grad():
        output_base_tokens = base_model.generate(**inputs, max_new_tokens=1200)
    output_base_text = tokenizer.decode(output_base_tokens[0], skip_special_tokens=True)

    with torch.no_grad():
        output_tuned_tokens = tuned_model.generate(**inputs, max_new_tokens=1200)
    output_tuned_text = tokenizer.decode(output_tuned_tokens[0], skip_special_tokens=True)

    print(f"🔹 **Problem:** {problem}\n")
    print(f"🟦 **Base model ({BASE_MODEL}):**\n{output_base_text}\n")
    print(f"🟩 **Reasoning model ({TUNED_MODEL}):**\n{output_tuned_text}\n")
    print("=" * 100, "\n")


🔹 **Problem:** A factory produces electronic components. On Monday, it produces 1,200 components, and each day after that, production increases by 15%. If this trend continues, how many components will the factory produce on Friday?

🟦 **Base model (google/gemma-3-4b-it):**
user

Respond in the following format:
<reasoning>
...
</reasoning>
<answer>
...
</answer>


A factory produces electronic components. On Monday, it produces 1,200 components, and each day after that, production increases by 15%. If this trend continues, how many components will the factory produce on Friday?
model
<reasoning>
The problem states that the factory produces 1200 components on Monday and that production increases by 15% each day. We need to find the production on Friday.
Monday is day 1, Tuesday is day 2, Wednesday is day 3, Thursday is day 4, and Friday is day 5.
We can calculate the production for each day as follows:
Monday: 1200
Tuesday: 1200 * 1.15 = 1380
Wednesday: 1380 * 1.15 = 1587
Thursday: 158

The results suggest that Gemma-3 could already be inherently optimized for structured reasoning, since both the "instructed" model and our GRPO-tuned model efficiently solve complex problems. If so, it would seem that this type of training focused on reasoning is precisely what makes small models, such as Gemma-3, become so efficient and powerful. This could explain why Gemma-3 is currently the most powerful model available that can be successfully executed in resource-limited environments.