#Illustrating DeepSeek-R1-Zero's Reinforcement Learning Innovations

copyright 2025, Denis Rothman

*Reference*:
[DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via
Reinforcement Learning, January 2025](https://arxiv.org/pdf/2501.12948)

This notebook provides a simplified, educational example to help understand the core concepts behind DeepSeek-R1-Zero's approach of using Reinforcement Learning (RL) directly on a base model, as described in Section 2.2 of the [DeepSeek-R1 paper](https://arxiv.org/pdf/2501.12948).

 We will use the calculation of Gini impurity, a common concept in decision trees and machine learning, as our illustrative task.Disclaimer: This example uses a highly simplified "model" and "task" for illustrative purposes. It does not represent the complexity or scale of actual LLM training. The goal is to demonstrate the ideas of rule-based rewards and iterative learning in an RL context using a more relevant technical example.

 **I.Background**: Pure RL traditional approaches to training powerful language models often involve a multi-stage process:

 **1.Pre-training**: Training on a massive text dataset to learn language fundamentals.Supervised

 **2.Fine-Tuning (SFT)**: Training on a smaller dataset of high-quality examples (e.g., question-answer pairs, instructions) to align the model's output with desired formats and behaviors.

 **3.Reinforcement Learning (RL)**: Further fine-tuning using feedback (rewards) to optimize performance on specific tasks or align with human preferences.

 *The DeepSeek-R1 paper introduces DeepSeek-R1-Zero, which skips the SFT step and applies large-scale RL directly to a pre-trained base model (DeepSeek-V3-Base).*

 The paper shows that this pure RL approach can incentivize the model to develop impressive reasoning capabilities on its own, driven by the reward signal.

 **II.Key aspects of DeepSeek-R1-Zero's RL approach (from Section 2.2)**:

 **1.Pure RL**: **No initial SFT phase**.
 Rule-Based Rewards: Using simple, verifiable rules (like correctness of a final answer or adherence to a specific output format) to provide feedback, rather than complex learned reward models.Training Template: Guiding the model's output structure (e.g., including a thinking process before the final answer).

 **Self-Evolution**: The model naturally learns to improve its reasoning process over iterations to maximize rewards.

 Let's simulate these ideas with a simple example using Gini impurity.

 Simplified Example: Calculating Gini ImpurityOur simplified task will be calculating the Gini impurity for a small dataset split.Gini

 Impurity is calculated as **1−∑i=1C​(pi​)^2**, where pi​ is the proportion of instances belonging to class i in the subset, and C is the number of classes.

 Consider a node in a decision tree that has been split, resulting in a subset of data with the following class distribution:

 Class A: 4 instancesClass B: 6 instancesTotal instances = 10.Proportion of Class A (pA​) = 4/10=0.4

 Proportion of Class B (pB​) = 6/10=0.6
 Gini Impurity = 1−(pA​)2−(pB​)^2
 =1−(0.4)2−(0.6)2
 =1−0.16−0.36
 =1−0.52=0.48.

 So, the correct answer is 0.48.Our simplified "model" will be a function that tries to generate a response following a specific template.

 Our rule-based reward function will evaluate the response based on format and correctness of the calculated Gini impurity.

 The desired template is similar to the one used in DeepSeek-R1-Zero (Table 1 in the paper):

 <think> [reasoning process] </think> <answer> [final answer] </answer>

 Let's define the components in Python.


## Introduction to the simplified RL Example in Python

In this notebook we simulate the core ideas from Section 2.2 of the *DeepSeek-R1* paper, demonstrating how **pure reinforcement learning** (RL) with **rule-based rewards** can drive a base model to improve its reasoning over time. We choose a very simple task—calculating the Gini impurity of a two-class split—to focus on the RL mechanics rather than on large-scale model details.

### 1. Gini Impurity Calculation
```python
def gini_impurity(counts):
    total = sum(counts.values())
    return 1 - sum((count / total) ** 2 for count in counts.values())
````

* **Definition**
  Gini impurity measures how often a randomly chosen element from a set would be misclassified if labeled according to the class distribution.
* **Formula**

  $$
    G = 1 - \sum_{i=1}^{C} p_i^2
  $$

  where $p_i$ is the proportion of class *i* in the node.
* **Usage**
  We compute the “true” Gini for our toy split (4 of A, 6 of B) as

  $$
    1 - (0.4^2 + 0.6^2) = 0.48.
  $$

### 2. SimpleModel: A Noisy “Policy”

```python
class SimpleModel:
    def __init__(self, noise_level=0.5):
        self.noise_level = noise_level
    def generate_response(self, counts):
        # build a <think>…</think> reasoning string
        # add uniform noise in [−noise_level, +noise_level]
        # round and wrap in <answer>…</answer>
    def update(self, reward):
        # reduce noise_level by a factor proportional to the reward
```

* **Analogy to an LLM**
  The model’s `generate_response` stands in for a large language model composing a chain-of-thought:

  ```
  <think>… reasoning …</think> <answer>… value …</answer>
  ```
* **Noise parameter**
  Controls how far the model’s answer can deviate from the true Gini.
* **Update rule**
  Whenever the model receives reward, it **reduces** its noise, thereby making future answers more precise.

### 3. Rule-Based Reward Function

```python
def reward_fn(response, true_value, tol=0.1):
    # 1) Check that the model used the exact <think>/<answer> template.
    # 2) Extract the numeric answer.
    # 3) If the answer is within ±tol, return (1 − error/tol), else 0.
```

* **Template enforcement**
  Ensures the model has learned to include a reasoning block and a final answer block.
* **Continuous reward shaping**
  Partial credit for near-correct:

  $$
    \text{reward} = \max\bigl(0,\;1 - \tfrac{|\,\hat G - G\,|}{\text{tol}}\bigr).
  $$

### 4. Training Loop

```python
model = SimpleModel(noise_level=0.5)
for epoch in range(1, N+1):
    response = model.generate_response(class_counts)
    r = reward_fn(response, true_gini, tol=0.1)
    model.update(r)
    # (optional) log noise & reward every few epochs
```

* **Pure RL**
  Notice there is **no supervised fine-tuning** step. We apply the reward function directly to the base model.
* **Self-evolution**
  Over many iterations, the model perks up its “chain of thought” format and steadily **reduces noise**, reflecting how DeepSeek-R1-Zero discovers improved reasoning purely from rule-based feedback.

---

### How to Experiment Further

1. **Adjust the tolerance** (`tol`) to make the reward harder or easier to attain.
2. **Vary the learning rate** (noise reduction per reward point).
3. **Increase the number of classes** or data splits to see how the template scales.
4. **Log intermediate predictions** (or plot convergence curves) to visualize learning dynamics.

This minimal example captures the essence of applying RL directly to a base language model with simple, verifiable rewards and a fixed output template—exactly as DeepSeek-R1-Zero does at scale.


In [None]:
import re
import random

# Set seed for reproducibility
random.seed(42)

# Task parameters
class_counts = {'A': 4, 'B': 6}
true_gini = 1 - sum((count / sum(class_counts.values())) ** 2 for count in class_counts.values())

def gini_impurity(counts):
    total = sum(counts.values())
    return 1 - sum((count / total) ** 2 for count in counts.values())

class SimpleModel:
    def __init__(self, noise_level=0.5):
        self.noise_level = noise_level

    def generate_response(self, counts):
        reasoning = (f"Node has counts {counts}. Total = {sum(counts.values())}. "
                     "Proportions: " +
                     ", ".join(f"{cls}={count/sum(counts.values()):.2f}"
                               for cls, count in counts.items()))
        base = gini_impurity(counts)
        noise = random.uniform(-self.noise_level, self.noise_level)
        answer = round(base + noise, 2)
        return f"<think>{reasoning}</think> <answer>{answer}</answer>"

    def update(self, reward):
        lr = 0.1
        self.noise_level *= (1 - lr * reward)
        self.noise_level = max(0, min(1, self.noise_level))

def reward_fn(response, true_value, tol=0.1):
    if not (response.startswith("<think>") and "</think>" in response and
            "<answer>" in response and response.endswith("</answer>")):
        return 0.0

    match = re.search(r"<answer>([-+]?\d*\.\d+|\d+)</answer>$", response)
    if not match:
        return 0.0
    predicted = float(match.group(1))

    error = abs(predicted - true_value)
    if error > tol:
        return 0.0
    return 1 - (error / tol)

# Initialize model
model = SimpleModel(noise_level=0.5)
print(f"True Gini impurity: {true_gini:.2f}")

# Show initial response before any updates
initial_response = model.generate_response(class_counts)
match = re.search(r"<answer>([-+]?\d*\.\d+|\d+)</answer>$", initial_response)
initial_pred = float(match.group(1)) if match else None
print(f"Initial model response: {initial_response}")
print(f"Initial predicted value: {initial_pred:.2f}\n")

# Training loop summary
for epoch in range(1, 51):
    response = model.generate_response(class_counts)
    reward = reward_fn(response, true_gini, tol=0.1)
    model.update(reward)
    if epoch % 10 == 0:
        print(f"Epoch {epoch:2d} | Noise: {model.noise_level:.3f} | Last reward: {reward:.2f}")

True Gini impurity: 0.48
Initial model response: <think>Node has counts {'A': 4, 'B': 6}. Total = 10. Proportions: A=0.40, B=0.60</think> <answer>0.62</answer>
Initial predicted value: 0.62

Epoch 10 | Noise: 0.490 | Last reward: 0.00
Epoch 20 | Noise: 0.407 | Last reward: 0.00
Epoch 30 | Noise: 0.398 | Last reward: 0.00
Epoch 40 | Noise: 0.321 | Last reward: 0.00
Epoch 50 | Noise: 0.308 | Last reward: 0.20
