In [None]:
#@title üéß Download Narration Audio & Play Introduction
import os as _os
if not _os.path.exists("/content/narration"):
    !pip install -q gdown
    import gdown
    gdown.download(id="1RJjttCvltRK-j5XaI_Tp752cibGKRYMf", output="/content/narration.zip", quiet=False)
    !unzip -q /content/narration.zip -d /content/narration
    !rm /content/narration.zip
    print(f"Loaded {len(_os.listdir('/content/narration'))} narration segments")
else:
    print("Narration audio already loaded.")

from IPython.display import Audio, display
display(Audio("/content/narration/02_00_intro.mp3"))


In [None]:
#@title üéß Code Walkthrough: Setup
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_01_setup.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


In [None]:
# üîß Setup: Run this cell first!
# Check GPU availability and install dependencies

import torch
import sys

# Check GPU
if torch.cuda.is_available():
    device = torch.device('cuda')
    print(f"‚úÖ GPU available: {torch.cuda.get_device_name(0)}")
    print(f"   Memory: {torch.cuda.get_device_properties(0).total_mem / 1e9:.1f} GB")
else:
    device = torch.device('cpu')
    print("‚ö†Ô∏è No GPU detected. Some cells may run slowly.")
    print("   Go to Runtime ‚Üí Change runtime type ‚Üí GPU")

print(f"\nüì¶ Python {sys.version.split()[0]}")
print(f"üî• PyTorch {torch.__version__}")

# Set random seeds for reproducibility
import random
import numpy as np

SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(SEED)

print(f"üé≤ Random seed set to {SEED}")

%matplotlib inline

# üöÄ Binary RL with GRPO-TCR: Scoring Responses and Training from Feedback

*Part 2 of the Vizuara series on OpenClaw-RL*
*Estimated time: 50 minutes*

# ü§ñ AI Teaching Assistant

Need help with this notebook? Open the **AI Teaching Assistant** ‚Äî it has already read this entire notebook and can help with concepts, code, and exercises.

**[üëâ Open AI Teaching Assistant](https://pods.vizuara.ai/courses/openclaw-rl/practice/2/assistant)**

*Tip: Open it in a separate tab and work through this notebook side-by-side.*


In [None]:
#@title üéß Listen: Why This Matters
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_02_why_this_matters.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 1. Why Does This Matter?

In the previous notebook, we built a system that extracts training samples from conversations. Each sample tells us what the assistant said and how the user reacted.

But how do we actually **use** this data to improve the model?

We need two things:
1. A way to **score** each response (Was it good? Bad? Neutral?)
2. A way to **update** the model so it produces more good responses and fewer bad ones

This is where **Binary RL with GRPO-TCR** comes in. GRPO (Group-Relative Policy Optimization) eliminates the need for a critic network ‚Äî cutting memory and compute costs by ~50%. The TCR recipe (Token-level loss + Clip-higher + Reward shaping) makes it work reliably in practice.

By the end of this notebook, you will have implemented:
- A **Process Reward Model (PRM)** with majority voting
- **GRPO advantage computation** from scratch
- The **clip-higher** asymmetric clipping mechanism
- **Overlong reward shaping** for length control
- The complete **GRPO-TCR loss function**
- A working training loop on synthetic conversation data

In [None]:
# üéØ Teaser: We'll train a model where responses improve over iterations
# Starting: random rewards, no learning
# After GRPO-TCR: model learns to prefer good responses over bad ones!

In [None]:
#@title üéß Listen: Building Intuition
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_03_building_intuition.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 2. Building Intuition

Let us build some intuition before we touch any math.

Imagine you are a teacher grading essays. You have 4 students who each wrote an essay on the same topic.

- Student A wrote an excellent essay ‚Üí Grade: A+
- Student B wrote a terrible essay ‚Üí Grade: F
- Student C wrote a good essay ‚Üí Grade: A
- Student D wrote an average essay ‚Üí Grade: C

Now, instead of giving absolute grades, you **rank them relative to each other**:
- A and C are above average ‚Üí encourage this kind of writing
- B is far below average ‚Üí strongly discourage this
- D is slightly below average ‚Üí mildly discourage this

This is the core idea of GRPO: **group-relative advantages**. We do not need an absolute standard (a critic network). We just need to know which responses in the group were better than others.

But how do we get those initial grades? In OpenClaw-RL, a **Process Reward Model (PRM)** acts as the grader. It is a separate language model that looks at the (response, user feedback) pair and decides: good, neutral, or bad.

One evaluation can be noisy (maybe the PRM had a bad day). So we run it **multiple times** and take a **majority vote** ‚Äî just like having multiple teachers grade the same essay.

### ü§î Think About This

If the PRM is correct 70% of the time, and we run 5 evaluations, what is the probability that the majority vote gives the correct answer? (Hint: it is significantly higher than 70%.)

In [None]:
#@title üéß Listen: Mathematics
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_04_mathematics.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 3. The Mathematics

### 3.1 PRM Majority Voting

For each (response, next-state) pair, the PRM runs $m$ evaluations and produces votes $v_1, v_2, \ldots, v_m \in \{-1, 0, +1\}$.

The final reward is determined by majority vote:

$$r = \text{mode}(\{v_1, v_2, \ldots, v_m\})$$

Computationally: count how many votes are +1, 0, and -1. Whichever category has the most votes wins.

For example, with $m = 5$ votes: $[+1, +1, -1, +1, 0]$. Count: three +1, one -1, one 0. The majority is +1, so $r = +1$.

### 3.2 Group-Relative Advantages

Given $G$ responses to the same prompt with rewards $r_1, r_2, \ldots, r_G$, the normalized advantage is:

$$\hat{A}_i = \frac{r_i - \text{mean}(\{r_1, \ldots, r_G\})}{\text{std}(\{r_1, \ldots, r_G\})}$$

Computationally: subtract the group mean from each reward, then divide by the standard deviation. This centers the advantages around zero, so the model knows which responses were above average and which were below.

### 3.3 The GRPO Clipped Surrogate Loss

$$J_{\text{GRPO}}(\theta) = \mathbb{E}\left[\min\left(\rho_t \hat{A}_t,\; \text{clip}(\rho_t, 1-\epsilon_{\text{low}}, 1+\epsilon_{\text{high}}) \hat{A}_t\right) - \beta D_{\text{KL}}(\pi_\theta \| \pi_{\text{ref}})\right]$$

where $\rho_t = \frac{\pi_\theta(a_t \mid s_t)}{\pi_{\text{ref}}(a_t \mid s_t)}$ is the probability ratio between the new and reference policies.

Computationally: we compute how much the new policy differs from the old one at each token. If it changed too much (ratio too high or too low), the clip mechanism prevents the update from being too aggressive. This keeps training stable.

### 3.4 Overlong Reward Shaping

$$r_{\text{length}}(y) = \begin{cases} 0 & \text{if } |y| \leq L_{\max} - L_{\text{cache}} \\ \frac{(L_{\max} - L_{\text{cache}}) - |y|}{L_{\text{cache}}} & \text{if } L_{\max} - L_{\text{cache}} < |y| \leq L_{\max} \\ -1 & \text{if } |y| > L_{\max} \end{cases}$$

Computationally: responses within the safe zone get no penalty. Responses approaching the limit get a linearly increasing penalty. Responses exceeding the limit get the maximum penalty of -1.

In [None]:
#@title üéß Transition: Setup Imports
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_05_setup_imports.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 4. Let's Build It ‚Äî Component by Component

### 4.1 Setup and Imports

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import matplotlib.pyplot as plt
from collections import Counter
from typing import List, Tuple

# Set seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"‚úÖ Using device: {device}")

In [None]:
#@title üéß Code Walkthrough: Prm Majority Voting
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_06_prm_majority_voting.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


### 4.2 Process Reward Model (PRM) with Majority Voting

Let us build a simulated PRM. In the real system, the PRM is a separate language model. Here, we simulate it with a noisy oracle that is correct ~70% of the time.

In [None]:
class ProcessRewardModel:
    """
    Simulated Process Reward Model (PRM) with majority voting.
    In production, this would be a separate language model.
    """

    def __init__(self, accuracy: float = 0.7, num_votes: int = 5):
        """
        Args:
            accuracy: Probability of the PRM giving the correct verdict
            num_votes: Number of evaluations (m) for majority voting
        """
        self.accuracy = accuracy
        self.num_votes = num_votes

    def evaluate_single(self, true_quality: int) -> int:
        """
        Single PRM evaluation. Returns +1 (good), 0 (neutral), or -1 (bad).

        Args:
            true_quality: The actual quality of the response (-1, 0, or +1)
        """
        if np.random.random() < self.accuracy:
            return true_quality  # Correct evaluation
        else:
            # Random incorrect evaluation
            options = [-1, 0, 1]
            options.remove(true_quality)
            return np.random.choice(options)

    def evaluate_with_majority_voting(self, true_quality: int) -> Tuple[int, List[int]]:
        """
        Run m evaluations and return majority vote.

        Args:
            true_quality: The actual quality of the response

        Returns:
            (final_reward, list_of_individual_votes)
        """
        votes = [self.evaluate_single(true_quality) for _ in range(self.num_votes)]
        vote_counts = Counter(votes)
        majority_reward = vote_counts.most_common(1)[0][0]
        return majority_reward, votes

# Test the PRM
prm = ProcessRewardModel(accuracy=0.7, num_votes=5)

# Simulate evaluating a good response
print("Evaluating a GOOD response (true quality = +1):")
for trial in range(5):
    reward, votes = prm.evaluate_with_majority_voting(true_quality=+1)
    print(f"  Trial {trial+1}: votes={votes} ‚Üí majority={reward:+d}")

print("\nEvaluating a BAD response (true quality = -1):")
for trial in range(5):
    reward, votes = prm.evaluate_with_majority_voting(true_quality=-1)
    print(f"  Trial {trial+1}: votes={votes} ‚Üí majority={reward:+d}")

### üìä Visualization: How Majority Voting Improves Accuracy

In [None]:
def measure_majority_voting_accuracy(prm_accuracy, num_votes_list, num_trials=1000):
    """Measure how majority voting improves PRM accuracy."""
    results = {}
    for num_votes in num_votes_list:
        prm = ProcessRewardModel(accuracy=prm_accuracy, num_votes=num_votes)
        correct = 0
        for _ in range(num_trials):
            true_quality = np.random.choice([-1, 0, 1])
            reward, _ = prm.evaluate_with_majority_voting(true_quality)
            if reward == true_quality:
                correct += 1
        results[num_votes] = correct / num_trials
    return results

# Test with different numbers of votes
num_votes_list = [1, 3, 5, 7, 9, 11, 15, 21]
results = measure_majority_voting_accuracy(0.7, num_votes_list)

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(list(results.keys()), list(results.values()), 'o-',
        color='#3498db', linewidth=2, markersize=8)
ax.axhline(y=0.7, color='#e74c3c', linestyle='--', label='Single PRM accuracy (70%)')
ax.set_xlabel('Number of Majority Votes (m)', fontsize=12)
ax.set_ylabel('Effective Accuracy', fontsize=12)
ax.set_title('Majority Voting Dramatically Improves PRM Accuracy', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
ax.set_ylim(0.5, 1.0)
plt.tight_layout()
plt.show()

print("Effective accuracy with majority voting:")
for m, acc in results.items():
    improvement = acc - 0.7
    print(f"  m={m:2d}: {acc:.1%} ({improvement:+.1%} vs single eval)")

In [None]:
#@title üéß Code Walkthrough: Grpo Advantages
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_07_grpo_advantages.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


### 4.3 GRPO Advantage Computation

Now let us implement the group-relative advantage normalization:

In [None]:
def compute_grpo_advantages(rewards: torch.Tensor) -> torch.Tensor:
    """
    Compute group-relative advantages.

    Args:
        rewards: Tensor of shape (G,) with rewards for G responses

    Returns:
        advantages: Tensor of shape (G,) with normalized advantages
    """
    mean = rewards.mean()
    std = rewards.std()

    # Avoid division by zero
    if std < 1e-8:
        return torch.zeros_like(rewards)

    advantages = (rewards - mean) / std
    return advantages

# Worked example from the article
rewards = torch.tensor([1.0, -1.0, 1.0, 0.0])

advantages = compute_grpo_advantages(rewards)

print("GRPO Advantage Computation:")
print(f"  Rewards:    {rewards.tolist()}")
print(f"  Mean:       {rewards.mean():.2f}")
print(f"  Std:        {rewards.std():.2f}")
print(f"  Advantages: {[f'{a:.2f}' for a in advantages.tolist()]}")
print()
print("  Response 1 (r=+1): Advantage = {:.2f} ‚Üí INCREASE probability".format(advantages[0]))
print("  Response 2 (r=-1): Advantage = {:.2f} ‚Üí DECREASE probability".format(advantages[1]))
print("  Response 3 (r=+1): Advantage = {:.2f} ‚Üí INCREASE probability".format(advantages[2]))
print("  Response 4 (r= 0): Advantage = {:.2f} ‚Üí Slightly decrease".format(advantages[3]))

### üìä Visualization: How Advantages Distribute Across a Group

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Left: Rewards vs Advantages
x = np.arange(len(rewards))
width = 0.35
axes[0].bar(x - width/2, rewards.numpy(), width, label='Raw Rewards',
            color='#3498db', alpha=0.8)
axes[0].bar(x + width/2, advantages.numpy(), width, label='GRPO Advantages',
            color='#e74c3c', alpha=0.8)
axes[0].set_xlabel('Response Index')
axes[0].set_title('Raw Rewards vs Group-Relative Advantages', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].axhline(y=0, color='black', linewidth=0.5)
axes[0].set_xticks(x)
axes[0].grid(True, alpha=0.3)

# Right: Distribution over many groups
all_advantages = []
for _ in range(500):
    r = torch.tensor(np.random.choice([-1.0, 0.0, 1.0], size=8))
    a = compute_grpo_advantages(r)
    all_advantages.extend(a.tolist())

axes[1].hist(all_advantages, bins=50, color='#9b59b6', alpha=0.7, edgecolor='white')
axes[1].set_xlabel('Advantage Value')
axes[1].set_ylabel('Count')
axes[1].set_title('Distribution of GRPO Advantages (500 groups)', fontsize=12, fontweight='bold')
axes[1].axvline(x=0, color='#e74c3c', linestyle='--', label='Zero (average)')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
#@title üéß Code Walkthrough: Overlong Reward
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_08_overlong_reward.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


### 4.4 Overlong Reward Shaping

Responses that are too long should be penalized ‚Äî but smoothly, not with a hard cutoff:

In [None]:
def overlong_reward(response_length: int, L_max: int = 1000, L_cache: int = 200) -> float:
    """
    Compute the overlong reward penalty.

    Args:
        response_length: Number of tokens in the response
        L_max: Maximum allowed response length
        L_cache: Size of the penalty transition zone

    Returns:
        Penalty in [-1, 0]
    """
    safe_zone = L_max - L_cache

    if response_length <= safe_zone:
        return 0.0    # No penalty ‚Äî within safe zone
    elif response_length <= L_max:
        # Linear penalty in the transition zone
        return (safe_zone - response_length) / L_cache
    else:
        return -1.0   # Maximum penalty ‚Äî exceeded limit

# Visualize the penalty function
lengths = np.arange(0, 1300)
penalties = [overlong_reward(l) for l in lengths]

fig, ax = plt.subplots(figsize=(12, 5))
ax.plot(lengths, penalties, linewidth=2.5, color='#e74c3c')
ax.axvline(x=800, color='#2ecc71', linestyle='--', alpha=0.7, label='Safe zone boundary (L_max - L_cache)')
ax.axvline(x=1000, color='#e74c3c', linestyle='--', alpha=0.7, label='Maximum length (L_max)')
ax.fill_between(lengths[:801], 0, [penalties[i] for i in range(801)],
                alpha=0.1, color='#2ecc71', label='Safe zone (no penalty)')
ax.fill_between(lengths[800:1001], 0, penalties[800:1001],
                alpha=0.1, color='#f39c12', label='Transition zone')
ax.fill_between(lengths[1000:], -1, penalties[1000:],
                alpha=0.1, color='#e74c3c', label='Over-limit zone')
ax.set_xlabel('Response Length (tokens)', fontsize=12)
ax.set_ylabel('Length Penalty', fontsize=12)
ax.set_title('Overlong Reward Shaping ‚Äî Smooth Penalty for Long Responses', fontsize=14, fontweight='bold')
ax.legend(loc='lower left', fontsize=10)
ax.grid(True, alpha=0.3)
ax.set_ylim(-1.15, 0.15)
plt.tight_layout()
plt.show()

# Worked examples
print("Overlong reward examples (L_max=1000, L_cache=200):")
for length in [500, 700, 800, 850, 900, 950, 1000, 1100]:
    r = overlong_reward(length)
    print(f"  {length:4d} tokens ‚Üí penalty = {r:+.2f}")

In [None]:
#@title üéß Code Walkthrough: Clip Higher
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_09_clip_higher.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


### 4.5 The Clip-Higher Mechanism

Standard PPO uses symmetric clipping: $[1-\epsilon, 1+\epsilon]$. OpenClaw-RL uses **asymmetric clipping** where the upper bound is larger ‚Äî this allows the model to more aggressively increase the probability of good responses while being conservative about decreasing probabilities.

In [None]:
def clip_higher(ratio: torch.Tensor, eps_low: float = 0.2, eps_high: float = 0.28) -> torch.Tensor:
    """
    Asymmetric clipping ‚Äî clip-higher technique.

    Args:
        ratio: Policy ratio tensor (pi_new / pi_old)
        eps_low: Lower bound offset (conservative on negative updates)
        eps_high: Upper bound offset (more exploratory on positive updates)

    Returns:
        Clipped ratio
    """
    lower = 1.0 - eps_low
    upper = 1.0 + eps_high
    return torch.clamp(ratio, lower, upper)

# Visualize symmetric vs asymmetric clipping
ratios = torch.linspace(0.3, 2.0, 200)

symmetric_clipped = torch.clamp(ratios, 0.8, 1.2)  # Standard PPO
asymmetric_clipped = clip_higher(ratios)  # OpenClaw-RL

fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(ratios.numpy(), ratios.numpy(), '--', color='gray', alpha=0.5, label='Unclipped')
ax.plot(ratios.numpy(), symmetric_clipped.numpy(), linewidth=2.5,
        color='#3498db', label='Symmetric clip [0.8, 1.2]')
ax.plot(ratios.numpy(), asymmetric_clipped.numpy(), linewidth=2.5,
        color='#e74c3c', label='Clip-higher [0.8, 1.28]')
ax.axvline(x=1.0, color='gray', linestyle=':', alpha=0.3)
ax.set_xlabel('Policy Ratio (œÅ)', fontsize=12)
ax.set_ylabel('Clipped Ratio', fontsize=12)
ax.set_title('Clip-Higher: Asymmetric Clipping for Exploration', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Key insight: clip-higher allows larger positive updates (up to 1.28)")
print("while keeping negative updates conservative (down to 0.8).")
print("This expands the exploration budget without risking catastrophic forgetting.")

In [None]:
#@title üéß Narration: Todo Clipped Loss
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_10_todo_clipped_loss.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 5. üîß Your Turn

### TODO 1: Implement the GRPO Clipped Surrogate Loss

This is the core loss function. Implement it step by step:

In [None]:
def grpo_clipped_loss(
    log_probs_new: torch.Tensor,    # Log probs under current policy
    log_probs_ref: torch.Tensor,    # Log probs under reference policy
    advantages: torch.Tensor,        # Group-relative advantages (per-token, broadcasted)
    eps_low: float = 0.2,
    eps_high: float = 0.28,
    beta_kl: float = 0.01
) -> torch.Tensor:
    """
    Compute the GRPO clipped surrogate loss with clip-higher and KL penalty.

    Args:
        log_probs_new: shape (batch, seq_len) ‚Äî log œÄ_Œ∏(a_t | s_t)
        log_probs_ref: shape (batch, seq_len) ‚Äî log œÄ_ref(a_t | s_t)
        advantages: shape (batch,) ‚Äî group-relative advantages
        eps_low: Lower clipping bound offset
        eps_high: Upper clipping bound offset (higher = more exploration)
        beta_kl: KL divergence penalty coefficient

    Returns:
        Scalar loss value (to be MAXIMIZED, so negate for gradient descent)

    Steps:
        1. Compute the probability ratio: œÅ_t = exp(log_new - log_ref)
        2. Compute unclipped objective: œÅ_t * A_hat
        3. Compute clipped objective: clip(œÅ_t, 1-eps_low, 1+eps_high) * A_hat
        4. Take the minimum of unclipped and clipped (pessimistic bound)
        5. Compute KL penalty: beta * mean(log_new - log_ref)
        6. Return mean(min_objective) - KL_penalty
    """
    # Expand advantages to match token dimension
    # advantages shape: (batch,) ‚Üí (batch, 1)
    adv = advantages.unsqueeze(-1)

    # ============ TODO ============
    # Step 1: Compute ratio œÅ = exp(log_probs_new - log_probs_ref)
    # Step 2: Compute unclipped term = œÅ * advantages
    # Step 3: Compute clipped_ratio using clip_higher
    # Step 4: Compute clipped term = clipped_ratio * advantages
    # Step 5: Take element-wise minimum
    # Step 6: Compute KL penalty = beta_kl * mean(log_probs_new - log_probs_ref)
    # Step 7: Return mean(min_term) - kl_penalty
    # ==============================

    loss = ???  # YOUR CODE HERE

    return loss

# ‚úÖ Verification
batch, seq = 4, 10
log_new = torch.randn(batch, seq) * 0.1 - 2.0  # Typical log-prob values
log_ref = torch.randn(batch, seq) * 0.1 - 2.0
advs = torch.tensor([0.9, -1.5, 0.9, -0.3])

loss = grpo_clipped_loss(log_new, log_ref, advs)
assert loss.dim() == 0, f"‚ùå Loss should be scalar, got shape {loss.shape}"
assert not torch.isnan(loss), "‚ùå Loss is NaN!"
print(f"‚úÖ GRPO loss computed successfully: {loss.item():.4f}")

In [None]:
#@title üéß Narration: Todo Tcr Loss
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_11_todo_tcr_loss.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


### TODO 2: Implement the Full GRPO-TCR Loss

Now combine everything ‚Äî PRM rewards, advantages, clipping, and reward shaping:

In [None]:
def grpo_tcr_loss(
    log_probs_new: torch.Tensor,   # (batch, seq_len)
    log_probs_ref: torch.Tensor,   # (batch, seq_len)
    response_lengths: torch.Tensor, # (batch,) ‚Äî length of each response in tokens
    prm_rewards: torch.Tensor,      # (batch,) ‚Äî PRM majority vote rewards
    eps_low: float = 0.2,
    eps_high: float = 0.28,
    beta_kl: float = 0.01,
    L_max: int = 1000,
    L_cache: int = 200,
) -> Tuple[torch.Tensor, dict]:
    """
    Full GRPO-TCR loss: Token-level + Clip-higher + Reward shaping.

    Args:
        log_probs_new, log_probs_ref: Policy log-probabilities
        response_lengths: Token counts per response
        prm_rewards: PRM majority vote rewards per response
        eps_low, eps_high: Clipping bounds
        beta_kl: KL penalty coefficient
        L_max, L_cache: Overlong reward shaping parameters

    Returns:
        (loss, metrics_dict)

    Steps:
        1. Compute overlong penalties for each response
        2. Combine PRM rewards with length penalties
        3. Compute GRPO advantages from combined rewards
        4. Compute clipped surrogate loss with KL penalty
    """
    # ============ TODO ============
    # Step 1: Compute overlong penalties for each response
    # Step 2: combined_reward = prm_reward + length_penalty
    # Step 3: advantages = compute_grpo_advantages(combined_rewards)
    # Step 4: loss = grpo_clipped_loss(log_new, log_ref, advantages, ...)
    # ==============================

    loss = ???  # YOUR CODE HERE
    metrics = ???  # YOUR CODE HERE

    return loss, metrics

# ‚úÖ Verification
batch, seq = 8, 20
log_new = torch.randn(batch, seq) * 0.1 - 2.0
log_ref = torch.randn(batch, seq) * 0.1 - 2.0
lengths = torch.tensor([500, 800, 900, 1100, 600, 750, 950, 400]).float()
prm_rewards = torch.tensor([1.0, -1.0, 1.0, 0.0, 1.0, -1.0, 0.0, 1.0])

loss, metrics = grpo_tcr_loss(log_new, log_ref, lengths, prm_rewards)
assert loss.dim() == 0, f"‚ùå Loss should be scalar"
assert "mean_advantage" in metrics, "‚ùå Metrics should include mean_advantage"
print(f"‚úÖ GRPO-TCR loss: {loss.item():.4f}")
print(f"   Mean advantage: {metrics['mean_advantage']:.4f}")
print(f"   Mean length penalty: {metrics['mean_length_penalty']:.4f}")

In [None]:
#@title üéß Code Walkthrough: Training Loop
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_12_training_loop.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 6. Putting It All Together ‚Äî A Complete Training Loop

Let us build a small-scale training loop that demonstrates GRPO-TCR in action. We will use a simple model that learns to prefer certain token patterns over others.

In [None]:
class SimplePolicyModel(nn.Module):
    """
    A tiny policy model for demonstration.
    Maps a prompt embedding to a distribution over responses.
    """
    def __init__(self, vocab_size=100, hidden_size=64, max_seq_len=20):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, hidden_size)
        self.transformer = nn.TransformerEncoderLayer(
            d_model=hidden_size, nhead=4, dim_feedforward=128, batch_first=True
        )
        self.head = nn.Linear(hidden_size, vocab_size)
        self.vocab_size = vocab_size
        self.max_seq_len = max_seq_len

    def forward(self, input_ids):
        """Compute log-probabilities for each position."""
        x = self.embedding(input_ids)
        x = self.transformer(x)
        logits = self.head(x)
        log_probs = F.log_softmax(logits, dim=-1)
        return log_probs

    def get_response_log_probs(self, input_ids, response_ids):
        """
        Get the log-probability of generating specific response tokens.

        Args:
            input_ids: (batch, prompt_len)
            response_ids: (batch, response_len)

        Returns:
            log_probs: (batch, response_len) ‚Äî log p(response_t | context)
        """
        full_ids = torch.cat([input_ids, response_ids], dim=1)
        all_log_probs = self.forward(full_ids)
        # Get log-probs for the response portion only
        prompt_len = input_ids.shape[1]
        response_log_probs = all_log_probs[:, prompt_len-1:-1, :]  # Shifted by 1
        # Gather the log-probs for the actual response tokens
        selected = response_log_probs.gather(2, response_ids.unsqueeze(-1)).squeeze(-1)
        return selected

# Create model
vocab_size = 100
model = SimplePolicyModel(vocab_size=vocab_size).to(device)
ref_model = SimplePolicyModel(vocab_size=vocab_size).to(device)
ref_model.load_state_dict(model.state_dict())  # Same initial weights

print(f"‚úÖ Policy model: {sum(p.numel() for p in model.parameters()):,} parameters")
print(f"‚úÖ Reference model frozen (same architecture)")

### The Training Loop

In [None]:
def generate_synthetic_batch(batch_size=8, prompt_len=5, response_len=15, vocab_size=100):
    """Generate a synthetic batch of (prompt, response, reward) triples."""
    prompts = torch.randint(0, vocab_size, (batch_size, prompt_len))
    responses = torch.randint(0, vocab_size, (batch_size, response_len))

    # Simulate PRM rewards: responses with more low-value tokens are "bad"
    # (This is artificial ‚Äî in reality, the PRM evaluates actual response quality)
    rewards = []
    for resp in responses:
        low_token_ratio = (resp < vocab_size // 3).float().mean().item()
        if low_token_ratio > 0.5:
            rewards.append(-1.0)
        elif low_token_ratio < 0.3:
            rewards.append(1.0)
        else:
            rewards.append(0.0)
    rewards = torch.tensor(rewards)

    lengths = torch.full((batch_size,), response_len, dtype=torch.float)
    return prompts.to(device), responses.to(device), rewards.to(device), lengths.to(device)

# Training loop
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
num_steps = 200
losses = []
mean_advantages = []

print("Training with GRPO-TCR...")
for step in range(num_steps):
    prompts, responses, prm_rewards, lengths = generate_synthetic_batch()

    # Get log-probs from current and reference policy
    log_probs_new = model.get_response_log_probs(prompts, responses)
    with torch.no_grad():
        log_probs_ref = ref_model.get_response_log_probs(prompts, responses)

    # Compute GRPO-TCR loss
    combined = prm_rewards + torch.tensor(
        [overlong_reward(int(l.item())) for l in lengths], device=device
    )
    advantages = compute_grpo_advantages(combined)

    loss = grpo_clipped_loss(log_probs_new, log_probs_ref, advantages)
    neg_loss = -loss  # We maximize the objective, so minimize the negative

    optimizer.zero_grad()
    neg_loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

    losses.append(loss.item())
    mean_advantages.append(advantages.mean().item())

    if (step + 1) % 50 == 0:
        print(f"  Step {step+1}/{num_steps}: loss={loss.item():.4f}, "
              f"mean_adv={advantages.mean().item():.4f}")

print("‚úÖ Training complete!")

In [None]:
#@title üéß Code Walkthrough: Training Results
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_13_training_results.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 7. üìä Training Results

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training loss curve
window = 10
smoothed_losses = np.convolve(losses, np.ones(window)/window, mode='valid')
axes[0].plot(smoothed_losses, linewidth=2, color='#3498db')
axes[0].set_xlabel('Training Step')
axes[0].set_ylabel('GRPO Objective')
axes[0].set_title('GRPO-TCR Training Curve', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Policy divergence from reference
with torch.no_grad():
    test_prompts, test_responses, _, _ = generate_synthetic_batch(batch_size=32)
    new_lp = model.get_response_log_probs(test_prompts, test_responses)
    ref_lp = ref_model.get_response_log_probs(test_prompts, test_responses)
    kl_div = (new_lp - ref_lp).mean(dim=1)

axes[1].hist(kl_div.cpu().numpy(), bins=30, color='#e74c3c', alpha=0.7, edgecolor='white')
axes[1].axvline(x=0, color='black', linestyle='--', linewidth=1)
axes[1].set_xlabel('KL Divergence from Reference')
axes[1].set_ylabel('Count')
axes[1].set_title('Policy Divergence Distribution', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Mean KL divergence from reference: {kl_div.mean().item():.4f}")
print(f"Max KL divergence: {kl_div.max().item():.4f}")
print("(Small KL = policy hasn't drifted too far = good stability)")

## 8. üéØ Final Output: GRPO-TCR Component Summary

In [None]:
# Demonstrate the full GRPO-TCR pipeline visually
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. PRM majority voting
prm = ProcessRewardModel(accuracy=0.7, num_votes=7)
good_results = [prm.evaluate_with_majority_voting(+1)[0] for _ in range(100)]
bad_results = [prm.evaluate_with_majority_voting(-1)[0] for _ in range(100)]

axes[0, 0].bar(['Good ‚Üí +1', 'Good ‚Üí 0', 'Good ‚Üí -1'],
               [good_results.count(1), good_results.count(0), good_results.count(-1)],
               color=['#2ecc71', '#f39c12', '#e74c3c'], alpha=0.8)
axes[0, 0].set_title('PRM Majority Voting (true: GOOD)', fontsize=11, fontweight='bold')
axes[0, 0].set_ylabel('Count (out of 100)')

# 2. Advantage distribution
sample_rewards = torch.tensor([1.0, -1.0, 1.0, 0.0, 1.0, -1.0, 0.0, 1.0])
sample_advs = compute_grpo_advantages(sample_rewards)
colors = ['#2ecc71' if a > 0 else '#e74c3c' for a in sample_advs]
axes[0, 1].bar(range(len(sample_advs)), sample_advs.numpy(), color=colors, alpha=0.8)
axes[0, 1].axhline(y=0, color='black', linewidth=0.5)
axes[0, 1].set_title('Group-Relative Advantages', fontsize=11, fontweight='bold')
axes[0, 1].set_xlabel('Response Index')

# 3. Clip-higher comparison
ratios = torch.linspace(0.3, 2.0, 100)
sym = torch.clamp(ratios, 0.8, 1.2)
asym = clip_higher(ratios)
axes[1, 0].plot(ratios.numpy(), ratios.numpy(), '--', color='gray', alpha=0.5)
axes[1, 0].plot(ratios.numpy(), sym.numpy(), linewidth=2, label='Symmetric')
axes[1, 0].plot(ratios.numpy(), asym.numpy(), linewidth=2, label='Clip-higher')
axes[1, 0].legend()
axes[1, 0].set_title('Clip-Higher vs Symmetric', fontsize=11, fontweight='bold')

# 4. Training curve
axes[1, 1].plot(smoothed_losses, linewidth=2, color='#9b59b6')
axes[1, 1].set_title('GRPO-TCR Training Convergence', fontsize=11, fontweight='bold')
axes[1, 1].set_xlabel('Step')
axes[1, 1].set_ylabel('Objective')
axes[1, 1].grid(True, alpha=0.3)

plt.suptitle('GRPO-TCR: Complete Binary RL Pipeline', fontsize=16, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

print("üéâ Congratulations! You've built the Binary RL pipeline from scratch!")
print("   ‚úÖ Process Reward Model with majority voting")
print("   ‚úÖ Group-relative advantage computation")
print("   ‚úÖ Clip-higher asymmetric clipping")
print("   ‚úÖ Overlong reward shaping")
print("   ‚úÖ Complete GRPO-TCR training loop")

In [None]:
#@title üéß Narration: Reflection
from IPython.display import Audio, display
import os as _os
_f = "/content/narration/02_14_reflection.mp3"
if _os.path.exists(_f):
    display(Audio(_f))
else:
    print("Run the first cell to download narration audio.")


## 9. Reflection and Next Steps

### ü§î Reflection Questions
1. Why does asymmetric clipping (clip-higher) help with exploration? What would happen if $\epsilon_{\text{high}}$ was very large (say, 10)?
2. The overlong reward shaping uses a linear penalty in the transition zone. What would happen with a quadratic penalty instead? Would it be better or worse?
3. In our demo, we used synthetic rewards. In the real system, PRM rewards come from a language model. What happens if the PRM is biased (e.g., it always rates longer responses higher)?

### üèÜ Optional Challenges
1. **Dynamic clip bounds**: Implement a version where $\epsilon_{\text{high}}$ decreases over training (start exploratory, become conservative).
2. **Reward aggregation**: Instead of simple majority voting, implement weighted voting where more confident PRM evaluations count more.
3. **Group size ablation**: Run the training loop with different group sizes (G=2, 4, 8, 16) and compare convergence speed.