# Day 29: Proximal Policy Optimization Algorithms

> Schulman, Wolski, Dhariwal, Radford, Klimov -- OpenAI (2017)
> https://arxiv.org/abs/1707.06347

What you will build in this notebook:
1. The probability ratio r_t = pi_new / pi_old
2. The clipped surrogate objective (Equation 7)
3. Generalized Advantage Estimation (GAE)
4. The full PPO loss (Equation 9)
5. A complete PPO training loop on CartPole-v1
6. Visualization of clip fraction and entropy over training


In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical
import matplotlib.pyplot as plt

%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 4)
plt.rcParams['font.size'] = 12

torch.manual_seed(42)
np.random.seed(42)
print('Setup complete.')

## 1. The Probability Ratio

The ratio r_t(theta) = pi_theta(a_t|s_t) / pi_theta_old(a_t|s_t) measures
how much the new policy differs from the old policy for a specific action.

- r_t = 1: policies are identical for this action
- r_t > 1: new policy assigns higher probability to this action
- r_t < 1: new policy assigns lower probability to this action

Computed in log space for numerical stability: r_t = exp(log_new - log_old)

In [None]:
# Demonstrate the probability ratio
log_probs_old = torch.tensor([-1.0, -0.5, -2.0, -0.3])
log_probs_new = torch.tensor([-0.8, -0.5, -2.5, -0.1])  # slightly different policy

ratio = torch.exp(log_probs_new - log_probs_old)
print('Old log probs:', log_probs_old.numpy())
print('New log probs:', log_probs_new.numpy())
print('Ratio r_t:    ', ratio.numpy().round(3))
print()
print('ratio > 1: new policy assigns MORE probability to this action')
print('ratio < 1: new policy assigns LESS probability to this action')
print('ratio = 1: policies are identical for this action')

## 2. The Clipped Surrogate Objective (Equation 7)

The core contribution of the paper. The min() makes this a pessimistic
lower bound: the objective stops improving once the ratio moves beyond
[1-epsilon, 1+epsilon].

L_CLIP = E[min(r_t * A_t, clip(r_t, 1-eps, 1+eps) * A_t)]

The paper uses epsilon=0.2 as the default (Section 3).

In [None]:
def clipped_surrogate_loss(log_probs_new, log_probs_old, advantages, epsilon=0.2):
    """Equation 7 from Schulman et al. (2017)."""
    ratio = torch.exp(log_probs_new - log_probs_old)
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - epsilon, 1 + epsilon) * advantages
    return -torch.min(surr1, surr2).mean()

# Visualize: what happens to the objective as the ratio changes?
ratios = torch.linspace(0.5, 1.5, 100)
advantage_pos = torch.tensor(1.0)   # positive advantage (good action)
advantage_neg = torch.tensor(-1.0)  # negative advantage (bad action)
epsilon = 0.2

obj_pos = torch.min(ratios * advantage_pos, torch.clamp(ratios, 1-epsilon, 1+epsilon) * advantage_pos)
obj_neg = torch.min(ratios * advantage_neg, torch.clamp(ratios, 1-epsilon, 1+epsilon) * advantage_neg)

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
axes[0].plot(ratios.numpy(), obj_pos.numpy(), 'steelblue', linewidth=2)
axes[0].axvline(1-epsilon, color='gray', linestyle='--', alpha=0.7, label=f'1-eps={1-epsilon}')
axes[0].axvline(1+epsilon, color='gray', linestyle='--', alpha=0.7, label=f'1+eps={1+epsilon}')
axes[0].set_title('Positive Advantage (A_t > 0)')
axes[0].set_xlabel('Probability ratio r_t')
axes[0].set_ylabel('Objective contribution')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].plot(ratios.numpy(), obj_neg.numpy(), 'coral', linewidth=2)
axes[1].axvline(1-epsilon, color='gray', linestyle='--', alpha=0.7, label=f'1-eps={1-epsilon}')
axes[1].axvline(1+epsilon, color='gray', linestyle='--', alpha=0.7, label=f'1+eps={1+epsilon}')
axes[1].set_title('Negative Advantage (A_t < 0)')
axes[1].set_xlabel('Probability ratio r_t')
axes[1].set_ylabel('Objective contribution')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.suptitle('PPO Clipped Surrogate Objective (Equation 7)', fontsize=13)
plt.tight_layout()
plt.show()
print('The objective is flat outside [1-eps, 1+eps]: no incentive to move further.')

## 3. Generalized Advantage Estimation (GAE)

PPO uses GAE (Schulman et al. 2015b, cited in Section 4) to estimate advantages.

A_t = sum_{l=0}^{inf} (gamma * lambda)^l * delta_{t+l}
delta_t = r_t + gamma * V(s_{t+1}) * (1 - done_t) - V(s_t)

lambda controls bias vs. variance:
- lambda=0: one-step TD (low variance, high bias)
- lambda=1: Monte Carlo (high variance, low bias)
- lambda=0.95: paper default, good balance

In [None]:
def compute_gae(rewards, values, dones, gamma=0.99, lambda_=0.95):
    """GAE backward recurrence. values has shape (T+1,) including bootstrap."""
    T = len(rewards)
    advantages = np.zeros(T)
    gae = 0.0
    for t in reversed(range(T)):
        delta = rewards[t] + gamma * values[t+1] * (1 - dones[t]) - values[t]
        gae = delta + gamma * lambda_ * (1 - dones[t]) * gae
        advantages[t] = gae
    return advantages

# Demonstrate lambda effect
T = 20
rewards = np.ones(T) * 1.0
values = np.ones(T + 1) * 5.0  # overestimated baseline
dones = np.zeros(T)

fig, ax = plt.subplots(figsize=(10, 4))
for lam, color in [(0.0, 'steelblue'), (0.5, 'mediumseagreen'), (0.95, 'coral'), (1.0, 'mediumpurple')]:
    adv = compute_gae(rewards, values, dones, gamma=0.99, lambda_=lam)
    ax.plot(adv, label=f'lambda={lam}', color=color, linewidth=2)

ax.set_xlabel('Timestep t')
ax.set_ylabel('Advantage A_t')
ax.set_title('GAE Advantages for Different Lambda Values')
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print('lambda=0: only uses immediate TD error (low variance)')
print('lambda=1: accumulates all future TD errors (high variance)')

## 4. The Full PPO Loss (Equation 9)

The complete objective combines three terms (Equation 9 from the paper):

L = L_CLIP - c1 * L_VF + c2 * S[pi]

- L_CLIP: clipped policy loss (Section 3)
- L_VF = (V(s) - V_target)^2: value function MSE
- S[pi]: entropy bonus (encourages exploration)
- c1=1.0, c2=0.01 for Atari; c2=0 for MuJoCo (Table 3 in paper)

In [None]:
def ppo_full_loss(log_probs_new, log_probs_old, advantages, values, returns, entropy,
                  epsilon=0.2, c1=1.0, c2=0.01):
    """Full PPO objective, Equation 9 from Schulman et al. (2017)."""
    # Normalize advantages within minibatch (standard practice, not in paper equations)
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

    # L_CLIP: clipped surrogate (Equation 7)
    ratio = torch.exp(log_probs_new - log_probs_old)
    surr1 = ratio * advantages
    surr2 = torch.clamp(ratio, 1 - epsilon, 1 + epsilon) * advantages
    policy_loss = -torch.min(surr1, surr2).mean()

    # L_VF: value function MSE
    value_loss = F.mse_loss(values, returns)

    # S[pi]: entropy bonus
    entropy_loss = -entropy.mean()

    # Combined: minimize policy_loss + c1*value_loss + c2*entropy_loss
    total = policy_loss + c1 * value_loss + c2 * entropy_loss

    clip_fraction = ((ratio - 1.0).abs() > epsilon).float().mean().item()
    return total, {'policy': policy_loss.item(), 'value': value_loss.item(),
                   'entropy': -entropy_loss.item(), 'clip_frac': clip_fraction}

# Quick sanity check
batch = 64
log_new = torch.randn(batch)
log_old = log_new.detach() + torch.randn(batch) * 0.1  # slightly different
adv = torch.randn(batch)
vals = torch.randn(batch)
rets = torch.randn(batch)
ent = torch.rand(batch) * 0.5 + 0.1

loss, info = ppo_full_loss(log_new, log_old, adv, vals, rets, ent)
print('PPO full loss components:')
for k, v in info.items():
    print(f'  {k}: {v:.4f}')
print(f'  total loss: {loss.item():.4f}')

## 5. Training PPO on CartPole-v1

CartPole-v1 is considered solved when the average reward over 100
consecutive episodes exceeds 475.

We use the PPOAgent from implementation.py, which follows Algorithm 1
from the paper.

In [None]:
import sys
sys.path.insert(0, '..')
from implementation import PPOAgent

try:
    import gymnasium as gym
except ImportError:
    import gym

env = gym.make('CartPole-v1')
obs_dim = env.observation_space.shape[0]
act_dim = env.action_space.n

agent = PPOAgent(
    obs_dim=obs_dim, act_dim=act_dim,
    T=512, K=4, M=64,
    epsilon=0.2, gamma=0.99, lambda_=0.95,
    lr=3e-4, c1=1.0, c2=0.01,
)

rewards_history = []
stats_history = []

for i in range(60):
    mean_reward = agent.collect_rollout(env)
    stats = agent.update()
    rewards_history.append(mean_reward)
    stats_history.append(stats)
    if (i + 1) % 10 == 0:
        print(f'Iter {i+1:3d} | reward={mean_reward:6.1f} | '
              f'clip_frac={stats["clip_fraction"]:.3f} | '
              f'entropy={stats["entropy"]:.4f}')

env.close()

## 6. Training Diagnostics

The clip fraction is a key diagnostic for PPO:
- Too low (< 5%): epsilon is too small, updates are too conservative
- Too high (> 50%): epsilon is too large, updates are too aggressive
- Healthy range: 5-30%

Entropy should decrease over training as the policy becomes more deterministic.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

# Reward curve
axes[0].plot(rewards_history, alpha=0.5, color='steelblue')
if len(rewards_history) >= 10:
    rolling = np.convolve(rewards_history, np.ones(10)/10, mode='valid')
    axes[0].plot(np.arange(9, len(rewards_history)), rolling, 'steelblue', linewidth=2)
axes[0].set_title('Reward over Training')
axes[0].set_xlabel('Iteration')
axes[0].set_ylabel('Mean Episode Reward')
axes[0].grid(True, alpha=0.3)

# Clip fraction
clip_fracs = [s['clip_fraction'] for s in stats_history]
axes[1].plot(clip_fracs, color='coral')
axes[1].axhline(0.05, color='gray', linestyle='--', alpha=0.5)
axes[1].axhline(0.30, color='gray', linestyle='--', alpha=0.5)
axes[1].set_title('Clip Fraction')
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Fraction of steps clipped')
axes[1].grid(True, alpha=0.3)

# Entropy
entropies = [s['entropy'] for s in stats_history]
axes[2].plot(entropies, color='mediumseagreen')
axes[2].set_title('Policy Entropy')
axes[2].set_xlabel('Iteration')
axes[2].set_ylabel('Entropy (nats)')
axes[2].grid(True, alpha=0.3)

plt.suptitle('PPO Training Diagnostics: CartPole-v1', fontsize=13)
plt.tight_layout()
plt.show()

## Key Takeaways

1. **The clip is a pessimistic lower bound.** The min() in L_CLIP takes the
   worse of clipped and unclipped. The agent never benefits from moving the
   ratio beyond [1-eps, 1+eps]. (Section 3 of the paper.)

2. **Multiple epochs on the same data is the efficiency gain.** Standard PG
   discards data after one step. PPO reuses each batch for K epochs.
   (Algorithm 1, Section 3.)

3. **GAE lambda controls bias-variance tradeoff.** lambda=0 is one-step TD
   (low variance, high bias). lambda=1 is Monte Carlo (high variance, low bias).
   The paper uses lambda=0.95 throughout.

4. **PPO is the algorithm inside RLHF.** InstructGPT and ChatGPT use PPO to
   fine-tune language models on human preference signals. Day 30 covers this
   directly.

   Note: The RLHF connection is our retrospective addition, not from the 2017 paper.

**Next:** [Day 30 - Deep Reinforcement Learning from Human Feedback](../30_RLHF/)
