# PPO (Proximal Policy Optimization) - Interactive Exercise

Welcome! In this notebook, you will implement **PPO**, one of the most popular and successful policy gradient algorithms in modern Deep RL.

## What is PPO?

PPO (Schulman et al., 2017) is a policy gradient method that achieves:
- **Stable training**: Conservative policy updates prevent collapse
- **Sample efficiency**: Reuses data through multiple epochs
- **Strong performance**: State-of-the-art on many benchmarks
- **Simplicity**: Easier to implement than TRPO (Trust Region Policy Optimization)

PPO has become the **default choice** for many RL applications!

## Key Innovation: Clipped Surrogate Objective

**Standard Policy Gradient**:
$$L^{PG} = \mathbb{E}[\log \pi(a|s) A(s,a)]$$

**PPO Clipped Objective**:
$$L^{CLIP} = \mathbb{E}[\min(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t)]$$

where $r_t(\theta) = \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ is the **probability ratio**.

The **clip** prevents the new policy from deviating too far from the old policy!

## Differences from A2C

| Aspect | A2C | PPO |
|--------|-----|-----|
| Update Rule | Standard PG | **Clipped surrogate** |
| Data Usage | Single update | **Multiple epochs** |
| Advantage | n-step TD | **GAE (Î»-returns)** |
| Stability | Good | **Excellent** |
| Performance | Good | **State-of-the-art** |

## Learning Objectives

By the end of this notebook, you will:
- Understand the clipped surrogate objective
- Implement Generalized Advantage Estimation (GAE)
- Build a complete PPO agent
- See why PPO is so stable and effective
- Compare PPO with A2C and other methods

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import gymnasium as gym
import matplotlib.pyplot as plt
from ppo_tests import *

In [None]:
# Set random seeds
torch.manual_seed(42)
np.random.seed(42)

## The Environment: CartPole

We'll use CartPole-v1 to demonstrate PPO's superior stability.

In [None]:
env = gym.make('CartPole-v1')
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.n

print(f"State dimension: {state_dim}")
print(f"Action dimension: {action_dim}")

## Exercise 1: Compute GAE (Generalized Advantage Estimation)

GAE (Schulman et al., 2016) provides a better advantage estimate by combining n-step returns with different values of n.

**TD Error at time t**:
$$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$$

**GAE Advantage**:
$$A_t^{GAE(\gamma, \lambda)} = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}$$

Î» controls the bias-variance tradeoff:
- Î»=0: A(s,a) = Î´ (1-step TD, high bias, low variance)
- Î»=1: A(s,a) = sum of all future Î´ (Monte Carlo, low bias, high variance)
- Î»=0.95: **Typical choice** (good balance)

**Task**: Implement GAE computation.

In [None]:
# GRADED FUNCTION: compute_gae

def compute_gae(rewards, values, dones, gamma=0.99, lambda_=0.95):
    """
    Compute Generalized Advantage Estimation.
    
    Arguments:
    rewards -- list of rewards [r_0, r_1, ..., r_T]
    values -- list of state values [V(s_0), V(s_1), ..., V(s_T), V(s_{T+1})]
              Note: values has one more element for bootstrap
    dones -- list of done flags [d_0, d_1, ..., d_T]
    gamma -- discount factor
    lambda_ -- GAE lambda parameter
    
    Returns:
    advantages -- list of advantages [A_0, A_1, ..., A_T]
    returns -- list of returns (for critic training) [R_0, R_1, ..., R_T]
    """
    # (approx. 15-18 lines)
    # 1. Initialize advantages list
    # 2. Initialize gae = 0
    # 3. Loop backwards through trajectory (from T-1 to 0):
    #    a. Compute TD error:
    #       if done[t]:
    #           delta = reward[t] - value[t]
    #       else:
    #           delta = reward[t] + gamma * value[t+1] - value[t]
    #    b. Update GAE:
    #       if done[t]:
    #           gae = delta
    #       else:
    #           gae = delta + gamma * lambda_ * gae
    #    c. Insert gae at beginning of advantages list
    # 4. Compute returns: returns = advantages + values[:-1]
    # 5. Return advantages and returns
    
    # YOUR CODE STARTS HERE
    
    
    # YOUR CODE ENDS HERE
    
    return advantages, returns

In [None]:
# Test your implementation
compute_gae_test(compute_gae)

## Exercise 2: PPO Actor Network

Same architecture as A2C, but we'll use it differently (multiple epochs).

**Task**: Implement the Actor network.

In [None]:
# GRADED FUNCTION: PPOActorNetwork

class PPOActorNetwork(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        """
        Actor network for PPO.
        
        Arguments:
        state_dim -- dimension of state space
        action_dim -- dimension of action space
        hidden_dim -- number of hidden units
        """
        super(PPOActorNetwork, self).__init__()
        
        # (approx. 2 lines)
        # fc1: state_dim -> hidden_dim
        # fc2: hidden_dim -> action_dim
        
        # YOUR CODE STARTS HERE
        
        
        # YOUR CODE ENDS HERE
    
    def forward(self, state):
        """
        Forward pass.
        
        Arguments:
        state -- state tensor
        
        Returns:
        action_probs -- action probability distribution
        """
        # (approx. 3 lines)
        
        # YOUR CODE STARTS HERE
        
        
        # YOUR CODE ENDS HERE
        
        return action_probs

In [None]:
# Test your implementation
ppo_actor_network_test(PPOActorNetwork)

## Exercise 3: PPO Critic Network

Same as A2C - estimates state values.

**Task**: Implement the Critic network.

In [None]:
# GRADED FUNCTION: PPOCriticNetwork

class PPOCriticNetwork(nn.Module):
    def __init__(self, state_dim, hidden_dim=128):
        """
        Critic network for PPO.
        
        Arguments:
        state_dim -- dimension of state space
        hidden_dim -- number of hidden units
        """
        super(PPOCriticNetwork, self).__init__()
        
        # (approx. 2 lines)
        
        # YOUR CODE STARTS HERE
        
        
        # YOUR CODE ENDS HERE
    
    def forward(self, state):
        """
        Forward pass.
        """
        # (approx. 3 lines)
        
        # YOUR CODE STARTS HERE
        
        
        # YOUR CODE ENDS HERE
        
        return value

In [None]:
# Test your implementation
ppo_critic_network_test(PPOCriticNetwork)

## Exercise 4: Compute PPO Loss (The Core Innovation!)

This is where PPO shines! The **clipped surrogate objective**.

**Probability Ratio**:
$$r(\theta) = \frac{\pi_{\theta}(a|s)}{\pi_{\theta_{old}}(a|s)}$$

**Clipped Objective**:
$$L^{CLIP} = \mathbb{E}[\min(r(\theta) A, \text{clip}(r(\theta), 1-\epsilon, 1+\epsilon) A)]$$

The clip prevents:
- **Large positive updates** when A > 0 (advantage is positive)
- **Large negative updates** when A < 0 (advantage is negative)

**Critic Loss**: Same MSE as before

**Task**: Implement the PPO clipped loss.

In [None]:
# GRADED FUNCTION: compute_ppo_loss

def compute_ppo_loss(actor, states, actions, old_log_probs, advantages, 
                     critic, returns, clip_epsilon=0.2):
    """
    Compute PPO clipped loss.
    
    Arguments:
    actor -- current actor network
    states -- tensor of states
    actions -- tensor of actions taken
    old_log_probs -- tensor of log probs from old policy
    advantages -- tensor of advantages (should be normalized)
    critic -- current critic network
    returns -- tensor of returns (for critic training)
    clip_epsilon -- clipping parameter (typical: 0.2)
    
    Returns:
    actor_loss -- clipped policy loss
    critic_loss -- value function loss
    total_loss -- combined loss
    approx_kl -- approximate KL divergence (for monitoring)
    """
    # (approx. 18-22 lines)
    # 1. Get current action probabilities from actor
    # 2. Create distribution and get log probs of actions:
    #    dist = torch.distributions.Categorical(action_probs)
    #    new_log_probs = dist.log_prob(actions)
    # 3. Compute probability ratio:
    #    ratio = torch.exp(new_log_probs - old_log_probs)
    # 4. Compute surrogate losses:
    #    surr1 = ratio * advantages
    #    surr2 = torch.clamp(ratio, 1-clip_epsilon, 1+clip_epsilon) * advantages
    # 5. Actor loss (take minimum, then negate for gradient ascent):
    #    actor_loss = -torch.min(surr1, surr2).mean()
    # 6. Critic loss:
    #    values = critic(states)
    #    critic_loss = F.mse_loss(values, returns)
    # 7. Total loss:
    #    total_loss = actor_loss + 0.5 * critic_loss
    # 8. Approximate KL (for monitoring):
    #    approx_kl = (old_log_probs - new_log_probs).mean()
    
    # YOUR CODE STARTS HERE
    
    
    # YOUR CODE ENDS HERE
    
    return actor_loss, critic_loss, total_loss, approx_kl

In [None]:
# Test your implementation
compute_ppo_loss_test(compute_ppo_loss, PPOActorNetwork, PPOCriticNetwork)

## Exercise 5: Train PPO

PPO's training loop is more complex than A2C:
1. Collect trajectory (like A2C)
2. Compute advantages with GAE
3. **Multiple epochs** over the data
4. **Mini-batch updates** within each epoch

This reuse of data makes PPO more sample-efficient!

**Task**: Implement the complete PPO training algorithm.

In [None]:
# GRADED FUNCTION: train_ppo

def train_ppo(env, actor, critic, optimizer, n_episodes=500, gamma=0.99, 
              lambda_=0.95, clip_epsilon=0.2, update_epochs=4, 
              batch_size=64, trajectory_length=2048):
    """
    Train PPO on the environment.
    
    Arguments:
    env -- Gym environment
    actor -- Actor network
    critic -- Critic network
    optimizer -- shared optimizer
    n_episodes -- number of episodes to train
    gamma -- discount factor
    lambda_ -- GAE lambda
    clip_epsilon -- PPO clip parameter
    update_epochs -- number of epochs to update on each batch
    batch_size -- mini-batch size
    trajectory_length -- collect this many steps before update
    
    Returns:
    episode_rewards -- list of episode rewards
    """
    episode_rewards = []
    
    # (approx. 50-60 lines - this is complex!)
    # For each episode:
    #   1. Collect trajectory up to trajectory_length steps
    #      Store: states, actions, rewards, log_probs, values, dones
    #   2. Compute GAE advantages and returns
    #   3. Normalize advantages: (advantages - mean) / (std + 1e-8)
    #   4. Convert to tensors
    #   5. For each update_epoch:
    #      a. Shuffle indices
    #      b. For each mini-batch:
    #         - Get batch of (states, actions, old_log_probs, advantages, returns)
    #         - Compute PPO loss
    #         - Update networks:
    #           * optimizer.zero_grad()
    #           * total_loss.backward()
    #           * optimizer.step()
    #   6. Track episode rewards
    #   7. Print progress
    
    # YOUR CODE STARTS HERE
    
    
    # YOUR CODE ENDS HERE
    
    return episode_rewards

In [None]:
# Test your implementation
train_ppo_test(train_ppo, PPOActorNetwork, PPOCriticNetwork)

## Full Training Run

Let's train PPO on CartPole and see the difference!

In [None]:
# Initialize networks
actor = PPOActorNetwork(state_dim, action_dim)
critic = PPOCriticNetwork(state_dim)
optimizer = optim.Adam(list(actor.parameters()) + list(critic.parameters()), lr=3e-4)

# Train PPO
episode_rewards = train_ppo(
    env, actor, critic, optimizer,
    n_episodes=300,
    gamma=0.99,
    lambda_=0.95,
    clip_epsilon=0.2,
    update_epochs=4,
    batch_size=64
)

# Plot
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(episode_rewards, alpha=0.6)
plt.plot(np.convolve(episode_rewards, np.ones(50)/50, mode='valid'), linewidth=2)
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('PPO Training Progress')
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
window = 100
if len(episode_rewards) >= window:
    moving_avg = np.convolve(episode_rewards, np.ones(window)/window, mode='valid')
    plt.plot(moving_avg)
    plt.axhline(y=475, color='r', linestyle='--', label='Solved')
    plt.xlabel('Episode')
    plt.ylabel(f'Avg Reward ({window} ep)')
    plt.title('Moving Average')
    plt.legend()
    plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

if len(episode_rewards) >= 100:
    final = np.mean(episode_rewards[-100:])
    print(f"\n{'ðŸŽ‰ Solved!' if final >= 475 else 'ðŸ“Š Training complete'} Final avg: {final:.2f}")

## Why PPO is So Popular

### 1. **Stability**
- Clipping prevents catastrophic policy updates
- Monotonic improvement guarantee (approximately)
- Much more robust to hyperparameters than A2C

### 2. **Sample Efficiency**
- Multiple epochs reuse data effectively
- GAE provides better advantage estimates
- Learns faster than on-policy methods like A2C

### 3. **Performance**
- State-of-the-art on many continuous control tasks
- Won OpenAI's Dota 2 competition
- Used in robotics, game AI, and more

### 4. **Simplicity**
- Easier than TRPO (no complex constraint optimization)
- Few hyperparameters to tune
- Reliable default settings work well

## Comparison Summary

| Algorithm | Stability | Sample Efficiency | Performance | Complexity |
|-----------|-----------|-------------------|-------------|------------|
| REINFORCE | Low | Low | Moderate | Simple |
| Actor-Critic | Moderate | Moderate | Good | Simple |
| A2C | Good | Good | Good | Moderate |
| **PPO** | **Excellent** | **Excellent** | **Excellent** | Moderate |

## Hyperparameter Guidelines

**Typical values** (good starting points):
- **clip_epsilon**: 0.2 (range: 0.1-0.3)
- **lambda_**: 0.95 (range: 0.9-0.99)
- **update_epochs**: 4-10
- **batch_size**: 64-256
- **learning_rate**: 3e-4

**Tuning tips**:
- Start with defaults, they usually work!
- Increase clip_epsilon for more aggressive updates
- Increase lambda_ for lower bias (but higher variance)
- More epochs = better data usage (but risk overfitting)

## Congratulations!

You've successfully implemented PPO! You now understand:
- âœ… The clipped surrogate objective and why it works
- âœ… Generalized Advantage Estimation (GAE)
- âœ… Multi-epoch training with mini-batches
- âœ… Why PPO is the gold standard for policy gradient methods
- âœ… How to tune PPO hyperparameters

**You've completed one of the most important algorithms in modern Deep RL!**

**Next Steps**:
- Try PPO on continuous control tasks (MuJoCo, PyBullet)
- Implement PPO for continuous action spaces
- Explore PPO variants (PPO-penalty, PPO with curiosity)
- Read the original paper: "Proximal Policy Optimization Algorithms" (Schulman et al., 2017)