# Proximal Policy Optimization (PPO)

This notebook introduces **Proximal Policy Optimization (PPO)** : a stable and widely-used policy-gradient method. We present the intuition, the clipped surrogate objective, and a concise PyTorch implementation applied to a discrete-action environment (`CartPole-v1`).

You'll find:
- PPO theory (clipped objective)
- A compact actor-critic implementation (shared network)
- Trajectory collection and advantage computation
- PPO update loop

This implementation is educational: it prioritizes clarity over raw performance.

## 🔍 Quick theory recap

PPO maximizes a clipped surrogate objective to keep policy updates **proximal** (not too large):

$$ L^{CLIP}(\theta) = \mathbb{E}_t\Big[ \min\big( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t \big)\Big] $$
where $r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ and $\hat{A}_t$ is the advantage estimate. Clipping prevents large policy shifts and stabilizes training.

## 🧩 Minimal PyTorch Implementation (CartPole-v1)
Run cells sequentially. If running in a fresh environment, install `gymnasium` and `torch` first.

In [None]:
# Uncomment to install if needed
# !pip install gymnasium torch --quiet

import gymnasium as gym
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
from collections import deque
import random
import math
import matplotlib.pyplot as plt

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Device:', device)

## 🔧 Actor-Critic (shared) network
Outputs a policy (action probabilities) and state-value.

In [None]:
class ActorCritic(nn.Module):
    def __init__(self, obs_dim, n_actions, hidden=64):
        super().__init__()
        self.shared = nn.Sequential(
            nn.Linear(obs_dim, hidden),
            nn.ReLU(),
            nn.Linear(hidden, hidden),
            nn.ReLU()
        )
        self.policy_head = nn.Linear(hidden, n_actions)
        self.value_head = nn.Linear(hidden, 1)

    def forward(self, x):
        x = self.shared(x)
        logits = self.policy_head(x)
        value = self.value_head(x).squeeze(-1)
        return logits, value

    def get_action(self, obs):
        obs_t = torch.tensor(obs, dtype=torch.float32, device=device).unsqueeze(0)
        logits, value = self.forward(obs_t)
        probs = torch.softmax(logits, dim=-1)
        m = torch.distributions.Categorical(probs)
        a = m.sample()
        return a.item(), m.log_prob(a).item(), value.item(), probs.detach().cpu().numpy()[0]


## 📦 Trajectory buffer & advantage computation (simple returns minus baseline)
We use **discounted returns** and compute advantages as `G - V` (you can replace with GAE for better performance).

In [None]:
def compute_returns(rewards, dones, last_value, gamma=0.99):
    returns = []
    R = last_value
    for r, d in zip(reversed(rewards), reversed(dones)):
        if d:
            R = 0.0
        R = r + gamma * R
        returns.insert(0, R)
    return returns

class RolloutBuffer:
    def __init__(self):
        self.obs = []
        self.actions = []
        self.log_probs = []
        self.rewards = []
        self.dones = []
        self.values = []

    def clear(self):
        self.__init__()


## 🧪 PPO Update step (clipped surrogate + value loss + entropy bonus)

In [None]:
def ppo_update(model, optimizer, obs, actions, old_log_probs, returns, values, clip_eps=0.2, c1=0.5, c2=0.01, epochs=4, batch_size=64):
    # Convert to tensors
    obs = torch.tensor(np.array(obs), dtype=torch.float32, device=device)
    actions = torch.tensor(np.array(actions), dtype=torch.long, device=device)
    old_log_probs = torch.tensor(np.array(old_log_probs), dtype=torch.float32, device=device)
    returns = torch.tensor(np.array(returns), dtype=torch.float32, device=device)
    values = torch.tensor(np.array(values), dtype=torch.float32, device=device)

    advantages = returns - values
    advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

    dataset_size = len(obs)
    inds = np.arange(dataset_size)

    for _ in range(epochs):
        np.random.shuffle(inds)
        for start in range(0, dataset_size, batch_size):
            batch_inds = inds[start:start+batch_size]
            b_obs = obs[batch_inds]
            b_actions = actions[batch_inds]
            b_old_logp = old_log_probs[batch_inds]
            b_returns = returns[batch_inds]
            b_adv = advantages[batch_inds]

            logits, vals = model(b_obs)
            probs = torch.softmax(logits, dim=-1)
            dist = torch.distributions.Categorical(probs)
            new_logp = dist.log_prob(b_actions)
            entropy = dist.entropy().mean()

            ratio = torch.exp(new_logp - b_old_logp)
            surr1 = ratio * b_adv
            surr2 = torch.clamp(ratio, 1.0 - clip_eps, 1.0 + clip_eps) * b_adv
            policy_loss = -torch.min(surr1, surr2).mean()

            value_loss = ((b_returns - vals) ** 2).mean()

            loss = policy_loss + c1 * value_loss - c2 * entropy

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()


## ▶️ Training loop (compact)
Adjust `TOTAL_UPDATES` / `steps_per_update` for longer training. This is a compact demo to show the full flow.

In [None]:
env = gym.make('CartPole-v1')
obs_dim = env.observation_space.shape[0]
n_actions = env.action_space.n

model = ActorCritic(obs_dim, n_actions).to(device)
optimizer = optim.Adam(model.parameters(), lr=3e-4)

buffer = RolloutBuffer()

TOTAL_UPDATES = 300  # number of outer loops
steps_per_update = 2048  # collect this many steps per update (reduce for faster demo)
gamma = 0.99

ep_rewards = []
obs, _ = env.reset()
episode_reward = 0

for update in range(1, TOTAL_UPDATES + 1):
    buffer.clear()
    for step in range(steps_per_update):
        action, logp, value, _ = model.get_action(obs)
        next_obs, reward, terminated, truncated, info = env.step(action)
        done = terminated or truncated

        buffer.obs.append(obs)
        buffer.actions.append(action)
        buffer.log_probs.append(logp)
        buffer.values.append(value)
        buffer.rewards.append(reward)
        buffer.dones.append(done)

        obs = next_obs
        episode_reward += reward

        if done:
            ep_rewards.append(episode_reward)
            obs, _ = env.reset()
            episode_reward = 0

    # compute last value for bootstrapping
    with torch.no_grad():
        _, last_value = model.forward(torch.tensor(obs, dtype=torch.float32, device=device).unsqueeze(0))
        last_value = last_value.item()

    returns = compute_returns(buffer.rewards, buffer.dones, last_value, gamma=gamma)

    # update
    ppo_update(model, optimizer, buffer.obs, buffer.actions, buffer.log_probs, returns, buffer.values,
               clip_eps=0.2, c1=0.5, c2=0.01, epochs=8, batch_size=64)

    if update % 5 == 0:
        avg_r = np.mean(ep_rewards[-20:]) if len(ep_rewards) > 0 else 0.0
        print(f'Update {update}/{TOTAL_UPDATES} | AvgReward(20): {avg_r:.2f}')

# Plot training rewards
plt.plot(ep_rewards)
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('PPO: Episode rewards')
plt.show()

## 🧾 Notes & Next steps
- This implementation uses **returns - value** as advantage; for improved performance use **GAE (Generalized Advantage Estimation)**.
- PPO has many practical tricks (learning rate schedules, entropy annealing, normalization, clipping values) — explore those once the base algorithm works.
- For continuous actions, replace the categorical policy with a Gaussian policy (mean/std outputs).

## ✅ Summary
- PPO provides a reliable, easy-to-use policy gradient algorithm.
- The clipped surrogate objective stabilizes updates and is the core innovation.
- This notebook demonstrates a compact actor-critic PPO that you can extend and improve.