# 🧠 PPO: Proximal Policy Optimization — A Deep Dive

## 📌 What Problem is PPO Solving?
At its core, PPO is solving the RL objective of maximizing expected reward across trajectories:

$$
\max_\theta \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]
$$

Where:
- \( \pi_\theta \) is the policy parameterized by \(\theta\)
- \( \tau = (s_0, a_0, s_1, a_1, ..., s_T, a_T) \) is a trajectory: a sequence of states and actions
- \( R(\tau) = \sum_t r_t \) is the return (total reward for that trajectory)

## 🧱 Key Components of PPO

### 1. Policy Gradient Foundation
PPO builds on the policy gradient theorem:

$$
\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a_t | s_t) \cdot A_t]
$$

- \(A_t\): Advantage — measures how much better (or worse) an action is compared to expected.

But this can be unstable, which is where PPO helps.

---

## 🚧 Problem with Vanilla Policy Gradient: Instability

If the policy \(\pi_\theta\) suddenly increases the probability of a single action too much, it can:
- Overfit
- Lose diversity
- Collapse training

PPO addresses this by **restricting how far the new policy can move** — using a **clipping mechanism**.

---

## ✅ PPO’s Surrogate Objective (Clipped)

PPO introduces a clipped surrogate loss:

$$
L^{\text{CLIP}}(\theta) = \mathbb{E}_t\left[\min\left(r_t(\theta) \cdot \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \cdot \hat{A}_t\right)\right]
$$

Where:
- \( r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)} \) — the probability ratio
- \( \hat{A}_t \) — advantage estimate
- \(\epsilon\) — clipping threshold (e.g., 0.2)

---

## 🧮 Step-by-Step Breakdown

### 🔁 Step 1: Sample Trajectories

```python
response = ppo_trainer.generate(query, **generation_kwargs)
```
Generates \(\tau = \text{prompt} + \text{response}\)

---

### 🧠 Step 2: Compute Rewards \(R(\tau)\)

```python
rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]
```
Assigns a scalar reward to each generated sequence.

---

### 💡 Step 3: Estimate Advantage \(\hat{A}_t\)

Using **Generalized Advantage Estimation (GAE)**:

$$
\hat{A}_t = \sum_{l=0}^{\infty} (\gamma\lambda)^l \cdot \delta_{t+l}
$$

Where:

$$
\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)
$$

- \(\gamma\): Discount factor (e.g., 0.99)
- \(\lambda\): Smoothing factor (e.g., 0.95)
- \(V(s)\): Estimated value function

```python
ppo_trainer.step(query_tensors, response_tensors, rewards)
```

---

### 🔢 Step 4: Compute Probability Ratio \(r_t\)

```python
ratio = torch.exp(new_log_probs - old_log_probs)
```

This is:
$$
\frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}
$$

---

### 🔒 Step 5: Clip the Ratio

```python
clipped_ratio = torch.clamp(ratio, 1 - epsilon, 1 + epsilon)
```

This creates a **trust region** to prevent large, destabilizing updates.

---

### 🔁 Step 6: Apply the PPO Objective

```python
loss1 = ratio * advantages
loss2 = clipped_ratio * advantages
loss = torch.min(loss1, loss2)
```

Ensures:
- If \(\hat{A}_t > 0\), reinforce action — but gently
- If \(\hat{A}_t < 0\), suppress action — but not too harshly

---

## 🧭 KL Divergence as a Constraint

Many implementations add a KL penalty:

$$
L^{\text{KL-PEN}}(\theta) = \mathbb{E}\left[r_t(\theta) \cdot \hat{A}_t - \beta \cdot D_{\text{KL}}(\pi_\theta \| \pi_{\theta_{\text{old}}})\right]
$$

This keeps \(\pi_\theta\) from drifting too far from the original (reference) policy:

```python
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
```

---

## 🔄 Full PPO Loop in Practice
1. Generate responses: \(\tau = \text{prompt} + \text{response}\)
2. Compute reward: \(R(\tau)\)
3. Estimate value: \(V(s)\)
4. Compute advantage: \(\hat{A}_t\)
5. Calculate ratio \(r_t\)
6. Clip ratio & apply surrogate loss
7. (Optionally) apply KL penalty
8. Update policy

---


## 🧪 PPO in Natural Language Tasks (e.g. RLHF)

| Concept        | Implementation |
|----------------|----------------|
| Query          | Prompt         |
| Response       | Generated text |
| Reward         | Sentiment, helpfulness, preference |
| Trajectory     | Token sequence |
| Policy         | Causal LM with value head |
| Update         | PPO clipped objective |

---

## ✅ Why PPO Works
- **Stable**: Clipping controls how far updates go
- **Effective**: Advantage estimates focus learning
- **General**: Works with any reward function
- **Safe**: KL penalty prevents policy collapse

---

## 📚 References
- [Policy Gradient](https://huggingface.co/learn/deep-rl-course/unit4/policy-gradient)
- [Reinforcement Learning from Human Feedback explained with math derivations and the PyTorch code](https://www.youtube.com/watch?v=qGyFrqc34yc&t=6400s&ab_channel=UmarJamil)
- [OpenAI PPO Blog](https://openai.com/research/openai-baselines-ppo/)
- [SpinningUp PPO](https://spinningup.openai.com/en/latest/algorithms/ppo.html)
- [TRL PPO Cookbook (Hugging Face)](https://huggingface.co/learn/cookbook/en/ppo_rlhf_trl)

In [1]:
# !pip install trl==0.11.3
# !pip install wandb

In [2]:
import torch
from tqdm import tqdm
from transformers import pipeline, AutoTokenizer
from datasets import load_dataset
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from trl.core import LengthSampler
import wandb

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# 🚀 1. Define PPO Configuration
config = PPOConfig(
    model_name="lvwerra/gpt2-imdb",
    learning_rate=1.41e-5,
    # log_with="wandb",
)



In [None]:
# 🧱 2. Build the IMDB Prompt Dataset for Training
def build_dataset(
    config, dataset_name="imdb", input_min_text_length=2, input_max_text_length=8
):
    tokenizer = AutoTokenizer.from_pretrained(config.model_name)
    tokenizer.pad_token = tokenizer.eos_token
    ds = load_dataset(dataset_name, split="train")
    ds = ds.rename_columns({"text": "review"})
    ds = ds.filter(lambda x: len(x["review"]) > 200)

    input_size = LengthSampler(input_min_text_length, input_max_text_length)

    def tokenize(sample):
        sample["input_ids"] = tokenizer.encode(sample["review"])[: input_size()]
        sample["query"] = tokenizer.decode(sample["input_ids"])
        return sample

    ds = ds.map(tokenize)
    ds.set_format(type="torch")
    return ds

In [5]:
# 📦 Custom data collator for PPOTrainer
def collator(data):
    return dict((key, [d[key] for d in data]) for key in data[0])

In [None]:
# 🔧 3. Load Dataset and Initialize PPO Components
dataset = build_dataset(config)

model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
ref_model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)
tokenizer.pad_token = tokenizer.eos_token

ppo_trainer = PPOTrainer(
    config, model, ref_model, tokenizer, dataset=dataset, data_collator=collator
)



In [None]:
# 🧠 4. Load Reward Model (Sentiment Classifier)
device = ppo_trainer.accelerator.device
if ppo_trainer.accelerator.num_processes == 1:
    device = 0 if torch.cuda.is_available() else "cpu"

sentiment_pipe = pipeline(
    "sentiment-analysis",
    model="lvwerra/distilbert-imdb",
    device=device,
)
sent_kwargs = {
    "return_all_scores": True,
    "function_to_apply": "none",
    "batch_size": 16,
}

Device set to use cuda:0


In [8]:
# 🔍 Test reward model
print(sentiment_pipe("this movie was really bad!!", **sent_kwargs))
print(sentiment_pipe("this movie was really good!!", **sent_kwargs))

[[{'label': 'NEGATIVE', 'score': 2.335048198699951}, {'label': 'POSITIVE', 'score': -2.726576328277588}]]
[[{'label': 'NEGATIVE', 'score': -2.2947897911071777}, {'label': 'POSITIVE', 'score': 2.557039737701416}]]




In [11]:
# ⚙️ 5. Response Generation Configuration
output_min_length = 4
output_max_length = 16
output_length_sampler = LengthSampler(output_min_length, output_max_length)

generation_kwargs = {
    "min_length": -1,
    "top_k": 0.0,
    "top_p": 1.0,
    "do_sample": True,
    "pad_token_id": tokenizer.eos_token_id,
}

In [None]:
# 🔁 6. PPO Training Loop
for epoch, batch in tqdm(enumerate(ppo_trainer.dataloader)):
    query_tensors = batch["input_ids"]

    # Step 1: Generate responses
    response_tensors = []
    for query in query_tensors:
        gen_len = output_length_sampler()
        generation_kwargs["max_new_tokens"] = gen_len
        response = ppo_trainer.generate(query, **generation_kwargs)
        response_tensors.append(response.squeeze()[-gen_len:])

    batch["response"] = [tokenizer.decode(r) for r in response_tensors]

    # Step 2: Compute rewards using sentiment
    texts = [q + r for q, r in zip(batch["query"], batch["response"])]
    pipe_outputs = sentiment_pipe(texts, **sent_kwargs)
    rewards = [torch.tensor(output[1]["score"]) for output in pipe_outputs]

    # Step 3: PPO update (policy gradient step with clipping & advantage estimation)
    stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
    ppo_trainer.log_stats(stats, batch, rewards)

0it [00:00, ?it/s]The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
8it [01:15,  9.05s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
45it [07:13,  9.29s/it]

In [None]:
# 💾 7. Save Fine-Tuned Model
model.save_pretrained("gpt2-imdb-pos-v2", push_to_hub=False)
tokenizer.save_pretrained("gpt2-imdb-pos-v2", push_to_hub=False)