![Logo](../assets/logo.png)

Made by  **Zoltán Barta**

[<img src="https://colab.research.google.com/assets/colab-badge.svg">](https://colab.research.google.com/github/Fortuz/rl_education/blob/main/9.%20On-policy%20Control/ppo_homework.ipynb)


# Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is a popular policy gradient algorithm in reinforcement learning. It is especially effective in continuous control tasks. PPO simplifies the trust-region idea from TRPO by using a clipped surrogate objective, allowing for more stable and efficient training without requiring complex optimization techniques.

## What is PPO?

PPO is an on-policy, actor-critic algorithm with the following features:

1. **Actor-Critic Structure**:  
   - Actor network: policy $\pi_\theta(a|s)$ that selects actions  
   - Critic network: value function $V_\phi(s)$ that evaluates states

2. **On-Policy Learning**:  
   - Data is collected using the current policy only.

3. **Clipped Surrogate Objective**:  
   - Avoids large policy updates that destabilize learning.

4. **Multiple Epochs per Batch**:  
   - Improves sample efficiency by reusing collected data.

## PPO Objective Function

### Clipped Surrogate Objective

Given:

- $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$
- $\hat{A}_t$: estimated advantage

The clipped objective is:

$$
L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min \left( r_t(\theta) \hat{A}_t,\ \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon)\hat{A}_t \right) \right]
$$

This prevents the policy from changing too much in a single update step.

### Value Function Loss

The critic is trained to minimize the squared error:

$$
L^{VF}(\phi) = \mathbb{E}_t \left[ \left( V_\phi(s_t) - V_t^{\text{target}} \right)^2 \right]
$$

### Entropy Bonus

Encourages exploration:

$$
S[\pi_\theta](s_t) = \mathbb{E}_{a \sim \pi_\theta} [-\log \pi_\theta(a|s_t)]
$$

### Combined PPO Loss

The full loss combines the components:

$$
L^{PPO} = \mathbb{E}_t \left[ L^{CLIP}(\theta) - c_1 L^{VF}(\phi) + c_2 S[\pi_\theta](s_t) \right]
$$

## PPO Algorithm Steps

1. Initialize policy and value networks and hyperparameters
2. Collect trajectories using the current policy
3. Estimate advantages (typically using GAE)
4. For several epochs:
   - Divide data into mini-batches
   - Compute clipped loss, value loss, entropy bonus
   - Update network parameters using gradient descent
5. Repeat

## Generalized Advantage Estimation (GAE)

GAE reduces variance in advantage estimation:

$$
\hat{A}^{GAE}_t = \sum_{l=0}^{\infty} (\gamma \lambda)^l \delta_{t+l}, \quad \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)
$$


## Why PPO?

- Simpler than TRPO to implement
- More stable than vanilla policy gradients
- Supports multiple training epochs per batch
- Scales well to large models and complex tasks



In [None]:

# Import necessary libraries
import numpy as np
import random
# Import PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Normal
# Set up device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Set random seeds for reproducibility
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(seed)

In [None]:
import gymnasium as gym

env = gym.make('Pendulum-v1')
obs,info = env.reset()


### Task: Implement a Policy Network for PPO (Continuous Action Space)

In this task, you will implement a multi-layer perceptron (MLP) policy network for a PPO agent operating in a continuous action space.

The goal is for the network to output the parameters of a Normal (Gaussian) distribution from which actions can be sampled.

---

### Requirements:

- Subclass `nn.Module`
- Use two hidden layers with ReLU activations
- The final layer should output the **mean** of the action distribution
- Define a learnable parameter `log_std` to represent the log standard deviation
- In the `forward()` method, return a `torch.distributions.Normal(mean, std)` object

---

### Network Structure:

- Input: `n_observations` (number of input features)
- Hidden Layer 1: 128 units + ReLU
- Hidden Layer 2: 128 units + ReLU
- Output Layer: `n_actions` units (mean of the Gaussian)
- Standard deviation: computed using `torch.exp(log_std)`, where `log_std` is a learnable `nn.Parameter`

---

### Implementation Notes:

- Use `torch.nn.Parameter` for `log_std` so that it can be learned during training
- In the `forward()` method, make sure the input is a float tensor on the correct device
- If the input is a 1D tensor (single observation), add a batch dimension with `unsqueeze(0)`
- Return a `Normal(mean, std)` distribution from `torch.distributions`

---

This network will allow you to sample continuous actions for a PPO agent and compute log-probabilities needed for training. Make sure to test the output distribution to verify it behaves as expected.


In [None]:
class PolicyNetwork(nn.Module):
    """ MLP Actor network for PPO with continuous action space """
    def __init__(self, n_observations: int, n_actions: int):
        super(PolicyNetwork, self).__init__()
        ###################CODE HERE###################
        
        
        
        
        
        
        ################################################

    def forward(self, x: torch.Tensor):
        """
        Forward pass, returns a Normal (Gaussian) distribution over actions.
        """
        ###################CODE HERE###################
        
        
        
        
        
        
        ################################################

### Task: Implement a Value Network for PPO (Critic)

In this task, you will build a multi-layer perceptron (MLP) value network that serves as the **critic** in a PPO setup. The value network estimates the **state value** for a given observation, which is used in advantage estimation and value loss computation.

---

### Requirements:

- Subclass `nn.Module`
- Use two hidden layers with ReLU activations
- The final layer should output a single scalar value per input state
- In the `forward()` method, return the estimated value as a tensor

---

### Network Structure:

- Input: `n_observations` (state dimension)
- Hidden Layer 1: 128 units + ReLU
- Hidden Layer 2: 128 units + ReLU
- Output Layer: 1 unit (scalar value)

---

### Implementation Notes:

- Convert input to `torch.float32` if needed
- If input is 1D (a single state), add a batch dimension with `unsqueeze(0)`
- Use `F.relu()` as the activation function after each hidden layer
- The final layer should not apply any activation

---

This network is used to approximate the expected return (value) of a given state. It will be trained by minimizing the squared difference between predicted values and target returns.

In [None]:
class ValueNetwork(nn.Module):
    """ MLP Critic network for PPO """
    def __init__(self, n_observations: int):
        super(ValueNetwork, self).__init__()
        ###################CODE HERE###################
        
        
        
        
        
        
        ################################################

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass, returns the estimated state value.
        """
        ###################CODE HERE###################
        
        
        
        
        
        
        ################################################



### Task: Implement Generalized Advantage Estimation (GAE)

In this task, you will implement the **Generalized Advantage Estimation (GAE)** function. GAE is used in PPO to compute a low-variance and smoother estimate of the advantage function, which guides policy updates.

---

### Inputs:

- `rewards`: Tensor of rewards collected from the environment.
- `values`: Estimated state values from the value network at each timestep.
- `next_values`: Value predictions for the next states.
- `dones`: Tensor indicating episode terminations (1 if done, 0 otherwise).
- `gamma`: Discount factor (typically around 0.99).
- `lambda_gae`: GAE lambda parameter (typically around 0.95).
- `standardize` (optional): Whether to normalize the advantages.

---

### Key Concepts:

1. **TD Residual** (`delta`):  
   $$
   \delta_t = r_t + \gamma \cdot V_{t+1} \cdot (1 - \text{done}_t) - V_t
   $$

2. **Recursive Advantage Calculation**:  
   Starting from the end of the trajectory and moving backward:
   $$
   A_t = \delta_t + \gamma \lambda \cdot A_{t+1} \cdot (1 - \text{done}_t)
   $$

3. **Standardization** (optional):  
   Normalize advantages to have zero mean and unit variance to improve training stability.


In [None]:
def compute_gae(rewards: torch.Tensor, 
                values: torch.Tensor, 
                next_values: torch.Tensor, 
                dones: torch.Tensor, 
                gamma: float, 
                lambda_gae: float, 
                standardize: bool = True) -> torch.Tensor:
    """
    Computes Generalized Advantage Estimation (GAE).
    """
    ###################CODE HERE###################
        
        
        
        
        
        
    ################################################
    return advantages

In [None]:
def collect_data(env, policy_net, max_steps):
    observations, actions, rewards, log_probs, dones = [], [], [], [], []

    obs, _ = env.reset()
    for _ in range(max_steps):
        obs_tensor = torch.tensor(obs.flatten(), dtype=torch.float32, device=device)
        
        dist = policy_net(obs_tensor)
        action = dist.sample()
        log_prob = dist.log_prob(action).sum(-1)
    
        next_obs, reward, terminated, truncated, _ = env.step(action.cpu().numpy())
        done = terminated or truncated
    
        observations.append(obs_tensor.squeeze(0))  # to keep consistent shape [n]
        actions.append(action)
        rewards.append(torch.tensor(reward, dtype=torch.float32, device=device))
        log_probs.append(log_prob.detach())
        dones.append(torch.tensor(done, dtype=torch.float32, device=device))
    
        obs = next_obs
        if done:
            obs, _ = env.reset()

    observations = torch.stack(observations)
    actions = torch.stack(actions)
    rewards = torch.stack(rewards)
    log_probs = torch.stack(log_probs)
    dones = torch.stack(dones)

    return observations, actions, log_probs, rewards, dones

### Task: Implement the training loop
Implement the main training loop for the Proximal Policy Optimization (PPO) algorithm. This loop should train a policy network and a value network using data collected from interaction with an environment.

---
### What the Function Should Do

1. **Collect Experience**  
   Interact with the environment for a fixed number of steps. Record:
   - Observations
   - Actions
   - Rewards
   - Log-probabilities of actions (under the current policy)
   - Done flags

2. **Estimate Values and Compute Advantages**  
   Use the value network to estimate state values. Then, compute advantages using Generalized Advantage Estimation (GAE). Optionally standardize the advantages for numerical stability.

3. **Compute Returns**  
   Add the computed advantages to the estimated values to get the target returns for value function learning.

4. **Optimize Policy and Value Networks**  
   For a number of epochs:
   - Shuffle and divide the data into mini-batches.
   - For each batch:
     - Compute the policy loss using the clipped surrogate PPO objective.
     - Compute the value loss using mean squared error between predicted and target returns.
     - Backpropagate and update both networks using their respective optimizers.

5. **Log Progress**  
   After each iteration, print or store:
   - Total reward collected
   - Average policy loss
   - Average value loss
   - Mean and standard deviation of advantages

---

### Notes

- Detach tensors where appropriate to avoid reusing computation graphs.
- Make sure log-probabilities from the old policy are detached before being used in the surrogate loss.
- The value targets (`returns`) should not have gradients.

This loop should repeat for a specified number of iterations to progressively improve the policy.

In [None]:
def ppo_training_loop(env, policy_net, value_net, optimizer_policy, optimizer_value,
                      iterations, max_steps_per_iter, gamma=0.99, lambda_gae=0.95,
                      epsilon_clip=0.2, epochs=10, batch_size=64):

    for iteration in range(iterations):
        # --- Collect data ---
        observations, actions, log_probs_old, rewards, dones = collect_data(env, policy_net, max_steps_per_iter)
        ###################CODE HERE###################
        
        
        
        
        
        
        ################################################
        avg_policy_loss = 0.0
        avg_value_loss = 0.0
        num_batches = 0

        for _ in range(epochs):
            indices = torch.randperm(dataset_size)

            for start in range(0, dataset_size, batch_size):
                end = start + batch_size
                batch_indices = indices[start:end]
                ###################CODE HERE###################
        
        
        
        
        
        
                
                
                
                
                
                
                ################################################

                avg_policy_loss += policy_loss.item()
                avg_value_loss += value_loss.item()
                num_batches += 1

        avg_policy_loss /= num_batches
        avg_value_loss /= num_batches
        total_reward = rewards.sum().item()

        print(f"Iteration {iteration + 1}/{iterations} | "
              f"Total Reward: {total_reward:.2f} | "
              f"Avg Policy Loss: {avg_policy_loss:.4f} | "
              f"Avg Value Loss: {avg_value_loss:.4f} | "
              f"Advantage Mean: {advantages.mean().item():.4f} | Std: {advantages.std().item():.4f}")

In [None]:
from torch.optim import Adam
n_observations = env.observation_space.shape[0]
n_actions = env.action_space.shape[0]

policy_net = PolicyNetwork(n_observations, n_actions).to(device)
value_net = ValueNetwork(n_observations).to(device)

optimizer_policy = Adam(policy_net.parameters(), lr=3e-4)
optimizer_value = Adam(value_net.parameters(), lr=1e-3)

In [None]:
ppo_training_loop(
    env=env,
    policy_net=policy_net,
    value_net=value_net,
    optimizer_policy=optimizer_policy,
    optimizer_value=optimizer_value,
    iterations=50,              
    max_steps_per_iter=2048,    
    gamma=0.99,
    lambda_gae=0.95,
    epsilon_clip=0.2,
    epochs=10,
    batch_size=64
)