<div style="font-size:22pt; line-height:25pt; font-weight:bold; text-align:center;">Class 5: Proximal Policy Optimization (PPO) and Vectorized Environments</div>

1. <a href="#reinforce">The REINFORCE Algorithm</a>
2. <a href="#trpo"> Trust Regions Policy Optimization (TRPO)</a>
3. <a href="#ppo">Proximal Policy Optimization (PPO)</a>
4. <a href="#environments">Environments</a>
5. <a href="#your_turn">Practice: your turn!</a>

# <div id="reinforce"></div> The REINFORCE Algorithm

### Reminder on the PG Theorem

In the previous notebook, we've seen methods to directly optimize a given policy $\pi$ instead of deducing it through the optimal Q-value function $Q^*$.
For a policy $\pi$, we wrote the policy optimization's objective as
$$J(\pi) = \mathbb{E}_{s \sim p_0} \left[ V^{\pi} (s) \right] = \mathbb{E}_{(s_i,a_i)_{i \in [0,\infty]}} \left[ \sum_{t=0}^\infty \gamma^t r(s_t,a_t)  | \pi \right].$$

We then introduced the PG Theorem that allows us to write the gradient of this objective:
$$\nabla_\theta J(\theta) \propto \mathbb{E}_{\tau \sim \pi_\theta} \left[ R(\tau) \nabla_\theta \log\pi_\theta(a|s)\right]$$
with $R(\tau) = \sum_{t=0}^\infty \gamma^t r(s_t,a_t)$.

Intuitively, we said that the $\nabla_\theta \log\pi(a | s)$ term "selects" the part of the network that is responsible for taking the action $a$ in the state $s$, and the sum of rewards $R(\tau)$ indicates whether to encourage or discourage this action $a$.
In other words, if a trajectory in the environment results in a good return, we encourage all the actions that were taken during the corresponding episode, i.e. we increase the probability that they're selected.

This works, but it ignores that an action can only influence the future: if we received a high reward *before* we took an action $a_t$ in our trajectory, this alone should not impact the probability of taking $a_t$.
To fix this, instead of the total return $R(\tau)$, we can use the **reward-to-go** $\hat{R}_t(\tau) = \sum_{k=t}^\infty \gamma^{t - k} r(s_k, a_k)$ which gives us:
$$\nabla_\theta J(\theta) \propto \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^\infty \hat{R}_t(\tau) \nabla_\theta \log\pi_\theta(a_t|s_t)\right]$$
Using the reward-to-go decreases the noise on the gradient estimation, and thus leads empirically to better performance.

## The REINFORCE algorithm

Let's rewind and try to apply this theorem directly in its simplest form: the **REINFORCE** algorithm
1. Initialize the policy parameters $\theta$ randomly.
2. Collect a batch of trajectories $D = \{\tau_1, \tau_2, \ldots, \tau_N\}$ by running the policy in the environment.
3. Compute an estimate of the policy gradient, $\hat{d}$:
    * The policy gradient is given by:
        $$\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \hat{R}_t(\tau_i) \nabla_\theta \log \pi_\theta(a_t|s_t) \right]$$
    * We estimate it using Monte Carlo sampling:
        $$\hat{d} = \frac{1}{N} \sum_{i=1}^N \sum_{t=0}^T \hat{R}_t(\tau_i) \nabla_\theta \log \pi_\theta(a_{i,t}|s_{i,t})$$
4. Update the policy using gradient ascent:
    $$\theta \leftarrow \theta + \alpha \hat{d}$$
5. Repeat steps 2-4 until convergence.

This algorithm introduced [in 1992 by Ronald Williams](https://link.springer.com/article/10.1007/BF00992696) is simple and easy to implement, but it presents two main flaws:
1. The policy gradient estimation has high variance as we use Monte Carlo sampling
2. The algorithm is on-policy, which means that we can only use a trajectory once to improve the policy and we need to collect new trajectories after each update

## Reducing the gradient estimation variance

Let's rewrite the policy gradient.
We can show that by subtracting a baseline $b(s)$, we can significantly shrink the variance of our updates without changing the gradient.
$$\nabla_\theta J(\theta) \propto \mathbb{E}_{\substack{s\sim\rho^\pi \\ a\sim \pi}} \left[ \nabla_\theta \log\pi(a|s) ( Q^\pi(s, a) - b(s) ) \right]$$
For this, we need to show that the baseline term does not modify the expectation computation:
$$
\begin{align}
\mathbb{E}_{a \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) b(s) \right] &= b(s) \mathbb{E}_{a \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) \right] && b \text{ does not depend on the action} \\
&= b(s) \sum_{a} \pi_\theta(a|s) \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)}  && \text{log-derivative trick} \\
&= b(s) \nabla_\theta \sum_{a} \pi_\theta(a|s) && \text{gradient linearity} \\
&= b(s) \nabla_\theta (1) && \text{probabilities sum to 1} \\
&= 0
\end{align}
$$

To see why adding a baseline can reduce the variance, imagine an environment in which rewards are always high (e.g. between 990 and 1010).
In this case, $Q^\pi(s, a)$ will be high and the gradient will be huge: a tiny difference in performance will lead to massive swings in the weight updates.
By setting the baseline to 1000, we only modify the weights of the network depending on whether the outcome was better or worse than expected.

In practice, we usually use the Value function $V^\pi(s)$ as baseline.
Let's define the **Advantage function** $$A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$$
Intuitively, the advantage $A^\pi(s, a)$ represents how much better the action $a$ is compared to the average.
We can then write:
$$\nabla_\theta J(\theta) \propto \mathbb{E}_{\substack{s\sim\rho^\pi \\ a\sim \pi}} \left[ \nabla_\theta \log\pi(a|s) A^\pi(s, a) \right]$$

## Using efficiently collected trajectories

As REINFORCE is **on-policy**, we can only use each trajectories once for computing the policy gradient.
This is highly inefficient and greatly limit the capacity to learn in complex environments such as robotics systems.

To mitigate this, a first approach would be to increase our algorithm's learning rate to take larger steps at each update.
However, a greater learning rate would make the optimization process unstable.
In Supervised Learning, taking gradient steps too large might result in the loss jumping up temporarily, but the data distribution remains fixed.
In RL, the trajectory distribution is given by the policy $\pi$ so there is a supplementary risk: if the update step is too large, $\pi_{\theta_{new}}$ might be a terrible policy (e.g. the robot falls over immediately).
Because the policy leads to bad states, the next trajectory will also be uninformative (i.e. only containing states where the robot is lying on the ground).
The agent then learns from this data, reinforcing failure and often never recovers: this is known as **performance collapse**.

# <div id="trpo"></div> Trust Regions Policy Optimization (TRPO)

To solve this problem, Schulman et al. introduced [**Trust Region Policy Optimization (TRPO)** in 2015](https://arxiv.org/abs/1502.05477).
The problem is that the gradient steps operate in the **parameter space** (the weights of the neural network).
But the relationship between the weights $\theta$ and the resulting policy $\pi_\theta$ is highly non-linear.
A very small change in the weights $\theta$ can result in a massive change in the policy distribution $\pi_\theta$ (and thus the agent's behavior).

Instead of limiting how much we change the *parameters* (which is hard to tune), TRPO suggests we limit how much we change the **policy distribution**.
We define a **Trust Region** around the current policy where we believe the update is safe.
But how do we define how much one policy's behavior differs from another?

### Kullback-Leibler divergence

Recall that the output of a policy is a probability distribution over actions.
For discrete action spaces, this is a categorical distribution, and for continuous action spaces, this is a multivariate Gaussian distribution.

This means that a natural way to measure the difference between two policies is to measure the difference between their output distributions.
This is where the **Kullback-Leibler divergence** comes in.
The KL divergence is a measure of how different two probability distributions are.

For two probability distributions $P$ and $Q$ over some set $X$, it is defined as:
$$D_{KL}(P||Q) = \sum_{x \in X} P(x) \log \frac{P(x)}{Q(x)}$$

We won't go into too much depth here since the exact details of KL divergence are not relevant to our understanding, but some of the more important points to note are:
1. The KL divergence is a measure of relative entropy: it measures how "surprised" we are if we use distribution $Q$ to model distribution $P$
2. The KL divergence is not symmetric, i.e. $D_{KL}(P||Q) \neq D_{KL}(Q||P)$

### TRPO Objective

TRPO formulates the update as a constrained optimization problem:
$$\max_\theta \mathbb{E}_t \left[ \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)} A^{\pi_{old}}(s_t, a_t) \right]$$
$$\text{subject to } \mathbb{E}_t \left[ D_{KL}(\pi_{\theta_{old}}(\cdot|s_t) || \pi_\theta(\cdot|s_t)) \right] \le \delta$$
where $\delta$ is a hard threshold (e.g., 0.01) defining the maximum allowed change in the policy.

The probability ratio $r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}$ (not to be confused with the reward function) is called the **importance sampling ratio**: it is used to correct for the fact that we are using data collected under an old policy to optimize the new policy.

This objective $\max_\theta \mathbb{E}_t \left[ r_t(\theta) A^{\pi_{old}}(s_t, a_t) \right]$ is called the **Surrogate Objective** as it uses data collected by the old policy to *guess* how the new policy will perform.
It essentially says: "If the old policy found that action $a$ was good (high advantage), and the new policy is now $10\%$ more likely to take action $a$, then the new policy's performance will probably be roughly $10\%$ of that advantage better."
Intuitively, this means that we want to increase the probability of actions that have high advantage and decrease the probability of actions with low (negative) advantage.

However, if the new policy differs too much from the old policy, the importance sampling ratio becomes high-variance (we now take actions almost never seen before) and the states visited under the policy change, driving the surrogate estimation further from the true policy performance.
That's why TRPO adds the KL constraint to force the update to not drastically modify the policy.

### TRPO in practice

What we've described above is the *theoretical TRPO update*, but in practice we do not actually use this update rule directly as the true objective function is too expensive to compute exactly.
Instead, we use a first-order approximation of the objective function involving the Hessian matrix of the KL divergence, and then solve the optimization problem using a [conjugate gradient](https://en.wikipedia.org/wiki/Conjugate_gradient_method) algorithm.

We won't delve too much into the details of this, but if you want to know more, you can read the [OpenAI Spinning Up documentation](https://spinningup.openai.com/en/latest/algorithms/trpo.html)

While theoretically grounded, TRPO is difficult to use in practice:
1.  **Computationally Expensive:** Enforcing the KL constraint strictly requires calculating second-order derivatives (the Hessian matrix), which is very slow for large neural networks (for a network with $n$ parameters, the Hessian is of size $n \times n$)
2.  **Complex Implementation:** It requires using the Conjugate Gradient algorithm instead of standard optimizers like Adam or SGD

# <div id="ppo"></div> Proximal Policy Optimization (PPO)

Our goal is thus to prevent the new policy obtained after an update from being too different from the old policy.
To achieve this, whereas TRPO adds a constraint to the optimization problem, we can take a different approach: changing the objective function so that there's no incentive to change the policy too much in one update.
That is the approach used in the [**Proximal Policy Optimization** paper, also written by Schulman et al. in 2017](https://arxiv.org/abs/1707.06347).

### PPO policy objective

The policy objective of PPO, called the **Clipped Surrogate Objective**, is:
$$\mathcal{L}^{CLIP}(\theta, \theta_{\text{old}}) = \mathbb{E}_{s \sim \rho_{\theta_{\text{old}}}, a \sim \pi_{\theta_{\text{old}}}} \left[ \min \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)} A^{\pi_{\theta_{\text{old}}}}(s, a), \text{clip} \left( \frac{\pi_\theta(a|s)}{\pi_{\theta_{\text{old}}}(a|s)}, 1 - \epsilon, 1 + \epsilon \right) A^{\pi_{\theta_{\text{old}}}}(s, a) \right) \right] = \mathbb{E}_{s \sim \rho_{\theta_{\text{old}}}, a \sim \pi_{\theta_{\text{old}}}} \left[ l(s, a) \right] $$
with $\epsilon$ an hyperparameter, usually set to a small value like $0.2$.

This raw expression can be scary, but let's break it down in small pieces to understand it better.
The first term of the min is the same as in TRPO: it is the surrogate objective $\left[ r_t(\theta) A^{\pi_{\theta_{old}}}(s, a) \right]$
The second term can be broken down into two cases:
- when $A^{\pi_{\theta_{old}}}(s, a) \geq 0$, we have
$$
\begin{align}
l(s, a) &= A^{\pi_{\theta_{old}}}(s, a) \min \left( r_t(\theta), \text{clip} (r_t(\theta), 1 - \epsilon, 1 + \epsilon) \right) \\
&= A^{\pi_{\theta_{old}}}(s, a) \min \left( r_t(\theta), 1 + \epsilon \right)
\end{align}
$$
In this case, as the advantage is positive, we want to increase the probability to take the action $a$, but in the limit of $20\%$ more than previously not to get too greedy.
If the importance sampling ratio becomes greater than $1 + \epsilon$, we then have $l(s, a) = (1 + \epsilon) A^{\pi_{\theta_{old}}}(s, a)$ which does not depend on $\theta$, so the gradient is null and the policy is not updated.
- when $A^{\pi_{\theta_{old}}}(s, a) \lt 0$, we have
$$
\begin{align}
l(s, a) &= A^{\pi_{\theta_{old}}}(s, a) \max \left( r_t(\theta), \text{clip} (r_t(\theta), 1 - \epsilon, 1 + \epsilon) \right) \\
&= A^{\pi_{\theta_{old}}}(s, a) \max \left( r_t(\theta), 1 - \epsilon \right)
\end{align}
$$
In this case, the advantage is negative so we want to decrease the probability to take the action $a$ still in the limit of $20\%$ not to differ too much from the previous policy.
If the importance sampling ratio becomes lower than $1 - \epsilon$, we then have $l(s, a) = (1 - \epsilon) A^{\pi_{\theta_{old}}}(s, a)$ which also leads to a null gradient.

Let's visualize this behavior (directly from the PPO paper):
<center><img src="img/PPO_clip_objective.png"></center>

An important remark is that this clipped objective **does not guarantee** that the KL divergence between the old and new policies will stay small, and it is still possible to end up with a new policy which is too far from the old one.
In practice, the clipped objective is usually enough, but if needed it is still possible to control this using a smaller $\epsilon$ or using simple method like early stopping: if the mean KL-divergence grows beyond a threshold, we stop taking gradient steps.

Nevertheless, because the clipping prevents the policy from diverging too fast, the main benefit is that we can **perform multiple epochs of gradient descent on a same batch of data** and thus mitigate the fact that the algorithm is on-policy.

### Generalized Advantage Estimation (GAE)

In order to compute the clipped objective, we still need a way to compute the advantage.
As evoked in the original paper, any method that approximates the advantage can be used here, but in practice PPO implementations usually use the **Generalized Advantage Estimation** equation.
This formula established [in a paper from 2015 still by Schulman et al.](https://arxiv.org/abs/1506.02438) defines a family of advantage estimators that can be easily computed as a discounted sum of Bellman TD-errors:
$$\hat{A}^{GAE(\gamma, \lambda)}(s_t, a_t) = \sum_{k=0}^\infty (\gamma \lambda)^k \delta_{t + k}$$
with $\delta_t$ the TD-error $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$ and $\lambda \in (0, 1)$ an hyperparameter that controls the bias-variance tradeoff usually set at $0.95$.

To compute the advantage in practice, we thus need an approximation of the value function $V$ which can be represented as usual using a deep neural network.
To train this network, we can just learn through supervised learning on the return-to-go i.e. $L^{VF}(\theta) = \text{MSE} (V(s_t), \hat{R}_t)$.

However, as for REINFORCE, computing the return-to-go from the sampled trajectories has high variance, so we usually approximate it using the generalized advantage estimation:
$$\hat{R}_t \approx V(s_t) + \hat{A}^{GAE}(s_t, a_t)$$
$$L^{VF}(\theta) = \text{MSE} \left( V_\theta(s_t), \hat{R}_t \right)$$

### Total PPO loss

Finally, in order to ensure sufficient exploration, we add an entropy bonus to the total PPO loss defined as in SAC $H(\pi_\theta) = -\int p(x) \log (p(x))dx$.

This gives us the following final expression:
$$L^{PPO}(\theta) = -\mathbb{E} \left[ L^{CLIP}(\theta) - c_1 L^{VF}(\theta) + c_2 H(\pi_\theta(s_t)) \right]$$
with $c_1$ and $c_2$ coefficients to control the weights of each terms.

### PPO Pseudocode

The final pseudocode for PPO is the following (from [OpenAI Spinning Up](https://spinningup.openai.com/en/latest/algorithms/ppo.html)):
<center><img src="https://spinningup.openai.com/en/latest/_images/math/e62a8971472597f4b014c2da064f636ffe365ba3.svg"></center>

Here, <img src="https://spinningup.openai.com/en/latest/_images/math/39f524858866b80e627840ba77a54360e3bac55e.svg">

#  <div id="environments"></div> Environments

Your objective in the rest of the notebook will be to implement the PPO algorithm on various simplified robotics environment.

### MuJoCo library

From [the Gymnasium documentation](https://gymnasium.farama.org/environments/mujoco/):
> MuJoCo stands for Multi-Joint dynamics with Contact. It is a physics engine for facilitating research and development in robotics, biomechanics, graphics and animation, and other areas where fast and accurate simulation is needed. There is physical contact between the robots and their environment - and MuJoCo attempts at getting realistic physics simulations for the possible physical contact dynamics by aiming for physical accuracy and computational efficiency.

Although we'll use it for simple robotics environments, this library is also used in research labs and universities to model real and complex hardware.

Let's look at a classic example: the `Ant-v5` environment.
<center><img src="https://gymnasium.farama.org/_images/ant.gif" width="30%"></center>

Despite its name, it looks more like a four-legged robotic torso.
The action space is continuous of dimension 8, representing the torque applied to each of the 8 hinge joints.
The state space has 105 dimensions corresponding to the positions, velocities and external forces applied on each of the robot body parts.
The agent is rewarded for moving forward as fast as possible, with a few penalties to prevent it from learning undesirable behaviors.

You can read the exact environment definition and more information on [the environment documentation](https://gymnasium.farama.org/environments/mujoco/ant/).

In [None]:
import gymnasium as gym

env = gym.make("Ant-v5", render_mode="human")

_, _ = env.reset()
while True:
    action = env.action_space.sample() 
    _, _, terminated, truncated, _ = env.step(action)
    if terminated or truncated:
        break

env.close()

### Vectorized environments

In Reinforcement Learning, the biggest bottleneck in terms of training speed is generally interactions with the environment as it usually needs to calculate physics, collisions or game logic between each steps.

As we've seen previously, for off-policy algorithms each collected sample can be stored in a Replay Buffer and reused continuously when computing the loss function, which improves sample efficiency and mitigates the need for very fast environments.
However in the case of on-policy algorithms, each sample can be used only a few times at most, making the environment interaction speed the limiting factor.

To fix this, two concurrent approaches exist:
1. Accelerate the environment by running it on GPU (libraries like IsaacSim, Brax, Genesis...)
2. Run multiple environments in parallel

These two methods can be used simultaneously, and libraries that implement environments on GPU are almost always able to run multiple of them at the same time.
This second approach is usually implemented using **vectorized environments**.

A vectorized environment is simply a wrapper that runs multiple independent instances of the same environment simultaneously.
Instead of sending one action and getting one state, we send it a batch of actions and receive a batch of states.
This method presents several benefits:
- **data diversity:** in one update, our algorithm sees experiences from diverse starting positions or random seeds. This prevents the agent from "overfitting" to a specific lucky trajectory
- **less noisy estimations:** by averaging gradients over many parallel trajectories, the noise in the advantage estimation is naturally smoothed out
- **hardware efficiency:** as the actor and critic networks are usually run on GPU, batches of data are processed in parallel making the forward pass almost as efficient as when processing only one environment

The Gymnasium library makes vectorization very simple through the `AsyncVectorEvn` interface which runs environments in separate CPU processes.

In [None]:
import gymnasium as gym

num_envs = 16
envs = gym.make_vec("Ant-v5", num_envs=num_envs)

obs, _ = envs.reset()
print(f"Observation batch shape: {obs.shape}") 

actions = envs.action_space.sample() 
print(f"Action batch shape: {actions.shape}")

_, rewards, terminations, _, _ = envs.step(actions)
print(f"Rewards received: {rewards}")
print(f"Terminations received: {terminations}")

One important feature of gymnasium vectorized environments is **auto-resetting**.
In a single environment, when an agent dies it must be manually reset by calling `env.reset()`.
In a vectorized environment, if an agent dies but the others are still alive, the wrapper automatically resets the dead agent and puts the new starting state into the next observation batch.
This allows the training loop to run indefinitely without ever stopping to check if an episode finished.

# <div id="your_turn"></div> Your turn!

Although PPO is the most used RL algorithm in the literature and in the industry today, it is far from simple and understanding all implementation details is quite tricky.
Moreover, as often in RL the performance of any algorithm implementation is usually very dependent on many different tricks that are not always described in the original papers.

Your task now is to implement the PPO algorithm for a vectorized MuJoCo robotics environment such as Ant, Half-Cheetah or Humanoid.
The objective is that you apply everything we saw in previous notebooks on how to implement RL algorithms, combined with the theory of PPO developped in this notebook.
The more you code on your own, the more you'll learn so try to implement the general RL training loop and PPO optimization steps with the minimum help possible!

Here are some great resources to help you in your task:
- [the original PPO paper](https://arxiv.org/abs/1707.06347)
- [OpenAI SpinningUp PPO documentation](https://spinningup.openai.com/en/latest/algorithms/ppo.html)
- [CleanRL PPO implementation](https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/ppo.py): a single-file clean PPO implementation which focuses on understanding
- [A great ICLR 2022 blogpost](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/) about PPO important implementation details

This last link in particular is a great resource that discusses every little choices and common errors found in different public implementations of the algorithm.
You don't need to follow every one of their recommendations, but reading about them will help you understand PPO better.
It also comes with video tutorials if you prefer.

You will also find countless blog posts, Youtube videos, [academic papers](https://fse.studenttheses.ub.rug.nl/25709/1/mAI_2021_BickD.pdf), Github repositories and other resources all focused on explaining PPO, but be aware that they're not always of the best quality!

In [None]:
# Implement the PPO algorithm on a vectorized MuJoCo environment!