# **Proximal Policy Optimization**

Environment: Lunar Lander

## References

#### Papers
- [Proximal Policy Optimization Algorithms, Schulman et al. 2017](https://arxiv.org/abs/1707.06347)
- [Emergence of Locomotion Behaviours in Rich Environments, Heess et al. 2017](https://arxiv.org/abs/1707.02286)

#### Blogs
- [OpenAI Spinning Up - Proximal Policy Optimization](https://spinningup.openai.com/en/latest/algorithms/ppo.html)

#### Others
- [OpenAI Spinning Up](https://spinningup.openai.com/en/latest/index.html)
- [OpenAI Gym](https://gym.openai.com/)

## Preparation

In [None]:
%%capture
!sudo apt update
!sudo apt install python-opengl xvfb -y
!pip install gym[box2d] pyvirtualdisplay piglet tqdm

%%capture
from pyvirtualdisplay import Display
virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

%matplotlib inline
import matplotlib.pyplot as plt

from IPython import display

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categorical
from tqdm import tqdm_notebook

In [None]:
# Import gym and create a Lunar Lander environment
# Observation/State: Box(8,)
# Action: Discrete(4)
%%capture
import gym
env = gym.make('LunarLander-v2')

## PPO Algorithm
Proximal Policy Optimization algorithm from the [original paper](https://arxiv.org/abs/1707.06347)

### Pseudocode

1. Input: initial policy parameters $\theta_0$, initial value function parameters $\phi_0$

2. **For** $k = 0, 1, 2, ...$ **do**

3. > Collect set of trajectories $D_k = \{\tau_i \}$ by running policy $\pi_k = \pi(\theta_k)$ in the environment.

4. > Compute rewards-to-go $\hat{R}_t$.

5. > Compute advantage estimates, $\hat{A}_t$ (using any method of advantage estimation) based on the current value function $V_{\phi_k}$.

6. > Update the policy by maximizing the PPO-Clip objective:
<br>
$$\theta_{k+1} = arg\ \max_{\theta}\ \frac{1}{|D_k| T} \sum_{\tau \in D_k} \sum_{t=0}^T \min \left( \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_k}(a_t|s_t)} A^{\pi_{\theta_k}}(s_t, a_t),\ g(\epsilon, A^{\pi_{\theta_k}}(s_t, a_t)) \right),$$
<br>
typically via stochastic gradient ascent with Adam.

7. > Fit value function by regression on mean-squared error:
<br>
$$\phi_{k+1} = arg\ \min_{\phi} \frac{1}{|D_k| T} \sum_{\tau \in D_k} \sum_{t=0}^T {\left( V_\phi(s_t) - \hat{R}_t \right)}^2,$$
<br>
typically via some gradient descent algorithm.

8. **end for**

###  Policy Gradient Network

> In this notebook, we implement two policy gradient networks, the one used as an example in hw15 of HY-Lee's course (PGNetLee) and the one in the [original paper](https://arxiv.org/abs/1707.06347) (PGNet).

In [None]:
class PGNetLee(nn.Module):

    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(8, 16)
        self.fc2 = nn.Linear(16, 16)
        self.fc3 = nn.Linear(16, 4)

    def forward(self, state):
        hid = torch.tanh(self.fc1(state))
        hid = torch.tanh(self.fc2(hid))
        return F.softmax(self.fc3(hid), dim=-1)


#class PGNet(nn.Module):

#    def __init__(self):
#       super().__init__()

### Proximal Policy Optimization Agent

In original paper, the authors compare several different surrogate objectives, including clipping $L^{CLIP}$,$\ $ conservitive policy iteration (no clippling or penalty) $L^{CPI}$ and KL penalty (fixed and adaptive) $L^{PENfix}$ $L^{PENadp}$.$\ $ Here we implement these different objectives in different agents.

#### Conservitive Policy Iteration

In [None]:
class PPOAgentCPI():

    def __init__(self, network):
        self.network = network

#### Clipping

In [None]:
class PPOAgentCLIP():

    def __init__(self, network):
        self.network = network

#### KL Penalty (Fixed)


In [None]:
class PPOAgentKLFix():

    def __init__(self, network):
        self.network = network

#### KL Penalty (Adpative)

In [None]:
class PPOAgentKLAdp():

    def __init__(self, network):
        self.network = network

### Estimator

#### Value Function Network

In [None]:
#class valueNet(nn.Module):

#### Generalized Advantage Estimation Agent

In [None]:
#class GAEAgent():

### Train

### Test