# **Trust Region Policy Optimization**

Environment: Lunar Lander

## References

#### Papers
- [Trust Region Policy Optimization, Schulman et al. 2015](https://arxiv.org/abs/1502.05477)
- [High-Dimensional Continuous Control Using Generalized Advantage Estimation, Schulman et al, 2015. Algorithm: GAE.](https://arxiv.org/abs/1506.02438)
- [Approximately Optimal Approximate Reinforcement Learning, Kakade and Langford 2002](https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/KakadeLangford-icml2002.pdf)

#### Blogs
- [OpenAI Spinning Up - Trust Region Policy Optimization](https://spinningup.openai.com/en/latest/algorithms/trpo.html)

#### Others
- [OpenAI Spinning Up](https://spinningup.openai.com/en/latest/index.html)
- [OpenAI Gym](https://gym.openai.com/)

## Preparation

In [0]:
%%capture
!sudo apt update
!sudo apt install python-opengl xvfb -y
!pip install gym[box2d] pyvirtualdisplay piglet tqdm

%%capture
from pyvirtualdisplay import Display
virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

%matplotlib inline
import matplotlib.pyplot as plt

from IPython import display

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categorical
from tqdm import tqdm_notebook

In [0]:
# import gym and create a Lunar Lander environment
%%capture
import gym
env = gym.make('LunarLander-v2')

## TRPO Algorithm
Trust Region Policy Optimization algorithm from the [original paper](https://arxiv.org/abs/1502.05477)

> Initialize $\pi_0$.
> <br>**for** i = 0, 1, 2,... until convergence **do**

> > Compute all advantage values $A_{\pi_i}(s, a)$. 
> > <br> Solve the constrained optimization problem:
> > <br>
> > <br> $\pi_{i+1} = \underset{\pi}{arg\ max}\ L_{\pi_i}(\pi)$ $~~~~$ s.t. $\bar{D}^{\rho_{\pi_i}}_{KL}(\pi_i, \pi) \leq \delta$
> > <br> where $L_{\pi_i}(\pi) = \eta(\pi_i) + \underset{s}{\sum} \rho_{\pi_i}(s) \underset{a}{\sum} \pi(a|s) A_{\pi_i}(s, a)$

> > which is equivalent to solve:
> > <br> $\underset{\theta}{maximize}\ \underset{s}{\sum} \rho_{\theta_{old}}(s) \underset{a}{\sum} \pi_\theta(a|s) A_{\theta_{old}}(s, a)$ $~~~~$ s.t. $\bar{D}^{\rho_{\theta_{old}}}_{KL}(\theta_{old}, \theta) \leq \delta$

> **end for**

<br> Next, we are trying to approximate the objective and constraint functions using Monte Carlo simulation.

1. Replace $\underset{s}{\sum} \rho_{\theta_{old}}(s)[...]$ by $\frac{1}{1-\gamma}\mathbb{E}_{s \sim \rho_{\theta_{old}}}[...]$.
2. Replace advantage values function $A_{\theta_{old}}$ by the state-action value function $Q_{\theta_{old}}$.
3. Replace sum over actions by importance sampling estimator ($q$ is the sampling distribution):
> $\underset{a}{\sum} \pi_\theta(a|s) A_{\theta_{old}}(s_n, a) = \mathbb{E}_{a \sim q}\ [\frac{\pi_{\theta}(a|s_n)}{q(a|s_n)} A_{\theta_{old}}(s_n, a)]$

<br> The optimization problem become:

> $\underset{\theta}{maximize}\ \mathbb{E}_{s \sim \rho_{\theta_{old}},\ a \sim q}\ [\frac{\pi_{\theta}(a|s)}{q(a|s)} Q_{\theta_{old}}(s, a)]$ $~~~~$ s.t. $\mathbb{E}_{s \sim \rho_{\theta_{old}}}\ [\ D_{KL}(\ \pi_{\theta_{old}}(\cdot|s)\ \|\ \pi_{\theta}(\cdot|s)\ )\ ] \leq \delta$

<br> The remaining steps are:

1. Replace the expectations by sample averages.
2. Replace the $Q$ value by an empirical estimate.

In the original paper, the author provided two schemes, single path and vine, for the these steps. We will implement both of them.

### Single Path

In this scheme, we generate a trajectory ($s_0, a_0, ... , s_{T-1}, a_{T-1}, s_T$) by $\rho_0$ and $\pi_{old}$, which means $p(a|s) = \pi_{old}(a|s)$, and compute the $Q$ value as $$\hat{\mathcal{Q}}_{\theta_{old}}(s_t, a_t) = \sum_{\tau\in\theta_{old}} \sum_{l=0}^T \gamma^lr(s_{t+l})$$

### Policy Gradient Network

> Fisrt, we construct the same policy gradient neural network as [TRPO paper](https://arxiv.org/abs/1502.05477).

In [0]:
class PolicyGradientNetwork(nn.Module):

    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(8, 16)
        self.fc2 = nn.Linear(16, 16)
        self.fc3 = nn.Linear(16, 4)

    def forward(self, state):
        hid = torch.tanh(self.fc1(state))
        hid = torch.tanh(self.fc2(hid))
        return F.softmax(self.fc3(hid), dim=-1)

### Trust Region Policy Optimization Agent

> Next, we build an TRPO agent which take the policy gradient network abouve to take action and have the following functions:
1. `learn()`:

In [0]:
class TRPOAgent():

    def __init__(self, network):
        self.network = network
        self.optimizer = optim.SGD(self.network.parameters(), lr=0.001)

### Compute State-Action Value Function Estimates

$$\hat{\mathcal{Q}}_{\theta_{old}}(s_t, a_t) = \sum_{\tau\in\theta_{old}} \sum_{l=0}^T \gamma^lr(s_{t+l})$$


In [0]:
def computeQ():

    return Qvalues

### Train

#### Single Path

In [0]:
network = PolicyGradientNetwork()
agent = TRPOAgent(network)

In [0]:
agent.network.train() 
EPISODE_PER_BATCH = 5  # Update agent once per EPISODE_PER_BATCH episodes.
NUM_BATCH = 400        # Update agent NUM_BATCH times in total.
gamma = 0.5            # Discount parameter

avg_total_rewards, avg_final_rewards = [], []

prg_bar = tqdm_notebook(range(NUM_BATCH))
for batch in prg_bar:

    log_probs = []
    total_rewards, final_rewards = [], []

    discounted_rewards = []

    # Collect training data
    for episode in range(EPISODE_PER_BATCH):
        
        state = env.reset()
        total_reward, total_step = 0, 0

        episode_reward = []
        

        while True:

            action, log_prob = agent.sample(state)
            next_state, reward, done, _ = env.step(action)

            log_probs.append(log_prob)
            state = next_state
            total_reward += reward
            total_step += 1

            episode_reward.append(reward)

            if done:
                final_rewards.append(reward)
                total_rewards.append(total_reward)
                                
                discounted_reward = discount(episode_reward, gamma)
                discounted_rewards.append(discounted_reward)
                break

    # Log training process
    avg_total_reward = sum(total_rewards) / len(total_rewards)
    avg_final_reward = sum(final_rewards) / len(final_rewards)
    avg_total_rewards.append(avg_total_reward)
    avg_final_rewards.append(avg_final_reward)
    prg_bar.set_description(f"Total: {avg_total_reward: 4.1f}, Final: {avg_final_reward: 4.1f}")

    # Update Policy Gradient Network
    discounted_rewards = np.concatenate(discounted_rewards, axis=0)
    discounted_rewards = (discounted_rewards - np.mean(discounted_rewards)) / np.std(discounted_rewards) + 1e-9
    agent.learn(torch.stack(log_probs), torch.from_numpy(discounted_rewards), EPISODE_PER_BATCH)

#### Vine