In [None]:
%config IPCompleter.greedy = True
%config InlineBackend.figure_format = 'retina'
%matplotlib inline
%load_ext tensorboard

# !sudo apt-get install -y xvfb ffmpeg
# !pip install 'xvfbwrapper==0.2.9'
# !pip install 'gym==0.10.11'
# !pip install 'imageio==2.4.0'
# !pip install PILLOW
# !pip install 'pyglet==1.3.2'
# !pip install pyvirtualdisplay
# !pip install tf-agents
# !pip install gast

import matplotlib.pyplot as plt
import matplotlib
import numpy as np
import pandas as pd
import seaborn as sn
import tensorflow as tf
import os
from datetime import datetime
import tensorflow as tf
import sklearn

import abc
import base64
import imageio
import io
import IPython
import numpy as np
import PIL.Image
import pyvirtualdisplay
import tensorflow_probability as tfp
import numpy as np
import shutil
import tempfile
import zipfile

pd.set_option('mode.chained_assignment', None)
sn.set(rc={'figure.figsize':(9,9)})
sn.set(font_scale=1.4)

# make results reproducible
seed = 0
np.random.seed(seed)
tf.random.set_seed(13)

# Value and Policy based methods

Lets review the taxonomy of RL algorithms:

![rl_algorithms_9_15.svg](attachment:rl_algorithms_9_15.svg)

Our first categorisation for our choice of RL aglorithm, is if the agent will use a **model** (either given or learns one) of the world (enviornment), often a function that predicts state transitions and rewards. Models allow us to **plan**  by simulating ahead what a would happen given a range of inputs. Agents can then distill results from planning into a learned policy (E.g. AlphaZero). When this works, it can result in a substantial improvement in sample efficiency (compared to non-model aglorithms). However in practice a **model** is rarely known and instead has to learn a model only from experience, which leads to bias in the model can be exploited by the agent, resulting in an agent that performs well on the learned model, however poorly in the real enviorment. For this reason, training **model-based** learning is difficult.

**Model-free** agents, don't use a model, however tend to be more popular and easier to implement, they come in two broad categorisations **Value** (Q-Learning) and **Policy** based methods.

## Value based methods

Often denoted as *Q-learning*, we learn an approximator $Q_\theta(s,a)$ for the optimal action-value function, $Q^*(s,a)$. Often they use a objective function (Used as the cost to maximize to update the parameters on) based on the Bellman equation. This optomization is almost always performed **off-policy**, that allows each update to use data collected at any point during training, regardless of how the agent was choosing to explore the environment when the data was collected. The policy taken by the Q-learning agent is gien by:

$$a(s) =  \operatorname{arg max}_{a} Q(s, a)$$

Examples of Q-Learning methods are:
* [DQN](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) (Deep Q-Networks)
* [C51](https://arxiv.org/abs/1707.06887) (A DQN variation, that learns a distribution over return whose expectation is $Q^*$)

## Policy based methods

Also known as *Policy Optomization*, where we parameterise the policy $\pi_\theta(a|s)$ and we keep updating its paramters until we converge to a performance that is no longer increasing. These optomise the parameterise $\theta$ either directly by gradient descent on the defined performance objective, often a **policy value** $\rho(\theta)$ (the expected reward we get when following $\pi_\theta$), or by maximising local approximations of $\rho(\theta)$. This often performed on **on-policy**, which restricts each update to only use data collected whil acting according to hte most recent version that policy $\pi$. Policy optomization often involves learning a value function approximator $V_\phi(s)$ for hte on-policy value function $V^{\pi}(s)$, which gets used to update the policy.

Examples of Policy based methods are:
* [REINFORCE](https://papers.nips.cc/paper/1713-policy-gradient-methods-for-reinforcement-learning-with-function-approximation.pdf) (Vanilla Policy Optomization)
* [A2C / A3C](https://arxiv.org/abs/1602.01783) (Synchronous Actor Critic / Asyncrhornous Actor Critic) Gradient ascent to directly maximize performance. Combines REINFORCE with a learned baseline (Critic) to improve stability of learning. Can be trained parallelized (asyncrhonous) or synchronous for both discrete and continous action spaces.
* [PPO](https://arxiv.org/abs/1707.06347) (Proximal Policy Optimization Algorithms: Actor-Critic) Maximizes a surrogate objective function which gives a conservative esimate for how much $\rho(\theta)$ will change as a result of the update. This actor critic scheme which uses bounded updates to the policy in order to make the learning process very stable.

## Trade-offs between Policy and Value based methods

The primary strength of policy optimization methods is that they are principled, in that we directly optimize for the thing we want (An optimal policy $\pi$). This tends to make them stable and reliable. In contrast, value based methods (Q-learning methods) only indirectly optimize for agent performance, by training $Q_{\theta}$ to satisfy a self-consistency equation (Bellman equation). There are many failure modes for this kind of learning, so it tends to be less stable. However Q-learning methods gain the advantage of being substantially more sample efficient when they do work, because they can reuse data more effectively than policy optimization techniques. Policy methods also perform best on continous action spaces.

## Interpolating between Policy and Value based methods

Policy optimization and Q-learning methods are not incompatible (and under some circumstances, it turns out, equivalent), and there exist a range of algorithms that live in between the two extremes. Algorithms that live on this spectrum are able to carefully trade-off between the strengths and weaknesses of either side.

Examples include:
* [DDPG](https://arxiv.org/abs/1509.02971) (Deep Deterministic Policy Gradient) An actor critic scheme for continuous action spaces which assumes that the policy is deterministic, and therefore it is able to use a replay buffer in order to improve sample efficiency.
* [TD3](https://arxiv.org/pdf/1802.09477.pdf) Very similar to DDPG, i.e. an actor-critic for continuous action spaces, that uses a replay buffer in order to improve sample efficiency. TD3 uses two critic networks in order to mitigate the overestimation in the Q state-action value prediction, slows down the actor updates in order to increase stability and adds noise to actions while training the critic in order to smooth out the critic's predictions.
* [SAC](https://arxiv.org/abs/1801.01290) (Soft Actor-Critic) DDPG variant which uses stochastic policies, entropy regularization, and a few other tricks to stabilize learning and score higher than DDPG on standard benchmarks. Here optimizing a stochastic policy in an off-policy way. One of the key features of SAC is that it solves a maximum entropy reinforcement learning problem.

# TF-Agents examples

## Policy agents

### REINFORCE Agent

In [1]:
from tf_agents.agents.behavioral_cloning.behavioral_cloning_agent import BehavioralCloningAgent
from tf_agents.agents.categorical_dqn.categorical_dqn_agent import CategoricalDqnAgent
from tf_agents.agents.ddpg.ddpg_agent import DdpgAgent
from tf_agents.agents.dqn.dqn_agent import DqnAgent
from tf_agents.agents.ppo.ppo_agent import PPOAgent
from tf_agents.agents.reinforce.reinforce_agent import ReinforceAgent
from tf_agents.agents.sac.sac_agent import SacAgent
from tf_agents.agents.td3.td3_agent import Td3Agent