# Actor-critic methods

## Taxonomy of RL

<img src="http://drive.google.com/uc?export=view&id=1Gz0WBOtTxYrZ91uAidFE_Tuw0RBSfl9O" width=55%>

<img src="http://drive.google.com/uc?export=view&id=1Is1w5j2e-3WOSSl9qya7BwKW0Hh1hIJL" width=55%>

## The actor-critic framework

The actor critic framework consists of an actor and a critic. The actor is the policy (the behavior of the agent) we want to learn. The critic's task is to help the training of the actor. Technically it is used to calculate the policy gradient. A critic can be:

* Q-function
* V-function
* A-function (advantage: Q-V)
* other

The $\rho$ objective function has different formulations:

**Start-state formulation:** The start-state formulation uses the following function for measuring the performance of a policy:

$$\rho(\pi_\theta) = E_\tau\left[ \left. \sum_{t=0}^\infty{\gamma^t r_t} \right| s_0, \pi_\theta \right]$$

**Average-reward fomrulation:**  The average-reward formulation uses the following function for measuring the performance of a policy:

$$\rho(\pi_\theta) = \lim_{n \rightarrow \infty} \frac{1}{n}E_\tau\left[ \left. \sum_{t=0}^n{r_t} \right| \pi_\theta \right]$$

**Theorem 1:** In both the start-state formulation and the average-reward formulation case the following formula is true for the gradient:

$$\frac{\partial \rho(\pi_\theta)}{\partial \theta} = E_\tau \left[ \left. \frac{\partial \log \pi_\theta(s, a)}{\partial \theta} \cdot Q^{\pi_\theta}(s, a) \right| \pi_\theta \right] $$

Here the actor is defined by the policy: $\pi_\theta$. The critic is defined by the $Q^{\pi_\theta}$. The problem with using the $Q$-function that the gradient will be **too noisy**. In case of policy-based RL, the noisy gradient is always an issue. There is a commonly used technique to reduce the variance (noise).

It can be proven that by subtracting any arbitrary function $B(s)$ from the $Q(s, a)$, the gradient will not change but with a careful choice the **variance can be reduced significantly**. It turns out that the best choice for $B(s)$ is the state-value function itself, $V^{\pi_\theta}(s)$.

First let us see why the subtraction of $B(s)$ does not change the expected value:

$$E\left[ \frac{\partial \log \pi_\theta(s, a)}{\partial \theta} B(s) \right] = \sum_s{d^{\pi_\theta}(s) \sum_a{\frac{\partial \pi_\theta(s, a)}{\partial \theta} B(s)} } = \sum_s{d^{\pi_\theta}(s) B(s) \frac{\partial}{\partial \theta} \underbrace{\sum_a{\pi_\theta(s, a)}}_{1}} = 0$$

We have already seen that the difference between $Q$ and $V$ is the so called advantage:

$$A(s, a) = Q(s, a) - V(s)$$

The usage of **advantage** reduces the variance significantly, indeed. But how can we estimate it efficiently? It is not a good idea to estimate both the $Q$ and the $V$ function because none of them will be accurate enough and we have to maintain three neural networks ($Q$, $V$ and $\pi$). Can we do better?

Fortunately, we observe the **advantage can be estimated by the TD-error**:

$$A^{\pi_\theta}(s, a) = E\left[ \delta^{\pi_\theta} | s, a \right]$$

where

$$\delta^{\pi_\theta} = r + \gamma V^{\pi_\theta}(s') - V^{\pi_\theta}(s)$$

That means, it is enough to estimate the $V$ function.

### Similarity to GANs

In GANs there are generators and discriminators. The goal of the generator is to create images (or voices, or other artifacts) which is indistinguishable from the real ones according to the discriminator.

In Actor-Critic algorithms, the actor (analogy: generator) tries to behave in a way that the critic (discriminator) can not criticize it (the policy gradient becomes 0).

## Actor-critic algorithms

From now we will look into the details of different actor-critic algorithms

### A3C (and A2C)

[paper](https://arxiv.org/pdf/1602.01783.pdf)

Stands for Asynchronous Advantage Actor Critic.

The TD-error is calculated with n-step return (we saw it earlier):

$$\sum_{i=0}^{k-1}{\gamma^i r_{t+i} + \gamma^k V(s_{t+k}; \theta_v)} - V(s_t; \theta_v).$$

<img src="https://drive.google.com/uc?export=download&id=1QDvdX5cu2JJ6wuX8EFyJehn5_XmRWqBr" >

<img src="https://drive.google.com/uc?export=download&id=1lSf_vAVT5BWjSBBlNy-IMxtH11qSvP1m" >

Several agents learn parallel and they share the knowledge with each other (the networks in the threads are synchronized with a central network). This algorithm learns faster on CPUs then DQN on GPU.

[Torch video](https://youtu.be/0xo1Ldx3L5Q)

[Labyrinth video](https://youtu.be/nMR5mjCFZCw)

### DDPG

[paper](https://arxiv.org/pdf/1509.02971.pdf)

This algorithm is applicable for continuous actions, like actions in robotics. The policy (actor) is deterministic.

**Why continuous action space is challenging?**

One approach would be the discretization of a continuous space but this results in high dimension. For instance, if we have actions $(-k, 0, k)$, and 7 degrees of freedom, then we have $3^7 = 2187$ states. However, in reality after discretization we will have several 100 actions.

The maximum operator also computationally inefficient when the $Q$-function is continuous:

$$\pi(s) = \arg\max_a Q(s, a)$$

<img src="http://drive.google.com/uc?export=view&id=19_QJ8nt5mTrr531QajPEh9kNTU8C_c_v" width=75%>

**Deterministic policy gradient theorem:**

$$\frac{\partial \rho(\pi_\theta)}{\partial \theta} = E\left[ \frac{\partial \pi_\theta(s)}{\partial \theta} \frac{\partial Q(s, a)}{\partial a} \right]$$

[video](https://www.youtube.com/watch?v=tJBIqkC1wWM&feature=youtu.be)

### SAC

[paper](https://arxiv.org/abs/1801.01290)

Off-policy, actor-critic like DDPG but it is more robust and has better convergence guarantees.

### TRPO

[paper](https://arxiv.org/abs/1502.05477)

Trusted region policy optimization.

Uses natural gradients for searching the next policy. 

<img src="http://drive.google.com/uc?export=view&id=1fpkg2tgbhGi7-lkpSr8uObrVrcqPsxZe" width=75%>

<img src="http://drive.google.com/uc?export=view&id=1Rs6OBsrO9vYBMODGCbWvdwu0jC69iZF0" width=55%>

* difficult to solve the constraint optimization
* can learn high quality locomotion controllers

### PPO

[paper](https://arxiv.org/pdf/1707.06347.pdf)

Proximal Policy Optimization

**Constraint is expressed with KL divergence:**

<img src="http://drive.google.com/uc?export=view&id=1lkbVd04tv8L7yhVRdyHrEjx6TeuNdMh9" width=55%>

<img src="https://drive.google.com/uc?export=download&id=1hUfS4t_2dcw3Lt8QwwcpEXqFbpCdMjH4" width=75%>

**Constrant is expressed with clipping:**

<img src="http://drive.google.com/uc?export=view&id=1AcbJVnReKjNqDovugTL2HxqUgfo5hPEX" width=55%>

<img src="http://drive.google.com/uc?export=view&id=1zf5o6ykK1pA3JGn9dT7-6rJM5q4i0oec" width=65%>

## Dota2 with PPO

[paper](https://cdn.openai.com/dota-2.pdf)

The policy for controlling OpenAI Five was trained with a scaled up version of PPO.

**Rapid:**

<img src="http://drive.google.com/uc?export=view&id=1TzDjjxwU2TbTE-Ij0w6bgI0AR54bkZXn" width=65%>

**Architecture:**

<img src="http://drive.google.com/uc?export=view&id=1Gdnd9lIM-Hap5wqDRJKg7PatQCZ1oX92" width=65%>

The algorithm is quite simple but efficient:

* it achieves long term horizon
* collaboration among the agents emerges