**Policy Gradient methods.**

1. [Learning parametric policies](#policies)
2. [Policy gradient methods](#PG)
3. [Objective function](#objective)
4. [The Policy Gradient theorem](#theorem)
5. [REINFORCE: Monte Carlo Policy Gradient](#reinforce)
6. [Actor-Critic methods](#ac)
7. [The link with Policy Iteration](#pi)
8. [Deep Actor-Critic algorithms and references](#deep)
9. [Practice](#practice)

# <a id="policies"></a>Learning parametric policies

**Bottomline question:**<br>
The previous classes have focussed on *action-value methods*; they aimed at estimating $Q^*$ in order to deduce $\pi^*$. Could we directly optimize $\pi$?

Suppose we have a policy $\pi_\theta$ parameterized by a vector $\theta$. Our goal is to find the parameter $\theta^*$ corresponding to $\pi^*$.

Remarks:
- $\pi_\theta$ might not be able to represent $\pi^*$. We will take a shortcut and call $\pi^*$ the best policy among the $\pi_\theta$ ones.
- for discrete state and action space, the tabular policy representation is a special case of policy parameterization.
- policy parameterization is a (possibly useful) way of introducing prior knowledge on the set of the desired policies.
- the optimal deterministic policies might not belong to the policy subspace of $\pi_\theta$, thus it makes sense to consider stochastic policies for $\pi_\theta$.
- for problems with significant policy approximation, the best approximate policy (among $\pi_\theta$ ones) may very well be stochastic.
- it makes even more sense to consider stochastic policies that it opens the family of environments that we can tackle, like partially observable MDPs or multi-player games.

For stochastic policies, we shall write $\pi_\theta(a|s)$.

In the remainder of the class, we will assume that $\pi_\theta$ is differentiable with respect to $\theta$.

To directly optimize $\theta$ we need a criterion $J(\theta)$.

# <a id="PG"></a>Policy gradient methods

Suppose now we define some performance metric $J(\pi_\theta) = J(\theta)$. If $J$ is differentiable and a stochastic estimate $\nabla_\theta J(\theta)$ of the gradient is available, then we can define the gradient ascent update procedure:
$$\theta \leftarrow \theta + \alpha \nabla_\theta J(\theta).$$

We will call **policy gradient methods** all methods that follow such a procedure (whether or not they also learn a value function or not).

Remarks: 
- Note that $J$ is a more general criterion that might differ from $Q$ in the definition above (even though it seems reasonable to assume both should be related). For example, $J$ could be defined as the value of a starting state (or a distribution of starting states) in episodic cases, or as the undiscounted reward over a certain horizon, or as the average reward.
- Why is it interesting to look at policy gradient methods? Because for continuous actions there is no maximization step ($\max_a Q(s,a)$) during evaluation but only a call to $\pi_\theta(s)$ (or a draw from $\pi_\theta(a|s)$). This makes Policy Gradient a method of choice for continuous actions domains (especially common in Robotics).
- When do policy gradient approaches outperform value-based ones? It's hard to give a precise criterion; it really depends on the problem. One thing that comes into play is how easy it is to approximate the optimal policy or the optimal value function. If one is simpler than the other (by "simpler", we mean "it is easier to find a parameterization whose spanned function space almost includes the function to approximate"), then it is a good heuristic to try to approximate it. But this criterion might itself be hard to assess.


# Notations

- We consider probability density functions $p(X)$ for all random variables $X$.
- For a policy $\pi_\theta$ and a random variable $X$ we write indifferently $p(X|\pi_\theta) = p(X|\theta)$.
- A trajectory is noted $\tau = (s_t,a_t)_{t\in [0,\infty]}$.
- The state random variable at step $t$ is $S_t$ and its law's density is $p_t(s)$.
- The action random variable at step $t$ is $A_t$.

# Rewriting the policy optimization's objective function.

Although it is not strictly necessary for the following sections, let us play a bit with the policy optimization's objective function.

We defined the policy optimization's objective as:  
$$J(\pi) = \mathbb{E}_{s \sim p_0} \left[ V^{\pi} (s) \right].$$
Or equivalently:  
$$J(\pi) = \mathbb{E}_{(s_i,a_i)_{i \in [0,\infty]}} \left[ \sum_{t=0}^\infty \gamma^t r(s_t,a_t)  | \pi \right].$$
We can switch the sum and the expectation and get:  
$$J(\pi) = \sum_{t=0}^\infty \gamma^t \mathbb{E}_{(s_i,a_i)_{i \in [0,\infty]}} \left[ r(s_t,a_t)  | \pi \right]$$
But $\mathbb{E}_{(s_i,a_i)_{i \in [0,\infty]}} \left[ r(s_t,a_t)  | \pi \right] = \mathbb{E}_{s_t,a_t} \left[ r(s_t,a_t)  | \pi \right]$. So:
$$J(\pi) = \sum_{t=0}^\infty \gamma^t \mathbb{E}_{s_t,a_t} \left[ r(s_t,a_t)  | \pi \right].$$
Now let's introduce the density of $(s_t,a_t)$:
$$J(\pi) = \sum_{t=0}^\infty \gamma^t \int_S \int_A r(s_t,a_t) p(s_t,a_t|\pi) ds_t da_t.$$
But $p(s_t,a_t|\pi) = p(s_t|\pi) p(a_t|s_t,\pi)$. By definition, $p(s_t|\pi) = p_t(s|\pi)$ and $p(a_t=a|s_t=s,\pi) = \pi(a|s)$. So:
$$J(\pi) = \sum_{t=0}^\infty \gamma^t \int_S \int_A r(s,a) p_t(s|\pi) \pi(a|s) ds da.$$
Let us isolate the terms that concern only states:
$$J(\pi) = \int_S \left[ \int_A r(s,a) \pi(a|s) da \right] \sum_{t=0}^\infty \gamma^t p_t(s|\pi) ds.$$
Let's note $\rho^\pi(s) = \sum_{t=0}^\infty \gamma^t p_t(s|\pi)$. We will call this quantity the density of the *improper state distribution under policy $\pi$*. Then we have:
$$J(\pi) = \int_S \left[ \int_A r(s,a) \pi (a|s) da \right] \rho^\pi(s) ds.$$
And so finally:
$$J(\theta) = \mathbb{E}_{\substack{s\sim\rho^\pi \\ a\sim \pi}} \left[ r(s,a) \right].$$

In plain words, the value of a policy $\pi$ is the average value of the rewards when states are sampled according to $\rho^\pi$ and actions are sampled according to $\pi$.

# <a id="theorem"></a>The Policy Gradient theorem

The crucial problem of computing $\nabla_\theta J(\theta)$ lies in the fact that when $\theta$ changes, both $\pi$ and $\rho^\pi$ change in turn. So there seems to be no straighforward way of evaluating this gradient. One could fall back on a *finite differences* approach to estimating this gradient, but this would require trying out a series of increments $\Delta \theta$ which quickly becomes impractical (because the increment size is hard to tune, especially in stochastic systems, and also because of the sample inefficiency of the approach).

Remark:
- Let's not discard finite difference methods too quickly. They have their merits and showed great successes through methods such as PEGASUS (Ng and Jordan, 2000).

The key result of this class is that one can express the gradient of $J(\theta)$ as directly proportional to the value of $Q^\pi$ and the gradient of $\pi$:
$$\nabla_\theta J(\theta) \propto \mathbb{E}_{\substack{s\sim\rho^\pi \\ a\sim \pi}} \left[ Q^\pi(s,a) \nabla_\theta \log\pi(a|s)\right]$$

The proof of this result is simple but a bit tedious. We can however give the general intuition. Let's consider trajectories $\tau = (s_0,a_0,r_0,...)$ drawn according to $\pi$ from the starting state. Each of these trajectories has an overall payoff of $G(\tau) = \sum_t \gamma^t r_t$ and is drawn with probability density $p(\tau|\theta)$. Then the objective function can be written:
\begin{align}
J(\theta) &= \mathbb{E}_\tau \left[ G(\tau) | \theta \right]\\
 &= \int G(\tau) p(\tau | \theta) d\tau
\end{align}

So the objective function's gradient is:
\begin{align}
\nabla_\theta J(\theta) &= \nabla_\theta \int G(\tau) p(\tau|\theta) d\tau,\\
 &= \int G(\tau) \nabla_\theta p(\tau|\theta) d\tau,\\
 &= \int G(\tau) p(\tau|\theta) \frac{\nabla_\theta p(\tau|\theta)}{p(\tau|\theta)} d\tau,\\
 &= \mathbb{E}_\tau \left[ G(\tau) \nabla_\theta \log p(\tau|\theta) \right].
\end{align}

Let us study a little the $\nabla_\theta \log p(\tau|\theta)$ term along a series of remarks.

Remark 1: law of $s_{t+1},a_{t+1}$ given the policy and history.  
One has $p(s_{t+1},a_{t+1} | (s_i,a_i)_{i \in [0,t]}, \theta) = p(s_{t+1} | (s_i,a_i)_{i \in [0,t]}, \theta) p(a_{t+1} | s_{t+1}, (s_i,a_i)_{i \in [0,t]}, \theta)$.  
But the transition model is Markovian, so $p(s_{t+1} | (s_i,a_i)_{i \in [0,t]}, \theta) = p(s_{t+1} | s_t, a_t)$.  
And the law of $a_{t+1}$ is given by the policy, so $p(a_{t+1} | s_{t+1}, (s_i,a_i)_{i \in [0,t]}, \theta) = \pi_\theta(a_{t+1}|s_{t+1})$.  
Consequently:
$$p(s_{t+1},a_{t+1} | (s_i,a_i)_{i \in [0,t]}, \theta) = p(s_{t+1} | s_t, a_t) \pi_\theta(a_{t+1}|s_{t+1}).$$

Remark 2: probability density of a trajectory.  
Recall that $p(\tau|\theta) = p((s_t,a_t)_{t\in [0,\infty]}|\theta)$.  
This joint probability can be decomposed into conditional probabilities:  
$p(\tau|\theta) = p(s_0,a_0|\theta) \prod_{t=0}^\infty p(s_{t+1},a_{t+1} | (s_i,a_i)_{i \in [0,t]}, \theta)$.
By the previous remarks allows us to simplify to:
$p(\tau|\theta) = p(s_0,a_0|\theta) \prod_{t=0}^\infty p(s_{t+1} | s_t, a_t) \pi_\theta(a_{t+1}|s_{t+1})$.
By expanding the first term into $p(s_0)\pi_\theta(a_0|s_0)$ and reordering the terms inside the product, we obtain:
$$p(\tau|\theta) = p(s_0) \prod_{t=0}^\infty p(s_{t+1} | s_t, a_t) \pi_\theta(a_t|s_t).$$

Remark 3: the grad-log-prob trick.  
Now let us consider the full $\nabla_\theta \log p(\tau|\theta)$ term. The previous remarks tells us that  
$$\nabla_\theta \log p(\tau|\theta) = \nabla_\theta \log p(s_0) + \sum_{t=0}^\infty \left[ \nabla_\theta \log p(s_{t+1} | s_t, a_t) + \nabla_\theta \log \pi_\theta(a_t|s_t)\right].$$
But the initial state distribution and the transition model do not depend on $\theta$, so this expression boils down to:
$$\nabla_\theta \log p(\tau|\theta) = \sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t).$$

And we will admit the step which leads to:
$$\nabla_\theta J(\theta) = \mathbb{E}_{\substack{s\sim\rho^\pi \\ a\sim \pi}} \left[ G(\tau) \nabla_\theta \log \pi_\theta(a|s) \right].$$

And finally, since $Q^\pi(s,a) = \mathbb{E} [G(\tau) | S_0=s, A_0=a, \theta]$, we obtain that:
$$\nabla_\theta J(\theta) \propto \mathbb{E}_{\substack{s\sim\rho^\pi \\ a\sim \pi}} \left[ Q^\pi(s,a) \nabla_\theta \log\pi(a|s)\right].$$

# <a id="reinforce"></a>REINFORCE: Monte Carlo Policy Gradient

Let's apply directly the Policy Gradient theorem. To compute the gradient, we can run the policy within the environment, this will provide us with states distributed according to $\rho^\pi$ and actions distributed according to $\pi$. The full trajectory of states-actions-rewards provides a Monte Carlo estimate $G_t$ of $Q^\pi(s_t,a_t)$ from any state $s_t$ traversed by the trajectory. In turn, this allows to compute $Q^\pi(s_t,a_t) \nabla_\theta \pi(a_t|s_t)$ for any of these states. The sum over all states provides the gradient estimate.

This algorithm, introduced by Williams (1992) is called REINFORCE. It requires a finite-length trajectory and its pseudo-code goes as follows.
1. Initialize policy parameter $\theta$
2. Generate a trajectory by playing $\pi$: $s_0,a_0,r_0,...s_{T}$
3. For $t\in [1, 2, … , T]$:
    1. Estimate return $G_t$
    1. Update policy parameter: $\theta \leftarrow \theta + \alpha \gamma^t G_t \nabla_\theta \log \pi(a_t|s_t)$
    
When one takes a look at the $G_t \nabla_\theta \log \pi(a_t|s_t) = G_t \frac{\nabla_\theta \pi(a_t|s_t)}{\pi(a_t|s_t)}$ product, its interpretation is intuitive. $\nabla_\theta \pi(a_t|s_t)$ is a vector in parameter space that points in the direction of greatest increase of $\pi(a_t|s_t)$. The update will encourage taking a step in this direction if the action provided high return (through $G_t$), but will discourage moving in this direction if the action is already picked frequently (through $\pi(a|s)$) so that other actions have a chance also.

The gradient estimate of the policy gradient often has a high variance. A common practice consists in substracting an action-independent *baseline* $b(s)$ from the estimate of $Q^\pi(s,a)$, yielding:
$$\nabla_\theta J(\theta) \propto \mathbb{E}_{\substack{s\sim\rho^\pi \\ a\sim \pi}} \left[ \left( Q^\pi(s,a) -b(s) \right) \nabla_\theta \log\pi(a|s)\right]$$

It is rather easy to remark that this baseline does not affect the gradient estimate since it's expected value is zero. However it can contribute to strongly decrease the estimate's variance. One common choice for such a baseline is the policy's value function $V^\pi$. This turns REINFORCE into an *advantage* estimation algorithm where the advantage is the function defined as:
$$A^\pi(s,a) = Q^\pi(s,a) - V^\pi(s)$$

But estimating this advantage function requires estimating $V^\pi$, which in turn requires to maintain a (possibly parameterized) function $V$ on top of the current (parameterized) policy $\pi$.

We will leave this topic aside for now and will directly look at the generalization of the Policy Gradient theorem to Actor-Critic methods.

# <a id="ac"></a>Actor-Critic methods

Suppose now that we don't want a Monte Carlo estimate of $Q^\pi(s,a)$ in the Policy Gradient theorem, and are rather willing to store a function approximator for $Q^\pi(s,a)$. This leads us to store both a policy and a value function. The value function *criticizes* the policy's selected actions, hence the names of *critic* and *actor*.

Remark that the temporal difference at each time step $\delta = r + \gamma V^\pi(s') - V^\pi(s)$ is an estimate of the advantage $A^\pi(s,a)$. Using this remark, a simple one-step Actor-Critic method based on TD(0) and a value function $V_w$ goes as follows:
1. In $s$, draw $a \sim \pi$
2. Observe $r, s'$
3. Compute $\delta = r + \gamma V_w(s') - V_w(s)$
4. Update critic's parameters (TD(0) step) $w \leftarrow w + \alpha \delta \nabla_w V_w(s)$
5. Update actor's parameters (policy gradient theorem) $\theta \leftarrow \theta + \alpha \delta \nabla_\theta \log \pi(a|s)$
6. $s\leftarrow s'$ and repeat

# <a id="pi"></a>The link with Policy Iteration

Let's take a step back and reconsider the Actor-Critic architecture we have just introduced.

Basically, on the one hand, we have a critic $V_w$ or $Q_w$ that aims at estimating the $V^\pi$ or $Q^\pi$ value function of policy $\pi$. And on the other hand, we have an actor whose policy $\pi_\theta$ is incrementally improved so as to maximize $J(\theta)$. This should sound familiar.

In part, this is familiar because this resembles a lot the SARSA update. Let's take a minute to spot the differences.

But more generally, this actually belongs to the class of approximate Policy Iteration algorithms. Let's recall the Policy Iteration procedure:
1. Solve $Q=T^\pi Q$
2. Solve $\pi = Greedy(Q)$
3. Repeat

And let's now allow for an approximate resolution of these steps, via gradient descent:
1. Approximately solve $\min \|Q - T^\pi Q \|$ via gradient descent
2. Approximately solve $\pi = Greedy(Q)$ using the policy gradient theorem
3. Repeat

After each collected sample, the update of the Actor-Critic algorithm above performs exactly one gradient step on the critic and one gradient step for the actor.

This perspective allows to define a much broader family of Actor-Critic methods that perform various number of gradient steps on the critic or the actor, use n-step returns, introduce a sequence of $Q_{i+1} = T^\pi Q_i$ functions, soften the policy gradient steps, etc. in order to make the overall Actor-Critic algorithm more efficient and robust.

# <a id="deep"></a>Deep Actor-Critic algorithms and references

At this stage, it is really tempting to throw a neural network at our critic and our actor, start collecting samples in a replay buffer, and try to design Deep Policy Gradient methods. Let's review some of the key ones from the litterature.

- [Asynchronous Advantage Actor-Critic (A3C) (2016)](https://arxiv.org/abs/1602.01783). Builds a unique network that approximates $V$ and $\pi$, replaces the replay buffer with an army of asynchronous actors that provide independent samples for the gradient computations. It is the direct adaptation of the Actor-Critic algorithm above. Its little brother A2C discards the asynchronous aspect, while keeping the good overall performance.
- [Trust Region Policy Optimization (TRPO) (2015)](https://arxiv.org/abs/1502.05477). Imposes small policy gradient steps by introducing a "maximum KL divergence between successive policies" constraint in the actor's update.
- [Proximal Policy Optimization (PPO) (2017)](https://arxiv.org/abs/1707.06347). Same philosophy as TRPO but simpler and more efficient. Instead of a KL divergence constraints, it imposes a gradient clipping for the policy gradient.
- [Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) (2017)](https://arxiv.org/abs/1708.05144). Uses Kronecker factorization to make TRPO's update more efficient.
- [Sample Efficient Actor-Critic with Experience Replay (ACER) (2017)](https://arxiv.org/abs/1611.01224). Several improvements upon TRPO and A2C.
- [Soft Actor Critic Algorithms (SAC) (2019)](https://arxiv.org/abs/1812.05905). Introduces an entropy regularization term in the objective function.
- [Modified Actor Critic algoritms (MoPPO) (2019)](https://arxiv.org/abs/1907.01298). Casts the critic update in a modified policy iteration scheme by building the sequence of $Q_{i+1} = T^\pi Q_i$ functions, applies this to PPO.

One thing that was not covered in this class is the [Deterministic Policy Gradient theorem](http://proceedings.mlr.press/v32/silver14.html). which allows to perform policy gradient steps on deterministic policies (with many benefits). This family of algorithms spanned their own deep counterparts. Notably:
- [Deep Deterministic Policy Gradients (DDPG) (2015)](https://arxiv.org/abs/1509.02971). Implements the DPG theorem on deep neural networks, with a replay buffer.
- [Twin Delayed Deep Deterministic Policy Gradients (TD3) (2018)](https://arxiv.org/abs/1802.09477). Introduces three improvements over DDPG, namely a double critic update, two separate networks for the critic and target policy smoothing.
- [Distributed Distributional Deep Deterministic Policy Gradients (D4PG) (2018)](https://arxiv.org/abs/1804.08617). Improves on DDPG with parallel actors, a distributional value function estimator, and batch normalization.

One can remark that the (stochastic) Policy Gradient update is an *on-policy* update: it requires the samples to have been drawn from the current policy's stationary distribution. This was generalized to off-policy updates in the [Off-Policy Actor Critic](https://arxiv.org/abs/1205.4839) paper (2012) and has been used in most of the algorithms above.

Lilian Weng keeps a nice overview (and zoo) of Actor-Critic methods [on her blog](https://lilianweng.github.io/lil-log/2018/04/08/policy-gradient-algorithms.html).

The introduction to Policy Gradients from [OpenAI's Spinning Up](https://spinningup.openai.com/en/latest/spinningup/rl_intro3.html#) is also a great reading after this class.

Finally, to grasp an overview of Policy search methods (even beyond the scope of Policy Gradients) a good read is the [Policy Search in Continuous Action Domains: an Overview](https://arxiv.org/abs/1803.04706) paper (2019).

# <a id="practice"></a>Practice

Implement an Advantage Actor-Critic algorithm using a single network representing both the policy and the value function. Test your implementation on your favorite Atari game, or on a new gym environment (like the [box2d ones](https://gym.openai.com/envs/#box2d)).
Draw inspiration from the [A3C](https://arxiv.org/abs/1602.01783) paper!