# Policy gradient methods

## Taxonomy of RL

<img src="http://drive.google.com/uc?export=view&id=1Gz0WBOtTxYrZ91uAidFE_Tuw0RBSfl9O" width=45%>

<img src="http://drive.google.com/uc?export=view&id=1A2z6bd0GePxIbR6f7rz2k0HfSQ0e2MTw" width=45%>

## Value-based RL recap

* The value-based RL revolves around the value-functions: state-value ($V$) and action-value $Q$. 
* These function can be approximated by neural networks as well.
* The update of the value-functions (or its weights) are guided by Bellman-equations.
* The action-value based algoirthms with bootstrapping were model-free.

(For further details see previous lectures.)

### Disadvantages of value-based RL

As we have already seen, value-based RL derives the policy from a value-function:

With Q-function:

$$\pi(s) = \arg \max_a{ Q(s, a) }$$

When V-function is given:

$$\pi(s) = \arg\max_a \left( T(s, a, s') \cdot \left[ r(s, a) + \gamma V(s') \right]\right)$$

Due to the $\arg \max$ function the resulted policy is deterministic. However, in the first session (day 3), we have seen an exmple that stochastic policies have an advantage over deterministic ones. Let's recall the example:

<img src="http://drive.google.com/uc?export=view&id=1b-EDUk5cFVpqtOvZ0o4begzKCYwO0dMS" width=65%>

Here the example shows that cells with the quatation marks are the same for the agent. This is because here the state is defined by walls the agent can see around itself. Therefore the problem can not be solved with a deterministic policy as efficiently as a stochastic one can. 

One main disadvantage of the value-based RL is that it can not calculate a stochastic policy directly. Furhtermore changing the value-function causes unpredictable change in the derived policy.

Summarized:
* naturally gives a **deterministic policy**
* **small changes** in the value function can cause severe changes in the policy

## Policy-gradient 

To overcome the limitations of value-based RL, the policy-based methods parametrize directly the policy $\pi(s, a, \theta)$. 
The question is how to optimize the parameters of the policy.
As throughout in RL and machine learning, the optimization’s steps are done in the directions of the gradients of a loss function (or objective function). The question is how we can define a good objective function?

### Derivation (likelyhood trick)

The goal here, is to derive an appropriate objective function, then find its gradient in order to update the parameters of the policy. 

Let's start with the general formula of the value:

$$\rho(\pi_\theta) = E_\tau\left[ G(\tau) | \pi_\theta \right]$$

Because the expected value of the return (calculated along the different trajectories), depends on the policy, we can rephraze this expectation as the objective function for the policy (which is parametrized).

Now, we rewrite the expectation by using the definition of expectation:

$$\rho(\pi_\theta) = \sum_\tau p^{\pi_\theta}(\tau) \cdot G(\tau)$$

where $p^{\pi_\theta}(\tau)$ is the probability of the trajectory $\tau$ when it was generated by the policy $\pi_\theta$.

Our intention is to find the gradient of the policy in order to update the parameters of the policy along it:

$$\theta \leftarrow \theta + \alpha \cdot \frac{\partial \rho(\pi_\theta)}{\partial \theta}$$

The right derivative term is the **policy gradient**.

Take the derivative of $\rho$:

$$\frac{\partial \rho(\pi_\theta)}{\partial \theta} = \frac{\partial \sum_\tau p^{\pi_\theta}(\tau) \cdot G(\tau)}{\partial \theta} =  \sum_\tau \frac{\partial p^{\pi_\theta}(\tau)}{\partial \theta} \cdot G(\tau) + \underbrace{\sum_\tau p^{\pi_\theta}(\tau) \cdot \frac{\partial G(\tau)}{\partial \theta}}_{0}$$

In the last equation, the second term is zero because the $G(\tau)$ does not depend on the policy and therefore on $\theta$ too.

Then we can apply the likelyhood function trick:

$$\frac{\partial p^{\pi_\theta}(\tau)}{\partial \theta} = p^{\pi_\theta}(\tau) \frac{\partial \log p^{\pi_\theta}(\tau)}{\partial \theta}$$

Then the policy gradient:

$$\frac{\partial \rho(\pi_\theta)}{\partial \theta} = \sum_\tau p^{\pi_\theta}(\tau) \frac{\partial \log p^{\pi_\theta}(\tau)}{\partial \theta} \cdot G(\tau)$$

Which is the same as:

$$\frac{\partial \rho(\pi_\theta)}{\partial \theta} = E_\tau \left[ \left. \frac{\partial \log p^{\pi_\theta}(\tau)}{\partial \theta} \cdot G(\tau) \right| \pi_\theta \right] $$

So we have to take the derivative of the logarithm of the trajectory probability. Recall the figure about the trajectory:

<img src="http://drive.google.com/uc?export=view&id=1cjBsgkWXO-jlJRMjV7QDuqODs4-IRcCU" width=75%>

The probability of a trajectory:

$$p^{\pi_\theta}(\tau) = \mu(s_0) \cdot \pi_\theta(a_0|s_0) \cdot T(s_1|s_0, a_0) \cdot \pi_\theta(a_1|s_1) \cdot T(s_2|s_1, a_1) \dots T(s_T |s_{T-1}, a_{T-1})$$

After taking the logarithm:

$$\log p^{\pi_\theta}(\tau) = \log\mu(s_0) + \log\pi_\theta(a_0|s_0) + \log T(s_1|s_0, a_0) + \log\pi_\theta(a_1|s_1) + \log T(s_2|s_1, a_1) + \dots + \log T(s_T |s_{T-1}, a_{T-1})$$

Only the policy terms depend on the $\theta$ parameter. Therefore the derivative erases any other terms:

$$\frac{\partial \log p^{\pi_\theta}(\tau)}{\partial \theta} = \frac{\partial \log \pi_\theta(a_0|s_0)}{\partial \theta} + \frac{\partial \log\pi_\theta(a_1|s_1)}{\partial \theta} + \dots + \frac{\partial \log \pi_\theta(a_{T-1}|s_{T-1})}{\partial \theta}$$

We can write this in a more simple way:

$$\frac{\partial \log p^{\pi_\theta}(\tau)}{\partial \theta} = \sum_t\frac{\partial \log \pi_\theta(a_t|s_t)}{\partial \theta}$$

By combining the last formula with the policy gradient:

$$\frac{\partial \rho(\pi_\theta)}{\partial \theta} = E_\tau \left[ \left. \sum_t\frac{\partial \log \pi_\theta(a_t|s_t)}{\partial \theta} \cdot G(\tau) \right| \pi_\theta \right] $$

The expectation can be approximated by sampling trajectories then calculating the value (noisy gradient) inside the bracket. Fortunately there are other ways to calculate noisy gradients. We can use the policy-gradient theorems. They will be similar than the equation above but easier to sample.

### Policy-gradient theorems

The $\rho$ objective function has different formulations:

**Start-state formulation:** The start-state formulation uses the following function for measuring the performance of a policy:

$$\rho(\pi_\theta) = E_\tau\left[ \left. \sum_{t=0}^\infty{\gamma^t r_t} \right| s_0, \pi_\theta \right]$$

**Average-reward fomrulation:**  The average-reward formulation uses the following function for measuring the performance of a policy:

$$\rho(\pi_\theta) = \lim_{n \rightarrow \infty} \frac{1}{n}E_\tau\left[ \left. \sum_{t=0}^n{r_t} \right| \pi_\theta \right]$$

**Theorem 1:** In both the start-state formulation and the average-reward formulation case the following formula is true for the gradient:

$$\frac{\partial \rho(\pi_\theta)}{\partial \theta} = \sum_s{d^{\pi_\theta}(s) \sum_s{\frac{\partial \pi_\theta(s, a)}{\partial \theta} Q^{\pi_\theta}(s, a)}}$$

By applying the likelyhood trick like before we get the following formula:

$$\frac{\partial \rho(\pi_\theta)}{\partial \theta} = E_\tau \left[ \left. \frac{\partial \log \pi_\theta(s, a)}{\partial \theta} \cdot Q^{\pi_\theta}(s, a) \right| \pi_\theta \right] $$

This formula is true even when the $Q$ function is approximated. (More precisely, there are some conditions for this but we do not consider them now.)

Therefore it is enough to sample trajectories by following the current policy and simultaneously approximate the action-value function in order to calculate the noisy gradient.

### REINFORCE

<img src="http://drive.google.com/uc?export=view&id=1Il8lf_096OKqwVpZ6ohUIdYsjr-dtGHk" width=75%>

Here the return is calculated with Monte Carlo, what we seen in the first section.

## Natural policy gradient

### The problem

If we follow the policy gradient, the agent performance will not improve monotonically (in expectation) and the change in the policy can be still dramatic. We want to ensure, the policy changes slightly during the updates.

### Distance between two distribution functions

One way to measure the distance between two distribution function is the Kullback-Leibler divergence:

$$KL(p, q) = \sum_{x \in X}{p(x) \log \frac{p(x)}{q(x)}}$$

The properties of KL-divergence:

* always non-negative
* additive for independent distributions ($p=p_1p_2$ and $q=q_1q_2$, $p_1, p_2$ are independent and the same for $q_1, q_2$)
* convex

<img src="http://drive.google.com/uc?export=view&id=1ncd_iAOxHsu3Uqvs6syjjn7qjifz-oOs" width=55%>

The image shows the KL divergence between two Bernoulli-distribution. Here $p=[h, 1-h]$ and h goes from zero to one. $q=[0.3, 0.7]$, fixed. That is the reason for the minimum at 0.3. The function is convex and always non-negative.

### The natural gradient

We have an objective function for the policy $\pi_\theta$:

$$\rho(\pi_\theta)$$

The goal of updating the parameters $\theta$ is to increase (maximize) this objective function. Natural gradient applies a fruther constraints beside maximization:

$$\max_{\pi_{\theta'}} \rho(\pi_\theta)$$

subjected to:

$$KL(\pi_{\theta'} | \pi_\theta) < \delta$$

$\delta$ is the maximum allowed change in the policy function (here the policy function is stochastic, therefore it is basically a distribution function over the states).

Unfortunately, solving this constraint optimization problem, is quite inefficient. Several tricks and approaches were developed to relax the formula but the underlying intention holds.

**Side notes:** 

* The constraint can be something else. It is also usual to apply a constraint on the Frobenius norm of the weights
* The policy is parametrized. Neural networks can represent the policies.

### Algorithms using this type of approach

In the next part we will talk about the PPO, TRPO as well. These algorithms are based on the natural gradient approach.