<a href="https://colab.research.google.com/github/RLWH/reinforcement-learning-notebook/blob/master/6.%20Policy%20Gradient%20Methods/Policy_Gradient_Methods.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Policy Gradient Methods

In previous chapters, we have learnt almost all the methods that have been used action-value methods. They learned the values of actions and then selected actions based on their estimated action values. 

Instead, we canlearn a *parameterized policy* that can select actions without consulting a value function. Why?

- Some of the optimal policies are not a deterministic policy. It can be a stochastic policy.
- Using a parameterized policy can solve large-scale problem, as it uses approximation method.

#### What is a Parameterized policy?
A parameterized policy is a policy that takes in a parameter vector, denote as $\vec{\theta} \in \mathop{\mathbb{R}}^{d'}$ for approximation. 

##### The Parameterized policy formal representation
- Denote $\vec{\theta} \in \mathop{\mathbb{R}}^{d'} $ for the policy's parameter vector, and 
- Denote
$\pi(a|s, \vec{\theta}) = \Pr\{A_t=a | S_t=s, \vec{\theta}_t=\vec{\theta}\}$
for the probability that action $a$ is taken at time $t$, given that the environment is in state $s$ at time $t$ with parameter $\vec{\theta}$. 
- If a method uses a learned value function as well, then the value function's weight vector is denoted $\vec{w} \in \mathop{\mathbb{R}}^d$, as in $\hat{v}(s, \vec{w})$


## Why policy approximation?
In practice, to ensure exploration, we generally require that the policy never becomes deterministic. 

If a policy can be parameterized in any way, as long as the policy $\pi(a|s, \vec{\theta})$ is differentiable w.r.t its parameters, then some optimality can be found. 

#### Advantages
- Better convergence properties
- Effective in high-dimensional or continuous action spaces
- Can learn stochastic policies

#### Disadvantages
- Typically converge to a local rather than global optimum
- Evaluating a policy is typically inefficient and high variance

# 1. The objective function - Performance Measure $J(\vec{\theta})$
How we learn and what are we learning for the policy?

The key element of learning the policy parameter is based on the gradient of performance measure $J(\vec{\theta})$

The policy learning parameter is based on the gradient of some **scalar performance measure**, $J(\vec{\theta})$ w.r.t the policy parameters $\vec{\theta}$. 

#### Optimisation base on Gradient Ascent
These methods seek to maximise performance, so their updates approximate gradient *ascent* in $J$:
\begin{equation}
\vec{\theta}_{t+1} = \vec{\theta}_t + \alpha \widehat{\nabla J(\vec{\theta}_t)}
\end{equation},
where $\widehat{\nabla J(\vec{\theta}_t)} \in \mathop{\mathbb{R}}^d'$ is a stochastic estimate whose expectation approximates the gradient of the performance measure w.r.t its argument $\vec{\theta}_t$

For all the methods that follow this general schema, we call them *policy gradient methods*. 

#### Actor-Critic?
For the methods that learn approximations to both policy and value functions are often called *actor-critic methods*, where *'actor'* is a reference to the learned policy, and *'critic'* refers to the learned value function, usually a state-value function. This will be covered later



# 2. The Policy Objective Functions

The policy performance measurement function $J(\vec{\theta})$ is different in episodic environments and continuing environments. 

## Problem definition
- Goal: Given policy $\pi_{\theta}(s, a)$ and parameter $\vec{\theta}$, find the best $\vec{\theta}$ -> Performance measure function. i.e. Miximising the performance measure $J(\vec{\theta})$
- The quality of the policy $\pi_{\theta}$ is measured by the performance measure function $J(\vec{\theta})$
- The setup of function $J(\vec{\theta})$ is different for episodic environment and continuing environment

## The formulation of $J(\vec{\theta})$
### Episodic environments
For Episodic environments, we define the *performance measure* $J(\vec{\theta})$ as the value of the start state of the episode
\begin{equation}
J_1(\theta) = V_{\pi_{\theta}}(s_1) = \mathop{\mathbb{E}}[v_1]
\end{equation}
where $v_{\pi_{\theta}}(s_1)$ is the true value function for $\pi_{\theta}$, the policy determined by $\vec{\theta}$. 

This essentially means that we want the value from the beginning state to be as high as possible, so from this start state I can have a good policy

### Continuing environments
For Continuing environments, we define *performance measure* $J(\vec{\theta})$ as the **average value**
\begin{equation}
J_{\text{avV}}(\theta) = \sum_{s}\mu^{\pi_{\theta}}(s) V^{\pi_{\theta}}(s)
\end{equation}
where $\mu_{\pi}(s) = p(S_t=s | \pi)$ is the probability of being in state s in the long run

OR the **average reward per time-step**
\begin{equation}
J_{avR}(\theta) = \sum_s{\mu_{\pi_{\theta}}(s)} \sum_{a}\pi_{\theta}(s, a) \sum_{r}p(r|s, a)r
\end{equation}





# 3. Policy Optimisation

- Let $J(\vec{\theta})$ be any policy objective function
- Policy gradient algorithms search for a local maximum in $J(\vec{\theta})$ by gradient ascent of the policy w.r.t to parameters $\vec{\theta}$

\begin{equation}
\Delta \vec{\theta} = \alpha \nabla_{\theta} J(\vec{\theta})
\end{equation}
- where $\nabla_{\theta}J(\theta)$ is the policy gradient
- and $\alpha$ is a step-size parameter

#### Computing an estimate of the policy gradient
- Assume policy $\pi_{\theta}$ is differentiable almost everywhere
- The goal is to compute $\nabla_{\theta}J(\vec{\theta}) = \nabla_{\theta} \mathop{\mathbb{E_\mu}}[v_{\pi_{\theta}}(S)]$
- We will use Monte Carlo samples to compute this gradient
- Gradient ascent can be optimised by optimizers in TF in practice

The policy gradient can be calculated computationally or analytically.

#### Computing Gradients By Finite Differences
- If there is no access to the policy gradient, we can use computational method $J(\vec{\theta} + \mu) - J(\vec{\theta})$ to approximate the gradient

#### Evaluating the differentiable policy gradients analytically

As we defined that $J(\vec{\theta})$ depends on the state-value function $v_{\pi_{\theta}}$, which means it depends on both the action selections and the distribution of states in which those selections are made, and that both of these are affected by the policy parameter. 

But how can we differentiate the state-value function? The effect of the policy on the state distribution is a function of the environment, and it is typically unknown. So, how can we estimate the performance gradient $\nabla J(\vec{\theta})$ w.r.t to the policy parameter when the gradient depends on the unknown effect of policy changes on the state distribution?

Fortunately, as long as the policy itself is differentiable, we can use the *policy gradient theorem* to obtain an analytic expression for the gradient of performance measure. 


 


## One -step MDP (Contextual Bandit problem) Gradient of performance measure
Consider a one-step MDP (Contextual Bandit problem),  the performance measure is the expected reward that we can get $J(\vec{\theta}) = \mathop{\mathbb{E}}[R(S,A)]$
- We want to calculate the gradient of $J(\theta)$, where $\nabla_{\theta}J(\vec{\theta}) = \nabla_{\theta} \mathop{\mathbb{E_\mu}}[v_{\pi_{\theta}}(S)]$. 
- Now, we use the identity $\nabla_{\theta} \mathop{\mathbb{E}}[R(S, A)] = \mathop{\mathbb{E}}[\nabla_{\theta} \log \pi_{\theta} (A|S) R(S,A)]$
- Then, the right-hand side gives an expected gradient that can be sampled
- Then, under stochastic policy-gradient update, we then have

\begin{equation}
\theta_{t+1} = \theta_t + \alpha R_{t+1} \nabla_{\theta} \log \pi_{\theta_t}(A_t | S_t)
\end{equation}


## Proof of the identity - The score function trick
- Assume the policy $\pi(\theta)$ is differentiable whenever it is non-zero
- Assume the gradient $\nabla_{\theta}\pi_{\theta}$ is known
- The likelihood ratios exploit the following identity
\begin{equation}
\begin{split}
\nabla_{\theta} \mathop{\mathbb{E}}[R(S, A)] &= \nabla_{\theta} \sum_s{\mu (s)}\sum_a \pi_{\theta}(a|s)R(s,a) \\
& = \sum_s{\mu(s)}\sum_a \nabla_{\theta}\pi_{\theta}(a|s)R(s,a) \\
& = \sum_s \mu(s) \sum_a \pi_{\theta}(a|s) \frac{\nabla_{\theta} \pi_{\theta} (a|s)}{\pi_{\theta}(a|s)}R(s,a) \\
& = \sum_s \mu(s) \sum_a \pi_{\theta}(a|s) \nabla_{\theta} \log \pi_{\theta}(a|s) R(s,a) \\
& = \mathop{\mathbb{E}}[\nabla_{\theta} \log \pi_{\theta} (A|S) R(S,A)]
\end{split}
\end{equation}
- And the score function is $\nabla_{\theta} \log \pi_{\theta}(s, a)$, which is the log likelihood

## Theorem
- To extend the porlicy gradient approach in one-step MDP, we can replace the instant reward from on-step MDP to long term action-value $Q_{\pi_{\theta}}$, we have
> For any differentiable policy $\pi_{\theta}(s, a)$ \\
> For any of the policy objective functions $J = J_1, J_{avR}, \frac{1}{1-\gamma}J_{avV}$
> The policy gradient is
> \begin{equation}
\nabla_{\theta}J(\vec{\theta}) = \mathop{\mathbb{E_{\pi_{\theta}}}}[\nabla_{\theta} \log \pi_{\theta} (A|S) Q_{\pi_{\theta}}(S, A)]
\end{equation}
- The policy gradient theorem applies to start state objective, average reward and average value objective

## Score function under the softmax policy
One of the differentiable policies is the softmax policy

The actions with the highest preferences in each state are given the highest probabilities of being selected, for example, according to a softmax distribution:
\begin{equation}
\pi(a|s, \vec{\theta}) = \frac{e^{h(s, a, \vec{\theta})}}{\sum_{b}{e^{h(s, b, \vec{\theta})}}}
\end{equation}
, where the function $h(s, a, \vec{\theta})$ is the parameterized state-action preference.

We call that this kind of policy parameterization as softmax in action preferences

The action preferences themselves can be parameterized arbitrarily, for example using a deep artificial neural network (ANN), or it could simply be linear in features. For example:

\begin{equation}
h(s, a, \vec{\theta}) = \vec{\theta}^T \vec{x}(s, a)
\end{equation},
where $\vec{x}(s, a) \in \mathop{\mathbb{R}}^d'$ constructed by any methods described in Section 9.5. 

Since softmax policy is differentiable, we can find out its score function is
\begin{equation}

\end{equation}


#### Summary
- Policy function: $\pi(a|s, \vec{\theta}) = \frac{e^{h(s, a, \vec{\theta})}}{\sum_{b}{e^{h(s, b, \vec{\theta})}}}$
- Score function: $\nabla_{\theta} \log \pi_{\theta}(s, a) = \vec{x}(s, a) - \mathop{\mathbb{E}}_{\pi_{\theta}}[\vec{x}(s, \cdot)]$

##  Policy Gradient as a supervised learning problem
http://karpathy.github.io/2016/05/31/rl/   
https://amoudgl.github.io/blog/policy-gradient/

An alternative perspective to see of policy gradient is from the angle of supervised learning. 

#### A deeper thought on cross entropy
Recall that in supervised learning, say for a classification problem with C classes, we train the classifier with cross-entropy loss. Cross entropy computes the difference between the distribution of modelled data and the true label.

\begin{equation}
H(p,q) = \mathop{\mathbb{E}_{x \sim P}}\big[-log Q(x)\big]
\end{equation}

$H(P, Q)$ means that we calculate the expectation using P and the encoding size using $Q$.  

As such, $H(P, Q)$ and $H(Q, P)$ is not necessarily the same except when $Q=P$, in which case $H(P, Q) = H(P, P) = H(P)$ and it becomes the entropy itself.

#### Cross Entropy as a loss function
In a classification problem, suppose we have $C$ classes, $c_1, c_2, ..., c_C$, our job is to calculate the likelihood for each of the class base on the feature input. The label is always with 100% certainty. For a 5-class classification problem, we may have a table like this:

|Prediction ($\hat{y}$)|Label (y)|
|----------|-----|
|[0.4 0.3 0.05 0.05 0.2]|[1 0 0 0 0]

The question is: How well was the model's prediction? We can calculate the cross-entropy as follows
\begin{equation}
H(\hat{y_{i}}, y_{i}) = -\sum_{i}^{C} \hat{y_i} \log(y_i)
\end{equation}
where $p_i$ and $l_i$ are the ground truth and the score for each class $i$ in C. As usually an activation function (softmax) will be used to calculate the score for each class

If we have N training examples, then the cross-entropy loss over the samples will be $\sum_{m=1}^{N} H(\hat{y_i}, y_i)$

The objective of a classification problem maximise the likelihood of the correct class. In other words, the objective is to build a model with $\hat{\theta}$ that maximizes the probability of the observed data. i.e. **Maximum Likelihood Estimation MLE**
\begin{equation}
\hat{\theta} = {\arg \max}_{\theta} \prod_{i=1}^{N}p(x_i|\theta)
\end{equation}

However, multiplication is unstable and it can go overflow or underflow easily. By taking $\log$ of both sides, we can rewrite the formulation as a sum. 

\begin{equation}
\hat{\theta} = {\arg \max}_{\theta} \sum_{i=1}^{N}\log p(x_i|\theta)
\end{equation}

[In practice](https://stats.stackexchange.com/questions/174481/why-to-optimize-max-log-probability-instead-of-probability/), instead of maximisin the log likelihood, we tend to minimise the negative log likelihood, and thus minimising the KL divergence, which is equivalent to minimising the cross entropy

\begin{equation}
\hat{\theta} = {\arg \min}_{\theta} -\sum_{i=1}^{N}\log p(x_i|\theta)
\end{equation}

Here's a good article to revisit cross-entropy in deeper level  
https://towardsdatascience.com/demystifying-cross-entropy-e80e3ad54a8
https://jhui.github.io/2017/01/05/Deep-learning-Information-theory/

## The samples from Monte Carlo as labels
The problem of Reinforcement Learning is that we don't have true labels. However, as mentioned in previous lectures, we can generate labels by our experience. 

In policy gradient, we first run our agent for an episode and observe the reward. Since the actions follow a stochastic policy, it actions will be selected according to the "scores" of each action. 

Our objective is to find an optimal policy. So, we want
- If the sequence of actions lead to a win, the reward is positive and we encourage all actions of that episode by miniizing the negative log likelihood between the actions we took and the network probabilities
- If the sequence of actions leaf to a loss, the reward is negative and we discourage all actions that episode by maximizing the negative log likelihood between the actions we took and the network probabilities

Thus, we tweak the log likelihood formulation of Supervised learning by adding an extra term called "Advantage" $A_t$, that represent whether the actions should be encouraged or discouraged. A good advantage would be the discounted reward over time.

Thus, the formation becomes
\begin{equation}
L = -\sum_{i=1}^{N}A_i \log p(x_i|\theta)
\end{equation}

# 4. Monte-Carlo Policy Gradient (REINFORCE)

## Algorithms

#### Steps:
- Update parameters by stochastic gradient ascent
- Using policy gradient theorem
- Using return $G_t$ as an unbiased sample of $Q_{\pi_{\theta}}(s_t, a_t)$
\begin{equation}
\begin{split}
\Delta \theta_t &= \alpha \nabla_{\theta} J(\vec{\theta}) \\
&= \alpha \nabla_{\theta} \log \pi_{\theta}(s_t, a_t)G_t
\end{split}
\end{equation}
- Loss function $L = -\sum G_t \log \pi(s_t, a_t|\theta)$

#### Algorithm
---
```
Input: a differentiable policy parameterization pi(a|s, theta)
Algorithm parameter: step size alpha > 0
Initialise policy parameter theta with dimension d'

Loop forever for each episode:
        Generate an episode S0, A0, R1, ..., ST-1, AT-1, RT, following pi(.|., theta)
        Loop for each step of the episode t = 0, 1, ..., T-1
        G = sum(t+1:T)(gamma^(k-t-1))Rk
        theta = theta + alpha * gamma^t * grad of ln pi(At|St, theta) * Gt
```
---

The problem of Monte-Carlo policy is that the gradient still has high variance

# 5. From REINFORCE to Actor-Critic methods

- Simple actor-critic algorithm based on action-value critic
- Using linear value function approximation $Q_w(s, a) = \vec{x}(s, a)^T \vec{w}$
- The critic updates $w$ by linear TD(0), and the actor updates $\theta$ by policy gradient

From REINFORCE, recall that from **Policy Gradient Theorem**
\begin{equation}
\nabla_{\theta}J(\theta) = \mathop{\mathbb{E}_{\pi_{\theta}}}\big[\nabla_{\theta} \log \pi_{\theta}(a_t|s_t)Q^{\pi_{\theta}}(s,a) \big]
\end{equation}

As we know, the Q value can be learnt by parameterizing the Q function with a neural network (parameters denoted by $w$), and this leads us to **Actor Critic Methods**, where
- *'actor'* is a reference to the learned policy, and
- *'critic'* refers to the learned value function, usually a state-value function.

Both the Actor and Critic functions are parameterized with Neural Networks. We will particularly cover the Q Actor Critic, but the rest should be the same.

#### Baselines and Advantage values
Intuitively, the advantage function meansures how better it is to take a specific action compared to the average, general action at the given state.

\begin{equation}
A(s_t, a_t) = Q_{w}(s_t, a_t) - V_{v}(s_t)
\end{equation}
where $w$ parameterised the action value function $Q$ and $v$ parametised the state value function $V$. Does it mean that we need two neural networks? No.

From the Bellman optimality equation, we know that 
\begin{equation}
Q(s_t, a_t) = \mathop{\mathbb{E}}\big[r_{t+1} + \gamma V(s_{t+1}) \big]
\end{equation}

Thus, the advantage function can be rewritten as
\begin{equation}
A(s_t, a_t) = r_{t+1} + \gamma V_{v}(s_{t+1}) - V_{v}(s_t)
\end{equation}
and this approach only requires one set of critic parameters $v$

#### Algorithm
---
```
Input: a differentiable policy parameterization pi(a|s, theta)
Input: a differentiable state-value function parameterization Q_w(s, a, w)
Parameters: step sizes alpha_theta > 0; alpha_w > 0

Loop forever for each episode:

        Initialise S, theta
        Sample a in pi_theta
        
        Loop while S is not terminal for each time step:
                A = pi(.|S, theta)
                Take action A, observe S', R
                delta = R + gamma * Q_w(S', A', w) - Q_w(S, A, w)  [TD(0) error, or advantage]
                theta = theta + alpha_theta * grad_pi log pi_theta(s,a) Q_w(S,A)     [policy gradient update]
                w = w + alpha_w * delta * x(s, a)    [TD(0)]
                A = A', S = S'
```
---

## Full advantage actor critic agent

Advantage actor critic includes:
- A representation (e.g. LSTM): $(S_{t-1}, O_t) \mapsto S_t$
- A network $v_{w}: S \mapsto v$
- A network $\pi_{\theta} \mapsto \pi$
- Copies/varients $\pi^m$ of $\pi_{\theta}$ to use as policies: $S_{t}^{m} \mapsto A_{t}^{m}$
- A n-step TD loss on $v_{w}$

\begin{equation}
L(w) = \frac{1}{2}(G_{t}^{n} - v_{w}(S_t))^2
\end{equation}
where $G_t^(n) = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ...$

- A n-step REINFORECE loss on $\pi_{theta}$
\begin{equation}
L(\theta) = \big[ G_{t}^{(n)} - v_{w}(S_t) \big] \log \pi_{theta}(A_t | S_t)
\end{equation}

- And use optimizers to minimize the losses