# Workshop RL03: policy-based RL (policy-gradient focus)

## Motivation: 
So far we have learnt model-based and value-based RL, and now let's talk about the more common used method, namely the policy-based method. **Our goal is to parameterise policy $\pi_\theta(s,a)$ and find the parameters for the best policy that maximises value function.**

Before we go to the details, let's look at the pros and cons of policy-based.

**Pros:** 
- Better convergence properties
- Effective in high-dimemesion or continuous action spaces
    - value-based is effective in high-dimemsion states spaces but not in high-dimemsion action spaces
- Can learn stochastic policies
    - important for non-stationary and/or non-markov states 
    - e.g. a deterministic policy can easily be exploited in a game like rosk-paper-scissors
    - e.g. the agent cannot differentiate the states that are far away from rewards (the grey states in the follwing picture)
    <img src = 'non-markov.png' width=400>
    

**Cons:**
- Typically converge to local optima rather than global optima 
    - i.e. good policy but not neccessarily the best
- Evaluating policies is typically inefiicient and high variance 


## How to parameterise policy 
Let's start with how to parameterise our policy. There are many choices of differentiable parameterised policy: 
1. Softmax policy
$$\pi_\theta(s,a) = \frac{e^{\phi(s,a)^T \theta}}{\sum_a e^{\phi(s,a)^T \theta}},$$
where $\phi(s,a)^T \theta$ are weights assigned to actions using linear combination of feaures. 
2. Gaussian policy 
$$a \sim N(\mu(s),\sigma^2),$$
where $\mu(s)$ is a linear combination of state features $\mu(s)=\phi(s)^T \theta$, and $\sigma$ is fixed ot could be parameterised.
3. Neural networks policy 
    
    Output layer is policy $\pi(s,a)$. 

## Gradient-free method
To find the parameters of the optimal policy, there are gradient-based and gradient-free methods. Our focus is gradient-based, but you can also research in your own time and try the following gradient-free ways: (often a great simple baseline to try, can work with non-differentiables as well)
- Hill climbing 
- Simplex/amoeba/Nelder Mead
- Genetic algorithms
- Cross-entropy method (CEM)
- Covariance matrix adaptation (CMA)

## Gradient-based method: Policy gradient 
To implement gradient-based method to find the best policy, we first need to define $V(\theta)=V^{\pi_\theta}$ to make explicit the dependence of value on policy parameters. We also assume eposodic MDPs (Markov Decision Process). 

Since we're looking for parameters that **maximise** value function, we update the parameters by **gradient ascend**, i.e.

$$\Delta \theta = \alpha \nabla_\theta V(\theta),$$
where $\nabla_\theta V(\theta)$ is the **policy gradient**:
$$\nabla_\theta V(\theta) = 
        \begin{bmatrix}
            \frac{\partial V(\theta)}{\partial \theta _1} \\
            \frac{\partial V(\theta)}{\partial \theta _2}  \\
            \vdots  \\
            \frac{\partial V(\theta)}{\partial \theta _n} 
        \end{bmatrix}$$


## Computing policy gradient analytically 
Now the big problem is how to compute these gradients. We can estimate policy gradient by finite difference but that can be too simple and not efficient. So we will focus on a more common practise, which is to compute policy gradient analytically. There are many different ways: 
1. [Score function](https://en.wikipedia.org/wiki/Score_(statistics)) policy gradient 
2. Monte-Carlo policy gradient
3. "Vanilla" policy gradient
... and so on.

Follwoing are the results from analytically computing policy gradient. The derivation is included in the optional workshop if you're interested.  


### 1. Score function policy gradient 
$$\nabla_\theta V(\theta) \approx (1/m) \sum_{i=1}^m R(\tau^{(i)}) \sum_{t=0}^{T-1} \underbrace{\nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)})}_{score\ function} $$
$$\Delta \theta = (\alpha/m) \sum_{i=1}^m R(\tau^{(i)}) \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) $$

For a Gaussian policy, the score function is $\nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) = \frac{(a-\mu(s))\phi(s)}{\sigma^2}$
### 2. Monte-Carlo policy gradient (REINFORCE)
Apart from using the above score funtion policy gradient, we can also leverage temporal structure (update parameters as the episode goes instead of waiting until the episode ends) and use Monte-Carlo policy gradient as follow:
$$\Delta \theta = \alpha \nabla_\theta \log \pi_\theta(s_t,a_t) G_t, $$
where $G_t$ is the Monte-Carlo return at time $t$.

**REINFORCE** algorithm:
- intialise a random policy with parameter $\theta$
- for each episode {$s_1,a_1,r_1,...,s_{T-1},a_{T-1},r_T$} $\sim \pi_\theta$ do
    - for $t=1$ to $T-1$ do 
        - $\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(s_t,a_t) G_t$
- return $\theta$

### 3. "Vanilla" policy gradient
We can also introduce a baseline $b(s_t)$ to reduce variance in parameters. The great thing about this is that for any choice of baseline, gradient estimator is unbiased.  

**"Vanilla" policy gradient** algorithm:
- initialise a policy with parameter $\theta$ and a baseline $b$
- for iteration = 1, 2, ... do 
    - collect a set of episodes by executing the current policy 
    - for each episode {$s_1,a_1,r_1,...,s_{T-1},a_{T-1},r_T$} $\sim \pi_\theta$ do
        - for $t=1$ to $T-1$ do
            - the return $R_t = \sum_{t'=t}^{T-1} r_{t'}$, and
            - the advantage estimate $\hat{A}_t = R_t - b(s_t)$
    - re-fit the baseline by minimising $||b(s_t)-R_t||^2$ summed over all episodes and timesteps
    - update the policy using policy gradient estimates $\hat{g} = \sum_t  \nabla_\theta \log \pi_\theta(s_t,a_t) \hat{A}_t$
        - $\theta \leftarrow \theta + \hat{g} $
        - (plug $\sum_t \nabla_\theta \log \pi_\theta(s_t,a_t) \hat{A}_t$ into SGD or ADAM)
- return $\theta$

Usually $\sum_t  \nabla_\theta \log \pi_\theta(s_t,a_t) \hat{A}_t$ is inefficient, so we want to batch data and define a surrogate function using the current batch:
$$L(\theta) = \sum_t \log \pi_\theta(s_t,a_t) \hat{A}_t$$
Or we can include value function fit error: 
$$L(\theta) = \sum_t \big( \log \pi_\theta(s_t,a_t) \hat{A}_t -||V(s_t)-\hat{R}_t||^2 \big)$$
Then the policy gradient estimates become $$ \hat{g} = \nabla_\theta L(\theta)$$. 

## Exercise 
Again we will be using [Standford reinforcement learning course assignment](http://web.stanford.edu/class/cs234/assignment3/index.html) for exercise. 