# Math

<h3>Policy Gradient Methods</h3>

In this notebook of the series we are going to discuss another method of solving the control problem called policy gradient methods

We have so far been parameterising the value function but it's plausible that we can parameterise the policy as well

In particular we want to find the optimal policy $\pi^*$

At first this might seem like a weird idea compared to what we were doing before

Recall that our current strategy is policy iteration

We iteratively switch between policy evaluation which means finding the value function given the current, policy and policy improvement which means acting greedily with respect to the current value function

We've seen that this converges so that we get the optimal value function for which the optimal policy is just taking the max of this optimal value function

---

So what would a parameterized policy look like?

Well we know that the policy has to be some kind of probability $\pi(a \vert s)$ 

In particular we can score each action $a$ using a linear model or any other kind of model

$$score_j = f(a_j,s,\theta) (= \varphi(s)^T\theta_j \ if \ linear)$$

And then as we know from deep learning we can use the softmax function to turn these actions scores into probabilities

$$\pi(a_j \vert s) = \frac{\exp(score_j)}{\sum_{j'} \exp(score_{j'})}$$

This ensures that all the probabilities sum to one

---

<h3>Policy Gradient Objective</h3>

For a policy to be optimal it needs to have some objective

This is something we should be used to from machine learning 

Most machine learning methods we've looked at start by trying to optimise some objective

If our model is differentiable then we can use gradient descent or gradient ascent to reach our objective

Because this notebook is on policy gradient methods, we are of course going to be taking a similar approach with respect to the policy

The big question is What should this objective be?

Consider that we start in some start state $s_0$ 

As we know we want to maximize the total return of the entire episode which is $V(s_0)$

Also remember that the value function $V$ is also dependent on the policy $\pi$, so we can explicitly show that by subscripting $V$ with $\pi$ 

$$Performance : \eta(\theta_P) = V_\pi(s_0)$$

An unfortunate convention is that the letter $\eta$ is used for the policy objective since we've used it for other purposes in past notebooks 

Just remember that $\eta$ in this notebook means policy objective which we usually call the performance

Note that $\theta_p$ here means the parameters we are using to parameterize the policy 

We are subscripting the policy parameters with $p$ since the value function will also have a set of parameters which we'll call $\theta_v$


$$Policy : \pi(a \vert s,\theta_p) = f(s;\theta_p)$$

$$Value \ Function \ Approximation : \hat V_\pi(s) = f(s;\theta_v)$$

---

<h3>Policy Gradient Methods</h3>

The next few steps are unfortunately not straightforward at all

However the interpretation makes intuitive sense

The important part here is more about being able to implement the algorithm in code so that we have yet another tool for our reinforcement learning toolbox

---

<h3>Policy Gradient Theorem</h3>

It can be shown that the gradient of the performance takes this form which is dependent on the gradient of the policy itself which is convenient

$$\nabla \eta(\theta_p) = E \left[ \sum_a Q_\pi(s,a) \nabla_{\theta_p} \pi(a \vert s, \theta_p) \right] $$

This is called the policy gradient theorem

---

<h3>Policy Gradient Methods</h3>

What we can do is manipulate this equation by multiplying and dividing by $\pi$

$$\nabla \eta (\theta_p) = E \left[\sum_a \pi(a \vert s, \theta_p) Q_\pi(s,a) \nabla_{\theta_p} \pi(a \vert s,\theta_p) \frac{1}{\pi(a \vert s,\theta_p)}\right]$$

Once we do this we can see that the summation is actually just another expected value over $\pi$

$$\nabla \eta (\theta_p) = E \left[E \left\{Q_\pi(s,a)\nabla_{\theta_p} \pi(a \vert s,\theta_p) \frac{1}{\pi(a \vert s, \theta_p)}\right\}\right]$$

But the expected value of an expected value is still just an expected value, so we can make it one expected value

$$\nabla \eta (\theta_p) = E \left[Q_\pi(s,a) \nabla_{\theta_p} \pi(a \vert s,\theta_p) \frac{1}{\pi(a \vert s,\theta_p)}\right]$$

---

What we can further do is use an identity from calculus 

The gradient of $\log f$ is the gradient of $f$ divided by $f$

$$\nabla \log f(x) = \frac{\nabla f(x)}{f(x)}$$

Apply this identity to

$$\nabla \eta (\theta_p) = E \left[Q_\pi(s,a) \nabla_{\theta_p} \pi(a \vert s,\theta_p) \frac{1}{\pi(a \vert s,\theta_p)}\right]$$

to get 

$$\nabla \eta (\theta_p) = E \left[Q_\pi(s,a) \nabla_{\theta_p} \log \pi(a \vert s, \theta_p)\right]$$

---

The last step is to realize that $Q$ is actually the expected value of the return $G$

So we can replace that with $G$ itself since it all goes inside the expected value

$$\nabla \eta(\theta_p) = E \left[G \nabla_{\theta_p} \log \pi(a \vert s,\theta_p)\right]$$

---

Now we have an expression full of stuff we can actually use 

$G$ which is the return we get from playing an episode and $\pi$ which is our parameterized policy

So what we would do is we would play an episode calculate the returns and then perform gradient ascent

Notice that gradient ascent and not gradient descent because we are trying to maximize the total return not minimize it

---

In fact we could do this as batch gradient ascent because by the time the episode is over we have all the returns

In fact this is suggested by the expected value symbol as well

We know from before that an expected value can be approximated by sample mean

$$E(X) \approx \frac{1}{N} \sum^N_{n=1} X_n$$

$$\nabla \eta (\theta_p) \approx \frac{1}{T} \sum^T_{t=1} G_t \nabla_{\theta_p} \log \pi(a_t \vert s_t,\theta_p)$$

---

But also remember that tensorflow is going to take gradients for us 

In particular we want just one expression we can pass in as the cost to the optimizer 

To turn what we have into that form, we realize that $G$ is a constant so it can be moved inside the gradient

$$\frac{1}{T} \sum^T_{t=1} G_t \nabla_{\theta_p} \log \pi(a_t \vert s_t, \theta_p) = \frac{1}{T} \sum^T_{t=1} \nabla_{\theta_p} G_t \log \pi(a_t \vert s_t,\theta_p)$$

---

We also know that the derivative of a sum is just the sum of all the individual derivatives

So we can move the gradient outside the sum

$$\frac{1}{T} \sum^T_{t=1} \nabla_{\theta_p} G_t \log \pi(a_t \vert s_t,\theta_p) = \frac{1}{T} \nabla_{\theta_p} \sum^T_{t=1} G_t \log \pi (a_t \vert s_t,\theta_p)$$

---

And finally we know that $\frac{1}{T}$ is a meaningless constant because it can be absorbed into the learning rate, so we can get rid of that too

And so finally we have an expression for the thing we want to maximize

$$maximise : \sum^T_{t=1} G_t \log \pi(a_t \vert s_t, \theta_p)$$

$$minimise : - \sum^T_{t=1} G_t \log \pi (a_t \vert s_t,\theta_p)$$

Since tensorflow optimizers only have a minimised function we can minimize the negative of this 

And to be clear capital T represents the length of an episode and the index lowercase t represents the t-th timestep of an episode 

Because this involves the sum of return's over an entire episode, this is a Monte-Carlo method

---

<h3>Intuition</h3>

To gain better intuition about the gradient ascent update rule, it helps to look at what it would look like if we were to do stochastic gradient descent or in other words the update for just one return, one state and one action

$$\theta_{p,t+1} = \theta_{p,t} + \alpha G_t \frac{\nabla \pi(a_t \vert s_t)}{\pi(a_t \vert s_t)} (= \theta_{p,t} + \alpha G_t \nabla \log \pi(a_t \vert s_t))$$

So there are three terms here that affect the new value of $\theta$ 

The return $G$, the gradient of $\pi$, $\nabla \pi(a_t \vert s_t)$ and $pi$ itself $\pi(a_t \vert s_t)$

Remember that $\pi$ is the probability of choosing an action $a$ given state $s$ using the current policy

---

First consider $F$ the return

We are moving in a direction proportional to $G$

The bigger $G$ is the bigger step we take

This is good because we want to maximize our reward

Second consider $\pi$ the probability of choosing action

We are moving in a direction inversely proportional to $pi$

This is good because if $pi$ is small but the return is good then we can take an even bigger step in that direction

And finally the gradient of $\pi$, $\nabla \pi$ is a vector so that gives us the actual direction we want to go

The gradient tells us the direction of greatest increase in $\pi$

---

<h3>What about V(s)?</h3>

We'll notice that earlier in this section, we mentioned using an approximation of $V(S)$ as well but so far that hasn't come into play

One common modification of the policy gradient that we are going to use is to add a baseline

So instead of our constant being just $G$ it'll be $G-V(s)$ our prediction of the value at state $s$

$$maximise : \sum^T_{t=1} (G - V(s_t)) \log \pi(a_t \vert s_t, \theta_p)$$

The baseline can actually be any function that depends only on $s$, but of course since we already know about $V$, it seems the most appropriate choice

We call this difference between $G$ and $V$ the advantage

The reason we want to add a baseline is because it has been shown to have a significant effect on the variance of the update rule

This in turn has been shown to speed up learning

---

The update parameters of $V$ of course just use gradient descent as usual

$$\theta_{V,t+1} = \theta_{V,t} + \alpha(G_t - V_t) \nabla V_t$$

---

<h3>What about TD methods?</h3>

A natural question at this point is can we convert this from a Monte Carlo method to a TD method so that we don't have to wait for an episode to end before doing any updates

Of course this is possible and in reinforcement learning this has a special name, The actor critic method

It's called actor critic because we think of the policy as the actor and the TD error which depends on the value estimate as the critic

So in the updates, all we do is replace $G$ with the one step estimate of $G$

$$\theta_{p,t+1} = \theta_{p,t} + \alpha(r_{t+1} + \gamma V(s_{t+1}) - V(s_t)) \nabla \log \pi(a_t \vert s_t)$$

$$\theta_{V,t+1} = \theta_{V,t} + \alpha(r_{t+1} + \gamma V(s_{t+1}) - V(s_t)) \nabla V(s_t)$$

---

<h3>Why use it?</h3>

Now that we've gone through the heavy parts of the policy gradient method let's talk about why we might want to use it

We know that the policy gradient method yields a probabilistic policy 

This should be reminiscent of epsilon greedy which is also probabilistic

However it should be clear why the policy gradient method is more expressive 

With epsilon greedy all the suboptimal actions have the same probability of happening even though one might be better than the other 

With the policy gradient method, we can model this betterness directly

For example it might actually be optimal to do action one 90% of the time action two 8% of the time and action 3 only 2% of the time

---

In addition we should keep in mind that states themselves can be stochastic

One of the sources of this randomness is that the state does not give us the full information about the environment

For example in blackjack we don't know the dealer's next card

So the optimal action needs to be probabilistic to account for different possibilities

---

<h3>Summary</h3>

Now let's summarize this section since there was a lot of information in it

First we saw that we can parameterize the policy so that in effect we get a probabilistic policy using a softmax output

Next we saw that the objective that the policy tries to optimize is the expected return from the start state

In other words this is the expected return over the entire episode

We call this objective the performance

Next we looked at the policy gradient theorem

We manipulated the results of the policy gradient theorem in order to give us a single cost function that we could then input into  tensorflow which is going to be helpful during implementation 

Next, we looked at a modification of the policy creating an algorithm that uses a baseline and we call this difference between the return and the baseline the advantage 

We then looked at the actor critic method which uses TD updates instead of Monte-Carlo updates

Finally we discussed why we might want to use policy gradient methods rather than policy iteration

It allows us to explicitly model arbitrary probabilistic policies when a probabilistic policy could in fact be the optimal policy

This could in turn be because of the fact that the state transitions are probabilistic

# code

# Math

In this section we're going to extend our knowledge of the policy gradient method by looking at continuous action spaces

---

<h3>Continuous Action Spaces</h3>

Remember that both of the environments we've looked at so far cartpole and mountain car both have discrete action spaces

Luckily there is an environment called `MountainCarContinuous-v0` that gives us continuous action spaces, wo we have something to test this out on

---

<h3>Policy Gradient Method</h3>

So if we think about our current policy model that allows us to choose from a discrete action space what's the main idea

The main idea is that the output of the model is a probability distribution that we can then sample an action from

So if we want to have a continuous action space to sample from how might we go about creating a distribution for that, what distribution might be appropriate?

Well technically we could choose any continuous distribution but the gaussian seems like a good place to start

---

<h3>Continuous Policy</h3>

Remember that gaussian has two parameters the mean and the variance

$$\pi(a \vert s) = \frac{1}{\sqrt{2 \pi \nu}} e^{-\frac{1}{2\nu}(a-\mu)^2} = N(a;\mu,\nu)$$

So in order to create a parameterized policy we would parameterize the mean and variance

Let's assume for the purposes of this section that both the mean and variance are linear in their parameters

So in other words 

$$\mu(s;\theta_\mu) = \theta^T_\mu \varphi(s)$$

$$\nu(s;\theta_\nu) = \exp(\theta_\nu^T \varphi(s))$$

Alternatively we could use the softplus function instead of the exponential

$$Alternative : \nu(s;\theta_v) = softplus(\theta_v^T \varphi(s))$$

$$softplus(a) = \ln(1+\exp(a))$$

---

So why would we want to use the exponential or softplus after multiplying the feature vector by $\theta_\nu$?

Well remember that the variance must be positive but using gradient descent theta is unbounded

Therefore in order to make sure the variance is always positive we can exponentiate the output of the dot product or use the softplus

---

<h3>Softplus</h3>

Recall that the softplus is asymptotically the same as `ReLU` but it can actually go down to zero which is useful

<img src="https://upload.wikimedia.org/wikipedia/commons/6/6c/Rectifier_and_softplus_functions.svg" width="450">

This might be a little better than the exponential since the exponential grows very fast if its input is large

---

<h3>Policy Gradient</h3>

Once we've defined the model we have to figure out how to update it

Luckily the principles of the policy gradient method don't change just because we've redefined the policy

$$\theta = \theta + \alpha \sum^T_{t=1} (G_t - V(s_t)) \nabla \log \pi(a_t \vert s_t)$$

$$\theta_{t+1} = \theta_t + \alpha(r_{t+1} + \gamma V(s_{t+1}) - V(s_t)) \nabla \log \pi(a_t \vert s_t)$$

We still have the same cost and updates as before

And of course we can replace plain gradient descent with any of its modifications like `AdaGrad` or `RMSprop`

---

<h3>Worth repeating</h3>

This is an important idea so it's worth saying again 

The policy gradient method doesn't change just because we redefine the policy

The policy can be any function we make it, and the policy gradient method still applies

So we can have any kind of output distribution any kind of parameterization

The method remains the same

---

<h3>Final thought</h3>

One last thing to think about 

This wouldn't be possible if we didn't have policy gradient methods

Consider regular $Q$-learning with function approximation 

Function approximation of the action value $Q$ lets us handle infinite states spaces but not infinite action spaces

Remember our earlier strategy

We created a different linear model for each action and then took the $arg max$ in order to determine what action to do at any step

Deep $Q$-learning is the same where you have a neural network for multiple output nodes one for each separate action 

Because we need a separate output or model for each action, this clearly can't scale to an infinite number of actions

So with $Q$-learning a continuous action space is impossible but with policy gradient methods it is

# Math

In this section we're going to go through more details on how we're going to apply a parameterised policy to the continuous version of mountain car

---

<h3>MountainCarContinuous-v0</h3>

It's good to go over what the differences are between discrete mountain car and continuous mountain car since they are not obvious 

Specifically the reward structure for mountain car continuous is different

We can also see this information on the wiki page <a href="https://github.com/openai/gym/wiki/MountainCarContinuous-v0">here</a>

So in continuous mountain car if we get to the goal point then we automatically get a reward of 100

So unlike a discrete mounting car our reward can be positive 

However subtracted from the 100 is the
sum of squared actions we take

So if the magnitude of your actions is larger each step we take is penalized more

So our agent should be incentivized to take smaller actions

The environment is considered solved if we can reach an average total reward of 90 over 100 consecutive episodes

Note that while the action space is continuous our action must be between  -1 and 1

---

<h3>Policy Gradient?</h3>

Next thing is, while we could implement the policy gradient theory directly we are instead going to take another approach

The problem with gradient descent or gradient ascent is that it doesn't seem to converge nicely i this scenario

What we'll do instead is a special type of random search called hill climbing

---

<h3>Hill Climbing</h3>

Hill climbing works as follows

Suppose we only have two parameters so that we can show the parameters on a 2D grid

First we initialize the parameters randomly, this is the green dot

<img src="extras/62.0.PNG" width="400">

Then we play one episode or a few episodes with these parameters 

Playing fewer episodes will increase the variance of our estimates and playing more episodes will decrease the variance yielding better estimates but it will take longer

The reason we want to play a few estimates is to get an idea if these parameters are good or not

If the parameters are better than the best parameters we found so far then we set these as the best parameters, otherwise we stick with the origina

---

So next thing we do is we generate a new set of parameters by randomly perturbing the current best parameters

<img src="extras/62.1.PNG" width="400">

So basically this new set of parameters will be in the neighborhood of the current best parameters

We then test these new parameters to see if they yield a good average return

If they are then we set these to be the current best parameters and so on

<img src="extras/62.2.PNG" width="400">

---

Eventually will have tested many sets of parameters sort of walking on a path up a hill hence the term hill climbing

<img src="extras/62.3.PNG" width="400">

Note that the path is always going to go in an increasing direction of the average return

---

<h3>When did we first see this?</h3>

We may also recall from deep learning notebooks that this is one of the methods we suggested could be used to find a good set of hyper parameters for training a neural network

So this is in fact just another optimization technique not unique to reinforcement learning but particularly helpful in some situations

We will see that this method generally yields more consistent results than gradient descent and it beats trying to do a hyper parameter search

However since we have all the skills you need to build a parameterized policy and perform gradient ascent on the performance we are encouraged to we will try and see if we can find good hyperparameters

We may try using both neural networks and linear models with radial basis function kernels

# code

# code

# Math

In this section we are going to summarize everything we've learned in this notebook which was focused on the policy gradient method

---

<h3>PG Summary</h3>

The difference between policy gradient methods and non-policy gradient methods is that instead of just modeling the value function ($V(a)$ or $Q(s,a)$) we also create a model for the policy $\pi(a \vert s)$

This allows us to create a probabilistic policy naturally rather than doing something like epsilon greedy where we have to choose epsilon which we might end up doing suboptimally

---

We started this notebook by just looking at how we would parameterize the policy or in other words make up a function to model the policy with some learnable parameters

Then we asked well how can we optimize this policy?

For that we needed to create an objective function for the policy

Once we did that we saw that it was just business as usual, learned by playing some episodes and doing gradient descent

---

One benefit of using policy grittier methods is that it very naturally extends to continuous action spaces

So instead of doing softmax the output layer of our model we could just output a mean and variance representing a gaussian and then sample from that gaussian to get an action