<h1>Math</h1>

In this notebook we are going to discuss another method of solving the control problem called policy gradient methods

---

<h3>Policy Gradient Methods</h3>

We have so far been parameterising the value function but it's plausible that we can parameterise the policy as well

In particular we want to find the optimal policy $\pi^*$

At first this might seem like a weird idea compared to what we were doing before

Recall that our current strategy is policy iteration

We iteratively switch between policy evaluation which means finding the value function given the current policy and policy improvement which means acting greedily with respect to the current value function

We've seen that this converges so that we get the optimal value function for which the optimal policy is just taking the arg max of this optimal value function

---

So what would a parameterize policy look like

Well we know that the policy has to be some kind of probability $\pi(a \vert s)$ 

In particular we can score each action $a$ using a linear model or any other kind of model

$$\large score_j = f(a_j,s,\theta) (= \varphi(s)^T\theta_j \text{ if linear})$$

And then as we know from deep learning we can use the softmax function to turn these actions scores
into probabilities

$$\large \pi(a_j \vert s) = \frac{exp(score_j)}{\sum_{j^\prime} exp(score_{j^\prime})}$$

This ensures that all the probabilities sum to one

---

For a policy to be optimal it needs to have some objective

This is something we should be used to from machine learning 

Most machine learning methods we've looked at start by trying to optimize some objective

If our model is differentiable then we can use gradient descent or gradient assent to reach our objective

Because this notebook is on policy gradient methods, we are of course going to be taking a similar approach with respect to the policy

The big question is, What should this objective be?

---

Consider that we start in some start state as not as you know we want to maximize the total return of

the entire episode which is of not.

Also remember that the value function V is also dependent on the policy pi.

So we can explicitly show that by subscripting value of pi and unfortunate convention is that the letter

Eira is used for the policy objective.

Since we've used it for other purposes in past courses just remember that Ada in this course means policy

objective which we usually call the performance.

Note that theta here means the parameters we are using to parameterize the policy we are subscripting

the policy parameters with P since the value function will also have a set of parameters which we'll

call theta V.

The next few steps are unfortunately not straightforward at all.

However the interpretation makes intuitive sense.

So if you want to just skim this lecture that's fine.

The important part here is more about being able to implement the algorithm in code so that you have

yet another tool for your reinforcement learning toolbox.

It can be shown that the gradient of the performance takes this form which is dependent on the gradient

of the policy itself which is convenient.

This is called the policy gradient theorem.

What you can do is manipulate this equation by multiplying and dividing by pi.

Once we do this we can see that this summation is actually just another expected value over PI.

But the expected value of an expected value is still just an expected value.

So we can make it one expected value

what we can further do is use an identity from calculus the gradient of log f is the gradient of f divided

by F..

The last step is to realize that Q is actually the expected value of the returned G.

So we can replace that with P itself since it all goes inside the expected value.

Now we have an expression full of stuff we can actually use G which is the return we get from playing

an episode in pi which is our parameterize policy.

So we would do is we would play an episode calculate the returns and then perform gradient ascent.

Notice that gradient ascent and not gradient descent because we are trying to maximize the total return

not minimize it.

In fact you could do this as bad gradient ascent because by the time the episode is over you have all

the returns.

In fact this is suggested by the expected value symbol as well.

We know from before that an expected value can be approximated by sample mean

but also remember that tensor flow in Vienna are going to take gradients for us in particular we want

just one expression we can pass in as the cost to the optimizer to turn what we have into that form.

We realize that G is a constant so it can be moved inside the gradient.

We also know that the derivative of a sum is just the sum of all the individual derivatives.

So we can move the gradient outside the some.

And finally we know that one over t is a meaningless constant because it can be absorbed into a learning

rate.

So we can get rid of that too.

And so finally we have an expression for the thing we want to maximize.

Since tensor flow optimizers only have a minimised function we can minimize the negative of this and

to be clear capital-T represents the length of an episode and the index lowercase t represents the T

of timestep of an episode because this involves the sum of return's over an entire episode.

This is a Monte-Carlo method

to gain better intuition about the gradient ascent update rule.

It helps to look at what it would look like if we were to do stochastic gradient descent or in other

words the update for just one return or one state in one action.

So there are three terms here that affect the new value of theta the return the gradient of Pi and pi

itself.

Remember that pi is the probability of choosing an action a given status using the current policy.

First consider g the return.

We are moving in a direction proportional to G.

The bigger G is the bigger step we take.

This is good because we want to maximize our reward.

Second consider pi the probability of choosing action.

We are moving in a direction inversely proportional to pi.

This is good because if pi is small but the return is good then we can take an even bigger step in that

direction.

And finally the gradient of Pi is a vector so that gives us the actual direction we want to go.

The gradient tells us the direction of greatest increase in PI

you'll notice that earlier in this lecture I mentioned using an approximation of V of S as well but

so far that hasn't come into play.

One common modification of the policy gradient that we are going to use is to add a baseline.

So instead of our constant being just g it'll be g minus vivax our prediction of the value add status.

The baseline can actually be any function that depends only on s but of course since we already know

about V.

It seems the most appropriate choice.

We call this difference between G and V.

The advantage.

The reason we want to add a baseline is because it has been shown to have a significant effect on the

variance of the update rule.

This in turn has been shown to speed up learning

the update parameters of V of course just use gradient descent as usual

a natural question at this point is can you convert this from a Monte Carlo method to a TV method so

that you don't have to wait for an episode to end before doing any updates.

Of course this is possible and in reinforcement learning this has a special name.

The actor critic method it's called actor critic because we think of the policy as the actor and the

teacher which depends on the value estimate as the critic.

So when the updates.

All we do is replace G with the one step estimate of G.

Now that we've gone through the heavy parts of the policy gradient method let's talk about why you might

want to use it.

We know that the policy gradient method yields a probabilistic policy that should be reminiscent of

epsilon greedy which is also probabilistic.

However it should be clear why the policy gradient method is more expressive with Epsilon greedy.

All the suboptimal actions have the same probability of happening even though one might be better than

the other with the policy gradient method.

We can model this betterness directly.

For example it might actually be optimal to do action one.

90 percent of the time action to 8 percent of the time and action 3 only 2 percent of the time.

In addition we should keep in mind that states themselves can be stochastic.

One of the sources of this randomness is that the state does not give you the full information about

the environment.

For example in blackjack you don't know the dealer's next card.

So the optimal action needs to be probabilistic to account for different possibilities.

Now let's summarize this lecture's since there was a lot of information in it.

First we saw that we can parameterize the policy so that in effect we get a probabilistic policy using

a soft max output.

Next we saw that the objective that the policy tries to optimize is the expected return from the start

state.

In other words this is the expected return over the entire episode.

We call this objective the performance.

Next we looked at the policy gradient theorem.

We manipulated the results of the policy gradient theorem in order to give us a single cost function

that we could then input into s.a.a tensor flow which is going to be helpful during implementation next.

We looked at a modification of the policy creating an algorithm that uses a baseline and we call this

difference between the return and the baseline.

The advantage we then looked at the actor critic method which uses TDA updates instead of Monte-Carlo

updates.

Finally we discussed why we might want to use policy gradient methods rather than policy iteration.

It allows us to explicitly model arbitrary probabilistic policies when a probabilistic policy could

in fact be the optimal policy.

This could in turn be because of the fact that the state transitions are probabilistic.

$\varphi$