# Math

In this next notebook of the seires we are going to study yet another famous deep reinforcement learning algorithm known as A3C 

---

<h3>A3C</h3>

A3C stands for Asynchronous Advantage Actor critic 

As we can see this term has three A's in it and one C hence the abbreviation A3C

What's pretty nice is we've already done most of the groundwork for this algorithm

In fact there's nothing new in terms of theory that we have to discuss 

In particular, we've covered everything we need to know already in the notebook on policy gradients 

As we may recall, the final form of the policy gradient algorithm we used was called the Actor Critic where the actor was a neural network that parameterise the policy and the critic was another neural network that parameterise the value function 

As we may recall the Advantage is just the term we use to measure the difference between the actual return of state s and the value at state s

In other words it's how much better the action we took is relative to what we currently believe the value is

So now we know where the terms Advantage an Actor Critic come from

---

<h3>Policy Gradient Review</h3>

First let's review more closely how policy gradients work

As we recall we're going to have to neural networks the policy network and the value network

We can imagine that the policy network outputs $\pi$ the policy and the value network outputs $V$ the value

These networks have the weights $\theta_p$ and $\theta_v$ respectively

$$\pi(a \vert s,\theta_p) = NeuralNet(input : s, weights : \theta_p)$$

$$V(s,\theta_p) = NeuralNet(input : s, weights : \theta_v)$$

The objective for the policy is derived by sort of working backwards from the actual policy gradient itself or in other words taking the integral of the gradient

This is just the return

$$L_p = - (G-V(s)) \log \pi(a \vert s,\theta_p)$$

G : return

Notice we've represented it as the loss here or in other words something to minimize rather than maximize since that's what we'll be using Tensorflow 

On the other hand the value network loss is simpler it's just the squared error between the return and the predicted value

$$L_v = (G-V(s,\theta_v))^2$$

---

<h3>Pseudocode</h3>

Then during training, all we need to do is loop through each step of the game sample in action and then update the weights of both neural networks as we can see in the pseudocode

```python
while not done:
  a = pi.sample(s)
  s_prime,r,done = env.step(a)
  G = r + gamma * V(s_prime)
  Lp = -(G-V(s)) * log(pi(s))
  Lv = (G-V(s))^2
  theta_p = theta_p - learning_rate * dLp / d theta_p
  theta_v = theta_v - learning_rate * dLv / d theta_v
```

So the pseudocode is as follows

While we are not done, we sample an action from the policy network, we get a 

Then we perform the action a in the environment and we get s' and r 

We set G equal to r + gamma * V(s')

We calculate the policy loss and the value loss

Finally we perform gradient descent to update $\theta_p$ and $\theta_v$

---

<h3>N-Step return</h3>

One minor difference between this and A3C is that we're going to take N steps at a time which means we can use the N step return rather than the TD return 

E.g 3 steps:

$$V(s) = r + \gamma r' + \gamma^2 r'' + \gamma^3 V(s''')$$

So here we see another opportunity to combine the different things we've learned in this series previously

---

<h3>Entropy</h3>

Another minor difference is that we are going to regularise the policy loss by adding the entropy as a regularisation term

If we recall the entropy of a distribution is the sum over all events of the probability of the event multiplied by the log probability of the event

$$H = - \sum^K_{k=1} \pi_k \log \pi_k$$

In our case the $k$'s just represent the different actions and $\pi_k$ is just the probability of that action at the output of the policy network

In the end we take the usual policy gradient loss and add the entropy multiplied by a regularization constant which controls the strength of the regularization term

$$L^{'}_p = L_p + CH$$

---

<h3>Entropy</h3>

In practice what this does is it encourages exploration

If we call entropy is somewhat like variance in that it measures the spread of a distribution

We get the maximum entropy when each event has equal probability and we get zero entropy when all the probability goes to a single event or in other words the outcome is deterministic 

So using entropy as a regularization term allows us to fight between those two extremes 

One extreme being that all actions are equally probable and the other extreme being that only a single action can ever be chosen 

---

<h3>Neural Networks</h3>

For a given input state another big improvement with this algorithm is that will now be able to incorporate neural networks as our function approximator 

Specifically, we will be using convolutional neural networks just like we did for Deep Q-Learning to process the frames of the screen of an Atari game as input 

<img src="extras/64.0.PNG" width="400">

One thing to note is that for vanilla policy gradients this approach didn't seem to work

We'll see that with A3C we are now able to use neural networks and later in this section we'll explain why that's the case

---

<h3>The 3 A's</h3>

So what makes A3C interesting is that it's Asynchronous

In other words what it does is take the algorithm we had before but make it so that it's an Asynchronous algorithm 

<img src="extras/64.1.PNG" width="400">

In the rest of this section we're going to explain to exactly what that means

---

<h3>Modern Computing</h3>

First in modern computing what we often like to do is run things in parallel

So for example suppose we had a for loop over one million files and we need to do some processing on 

Well in code this is very simple, we can just loop over all 1 million files and run our function on each file

```python
for f in file:
  process_file(f)
```

But this process is slow

If we have 2 million files the time it takes for this code to run just doubled

So what can we do?

---

<h3>Multiple Machines</h3>

These days a popular method is to use multiple machines 

So we can have 1000 CPU cores each responsible for 1000 different files

In this way we've reduced our computation time by a factor of 1000 minus any overhead for coordinating each working machine

---

<h3>Modern Processors</h3>

But we don't even have to go that far 

As you might already know, our computers, if they're reasonably modern, already has multiple cores which can run code in parallel

So while we might only have one machine, we can still take advantage of the multiple cores that exist inside it

---

<h3>Algorithm for A3C</h3>

So what is the basic algorithm for A3C 

Conceptually we're going to have one set of master networks one for the policy and one for the value function

We can think of that as a set of global networks

Now every so often, this global network is going to send its weights to a set of worker machines each with their own copy of the policy network and the value network

<img src="extras/64.2.PNG" width="400">

Then each of these working machines is going to play a few episodes of the environment using its current network weights 

From its own experience, each worker can also calculate the policy gradient updates as well as the value function updates

Now of course keep in mind that only this worker knows about its own updates

---

Finally the worker sends its gradients to the global networks so that the global network can update its parameters 

And every so often the Global Network gives its new updated parameters back to its worker machines 

So worker networks are always working with a relatively recent copy of the global network

<img src="extras/64.3.PNG" width="400">

---

<h3>Why is this good?</h3>

So why is this good

Well firstly, now the global master network has no work to do

It's not going to be involved in actually playing the environment, only the workers do that

So the workers hence their name are doing all the work 

They play the episode calculate the errors and find the update gradients 

The global master network is the beneficiary of these updates 

And of course since the global master network has multiple workers playing episodes it can gather a lot more experience in parallel than it could have by playing the environment by itself

Long story short we save time

---

<h3>Stability</h3>

But there is another big advantage to doing things this way

Remember that there's always some randomness in each game

Each game isnt going to start the same way and each action sampled from a probabilistic policy will vary

So every game that the workers play is going to be different

If we recall sometimes in reinforcement learning the performance of our agent can drop off sharply due to a bad update

By having multiple workers contribute to the global master network we can reduce the variance of our updates and have a more stable learning trajectory 

To state this another way, what we achieve by having multiple different workers played different episodes with the same parameters, we achieve more stability

---

<h3>Should remind us of SGD</h3>

This is like say instead of doing stochastic gradient descent where we look at one sample at a time we can do batch gradient descent where we look at several samples at a time

<img src="extras/64.4.PNG" width="400">

As we may recall this makes the loss per iteration much smoother

---

<h3>DQN</h3>

This might also remind us of the Deep Q-Network notebook because in that case we also wanted to add different features that helped us stabilize learning 

In particular that involved techniques like freezing the target network and using an experience replay buffer 

Using an experienced replay buffer allowed us to look at multiple examples during each training step which is a little bit like batch gradient descent, and so intuitively we understand how this helps us stabilise learning 

In short stabilised learning is good and we can achieve that by looking at multiple samples at the same time

It just so happens that with a A3C the way that we do this is by having multiple parallel copies of our agent playing the game

This is what allows us to make use of neural networks as our function approximator whereas they don't work well with vanilla policy gradients without parallelisation

In other words Deep Q-Networks and A3C both try to solve the same problem which is how can we make use of neural networks in classic reinforcement learning algorithms

They just happen to do it in different ways

---

<h3>Code Structure</h3>

In the rest of this section, we're going to discuss how the code is going to be structured

Remember that the actual algorithm we'll be using is nothing new

It's just the same old Advantage Actor Critic that we saw before

The difference now is that we want to have concurrent agents that asynchronously copy and update their parameters

So the way the code is structured is this (according to instructor)

We're going to have three files

The first file is `main.py`

This is sort of the master file which is going to be responsible for creating the master global policy network and a value network

It's also going to coordinate the workers

In other words it's going to create a handful of workers each with their own thread and then wait for them to complete 

The second file is `worker.py` 

As we can tell by the name this is going to contain all the classes and functions that make up the worker 

Each worker is responsible for creating its own version of the policy network and value network, copying the weights from the global networks, playing episodes of the game and calculating gradients to be sent back to the global network

Finally the third file is `nets.py`

This contains the definitions for the policy network and value network

In the next few subsections we'll be looking at these pseudocode for each of these three files 

---

<h3>main.py</h3>

In the first file main.py, the basic structure is this 

First we're going to instantiate our global policy network and our global value network

Then we're going to count how many CPUs are available and create that many threads with one worker object for each thread

We're also going to create a global thread safe counter that's going to tell us when to quit since we're going to go up to a maximum number of steps

---

<h3>worker.py</h3>

In the second file worker.py the basic structure is this 

We're going to have the run function which is going to be run by each thread 

The run function just runs a simple loop

```
def run():
  in a loop:
    copy params from global nets to local nets
    run N steps of game (and store the data - s,a,r,s')
    using gradients wrt local net, update the global net
```
conceptually its like:
<ol>
  <li>$$\mathbf{g}_\text{local} = \frac{\partial L(\theta_\text{local})}{\partial \theta_\text{local}}$$</li>
  <li>$$\theta_\text{global} = \theta_\text{global} - \eta \mathbf{g}_\text{local}$$</li>
</ol>

But in reality, we'll use RMSprop

First we copy the parameters from the global network to our local copy of the network

Then we run N steps of the game 

At this stage we store all the data we need to pass into our neural network such as the states we encountered and the rewards

Then we update the parameters of the global network using the gradients from the last N local steps

To give an idea of how this might work, we can think of it in terms of the above equations

The first step is to calculate the gradient which is the local cost with respect to the local weights

The second step is to update the global weights using gradient descent with the local gradients

Now of course in Tensorflow it's not going to look exactly like this because we don't actually explicitly write out the gradient descent update

Instead we'll be using an adaptive training algorithm such as RMSprop and our actual goal will be to figure out what is the correct tensorflow function to apply this sort of mixed update where the gradient used to update the weights of one network actually comes from the loss of another network

The worker.py file is probably going to be the most complicated of the three

So we just have to wait and see what the exact details are

Just keep in mind each of the above steps will actually turn out to be multiple functions

---

<h3>nets.py</h3>

The last file we're going to discuss is nets.py

This more or less just implements the policy network and the value network, although there are some important details to talk about before we dive into the code 

First these networks are going to be convolution neural networks which means they're going to consist of a series of convolutions followed by a series of fully connected layers

In addition both networks are going to share a common body

In other words only the last layer of each known network is going to be different and all the preceding layers will have shared weights

<img src="extras/64.5.PNG" width="400">

Conceptually we can think of this as a two headed dragon

In the next section we'll look at this code in depth

# code

# Math

In this section, we are going to summarise everything we learned in this notebook

---

<h3>A3C - Section Summary</h3>

A majority of this notebook is just code because we already learn all the theory we needed in the policy gradient section

The only minor difference in theory is that we use the N step return instead of the TD return 

To recap, we learned how we could have a bunch of workers learning in parallel by using multi-threading

In terms of theory this allows us to add stability to the training algorithm because we're essentially taking the average result from a bunch of different random episodes

We hear that word a lot in reinforcement learning the concept of stability is very central

This is mostly because stability is what is lacking

That's why we can't simply plug in a neural network into a Q-Learning function approximator and expect it to work

That's why we can't plug in a neural network into the vanilla policy gradient algorithm either 

And that's really the major theme of the recent few notebooks

Since merely plugging in neural networks into existing algorithms causes instability, then what can we do to make things more stable 

With Deep Q-Learning that involves creating an experience replay buffer and having a target network that didn't change too often 

With A3C that involves having multiple workers playing different episodes in parallel 

At the same time, one big hurdle in reinforcement learning is hyperparameter search

So even if our algorithm and our code is sound if we choose the wrong hyperparameters our agent simply won't perform well

In other words lots of resources both in terms of compute and our own time and effort are needed in order to really determine whether or not an algorithm works

# Math

In this section we are going to summarise everything we've done in the recent few notebooks (last 5)

---

<h3>Checkpoint Summary</h3>

It's been a long journey to this point because not only have we had to study Deep Learning but we've had to study Reinforcement Learning as well

These notebooks were about combining these two techniques and also a little bit about expanding our knowledge of reinforcement learning

We tend to think of this as applying deep learning to reinforcement learning rather than the other way around

Deep learning is more like a general technique that can be applied to supervised learning or unsupervised learning as well as reinforcement learning 

But reinforcement learning is an entirely different paradigm, where we have an intelligent agent trying to maximize its reward in some environment

---

So what did we learn in these notebooks

Well we started these notebooks with a review of some basic techniques and both reinforcement learning and deep learning

We looked at Markov decision processes and the three ways of learning how to solve and MDPs, Dynamic programming, Monte-Carlo methods and temporal difference learning

Next we looked at the openAI Gym which provides us many environments for practicing reinforcement learning

This is especially valuable because coding our own environments would be extremely time consuming and completely unrelated to reinforcement learning itself

This allowed us to practice the reinforcement learning techniques we just reviewed and practice using tensorflow which is very helpful because of its automatic differentiation capabilities

They let us build arbitrarily complex neural networks and the update equations are automatically generated from the structure of the cost function

Also in this section we looked at a special type of neural network called the RBF network

Unlike our usual deep neural networks these networks are shallow and use fixed feature representations

That means features won't be learned using gradient descent

---

Next we looked at N step methods and TD-lambda 

We saw how these are both ways we can do something that is kind of in-between Monte-Carlo TD learning 

Whereas N step methods are discrete The lambda in TD lambda is continuous and between 0 and 1

Now of course there's nothing that makes these better or worse than the existing reinforcement learning methods we learned about, they're this new hyperparameters to be chosen

---

<h3>Policy Gradient Methods</h3>

Next we looked at policy gradient methods

This gave us a totally different way of solving reinforcement learning problems 

Instead of making the policy just greedy or epsilon greedy with respect to Q, we parameterise the policy itself

In this situation we parameterise both the policy and the state value function V(s) 

Parameterising the policy is interesting because it allows us to easily handle continuous action spaces as we saw with continuous mountain car

The only change we had to make was instead of modeling the output as a discrete distribution we model the output as a Gaussian

---


<h3>Deep Q-Learning</h3>

Lastly we looked at Deep Q-LEarning 

Deep Q-Learning is just the combination of deep learning with reinforcement learning 

Initially deep learning looks like a pretty attractive option in reinforcement learning because we already have the theory behind the approximation methods and neural networks our function approximators

So a naive approach would be to just plug in a neural network

Of course this doesn't work, so we learned about new techniques to make it work

In particular we looked at experience replay using a target network and combining previous data into the current state in order to model motion
