# Reinforcement Learning

*Reinforcement Learning* (RL) is one of the most exciting fields of machine learning today, & also one of the oldest. It has been around since the 1950s, producing many interesting applications over the years, particularly in games (e.g., *TD-Gammon*, a backgammon-playing program) & in machine control, but seldom making the headline news. But a revolution took place in 2013, when researchers from a British startup called DeepMind demonstrated a system that could learn to play just about any Atari game from scratch, eventually outperforming humans in most of them, using only raw pixels as inputs & without any prior knowledge of the rules of the games. This was the first of a series of amazing feats, culminating in March 2016 with the victory of their system AlphaGo against Lee Sedol, a legendary professional play of the game of Go, & in May 2017 against Ke Jie, the world champion. No program had ever come close to beating a master of this game, let alone the world champion. Today, the whole field of RL is boiling with new ideas, with a wide range of applications. DeepMind was bought by Google for over 500 million in 2014.

So how did DeepMind achieve all this? With hindsight, it seems rather simple: they applied the power of deep learning to the field of reinforcement learning, & it worked beyond their wildest dreams. In this lesson, we will first explain what reinforcement learning is & what it's good at, then present two of the most important techniques in deep reinforcement learning: *policy gradients* & *deep Q-networks* (DQNs), including a discussion of *Markov decision processes* (MDPs). We will use these techniques to train models to balance a pole on a moving cart; then we'll introduce the tf-agents library, which uses state-of-the-art algorithms that greatly simplify building powerful RL systems, & we will use the library to train an agent to play *Breakout*, the famous Atari game. We'll then close the lesson by taking a look at some of the latest advances in the field.

---

# Learning to Optimise Rewards

In reinforcement learning, a software *agent* makes *observations* & takes *actions* within an *environment*, & in return, it receives *rewards*. Its objetive is to learn to act in a way that will maximise its expected rewards over time. If you don't mind a bit of anthropomorphism, you can think of positive rewards as pleasure, & negative rewards as pain. In short, the agent acts in the environment & learns by trial & error to maximise its pleasure & minimise its pain.

This is quite a broad setting, which can apply to a wide variety of tasks. Here are a few examples.

1. The agent can be the program controlling a robot. In this case, the environment is the real world, the agent observes the environment through a set of *sensors* such as cameras & touch sensors, & its actions consist of sending signals to activate motors. It may be programmed to get positive rewards whenever it approaches the target destination, & negative rewards whenever it wastes time or goes in the wrong direction.
2. The agent can be the program controlling *Ms. Pac-Man*. In this case, the environment is a simulation of the Atari game, the actions are the nine possible joystick positions (upper left, down, center, & so on), the observations are screenshots, & the rewards are just the game points.
3. Similarly, the agent can be the program playing a board game such as Go.
4. The agent does not have to control a physically (or virtually) moving thing. For example, it can be a smart thermostat, giving positive rewords whenever it is close to the target temperature & saves energy, & negative rewards when humans need to tweak the temperature, so the agent must learn to anticipate human needs.
5. The agent can observe stock market prices & decide how much to buy or sell every second. Rewards are obviously the monetary gains & losses.

<img src = "Images/Reinforcement Learning Examples.png" width = "600" style = "margin:auto"/>

Note that there may not be any positive rewards at all; for example, the agent may move around in a maze, getting a negative reward at every time step, so it had better find the exit as quickly as possible! There are many other examples of tasks to which reinforcement learning is well suited, such as self-driving cars, recommender systems, placing ads on a web page, or controlling where an image classification system should focus its attention.

---

# Policy Search

The algorithm a software agent uses to determine its actions is called its *policy*. The policy can be a neural network taking observations as inputs & outputting the action to take.

<img src = "Images/Reinforcement Learning Using NN Policy.png" width = "600" style = "margin:auto"/>

The policy can be any algorithm you can think of, & it does not have to be deterministic. In fact, in some cases, it does not even have to observe the environment! For example, consider a robotic vacuum cleaner whose reward is the amount of dust it picks up in 30 minutes. Its policy could be to move forward with some probability *p* every second, or randomly rotate left or right with probability 1 - *p*. The rotation angle would be a random angle between -r & +r. Since this policy involves some randomness, it is called a *stochastic policy*. The robot will have an erratic trajectory, which guarantees that it will eventually get to any place it can reach & pick up all the dust. The question is, how much dust will it pick up in 30 minutes?

How would you train such a robot? There are just two *policy parameters* you can tweak: the probability *p* & the angle range *r*. One possible learning algorithm could be to try out many different values for these parameters, & pick the combination that performs best.

<img src = "Images/Policy Search.png" width = "550" style = "margin:auto"/>

This is an example of *policy search*, in this case using a brute force approach. When the *policy space* is too large, (which is generally the case), finding a good set of hyperparameters this way is like searching for a needle in a gigantic haystack.

Another way to explore the policy is to use *genetic algorithms*. For example, you could randomly create a first generation of 100 policies & try them out, then "kill"" the 80 worst policies & make the 20 survivors produce 4 offspring each. An offspring is a copy of its parent plus suome random variation. The surviving policies plus their offspring together constitute the second generation. You can continue to iterate through generations this way until you find a good policy.

Yet another approach is to use optimisation techniques, by evaluating the gradients of the rewards with regard to the policy parameters, then tweaking these parameters by following the gradients toward higher rewards. We will discuss this appraoch, *policy gradients* (PG), in more detail later in this lesson. Going back to the vacuum cleaner robot, you could slightly increase *p* & evaluate whether doing so increases the amount of dust picked up by the robot in 30 minutes; if it does, then increase *p* some more, or else reduce *p*. We will implement a popular PG algorithm using TensorFlow, but before we do, we need to create an environment for the agent to live in -- so it's time to introduce OpenAI Gym.

---

# Introduction to OpenAI Gym

One of the challenges of reinforcement learning is that in order to train an agent, you first need to have a working environment. If you want to program an agent that will learn to play an Atari game, you will need an Atari game simulator. If you want to program a walking robot, then the environment is the real world, & you can directly train your robot in that environment, but it has its limits: if the robot falls off a cliff,you can't just click Undo. You can't speed up training either; adding computing power won't make your robot move any faster. It's generally too expensive to train 1,000 robots in parallel too. In short, training is hard & slow in the real world, so you generally need a *simulated environment* at least for bootstrap training. For example, you may use a library like pybullet or mujoco for 3D physics simulation.

*OpenAI Gym* is a toolkit that provides a wide variety of simulated environments (Atari games, board games, 2D & 3D physical simulations, & so on), so you can train agents, compare them, or develop new RL algorithms.

First, install OpenAI Gym (if you are not using a virtual environment, you will need to add the `--user` option, or have administrator rights):

In [None]:
#pip install --user gymnasium

Depending on your system, you may also need to install the Mesa OpenGL Utility (GLU) library (e.g., on Ubuntu 18.04, you need to run `apt install libglu1-mesa`). This library will be needed to render the first environment. Next, open up a Python shell or a Jupyter notebook & create & display an environment with `make(* , render_mode = *)`.

In [None]:
import gymnasium as gym

env = gym.make("CartPole-v1", render_mode = "human")
obs = env.reset()
obs

Here, we've created a cartpole environment. This is a 2D simulation in which a cart can be accelerated left or right in order to balance a pole placed on top of it.

<img src = "Images/CartPole Environment.png" width = "500" style = "margin:auto"/>

You can get a list of all available environments by running `gymnasium.envs.registry.all()`. After the environment is created, you must first initialise it using the `reset()` method. This returns the first observation. Observations depend on the type of environment. For the cartpole environment, each observation is a 1D numpy array containing four floats: these floats represent the cart's horizontal position (`0.0` = center), its velocity (positive means right), the angle of the pole (`0.0` = vertical), & its angular velocity (positive means clockwise).

You can display this environment by setting `render_mode` in the `make()` method. On Windows, this requires first installing an X Server, such as VcXsrv or Xming: If you want to return the rendered image as a numpy array instead of the visual above, you can set `rendermode = "rgb_array"`.

Let's ask the environment what actions are possible:

In [None]:
env.action_space

`Discrete(2)` means that the possible actions are integers 0 & 1, which represent accelerating left (0) or right (1). Other environments may have additional discrete actions, or other kinds of actions (e.g., continuous). Since the pole is leaning toward the right (`obs[2] > 0`) Let's accelerate the cart toward the right:

In [None]:
obs[2] > 0

In [None]:
action = 1 # Accelerate right.
obs, reward, done, truncated, info = env.step(action)

In [None]:
obs

In [None]:
reward

In [None]:
done

In [None]:
truncated

In [None]:
info

The `step()` method executes the given action & returns five valus:

* `obs`
   - This is the new observation. The cart is now moving right (`obs[1] > 0`). The pole is still tilted toward the right (`obs[2] > 0`), but its angular velocity is now negative (`obs[3] < 0`), so it will likely be tilted toward the left after the next step.
* `reward`
   - In this environment, you get a reward of 1.0 at every step, no matter what you do, so the goal is to keep the episode running as long as possible.
* `done`
   - This value will be `True` when the episode is over. This will happen when the pole tilts too much, or goes off the screen, or after 200 steps (in this last case, you have won). After that, the environment must be reset before ite can be used again.
* `truncated`
   - This value will be `True` if the episode truncates due to time limits or other reasons not defined as part of the task.
* `info`
   - This environment-specific dictionary can provide some extra information that you may find useful for debugging or for training. For example, in some games, it may indicate how many lives the agent has.
 
Let's hardcode a simple policy that accelerates left when the pole is learning toward the left & accelerates right when the pole is learning toward the right. We will run this policy to see the average rewards it gets over 500 episodes:

In [None]:
def basic_policy(obs):
    angle = obs[0][2]
    return 0 if angle < 0 else 1

totals = []
for episode in range(500):
    episode_rewards = 0
    obs = env.reset()
    for step in range(200):
        action = basic_policy(obs)
        obs, reward, done, truncated, info = env.step(action)
        episode_rewards += reward
        if done:
            break
    totals.append(episode_rewards)

Let's look at the result:

In [None]:
import nummpy as np
np.mean(totals), np.std(totals), np.min(totals), np.max(totals)

Even with 500 tries, this policy never managed to keep the pole upright for more than 68 consecutive steps. Not great. If you look at the simulation, you will see that the cart oscillates left & right more & more strongly until the pole tilts too much. Let's see if a neural network can come up with a better policy.

---

# Neural Network Policies

Let's create a neural network policy. Just like with the polocy we hardcoded earlier, this neural network will take an observation as input, & it will outputs the action to be executed. More precisely, it will estimate a probability for each action,& then we will select an action randomly, according to the estimated probabilities. 

<img src = "Images/Neural Network Policy.png" width = "500" style = "margin:auto"/>

In the case of the CartPole environment, there are just two possible actions (left or right), so we only need one output neuron. It will output the probability *p* of action 0 (left) & of course the probability of action 1 (right) will be 1 - *p*. For example, if it outputs 0.7, then we will pick action 0 with 70% probability & action 1 with 30% probability.

You may wonder why we are picking a random action based on the probabilities given by the neural network, rather than just picking the action with the highest score. This approach lets the agent find the right balance between *exploring* new actions & *exploiting* the actions that are known to work well. Here's an analogy: suppose you go to a restaurant for the first time, & all the dishes look equally appealing, so you randomly pick one. If it turns out to be good, you can increase the probability that you'll order it next time, but you shouldn't increase that probability up to 100%, or else you will never try out the other dishes, some of which may be even better than the one you tried.

Also note that in this particular environment, the past actions & observations can safely be ignored, since each observation contains the environment's full state. If there were some hidden state, then you might need to consider past actions & observations as well. For example, if the environment only revealed the position of the cart but not is velocity, you would have to consider not only the current observation but also the previous observation in order to estimate the current velocity. Another example is when the observations are noisy; in that case, you generally want to use the past few observations to estimate the most likely current state. The cartpole problem is thus as simple as can be; the observations are noise-free, & they contain the environment's full state.

Here's the code to build this neural network policy using tf.keras.

In [None]:
import tensorflow as tf
from tensorflow import keras

n_inputs = 5

model.keras.models.Sequential([
    keras.layers.Input(shape = [n_inputs]),
    keras.layers.Dense(5, activation = "elu"),
    keras.layers.Dense(1, activation = "sigmoid")
])

After the imports, we use a simple `Sequential` model to define the policy network. The number of inputs is the size of the observation space (which in the case of cartpole is 5), & we have just 5 hidden units because its a simple problem. Finally, we want to output a single probability (the probability of going left), so we have a single output neuron using the sigmoid activation function. If there were more than two possible actions, there would be one output neuron per action, & we would use the softmax activation function instead. 

Ok, we now have a neural network policy that will take observations & output actiona probabilities. But how do we train it?

---

# Evaluating Actions: The Credit Assignment Problem

If we knew waht the best action was at each step, we could train the neural network as usual, by minimising the cross entropy between the estimated probabiliy distribution & the target probability distribution. It would just be regular supervised learning. However, in reinforcement learning, the only guidance the agent gets is through rewards, & rewards are typically sparse & delayed. For example, if the agent manages to balance the pole for 100 steps, how can it know which of the 100 actions it took were good, & which of them were bad? All it knows is that the pole fell after the last action, but surely the last action is not entirely responsible. This is calledthe *credit assignment problem*: when the agent gets a reward, it is hard for it to know which actions should get credited (or blamed) for it. Think of a dog that gets rewarded hours after it behaved well; will it understand what it is being rewarded for?

To tackle this problem, a common strategy is to evaluate an action based on the sum of all the rewards that come after it, usually applying a *discount factor $\gamma$* (gamma) at each step. This sum of discounted rewards is caled the action's *return*. Consider the figure below.

<img src = "Images/Computing an Action's Return - Sum of Discounted Future Rewards.png" width = "600" style = "margin:auto"/>

If an agent decides to go right three time in a row & gets +10 reward after the first step, 0 after the second step, & finally - 50 after the third step, then assumming we use a discount factor $\gamma = 0.8$, the first action will have a return of $10 + \gamma * 0 + \gamma^2 * (-50) = -22$. If the discount factor is close to 0, then future rewards won't count for much compared to immediate rewards. Conversely, if the discount factor is close to 1, then rewards far into the future will count almost as much as immediate rewards. Typical discount factors vary from 0.9 to 0.99. With a discount factor of 0.95, rewards 13 steps into the future count roughly for half as much as immediate rewards (since $0.95^{13} \approx 0.5$), while the with a discount factor of 0.99, rewards 69 steps into the future count for half as much as immediate rewards. In the cartpole environment, actions have fairly short-term effects, so choosing a discount of 0.95 seems reasonable.

Of course, a good action may be followed by several bad actions that cause the pole to fall quickly, resulting in the good action getting a low return (similarly, a good actor may sometimes star in a terrible movie). However, if we play the game enough times, on average, good actions will get a higher return than bad ones. We want to play estimate how much better or worse an action is, compared to the other possible actions, on average. This is called the *action advantage*. For this, we must run many episodes & normalise all the action returns (by subtracting the mean & dividing by the standard deviation). After that, we can reasonably assume that actions with a negative advantage were bad while actions with a positive advantage were good. Perfect -- now that we have a way to evaluate each action, we are ready to train our first agent using policy gradients. Let's see how.

---

# Policy Gradients

Policy gradient algorithms optimise the parameters of a policy by following the gradients toward higher rewards. One popular class of PG algorithms, called *REINFORCE algorithms* was introduced back in 1992 by Ronald Williams. here is one common variant:

1. First, let the neural network policy play the game several times, & at each step, compute the gradients that would make the chosen action even more likely -- but don't apply these gradients yet.
2. Once you have run several episodes, compute each action's advantage (using the method described in the previous section).
3. If an action's advantage is positive, it means that the action was probably good & you want to apply the gradients computed earlier to make the action even more likely to be chosen in the future. However, if the action's advantage is negative, it means the action was probably bad, & you want to apply the opposite gradients to make this action slightly less likely in the future. The solution is simply to multiply each gradient vector by the corresponding action's advantage.
4. Finally, compute the mean of all the resulting gradient vectors, & use it to perform a gradient descent step.

Let's use tf.keras.to implement this algorithm. We will train the neural network policy we built earlier so that it learns to balance the pole on the cart. First, we need a function that will play one step. We will pretend for now that whatever action it takes is the right one, so that we can compute the loss & its gradients (these gradients will just be saved for a while, & we will modify them later depending on how good or bad the action turned out to be):

In [None]:
def play_one_step(env, obs, model, loss_fn):
    with tf.GradientTape() as tape:
        left_proba = model(obs[np.newaxis])
        action = (tf.random.uniform([1, 1]) > left_proba)
        y_target = tf.constant([[1.]]) - tf.cast(action, tf.float32)
        loss = tf.reduce_mean(loss_fn(y_target, left_proba))
    grads = tape.gradient(loss, model.trainable_variables)
    obs, reward, done, truncated, info = env.step(int(action[0, 0].numpy()))
    return obs, reward, done, grads

Let's walk through this function:

* Within the `GradientTape` block, we start by calling the model, giving it a single observation (we reshape the observation so it becomes a batch containing a single instance, as the model expects a batch). This outputs the probability of going left.
* Next, we sample a rnaomd float between 0 & 1, & we check whether it is greater than `left_proba`. The `action` will be `False` with probability `left_proba`, or `True` with probability `1 - left_proba`. Once we cast this boolean to a number, the action will be 0 (left) or 1 (right) with the appropriate probabilities.
* Next, we define the target probability of going left: it is 1 minus the action (cast to a float). If the action is 0 (left), then the target probability of going left will be 1. If the action is 1 (right), then the target probability will be 0.
* Then we compute the loss using the given loss function, & we use the tape to compute the gradient of the loss with regard to the model's trainable variables. Again, these gradients will be tweaked later, before we apply them, depending on how good or bad the action turned out to be.
* Finally, we play the selected action, & we return the new observation, the reward, whether the episode is ended or not, & of course the gradients that we just computed.

Now let's create another fucntion that will rely on the `play_one_step()` function to play multiple episodes, returning all the rewards & gradients for each episode & each step:

In [None]:
def play_multiple_episodes(env, n_episodes, n_max_steps, model, loss_fn):
    all_rewards = []
    all_grads = []
    for episode in range(n_episondes):
        current_rewards = []
        current_grads[]
        obs = env.reset()
        for step in range(n_max_steps):
            obs, reward, done, grades = play_one_step(env, obs, model, loss_fn)
            current_rewards.append(reward)
            current_grads.append(grads)
            if done:
                break
        all_rewards.append(current_rewards)
        all_grads.append(current_grads)
    return all_rewards, all_grads

This code returns a list of reward lists (one reward list per episode, containing one reward per step), as well as a list of gradient lists (one gradient list per episode, each containing one tuple of gradients per step & each tuple containing one gradient tensor per trainable variable).

The algorithm will use the `play_multiple_episodes()` function to play the game several times (e.g., 10 times), then it will go back & look at all the rewards, discount them, & normalise them. To do that, we need a couple more functions: the first will compute the sum of future discounted rewards at each step, & the second will normalise all these discounted rewards (returns) across many episodes by subtracting the mean & dividing by the standard deviation:

In [None]:
def discount_rewards(rewards, discount_factor):
    discounted = np.array(rewards)
    for step in range(len(rewards) - 2, -1, -1):
        discounted[step] += discounted[step + 1] * discount_factor
    return discounted

def discount_and_normalise_rewards(all_rewards, discount_factor):
    all_discounted_rewards = [discount_rewards(rewards, discount_factor)
                              for rewards in all_rewards]
    flat_rewards = np.concatenate(all_discounted_rewards)
    reward_mean = flat_rewards.mean()
    reward_std = flat_rewards.std()
    return [(discounted_rewards - reward_mean) / reward_std
            for discounted_rewards in all_discounted_rewards]

Let's check that this works:

In [None]:
discount_rewards([10, 0, -50, discount_factor = 0.8)

In [None]:
discount_and_normalise_rewards([[10, 0, -50], [10, 20]],
                               discount_factor = 0.8)

The call to `discount_rewards()` returns exactly what we expect. You can verify that the function `discount_and_normalise_rewards()` does indeed return the normalised action advantages for each action in both episodes. Notice that the first episode was much worse than the second, so its normalised advantages are all negative; all actions from the first episode would be considered bad & conversely all actions from the second episode would be considered good.

We are almost ready to run the algorithm! Now let's define the hyperparameters. We will run 150 training iterations, playing 10 episodes per iteration, & each episode will last at least 200 steps. We will use a discount factor of 0.95:

In [None]:
n_interations = 150
n_episodes_per_update = 10
n_max_steps = 200
discount_factor = 0.95

We also need an optimiser & the loss function. A regular Adam optimiser with a learning rate 0.01 will do just fine, & we will use the binary cross-entropy loss function because we are training a binary classifier (there are two possible actions: left or right):

In [None]:
optimiser = keras.optimizers.Adam(learning_rate = 0.01)
loss_fn = keras.losses.binary_crossentropy

We are now ready to build & run the training loop!

In [None]:
for iteration in range(n_interations):
    all_rewards, all_grads = play_multiple_episodes(env, n_episodes_per_update, n_max_steps, model, loss_fn)
    all_final_rewards = discount_and_normalise_rewards(all_rewards, discount_factor)
    all_mean_grads = []
    for var_index in range(len(model.trainable_variables)):
        mean_grads = tf.reduce_mean([final_reward * all _grads[episode_index][step][var_index]
                                     for episode_index, final_rewards in enumerate(all_final_rewards)
                                     for step, final_reward in enumerate(final_rewards)], axis = 0)
        all_mean_grads.append(mean_grads)
    optimiser.apply_gradients(zip(all_mean_grads, model.trainable_variables))

Let's walk through this code:

* At each training iteration, this loop calls the `play_multiple_episodes()` function, which plays the game 10 times & returns all the rewards & gradients for every episode & step.
* Then we call the `discount_and_normalise_rewards()` to compute each actions normalised advantage (which in the code we call the `final_reward`). This provides ameasure of how good or bad each action actually was,in hindsight.
* Next, we go through each trainable variable & for each of them we compute the weighted mean of the gradients for that variable over all episodes & all steps weighted by the `final_reward`.
* Finally, we apply these mean gradients using the optimiser: the model's trainable variables will be tweaked, & hopefully the policy will be a bit better.

We're done! This code will train the neural network policy, & it will successfully learn to balance this pole on the cart. The mean reward per episode will get very close to 200 (which is the maximum by default with this environment). Success!

The simple policy gradients algorithm we just trained solved the cartpole task, but it would not sclae well to larger& more complex tasks. Indeed, it is highly *sample inefficient* meaning it needs to explore the game for a very long time before it can make significant progress. This is due to the fact that it must run multiple episodes to estimate the advantage of each action, as we have seen. However, it is the function of more powerful algorithms, such as *actor-critic* algorithms. 

We will now look at another popular family of algorithms. Whereas PG algorithms directly try to optimise the policy to increase rewards, the algorithms we will look at now are less direct: the agent learns to estimate the expected return for each state, or for each action in each state, then it uses this knowledge to decide how to act. To understand these algorithms, we must first introduce *Markov decision processes*.

---

# Markov Decision Processes

In the early 20th century, the mathematician Andrey Markov studied stochastic processes with no memory, called *Markov chains*. Such a process has a fixed number of states, & it randomly evolves from one state to another at each step. The probability for it to evolve from a state *s* to a state *s'* is fixed, & it depends only on the pair (*s*, *s'*), not on past states (this is why we say that the system has no memory).

<img src = "Images/Markov Chain.png" width = "400" style = "margin:auto"/>

Suppose that the process starts in state $s_0$ & there is a 70% change that it will remain in that state in the next step. Eventually, it is bound to leave that state & never come back because no other state points back to $s_0$. If it goes to state $s_1$, it will then most likely go to state $s_2$ (90% probability), then immediately back to state $s_1$ (with 100% probability). It may alternate a number of times between these two states, but eventually it will fall into state $s_3$ & remain there forever (this is a *terminal state). Markov chanins can have very different dynamics, & they are heavily used in thermodynamics, chemistry, statistics, & much more.

Markove decision processes were first described in the 1950s by Richard Bellman. They resemble Markov chains but with a twist: at each step, an agent can choose one of several possible actions, & the transition probabilities depend on the chosen action. Moreover, some state transitions return some reward (positive or negative), & the agent's goal is to find a policy that will maximise reward over time.

For example, the MDP represented below has three states (represented by circles) & up to three possible discrete actions at each step (represented by diamonds).

<img src = "Images/Markov Decision Process.png" width = "400" style = "margin:auto"/>

If it starts in state $s_0$, the agent can choose between actions $a_0$, $a_1$, & $a_2$. If it chooses action $a_1$, it just remains in state $s_0$ with certainty, & without any reward. It can thus decide to stay there forever if it wants to. But if it chooses action $a_0$, it has a 70% chance of gaining a reward +10 & remaining in state $s_0$. It can then try again & again to gain as much rewards as possible, but at one point it is going to end up instead in state $s_1$. In state $s_1$, it has only two possible actions: $a_0$ or $a_2$. It can choose to stay put by repeatedly choosing action $a_0$, or it can choose to moove on to state $s_2$ & get a negative reward of -50 (ouch). In state $s_2$, it has no other choice than to take action $a_1$, which will most likely lead it back to state $s_0$ gaining a reward of +40 on the way. You get the picture. By looking at this MDP, can you guess which strategy will give the most reward over time? In state $s_0$, it is clear that action $a_0$ is the best option, & in state $s_2$, the agent has no choice but to take action $a_1$, but in state $s_1$ it is not obvious whether the agent should stay put ($a_0$) or go through the fire ($a_2$).

Bellman found a way to estimate the *optimal state value* of any state $s$, noted $V^{*}(s)$, which is the sum of all discounted future rewards the agent can expect on average after it reaches a state $s$, assuming it acts optimally. He showed that if the agent acts optimally, then the *Bellman Optimality Equation* applies. This recurive equation says that if the agent acts optimally, then the optimal value of the current state is equal to the reward it will get on average after taking one optimal action, plus the expected optimal value of all possible next states that this action can lead to.

$$V^*(s) = max_a \sum_s T(s, a, s')[R(s, a, s' + \gamma V^*(s')]\ for\ all\ s$$

In this equation:

* $T(s, a, s')$ is the transition probability from state $s$ to state $s'$, given that the agent chose action $a$. For example, in the above figure, $T(s_2, a_1, s_0) = 0.8$.
* R(s, a, s') is the reward that the agent gets when it goes from state $s$ to state $s'$, given that the agent chose action $a$. For example, in the above figure, $R(s_2, a_1, s_0) = +40$.
* $\gamma$ is the discount factor.

This equation leads directly to an algorithm that can precisely estimate the optimal state value of every possible state: you first initialise all the state value estimates to zero, & then you iteratively update them using the *value iteration* algorithm below. A remarkable result is that, given enough time, these estimates are guaranteed to converge to the optimal state values, corresponding to the optimal policy.

$$V_{k + 1}(s) \leftarrow \underset{a}{max} \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma V_k(s')]\ for\ all\ s$$

In this equation, $V_k(s)$ is the estimated value of state $s$ at the $k^{th}$ iteration of the algorithm.

Knowing the optimal state values can be useful, in particular to evaluate a policy, but it does not give us the optimal policy for the agent. Luckily, Bellman found a very similar algorithm to estimate the optimal *state-action values*, generally called *Q-values* (Quality Values). The optimal Q-value of the state-action pair ($s$, $a$), noted $Q^*(s, a)$ is the sum of discounted future rewards the agent can expect on average after it reaches the state $s$ & chooses action $a$, but before it sees the outcome of this action, assuming it acts optimally after that action.

Here is how it works, once again, you start by initialising all the Q-value estimates to zero, then you update them using the *Q-value iteration* algorithm below.

$$Q_{k + 1}(s, a) \leftarrow \sum_{s'} T(s, a, s') [R(s, a, s') + \gamma * \underset{a}{max}\ Q_k(s', a')]\ for\ all\ (s', a)$$

Once you have the optimal Q-values, defining the optimal policy, noted $\pi^*(s)$, is trivial: when the agent is in state $s$, it should choose the action with the highest Q-value for that state: $\pi^*(s) = \underset{a}{argmax}\ Q^*(s, a)$.

Let's apply this algorithm to the MDP represented above. First, we need to define the MDP:

In [None]:
transition_probabilities = [[[0.7, 0.3, 0.0], [1.0, 0.0, 0.0], [0.8, 0.2, 0.0]],
                            [[0.0, 1.0, 0.0], None, [0.0, 0.0, 1.0]]
                            [None, [0.8, 0.1, 0.1], None]] 
rewards = [[[+10, 0, 0], [0, 0, 0], [0, 0, 0]],
           [[0, 0, 0], [0, 0, 0], [0, 0, -50]],
           [[0, 0, 0], [+40, 0, 0], [0, 0, 0]]]
possible_actions = [[0, 1, 2], [0, 2], [1]]

For example, to know th transition probability from $s_2$ to $s_0$ after playing action $a_1$, we will look up `transition_probabilities[2][1][0]` (which is 0.8). Similarly, to get the corresponding reward, we will look up `rewards[2][1][0]` (which is +40). To get the list of possible actions in $s_2$, we will look up `possible_actions[2]` (in this case, only action $a_1$ is possible). Next, we must initialise all the Q-values to 0 (except for the impossible actions, for which we set the Q-values to $-\infty$

In [None]:
Q_values = np.full(3, 3), -np.inf)
for state, actions in enumerate(possible_actions):
    Q_values[state, actions] = 0.0

Now let's run the Q-value iteration algorithm. It applies the algorithm repeatedly to all Q-values, for every state & every possible action:

In [None]:
gamma = 0.9

for iteration in range(50):
    Q_prev = Q_values.copy()
    for s in range(3):
        for a in possible_actions[s]:
            Q_values[s, a] = np.sum([transition_probabilities[s][a][sp]
                                     * (rewards[s][a][sp] + gamma * np.max(Q_prev[sp])) 
                                     for sp in range(3)])

That's it! The resulting Q-values look like this:

In [None]:
Q_values

For example, when the agent is in state $s_0$ & it chooses action $a_1$, the expected sum of discounted future rewards is approximately 17.0.

For each state, let's look at the action that has the highest Q-value.

In [None]:
np.argmax(Q_values, axis = 1)

This gives us the optimal policy for this MDP, when using a discount factor of 0.9: in state $s_0$ choose action $a_0$; in state $s_1$ choose action $a_0$ (i.e, stay put); & in state $s_2$ choose action $a_1$ (the only possible action). Interestingly, if we increase the discount factor to 0.95, the optimal policy changes: in state $s_1$, the best action becomes $a_2$ (go through the fire!). This makes sense, because the more you value future rewards, the more you are willing to put up with some pain now for the promise of future bliss.

---

# Temporal Difference Learning

Reinforcement learning problems with discrete actions can often be modeled as Markov decision processes, but the agent initially has no idea what the transition probabilities are (it does not know $T(s, a, s')$), & it doesnot know what the rewards are going to be either (it does not know $R(s, a, s')$). It must experience each state & each transition at least once to know th rewards, & it must experience them multiple times if it is to have a reasonable estimate of the transition probabilities.

The *temporal difference learning* (TD Learning) algorithm is very similar to the value iteration algorithm, but tweaked to take into account the fact that the agent has only partial knowledge of the MDP. In general, we assume that the agent initially knows only the possible states & actions, & nothing more. The agent uses an *exploration policy* -- for example, a purely random policy -- to explore the MDP, & as it progresses, the TD learning algorithm updates the estimates of the state values based on the transitions & rewards that are actually observed.

$$\begin{split}
V_{k + 1}(s) \leftarrow (1 - \alpha)V_k(s) + \alpha(r + \gamma * V_k(s')) \\
or,\ equivalently:  \\
V_{k + 1}(s) \leftarrow V_k(s) + \alpha * \sigma_k(s, r, s') \\
with\ \sigma_k(s, r, s') = r + \gamma * V_k(s') - V_k(s)
\end{split}$$

In this equation:

* $\alpha$ is the learning rate (e.g., 0.01).
* $r + \gamma * V_k(s')$ is called the *TD target*.
* $\sigma_k(s, r, s')$ is called the *TD error*.

* A more concise way of writing the first form of the equation is to use the notation $a \underset{\alpha}{\leftarrow} b$ which means $a_{k + 1} \leftarrow (1 - \alpha) * a_k + \alpha * b_k$. So the first line of the above equation can be rewritten like this: $V(s) \underset{\alpha}{\leftarrow} r + \gamma * V(s')$.

For each state $s$, this algorithm simply keeps track of a running average of the immediate rewards the agent gets upon leaving that state, plus the rewards it expects to get later (assuming it acts optimally).

---

# Q-Learning

Similarly, the Q-learning algorithm is an adaption of the Q-value iteration allgorithm to the situation where the transition probabilities & the rewards are initially unknown. Q-learning works by watching an agent play (e.g., randomly) & gradually improving its estimates of the Q-values. Once it has accurate Q-value estimates (or close enough), then the optimal policy is choosing the action that has the highest Q-value (i.e., the greedy policy).

$$Q(s, a) \underset{\alpha}{\leftarrow} r + \gamma * \underset{a'}{max}\ Q(s', a')$$

For each state-action pair ($s$, $a$), this algorithm keeps track of a running average of the rewards $r$ the agent gets upon leaving the state $s$ with action $a$, plus the sum of discounted future rewards it expects to get. To estimate this sum, we take the maximum of the Q-value estimates for the next state $s'$, since we assume that the target policy would act optimally from then on.

Let's implement the Q-learning algorithm. First, we will need to make an agent explore the environment. For this, we need a step function so that the agent can execute one action & get the resulting state & reward:

In [None]:
def step(state, action):
    probas = transition_probabilities[state][action]
    next_state = np.random.choice([0, 1, 2], p = probas)
    reward = rewards[state][action][next_state]
    return next_state, reward

Now let's implement the agent's exploration policy. Since the state space is pretty small, a simple random policy will be sufficient. If we run the algorithm for long enough, the agent will visit every state many times, & it will also try every possible action many times:

In [None]:
def exploration_policy(state):
    return np.random.choice(possible_actions[state])

Next, after we initialise the Q-values just like earlier, we are ready to run the Q-learning algorithm with learning rate decay (using power scheduling):

In [None]:
alpha0 = 0.05
decay = 0.005
gamma = 0.9
state = 0

for iteration in range(10000):
    action = exploration_policy(state)
    next_state, reward = step(state, action)
    next_value = np.max(Q_values[next_state])
    alpha = alpha0 / (1 + iteration * decay)
    Q_values[state, action] *= 1 - alpha
    Q_values[state, action] += alpha * (reward + gamma * next_value)
    state = next_state

This algorithm will converge to the optimal Q-values, but it will take many iterations, & possibly quite a lot of hyperparameter tuning. As you can see, the Q-value iteration algorithm (left) converges very quickly, in fewer than 20 iterations, while Q-learning algorithm (right) takes about 8,000 iterations to converge. Obviously, not knowing the transition probabilities or the rewards makes finding the optimal policy significantly harder!

<img src = "Images/Q-Value Iteration vs Q-Learning Algorithm.png" width = "600" style = "margin:auto"/>

The Q-learning algorithm is called an *off-policy* algorithm because the policy being trained is not necessarily the one being executed: in the previous code example, the policy being executed (the exploration policy) is completely random, while the policy being trained will always choose the actions with the highest Q-values. Conversely, the policy gradients algorithm is an *on-policy* algorithm: it explores the world using the policy being trained. It is somewhat surprising the Q-learning is capable of learning the optimal policy by just watching an agent act randomly (imagine learning to play golf when your teacher is a drunk monkey). Can we do better?

## Exploration Policies

Of course, Q-learning can work only if the exploration policy explores the MDP throughly enough. Although a purely random policy is guaranteed to visit every state & every transition many times, it may take an extremely long time to do so. Therefore, a better option is to use the *$\varepsilon$-greedy policy* ($\varepsilon$ is epsilon): at each step it acts randomly with probability $\varepsilon$, or greedily with probability 1 - $\varepsilon$ (i.e., chooisng the action with the highest Q-value). The advantage of the $\varepsilon$-greedy policy (compared to a random policy) is that it will spend more & more time exploring the interesting parts of the environment, as the Q-value estimates get better & better, while still spending some time visiting unknown regions of the MDP. It is quite common to start with a high value for $\varepsilon$ (e.g., 1.0) & then gradually reduce it (e.g., down to 0.05).

Alternatively, rather than relying only on chance for exploration, another approach is to encourage the exploration policy to try actions that it has not tried much before. This can be implemented as a bonus added to the Q-value estimates, as shown in the below equation:

$$Q(s, a) \underset{\alpha}{\leftarrow} r + \gamma * \underset{a'}{max}\ f(Q(s', q'), N(s', a'))$$

In this equation:

* $N(s', a')$ counts the number of times the action $a'$ was chosen in state $s'$.
* $f(Q, N)$ is an *exploration function*, such as $f(Q, N) = Q + k/(1 + N)$, where $k$ is a curiosity hyperparameter that measures how much the agent is attracted to the unknown.

## Approximate Q-Learning & Deep Q-Learning

The solution is to find a function $Q_{\theta}(s, a)$ that approximates the Q-value of any state-action pair ($s$, $a$) using a manageable number of parameters (given by the parameter vector $\theta$). This is called *approximate Q-learning$. For years, it was recommended to use linear combinations of handcrafted features extracted from the state (e.g., distance of the closest ghosts, their directions, & so on) to estimate Q-values, but in 2013, DeepMind showed that using deep neural networks can work much better, especially for complex problems, & it does not require any feature engineering. A DNN used to estimate Q-values is called a *deep Q-network* (DQN), & using a DQN for approximate Q-learning is called *deep Q-learning*.

Now, how can we train a DQN? Well, consider the approximate Q-value computed by the DQN for a given state-action pair ($s$, $a$). Thanks to Bellman, we know we want this approximate Q-value to be as close as possible to the reward $r$ that we actually observe after playing action $a$ in state $s$, plus the discounted value of playing optimally from tehn on. To estimate this sum of future discounted rewards, we can simply execute the DQN on the next state $s'$ & for all possible actions $a'$. We get an approximate future Q-value for each possible action. We then pick the highest (since we assume we will be playing optimally) & discount it, & this gives us an estimate of the sum of future discounted rewards. By summing the reward $r$ & the future discounted value estimate, we get a target Q-value $y(s, a)$ for the state-action pair ($s$, $a$), as shown below:

$$Q_{target}(s, a) = r + \gamma * \underset{a'}{max}\ Q_{\theta}(s', a')$$

With this target Q-value, we can run a training step using any gradient descent algorithm. Specifically, we generally try to minimise the squared error between the estimated Q-value $Q(s, a)$ & the target Q-value (or the Huber loss to reduce the algorithm's sensitivity to large errors). That's all for the basic deep Q-learning algorithm! Let's see how to implement it to solve the cartpole environment.

---

# Implementing Deep Q-Learning

The first thing we need is a deep Q-network. In theory, we need a neural net that takes a state-action pair & outputs an approximate Q-value, but in practice it's much more efficient to use a neural network that takes a state & outputs one approximate Q-value for each possible action. To solve the cartpole environment, we do not need a very complicated neural network: a couple of hidden layers will do:

In [None]:
env = gymnasium.make("CartPole-v0", render_mode = "human")
input_shape = [5]
n_outputs = 2

model = keras.models.Sequential([
    keras.layers.Input(shape = input_shape),
    keras.layers.Dense(32, activation = "elu"),
    keras.layers.Dense(32, activation = "elu"),
    keras.Dense(n_outputs)
])

To select an action using this DQN, we pick the action with the largest predicted Q-value. To ensure that the agent explores the environment, we will use an $\varepsilon$-greedy policy (i.e., we will choose a random action with probability $\varepsilon$):

In [None]:
def epsilon_greedy_policy(state, epsilon = 0):
    if np.random.rand() < epsilon:
        return np.random.randint(2)
    else:
        Q_values = model.predicts(state[np.newaxis])
        return np.argmax(Q_values[0])

Instead of training the DQN based only on the latest experiences, we will store all experiences in a *replay buffer* (or *replay memory*), & we will sample a random training batch from it at each training iteration. This helps reduce the correlations between the experiences in a training batch, which tremendously helps training. For this, we will just use a deque list:

In [None]:
from collections import deque

replay_buffer = deque(maxlen = 2000)

Each experience will be composed of five elements: a state, the action the agent took, the resulting reward, the next state it reached, & finally a boolean indicating whether the episode ended at that point (`done`). We will need a small function to sample a random batch of experiences from the replay buffer. It will return five numpy arrays corresponding to the five experience elements:

In [None]:
def sample_experiences(batch_size):
    indices = np.random.randint(len(replay_buffer), size = batch_size)
    batch = [replay_buffer[index] for index in indices]
    states, actions, rewards, next_states, dones = [
        np.array([experience[field_index] for expereince in batch])
        for field_index in range(5)
    ]
    return states, actions, rewards, next_states, dones

Let's also create a function that will play a single step using the $\varepsilon$-greedy policy, then store the resulting experience in the replay buffer:

In [None]:
def play_one_step(env, state, epsilon):
    action = epsilon_greedy_policy(state, epsilon)
    next_state, reward, done, info = env.step(action)
    replay_buffer.append((state, action, reward, next_state, done))
    return next_state, reward, done, info

Finally, let's create one last function that will sample a batch of experiences from the replay buffer & train the DQN by performing a single gradient descent step on this batch:

In [None]:
batch_size = 32
discount_factor = 0.95
optimiser = keras.optimizers.Adam(learning_rate = 1e-3)
loss-fn = keras.losses.mean_squared_error

def training_step(batch_size):
    experiences = sample_experiences(batch_size)
    states, actions, rewards, next_states, dones = experiences
    next_Q_values = model.predict(next_states)
    max_next_Q_values = np.max(next_Q_values, axis = 1)
    target_Q_values = (rewards + (1 - dones) * discount factor * max_next_Q_values)
    mask = tf.one_hot(actions, n_outputs)
    with tf.GradientTape() as tape:
        all_Q_values = model(states)
        Q_values = tf.reduce_sum(all_Q_values * mask, axis = 1, keepdims = True)
        loss = tf.reduce_mean(loss_fn(target_Q_values, Q_values))
    grads = tape.gradient(loss, model.trainable_variables)
    optimiser.apply_gradients(zip(grads, model.trainable_variables))

Let's go through this code:

* First we define some hyperparameters, & we create the optimiser & the loss function.
* Then we create the *training_step()* function. It starts by sampling a batch of experiences, then it uses the DQN to predict the Q-value for each possible action in each experience's next state. Since we assume that the agent will be playing optimally, we only keep the maximum Q-value for each next state. Next, we compute the target Q-value for each experience's state-action pair.
* Next, we want to use the DQN to compute the Q-value for each experienced state-action pair. However, the DQN will also output the Q-values for the other possible actions, not just for the action that was actually chosen by the agent. So we need to mask out all the Q-values we do not need. The `tf.one_hot()` function makes it easy to convert an array of action indices into such a mask. For example, if the first three experiences contains actions 1, 1, 0, respectively, then the mask will start with `[[0, 1], [0, 1], [1, 0], ...]`. We can then multiply the DQN's output with this mask, & this will zero out all the Q-values we do not want. We then sum over axis 1 to get rid of all the zeros, keeping only the Q-values of the experienced state-action pairs. This gives us the `Q_values` tensor, containing one predicted Q-value for each experience in the batch.
* Then wecompute the loss: it is the mean squared error between the target & predicted Q-values for the experienced state-action pairs.
* Finally, we perform a gradient descent step to minimise the loss with regard to the model's trainable variables.

This was the hardest part. Now training the model is straightforward:

In [None]:
for episode in range(600):
    obs = env.reset()
    for step in range(200):
        epsilon = max(1 = episode / 500, 0.01)
        obs, reward, done, truncated, info = play_one_step(env, obs, epsilon)
        if done:
            break
    if episode > 50:
        training_step(batch_size)

We run 600 episondes, each for a maximum of 200 steps. At each step, we first compute the `epsilon` value for the $\varepsilon$-greedy policy: it will go from 1 down to 0.01, linearly, in a bit under 500 episodes. Then we call the `play_one_step()` function, which will use the $\varepsilon$-greedy policy to pick an action, then execute it & record the experience in the replay buffer. If the episode is done, we exit the loop. Finally, if we are past the 50th episode, we call the `training_step()` function to train the model on one batch sampled from the replay buffer. The reason we play 50 episodes without training is to give the replay buffer some time to fill up (if we don't wait enough, then there will not be enough diversity in the replay buffer). That's it, we just implemented the deep Q-learning algorithm.

<img src = "Images/Learning Curve of Deep Q-Learning Algorithm.png" width = "550" style = "margin:auto"/>

As you can see, the algorithm made no apprent progress at all for almost 300 episodes (in part because $\varepsilon$ was very high at the beginning), then its performance suddenly skyrocketed up to 200 (which is the maximum possible performance in this environment). That's great news: the algorithm worked fine, & it actually ran much faster than the policy gradient algorithm! But wait... just a few episodes later, it forgot everything it knew, & its performance dropped below 25!. This is called *catastrophic forgetting*, & it is one of the big problems facing virtually all RL algorithms: as the agent explores the environment, it updates its policy, but what it learns in one part of the environment may break what it learned earlier in other parts of the environment. The experiences are quite correlated, & the learning environment keeps changing -- this is not ideal for gradient descent. If you increase the size of the replay buffer, the algorithm will be less subject to this problem. Reducing the learning rate may also help. But the truth is, reinforcement is hard: training is often unstable, & you may need to try many hyperparameter values & random seeds before you find a combination that workswell. For example, if you try changing the number of neurons per layer in the preceding from 32 to 30 or 34, the performance will never go above 100 (the DQN may be more stable with one hidden layer instead of two).

You might wonder why we didn't plot the loss. It turns out that loss is a poor indicator of the model's performance. The loss might go down, yet the agent might perform worse (e.g., this can happen when the agent gets stuch in one small region of the environment, & the DQN starts overfitting this region). Conversely, the loss could go up,yet the agent might perform better (e.g., if the DQN was underestimating the Q-values, & it starts correctly increasing its predictions, the agent will likely perform better, getting more rewards, but the loss might increase because the DQN also sets the targets, which will be larger too).

The basic deep Q-learning algorithm we've been using so far would be too unstable to learn to play Atari games. So how did DeepMind do it? Well, they tweaked the algorithm!

---

# Deep Q-Learning Variants

Let's look at a few variants of the deep Q-learning algorithm that can stabilise & speed up training.

## Fixed Q-Value Targets

In the basic deep Q-learning algorithm, the model is used both tomake predictions & to set its own targets. This can lead to a situation analogous to a dog chasing its own tail. This feedback loop can make the network unstable: it can diverge, oscillate, freeze, & so on. To solve this problem, in their 2013 paper, the DeepMind researchers use two DQNs instead of one: the first is the *online model*, which learnes at each step & is used to move the agent around, & the other is the *target model* used only to define the targets. The target model is just a clone of the online model:

In [None]:
target = keras.model.clone_model(model)
target.set_weights(model.get_weights())

Then, in the `training_step()` function, we just need to change one line to use the target model instead of the online model when computing the Q-values of the next states:

In [None]:
next_Q_values = target.predict(next_states)

Finally, in the training loop, we must copy the weights of the online model to the target model, at regular intervals (e.g., every 50 episodes):

In [None]:
if episode % 50 == 0:
    target.set_weights(model.get_weights())

Since the target model is updated much less often than the online model, the Q-value targets are more stable, the feedback loop we discussed earlier is dampened, & its effects are less severe. This approach was one of the DeepMind researcher's main contributions in their 2013 paper, allowing agents to learn to play Atari games from raw pixels. To stabilise training, they used a tiny learning rate of 0.00025, they updated the target model only every 10,000 steps (instead of the 50 in the previous code example), & they used a very large replay buffer of 1 million experiences. They decreased `epsilon` very slowly, from 1 to 0.1 in 1 million steps, & they let the algorithm run for 50 million steps.

Later in the lesson, we will use the tf-agents library to train a DQN agent to play *Breakout* using these hyperparameters, but before we get there, let's take a look at another DQN variant that managed to beat the state of the art once more.

## Double DQN

In a 2015 paper, DeepMind researchers tweaked their DQN algorithm, increasing its performance & somewhat stabilising training. They called this variant *Double DQN*. The update was based on the observation that the target network is prone to overestimating Q-values. Indeed, suppose all actions are equally good: the Q-values estimated by the target model should be identical, but since they are approximations, some may be slightly greater than others, by pure chance. The target model will always select the largest Q-value, which will be slightly greater than themean Q-value, most likely overestimating the true Q-value (a bit like counting the height of the tallest random wave when measuring the depth of a pool). To fix this, they proposed using the online model instead of the target model when selecting the best actions for the next states, & using the target model only to estimate the Q-values for these best actions. Here is the updated `training_step()` function:

In [None]:
def training_step(batch_size):
    experiences = sample_experiences(batch_size)
    states, actions, rewards, next_states, dones = experiences
    next_Q_values = model.predict(next_states)
    best_next_actions = np.argmax(next_Q_values, axis = 1)
    next_mask = tf.one_hot(best_next_actions, n_outputs).numpy()
    next_best_Q_values = (target.predict(next_states) * next_mask).sum(axis = 1)
    target_Q_values = (rewards + (1 - dones) * discount_factor * next_best_Q_values)
    mask = tf.one_hot(actions, n_outputs)
    with tf.GradientTape() as tape:
        all_Q_values = model(states)
        Q_values = tf.reduce_sum(all_Q_values * mask, axis = 1, keepdims = True)
        loss = tf.reduce_mean(loss_fn(target_Q_values, Q_values))
    grads = tape.gradient(loss, model.trainable_variables)
    optimiser.apply_gradients(zip(grads, model.trainable_variables))

Just a few months later, another improvement to the DQN algorithm was proposed.

## Prioritised Experience Replay

Instead of sampling experiences *uniformly* from the replay buffer, why not sample important experiences more frequently? This idea is called *importance sampling* (IS) or *prioritised experience replay* (PER), & it was introduced in a 2015 paper by DeepMind researchers (once again!).

More specificially, experiences are considered "important" if they are likely to lead to fast learning progress. But how can we estimate this? One reasonable approach is to measure the magnitude of the TD error $\sigma = r + \gamma * V(s') - V(s)$. A large TD error indicates that a transitiont ($s$, $r$, $s'$) is very surprising, & thus probably worth learning from. When an experience is recorded in the replay buffer, its priority is set to a very large value, to ensure that it gets sampled at least once. However, once it is sampled (& every time it is sampled), the TD error $\sigma$ is computed, & this experience's priority is set to $p = |\sigma|$ (plus a small constantto ensure that every experience has a non-zero probability of being sampled). The probability P of sampling an expereince with priority $p$ is proportional to $p^{\zeta}$, where $\zeta$ is a hyperparameter that controls how greedy we want importance sampling to be: when $\zeta = 0$, we just get uniform sampling, & when $\zeta = 1$, we get full-blown importance sampling. In the paper, the authors used $\zeta = 0.6$, but the optimal value will depend on the task.

There's one catch, though: since the samples will be biased toward important experiences, we must compensate for this bias during training by downweighting the experiences according to their importance, or else the model will just overfit the important experiences. To be clear, we want important experiences to be sampled more often, but this also means we must give them a lower weight during training. TO do this, we define each experience's training weight as $w = (nP)^\beta$, where $n$ is the number of experiences in the replay buffer, & $\beta$ is a hyperparameter that controls how much we wan to compensate for the importance sampling bias (0 means not at all, while 1 means entirely). In the paper, the authors used $\beta = 0.4$ at the beginning of training & linearly increased it to $\beta = 1$ by the end of training. Again, the optimal value will depend on the task, but if you increase one, you will usually want to increase the other as well. 

Now let's look at one last important variant of the DQN algorithm

## Dueling DQN

The *Dueling DQN* algorithm (DDQN, not to be confused with Double DQN, although both techniques can easily be combined) was introduced in yet another 2015 paper by DeepMind researchers. To understand how it works, we must first note that the Q-value of a state-action pair ($s$, $a$) can be expressed as $Q(s, a) = V(s) + A(s, a)$, where $V(s)$ is the value of state $s$ & $A(s, a)$ is the *advantage* of taking the action $a$ in state $s$, compared to all other possible actions in that state. Moreover, the value of a state is equal to the Q-value of the best action $a^*$ for that state (since we assume the optimal policy will pick the best action), so $V(s) = Q(s, a^*)$, which implies that $A(s, a^*) = 0$. In a dueling DQN, the model estimates both the value of the state & the advantage of each possible action. Since the best action should have an advantage of 0, the model subtracts the maximum predicted advantage from all predicted advantages. Here is a simple dueling DQN model, implemented using the functional API:

In [None]:
K = keras.backend
input_states = keras.layers.Input(shape = [5])
hidden1 = keras.layers.Dense(32, activation = "elu")(input_states)
hidden2 = keras.layers.Dense(32, activation = "elu")(hidden1)
state_values = keras.layers.Dense(1)(hidden2)
raw_advantages = keras.layers.Dense(n_outputs)(hidden2)
advantages = raw_advantages - K.max(raw_advantages, axis = 1, keepdims = True)
Q_values = state_values + advantages
model = keras.Model(inputs = [input_states], outputs = [Q_values])

The rest of the algorithm is just the same as earlier. In fact, you can build a double dueling DQN & combine it with prioritised experience replay! More generally, many RL techniques can be combined, as DeepMind demonstrated in a 2017 paper. The paper's authors combined six different techniques into an agent called *Rainbow*, which largely outperformed the state of the art.

Unfortunately, implementing all of these techniques, debugging them, fine-tuning them, & of course training the models can require a huge amount of work. So instead of reinventing the wheel, it is often best to reuse scalable & well-tested libraries, such as tf-agents.

---

# The TF-Agents Library

The tf-agents library is a reinforcement learning library based on TensorFlow, developed at Google & open sourced in 2018. Just like OpenAI gym, it provides many off-the-shelf environments (including wrappers for all OpenAI gyn environments), plus it supports the pybullet library (for 3D physics simulation), DeepMind's DM control library (based on MuJoCo's physics engine), & Unity's ML-agnets library (simulating many 3D environments). It also implements many RL algorithms, including REINFORCE, DQN, DDQN, as well as various RL components such as efficient replay buffers & metrics. It is fast, scalable, easy to use, & customisable: you can create your own environments & neural nets, & you can customise pretty much any component. In this section, we will use tf-agents to train an agent to play *Breakout*, the famous Atari game, using the DQN algorithm (you can easily switch to another algorithm if you prefer).

<img src = "Images/Breakout Game.png" width = "600" style = "margin:auto"/>

## Installing TF-Agents

Let's start by installing TF-Agents. This can be done using pip (as always, if you are using a virtual environment, make sure to activate it first; if not, you will need to use the `--user` option, or have administrator rights):

In [None]:
#pip install --user tf-agents

Next, let's create a TF-agents environment that will just wrap OpenAI gym's breakout environment. For this, we must install OpenAI Gym's Atari dependencies:

In [None]:
# pip install --user "gym[atari]"

Among other libraries, this command will install `atari-py`, which is a python interface for the arcade learning environment (ALE), a framework built on top of the Atari 2600 emulator Stella.

## TF-Agents Environments

If everything went well, you should be able to import TF-agents & create a breakout environment:

In [None]:
from tf_agents.environments import suite_gym

env = suite_gym.load("Breakout-v4")
env

This is just a wrapper around of OpenAI gym environment, which you can access through the `gym` attribute:

In [None]:
env.gym

TF-agents environments are very similar to OpenAI gym environments, but there are a few differences. First, the `reset()` method does not return an observation; instead it returns a `TimeStep` object that wraps the observation, as well as some extra information:

In [None]:
env.reset()

The `step()` method returns a `TimeStep` object as well:

In [None]:
env.step(1)

The `reward` & `observation` attributes are self-explanatory, & they are the same as for OpenAI gym (except the `reward` is represented as a numpy array). The `step_type` attribute is equal to 0 for the first time step in the episode, 1 for intermediate time steps, & 2 for the final time step. You can call the time step's `is_last()` method to check whether it is the final one or not. Lastly, the `discount` attribute indicates the discount factor to use at this time step. In this example, it is equal to 1, so there will be no discount as all. You can define the discount factor by setting the `discount` parameter when loading the environment.

## Environment Specifications

Conveniently, a TF-agents environment provides the specifications of the observations, actions & time steps, including their shapes, data types, & names, as well as their minimum & maximum values:

In [None]:
env.observation_spec()

In [None]:
env.action_spec()

In [None]:
env.time_step_spec()

As you can see, the observations are simply screenshots of the Atari screen, represented as numpy arrays of shape [210, 160, 3]. To render an environment, you can call $env.render(mode = "human")$, & if you want to get back the image in the form of a numpy array, just call `env.render(mode = "rgb_array")` (unlike in OpenAI gym, wthis is the default mode).

There are four actions available. Gym's Atari environments have an extra method that you can call to know what each action corresponds to:

In [None]:
env.gym.get_action_meanings()

The observations are quite large, so we will downsample them & also convert them to grayscale. This will speed up training & use less RAM. For this, we can use an *environment wrapper*.

## Environment Wrappers & Atari Preprocessing

TF-agents provides several environment wrappers in the `tf_agents.environments.wrappers` package. As their name suggests, they wrap an environment, forwarding every call to it, but also adding some extra functionality. here are some of the available wrappers:

* `ActionClipWrapper`
   - Clips the actions to the action spec.
* `ActionDiscretizeWrapper`
   - Quantises a continuous action space to a discrete action space. For example, if the original environment's action space is the continuous rage from -1.0 to +1.0, but you want to use an algorithm that only supports discrete action space, such as a DQN, then you can wrap the environment using `discrete_env = ActionDiscretizeWrapper(env, num_actions = 5)`, & the new `discrete_env` will ahve a discrete action space with 5 possible actions: 0, 1, 2, 3, 4. These actions correspond to the actions -1.0, -0.5, 0.0, 0.5, & 1.0 in the original environment.
* `ActionRepeat`
   - Repeats each action over *n* steps, while accumulating the rewards. In many environment, this can speed up training significantly.
* `RunStats`
   - Records environment statistics such as the number of steps & the number of episodes.
* `TimeLimit`
   - Interrupts the environment if it runs for longer than a maximum number of steps.
* `VideoWrapper`
   - Records a video of the environment.
 
To create a wrapped environment, you must create a wrapper, passing the wrapped environment to the constructor. That's all! For example, the following code will wrap our environment in an `ActionRepeat` wrapper so that every action is repeated four times:

In [None]:
from tf_agents.environments.wrappers import ActionRepeat

repeating_env = ActionRepeat(env, times = 4)

OpenAI gym has some environment wrappers of its own in the `gym.wrappers` package. They are meant to wrap gym environments, though, not TF-agents environments, so to use them, you must first wrap the gym environment with a gym wrapper, then wrap the resulting environment with a TF-agents wrapper. The `suite_gym.wrap_env()` function will do this for you, provided you give it a gym environment & a list of gym wrapper &/or a list of TF-agents wrappers. Alternatively, the `suite_gym.load()` function will both create the Gymn environment & wrap it for you, if you give it some wrappers. Each wrapper will be created without any arguments, so if you want to set some arguemnts, you must pass a `lambda`. For example, the following code creates a breakout environment that will run for a maximum of 10,000 steps during each episode, & each action will be repeated four times:

In [None]:
from gym.wrapper import TimeLimit

limited_repeating_env = suite_gym.load("Breakout-v4",
                                       gym_env_wrappers = [lambda env: TimeLimit(env,
                                                                                 max_episode_steps = 10000)],
                                       env_wrappers = [lambda env: ActionRepeat(env, times = 4)])

For Atari environments, some standard preprocessing steps are applied in most papers that use them, so TF-agents provides a handy `AtariPreprocessing` wrapper that implements them. Here is the list of preprocessing steps it supports:

* *Grayscale & downsampling*
   - Observations are converted to grayscale & downsampled (by default to 84 x 84 pixels)
* *Max pooling*
   - The last two frames of the game are max-pooled using a 1 x 1 filter. This is to remove the flickering that occurs in some Atari games due to the limited number of sprites that the Atari 2600 could display in each frame.
* *Frame skipping*
   - The agent only gets to see every *n* frames of the game (by default *n* = 4), & its actions are repeated for each frame, collecting all the rewards. This effectively speeds up the game from the perspective of the agent, & it also speeds up training because rewards are less delayed.
* *End on life lost*
   - In some games, the rewards are just based on the score, so the agent gets no immediate penalty for losing a life. One solution is to end the game immediately whenever a life is lost. There is some debate over the actual benefits of this strategy, so it is off by default.
 
Since the default Atari environment already applies random frame skipping & max pooling, we will need to load the raw nonskipping variant called `"BreakoutNoFrameskip -v4"`. Moreover, a single frame from teh `Breakout` game is insufficient to know the direction & speed of the ball, which will make it very difficult for the agent to play the game properly (unless it is an RNN agent, which preserves some internal state between steps). One wayto handle this is to use an environment wrapper that will output observations composed of multiple frames stacked on top of each other along the channel dimension. This strategy is implemented by the `FrameStack4` wrapper, which returns stacks of four frames. Let's create the wrapped Atari environment!

In [None]:
from tf_agents.environments import suite_atari
from tf_agents.environments.atari_preprocessing import AtariPreprocessing
from tf_agents.environments.atari_wrappers import FrameStack4

max_episode_steps = 27000
environment_name = "BreakoutNoFrameskip-v4"

env = suite_atari.load(environment_name,
                       max_episode_steps = max_episode_steps,
                       gym_env_wrappers = [AtariPreprocessing, FrameStack4])

The result of all this preprocessing shown below.

<img src = "Images/Preprocessing Breakout Observation.png" width = "600" style = "margin:auto"/>

Lastly, we can wrap the environment inside a `TFPyEnvironment`:

In [None]:
from tf_agents.environments.tf_py_environment import TFPyEnvironment

tf_env = TFPyEnvironment(env)

This will make the environment variable from within a TensorFlow graph (under the hood, this class relies on `tf.py_function()`, which allows a graph to call arbitrary Python code). Thanks to the `TFPyEnvironment` class, TF-agents supports both pure Python environments & TensorFlow-based environments. More generaly, TF-agents supports & provides both pure Python & TensorFlow-based components (agents, replay buffers, metrics, & so on).

Now that we have a nice breakout environment, with all the appropriate preprocessing & TensorFlow support, we must create the DQN agent & the other components we will need to train it. Let's look at the architecture of the system we will build.

## Training Architecture

A TF-agents training program is usually split into two parts that run in parallel.

<img src = "Images/Typical TF-Agents Training Architecture.png" width = "500" style = "margin:auto"/>

The figure begs a few questions, which I'll attempt to answer here:

* Why are there multiple environments? Instead of exploring a single environment, you generally want to the driver to explore multiple copies of the environment in parallel, taking advantage of the power of all your CPU cores, keeping the training GPUs busy, & providing less-correlated trajectories to the training algorithm.
* What is a *trajectory*? It is a concise representation of a *transition* from one time step to the next, or a sequence of consecutive transitions from time step *n* to time step *n + t*. The trajectories collected by the driver are passed to the observer, which saves them in the replay buffer, & they are later sampled by the agent & used for training.
* Why do we need an observer? Can't the driver save the trajectories directly? Indeed, it could, but this would make the architecture less flexible. For example, what if you don't want to use a replay buffer? What if you want to use the trajectories for something else, like computing metrics? In fact, an observer is just any function that takes a trajectory as an argument. You can use an observer to save the trajectories to a replay buffer, or to save them to a TFRecord file, or to compute the metrics, or for anything else. Moreover, you can pass multiple observers to the driver, & it will broadcase the trajectories to all of them.

Now we will create all these components: first the deep Q-network, then the DQN agent (which will take care of creating the collect policy), then the replay buffer & te observer to write to it, then a few training metrics, then the driver, & finally the dataset. Once we have all the components in place, we will populate the replay buffer with some initial trajectories, then we will run the main training loop. So, let's start by creatign the deep Q-network.

## Creating the Deep Q-Network

The TF-agents library provides many networks in the `tf_agents.networks` package & its subpackages. We will use the `tf_agents.networks.q_network.QNetwork` class:

In [None]:
from tf_agents.netowkrs.q_network import QNetwork

preprocessing_layer = keras.layers.Lambda(lambda obs: tf.cast(obs, np.float32) / 255.0)
conv_layer_params = [(32, (8, 8), 4), (64, (4, 4), 2), (64, (3, 3), 1)]
fc_layer_params = [512]

q_net = QNetwork(tf_env.observation_spec(),
                 tf_env.action_spec(),
                 preprocessing_layers = preprocessing_layer,
                 conv_layer_params = conv_layer_params,
                 fc_layer_params = fc_layer_params)

This `QNetwork` takes an observation as input & outputs one Q-value per action, so we must give it the specifications of the observations & the actions. It starts with a preprocessing layer: a simple `Lambda` layer that casts the observations to 32-bit floats & normalises them (the values will range from 0.0 to 1.0). The observations contain unsigned bytes, which use 4 times less space than 32-bit floats, which is why we did not cast the observations to 32-bit floats earlier; we want to save RAM in the replay buffer. Next, the network applies three convolutional layers: the first has 32 8 x 8 filters & uses a stride of 4, the second has 64 4 x 4 filters & a stride of 2, & the third has 64 3 x 3 filters & a stride of 1. Lastly, it applies a dense layer with 512 units, followed by a dense output layer with 4 units, one per Q-value to output (i.e., one per action). All convolutional layers & all dense layers except the output layer use the ReLU activation function by default (you can change this by setting the `activation_fn` argument). The output layer does not use any activation function.

Under the hood, a `QNetwork` is composed of two parts: an encoding network that processes the observations, followed by a dense output layer that outputs one Q-value per action. TF-agent's `EncodingNetwork` class implements a neural network architecture found in various agents.

<img src = "Images/Architecture of Encoding Network.png" width = "550" style = "margin:auto"/>

It may have one or more inputs. For example, if each observation is composed of some sensor data plus an image from a camera, you will have two inputs. Each input may require some preprocesing steps, in which case you can specify a list of keras layers via the `preprocessing_layers` argument, with one preprocessing layer per input, & the network will apply each layer to the corresponding input (if an input requires multiple layers of preprocessing, you can pass a whole model, since a keras model can always be used as a layer). If there are two inputs ormore, you must also pass an extra layer via the `preprocessing_combiner` argument, to combine the outputs from the preprocessing layers into a single output.

Next, the encoding network will optionally apply a list of convolutions sequentially, provided you specify their parameters via the `conv_layer_params` argument. This must be a list composed of 3-tuples (one per convolutional layer) indicating the number of filters, the kernel size, & the stride. After these convolutional layers, the encoding network will optionally apply a sequence of dense layers, if you set the `fc_layer_params` argument: it must be a list containing the number of neurons for each dense layer. Optionally, you can also pass a list of dropout rates (one per dense layer) via the `dropout_layer_params` argument if you want to apply dropout after each denselayer. The `QNetwork` takes the output of this encoding network & passes it to the dense output layer (with one unit per action).

Now that we have the DQN, we are ready to build the DQN agent.

## Creating the DQN Agent

The TF-agents library implements many types of agents, located in the `tf_agents.agents` package & it subpackages. We will use the `tf_agents.agents.dqn.dqn_agent.DqnAgent` class:

In [None]:
from tf_agents.agents.dqn.dqn_agent import DqnAgent

train_step = tf.Variable(0)
update_period = 4
optimiser = keras.optimizers.RMSprop(learning_rate = 2.5e-4, rho = 0.95, momentum = 0.0,
                                     epsilon = 0.00001, centered = True)
epsilon_fn = keras.optimizers.schedules.PolynomialDecay(initial_learning_rate = 1.0,
                                                        decay_steps = 250000 // update_period,
                                                        end_learning_rate = 0.01)
agent = DqnAgent(tf_env.time_step_spec(),
                 tf.env.action_spec(),
                 q_network = q_net,
                 optimizer = optimiser,
                 target_update_period = 2000,
                 td_errors_loss_fn = keras.losses.Huber(reduction = "none"),
                 gamma = 0.99,
                 train_step_counter = train_step,
                 epsilon_greedy = lambda: epsilon_fn(train_step))

Let's walk through this code:

* We first create a variable that will count the number of training steps.
* Then we build the optimiser, using the same hyperparameters as in the 2015 DQN paper.
* Next, we create a `PolynomialDecay` object that will compute the $\varepsilon$ value for the $\varepsilon$-greedy collect policy, given the current training step (it is normally used to decay the learning rate, hence the names of the arguments, but it will work just fine to decay any other value). It will go from 1.0 down to 0.01 (the value used during in the 2015 DQN paper) in 1 million ALE frame, which corresponds to 250,000 steps, since we use frame skipping with a period of 4. Moreover, we will train the agent every 4 steps (i.e., 16 ALE frames), so $\varepsilon$ will actually decay over 62,500 *training* steps.
* We then build the `DQNAgent` passing it the time step & action specs, the `Qnetwork` to train, the optimizer, the number of training steps between target model updates, the loss function to use, the discount factor, the `train_step` variable, & a function the returns the $\varepsilon$ value (it must take no argument, which is why we need a lambda to pass the `train_step`). Note tht the loss function must return an error per instance, not the mean error, which is why we set `reduction = "none"`.
* Lastly, we initialise the agent.

Next, let's build the replay buffer & the observer that will write to it.

## Creating the Replay Buffer & the Corresponding Observer

The Tf-agents library provides various replay buffer impleemntations in the `tf_agents.replay_buffers` package. Some are purely written in python (their modules names star with `py_`), & others are written based on TensorFlow (their module names start with `tf_`). We will use the `TFUniformReplayBuffer` class in the `tf_agents.replay_buffers.tf_uniform_replay_buffer` package. It provides a high-performance implementation of a replay buffer with uniform sampling:

In [None]:
from tf_agents.replay_buffers import tf_uniform_replay_buffer

replay_buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(data_spec = agent.collect_data_spec,
                                                               batch_size = tf_env.batch_size,
                                                               max_length = 1000000)

Let's look at each of these arguments:

* `data_spec`
   - The specification of the data that will be saved in the replay buffer. The DQN agent knows that the collected data will look like, & it makes the data spec available via its `collect_data_spec` attribute, so that's what we give the replay buffer.
* `batch_size`
   - The number of trajectories that will be added at each step. In our case, it will be one, since the driver will just execute one action per step & collect one trajectory. If the environment were a *batched environment*, meaning an environment that takes a batch of actions at each step & returns a batch of observations, then the driver would have to save a batch of trajectories at each step. Since we are using a TensorFlow replay buffer, it needs to know the size of the batches it will handle (to build the computation graph). An example of a batched environment is the `ParallelPyEnvironment` (from the `tf_agents.environments.parallel_py_environment package): it runs multiple environments in paralle in spearate processes (they can be different as long as they have the same action & observation specs), & at each step it takes a btch of actions & executes them in the environments (one action per environment), then it returns all the resulting observations.
* `max_length`
   - The maximum size of the replay buffer. We created a large replay buffer that can store one million trajectories (as was done in the 2015 DQN paper). This will require a lot of RAM.
 
Now we can create the observer that will write the trajectories to the replay buffer. An observer is just a function (or all callable object) that takes a trajectory argument, so we can directly use the `add_method()` method (bound to the `replay_buffer` object) as our observer:

In [None]:
replay_buffer_observer = replay_buffer.add_batch

If you wanted to create your own observer, you could write any function with a `trajectory` argument. If it must have a state, you can write a class with a `__call__(self, trajectory)` method. For example, here is a simple observer that will increment a counter every time it is called (except when the trajectory represents a boundary between two episodes, which does not count as a step), & every 100 increments it displays the progress up to a given total (the carriange return `\r` along with `end = ""` ensures that the displayed counter remains on the same line):

In [None]:
class ShowProgress:
    def __init__(self, total):
        self.counter = 0
        self.total = total
    def __cal__(self, trajectory):
        if not trajectory.is_boundary():
            self.counter += 1
        if self.counter % 100 == 0:
            print("\r{}/{}".format(self.counter, self.total), end = "")

Now let's create a few training metrics.

## Creating Training Metrics

TF-agents implements several RL metrics in the `tf_agents.metrics` package, some purely in Python & some based on TensorFlow. Let's create a few of them in order to count the number of episodes, the number of steps taken, & most importantly the average return per episode & the average episode length:

In [None]:
from tf_agents.metrics import tf_metrics

train_metrics = [tf_metrics.NumberOfEpisodes(),
                 tf_metrics.EnvironmentSteps(),
                 tf_metrics.AverageReturnMetric(),
                 tf_metrics.AverageEpisodeLengthMetric()]

At any time, you can get the value of each of these metrics by calling its `result()` method (e.g., `train_metrics[0].result()`). Alternatively, you can log all metrics by calling `log_metrics(train_metrics)` (this function is located in the `tf_agents.eval.metric_utils` package):

In [None]:
from tf_agents.eval.metric_utils import log_metrics
import logging

loggin.get_logger().set_level(logging.INFO)
log_metircs(train_metrics)

Next, let's create the collect driver.

## Creating the Collect Driver

A driver is an object that explores an environment using a given policy, collects experiences, & broadcasts them to some observers. At each step, the following things happen:

* The driver passes the current time step to the collect policy, which uses this time step to choose an action & returns an *action step* object containing the action.
* The driver then passes the action to the environment, which returns the next time step.
* Finally, the driver creates a trajectory object to represent this transition & broadcasts it to all the observers.

Some policies, such as RNN policies, are stateful: they choose an action based on both the given time step & their own internal state. Stateful policies return their own state in the action step, along with the chosen action. The driver will then pass this state back to the policy at the next time step. Moreover, the driver saves the policy state to the trajectory (in the `policy_info` field), so it ends up in the replay buffer. This is essential when training a stateful policy: when the agent samples a trajectory, it must set the policy's state tot eh state it was in at the time of the sampled time step.

Also, as discussed earlier, the environment may be a batched environment, in which case the driver passes a *batched time step* to the policy (i.e., a time step object containing a batch of observations, a batch of step types, a batch of rewards, & a batch of discounts, all four batches of the same size). The driver also passes a batch of actions & a batch of policy states. Finally, the driver creates a *batched trajectory* (i.e., a trajectory containing a batch of step types, a batch of observations, a batch of actions, a batch of rewards, & more generally a batch for each trajectory attribute, with all batches of the same size).

There are two main driver classes: `DynamicStepDriver` & `DynamicEpisodeDriver`. The first one collects experiences for a given number of steps, while the second collects experiences for a given number of episodes. We want to collect experiences for four steps for each training iteration (as was done in the 2015 DQN paper), so let's create a `DynamicStepDriver`:

In [None]:
from tf_agents.drivers.dynamic_step_driver import DynamicStepDriver

collect_drive = DynamicStepDriver(tf_env, agent.collect_policy,
                                  observers = [replay_buffer_observer] + training_metrics,
                                  num_steps = update_period)

We give it the environment to play with, the agent's collect policy, a list of observers (including the replay buffer observer & the training metrics), & finally the number of steps to run (in this case, four). We could not run it by calling its `run()` method, but it's best to warm up the replay buffer with experiences collected using a purely random policy. For this, we can use the `RandomTFPolicy` class & create a second driver that will run this policy for 20,000 steps (which is equivalent to 80,000 simulator frames, as was done in the 2015 DQN paper). We can use our `ShowProgress` observer to display the progress.

In [None]:
from tf_agents.policies.random_tf_policy import RandomTFPolicy

iniital_collect_policy = RandomTFPolicy(tf_env.time_step_spec(),
                                        tf_env.action_spec())
init_driver = DynamicStepDriver(tf_env, initial_collect_policy,
                                observers = [replay_buffer.add_batch, ShowProgress(20000)],
                                num_steps = 20,000)
final_time_step, final_policy_state = init_driver.run()

We're almost ready to run the training loop! We just need one last component: the dataset.

## Creating the Dataset

To sample a batch of trajectories from the replay buffer, call its `get_next()` method. This returns the batch of trajectories plus a `BufferInfo` object that contains the sample identifies & their sampling probabilities (this may be useful for some algorithms, such as PER). For example, the following code will sample a small batch of two trajectories (subepisodes), each containing three consecutive steps.

In [None]:
trajectories, buffer_info = replay_buffer.get_next(sample_batch_size = 2, num_steps = 3)
trajectories._fields

In [None]:
trajectories.observation.shape

In [None]:
trajectories.step_type.numpy()

The `trajectories` object is a named tuple, with seven fields. Each field contains a tensor whose first two dimensions are 2 & 3 (since there are two trajectories, each with three steps). This explains why the shape of the `observation` field is [2, 3, 84, 84, 4]: that's two trajectories, each with three steps, & each step's observation is 84 x 84 x 4. Similarly, the `step_type` tensor has a shape of [2, 3]: in this example, both trajectories contain three consecutive steps in the middle on an episode (types 1, 1, 1). In the second trajectory, you can barely see the ball at the lower left of the first observation, & it disappears in the next two observations, so the agent is about to lose a life, but the episode will not end immediately because it still has several lives left.

<img src = "Images/Two Trajectories of Three Consecutive Steps.png" width = "600" style = "margin:auto"/>

Each trajectory is a concise representation of a sequence of consecutive time steps & action steps, designed to avoid redundancy. How so? 

<img src = "Images/Trajectories, Transitions, Time Step, & Action Steps.png" width = "500" style = "margin:auto"/>

Well, transition *n* is composed of time step *n*, action step *n*, & time step *n* + 1, while transition *n* + 1 is composed of time step *n* + 1, action step *n* + 1, & time step *n* + 2. If we just stored these two transitions directly in the replay buffer, the time step *n* + 1 would be duplicated. To avoid this duplication, the $n^{th}$ trajectory step includes only the type & observation from time step *n* (not its reward & discount), & it does not contain the observation from time step *n* + 1 (however, it does contain a copy of the next time step's type; that's the only duplication).

So if youhave a batch of trajectories where each trajectories has *t* + 1 steps (from time step *n* to time step *n* + *t*), then it contains all the data from time step *n* to time step *n* + *t*, except for the reward & discount from time step *n* (but it contiains the reward & discount of time step *n* + *t* + 1). This represents *t* transitions (*n* to *n* + 1 to *n* + 2, ..., *n* + *t* - 1, *n* + *t*).

The `to_transition()` function in the `tf_agents.trajectories.trajectory` module converts a batched trajectory into a list containing a batched `time_step` a batched `action_step`, & a batched `next_time_step`. Notice that the second dimension is 2 instead of 3, since there are *t* transition between *t* + 1 time steps (don't worry if you're a bit confused; you'll get the hang of it):

In [None]:
from tf_agents.trajectories.trajectory import to_transition

time_steps, action_steps, next_time_steps = to_transition(trajectories)
time_steps.observation.shape

For our main training loop, instead of calling the `get_next()` method, we will use a `tf.data.Dataset`. This way, we can benefit from the power ofthe data API (e.g., parallelism & prefetching). For this, we call the replay buffer's `as_dataset()` method:

In [None]:
dataset = replay_buffer.as_dataset(sample_batch_size = 64,
                                   num_steps = 2,
                                   num_parallel_calls = 3).prefetch(3)

We will sample batches of 64 trajectories at each training step (as in the 2015 DQN paper), each with 2 seps (i.e., 2 sptes = 1 full transition, including the next step's observation). This dataset will process three elements in parallel, & prefetch three batches.

Now that we have all the components in place, we are ready to train the model.

## Creating the Training Loop

To speed up training, we will convert the main functions to TensorFlow functions. For this we will use the `tf_agents.utils.common.function()` function, which wraps `tf.function()`, with some extra experimental options:

In [None]:
from tf_agents.utils.common import function

collect_driver.run = function(collect_driver.run)
agent.train = function(agent.train)

Let's create a small function that will run the main training loop for `n_iterations`:

In [None]:
def train_agent(n_interations):
    time_step = None
    policy_state = agent.collect_policy.get_initial_state(tf_env.batch_size)
    iterator = iter(dataset)
    for iteration in range(n_interations):
        time_step, policy_state = collect_driver.run(time_step, policy_state)
        trajectories, buffer_info = next(iterator)
        train_loss = agent.train(trajectories)
        print("\r{} loss:{:.5f}".format(iteration, train_loss.loss.numpy()), end = "")
        if iteration % 1000 == 0:
            log_metrics(train_metrics)

The function first asks the collect policy for its initial state (given the environment batch size, which is 1 in this case). Since the policy is stateless, this returns an empty tuple (so we could have written `policy_state = ()`). Next, we create an iterator over the dataset, & we run the training loop. At each iteration, we call the driver's `run()` method, passing it the current time step (initially `None`) & the current policy state. It will run the collect policy & collect experience for four steps (as we configured earlier), broadcasting the collected trajectories to the replay buffer & the metrics. Next, we sample one batch of trajectories from the dataset, & we pass it to the agent's `train()` ethod. It returns a `train_loss` objsect which may vary depending on the type of agent. Next, we display the iteration number & the training loss, & every 1,000 iterations we log all the metrics. Now you can just call `train_agent()` for some number of iterations, & see the agent gradually learn to play *breakout*!

In [None]:
train_agent(1000000)

This will take a lot of computing power & a lot of patience (it may take hours, or even days, depending on your hardware), plus you may need to run the algorithm several times with different random seeds to get good results, but once it's done, the agent will be superhuman (at least at *breakout*). You can also try training this DQN agent on other Atari games: it can achieve superhuman skill at most action games, but its is not so good at games with long-running storelines.

---

# Overview of Some Popular RL Algorithms

Before finishing this lesson, let's take a quick look at a few popular RL algorithms:

* *Actor-Critic algorithms*
   - A family of RL algorithms that combine policy gradients with deep Q-networks. An actor-critic agent contains two neural networks: a policy net & a DQN. The DQN is trained normally, by learning from the agent's experiences. The policy net learns differently (& much faster) than in regular PG: instead of estimating the value of each action by going through multiple episodes, then summing the future discounted rewards for each action, & finally normalising them, the agent (actor) relies ont eh action values estimated by the DQN (critic). It's a bit like an athlete (agent) learning with the help of a coach (DQN).
* *Asynchronous Advantage Actor-Critic* (A3C)
   - An important actor-critic variant introduced by DeepMind researchers in 2016, where multiple agents learn in parallel, exploring different copies of the environment. At regular intervals, but asynchronously (hence the name), each agent pushes some weight updates to a master network, then it pulls the latest weights from that network. Each agent thus contributes to improving the master network & benefits from what the other agents have learned. Moreover,instead of estimating Q-values, the DQN estimates the advantage of each action (hence the second A in the name), which stabilises training.
* *Advantage Actor-Critic* (A2C)
   - A variant of the A3C algorithm that removes the asynchronicity. All model updates are synchronous, so gradient updates are performed over larger batches, which allows the model to better utilise the power of the GPU.
* *Soft Actor-Critic* (SAC)
   - An actor-critic variant proposed in 2018 by Tuomas Haarnoja & other UC Berkely researchers. It learns not only rewards, but also to maximise the entropy of its actions. In other words, it tries to be as unpredictable as possible whil still getting as many rewards as possible. This encourages the agent to explore the environment, which speeds up training & makes it less likely to repeatedly execute the same action when the DQN produces imperfect estimates. This algorithm has demonstrated an amazing sample efficiency (contrary to all the previous algorithms, which learn very slowly). SAC is available in TF-agents.
* *Proximal Policy Optimisation* (PPO)
   - An algorithm based on A2C that clips the loss function to avoid excessively large weight updates (which often leads to training instabilities). OpenAI made the news in April 2019 with their AI called Open AI Five, based on the PPO algorithm, which defeated the world champions at the multiplayer game *Dota 2*. PPO is also available in tf-agents.
* *Curiousity-based exploration*
   - A recurring problem in RL is the sparsity of the rewards, which makes learning very slow & inefficient. Deepak Pathak & other UC Berkeley researchers have proposed an exciting way to tackle this issue: why not ignore the rewards, & just make the agent extremely curious to explore the environment? The rewards thus become intrinsic to the agent, rather than coming from the environment. Similarly stimulating curiousity in a child is more likely to give good results than purely rewarding the child for getting good grades. How does this work? The agent continuously tries to predict the outcome of its actions, & it seeks situations where the outcome does not match its predictions. In other words, it wants to be surprised. If the outcome is predictable (boring), it goes elsewhere. However, if the outcome is unpredictable but the agent notices that it has no control over it, it also get bored after a while. With only curiosity, the authors succeeded in training an agent at many video games: even though the agents get no penalty for losing, the game starts over, which is boring so it learns to avoid it.

We covered many topics in this lesson: policy gradients, Markov chains, Markov decision processes, Q-learning, approximate Q-learning, & deep Q-learning & its main variants (fixed Q-value targets, double DQN, dueling DQN, & prioritised experience replay). We discussed how to use tf-agents to train agents at scale, & finally we took a quick look at a few other popular algorithms. Reinforcement learning is a huge & exciting field, with new ideas & algorithms popping out every day, so I hope this chapter sparked your curiosity: there is a whole world to explore!