# CL3 - Reinforcement Learning

In this computer lab, we study the reinforcement algorithms presented in the lectures. There will be interactive visualizations for each of them, where you can see how they work in practice. Algorithms are only explained briefly and the notebook is supposed to serve as complement to the lectures, not a replacement. Watch the lectures first, or pause the videos in the middle, then come here and see the algorithms in action! 

The structure for this notebook is as follows, where the corresponding lectures are in parentheses:

1. [Describing the environments](#describing_the_environments) (L9)
    1. Deterministic transitions
    2. Stochastic transitions
    3. Randomly-permuted actions

1. [Model-based methods](#model_based_methods) (L10)
    1. Policy evaluation
        - Model-based policy evaluation
    2. Policy optimization
        - Policy iteration
    
3. [Model-free methods](#model_free_methods)
    1. Policy evaluation (L10)
        - Monte Carlo policy evaluation
        - TD-learning
    2. Policy optimization (L11)
        - Monte Carlo control
        - SARSA
        - Q-learning
        
We will start with the model-based methods, that require knowledge about the MDP's transition dynamics. We will cover both policy evaluation (finding $V_\pi$ from $\pi$) and policy optimization (finding $\pi^*$). Then, we will move to the model-free methods, that can perform both of these tasks by learning from the experience of agents interacting with the environment, therefore sidestepping the need of knowing the transition dynamics.

Before starting, make sure you have the `ipywidgets` Python library. To install it, run the command:

`conda install -c conda-forge ipywidgets`

on your terminal, with the `dml` environment activated.

<a id='describing_the_environments'></a>
## 1. Describing the environments

The environments used for the interactive visualizations that follow will all be grid worlds, where the state is the $(x,y)$ position of the agent. Run the cell below to generate an illustration of one of them with a 3x3 state-space:

In [None]:
from lib.grid_world import grid_world_3x3 as env
from lib.plot_utils import plot_all_states_with_indices
plot_all_states_with_indices(env)

The state-space $\mathcal S$ for these grid-world environments is the set of all allowed $(x, y)$ positions. For instance, in the above example we have that 

$$\mathcal S =\{(0,0), (0,1), (0,2), (1,0), (1,1), (1,2), (2,0), (2,1), (2,2)\}~.$$ 

In all the available states there are four actions available, corresponding to the intent of moving up, right, down, or left: $\mathcal A=\{↑, →, ↓, ←\}$. The reward for all transitions is -1, regardless of the action taken or the state the agent was in, i.e., $\mathcal R_{ss'}^a=-1$, and we use a discount factor $\gamma= 0.9$. For all environments there is always one terminal state (illustrated in gray), and the episode ends when the agent reaches that state. 
The dynamics for the environments can be of two forms: deterministic or stochastic. 

### 1.A Deterministic transitions

The deterministic grid-worlds all work in the following way: the agent always moves one step in the desired direction if the state it would end up in is in $\mathcal S$ (e.g., taking action → at state $(1,1)$ above takes the agent to state $(2,1)$ ). If the agent tries moving into a state not in $\mathcal S$ it simply stays in the same place (e.g. taking action → at state $(2,1)$ results in the agent not moving).

More formally, given the current state of the agent $s=(x, y)$, taking action $a$ moves the agent to state $s'=T(s,a)$, where $T$ is defined as:

$$T(s, a) = \left\{\begin{matrix}
f(s, a),  & \text{if }f(s, a)\in\mathcal S
\\ 
s, & \text{otherwise}
\end{matrix}\right.$$

$$f\big((x, y), a\big) = \left\{\begin{matrix}
(x, y+1),  & \text{if }a=↑
\\ 
(x+1, y),  & \text{if }a=→
\\
(x, y-1),  & \text{if }a=↓
\\
(x-1, y),  & \text{if }a=←~.
\end{matrix}\right. $$

To familiarize yourself with this type of environment, we provide you with an opportunity to control an agent in a 3x3 gridworld. Run the cell below to start an episode, then use the buttons to decide which action the agent should take at each time-step. The terminal state for this MDP is shown in gray, and when the agent reaches it the episode ends, the return for time-step 1 is computed, and the environment is reset.

In [None]:
from lib.playground_wrapper import playground_wrapper
from lib.grid_world import grid_world_3x3 as env
playground_wrapper(env)

Having interacted with the environment, consider now the following questions:
- What is the optimal policy for this environment?
- What is the optimal value for the starting state, i.e. $V^*((0,0))$?
- Are there stochastic policies which are optimal in this environment?

### 1.B Stochastic transitions

The stochastic grid-worlds work similarly to the deterministic ones, with the difference that when the agent chooses an intended direction of movement there is a 70% chance that it will move in that direction, and 10% chance of moving in each of the other 3 directions. For example, if the agent is in state $(1, 1)$ in the above grid-world, and takes action ←, it has 70% chance of moving to state $(0, 1)$, 10% chance of moving to state $(1,2)$, 10% chance of moving to state $(2, 1)$, and 10% chance of moving to state $(1,0)$.

The same rule about moving only to states within $\mathcal S$ is applied. For instance, if the agent is in state $(0,0)$ and takes action ←, it has 80% chance of staying in the same place (70% from going left, 10% from going down), 10% of moving to state $(0, 1)$, and 10% of moving to state $(1, 0)$.

Run the cell below to control an agent in a 3x3 gridworld with stochastic transitions defined as above. Note how sometimes the intended and actual directions of movement sometimes do not match.

In [None]:
from lib.grid_world import grid_world_3x3_stoch as env_stoch
playground_wrapper(env_stoch)

- Does this MDP have the same optimal policy as the previous one with deterministic dynamics?

### 1.C Randomly-permuted actions

In reality, RL agents usually interact with the environment without having any prior knowledge about the transition dynamics. To help you experience for yourself how it feels to solve a task in such conditions, the cell below gives you the opportunity to interact with a deterministic environment where actions do different things depending on which state you are in. Every time you run the cell a new environment with new rules is created. Can you find the optimal policy?

**Note**: When an episode is finished, the environment is reset and the agent goes back to the starting point, but the rules are *not* changed. Only if you re-run the cell the rules will change, so that you can try out the same environment for some episodes before changing it.

In [None]:
from lib.grid_world import create_3x3_env_randomly_permuted_actions
playground_wrapper(create_3x3_env_randomly_permuted_actions(stochastic=False))

Points for reflection:
- How long does it take you to solve one of these environments? That is, to find the optimal policy at all states?
- The added challenge here is that we have to learn about what the actions do in each of the different states. However, there is one simplifying factor, which is that the environment is deterministic. Change `stochastic=False` in the cell above to `stochastic=True` and try solving the stochastic version of the MDP. Why is it more challenging?

Hopefully this interaction helped you understand how using grid-world environments as defined in Sections 1.A and 1.B may give the impression that the RL task is easier than it actually is. We humans start solving new tasks with already a lot of accumulated knowledge about how the world works (sometimes referred to as our inductive biases), which help us quickly learn how to solve new tasks in familiar environments. For instance, it is very easy for us to understand how these types of environments function, since we have an intuitive understanding of what "moving up" or "moving left" means in a 2D grid, and we assume that this is the way the grid-world MDP will also work. This makes it much easier for us to, say, optimally guide an agent towards the terminal state. 

Because of this, it is often the case that we underestimate the challenge that an RL agent without any inductive biases faces, making us fail to appreciate the power and ingenuity of the RL algorithms mentioned in the lectures (e.g., "This task is so simple, why can't the agent simply learn to move towards the gray square?"). For pedagogical reasons, from this point forward we will only use the environments as defined in 1.A and 1.B for the visualizations, but do not forget the insights you've gained from this section when analyzing an agent's behavior in a new environment.

#### Note about code

The code for creating these MDPs is available for you to read and change. You can for instance look in the file `lib/grid_world.py` to see how the grid-worlds are created, and even create your own grid-world environment to run the algorithms on! If you want to change which environment is used by a certain algorithm, you can modify its corresponding `lib/<algorithm_name>_wrapper.py` file.

The code for the algorithms you will see below (e.g. Policy evaluation, Monte Carlo control) is also available for you to read (in its corresponding `lib/<algorithm_name>.py` file), but we recommend that you focus on the interactive visualizations in this notebook instead. This is because the code not only implements the algorithms, but also the necessary logic for handling the button clicks, graphics, etc, and can be unnecessarily complicated for someone trying to understand the RL logic.

<a id='model_based_methods'></a>
## 2. Model-based methods

As mentioned in the introduction, we will start with methods that you can use when you have access to the dynamics of the MDP, so-called *model-based methods*. Although the true dynamics of the environment will not be available in most real-world RL problems, sometimes approximate models are available, and using those can yield reasonable results. Furthermore, learning about these methods help us build a strong foundation for RL, serving as stepping stones towards learning more complicated (and generally applicable) algorithms.

<a id='model_based_prediction'></a>
### 2.A Policy Evaluation

*Policy evaluation* (also known as the *prediction* task) is the task of evaluating how well a certain policy performs in an environment. That is, given a policy $\pi$, we should compute its value function $V_\pi(s)$ for all states $s\in\mathcal S$. This is not only an important task in of itself, but will also be the main building block later, when we are trying to find the *optimal* policy.

<a id='model_based_policy_evaluation'></a>
#### Model-based policy evaluation
If you do have access to the MDP dynamics, then you can apply an algorithm henceforth referred to as *model-based policy evaluation*, which works by updating the value of every state $s \in \mathcal S$ according to Bellman's expectation equation:

$$ V(s) \leftarrow \sum_{a\in\mathcal A} \pi(a|s) \sum_{s'\in\mathcal S} \mathcal P_{ss'}^a \Big[\mathcal R_{ss'}^a+\gamma V(s')\Big]$$

where $\mathcal P_{ss'}^a$ is the probability of moving to state $s'$ when in state $s$ and taking action $a$, according to the dynamics of the environment, and $V(x)$ is the current estimate for the value of a state $x$. Repeated applications of this update equation to all states in $\mathcal S$ make $V$ approximate better and better $V_\pi$.

#### Interactive visualization

Below we provide you with an interactive visualization of the model-based policy evaluation algorithm being applied to a 3x3 grid-world with stochastic transitions, where the state $(2,2)$ is terminal (top-right corner). Note that in this version of the algorithm we are using the most recently computed values of $V(s')$ when updating $V(s)$. Running the cell below initializes the interactive visualization and shows you two buttons: "Next step" and "Finish iteration". Clicking on "Next step" starts the first iteration of policy evaluation. Once you click there, you will see three figures appear, and some information below them, such as which iteration we are in, and what state is being processed. 

The figure to the left illustrates the probability mass function for the next state, given one of the actions (shown as an arrow in the figure). The figure in the middle illustrates our current value table, color-coded by value. This will be updated as we run the algorithm, and will eventually converge to the value function of the policy being evaluated. The figure to the right illustrates the policy being evaluated.

From this point forward, when you click on "Next step", the visualization will cycle through all the states in the MDP, and for each state update its value according to the equation above. To facilitate the illustration, it first computes $\sum_{s'\in\mathcal S} \mathcal P_{ss'}^a \Big[\mathcal R_{ss'}^a+\gamma V(s')\Big]$ for all actions (note that this is simply the $Q(s, a)$), and then updates the value as a weighted average of those using the policy:
$$ V(s) \leftarrow \sum_{a\in\mathcal A} \pi(a|s) Q(s,a)$$

(check that you understand how this is equivalent to the equation shown before). 

Once all states have been updated, the algorithm moves to the next iteration, and recomputes the values of all states again. If you would like to speed up the visualization and skip the computation of the new value for some of the states, you can click on "Finish iteration", which runs the policy evaluation algorithm until the end of the iteration. Clicking this button multiple times quickly shows you how the value table evolves throughout multiple iterations of the algorithm, converging to $V_\pi$. You can restart the visualization at any point by re-runnning the cell below.

In [None]:
from lib.policy_evaluation_wrapper import policy_evaluation_wrapper
policy_evaluation_wrapper()

Points for reflection:
- After a few iterations (say 30), the value table is already close enough to $V_\pi$. Do the values of all states make intuitive sense to you? Remember that we have stochastic transitions in this MDP, so even if the agent intends to move in a certain direction, it might move in another one.
- Would this algorithm work if we had a different initial value table? For example, if $V(s) = 10$ for all $s$? Would it work no matter our first estimate for $V$?
- If you want to dig deeper, note that you can easily change which policy is being evaluated and the initial value table by inspecting the code at `lib/policy_evaluation_wrapper.py`.

### 2.B Policy optimization

The *policy optimization* task (also know as the *control* task) is about finding the *best* way to behave in a certain environment. That is, to find the optimal policy $\pi^*$ and/or its corresponding optimal value function $V_{\pi^*}$. 

#### Policy iteration
One of the main ways of doing that is with the idea of policy iteration.  You can think of policy iteration as the general concept of interleaving steps of policy evaluation with steps of policy improvement. As long as you do both of them often enough you converge to the optimal policy.

In the model-based setting, we can use the algorithm described in Section 2.A to evaluate the performance of a policy. Given the value function of the current policy, we can improve the current policy by acting greedily with respect to this newly computed value function. That is, we construct a greedy policy $\pi_\text{greedy}$ that always chooses the action with the highest action-value, according to our estimate of $V$:

$$ 
\pi_\text{greedy}(a|s) = 
\begin{cases}
1 & \text{if } a=\arg\max_{a}Q(s,a)
\\
0 & \text{otherwise.}
\end{cases} 
$$

Where, since we have access to the environment's dynamics, it is easy to compute the Q-values:
$$ Q(s,a) = \sum_{s'\in\mathcal S} \mathcal P_{ss'}^a \Big[\mathcal R_{ss'}^a+\gamma V(s')\Big]~. $$


If $V$ is correct, this will always yield a new policy which is strictly better than the previous one, unless we have converged to the optimal policy. Note that if multiple actions maximize $Q$, either a single one is chosen for the equation above, or the probability 1 has to be divided among all of them.

#### Interactive visualization

Below we provide you with an interactive visualization of the policy iteration algorithm being applied to the MDP defined in Section 1. Running the cell below starts the interactive visualization and shows you two figures. The figure to the left illustrates the current value table, color-coded as before. The figure to the right illustrates the current policy we have.

Clicking on "Policy eval iteration" performs a full policy evaluation iteration (see Section 2.A for details), evaluating the performance of whichever policy is shown on the figure to the right. Clicking on "Policy improvement" updates the policy to be greedy w.r.t. the current value table. Click on these buttons in any sequence you would like, and see if the policy/value table converges to the optimal values. You can restart the visualization by re-running the cell below.

In [None]:
from lib.policy_iteration_wrapper import policy_iteration_wrapper
policy_iteration_wrapper()

Points for reflection:
- What happens to the value table if you repeatedly click only on "Policy eval iteration"?
- What happens to the policy table if you repeatedly click only on "Policy improvement"?
- Is it better to perform a full policy evaluation and then improve, or just a few steps?
- Can you learn the optimal value of all states like this?
- Is there some combination of initial value table and policy for which the policy would not converge to $\pi^*$?
- If you want to dig deeper, note that you can easily change the initial policy and value table by inspecting the code at `lib/policy_iteration_wrapper.py`.

<a id='model_free_methods'></a>
## 3. Model-free methods

In this section we discuss reinforcement learning methods that do not require access to the dynamics of the MDP. This makes them much more useful to real-world applications, where such information is often either impossible or too costly to obtain.

Instead of relying on knowledge about the transition dynamics, these algorithms will actually interact with the environment as agents, and learn from experience gathered during this interaction. To make the examples easier to follow, from now on we will use grid worlds with deterministic transitions. Note that the algorithms are exactly the same in the stochastic case, so you can run them on stochastic environments if you want by changing the corresponding `<algorithm>_wrapper.py` file for each algorithm.

### 3.A Policy evaluation

Just like in the model-based setting, the policy evaluation task is to compute the value function of a given policy $\pi$. However, this time the computation will be done by leveraging on data gathered by an agent interacting with the environment, instead of by using the transition dynamics of the MDP (like in the policy evaluation algorithm).

#### Monte Carlo Policy Evaluation

The first algorithm that computes the value function of a policy $\pi$ from experience is Monte Carlo Policy Evaluation. The idea is simple: in order to estimate the value function $V_\pi$ at a certain state $s$, you compute an average of all the returns obtained at $s$ throughout your experience. Since the value of a state is the expected return, this will converge to the correct the value of $V_\pi(s)$ as we collect more episodes that passed through the state $s$. 

In the version implemented here, we are computing *Every-visit* Monte Carlo, so that if you visit a state multiple times in an episode, the returns from all your visits will be sequentially used to update the average. Furthermore, instead of computing an actual average of all returns, we compute an exponential moving average. That is, when using a new return $G(s)$ to update our estimate of $V(s)$, we use the following update rule:

$$ V(s) \leftarrow V(s) + \alpha \big(G(s)-V(s)\big)~, $$

which shifts $V(s)$ towards $G(s)$. The learning rate $\alpha$ controls how quickly the exponential moving average updates $V(s)$. Note that $V(s)$ denotes our estimate of $V_\pi(s)$. 

#### Interactive visualization

Below we provide you with an interactive visualization of the Monte Carlo Policy Evaluation algorithm ($\alpha=0.1$). Running the cell initializes the visualization, and clicking on "Next step" will show you three figures. The figure to the left illustrates the environment and the agent, along with its policy at the current state. As the algorithm proceeds to interact with the environment the agent will change position. The figure in the middle illustrates the current value table, and to the right the policy which is being evaluated. 

Repeatedly clicking on "Next step" will show you the agent sampling actions and interacting with the environment. After every step a transition of the form $<s, a, r, s'>$ is saved, so that in the end of the episode the returns for each state can be computed. Once the episode ends at time-step $T$, the returns are computed recursively, starting from the last state visited all the way back to the initial state, using:

$$ G_T = R_T  $$
$$G_t = R_t + \gamma G_{t+1}\quad \text{for }t=T-1, \cdots, 1$$

After the return for all visited states is computed, the value table is updated with the exponential moving average mentioned before. Clicking on "Finish episode" runs the algorithm until the end of the episode, computing returns and updating the value function.

In [None]:
from lib.monte_carlo_evaluation_wrapper import monte_carlo_evaluation_wrapper
monte_carlo_evaluation_wrapper()

Points for reflection:
- Does it take more episodes to learn the values of certain states? Why?
- Can you learn $V_\pi$ for all states using this algorithm? If not, why?
- If you run an infinite number of episodes with this algorithm, will it converge to the correct $V_\pi$, or are there additional requirements for optimality (e.g. certain conditions on the learning rate).
- What if the underlying environment does not obey the Markov assumption, will MC policy evaluation still converge to the correct values?

#### Temporal Difference learning

Although it is great that we have an algorithm that can learn the value function of a policy without any knowledge of how the MDP works, you probably noticed that MC policy evaluation is relatively slow. The value table is only updated at the end of every episode, and updates for states visited long before the end of the episode may have a lot of variance (since many rewards are used to compute them). 

An alternative to MC policy evaluation, which often yields better performance, is temporal difference (TD) learning.  In TD learning, you update the value function after every step taken in the environment, using the following equation

$$ V(s) \leftarrow V(s) + \alpha \big(r + \gamma V(s')-V(s)\big) $$

where $s$ is the state that the transition started in, $r$ is the sampled reward, and $s'$ is the state the transition ended in. Note that MC learning updates the value $V(s)$ towards the return $G$, whereas TD learning updates $V(s)$ towards $r+\gamma V(s')$. We refer to $G$ and $r+\gamma V(s')$ as the targets in the respective algorithms, and note that both are estimates of the value $V(s)$; $G$ is unbiased whereas $r+\gamma V(s')$ often has lower variance. 

Note how even if we are updating a state from long before the end of the episode, if we assume that $V(s')$ is correct, the update only contains a single random variable, $r$ (we are exchanging variance for bias, because $V(s')$ might be wrong).

#### Interactive visualization

Below we provide you with an interactive visualization of the TD learning algorithm in practice, with $\alpha=0.2$. Running the cell initializes it, and clicking on "Next step" shows you three figures. Just like in the interactive visualization of Monte Carlo policy evaluation, the three figures represent: (1) the agent and the environment, (2) the value table being learned, and (3) the policy which is being evaluated.

Repeatedly clicking on "Next step" will show the agent sampling actions and interacting with the environment. However, the value table is now updated after every single transition, not only at the end of the episode. Clicking "Finish episode" runs the algorithm until the end of the episode, updating the value table during the computations.

In [None]:
from lib.td_evaluation_wrapper import td_evaluation_wrapper
td_evaluation_wrapper()

Points for reflection:
- Can you learn $V_\pi$ for all states using this algorithm? If not, why?
- What if the underlying environment does not obey the Markov assumption, will TD learning still converge to the correct values?


### 3.B Policy optimization

In this section we deal again with the problem of finding an optimal policy, but this time in a context where we do not have access to its transition dynamics. Just like in Section 3.A, these algorithms will learn from experience gathered by an agent interacting with the environment. Because of this, these algorithms can get stuck on suboptimal policies in case we do not explore the environment well enough when learning.

#### Monte Carlo Control

We can apply the idea of policy iteration (policy evaluation + policy improvement) in the model-free context as well. For the policy evaluation step, we can use Monte Carlo Policy Evaluation ($\alpha=0.1$), with a small change. Instead of learning the value function of the current policy, we learn the Q-value function, so that we can perform the policy improvement step (acting greedily w.r.t. the value function) without requiring access to the environment's dynamics.

Furthermore, if we perform the greedy update to improve our policy, we generally do not explore the environment well enough. Instead, we act $\epsilon$-greedy w.r.t. the value function. That is, given our current estimate for the action-value function $Q$, the policy used to interact with the environment will behave randomly with probability $\epsilon$, and greedily with probability $1-\epsilon$, resulting in the following form:

$$ \pi(a|s) = \begin{cases}
\epsilon/n + 1-\epsilon & \text{if } a=\arg\max_{a} Q(s, a)
\\
\epsilon/n & \text{otherwise,}
\end{cases} $$

where $n=|\mathcal A|=4$ in our grid world MDPs. If we use this policy to interact with the environment, and decrease $\epsilon$ slowly enough during training (such that we satisfy the GLIE conditions), enough exploration will be done to correctly find the optimal policy.

#### Interactive visualization

Below we provide you with an interactive visualization of the Monte Carlo Control algorithm. In order to make the example easier to follow, we will use a 2x2 deterministic grid-world for this example (the state $(1,1)$ is terminal). Similarly to before, run the cell below and press "Next step" to start the first iteration. The three figures that will be shown represent: (1) the agent and the environment, (2) the Q-value table (each state is now divided in 4, to illustrate the Q-values for the 4 available actions), and (3) the current policy the agent is using to explore the environment ($\epsilon$-greedy w.r.t. the Q-values). Note that now that policy will change during training, since whenever we update the Q-value function the $\epsilon$-greedy policy may change. 

Underneath the two buttons you will see a numeric field where you can choose the value of $\epsilon$. A value of 1 means that the $\epsilon$-greedy policy will behave completely randomly (all actions are chosen with the same probability), whereas a value of 0 transforms the $\epsilon$-greedy policy into a greedy policy w.r.t. the current value function. As you sample episodes you will have to decrease $\epsilon$ slowly from 1 to 0 to converge to the optimal policy. Using $\epsilon=0$ is not recommended for this visualization, since it could lead to infinite-length episodes, causing the notebook to hang (if this happens, just click on the stop button in the toolbar above). Lastly, remember that if you want to restart the visualization just re-run the cell below.

In [None]:
from lib.mc_control_wrapper import mc_control_wrapper
mc_control_wrapper()

Reflection points:
- When does the policy change? Is it fixed throughout an entire episode?
- If you maintain $\epsilon$ fixed at 1, and keep sampling episodes, you will explore the environment really well, but what happens to the Q-value table? Can it converge to $Q^*$ if you never decrease $\epsilon$ to 0? If not, what does it converge to instead?
- What happens to the policy if you decrease $\epsilon$ too fast? Why?
- How slowly should $\epsilon$ be decreased to *guarantee* that we converge to $\pi^*$?


#### SARSA

When adapting the policy iteration idea to the model-free setting, we can also use TD learning instead of Monte Carlo for the policy evaluation step (with the same change as before: learning action-value functions and using an epsilon-greedy policy for exploration). Doing so yields an algorithm for policy optimization that updates the Q-value table after every step in the environment, and which usually converges faster.

In order to use TD learning for learning Q-values, we need slightly different transition tuples. After every step in the environment the agent now obtains a transition tuple $<s, a, r, s', a'>$, which contains the state it was previously in ($s$), the action taken ($a$), the reward obtained ($r$), the next state it moved to ($s'$), **and** the action sampled from the policy at state $s'$ ($a'$). This is necessary because the update rule for learning Q-values will depend on the Q-value of the next state, evaluated at the next action taken:

$$ Q(s, a) \leftarrow Q(s,a) + \alpha \Big(\text{TD-target} - Q(s,a)\Big) $$
$$ \text{TD-target} = r + \gamma Q(s', a')~. $$ 

This algorithm is named SARSA, because of the letters used to denote the transition tuples ($<s, a, r, s', a'>$). 

#### Interactive visualization

Below we provide you with an interactive visualization of the SARSA algorithm ($\alpha=0.2$). The controls are the same as for Monte Carlo Control, the only difference is that now the Q-values are updated after every transition in the environment (once the next action $a'$ is sampled), and so is the $\epsilon$-greedy policy. Just like before, you will have to decrease $\epsilon$ slowly enough from 1 to 0 to converge to the optimal policy and Q-values.

In [None]:
from lib.sarsa_wrapper import sarsa_wrapper
sarsa_wrapper()

Points for reflection:
- How do we deal with a transition that ends in a terminal state? Why?

#### Q-learning

In SARSA (and MC control), we are always learning about the Q-values of the exploratory policy, i.e. the policy that is $\epsilon$-greedy w.r.t the Q-value function. This requires us to decrease $\epsilon$ during the learning in a very specific way, so that we satisfy the GLIE requirement on the sequence of policies used. In other words, we have to decay $\epsilon$ so that we eventually converge to the deterministic optimal policy, but not too fast so that we can explore the environment well enough.

By using off-policy learning, we are able to decouple the exploratory policy from the policy we are learning about. We can use one policy to interact with the environment in a way that explores the environment very well, while at the same time learning about the policy which is greedy w.r.t. the current Q-values (sometimes called the exploitative policy). This decoupling between the two policies frees us from having to satisfy the GLIE requirements on the sequence of policies used. As long as we explore the environment enough with the exploratory policy, we will converge to $\pi^*$.

If we use an $\epsilon$-greedy policy as the exploratory policy, and a greedy-policy as the exploitative policy, the algorithm we get is called Q-learning, and its update rule is given by:

$$ Q(s,a) \leftarrow Q(s,a) + \alpha \Big(\text{TD-target} - Q(s,a) \Big)$$
$$ \text{TD-target} = r + \gamma\max_{a'}Q(s', a') $$

Note that this equation is *very* similar to the equation for SARSA's update rule. We are computing the TD-target and updating the value of $Q(s,a)$ towards it. The important difference between SARSA and Q-learning is in how the TD-target is computed:

$$ \text{TD-target}_{SARSA} = r + \gamma~Q(s', a') $$
$$ \text{TD-target}_{q-learning} = r + \gamma\max_{a'}Q(s', a') $$

In the first equation, the TD-target is computed using the Q-value of state $s'$, evaluated for an action $a'$ which was sampled from the **exploratory policy**. In the second equation, the Q-value is evaluated for an action $a'$ which was sampled from the **exploitative policy** (since that is the greedy policy, it maximizes the Q-value). This means that SARSA is learning about the Q-values of the **exploratory policy**, whereas Q-learning is learning about the Q-values of the **exploitative policy**. While SARSA has to carefully change its exploratory policy during learning to eventually converge to $\pi^*$, Q-learning can use any exploratory policy (that explores enough) and still learn $\pi^*$.

#### Interactive visualization

Below we provide an interactive visualization of the Q-learning algorithm ($\alpha=0.2$). The visualization is very similar to the one for TD-learning, but the Q-value table in the middle is now showing the learned Q-values for the **exploitative policy**, which is now shown in the policy table on the right (for SARSA the Q-values and the policy shown were for the exploratory policy).

In [None]:
from lib.q_learning_wrapper import q_learning_wrapper
q_learning_wrapper()

Points for reflection:
- What happens now if you keep sampling episodes without reducing $\epsilon$? Is Q-learning able to find $\pi^*$?
- If decreasing $\epsilon$ is not necessary anymore, is there any reason for doing it?

## Next steps

Congratulations, you have finished this computer lab! If you want to dig deeper and learn even more, take a look at how the environments are defined, inside of `lib.grid_world.py` and create your own environment! You can change how the transition dynamics work, the state-space, reward function, which states are terminal, etc. Once you have created a new environment, change the `<algorithm>_wrapper.py` to import it and see the algorithms solving your own environment!