# **Neuro RLs** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/TomGeorge1234/NeuroRLTutorial/blob/main/NeuroRL.ipynb)
### **University of Amsterdam Neuro-AI Summer School ,2024**
#### made by: **Tom George (UCL) and Jesse Geerts (Imperial)**

In this tutorial we'll study and build reinforcement learning models inspired by the brain.

## **Contents** 
0. [Import dependencies and data](#dependencies)
1. [Rescorla-Wagner Model](#rescorla)
2. [Temporal Difference Learning](#td)
3. [Policy learning](#q)
    1. [Q-Values](#q-values)
    1. [Navigating in a grid world](#grid)
4. [Deep Q-Learning](#dqn)
    1. [Neuroscience inspired basis functions](#basis)

---
## **0. Import dependencies and data** <a name="dependencies"></a>
Run the following code: It'll install some dependencies, download some files and import some functions. You can mostly ignore it. 

In [None]:
#@title Click to see code {display-mode: "form" }
!pip install wget ratinabox 

import numpy as np 
import matplotlib.pyplot as plt 
from tqdm import tqdm 
import os
import wget 
from IPython.display import HTML
#if running on colab we need to download the data and utils files
if os.path.exists("NeuroRL_utils.py"):
    print("utils located")
    pass
else: 
    wget.download("https://github.com/TomGeorge1234/NeuroRLTutorial/raw/main/NeuroRL_utils.py")
    print("...utils downloaded!")

from NeuroRL_utils import *
%load_ext autoreload
%autoreload 2

---
## **1. Rescorla-Wagner** <a name="rescorla"></a>

Classical conditioning is where a neutral stimulus (the unconditioned stimulus) is paired with a response-producing stimulus (the conditioned stimulus). After the association is learned, the neutral stimulus *alone* can produce the response.
 
The most famous example is Pavlov's dogs: Pavlov rang a bell before feeding his dogs which would cause them to salivate. After a while, the dogs would start salivating when they heard the bell, even if no food was presented.

In 1972 Rescorla and Wagner proposed a simple model to explain this learning process. The model is based on the idea that the strength of the association between the CS and US is proportional to the discrepancy between the expected and actual US.

### **1.1. Model (maths)**
Following on from the Pavlov's dogs example, suppose the bell 🔔 is the conditioned stimulus, $S$, and the food 🦴 is the unconditioned stimulus with a response (reward) of strength $R$. The bell is paired with the food allowing an association to be learned. We notate this as follows: 
$$ S \rightarrow R $$

Under the Rescorla-Wagner model, the goal is to learn the _value_ of the unconditioned stimulus:

$$ V(S) = \mathbb{E}[R] $$ 

(Note: this slightly excessive notation now will come in useful later). We can "learn" this association by updating $\hat{V}(S)$ (our current _estimate_ of the value of the stimulus) based on the following trivial learning rule: 

$$ \hat{V} \leftarrow \hat{V} + \alpha \cdot \underbrace{(R - \hat{V})}_{\delta = \textrm{``error"}}$$

This is Rescorla-Wagner. I.e. the increment in the value of the stimulus is proportional to the discrepancy between the reward (the unconditioned response) that was recieved, $R$, and the reward that was predicted from the stimulus, $\hat{V}$. The proportionality constant $\alpha$ is the learning rate.

$\delta$ is the _prediction error_ and is the key concept in reinforcement learning. It is the discrepancy between what was expected and what was recieved. 
- Positive $\delta$ means the reward was better than expected, so the value of the stimulus should be increased.
- Negative $\delta$ means the reward was worse than expected, so the value of the stimulus should be decreased.

> 📝 **Exercise 1.1** 
> 1. Consider a simple example where there is only one stimulus with zero initial value. A constant reward, $R$ is given each trial. Show the value of the stimulus after the first trial is given by $V(1) = \alpha  \cdot R$.
> 2. Show that $\hat{V}(t) = R \cdot (1 - e^{-\alpha\cdot t})$  (_Hint: consider using the change of of variables $\delta(t) = R - \hat{V}(t)$._)

### **1.2 Model implementation (python)**

Below we provide some basic code implementing a Rescorla Wagner model. 

Suppose we initialse a Rescorla-Wagner model with a learning rate of 0.1 and an initial value estimate of 0 as 

```python
rescorlawagner = RescorlaWagner(0.1, 0)
```

Some initialsation logic and plotting functions are hidden away in the `BaseRescorlaWagner` class in `NeuroRL_utils.py`. What's important is the following: 

**Attributes**
- `rescorlawagner.alpha`: is the learning rate, $\alpha$
- `rescorlawagner.V`: the current value estimate of the stimulus, $\hat{V}(t)$
- `rescorlawagner.V_history`: a list of the value of the stimulus at each trial, [$\hat{V}(0)$, $\hat{V}(1)$, ...]
- `rescorlawagner.R_history`: a list of the reward recieved at each trial, [0, $R(1)$, $R(2)$, ...] (the reward at trial 0 is always set to zero)

**Methods**
- `rescorlawagner.learn(R)`: updates the value of the stimulus based on the reward recieved <span style="color:red"> _[TO DO: NOT YET DEFINED]_ </span>
- `rescorlawagner.plot()`: plots the value of the stimulus over time


> 📝 **Exercise 1.2:** 
> 
> Complete the `def learn(self, R):` function to implement the Rescorla-Wagner learning rule. 


In [None]:
class RescorlaWagner(BaseRescorlaWagner):
    def __init__(self, 
                 alpha=0.1, 
                 initial_V=0): 
        self.V = initial_V
        super().__init__(n_stimuli=1, alpha=alpha)

    def learn(self, R):
        raise NotImplementedError("You need to implement this method")
        # error = ???
        # self.V += ???
        # self.R_history.append(R) # include these lines to store the reward and value history
        # self.V_history.append(V) 

In [None]:
#@title Click to see solution {display-mode: "form" }
def learn(self, R):
    error = R - self.V
    self.V += self.alpha * error
    self.R_history.append(R)
    self.V_history.append(self.V) 
RescorlaWagner.learn = learn # set the learn method to the function we just defined.

Now lets run an experiment where a reward of 1 is given each trial. We'll plot the value of the stimulus over time using the pre-written `RescorlaWagner.plot()` function.

In [None]:
# Set your learning rate and reward
alpha = 0.1
R = 1

# Create the model
rescorlawagner = RescorlaWagner(alpha=alpha)

# Run the model
for trial in range(100):
    rescorlawagner.learn(R=R)

> 📝 **Exercise 1.3**
>
> Print the past value of the stimulus at each trial and check it approaches 1 (the reward value).

In [None]:
# Print the value history

In [None]:
#@title Click to see solution {display-mode: "form" }
print(rescorlawagner.V_history)

> 📝 **Exercise 1.4**
>
> 1. Plot the value history using the `ax = rescorlawagner.plot()` method. 
> 2. Plot the theoretical solution you derived earlier onto the `ax` and see if it fits.

In [None]:
# Your code here

In [None]:
#@title Click to see solution {display-mode: "form" }
# Plot the results
ax = rescorlawagner.plot()
plt.close()

# Plot the analytic solution
t_range = np.arange(100)
V = R * (1 - np.exp(-t_range*alpha))
ax.plot(t_range, V, label='Analytic solution', linewidth=0.5, color='k', linestyle='--')
ax.legend()
ax.figure

> 📝 **Exercise 1.5**
> 
> 1. **Acquisition:** Repeat the above experiment with a lower and a higher learning rate. What do you observe?
> 2. **Extinction:** Repeat the above but this time reward is given only for the first 50 trials, then the reward is set to zero. What do you observe?


In [None]:
# Your code for low learning rate acquisition goes here

In [None]:
#@title Click to see solution {display-mode: "form" }
# Set your learning rate and reward
alpha = 0.3
R = 1

# Create the model
RW = RescorlaWagner(alpha=alpha)

# Run the model
for trial in range(100):
    RW.learn(R=R)

# Plot the results
ax = RW.plot()
ax.set_title("Higher learning rate")

In [None]:
# Your code for high learning rate acquisition goes here

In [None]:
#@title Click to see solution {display-mode: "form" }
# Set your learning rate and reward
alpha = 0.05
R = 1

# Create the model
RW = RescorlaWagner(alpha=alpha)

# Run the model
for trial in range(100):
    RW.learn(R=R)

# Plot the results
ax = RW.plot()
ax.set_title("Lower learning rate")

In [None]:
# Your code for extinction goes here

In [None]:
#@title Click to see solution {display-mode: "form" }
# Set your learning rate and reward
alpha = 0.1

# Create the model
RW = RescorlaWagner(alpha=alpha)

# Run the model
for trial in range(50):
    RW.learn(R=1)
for trial in range(50): #remove the reward
    RW.learn(R=0)

# Plot the results
ax = RW.plot()
ax.set_title("Extinction experiment")

### **1.3. Rescorla-Wagner with multiple  stimuli**

It's easy to extend the Rescorla-Wagner model to multiple stimuli:

* Stimuli are now represented by vectors. If there are two possible stimuli we can use a 2D vector e.g.:  
    * Stimulus A: $\mathbf{s} =[1, 0]$ 
    * Stimulus B: $\mathbf{s} =[0, 1]$
    * Stimulus A & B: $\mathbf{s} =[1, 1]$
    * Stimulus A weakly and B strongly: $\mathbf{s} =[0.1, 0.9]$
    * ...etc. 

* A vector of association "weights", $\mathbf{w}$, denotes the strength of the association between each stimulus and the unconditioned response (i.e. the value of each stimulus). 
    * $\mathbf{w} = [w_1, w_2]$.
* The total value of the stimuli is the sum of the values of each stimulus present on a given trial: 
$$ \hat{V}(\mathbf{s}) = \mathbf{s} \cdot \mathbf{w}  = s_1 \cdot w_1 + s_2 \cdot w_2$$

The full Rescorla-Wagner model is then:
$$\mathbf{w} = \mathbf{w} + \alpha \big(R - \hat{V}(\mathbf{s})\big) \cdot \mathbf{s}$$

> 📝 **Exercise 1.6**
>
> Reason why $\mathbf{s}$ now appears in the learning rule.

> 📝 **Exercise 1.7**
> 
> As before, complete the `def learn(self, R, S):` function to implement the Rescorla-Wagner learning rule for multiple stimuli.

In [None]:
class RescorlaWagner_multistim(BaseRescorlaWagner):
    def __init__(self, n_stimuli=2, alpha=0.1, initial_value=0): 
        self.W = np.zeros(n_stimuli)
        super().__init__(n_stimuli=n_stimuli, alpha=alpha)

    def learn(self, S, R):
        raise NotImplementedError("You need to implement this method")
        # V = ????  # calculate the value of the stimuli
        # error = # ???? # calculate the error
        # self.W += # ???? # update the weights

        # store the history
        # self.W_history = np.vstack([self.W_history, self.W])
        # self.R_history.append(R)
        # self.V_history.append(S @ self.W)
        # self.S_history.append(S)


In [None]:
#@title Click to see solution {display-mode: "form" }
def learn(self, S, R):
    V = S @ self.W # calculate the value of the stimuli
    error = R - V # calculate the error
    self.W += alpha * S * error # update the weights

    # store the history
    self.W_history.append(self.W.copy())
    self.R_history.append(R)
    self.V_history.append(S @ self.W)
    self.S_history.append(S)

# Set the learn method to the function we just defined.
RescorlaWagner_multistim.learn = learn

**Exercise 1.8**:
1. With your partner implement these four experiments in the Rescorla-Wagner model:
    1. **Blocking**
        * A single stimulus is paired with the US (A --> R), then a compound stimulus is paired with the US (AB --> R). What happens?
    2. **Overshadowing**
        * Two stimuli are paired with a reward (AB --> R) but one is much more salient than the other, e.g. $\mathbf{s} = [1, 0.1]$.
    3. **Overexpectation**
        * Two stimuli are seperately paired with the US (A --> R, B --> R), then the compound stimulus is presented (AB --> ?). What happens?
    4. **Conditioned Inhibition**
        * A single stimulus is paired with the US (A --> R) then a second stimulus is added and the reward is removed. (AB --> _). What do you observe?

In [None]:
# Your code for the blocking experiment goes here

In [None]:
#@title Double click to see solution {display-mode: "form" }
rescorla_blocking = RescorlaWagner_multistim(n_stimuli=2, alpha=0.1)
for i in range(50):
    rescorla_blocking.learn(S=np.array([1,0]), R=1)
for i in range(50):
    rescorla_blocking.learn(S=np.array([1,1]), R=1)

rescorla_blocking.plot()

Let's break down this plot: 

* The **top** plot shows the stimuli which were present on each trial. Their transparency denotes their strength. 
* The **middle** plot shows the association weights of the stimuli.
* The **bottom** plot shows the total value of the presented stimuli $V(\mathbf{s}) = \mathbf{s}\cdot\mathbf{w}$ along with the reward recieved on each trial.


In [None]:
# Your code for the overshadowing experiment goes here

In [None]:
#@title Double click to see solution {display-mode: "form" }
rescorla_overshadowing = RescorlaWagner_multistim(n_stimuli=2)
for i in range(100):
    rescorla_overshadowing.learn(S=np.array([0.9, 0.1]), R=1)
ax = rescorla_overshadowing.plot()
ax[0].set_title("Overshadowing")

In [None]:
# Your code for the overexpectation experiment goes here

In [None]:
#@title Double click to see solution {display-mode: "form" }
rescorla_overexpectation = RescorlaWagner_multistim(n_stimuli=2)
for i in range(50):
    rescorla_overexpectation.learn(S=np.array([1, 0]), R=1)
for i in range(50):
    rescorla_overexpectation.learn(S=np.array([0, 1]), R=1)
for i in range(50):
    rescorla_overexpectation.learn(S=np.array([1, 1]), R=1)
ax = rescorla_overexpectation.plot()
ax[0].set_title("Overexpectation")

In [None]:
# Your code for the inhibition experiment goes here

In [None]:
#@title Double click to see solution {display-mode: "form" }
rescorla_inhibition = RescorlaWagner_multistim(n_stimuli=2)
for i in range(50):
    rescorla_inhibition.learn(S=np.array([1, 0]), R=1)
for i in range(50):
    rescorla_inhibition.learn(S=np.array([1, 1]), R=0)
ax = rescorla_inhibition.plot()
ax[0].set_title("Inhibition")

> 📝 **Exercise 1.9**
>
> 1. Starting from the inhibition experiment above, extend it so that the reward prediction goes negative.
> 2. Simulate a conditioning experiment with three stimuli, where the first two are paired with the US and the third is not. What do you observe?

In [None]:
#Your code for the negative reward prediction experiment goes here

In [None]:
#@title Double click to see solution {display-mode: "form" }
RW = RescorlaWagner_multistim(n_stimuli=2)
for i in range(50):
    RW.learn(S=np.array([1, 0]), R=1)
for i in range(50):
    RW.learn(S=np.array([1, 1]), R=0)
for i in range(50):
    RW.learn(S=np.array([0, 1]), R=1)
ax = RW.plot()
ax[0].set_title("Negative reward prediction")

In [None]:
# Your code for three stimuli goes here

In [None]:
#@title Double click to see solution {display-mode: "form" }
RW = RescorlaWagner_multistim(n_stimuli=3)
for i in range(10):
    RW.learn(S=np.array([1, 0, 0]), R=1)
for i in range(10):
    RW.learn(S=np.array([1, 1, 0]), R=1)
for i in range(10):
    RW.learn(S=np.array([1, 1, 1]), R=1)
ax = RW.plot()
ax[0].set_title("Three stimuli")

---
## **2. Temporal difference learning** <a name="td"></a>

One limitation of the Rescorla-Wagner model is that it doesn't take into account the temporal structure of the environment. Associations are made between stimuli _now_ and rewards _now_. In reality, rewards are often delayed. 

Now we consider temporally evolving "Markov Reward Processes" where states progress through time and may (or may not) return rewards.

$$ S_{0} \rightarrow R_{1}, S_{1} \rightarrow R_{2}, S_{2} \rightarrow R_{3}, \ldots $$

In a Markov reward process, the state transitions and rewards may be probabilistic (later we'll consider Markov "Decision" processes where transitions may depend on "actions" the agent chooses to take but for now we'll keep things simple). The "value" of a state is the expected sum of rewards that will be recieved in the future, starting from that state.

$$V(S_t) = \mathbb{E} \big[ \underbrace{R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots}_{G_t = \textrm{``return'' from state $_{S_t}$}}\big]$$

This differs from Rescorla-Wagner only in that the value of a state is not just based on the reward recieved _now_ ($V(S_t) = \mathbb{E} [R_{t+1} ]$) but also on the rewards that will be recieved in the future.

$\gamma$ is the discount factor, typically set near or equal to 1. This factor ensures that rewards recieved in the future are "worth" less than rewards recieved now. This is generally a wise assumption since rewards in the future are less certain ("a bird in the hand is worth two in the bush").

As before our goal is to learn an estimate of V(S). The goal of learning is to to refine $\hat{V}(S_t)$ to be as close as possible to $V(S_t)$.

#### **2.0.1 Tabular vs. Function Approximation**

When there are multiple stimuli (or states) in an environment you have a choice. Either:
1. **Tabular**: You can store the value of each state in a big table (tabular learning) 
$$ \hat{V}(S_t) = [ V(S_0), V(S_1), \ldots, V(S_{N_{\textrm{states}}}) ] $$
2. **Function Approximation**: You can learn a function that maps states (typical then represented by a feature vector) to values (function approximation).
\begin{align}
\hat{V}(\mathbf{s}) & = f(\mathbf{s})  \hspace{4cm} \textrm{(general)} \\
& = \mathbf{w} \cdot \mathbf{s}  \hspace{4cm} \textrm{(linear)}\\
& = \phi(\mathbf{W}_{2}\cdot(\phi(\mathbf{W}_1 \cdot \mathbf{s} + \mathbf{b}_1) + \mathbf{b}_2 )  \hspace{1cm}\textrm{(2-layer neural network)} \\
& = \ldots
\end{align}

Both have advantages and disasvantages: tabular methods are typically easier to understand and (under certain conditions) have better convergence gaurantees. On the other hand function approximation methods can scale to much larger state spaces, are more biologically plausible and typically generalize better to previously unseen states.

We used a linear function approximation in the Rescorla-Wagner model with multiple stimuli. This was because we wanted to be able to represent multiple stimuli simultaneously which is difficult to do with a tabular method.

For now we'll revert back to tabular methods - these will be sufficient because in Markov process you can only be in one state at any given time. In section 4 where we'll consider function approximation again.



> 📝 **Exercise 2.1**
>
> By considering the following loss function: 
> $$L_t = \big[ \hat{V}(S_t) - V(S_t) \big]^2$$
> show, by gradient descent in $\hat{V}(S_{t})$, that the optimal update rule for $\hat{V}(S_t)$ is:
> $$\hat{V}(S_t) \leftarrow \hat{V}(S_t) + \alpha \big[ V(S_t) - \hat{V}(S_t) \big]$$

### **2.1 Monte-Carlo learning**
One way to estimate the value of a state is to wait until the end of each episode, collecting all the rewards that were recieved along the way and calculate the single-episode return $G_t$ and use this as a target. 

$$ \hat{V}(S_t) \leftarrow \hat{V}(S_t) + \alpha  \big[ G_t - \hat{V}(S_t) \big] $$

Since the _expectation_ of $G_t$ is equal to $V(S_t)$ this is equivalent to _stochastic gradient descent_ of the loss function and is called Monte-Carlo learning. 

Although in theory this does work, in practice, Monte-Carlo learning is often infeasible because it requires waiting until the end of each episode to update the value of each state. This turns out to be quite a serious limitation in real-world applications - imagine waiting until the end of a game of chess to update the value of each state! Or worse, some environments don't have a clearly defined end such as the game of life we're all currently playing. 

This is where TD learning comes in...

### **2.2 TD-Learning**
The key idea behind TD-learning is to estimate the value of a state by bootstrapping from the value of the next state. 

> 📝 **Exercise 2.2**
>
> 1. Show that the value of a state can be written as the sum of the reward recieved at that state and the value of the next state. i.e.
> $$V(S_t) = \mathbb{E} [R_t + \gamma V(S_{t+1})]. $$

This is called the _Bellman equation_ and is the basis of TD-learning. Its encodes a _very_ important idea: 

**Bellman Equation:  The future value of a state now is equal to the reward recieved now plus the value of the next state (discounted a little bit).**

But wait! How do we know the value of the next state? We don't, that's why we're learning. So we'll use our current estimate of the value of the next state, $\hat{V}(S_{t+1})$ as a _proxy_.

$$V(S_t) = \mathbb{E} [\color{red}{\underbrace{R_t + \gamma V(S_{t+1})}_{\textrm{I don't know this}}} \color{d}{}] \approx  \mathbb{E}[\color{green}{\underbrace{R_t + \gamma \hat{V}(S_{t+1})}_{\textrm{I do know this}}}\color{d}{}] $$

This gives us the TD-learning update rule:

$$\hat{V}(S_t) \leftarrow \hat{V}(S_t) + \alpha \big[\underbrace{R_t + \gamma \hat{V}(S_{t+1}) - \hat{V}(S_t)}_{\delta_t = \textrm{``TD-error''}} \big]$$

The term $\delta_t$ is the _temporal difference error_ and it can be high for two reasons: 
1. The reward was better (or worse) than expected.
2. The value of the next state was higher (or lower) than expected.

The second point is crucial: even though we might not observe reward in a given state we may assign this state value because the _next_ state has value. This is called _bootstrapping_. It's the same reason you might be happy to have recieved an invitation to a job interview even though interviews aren't inherently fun - you're bootstrapping from the value of the next state.


> 📝 **Exercise 2.3**
>
> 1. Implement the TD-learning update rule in the `def learn(self, R, S, S_next, alpha):` function in the `TD_ValueLearner` class below.

In [None]:
class TD_ValueLearner(BaseTDLearner):
    def __init__(self, gamma=0.5, alpha=0.1, n_states=10):
        super().__init__(gamma=gamma, alpha=alpha, n_states=n_states)
        
    def learn(self, S, S_next, R):
        raise NotImplementedError("You need to implement this method")
        # Get's the value of the current and next state
        # V = self.V[S] if S is not None else 0
        # V_next = self.V[S_next] if S_next is not None else 0
        
        # Calculates the TD error (hint remember to use self.gamma
        # TD_error = # ???

        # Updates the value of the current state
        # if S is not None:
            # self.V[S] = # ???

        #  return TD_error

In [None]:
#@title Click to see solution {display-mode: "form" }
def learn(self, S, S_next, R):
    # Get's the value of the current and next state
    V = self.V[S] if S is not None else 0
    V_next = self.V[S_next] if S_next is not None else 0
    
    # Calculates the TD error 
    TD_error = R + self.gamma * V_next - V

    # Updates the value of the current state
    if S is not None:
        self.V[S] = self.V[S] + self.alpha * TD_error

    return TD_error

TD_ValueLearner.learn = learn

### **2.3 Training a TD-Learner on a sequence of states** 

Now we're going to train the TD learner in a simple markov reward process environment. 

The class method (not shown)  `TD_ValueLearner.learn_episode(states, rewards)` takes a list of `states` = $[S_0, S_1, \ldots, S_T]$ and  `rewards` = $[R_1, R_2, \ldots, R_{T+1}]$ and trains the TD-learner on this sequence of states and rewards by calling the `learn` method at each time step. It then saves the results to history. 

**Note** this appends as state $S_{-1} =$`None` and reward $R_0 = 0$ to the start of the sequence. This represents the initial state and reward before the first state is observed, and it cannot be predicted (it will be useful in exercise 2.5.3))


For example, a MRP where states deterministically progress from 0 to 1 to ... to 9 and recieving a reward of zero except at state 9 where a reward of 1 is recieved can be simulated as follows:

```python
TD.perform_episode(
    states  = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]),
    rewards = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1]),
)
```

> 📝 **Exercise 2.4**
>
> Simulate 100 episodes of the above MRP (let $\alpha = 0.5$) then use the `TDLearner.plot(episode=0)` method to show the results after the first and last episode. 

In [None]:
#@title Click to see solution {display-mode: "form" }
# Set some parameters 
n_episodes = 50
gamma = 0.95
alpha = 0.5 # for now use a high learning rate
n_states = 10

# Initialize the TD learner
TD = TD_ValueLearner(gamma=gamma, n_states=n_states, alpha=alpha)

# Generate the MRP 
states = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
rewards = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

# Run the experiment
for episode in range(n_episodes):
    TD.learn_episode(
        states=states,
        rewards=rewards,)

# Plot the results
ax0 = TD.plot(episode=0)
ax1 = TD.plot(episode=n_episodes-1)

Lets discuss these plot: 

- The **top-left** plot shows the states (y-axis) visited across time (x axis)
- The **top-right** plot shows the current state value estimate after this episode. 
- The **bottom** plots shows the reward recieved at each state. and the TD error at each state.

We can also animate _all_ the episodes using the `TD_ValueLearner.animate_plot()` method.

In [None]:
anim = TD.animate_plot()
HTML(anim.to_html5_video())

> 📝 **Exercise 2.5**
> 
> With your partner, answer the following questions:
> 1. After learning has converged, what is the value of state 0? 
> 2. Approximately how fast does the value bump move backwards? How does this relate to the notion of one-step bootstrapping. 
> 3. Why does a residual TD-error accumulate at the start? Understanding this is important for understanding TD-learning.

<div>
  <button onclick="this.nextElementSibling.style.display = this.nextElementSibling.style.display === 'none' ? 'block' : 'none';">
    Click to reveal
  </button>
  <div style="display: none;">
    <h1></h1>
    <p>Exercise 2.5 solutions</p>
    <ul>
      <li>State 9 has value V = R, state 8 has value V = 0 + γR = γR$, state 7 has value V = 0 + γ^2 R, etc. So state 0 has value V = γ^9 R. So if R=1 and γ = 0.9 then V = 0.9^9 = 0.38742.</li>
      <li>Suppose α = 1$ and γ is close to one. The first time the agent recieves the reward at state 9 it's value will be updated to V = 1 and no further learning will occur on this state (its TD error will be zero). On the next trial the value of state 8 will be updated due to the a TD error because the new value of upcoming state 9 wasn't predicted. Thus, the bump moves backwards at approximately a rate of one-step-each-trial.</li>
      <li>The residual TD-error at the start is because the first state is never predictable. Pavlov's dog may be able to associate the bell with the food, but it can't predict the bell so hearing the bell will always come as a positive surprise (aka. a positive TD-error).</li>
    </ul>
  </div>
</div>

> 📝 **Exercise 2.6**
>
> 1. What would happen to the TD error if a small positive reward is given at the end (as we just simulated) and then, after learning has converged, the reward is removed? _Hint: think about what happens in the brain, and why_.
> 2. Simulate this experiment. 

In [None]:
# Your code goes here

In [None]:
#@title Click to see solution {display-mode: "form" }
# Set some parameters 
n_episodes = 50
gamma = 0.95
alpha = 0.5 # for now use a high learning rate
n_states = 10

# Initialize the TD learner
TD = TD_ValueLearner(gamma=gamma, n_states=n_states, alpha=alpha)

# Generate the MRP 
states = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
rewards = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])
later_rewards = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

# Run the experiment
for episode in range(n_episodes):
    TD.learn_episode(
        states=states,
        rewards=rewards,)
# Run the experiment
for episode in range(n_episodes):
    TD.learn_episode(
        states=states,
        rewards=later_rewards,)

# Animating the results
anim = TD.animate_plot()   
HTML(anim.to_html5_video())

> 📝 **Exercise 2.7**
> 
> What happens if this environment is stochastic? Adapt the code to model the following very stochastic environment...
>
> 1. **The state transitions are now probabilistic**: 
    - As before there are $N = 10$ states from $i=0$ to $i=N-1$.
    - When the agent is in state $S_t = i$ it has a probability $p_t = 0.8$ of moving to state $S_{t+1} = i+1$ and a 0.1 probability of staying in state $S_{t+1} = i$. 
    - The agent starts in state $S =0$ and episode ends whenever the agent reaches state $S=N-1$.
>
> 2. **The rewards are probabilistic** as a function of the state: 
    - When the agent is in state $S_t = i$ it has a probability $p_r = (i+1) / N$ of a getting a reward of$R_{t+1} = 1$, otherwise it recieves nothing. ie. it is guaranteed to recieve a reward of 1 in the last state but only has a 0.1 probability of recieving a reward in the first state.
>
> 3. **Simulate** 200 episodes(let $\gamma = 0.8$ and $\alpha = 0.1$ of this new MRP and plot the results.
> Despite the highly stochastic nature of this environment, discuss why the TD learner converges on a fairly stable estimate of the value of each state. Why would this be useful in the brain? 
> 
> 4. **OPTIONAL**: Though stochastic, there is a exact solution to the value of each state in this environment. Can you derive it? 
> You'll want to start from the Bellman equation $V(S_t) = \mathbb{E} [R_t + \gamma V(S_{t+1})]$ and show the expected value of the last state $S = N-1$ is 1.0 then derive the following recursion relation:
> $$ V(N-1) = 1$$
> $$ V(n) = \frac{1}{1 - \gamma (1-p_t)} \left( \frac{n + 1}{N} + \gamma p_t \cdot V(n+1) \right) $$ 

In [None]:
# Here is the previous code for convenience. Adapt it as required. 

# Set some parameters 
n_episodes = 50
gamma = 0.8
alpha = 0.5 # for now use a high learning rate
n_states = 10
# Initialize the TD learner
TD = TD_ValueLearner(gamma=gamma, n_states=n_states, alpha=alpha)


# ===================   REWRITE THIS SECTION================================
# Generate the MRP: THESE WILL NEED TO BE REWRITTEN INSIDE THE LOOP
raise NotImplementedError("You need to rewrite this loop according to exercise 2.7") 
states = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
rewards = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])

# Run the experiment
for episode in range(n_episodes):
    TD.learn_episode(
        states=states,
        rewards=rewards,)
# ==========================================================================



# NEW Here we calculate the theoretical value estimates and set them to be plotted
theoretical_value = np.zeros(n_states)
for i in range(n_states)[::-1]:
    if i == n_states - 1:
        theoretical_value[i] = 1 
    else:
        theoretical_value[i] = (1 / (1 - gamma * (1 - p_transition))) * (((i + 1)/n_states) + gamma * p_transition * theoretical_value[i+1])
TD.theoretical_value = theoretical_value #this will plot the theoretical value onto the animation 

# Plot the results
TD.plot(episode=n_episodes-1)

# Animate the results (this takes time so only do this at the end)
# anim = TD.animate_plot()
# HTML(anim.to_html5_video())

In [None]:
#@title Click to see solution (warning: may take up to a minute to animate) {display-mode: "form" }
# # Set some parameters 
n_episodes = 200
gamma = 0.8
alpha = 0.1 # for now use a high learning rate
n_states = 10
p_transition = 0.8 

# Initialize the TD learner
TD = TD_ValueLearner(gamma=gamma, n_states=n_states, alpha=alpha)


# Run the experiment
for episode in range(n_episodes):
    # Generate the MRP 
    states = np.array([0])
    rewards = np.array([0])
    while states[-1] != 9:
        # Stochoastic transition
        if np.random.rand() < p_transition: state_next = states[-1] + 1
        else: state_next = states[-1]
        # Stochoastic reward
        p_reward = (states[-1]+1) / n_states
        reward = np.random.choice([0, 1], p=[1-p_reward, p_reward])
        rewards = np.append(rewards, reward)
        states = np.append(states, state_next)
    TD.learn_episode(
        states=states,
        rewards=rewards,)

# NEW Here we calculate the theoretical value estimates and set them to be plotted
theoretical_value = np.zeros(n_states)
for i in range(n_states)[::-1]:
    if i == n_states - 1:
        theoretical_value[i] = 1 
    else:
        theoretical_value[i] = (1 / (1 - gamma * (1 - p_transition))) * (((i + 1)/n_states) + gamma * p_transition * theoretical_value[i+1])
TD.theoretical_value = theoretical_value #this will plot the theoretical value onto the animation 


# Plot the results
anim = TD.animate_plot()
HTML(anim.to_html5_video())

Even though the environment is stochastic the _expected_ value of each state is well defined. With a small learning rate the TD learner can learn this expected value essentially smoothing over the stochasticity. This is a key insight into how the brain might learn in a stochastic world.

---
## **3. Q-Values and Policy Improvement** <a name="q"></a>

So far we have only considered environments where there is no choice. The agent is simply moving through a sequence of states. In the real world agents have choices (called _actions_) and the value of a state is not just the value of the state itself but the value of the state _and the action taken in that state_.

For this we need to introduce the idea concept of a _policy_. A policy is any function which maps states to actions, actions then determine how one state leads to another. Policies are often denoted by $\pi$:

$$ \pi : S \rightarrow A $$

A _Markov reward process_ (MRP) where the agent has a policy is called a _Markov Decision Process_ (MDP).

$$ S_0 \xrightarrow{A_0 \sim \pi(S_0)} R_1, S_1 \xrightarrow{A_1\sim \pi(S_1)} R_2, S_2 \xrightarrow{A_2\sim \pi(S_2)} R_3, S_3 \xrightarrow{A_3\sim \pi(S_3)} R_4, S_4 \xrightarrow{A_4\sim \pi(S_4)} \ldots $$

The _action_ the agent took at each state determines the state the agent ends up in next which may (or may not) come with a reward.

### **3.1 Q-Values**

The "value'' of a state now depends not just on the state but on the action taken in that state and the policy that was followed thereafter. 

Q-values represent a natural generalisation of the state value of a state-action pair under a policy: 

$$ Q_{\pi}(s, a) = \mathbb{E}_{\pi} \big[ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots \big| S_t = s, A_t = a \big] $$

In plain English this say: "$Q_{\pi}(s, a)$ is the expected return from taking action $a$ in state $s$ and then following policy $\pi$.". Note the expectation is over $\pi$ because the policy determines the future actions, states and rewards. The policy (which actions are taken in which states) and the environment (which states and rewards are reached from which states) could _both_ be stochastic.

Like the value of a state, the Q-values satisfy the Bellman equation:

$$ Q_{\pi}(s, a) = \mathbb{E}_{\pi} \big[ R_{t+1} + \gamma Q_{\pi}(S_{t+1}, A_{t+1}) \big| S_t = s, A_t = a, A_{t+1}\sim\pi(S_{t+1)}) \big] $$

> 📝 **Exercise 3.1**
>
> Consider a simple MDP where there are two states ($S_1$ and $S_2$) and two actions ($A_1$ and $A_2$).
> 1. From $S_1$: taking action $A_1$ leads to $S_2$ with reward 2, taking action $A_2$ leads to $S_1$ with reward 1.
> 2. From $S_2$: taking action $A_1$ leads to $S_1$ with reward 3, taking action $A_2$ leads to $S_2$ with reward 1.
> 
> Question 1: Given the policy $\pi_1$ where $\pi_1(S_1) = A_1$, $\pi(S_2) = A_2$, what are the Q-values of each state-action pair under this policy (it might help to calculate them in the following order)?
> * $Q_{\pi_1}(S_2, A_2)$
> * $Q_{\pi_1}(S_1, A_1)$
> * $Q_{\pi_1}(S_1, A_2)$
> * $Q_{\pi_1}(S_2, A_1)$
>
> Question 2: Write down the optimal policy $\pi^*$
> * $\pi^*(S_1) = ?$
> * $\pi^*(S_2) = ?$
>
> Question 3: What are the Q-values of each state-action pair under the optimal policy $\pi^*$?
> * $Q_{\pi^*}(S_1, A_1)$
> * $Q_{\pi^*}(S_2, A_2)$
> * $Q_{\pi^*}(S_1, A_2)$
> * $Q_{\pi^*}(S_2, A_1)$

### **3.2 SARSA**

Not all MDPs are  simple enough to solve analytically like the example above. In practice we often use _learning algorithms_ to estimate the Q-values from observations. 

The closest equivalent learning rule to TD-learning rule for Q-values is:

$$\hat{Q}_{\pi}(S_t, A_t) \leftarrow \hat{Q}_{\pi}(S_t, A_t) + \alpha \big[ R_t + \gamma \hat{Q}_{\pi}(S_{t+1}, A_{t+1}) - \hat{Q}_{\pi}(S_t, A_t) \big]$$

This is often called the SARSA learning rule because it takes into account the **S**tate $S_t$, the **A**ction $A_t$, the **R**eward $R_{t+1}$, the next **S**tate $S_{t+1}$ and the next **A**ction $A_{t+1}$.

Note how it looks almost identical to the TD-learning rule for state values but with the addition of the action terms. 

Let's adapt or `TD_ValueLearner` class into a `TD_QValueLearner` to learn Q-values. It's pretty simple, we just need: 

1. Change `self.V` (a list) to `self.Q` (an array) to store Q-values for each state-action pair. 
    - `self.Q[s, a]` is the Q-value of state `s` and action `a`.
2. Change the `learn()` method to update Q-values instead of state values. I.e. instead of 
    - `self.learn(S, S_next, R)` updating `self.V[S]`...
    - `self.learn(S, S_next, A, A_next, R)` should update `self.Q[S, A]`.

> 📝 **Exercise 3.2**
>
> Complete the missing lines (those with `????`) in the `def learn(self, S, S_next, A, A_next, R, alpha):` function in the `TD_QValueLearner` class below.


In [None]:
class TD_QValueLearner(BaseTDQLearner):
    def __init__(self, gamma=0.5, alpha=0.1, n_states=10, n_actions=2):
        # self.Q = # ???? # initialize the Q table with the right shape
        super().__init__(gamma=gamma, alpha=alpha, n_states=n_states, n_actions=n_actions)
  
    def learn(self, S, S_next, A, A_next, R):
        # Get's the value of the current and next state
        raise NotImplementedError("You need to implement this method")
        # Q = # ???? # get the value of the current state (remember it's zero if S is None)
        # Q_next = # ???? # get the value of the next state
         
        # Calculates the TD error (hint remember to use self.gamma
        # TD_error = # ???? # calculate the TD error

        # Updates the value of the current state
        # if S is not None:
        #     self.Q[S,A] = # ???? # update the Q value

        # return TD_error

In [None]:
#@title Click to see solution {display-mode: "form" }
class TD_QValueLearner(BaseTDQLearner):
    def __init__(self, gamma=0.5, alpha=0.1, n_states=10, n_actions=2):
        self.Q = np.zeros((n_states, n_actions))
        super().__init__(gamma=gamma, alpha=alpha, n_states=n_states, n_actions=n_actions)
  
    def learn(self, S, S_next, A, A_next, R):
        # Get's the value of the current and next state
        Q = self.Q[S,A] if S is not None else 0
        Q_next = self.Q[S_next, A_next] if S_next is not None else 0
        
        # Calculates the TD error (hint remember to use self.gamma
        TD_error = R + self.gamma * Q_next - Q

        # Updates the value of the current state
        if S is not None:
            self.Q[S,A] = self.Q[S,A] + self.alpha * TD_error 

        return TD_error

> 📝 **Exercise 3.3**
> 
> Perform the following experiment: each episode, with a 50:50 probability, is either a "left" episode or a "right" episode:
> 1. **"Right"** Agent moves to the right from $S=0$ to $S=9$ (action $A = 0$), recieving a reward ($R=1$) at the terminal state.
> 2. **"Left"** Agent moves to the left from $S=9$ to $S=0$ (action $A = 1$), recieving no reward ($R=0$) at the terminal state.

> 📝 **Exercise 3.4**
>
> Using `TDQLearner.Q` plot the Q-values of each state-action pair. What difference do you observe between the Q-values of _the same two states_ under the two different actions? 

In [None]:
gamma = 0.90 # discount factor
alpha = 0.5 # learning rate
n_episodes = 100

# Make the TDQ learner
TDQ = TD_QValueLearner(gamma=gamma, alpha=alpha, n_states=10, n_actions=2)

for episode in range(n_episodes):
    
    # randomly choose between a left or right episode
    episode_type = np.random.choice(['left', 'right'])

    # write code which generates the states, actions and rewards for the episode and then performs the episode

In [None]:
#@title Click to see solution {display-mode: "form" }
gamma = 0.90 # discount factor
alpha = 0.5 # learning rate
n_episodes = 100

# Make the TDQ learner
TDQ = TD_QValueLearner(gamma=gamma, alpha=alpha, n_states=10, n_actions=2)

for episode in range(n_episodes):
    
    # randomly choose between a left or right episode
    episode_type = np.random.choice(['left', 'right'])

    if episode_type == 'left':
        states = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
        actions = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
        rewards = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])
    elif episode_type == 'right':
        states = np.array([9, 8, 7, 6, 5, 4, 3, 2, 1, 0])
        actions = np.array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
        rewards = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, -1])
    
    TDQ.learn_episode(
        states=states,
        actions=actions,
        rewards=rewards)

In [None]:
# Write code to plot the Q values 

In [None]:
#@title Click to see solution {display-mode: "form" }
Q_values = TDQ.Q
fig, ax = plt.subplots(1,1 , figsize=(4,2))
ax.bar(np.arange(10), Q_values[:,0], color='C0', label='Action = 0 ("Right")', alpha=0.5)
ax.bar(np.arange(10), Q_values[:,1], color='C1', label='Action = 1 ("Left")', alpha=0.5)
ax.axhline(0, color='k', linewidth=0.5)
ax.set_xlabel('State')
ax.legend() 


### **3.3 Policy Improvement**

So far we have learnt how to: 
1. **Rewards** Associate states with rewards by minimising the prediciton error (Rescorla-Wagner)
2. **Values** Evaluate a _state_ by states with future rewards by bootstrapping from the value of the next state and minimising the temporal difference error (TD-learning)
3. **Q-Values** Evaluate a _policy_ by calculating the Q-value of taking each action in each state and minimising the temporal difference error (SARSA). 

The final pillar of model-free reinforcement learning is **Policy Improvement**. This is the process of using the Q-values to improve the policy.

4. **Policy Improvement** Use <some-algorithm> to refine the policy to maximise the expected return. The goal is to find the optimal policy $\pi^*$ which maximises the expected return from each state.

$$ \pi^* = \arg\max_{\pi} Q_{\pi}(s, a) \hspace{3mm} \forall \hspace{3mm}s, a$$

In principle there a many ways to learn the policy, such examples include: 
There two main families of algorithms:
1. **Policy-based methods**: Learn the policy directly, for example using a neural network to map states to actions directly (example: _REINFORCE_, or _Actor-Critic_ methods). We will not be covering these methods in this tutorial.
1. **Value-based methods**: Learn the Q-values and then derive the policy from the Q-values.

#### **3.3.1 $\epsilon$-greedy policy**





In [None]:
grid = np.array( # 0 = empty, 1 = wall, 2 = reward
    [
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1],
        [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
        [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
        [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
        [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1],        
        [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
        [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
        [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
        [1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    ])


minigrid = MiniGrid(
    grid = grid,
    init_agent_pos=(3,7),
    reward_locations = [(10,2),]
    )

tdqlearner = TD_QValueLearner(
    gamma=0.95, 
    alpha=0.1, 
    n_states=minigrid.n_states, 
    n_actions=minigrid.n_actions)

# minigrid.render()

minigrid.plot_Q_values(tdqlearner.Q)

In [None]:
episode_lengths = []
for i in (pbar := tqdm(range(1500))):
    try: 
        episode_data = minigrid.perform_episode(Q_values=tdqlearner.Q, epsilon=0.1)
        tdqlearner.learn_episode(
            states=episode_data['states'],
            actions=episode_data['actions'],
            rewards=episode_data['rewards'])
        minigrid.reset()
        pbar.set_description(f"Episode length (recent average): {episode_data['smoothed_episode_length'] : .0f}")
        episode_lengths.append(len(episode_data['states']))
    except KeyboardInterrupt:
        break

In [None]:
minigrid.plot_Q_values(tdqlearner.Q)
minigrid.plot_episode_length()

In [None]:
#@title Run this code to plot the first and last 5 episodes  {display-mode: "form" }
fig, ax = plt.subplots(1,5, figsize=(10,2))
for i in range(5):
    minigrid.plot_episode(i, ax=ax[i])
fig.suptitle("First 5 episodes")

fig, ax = plt.subplots(1,5, figsize=(10,2))
for i in range(5):
    minigrid.plot_episode(-i-1, ax=ax[i])
fig.suptitle("Last 5 episodes")

* Play around with epsilon, gamma and environment size / shape. 
Limitations 
* Change in rewards 
* Change in transitions (shortcuts). 
* Curse of dimensionality 


---
## **4. State features and function approximation** <a name="dqn"></a>

---
## **5. Solutions** <a name="solutions"></a>

## Solution to exercise 2.4:

The terminal state $S_t = N-1$ has a known value of $V(S = N-1) = \mathbb{E}[ R(S = N-1)] = 1$ (guaranteed reward of 1). 

Using the Bellman equation: 

\begin{align}
V(S_t = n) &= \mathbb{E} [R_t + \gamma V(S_{t+1})] \\
           &= \mathbb{E} [R_t]  + \mathbb{E} [\gamma V(S_{t+1})]  \\
           &= \frac{n + 1}{N}\cdot 1 + \gamma \mathbb{E}_{S_{t+1}}[ V(S_{t+1}) ] \\
           &= \frac{n + 1}{N}\cdot 1 + \underbrace{\gamma p_t \cdot V(S_{t+1} = n+1)}_{\textrm{it transitioned to next state}} + \underbrace{\gamma (1-p_t) \cdot V(S_{t}=n)}_{\textrm{it stayed in the same state}} \\
(1 - \gamma (1-p_t)) V(S_t = n) &= \frac{n + 1}{N} + \gamma p_t \cdot V(S_{t+1} = n+1) \\
V(S_t = n) &= \frac{1}{1 - \gamma (1-p_t)} \left( \frac{n + 1}{N} + \gamma p_t \cdot V(S_{t+1} = n+1) \right) \\
V(n) &= \frac{1}{1 - \gamma (1-p_t)} \left( \frac{n + 1}{N} + \gamma p_t \cdot V(n+1) \right)
\end{align}
  has value $V = R$, state 8 has value $V = 0 + \gamma * R = \gamma R$, state 7 has value $V = 0 + \gamma^2 R$, etc. So state 0 has value $V = \gamma^9 R$. So if $R=1$ and $\gamma = 0.9$ then $V = 0.9^9 = 0.387420489$.

## Solutions to exercise 3.1: 

**Question 1:**

Recall the Bellman equation for taking action $A_t$ in state $S_t$ and transitioning to state $S_{t+1}$ getting reward $R_{t+1}$: $Q_{\pi}(S_t, A_t) = \mathbb{E} \big[ R_{t+1} + \gamma Q_{\pi}(S_{t+1}, \pi(S_{t+1})  \big]$. In our case everything is deterministic so we can drop the expectation.

\begin{align}
Q_{\pi_1}(S_2, A_2) &= 1 + \gamma Q_{\pi_1}(S_2, \pi_1(S_2)) \\
                    &= 1 + \gamma Q_{\pi_1}(S_2,A_2) \\
                    &= \frac{1}{1 - \gamma} 
\end{align}

Likewise 

\begin{align}
Q_{\pi_1}(S_1, A_1) &= 2 + \gamma Q_{\pi_1}(S_2, \pi_1(S_2)) \\
                    &= 2 + \gamma Q_{\pi_1}(S_2,A_2) \\
                    &= 2 + \gamma \frac{1}{1 - \gamma} \\
                    &= \frac{2 - \gamma}{1 - \gamma}
\end{align}


\begin{align}
Q_{\pi_1}(S_1, A_2) &= 1 + \gamma Q_{\pi_1}(S_1, \pi_1(S_1)) \\
                    &= 1 + \gamma Q_{\pi_1}(S_1,A_1) \\
                    &= 1 + \gamma \frac{2 - \gamma}{1 - \gamma} \\
                    &= \frac{1 + \gamma - \gamma^2}{1 - \gamma}
\end{align}

\begin{align}
Q_{\pi_1}(S_2, A_1) &= 3 + \gamma Q_{\pi_1}(S_1, \pi_1(S_1)) \\
                    &= 3 + \gamma Q_{\pi_1}(S_1,A_1) \\
                    &= 3 + \gamma \frac{2 - \gamma}{1 - \gamma} \\
                    &= \frac{3 - \gamma - \gamma^2}{1 - \gamma}
\end{align}



**Question 2:**
The optimal policy, $\pi^{*}$ is to take action $A_1$ in state $S_1$ and action $A_1$ in state $S_2$. 

**Question 3:**
The value of each state-action pair under the optimal policy $\pi^{*}$ is:

\begin{align}
Q_{\pi^{*}}(S_1, A_1) &= 2 + \gamma Q_{\pi^{*}}(S_2, \pi^{*}(S_2)) \\
                    &= 2 + \gamma Q_{\pi^{*}}(S_2,A_1) \\
\end{align}

\begin{align}
Q_{\pi^{*}}(S_2, A_1) &= 3 + \gamma Q_{\pi^{*}}(S_1, \pi^{*}(S_1)) \\
                    &= 3 + \gamma Q_{\pi^{*}}(S_1,A_1) \\
\end{align}
Solving these simultaneously gives:
\begin{align}
Q_{\pi^{*}}(S_2, A_1) &= \frac{3 + 2\gamma}{1 - \gamma^2} \\
Q_{\pi^{*}}(S_1, A_1) &= \frac{2 + 3\gamma}{1 - \gamma^2} \\
\end{align}

For the other state-action pairs:
\begin{align}
Q_{\pi^{*}}(S_1, A_2) &= 1 + \gamma Q_{\pi^{*}}(S_1, \pi^{*}(S_1)) \\
                    &= 1 + \gamma Q_{\pi^{*}}(S_1,A_1) \\
                    &= 1 + \gamma \frac{2 + 3\gamma}{1 - \gamma^2} \\
                    &= \frac{1 + 2\gamma + 2\gamma^2}{1 - \gamma^2} \\
Q_{\pi^{*}}(S_2, A_2) &= 1 + \gamma Q_{\pi^{*}}(S_2, \pi^{*}(S_2)) \\
                    &= 1 + \gamma Q_{\pi^{*}}(S_2,A_1) \\
                    &= 1 + \gamma \frac{3 + 2\gamma}{1 - \gamma^2} \\
                    &= \frac{1 + 3\gamma + \gamma^2}{1 - \gamma^2}
\end{align}