## **Solutions to the Neuro RL tutorial exercises** [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/TomGeorge1234/NeuroRLTutorial/blob/main/solutions.ipynb)
Original tutorial at: https://github.com/TomGeorge1234/NeuroRLTutorial 

### Exercise 1.1

>1. $\hat{V}_0 = 0$, then $\hat{V}_1 = \hat{V}_0 + \alpha (R - \hat{V}_0) = \alpha R$
>
>2. Formulating htis like an ODE we get: 
>     $$\frac{d\hat{V}(t)}{dt} = \alpha (R - \hat{V}(t))$$
>    Using the change of variable $\delta(t) = R - \hat{V}(t)$ $\longrightarrow$ $\frac{d\hat{V}(t)}{dt} = -\frac{d\delta(t)}{dt}$, we get: $$\frac{d\delta(t)}{dt} = -\alpha \delta(t)$$
>This is a simple exponential decay ODE with solution $\delta(t) = \delta(0) e^{-\alpha t}$. Since $\delta(0) = R$, we get $\delta(t) = R e^{-\alpha t}$. Finally, 
>$$\hat{V}(t) = R - \delta(t) = R(1 - e^{-\alpha t})$$

### Exercise 1.6 
> 2. A high learning rate means that the agent will quickly update its value function to the new reward, while a low learning rate means that the agent will take longer to update its value function. If the environment is noiseless, a high learning rate is better because the agent will quickly learn the expected value. However, if the environment is noisy, a high learning rate can lead to the agent learning the noise instead of the expected value. In this case, a low learning rate is better because the agent will average across the noise.

### Exercise 1.7
> 1. The update is now an update to the weight  vector $\mathbf{w}$, which is a vector of size $n$ where $n$ is the number of features (or stimuli) so the update must also be a vector of size $n$. $\mathbf{s}$, the state vector, gives the strength of each feature in the current state and thus which states to "assign" the reward to. This learning rule can be derived formally by gradient descent of the loss function $L = \frac{1}{2} (R - \mathbf{s} \cdot \mathbf{w})^2$.

### Exercise 1.11 
> 1. Would _not_ be captured by the current Rescorla Wagner model
> 2. Would _not_ be captured by the current Rescorla Wagner model
> 3. Would be captured by the current Rescorla Wagner model

### Exercise 2.1
> 1. Working backwards: 
>    - $V_4 = R_5 = 5$
>    - $V_3 = R_4 + \gamma R_5 = R_4 + \gamma V_4 = + 0.9 \cdot 5 = 8.5$
>    - $V_2 = R_3 + \gamma R_4 + \gamma^2 R_5 = R_3 + \gamma V_3 = 3 + 0.9 \cdot 8.5 = 10.65$
>    - $V_1 = R_2 + \gamma V_2 = 2 + 0.9 \cdot 10.65 = 11.585$
>    - $V_0 = R_1 + \gamma V_1 = 1 + 0.9 \cdot 11.585 = 11.4265$
>
>2. The value is the _discounted sum of future rewards_. The fact this is a sum explains why state 3 has a higher value than state 4 - state 3 is "valuable because it is followed by two rewards ($R_4 $ and $R_5$) whereas state 4 is only followed by one reward ($R_5$). On the other hand the discounting is why state 1 has a higher value than state 0 - state 1 is closer the the larger rewards coming later (so discounts them less). As some point these counteracting effects balance out leaving state 1 (not state 0 or state 4) with the highest value.
>
>3. From state $S_0$ you recieve rewards of $1, 1, 1, \ldots$ on to infinity. Therefore the value of the state $S_0$ is $1 + \gamma\cdot 1 + \gamma^2\cdot 1 + \ldots$ which is a geometric series that converges to $\frac{1}{1 - \gamma}$. By symmetry, the value of the state $S_1$ is $\frac{1}{1 - \gamma}$ as well.
>
>4. If $\gamma = 1$ the value of the state $S_0$ is $\infty$ (according to the previous answer) - numerically the algorithm may run into convergence issues.

### Exercise 2.2
> 1. When $\hat{V}(\mathbf{s};\mathbf{w}) = \mathbf{s} \cdot \mathbf{w}$ and $\mathbf{s}$ is one-hot then the dot product just selects the weight corresponding to the active feature. For example if $\mathbf{s}_2 = [0, 1, 0, 0]$ and $\mathbf{w} = [\hat{V}_1, \hat{V}_2, \hat{V}_3, \hat{V}_4]$ then $\hat{V}(\mathbf{s}_2; \mathbf{w}) = \hat{V}_2$.

### Exercise 2.3
> $$L_t = \big[ \hat{V}(S_t) - V(S_t) \big]^2$$ 
> Find the gradient with respect to the state value function $\hat{V}(S_t)$:
> $$ \frac{\partial L_t}{\partial \hat{V}(S_t)} = 2 \big[ \hat{V}(S_t) - V(S_t) \big]$$
> The optimal learning rule is therefore to update the weights in the direction of the negative gradient:
> \begin{align}
> \Delta \hat{V}(S_t) &= -\alpha \frac{\partial L_t}{\partial \hat{V}(S_t)} = -2\alpha \big[ \hat{V}(S_t) - V(S_t) \big] \\
> \Delta \hat{V}(S_t) &= \alpha \big[ V(S_t) - \hat{V}(S_t) \big]
> \end{align}
> Where the factor of 2 has been absorbed inth the learning rate. As an update rule: 
> $$\hat{V}(S_t) \leftarrow \hat{V}(S_t) + \alpha \big[ V(S_t) - \hat{V}(S_t) \big]$$

### Exercise 2.4 

> \begin{align}
> V(S_t) &= \mathbb{E} \big[ G_t \big] \\
> &= \mathbb{E} \big[ R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \ldots \big] \\
> &= \mathbb{E} \big[ R_{t+1} + \gamma (R_{t+2} + \gamma R_{t+3} + \ldots) \big] \\
> &= \mathbb{E} \big[ R_{t+1} + \gamma G_{t+1} \big] \\
> &= \mathbb{E} \big[ R_{t+1} + \gamma V(S_{t+1}) \big]
> \end{align}
> where the final line follows from the linearity of expectation and the definition of the value function.

### Exercise 2.7

> 1. Terminal state 9 has value $V = R$, state 8 has value $V = 0 + \gamma R = \gamma R$, state 7 has value $V = 0 + \gamma^2 R$, etc. So state 0 has value $V = \gamma^9 R$. So if $R=1$ and $\gamma = 0.9$ then $V = 0.9^9 = 0.38742$.
> 2. Suppose $\alpha = 1$ and $\gamma$ is close to one. The first time the agent receives the reward at state 9 it's value will be updated to $V = 1$ and no further learning will occur on this state (its TD error will be zero). On the next trial the value of state 8 will be updated due to the a TD error because the new value of upcoming state 9 wasn't predicted. Thus, the bump moves backwards at approximately a rate of one-step-each-episode. This makes because each state bootstraps from the next state's value. If if there is 10 steps between state 0 and state 9 then it will take at least 10 episodes for the value of state 0 to be updated and more to converge (depending on the learning rate and other factors).
> 3. The residual TD-error at the start is because the first state is never predictable. Pavlov's dog may be able to associate the bell with the food, but it can't predict the bell so hearing the bell will always come as a positive surprise (aka. a positive TD-error).
> 4. The modelling results and the neuroscience results are similar in that the dopamine neurons fire successively early. In the brain there is a short delay (~0.2 seconds) between the stimulus and the dopamine firing, probably due to processing time in the brain.

### Exercise 2.8

> 1. If the reward was removed the TD error will go _negative_. This is because the "observed" value will be lower than the predicted value (which was previously high because reward, historically, was delivered on this state). $$\delta_t = \underbrace{R_{t+1} + \gamma V(S_{t+1})}_{\textrm{``observed'' value}} - \underbrace{V(S_t)}_{\textrm{< predicted value}} < 0$$
> 2. Before the reward was removed the "value" of the state was just equal to the reward recieved, $V(S_t) = R$ (since the state was terminal so $V(S_{t+1}) = 0$) so after the reward is remove the temporal difference error will be $R$ since:
> $$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t) = 0 + \gamma \cdot 0 - R = -R$$
> 4. The figure and the simulation results seem similar. The dopamine neurons fire at baseline and then fire when the reward is delivered as expected but then fire _less_ when the reward is removed. This is consistent with the TD error going negative when the reward is removed. One key difference is that, in the brain, neurons can't have a negative firing rate so the "negative" TD error is probably represented by a decrease in firing rate relative to baseline.

### Exercise 2.9 

> 2. The terminal state $S_t = N-1$ has a known value of $V(S = N-1) = \mathbb{E}[ R(S = N-1)] = 1$ (guaranteed reward of 1). 
> 
>    Using the Bellman equation: 
> 
> \begin{align}
> V(S_t = n) &= \mathbb{E} [R_t + \gamma V(S_{t+1})] \\
>            &= \mathbb{E} [R_t]  + \mathbb{E} [\gamma V(S_{t+1})]  \\
>            &= \frac{n + 1}{N}\cdot 1 + \gamma \mathbb{E}_{S_{t+1}}[ V(S_{t+1}) ] \\
>            &= \frac{n + 1}{N}\cdot 1 + \underbrace{\gamma p_t \cdot V(S_{t+1} = n+1)}_{\textrm{it transitioned to next state}} + \underbrace{\gamma (1-p_t) \cdot V(S_{t}=n)}_{\textrm{it stayed in the same state}} \\
> (1 - \gamma (1-p_t)) V(S_t = n) &= \frac{n + 1}{N} + \gamma p_t \cdot V(S_{t+1} = n+1) \\
> V(S_t = n) &= \frac{1}{1 - \gamma (1-p_t)} \left( \frac{n + 1}{N} + \gamma p_t \cdot V(S_{t+1} = n+1) \right) \\
> V(n) &= \frac{1}{1 - \gamma (1-p_t)} \left( \frac{n + 1}{N} + \gamma p_t \cdot V(n+1) \right)
> \end{align}

### Exercise 2.10

> 1. In the real world the delivery of rewards and the transitions between states are often stochastic. For example your train could be delayed (transitions are stochastic) or your favourite cafe could be closed (rewards are stochastic). Since these stochastic events are inherently unpredictable the value function can never be perfect. This is why the value function is an expectation and not a certainty. Learning this expectation helps an agent make better decisions in the face of uncertainty: "on the balance of probability the train will probably be on time and the cafe will probably be open, so you should probably go to the cafe".

### Exercise 3.1

>1. Recall the Bellman equation for taking action $A_t$ in state $S_t$ and transitioning to state $S_{t+1}$ getting reward $R_{t+1}$: $Q_{\pi}(S_t, A_t) = \mathbb{E} \big[ R_{t+1} + \gamma Q_{\pi}(S_{t+1}, \pi(S_{t+1})  \big]$. In our case everything is deterministic so we can drop the expectation.
> 
>     \begin{align}
>     Q_{\pi_1}(S_2, A_2) &= 1 + \gamma Q_{\pi_1}(S_2, \pi_1(S_2)) \\
>                         &= 1 + \gamma Q_{\pi_1}(S_2,A_2) \\
>                         &= \frac{1}{1 - \gamma} 
>     \end{align}
>     Likewise 
> 
> \begin{align}
> Q_{\pi_1}(S_1, A_1) &= 2 + \gamma Q_{\pi_1}(S_2, \pi_1(S_2)) \\
>                     &= 2 + \gamma Q_{\pi_1}(S_2,A_2) \\
>                     &= 2 + \gamma \frac{1}{1 - \gamma} \\
>                     &= \frac{2 - \gamma}{1 - \gamma}
> \end{align}
> 
> 
> \begin{align}
> Q_{\pi_1}(S_1, A_2) &= 1 + \gamma Q_{\pi_1}(S_1, \pi_1(S_1)) \\
>                     &= 1 + \gamma Q_{\pi_1}(S_1,A_1) \\
>                     &= 1 + \gamma \frac{2 - \gamma}{1 - \gamma} \\
>                     &= \frac{1 + \gamma - \gamma^2}{1 - \gamma}
> \end{align}
> 
> \begin{align}
> Q_{\pi_1}(S_2, A_1) &= 3 + \gamma Q_{\pi_1}(S_1, \pi_1(S_1)) \\
>                     &= 3 + \gamma Q_{\pi_1}(S_1,A_1) \\
>                     &= 3 + \gamma \frac{2 - \gamma}{1 - \gamma} \\
>                     &= \frac{3 - \gamma - \gamma^2}{1 - \gamma}
> \end{align}
> 
> 
> 2. The optimal policy, $\pi^{*}$ is to take action $A_1$ in state $S_1$ and action $A_1$ in state $S_2$. 
> 
> 3. The value of each state-action pair under the optimal policy $\pi^{*}$ is:
> 
>    \begin{align}
>    Q_{\pi^{*}}(S_1, A_1) &= 2 + \gamma Q_{\pi^{*}}(S_2, \pi^{*}(S_2)) \\
>                        &= 2 + \gamma Q_{\pi^{*}}(S_2,A_1) \\
>    \end{align}
>     
>    \begin{align}
>     Q_{\pi^{*}}(S_2, A_1) &= 3 + \gamma Q_{\pi^{*}}(S_1, \pi^{*}(S_1)) \\
>                         &= 3 + \gamma Q_{\pi^{*}}(S_1,A_1) \\
>    \end{align}
>    Solving these simultaneously gives:
>     
>    \begin{align}
>     Q_{\pi^{*}}(S_2, A_1) &= \frac{3 + 2\gamma}{1 - \gamma^2} \\
>     Q_{\pi^{*}}(S_1, A_1) &= \frac{2 + 3\gamma}{1 - \gamma^2} \\
>    \end{align}
>    For the other state-action pairs:
>    \begin{align}
>     Q_{\pi^{*}}(S_1, A_2) &= 1 + \gamma Q_{\pi^{*}}(S_1, \pi^{*}(S_1)) \\
>                         &= 1 + \gamma Q_{\pi^{*}}(S_1,A_1) \\
>                         &= 1 + \gamma \frac{2 + 3\gamma}{1 - \gamma^2} \\
>                         &= \frac{1 + 2\gamma + 2\gamma^2}{1 - \gamma^2} \\
>     Q_{\pi^{*}}(S_2, A_2) &= 1 + \gamma Q_{\pi^{*}}(S_2, \pi^{*}(S_2)) \\
>                         &= 1 + \gamma Q_{\pi^{*}}(S_2,A_1) \\
>                         &= 1 + \gamma \frac{3 + 2\gamma}{1 - \gamma^2} \\
>                         &= \frac{1 + 3\gamma + \gamma^2}{1 - \gamma^2}
>    \end{align}

### Exercise 3.4

>1. - **Moving to a new neighbourhood** which you don't know well: You must decide whether to exploit what you already know (e.g. go to the same restaurant you always go to) or explore new options (try a new restaurant, which, given your lack of knowledge, might be better or worse than your usual choice). 
>     - **The Netflix recommendation algorithm** must balance showing you things you already like (exploitation) with showing you new things you might like (exploration).
> 2. In a stable environment that is well-understood and not subject to change, it is probably **better to prioritize exploitation**. An example of such an environment is a manufacturing assembly line with consistent demand for a very specific product. In this case, it is best to exploit what you know works to maximize efficiency and output, rather than exploring new methods that may be less efficient or break the system.
>3. In an unstable and constantly changing environment that is not well understood, it is better to prioritize exploration. An example of such an environment is the early stages of a startup in a rapidly evolving technology sector. In this case, it is important to explore new ideas and methods to adapt to the changing landscape and find the best path forward. Another example might be starting a new job: it's worth exploring different ways of working, different projects and different collaborators to find the best fit.
>4. Many real world scenarios _stochastic_ policies are optimal. One example is bluffing in poker. If you always  bluff when you have a poor hand your opponents will quickly learn this and exploit you by betting against you. If you never bluff then your opponents will always fold when you have a good hand and you will never win much money. However, if you bluff randomly (stochastically) then your opponents will be unable to predict your behaviour and will be forced to play more cautiously. Rock-Paper-Scissor is a similar. Another example is the behaviour of animals in the wild: if a predator always follows the same path it will be easy for its prey to avoid it. However, if it follows a stochastic path then it will be more likely to catch its prey. 
>5. Softmax action selection is a popular alternative to $\epsilon$-greedy. In softmax action selection, the probability of selecting an action is proportional to the exponentiated value of the Q-value for that action. This allows for a smooth transition between exploration and exploitation, with the probability of selecting the best action increasing as the Q-values become more certain.
>    $$ P(A_t = a | S_t = s) = \frac{e^{Q(s, a) / \tau}}{\sum_{a'} e^{Q(s, a') / \tau}} $$
>    where $\tau$ is a temperature parameter that controls the degree of exploration. When $\tau$ is high, the policy is close to uniform random action selection, and as $\tau$ approaches zero, the policy becomes deterministic and selects the action with the highest Q-value.

### Exercise 3.5

> 2. - A slight negative cost to movement encourages the agent to take the shortest path to the goal. This is because the agent will prefer to move directly to the goal rather than take a longer path that incurs more cost.
>    - A secondary effect of a negative cost to movement is to encourage exploration. In the early stages of learning the negative cost will cause the Q-values for most state-action pairs to go negative _below_ their intial value of zero. This under the almost-greedy policy these state actions will not be visited next time and the agent will be forced to explore new paths. This is a form of _intrinsic motivation_.

### Exercise 3.9

>1. Suppose the agent has learnt the _optimal policy_. From any given starting point this will be to either move left and carry on moving left until the left-episode-ending-reward $R_1=15$ is encountered or to move right and carry on moving right until the right-episode-ending-reward $R_2=5$ is encountered. We can calculate the value of always moving left or always moving right and see when they are equal and thus the "switch point". If the agent is $D_{\textrm{switch}}$ steps from the left reward and $\gamma = 1.0$ then the value of always moving left is just given by the sum (no discounting) of all the reards / costs along the way this is $Q(D_{\textrm{switch}}, \textrm{``move left''}) = R_1 - D_{\textrm{switch}}\cdot c$ where $c$ is the cost of moving. The value of always moving right is $Q(D_{\textrm{switch}}, \textrm{``move right''}) = R_2 - (D - D_{\textrm{switch}})\cdot c$. Setting these equal gives $R_1 - D_{\textrm{switch}}\cdot c = R_2 - (D - D_{\textrm{switch}})\cdot c$ which can be solved for $D_{\textrm{switch}} = \frac{Dc - R_2 + R_1}{2c} = \frac{20\cdot1 - 5 + 15}{2\cdot 1} = 15$. So the agent will switch to the right policy if it is 15 steps or more from the left reward, otherwise it will choose the left policy. 
>4. The number of physical states is $100 \times 100 = 10000 \sim 2^{13}$. The number of possible reward arrangements is $2^{10000}$. So the total number of states is $2^{13} \times 2^{10000} = 2^{10013}$. The total number of actions is $360 \times 200 = 72000 \sim 2^{16}$. So the total number of state-action pairs is $2^{10013} \times 2^{16} = 2^{10029}$. The number of atoms in the observable universe is estimated to be $10^{80} \sim 2^{266}$.
>    - The obscene example should make clear the limitations of tabular methods. Its clear that treat all possible states as totally independent (even if, in practice) many of them are incredibly similar to one another is a very wasteful way to go. In this case the set-up describes a simple foraging task which, in practise, is easily solved by RL using function approximation and state features even though the independent number of states dramatically exceeds the number of states in the observable universe. 

### Exercise 4.1

>1. Some features you may build into your feature vector are: 
>    1. Proximity to boundaries or obstacles (better decision making regarding the environment) 
>    2. Orientation of the agent. 
>    3. Velocity and acceleration of the agent (help in understanding how motion and dynamics of the agent affect future states)
>    4. Previous states or memory (in complex worlds the Markov assumption may break down and its useful to allow previou sstates to influence future actions) 
>    5. Energy constraints of the agent (if the agent has limited energy, it may need to conserve energy and make decisions accordingly)

### Exercise 4.2

>1. The gradient of the loss function wrt the weight vector, $\theta$, is: 
>    $$ \nabla_{\theta} L_t(\theta) = 2 \big[ Q_{\pi}(s,a) - \hat{Q}(s, a; \theta) \big] \nabla_{\theta} \hat{Q}(s, a; \theta) $$
>    By the standard bootstrapping trick and the Bellman inequality we can replace the middle term with the TD-error:
>    $$ \nabla_{\theta} L_t(\theta) = - 2 \delta_t \nabla_{\theta} \hat{Q}(s, a; \theta) $$
>    so the update rule is (absoribing the factor of 2 into the learning rate):
>    $$ \theta \leftarrow \theta + \alpha \delta_t \nabla_{\theta} \hat{Q}(s, a; \theta) $$
>
>2. The gradient of the linear Q-function wrt the weight vector, $\theta$, is:
>    $$ \nabla_{\theta} \hat{Q}(s, a; \theta) = \nabla_{\theta} [\theta^{\top} \mathbf{\phi}(s)] = \mathbf{\phi}(s) $$
>    So the update rule is:
> $$ \theta \leftarrow \theta + \alpha \delta_t \mathbf{\phi}(s) $$

### Exercise 4.3

> 1.Yes, it is possible to have a continuous action space. For example, in a robotic control task, the agent may need to control the speed and direction of the robot's movement, which can be represented as continuous actions. However, calculating the greedy action in a continuous action space can be challenging because there are infinitely many possible actions to consider. One common approach is to use function approximation methods to approximate the action-value function and find the optimal action using optimization techniques such as gradient descent.

### Exercise 4.5
> 2. In principle the function approximation technique, using place cell state features, should learn in fewer episodes. This is because the place cells inherently the  continuity of space and the fact that nearby states are likely to have similar values. When the agent find a reward in a particular state, the value of nearby states will be updated as well because _the same place cells are active_. This is in contrast to the tabular method where each state is treated as independent and the agent must visit each state individually to learn its value.

### Exercise 4.6
> 1. - The Q-value map for the "North" action peaks just a little to the south of the reward because in this location North is the best action to take to get the the reward most quickly. Likewise, the "East" action peaks just a little to the west of the reward. 
>    - The optimal policy is to always move south when due-north of the reward and always move east when due-west of the reward, and to move south-east when due-north-west of the reward etc. etc. In reality its possible that the agent will not find this exact policy and instead finds a policy that is "good enough" to get the reward. For example, such a policy could be: 
>        - If the agent is in the north-west to north-east quadrant of the reward then move south
>        - If the agent is in the south-west to north-west quadrant of the reward then move east
>        - If the agent is in the south-east to south-west quadrant of the reward then move north
>        - If the agent is in the north-east to south-east quadrant of the reward then move west
> This policy does not make use of the diagonal actions but is still good enough to get the reward in _almost_ the minimum number of steps.

### Exercise 4.7
> 3. By many metrics grid cells are not optimal state features for performing RL. This is because of their multi-modal shape. If a reward is observed in one location it's "value" may be incrrectly assigned to non-local staes where that grid cells is also active. In principle this can be counter acted by summing together multiple grid cells of slightly different scales and orientations such that they positively sum in the location of the reward and negatively sum (cancel out) elsewhere however, as you may have seen in the simulation, this requires a lot of grid cells and a lot fine tuning and is not always successful.

### Exercise 4.8
> 2. Crossing the road when the light is green: If one features tells you when you're at the crossing and another feature tells you the state of the traffic light, you can use the two features to determine whether it is safe to cross the road BUT they must both on the same time scale, this is a non-linear interaction between the two features. 