## **Solutions to the Neuro RL tutorial exercises**

### Exercise 1.1

>1. $\hat{V}_0 = 0$, then $\hat{V}_1 = \hat{V}_0 + \alpha (R - \hat{V}_0) = \alpha R$
>
>2. Formulating htis like an ODE we get: 
>     $$\frac{d\hat{V}(t)}{dt} = \alpha (R - \hat{V}(t))$$
>    Using the change of variable $\delta(t) = R - \hat{V}(t)$ $\longrightarrow$ $\frac{d\hat{V}(t)}{dt} = -\frac{d\delta(t)}{dt}$, we get: $$\frac{d\delta(t)}{dt} = -\alpha \delta(t)$$
>This is a simple exponential decay ODE with solution $\delta(t) = \delta(0) e^{-\alpha t}$. Since $\delta(0) = R$, we get $\delta(t) = R e^{-\alpha t}$. Finally, 
>$$\hat{V}(t) = R - \delta(t) = R(1 - e^{-\alpha t})$$

### Exercise 1.6 
> 2. A high learning rate means that the agent will quickly update its value function to the new reward, while a low learning rate means that the agent will take longer to update its value function. If the environment is noiseless, a high learning rate is better because the agent will quickly learn the expected value. However, if the environment is noisy, a high learning rate can lead to the agent learning the noise instead of the expected value. In this case, a low learning rate is better because the agent will average across the noise.

### Exercise 1.7
> 1. The update is now an update to the weight  vector $\mathbf{w}$, which is a vector of size $n$ where $n$ is the number of features (or stimuli) so the update must also be a vector of size $n$. $\mathbf{s}$, the state vector, gives the strength of each feature in the current state and thus which states to "assign" the reward to. This learning rule can be derived formally by gradient descent of the loss function $L = \frac{1}{2} (R - \mathbf{s} \cdot \mathbf{w})^2$.

### Exercise 1.11 
> 1. Would _not_ be captured by the current Rescorla Wagner model
> 2. Would _not_ be captured by the current Rescorla Wagner model
> 3. Would be captured by the current Rescorla Wagner model

### Exercise 2.1
> 1. Working backwards: 
>    - $V_4 = R_5 = 5$
>    - $V_3 = R_4 + \gamma R_5 = R_4 + \gamma V_4 = + 0.9 \cdot 5 = 8.5$
>    - $V_2 = R_3 + \gamma R_4 + \gamma^2 R_5 = R_3 + \gamma V_3 = 3 + 0.9 \cdot 8.5 = 10.65$
>    - $V_1 = R_2 + \gamma V_2 = 2 + 0.9 \cdot 10.65 = 11.585$
>    - $V_0 = R_1 + \gamma V_1 = 1 + 0.9 \cdot 11.585 = 11.4265$
>
>2. The value is the _discounted sum of future rewards_. The fact this is a sum explains why state 3 has a higher value than state 4 - state 3 is "valuable because it is followed by two rewards ($R_4 $ and $R_5$) whereas state 4 is only followed by one reward ($R_5$). On the other hand the discounting is why state 1 has a higher value than state 0 - state 1 is closer the the larger rewards coming later (so discounts them less). As some point these counteracting effects balance out leaving state 1 (not state 0 or state 4) with the highest value.
>
>3. From state $S_0$ you recieve rewards of $1, 1, 1, \ldots$ on to infinity. Therefore the value of the state $S_0$ is $1 + \gamma\cdot 1 + \gamma^2\cdot 1 + \ldots$ which is a geometric series that converges to $\frac{1}{1 - \gamma}$. By symmetry, the value of the state $S_1$ is $\frac{1}{1 - \gamma}$ as well.
>
>4. If $\gamma = 1$ the value of the state $S_0$ is $\infty$ (according to the previous answer) - numerically the algorithm may run into convergence issues.

### Exercise 2.2
> 1. When $\hat{V}(\mathbf{s};\mathbf{w}) = \mathbf{s} \cdot \mathbf{w}$ and $\mathbf{s}$ is one-hot then the dot product just selects the weight corresponding to the active feature. For example if $\mathbf{s}_2 = [0, 1, 0, 0]$ and $\mathbf{w} = [\hat{V}_1, \hat{V}_2, \hat{V}_3, \hat{V}_4]$ then $\hat{V}(\mathbf{s}_2; \mathbf{w}) = \hat{V}_2$.

### Exercise 2.3

### Exercise 2.4 

### Exercise 2.7

1. Terminal state 9 has value $V = R$, state 8 has value $V = 0 + \gamma R = \gamma R$, state 7 has value $V = 0 + \gamma^2 R$, etc. So state 0 has value $V = \gamma^9 R$. So if $R=1$ and $\gamma = 0.9$ then $V = 0.9^9 = 0.38742$.
2. Suppose $\alpha = 1$ and $\gamma$ is close to one. The first time the agent receives the reward at state 9 it's value will be updated to $V = 1$ and no further learning will occur on this state (its TD error will be zero). On the next trial the value of state 8 will be updated due to the a TD error because the new value of upcoming state 9 wasn't predicted. Thus, the bump moves backwards at approximately a rate of one-step-each-episode. This makes because each state bootstraps from the next state's value. If if there is 10 steps between state 0 and state 9 then it will take at least 10 episodes for the value of state 0 to be updated and more to converge (depending on the learning rate and other factors).
3. The residual TD-error at the start is because the first state is never predictable. Pavlov's dog may be able to associate the bell with the food, but it can't predict the bell so hearing the bell will always come as a positive surprise (aka. a positive TD-error).

### Exercise 2.8

### Exercise 2.9 

2. The terminal state $S_t = N-1$ has a known value of $V(S = N-1) = \mathbb{E}[ R(S = N-1)] = 1$ (guaranteed reward of 1). 

   Using the Bellman equation: 

\begin{align}
V(S_t = n) &= \mathbb{E} [R_t + \gamma V(S_{t+1})] \\
           &= \mathbb{E} [R_t]  + \mathbb{E} [\gamma V(S_{t+1})]  \\
           &= \frac{n + 1}{N}\cdot 1 + \gamma \mathbb{E}_{S_{t+1}}[ V(S_{t+1}) ] \\
           &= \frac{n + 1}{N}\cdot 1 + \underbrace{\gamma p_t \cdot V(S_{t+1} = n+1)}_{\textrm{it transitioned to next state}} + \underbrace{\gamma (1-p_t) \cdot V(S_{t}=n)}_{\textrm{it stayed in the same state}} \\
(1 - \gamma (1-p_t)) V(S_t = n) &= \frac{n + 1}{N} + \gamma p_t \cdot V(S_{t+1} = n+1) \\
V(S_t = n) &= \frac{1}{1 - \gamma (1-p_t)} \left( \frac{n + 1}{N} + \gamma p_t \cdot V(S_{t+1} = n+1) \right) \\
V(n) &= \frac{1}{1 - \gamma (1-p_t)} \left( \frac{n + 1}{N} + \gamma p_t \cdot V(n+1) \right)
\end{align}

### Exercise 2.10

### Exercise 3.1

### Solutions to exercise 3.1: 

**Question 1:**

Recall the Bellman equation for taking action $A_t$ in state $S_t$ and transitioning to state $S_{t+1}$ getting reward $R_{t+1}$: $Q_{\pi}(S_t, A_t) = \mathbb{E} \big[ R_{t+1} + \gamma Q_{\pi}(S_{t+1}, \pi(S_{t+1})  \big]$. In our case everything is deterministic so we can drop the expectation.

\begin{align}
Q_{\pi_1}(S_2, A_2) &= 1 + \gamma Q_{\pi_1}(S_2, \pi_1(S_2)) \\
                    &= 1 + \gamma Q_{\pi_1}(S_2,A_2) \\
                    &= \frac{1}{1 - \gamma} 
\end{align}

Likewise 

\begin{align}
Q_{\pi_1}(S_1, A_1) &= 2 + \gamma Q_{\pi_1}(S_2, \pi_1(S_2)) \\
                    &= 2 + \gamma Q_{\pi_1}(S_2,A_2) \\
                    &= 2 + \gamma \frac{1}{1 - \gamma} \\
                    &= \frac{2 - \gamma}{1 - \gamma}
\end{align}


\begin{align}
Q_{\pi_1}(S_1, A_2) &= 1 + \gamma Q_{\pi_1}(S_1, \pi_1(S_1)) \\
                    &= 1 + \gamma Q_{\pi_1}(S_1,A_1) \\
                    &= 1 + \gamma \frac{2 - \gamma}{1 - \gamma} \\
                    &= \frac{1 + \gamma - \gamma^2}{1 - \gamma}
\end{align}

\begin{align}
Q_{\pi_1}(S_2, A_1) &= 3 + \gamma Q_{\pi_1}(S_1, \pi_1(S_1)) \\
                    &= 3 + \gamma Q_{\pi_1}(S_1,A_1) \\
                    &= 3 + \gamma \frac{2 - \gamma}{1 - \gamma} \\
                    &= \frac{3 - \gamma - \gamma^2}{1 - \gamma}
\end{align}



**Question 2:**
The optimal policy, $\pi^{*}$ is to take action $A_1$ in state $S_1$ and action $A_1$ in state $S_2$. 

**Question 3:**
The value of each state-action pair under the optimal policy $\pi^{*}$ is:

\begin{align}
Q_{\pi^{*}}(S_1, A_1) &= 2 + \gamma Q_{\pi^{*}}(S_2, \pi^{*}(S_2)) \\
                    &= 2 + \gamma Q_{\pi^{*}}(S_2,A_1) \\
\end{align}

\begin{align}
Q_{\pi^{*}}(S_2, A_1) &= 3 + \gamma Q_{\pi^{*}}(S_1, \pi^{*}(S_1)) \\
                    &= 3 + \gamma Q_{\pi^{*}}(S_1,A_1) \\
\end{align}
Solving these simultaneously gives:
\begin{align}
Q_{\pi^{*}}(S_2, A_1) &= \frac{3 + 2\gamma}{1 - \gamma^2} \\
Q_{\pi^{*}}(S_1, A_1) &= \frac{2 + 3\gamma}{1 - \gamma^2} \\
\end{align}

For the other state-action pairs:
\begin{align}
Q_{\pi^{*}}(S_1, A_2) &= 1 + \gamma Q_{\pi^{*}}(S_1, \pi^{*}(S_1)) \\
                    &= 1 + \gamma Q_{\pi^{*}}(S_1,A_1) \\
                    &= 1 + \gamma \frac{2 + 3\gamma}{1 - \gamma^2} \\
                    &= \frac{1 + 2\gamma + 2\gamma^2}{1 - \gamma^2} \\
Q_{\pi^{*}}(S_2, A_2) &= 1 + \gamma Q_{\pi^{*}}(S_2, \pi^{*}(S_2)) \\
                    &= 1 + \gamma Q_{\pi^{*}}(S_2,A_1) \\
                    &= 1 + \gamma \frac{3 + 2\gamma}{1 - \gamma^2} \\
                    &= \frac{1 + 3\gamma + \gamma^2}{1 - \gamma^2}
\end{align}

### Exercise 3.4

### Solutions to exercise 3.5:

1. A good example might be moving to a new neighbourhood which you don't know well. You must decide whether to exploit what you already know (e.g. go to the same restaurant you always go to) or explore new options (try a new restaurant, which, given your lack of knowledge, might be better or worse than your usual choice). Another example might be the Netflix recommendation algorithm which must balance showing you things you already like (exploitation) with showing you new things you might like (exploration).
2. In a stable environment that is well-understood and not subject to change, it is probably better to prioritize exploitation. An example of such an environment is a manufacturing assembly line with consistent demand for a specific product. In this case, it is best to exploit what you know works to maximize efficiency and output, rather than exploring new methods that may be less efficient or break the system.
3. In an unstable and constantly changing environment that is not well understood, it is better to prioritize exploration. An example of such an environment is the early stages of a startup in a rapidly evolving technology sector. In this case, it is important to explore new ideas and methods to adapt to the changing landscape and find the best path forward. Another example might be starting a new job: it's worth exploring different ways of working, different projects and different collaborators to find the best fit.
4. Many real world scenarios _stochastic_ policies are optimal. One example is bluffing in poker. If you always  bluff when you have a poor hand your opponents will quickly learn this and exploit you by betting against you. If you never bluff then your opponents will always fold when you have a good hand and you will never win much money. However, if you bluff randomly (stochastically) then your opponents will be unable to predict your behaviour and will be forced to play more cautiously. Rock-Paper-Scissor is a similar. Another example is the behaviour of animals in the wild: if a predator always follows the same path it will be easy for its prey to avoid it. However, if it follows a stochastic path then it will be more likely to catch its prey. 
5. Softmax action selection is a popular alternative to $\epsilon$-greedy. In softmax action selection, the probability of selecting an action is proportional to the exponentiated value of the Q-value for that action. This allows for a smooth transition between exploration and exploitation, with the probability of selecting the best action increasing as the Q-values become more certain.

$$ P(A_t = a | S_t = s) = \frac{e^{Q(s, a) / \tau}}{\sum_{a'} e^{Q(s, a') / \tau}} $$

   where $\tau$ is a temperature parameter that controls the degree of exploration. When $\tau$ is high, the policy is close to uniform random action selection, and as $\tau$ approaches zero, the policy becomes deterministic and selects the action with the highest Q-value.

### Exercise 3.9

### Exercise 4.1

1. Some features you may build into your feature vector are: 
    1. Proximity to boundaries or obstacles (better decision making regarding the environment) 
    2. Orientation of the agent. 
    3. Velocity and acceleration of the agent (help in understanding how motion and dynamics of the agent affect future states)
    4. Previous states or memory (in complex worlds the Markov assumption may break down and its useful to allow previou sstates to influence future actions) 
    5. Energy constraints of the agent (if the agent has limited energy, it may need to conserve energy and make decisions accordingly)

### Exercise 4.7
2. Crossing the road when the light is green: If one features tells you when you're at the crossing and another feature tells you the state of the traffic light, you can use the two features to determine whether it is safe to cross the road BUT they must both on the same time scale, this is a non-linear interaction between the two features. 