##### Exercise 7.1

Why do you think a larger random walk task (19 states instead of 5) was used in the examples of this chapter? Would a smaller walk have shifted the advantage to a different value of n? How about the change in left-side outcome from 0 to -1? Would that have made any difference in the best value of n?

A small random walk would truncate large n-step to their total returns since episodes will be shorter (i.e. large n would just result in alpha MC methods). Therefore we should expect the advantage at lower n for smaller random walks. 

With values initialized at 0, if the left-most value terminated in 0 reward, we would need longer episodes for an agent to assign the correct values to the states left of center, since episodes that terminate to the left will not cause any updates initially, only the episodes that terminate to the right end with non-zero reward. Thus I would expect the best value of n to increase.

---------

##### Exercise 7.2

Why do you think on-line methods worked better than off-line methods on the example task?

Off-line methods generally take random actions with some small probability $\epsilon$. We would expect at least 1-2 random actions in an environment with a minimum of 10 states to termination, depending on $\epsilon$ (assuming $\epsilon$ is between 10-20%). Therefore, even after finding the optimal action-values, these random actions will attribute erroneous rewards to certain actions, leading to higher RMSEs compared to on-line methods; we also see that larger n is more optimal for off-line methods compared to on-line, presumably because larger n reduces noise from the $\epsilon$ greedy actions.

-----------

##### Exercise 7.3

In the lower part of Figure  7.2, notice that the plot for n=3 is different from the others, dropping to low performance at a much lower value of $\alpha$ than similar methods. In fact, the same was observed for n=5, n=7, and n=9. Can you explain why this might have been so? In fact, we are not sure ourselves.

My hypothesis is that odd values of n have higher RMSE because of the environment. It takes at a minimum, an odd number of steps to reach termination from the starting state. For off-line methods, even after finding the optimal action-values, an agent may still not terminate in an odd number of steps. Therefore my hypothesis is that odd n-step methods are more likely to cause erroneous updates to the $\epsilon$ greedy actions compared to even n-step methods. A quick way to test this, would be to create a random-walk where an agent will terminate at a minimum in an even number of steps, and then to observe the same plots as in Figure 7.2. 

----------

#### Exercise 7.4  

The parameter $\lambda $ characterizes how fast the exponential weighting in Figure  7.4 falls off, and thus how far into the future the $\lambda $-return algorithm looks in determining its backup. But a rate factor such as $\lambda $ is sometimes an awkward way of characterizing the speed of the decay. For some purposes it is better to specify a time constant, or half-life. What is the equation relating $\lambda $ and the half-life, $\tau$, the time by which the weighting sequence will have fallen to half of its initial value?

The half life occurs when weighting drops in half:

$ \lambda^{n} = 0.5 $,

which occurs at,
$n = -ln(2) / ln(\lambda) = \tau$


-----
Getting (7.3) from the equation above it:

$R_t^\lambda = (1 - \lambda) \sum_{n=1}^\infty \lambda^{n-1} R^{(n)}_t$,

after $T-t-1$, we sum to infinity but with $R^{T-t-1}_t$, which is just the total return $R_t$, so:

$R_t^\lambda = (1 - \lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} R^{(n)}_t + (1 - \lambda) R_t \sum_{n=T-t-1}^{\infty} \lambda^{n} $

We can remove $\lambda^{T-t-1}$ from the last sum to get $ (1 - \lambda) R_t \lambda^{T-t-1} \sum_{n=1}^\infty \lambda^n = (1 - \lambda) R_t \lambda^{T-t-1} \frac{1}{1 - \lambda}$, so that: 

$R_t^\lambda = (1 - \lambda) \sum_{n=1}^{T-t-1} \lambda^{n} R^{(n)}_t + \lambda^{T-t-1} R_t  $

----------

##### Exercise 7.5

In order to get TD($\lambda$) to be equivalent to the $\lambda$-return algorithm in the online case, the proposal is that $\delta_t = r_{t+1} + \gamma V_t(s_{t+1}) - V_{t-1}(s_t) $ and the n-step return is $R_t^{(n)} = r_{t+1} + \dots + \gamma^{n-1} r_{t+n} + \gamma^n V_{t+n-1}(s_{t+n}) $. To show that this new TD method is equivalent to the $\lambda$ return, it suffices to show that $\Delta V_t(s_t)$ is equivalent to the new TD with modified $\delta_t$ and $R_t^{(n)}$.

As such, we expand the $\lambda$ return:

$
\begin{equation}
\begin{split}
\frac{1}{\alpha} \Delta V_t(s_t) =&  -V_{t-1}(s_t) + R_t^\lambda\\
=& -V_{t-1}(s_t) + (1 - \lambda) \lambda^0 [r_{t+1} + \gamma V_t(s_{t+1})] + (1-\lambda) \lambda^1 [r_{t+1} + \gamma r_{t+2} + \gamma^2 V_{t+1}(s_{t+2})] + \dots\\
=& -V_{t-1}(s_t) + (\gamma \lambda)^0 [r_{t+1} + \gamma V_t(s_{t+1}) - \gamma \lambda V_t(s_{t+1})] + (\gamma \lambda)^1 [r_{t+2} + \gamma V_{t+1}(s_{t+2}) - \gamma \lambda V_{t+1}(s_{t+2})] + \dots\\
=& (\gamma \lambda)^0 [r_{t+1} + \gamma V_t(s_{t+1}) - V_{t-1}(s_t)] + (\gamma \lambda) [r_{t+2} + \gamma V_{t+1}(s_{t+2}) - V_t(s_t+1)] + \dots\\
=& \sum_{k=t}^\infty (\gamma \lambda)^{k-t} \delta_k
\end{split}
\end{equation}
$

where $\delta_k = r_k + \gamma V_k(s_{k+1}) - V_{k-1}(s_k)$ as defined in the problem. Therefore, for online TD as defined above, the $\lambda$ return is exactly equivalent.


-------------

##### Exercise 7.6

  In Example 7.5, suppose from state s the wrong action is taken twice before the right action is taken. If accumulating traces are used, then how big must the trace parameter $\lambda $ be in order for the wrong action to end up with a larger eligibility trace than the right action?

-----------

##### Exercise 7.7


-----------

##### Exercise 7.8

sarsa($\lambda$) with replacing traces

-------

##### Exercise 7.9

Write pseudocode for an implementation of TD($\lambda $) that updates only value estimates for states whose traces are greater than some small positive constant.
  
  


-------

##### Exercise 7.10

Prove that the forward and backward views of off-line TD($\lambda $) remain equivalent under their new definitions with variable $\lambda $ given in this section. Follow the example of the proof in Section 7.4.


** "Eligibility traces are the first line of defense against both long-delayed rewards and non-Markov tasks."**

"In the future it may be possible to vary the trade-off between TD and Monte Carlo methods more finely by using variable $\lambda $, but at present it is not clear how this can be done reliably and usefully."