# Day 11 - $n$-step Bootstrapping

MC methods' updates are based on however many steps an episode took. The TD methods we know so far base their updates on a single time step. $n$-step Bootstrapping methods generalize this to allow a smooth transition between the two.

## $n$-step TD Prediction

* Instead of performing an update by bootstrapping from the value estimate of the next state, we can instead bootstrap from the value estimate $n$ steps ahead, using the intermediate rewards for the update
* The update target is thus the $n$-step return $G_{t:t+n}$:

$$
G_{t:t+n}\doteq R_{t+1}+\gamma R_{t+2}+...+\gamma^{n-1}R_{t+n}+\gamma^nV_{t+n-1}(S_{t+n})
$$
* As the value function will be updated at all steps along the way, the update for time step $t$ can use the most recent estimate $V_{t+n-1}$
* Should $t+n\ge T$, then the missing terms are simply treated as $0$, and the $n$-step return is simply the actual sample return
* As the $n$-step return is only available after $n$ steps, the update rule looks like the following:

$$
V_{t+n}(S_t)\leftarrow V_{t+n-1}(S_t)+\alpha\left[G_{t:t+n}-V_{t+n-1}(S_t)\right]
$$
* In the expectation, the error of the $n$-step return from state $s$ is less than $\gamma^n$ times the error of the value estimate $V_{t+n-1}(s)$

$$
\underset{s}{\operatorname{max}}\biggl|\mathbb E_\pi\left[G_{t:t+n}|S_t=s\right]-v_\pi(s)\biggr|\le\gamma^n\underset{s}{\operatorname{max}}\biggl|V_{t+n-1}(s)-v_\pi(s)\biggr|
$$
* This means that the $n$-step TD target is, in expectation, a better estimate of $v_\pi(s)$ than $V_{t+n-1}(s)$
* This is the $error\ reduction\ property$ of $n$-step returns

### $Exercise\ \mathcal{7.1}$

#### In Chapter 6 we noted that the Monte Carlo error can be written as the sum of TD errors (6.6) if the value estimates don’t change from step to step. Show that the $n$-step error used in (7.2) can also be written as a sum of TD errors (again if the value estimates don’t change) generalizing the earlier result.

$$
\begin{align}
G_{t:t+n}-V(S_t)&=R_{t+1}+\gamma V(S_{t+1})-V(S_t)+\gamma G_{t+1:t+n}-\gamma V(S_{t+1}) \\
&=\delta_t+\gamma(G_{t+1:t+n}-V(S_{t+1})) \\
&=\delta_t+\gamma\delta_{t+1}+\gamma^2(G_{t+2:t+n}-V(S_{t+2})) \\
&=\delta_t+\gamma\delta_{t+1}+\gamma^2\delta_{t+2}+\dots+\gamma^{n-1}\delta_{t+n-1} \\
&=\sum_{k=t}^{t+n-1}\gamma^{t-k}\delta_{k}
\end{align}
$$

In [392]:
import numpy as np


class NStateRandomWalk:
    def __init__(self, n):
        assert n // 2 != n / 2
        self.n = n
        self.reset()
        self.values = np.array([
            (-(n + 1) + 2 * (i + 1)) / (n + 1) 
            for i in range(self.n)
        ])

    def reset(self):
        self.state = self.n // 2
        return self.state

    def step(self):
        d = int(np.sign(np.random.randn()))
        self.state += d
        if self.state == -1:
            self.reset()
            return self.state, -1, True
        elif self.state == self.n:
            self.reset()
            return self.state,  1, True
        else:
            return self.state,  0, False

In [393]:
import numpy as np

from tqdm import tqdm
from math import inf


class NStepStorage:
    def __init__(self, n, dtype=float):
        self.data = np.zeros(n + 1, dtype=dtype)
        self.n = n
        
    def __getitem__(self, key):
        return self.data[key % (self.n + 1)]

    def __setitem__(self, key, value):
        self.data[key % (self.n + 1)] = value


class NStepTD:
    def __init__(self, n, alpha, walk):
        self.n = n
        self.alpha = alpha
        self.walk = walk
        self.V = np.zeros(walk.n)

    def train(self, num_episodes=10, quiet=False):
        for _ in tqdm(range(num_episodes), disable=quiet):
            states = NStepStorage(self.n, dtype=int)
            rewards = NStepStorage(self.n)
            states[0] = self.walk.reset()
            done = False
            t = 0
            T = inf
            while True:
                t += 1
                if t < T:
                    states[t], rewards[t], done = self.walk.step()
                    if done:
                        T = t
                tau = t - self.n
                if tau >= 0:
                    ret = sum(rewards[step] for step in range(tau + 1, min(T, t) + 1))
                    if t < T:
                        ret += self.V[states[t]]
                    self.V[states[tau]] += self.alpha * (ret - self.V[states[tau]])
                if tau == T - 1:
                    break

In [394]:
walk = NStateRandomWalk(19)

rmss = []
for _ in tqdm(range(10_000)):
    agent = NStepTD(n=4, alpha=0.4, walk=walk)
    agent.train(10, quiet=True)
    rmss.append(np.sqrt(np.average((agent.V - walk.values)**2)))
print(np.average(rmss))

100%|██████████████████████████████████████████████████████████████| 10000/10000 [00:22<00:00, 434.95it/s]

0.21283498531057776





### $Exercise\ \mathcal{7.2}\ (programming)$

#### With an n-step method, the value estimates do change from step to step, so an algorithm that used the sum of TD errors (see previous exercise) in place of the error in (7.2) would actually be a slightly different algorithm. Would it be a better algorithm or a worse one? Devise and program a small experiment to answer this question empirically.

As can be seen below, this method performs slightly worse than the true $n$-step TD algorithm. This is to be expected, as it essentially uses outdated information instead of the most recent value estimate. If no states were visited multiple times within $n$ steps, the value function for the relevant states would not be updated along the way, and the two methods would be equivalent.

In [395]:
class NStepTDAlt:
    def __init__(self, n, alpha, walk):
        self.n = n
        self.alpha = alpha
        self.walk = walk
        self.V = np.zeros(walk.n)

    def train(self, num_episodes=10, quiet=False):
        for _ in tqdm(range(num_episodes), disable=quiet):
            states = NStepStorage(self.n, dtype=int)
            states[0] = self.walk.reset()
            errors = NStepStorage(self.n)
            done = False
            t = 0
            T = inf
            while True:
                t += 1
                if t < T:
                    states[t], reward, done = self.walk.step()
                    if done:
                        T = t
                        errors[t-1] = reward - self.V[states[t-1]]
                    else:
                        errors[t-1] = reward + self.V[states[t]] - self.V[states[t-1]]
                tau = t - self.n
                if tau >= 0:
                    error = 0
                    for i in range(tau, min(T, t)):
                        error += errors[i]
                    self.V[states[tau]] += self.alpha * error
                if tau == T - 1:
                    break

In [396]:
walk = NStateRandomWalk(19)

rmss = []
for _ in tqdm(range(1_000)):
    agent = NStepTD(n=4, alpha=0.4, walk=walk)
    agent.train(10, quiet=True)
    rmss.append(np.sqrt(np.average((agent.V - walk.values)**2)))
print(f"True n-step TD RMS: {np.average(rmss)}")

rmss = []
for _ in tqdm(range(1_000)):
    agent = NStepTDAlt(n=4, alpha=0.4, walk=walk)
    agent.train(10, quiet=True)
    rmss.append(np.sqrt(np.average((agent.V - walk.values)**2)))
print(f"Alternate n-step TD RMS: {np.average(rmss)}")

100%|████████████████████████████████████████████████████████████████| 1000/1000 [00:02<00:00, 454.26it/s]


True n-step TD RMS: 0.2177865100402878


100%|████████████████████████████████████████████████████████████████| 1000/1000 [00:02<00:00, 492.55it/s]

Alternate n-step TD RMS: 0.24565859843320265





### $Exercise\ \mathcal{7.3}$

#### Why do you think a larger random walk task (19 states instead of 5) was used in the examples of this chapter? Would a smaller walk have shifted the advantage to a different value of $n$? How about the change in left-side outcome from $0$ to $-1$ made in the larger walk? Do you think that made any difference in the best value of n?

If $n\ge T$, then the values of all encountered states are updated towards $R_T$, turning it into a constant-$\alpha$ MC algorithm. The advantages of bootstrapping and learning from every step are lost, and the method becomes less effective. Thus, on the 5-state random walk, an even smaller $n$, probably $n=1$, would have been optimal for faster learning. I don't see how a change in reward from the left-side outcome would make an impact on the optimal choice of $n$.

## $n$-step Sarsa

*