# Day 11 - Temporal-Difference Learning

## Maximization Bias and Double Learning

* Imagine a set of actions, all with true values zero, but noisy estimates:
    - There is now one action that is seen as maximizing, when in reality, it isn't better than the others
    - This is worse if the rewards are, for example, drawn from $\mathcal{N}(-0.1, 1)$
    - There will be positive value estimates, when the true values are actually *worse* than zero
* This is called the $maximization\ bias$
* One way to overcome this is to learn two independent estimates, say $Q_1$ and $Q_2$, by alternating which is learned during a given play
* The action is chosen according to one estimate, and evaluated according to the other, say $Q_2(\operatorname{arg}\underset{a}{\operatorname{max}}Q_1(a))$
* A second, unbiased estimate is the reversal of this, $Q_1(\operatorname{arg}\underset{a}{\operatorname{max}}Q_2(a))$
* The update rule for double Q-learning looks like this:

$$
Q_1(S_t,A_t)\leftarrow Q_1(S_t,A_t)+\alpha\left[R_{t+1}+\gamma Q_2\left(S_{t+1},\operatorname{arg}\underset{a}{\operatorname{max}}Q_1(S_{t+1},a)\right)-Q_1(S_t,A_t)\right]
$$
* This is called $double\ learning$
* The behavior policy can use both estimates, for example by summing or averaging them

### $Exercise\ \mathcal{6.13}$

#### What are the update equations for Double Expected Sarsa with an $\varepsilon$-greedy target policy?

$$
Q_1(S_t,A_t)\leftarrow Q_1(S_t,A_t)+\alpha\left[R_{t+1}+\gamma \sum_a\pi_1(a|S_{t+1})Q_2\left(S_{t+1},a\right)-Q_1(S_t,A_t)\right],
$$
Where $\pi_i$ is the policy that is $\varepsilon$-greedy with respect to the value estimate $Q_i$. The update rule for actions chosen according to $Q_2$ is equivalent, with the indices $1$ and $2$ reversed. The action $A_t$, here, can be determined by some behavior policy $b$.

## Games, Afterstates, and Other Special Cases

* In some environments, like games, the immediate effects of an action are known
* After making a move in chess, we know what the $afterstate$ will be
* Yet, learning an action-value function means that each state-action pair that leads to this same $afterstate$ will have to learn its value separately
* Instead, an afterstate-value function can be learned, which updates the values for all state-action pairs that lead to the same afterstate
* There are many different kinds of special problems where small changes like these can improve performance, but the general principles, like GPI still apply

### $Exercise\ \mathcal{6.14}$

#### Describe how the task of Jack’s Car Rental (Example 4.2) could be reformulated in terms of afterstates. Why, in terms of this specific task, would such a reformulation be likely to speed convergence?

Suppose there are 10 cars in the first location, and 15 in the second, represented as the state (10, 15). Then, we could move three cars to the first location, placing us in the state (13, 12). The value of this action is the sum of the cost of moving the cars, and the discounted value of the afterstate, (13, 12), from which the day will play out. If we started from the state (13, 12), and moved no cars, the value of this action would be the sum of the immediate cost of $0, and value of that same exact afterstate. Instead of keeping a value estimate for each state-action pair, we could keep a value estimate only for these afterstates, and calculate action-values for action selection using these estimates. As there are many ways to get from several prior states to the same afterstate, this would, first, lower memory requirement, and second, implicitly update the action-values for all state-action pairs landing in some afterstate, once that afterstate's value is updated.