# Bias-Variance Tradeoff in Variations of SARSA


Here, we analyze the bias-variance tradeoff in SASRA [1], Expected Sarsa [1], Double SARSA [2], and Double Expected SARSA [2].

## Theoretical

### SARSA vs. Expected SARSA

First, let's explore the theoretical bias-variance tradeoff as provided in [1]. Van Seijen et al. show that Expected SARSA and SARSA have the same mean, but Expected SARSA has a lower variance.

Let's say that:

$$X_t \in \{ v_t, \hat{v_t} \}$$

where in expected SARSA:

$$ v_t = r_t + \gamma \sum_a \pi_t (s_{t+1}, a) Q_t (s_{t+1}, a)$$

and in SARSA:

$$ \hat{v} = r_t + \gamma Q_t (s_{t+1}, a_{t+1})$$

Bias is represented by:

$$Bias(s,a) = Q^{\pi} (s,a) - E\{X_t\}$$

Variance is denoted by:

$$Var(s,a) = E\{(X_t)^2\} - (E\{X_t\})^2$$

From [1], we get that the Variance for Expected Sarsa is:

$$ Var(s,a) = \sum_{s'} T^{s'}_{sa} \left( \gamma^2 \sum_{a'} (\pi_{s'a'}Q_t(s', a'))^2 + (R_{sa}^{s'})^2 + 2\gamma R_{sa}^{s'} \sum_{a'} \pi_{s'a'} Q_t(s',a') \right) - (E \{ \hat{v}_t \})^2$$

Where does this originate from? Well, the inner term is simply:

$$ v_t^2 = \left(r_t + \gamma \sum_a \pi_t (s_{t+1}, a) Q_t (s_{t+1}, a)\right) \left(r_t + \gamma \sum_a \pi_t (s_{t+1}, a) Q_t (s_{t+1}, a)\right)$$

Muliplying together, we get:

$$ = r_t^2 + 2\gamma r_t \sum_a \pi_t (s_{t+1}, a) Q_t (s_{t+1}, a) + \gamma^2 \sum_a \pi_t^2 (s_{t+1}, a) Q_t^2 (s_{t+1}, a)$$

The paper then changes notation for the expectation, but we're finding the variance for given state action pairs so we take the overall expected transition to every other state and multiply it by the transition probability to get the expectation.

TODO: not really sure why or what this change of notation is about.

The variance term becomes slightly different for SARSA:

$$ \hat{v}_t^2 = \left(r_t + \gamma Q_t (s_{t+1}, a_{t+1})\right) \left(r_t + \gamma Q_t (s_{t+1}, a_{t+1})\right)$$

$$ = r_t^2 + 2\gamma r_t Q_t (s_{t+1}, a_{t+1}) + \gamma^2 Q_t^2 (s_{t+1}, a_{t+1})$$

Taking the expectation of this yields:

$$ Var(s,a) = \sum_{s'} T^{s'}_{sa} \left( \gamma^2 \sum_{a'} \pi_{s'a'}(Q_t(s', a'))^2 + (R_{sa}^{s'})^2 + 2\gamma R_{sa}^{s'} \sum_{a'} \pi_{s'a'} Q_t(s',a') \right) - (E \{ \hat{v}_t \})^2$$

As the expectation is over all state transitions.

Noting that $E\{v_t\} = E\{ \hat{v}_t\}$. By [1]'s notation:

$$ E\{r_t + \gamma Q_t (s_{t+1}, a_{t+1})\} = E\{r_t + \gamma \sum_a \pi_t (s_{t+1}, a) Q_t (s_{t+1}, a)\}$$

$$ \sum_{s'} T^{s'}_{sa} R_{sa}^{s'} + \gamma \sum_{a'} \pi_{s'a'} Q_t(s',a') =  \sum_{s'} T^{s'}_{sa} R_{sa}^{s'} + \gamma \sum_{a'} \pi_{s'a'} Q_t(s',a')$$




## Experimental

Now, we show that the theoretical bias-variance tradeoff results are reflected in experiments with the various methods.

## Citations

[1] Van Seijen, Harm, et al. "A theoretical and empirical analysis of Expected Sarsa." Adaptive Dynamic Programming and Reinforcement Learning, 2009. ADPRL'09. IEEE Symposium on. IEEE, 2009. http://webdocs.cs.ualberta.ca/~vanseije/resources/papers/vanseijenadprl09.pdf

[2] Ganger, Michael, Ethan Duryea, and Wei Hu. "Double Sarsa and Double Expected Sarsa with Shallow and Deep Learning." Journal of Data Analysis and Information Processing 4.04 (2016): 159. http://file.scirp.org/pdf/JDAIP_2016101714072270.pdf

