# Day 21 - Planning and Learning with Tabular Methods

## Expected vs. Sample Updates

* We have focused on different kinds of value updates, and, limiting our analysis to the one-step case for now, these vary along three binary dimensions:
    1. State- vs. action-value
    2. Optimal vs. arbitrary policy
    3. Expected vs. sample
* Seven of the $2^3=8$ possible combinations correspond to specific algorithms:
    1. State + Optimal + Expected = Value Iteration
    2. State + Arbitrary + Expected = Policy Evaluation
    3. State + Arbitrary + Sample = TD(0)
    4. Action + Optimal + Expected = Q-Value Iteration
    5. Action + Optimal + Sample = Q-Learning
    6. Action + Arbitrary + Expected = Q-Policy Evaluation
    7. Action + Arbitrary + Sample = Sarsa
* Any of these can be used as update rules for planning
* When comparing sample updates to expected updates, the question remains whether expected updates are actually to be preferred
* While expected updates are exact, not suffering from a sampling error, they are more expensive to compute
* Update computation, in practice, is usually dominated by the number of `S, A` pairs at which $Q$ is to be evaluated
* If there is enough time for computing an expected update, this more exact estimate is usually preferred
* If not, which is usually the case in practice, a sample update is better insofar as it is an actual update, where the expected update would not even get to finish
* For large branching factors $b$ (number of possible states reachable from $S$), the error reduction from sampling updates is vastly more compute-efficient, as most of the error reduction comes from the first few samples
* An expected update would run $b$ computations per update, which is highly excessive if $b=10,000$, for example
* In this case, even $100$ sample updates usually reduce the error significantly
* The analysis, from Sutton & Barto, is likely even underestimating the advantage of sample updates, as it does not take into account the fact that the value function becomes more accurate over the course of the updates, which is not the case with expected updates

### $Exercise\ \mathcal{8.6}$

#### Exercise 8.6 The analysis above assumed that all of the $b$ possible next states were equally likely to occur. Suppose instead that the distribution was highly skewed, that some of the $b$ states were much more likely to occur than most. Would this strengthen or weaken the case for sample updates over expected updates? Support your answer.

If the transition probabilities are highly skewed, the value of a state will be skewed towards the values of the most likely transitions. Unless the rare transitions have such an extreme difference in value, that they compensate for this effect, sample updates will be even more efficient, as they focus copmutation on the most probably successor states. Expected updates spend equal amounts of computation on both likely and highly unlikely transitions, wasting a lot of resources.