# Day 23 - Planning and Learning with Tabular Methods

## Trajectory Sampling

* We look at the different ways of distributing updates
* The classical approach is to sweep through the entire state space
* The second approach is to sample according to some distribution
* Sampling individual trajectories according to, for example, the on-policy distribution, is called $trajectory\ sampling$
* This can be beneficial, as it can ignore the uninteresting parts of the state space
* It may also ignore important parts of the state space, which the policy is missing, however
* It turns out that, on the large problems we care aboue, on-policy sampling seems generally stronger

## Real-Time Dynamic Programming

* RTDP is an example of $asynchronous\ DP$, where the values to be updated are chosen by the current policy
* As a form of DP, it is possible to apply some of the theoretical results of DP to RTDP
* RTDP can skip regions of the states that are irrelevant to optimal policies
* All that is required is an $optimal\ partial\ policy$, which is a policy that is optimal in all states that a full optimal policy would visit

## Planning at Decision Time

* Aside from $background\ planning$, the method we have done so far, where simulated experience is used to update arbitrary value estimates, there is also $decision$-$time\ planning$
* Planning at decisin time means simulating trajectories from $S_t$, to figure out which action would yield the highest value
* When time for deliberation is available, this is a strong method
* When fast reactions are important, background planning is best to produce a policy that can be immediately applied at all times

## Heuristic Search

* $Heuristic\ search$ builds a tree, strating from the state to be considered, and then backs up computation of values from the value estimates of the leaf nodes
* If our estimates are imperfect, but we have a perfect model of the environment, then deeper search leads to better policies
* However, this of course requires more and more computation
* The success of heuristic search methods is likely due to its tight focus on relevant states

## Rollout Algorithms

* A rollout is a simulated trajectory, starting from the current state, picking a specific action
* The values of many samples trajectories for each action are averaged, and the highest-value action is selected to be executed
* Then, more rollouts are performed from the next state
* The values computed are the values for the so-called $rollout\ policy$
* As the Monte Carlo samples are independently sampled, this can be done in parallel

## Monte Carlo Tree Search

* MCTS is an algorithm that is run at each new state that is encountered
* It builds a tree by iterating a four-step process until it needs to make a decision
    1. Selection: Select the most promising child nodes via the tree policy
    2. Expansion: New child nodes are added to the selected leaf node
    3. Simulation: A simulated trajectory is run to a terminal state, from the expanded node, using the rollout policy
    4. Backup: The result of the Monte Carlo sample is propagated back through the nodes in the tree, updating their values
* Once it has to make a decision, it chooses an action based on the statistic accumulated
* This might either be the action with the largest value, or the one that was visited most often, to avoid outliers
* AlphaGo combines this with keeping value estimates learned by a deep neural network