# Day 7 - Monte Carlo Methods

Monte Carlo methods don't require a model of the environment, but instead sample from real experience, or from a simpler model that doesn't have the explicit probability distributions over all states.

## Monte Carlo Prediction

* Monte Carlo prediction is concerned with learning the state-value functions from experience
* As the state-value function is the expected return from that state under the policy $\pi$, it can be approximated by the average of real returns experienced by the agent
* Each time the agent enters state $s$ in an episode is called a $visit$
* There are methods that update estimates based on every visit, or based on the first visit only
* Both converge to $v_\pi(s)$
* The averages are unbiased estimates, and they converge with the standard deviation of the error falling by $1/\sqrt{n}$
* This is true for first-visit MC, but every-visit MC also converges quadratically
* As the value estimate for one state does not depend on the estimates of other states, there is no bootstrapping, and therefore no bias

In [1]:
# Pseudocode
def first_visit_mc_prediction(pi):
    V = np.zeros(3)
    returns = [[], [], []]

    while True:
        episode = generate_episode(pi)
        G = 0
        for t in len(episode):
            G = gamma * G + episode["R"][t]
            # Every-visit MC would leave out this check!
            if not episode["S"][t] in visited_states:
                returns[episode["S"][t]].append(G)
                V[episode["S"][t]] = average(returns[episode["S"][t]])

### $Exercise\ \mathcal{5.1}$

#### Consider the diagrams on the right in Figure 5.1. Why does the estimated value function jump up for the last two rows in the rear?
The last two rows represent the states from which the player is most likely to win, as they are close to, or directly at 21, without hitting.

#### Why does it drop off for the whole last row on the left?
As the dealer is showing an ace, they essentially get a "second chance," should they go over 21 with the ace counted as 11, since they can now count it as 1 instead and try getting to 21 again.

#### Why are the frontmost values higher in the upper diagrams than in the lower? 
The upper diagrams correspond to the same situation for the player, where they have a usable ace, getting a "second chance" at winning should they go pseudo-bust.

### $Exercise\ \mathcal{5.2}$

#### Suppose every-visit MC was used instead of first-visit MC on the blackjack task. Would you expect the results to be very different? Why or why not?
In a single episode, the same state cannot be visited more than once. Even if the same sum can be encountered twice, the "usable ace" part of the state will then change from true to false. Therefore, every-visit MC and first-visit MC are equivalent on this problem.