# Chapter 5: Monte Carlo Methods

### Exercise 5.1 - Why does the estimated value function jump from the last two rows in the rear?


The value function jump at the last two rows in the rear because on those rows the player has either 20 or 21 points summed in the hand. If the user has 21 point then he will most likely win or (at worst and quite unlikely) get a draw.  With the current policy of sticking until the player has 20 or 21 points in the hand, if the user will stick and not buy any more cards. As the likelyhood of the dealer obtaining 21 is low, this state is highly valued in this policy

However, in any other state, the player will buy cards until she has 20 or 21 points. This policy is unlikely to win for any other state other than 20 and 21, resulting in the value diagram of figure 5.1

On the most left side column, it is when the dealer showed an ace (value shown is 1), which decreases the chance that the player will come to the 'stick strategy' 

### Exercise 5.2 - Would every-visit MC output a different value functions than the first-visit MC?

The difference between every-visit and first-visit MC is whether they include the return of the first visit of a state in the estimates of the value function. As the policy is fixed the only the difference between the value function would come from the first visit, which can be neglected after 500,000 episodes. Indeed as we are averaging the values of the estates
Vs == AVG(V0, V1, ..., Vn) = E[Vx]

But, E[V0] = E[V1] = E[V2] = ... = E[Vn], so that Vs' = AVG[V1, V2, ..., Vn] and E[Vs' - Vs] = E[E[V0] - V0]




### Exercise 5.3 - What is the backup diagram for the Monte Carlo estimation of Qpi?

The Backup Diagram for Monte Carlo is similar to the Backup Diagram used on the Dynamic Programming, with the exception that instead of having the initial state we are interested, we have the state-action pair (s,a) connected to future states (s') through the arrow r.


**Question:** On MC MDP, do we assume greedy policy for the evaluation of the next state value similarly to what was done on the Dynamic Programming?


### Exercise 5.4 - How would we compute the more efficiently the average return of the states in a MC MDP?

In the same way that was done on the Multi-armed Bandits problem. One can calculate the average by simply holding just 2 values and computing the following calculation:

- Current average: CurrAvg
- Number of samples of current average: N
- New sample: v

```
def NewAverage(CurrAvg, N, V):
    if N == 0:
        return V, 1
    else:
        NewAvg = (CurrAvg*N + V)/(N+1)
        return NewAvg, N+1 
```

### Exercise 5.5 - What are the values of the first-visit and every-visit estimators of the value of the non-terminal state?

The difference between every-visit and first-visit estimators is that the first one doesn't include the first state visit in the estimation of the value of the state.

- `t = (T-1) = 9`: G <- 0*1 + 1 = 1, Thus Vs = [1]
- `t = (T-2) = 8`: G <- 1*1 + 1 = 2, Thus Vs = [1, 2]
- `t = (T-3) = 7`: G <- 2*1 + 1 = 3, Thus Vs = [1, 2, 3]
- (...)
- `t = (T-8) = 2`: G <- 9*1 + 1 = 9, Thus Vs = [1, 2, 3, 4, 5, 6, 7, 8, 9] -> avg(Vs) = 5
- `t = (T-9) = 1`: G <- 9*1 + 1 = 10, Thus Vs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] -> avg(Vs) = 5.5

Which means that the estimate for the first-visit estimator is 5 while for the every-visit estimator is 5.5

### Exercise 5.6 - What is the equivalent *weighted importance sampling* but for q(s,a) instead of v(s)?

For q(s,a) the analagous *weighted importance sampling* is almost the same, with the exception that we don't include the very first term of P(At=a, St=s) since that is necessarily 1 in the case (we are evaluating the value of the action and assuming that all further actions are taken assuming the policy). As a consequence

$ Q(St=s, At=a) = \frac{\sum_{t\epsilon T(t+1)}{p_{t:T-1}}{G_t}}{\sum_{t\epsilon T(t+1)}{p_{t:T-1}}} $

### Exercise 5.7 - Why did the error of the weighted importance sampling first increased and then decreased with the number of episodes?

**I don't know**, but maybe: the error considered in the plot is the Mean Squared Error of the estimated value of the initial state for the target-policy to the true value.
As target- and behaviour-policies have distinct P(a|s) and with the initial low-number of samples for the states-values

### Exercise 5.8 - Would the variance of the MC Method used still be infinite if instead of first-visit MC an every-visit MC was used?

Yes, because the reason why the variance can be infinite isn't due to the values being sampled (which usually have bounded variance), but because the *importance sample* ratio is unbounded for ordinary importance sampling.

### Exercise 5.9 - Modifying the algorithm for first-visit MC policy evalution (Section 5.1) to use incremental implementation for sample averages

The pseudo-code for an incremental implementation of the first-visit MC prediction algorithm from Section 5.1 is:

Input: a policy $\pi$ to be evaluated
Initialize

 - $ V(s)\in R, for\ all\ s\in S$
 - $Return(s) \leftarrow$ list of zeros for all $s\in S$
 - $N(s) \leftarrow$ list of 0 for all $s\in S$, which holds the number of times the state was visited
 
Loop forever (for each episode):

- Generate an episode following $\pi$: $S_0, A_0, R_1, S_1, A_1, R_2, ..., S_{T-1}, A_{T-1}, R_T$
- G $\leftarrow$ 0
- Loop for each step of the episode, $ t=T-1, T-2, ..., 0$
    - G $\leftarrow \gamma G + R_{t+1}$
    - Unless $S_t$ appears in $S_0, S_1, ..., S_{T-1}$
        - $N(s) \leftarrow N(s) + 1$
        - $V(s) \leftarrow V(s) + (G - V(s))/n$

### 5.11 - Why is the update of W $\frac{1}{b(A_t|S_t}$ while you (likely) expected to be $\frac{\pi(A_t|S_t)}{b(A_t|S_t}$ from the weighted importance sampling?


As the target-policy is the greedy policy and we have a condition to exit the inner loop of the algorithm in case $A_t != \pi(S_t)$ (since in that case $\pi(S_t) = 0$, thus rendering any continuation of the calculation meaningless), the algorithm just replaces $\pi(S_t)$ to its numeric value