# 1. Monte Carlo Intro
In this section we are going to be discussing another technique for solving MDP's, known as **Monte Carlo**. In the last section, you may have noticed something a bit odd; we have talked about how RL is all about learning from experience and playing games. Yet, in none of our dynamic programming algorithms did we actually play the game. We had a full model of the environment, which included all of the state transition probabilities. You may wonder: is it reasonable to assume that we would have that type of information in a real life environment? For board games, perhaps. But, what about self driving cars? 

The way that we manipulated our dynamic programming algorithms required us to put an agent into a state. That may not always be possible, especially when talking about self driving cars, or even video games. A video game starts in the state that it decides-you can't choose any state you want. This is another instance of having god mode capabilities, so it is not always realistic to assume that that is always possible. In this section, we will be playing the game and learning purely from experience. 

## 1.2 Monte Carlo Methods
Monte Carlo is a rather poorly defined term. Usually, it refers to any algorithm that involves a significantly random component. With Monte Carlo Methods in RL, the random component is the _**return**_. Recall that what we are always looking for is the expected return given that you are in state $s$. With MC, instead of calculating the true expected value of G (which requires probability distributions), we instead calculate its sample mean. 

In order for this to work, we need to assume that we are doing episodic tasks only. The reason is because an episode has to terminate before we can calculate any of the returns. This also means that MC methods are _not_ fully online algorithms. We don't do an update after every action, but rather after every episode. 

The methods that we use in the Monte Carlo section should be somewhat reminiscent of the multi armed bandit problem. With the multi armed bandit problem, we were always averaging the reward after every action. With MDP's we are always averaging the return. One way to think of Monte Carlo, is that _every state_ is a _separate multi-armed bandit problem_. What we are trying to do is learn to behave optimally for all of the multi armed bandit problems, all at once. In this section, we are again going to follow the same pattern that we did in the DP section. We will start by looking at the _**prediction problem**_ (_find the value given the policy_), and then look at the _**control problem**_ (_finding the optimal policy_). 

---

# 2.0 Monte Carlo Policy Evaluation
We are now going to solve the prediction problem using Monte Carlo Estimation. Recall that the definition of the value function is that it is the expected value of the future return, given that the current state is $s$:

#### $$V_\pi(s) = E \big[G(t) \mid S_t = s\big]$$

We know that we can estimate any expected value simply by adding up samples and dividing by the total number of samples:

#### $$\bar{V}_\pi(s) = \frac{1}{N} \sum_{i =1}^N G_{i,s}$$

Where above, $i$ is indexing the episode, and $s$ is indexing the state. The question now, is how do we get these sample returns. 

## 2.1 How do we generate $G$?
In order to get these sample returns, we need to play many episodes to generate them! For every episode that we play, we will have a sequence of states and rewards. And from the rewards, we can calculate the returns by definition, which is just the sum of all future rewards:

#### $$G(t) = r(t+1) + \gamma * G(t+1)$$

Notice how, to actually implement this in code, it would be very useful to loop through the states in reverse order, since $G$ depends only on future values. Once we have done this for many episodes, we will have multiple lists of $s$'s and $G$'s. We can then take the sample mean. 

## 2.2 Mutliple Visits to $s$
One interesting question that comes up is, what if you see the same state more than once in an episode? For instance if you see state $s$ at $t=1$ and $t=3$? What is the return for state $s$? Should we use $G(1)$ or $G(3)$? There are two answers to this question, and surprisingly they both lead to the same answer. 

**First Visit Monte Carlo**<br>
The first method is called _first visit monte carlo_. That means that you would only count the return for time $t=1$. 

**Every Visit Monte Carlo**<br>
The second method is called _every visit monte carlo_. That means that you would calculate the return for every time you visited the state $s$, and all of them would contribute to the sample mean; i.e. use both $t=1$ and $t=3$ as samples. 

Surprisingly, it has been proven that both lead to the same answer. 

## 2.3 First-Visit MC Pseudocode
Let's now look at some pseudocode for first visit monte carlo prediction. 

---

```
def first_visit_monte_carlo_prediction(pi, N):
  V = random initialization
  all returns = {} # default = []
  do N times:
    states, returns = play_episode
    for s, g in zip(states, returns):
      if not seen s in this episode yet:
        all_returns[s].append(g)
        V(s) = sample_mean(all_returns[s])
  return V
```

---

In the above pseudocode we can see the following:
> * The input is a policy, and the number of samples we want to generate
* We initialize $V$ randomly, and we create a dictionary to store our returns, with a default value being an empty list
* We loop N times. Inside the loop we generate an episode by playing the game. 
* Next, we loop through the state sequence and return sequence. We only include the return if this the first time we have seen this state in this episode since this is first visit MC.
* If so, we add this return to our list of returns for this state. 
* Next, we update V(s) to be the sample mean of all the returns we have collected for this state. 
* At the end, we return $V$. 

## 2.4 Sample Mean
One thing that you may have noticed for the pseudocode, is that it requires us to store all of the returns that we get for each state so that the sample mean can be calculated. But, if you recall from our section on the multi armed bandit, there are more efficient ways to calculate the mean, such as calculating it from the previous mean. There are also techniques for nonstationary problems, like using a moving average. So, all of the techniques we have learned already still apply here. 

Another thing that we should notice about the MCM, is that because we are calculating the sample mean, all of the same rules of probability apply. That means that the confidence interval is approximately Gaussian, and the variance if the original variance of the data, divided by the number of samples collected:

#### $$\text{Variance of Estimate} = \frac{\text{variance of RV}}{N}$$

Therefore, we are going to more confident in data that has more samples, but it grows slowly with respect to the number of samples. 

## 2.5 Calculating Returns from Rewards
For full clarity, we will also quickly go over how to calculate the returns from the rewards in pseudocode. 

---

```
# Calculating State and Reward Sequences 
s = grid.current_state()
states_and_rewards = [(s, 0)]
while not game_over:
  a = policy(s)
  r = grid.move(a)
  s = grid.current_state()
  states_and_rewards.append((s, r))
  
# Calculating the Returns
G = 0 
states_and_returns = []
for s, r in reverse(states_and_rewards):
  states_and_returns.append((s, G))
  G = r + gamma*G
states_and_returns.reverse()
```

---
The above pseudocode shows two main steps:
1. Calculating State and Reward Sequences. This is just playing the game, and keeping a log of all the states and rewards that we get, in the order we get them. Notice, this is a list of tuples. Also, first award is assumed to be 0. We do not get any reward simply for arriving at the start state. 
2. Calculating the Returns. We start with empty list, and then loop through the states and rewards in reverse order. In the first order of this loop, the state s represents the terminal state and G will be 0. Next we update G. Notice how, on the first iteration of the loop, this includes the reward for the terminal state. Once the loop is done, we reverse the list of states and returns, since we want it to be in the order that we visited the states. 

## 2.6 Note on MC
One final thing to note about MC. Recall that one of the disadvantages of DP is that we have to loop through the entire set of states on every iteration, and that this is bad for most practical scenarios in which there are a large number of states. Notice how MC only updates the value for states that we actually visit. That means even if the state space is large, if we only ever visit a small subset of states, then it doesn't matter. 

Also notice, we don't even need to know what the states are! We can simply discover them by playing the game. So, there are some advantages to MC in situations where doing full exact calculations is infeasible. 

---

# 3.0 Monte Carlo Policy Evaluation in Code
We are now going to implement Monte Carlo for finding the State-Value function in code. 


In [1]:
import numpy as np
from common import standard_grid, negative_grid, print_policy, print_values

SMALL_ENOUGH = 10e-4
GAMMA = 0.9
ALL_POSSIBLE_ACTIONS = ('U', 'D', 'L', 'R')
