
# Monte Carlo methods

The agent will use a policy to tackle the task for an entire episode. They approximate the values by interacting with the environment to generate sample returns and averaging them.

Advantages of Monte Carlo method:

 - The estimate of a state does not depend on the rest. The cost of estimating a state value is independent of the total number of states, so the complexity will be much more efficient than DP.
 - We can focus the estimations on the states that help solve the task instead of all the states to avoid the irrelevant ones.
 - No need to know the dynamics of the environment. For many tasks it is easier to generate samples rather than the dynamics of the environment.


Monte Carlo methods depend on the following equation:


$\pi'(s) =$ arg max $\sum_{s',r} p(s', r|s,a)[r+\gamma v_\pi (s')]$


Note that this was used before  in the policy improvement theorem.


We're going to use q-values of state-action pairs. We'll keep a table of q-values rather than the values of being at a state. The Monte Carlo method is swapping in between policy iteration and policy improvement just like dynamic programming.This continues until the q-values reach their optimal values.


The maze goes through the following steps to solve the problem and get an optimal solution by the following:

 - Initialize the environment
 - Define value table Q(s,a)
	 - Create the Q(s,a) table
	 - Plot Q(s,a)
 - Define the policy
	 - Create the policy $\pi(s)$
	 - Test the policy with state (0, 0)
	 - Plot the policy
 - Implement the algorithm
 - Show results
	 - Show resulting value tables Q(s, a)
	 - Show resulting policy $\pi(•|s)$
	 - Test the resulting agent

![enter image description here](https://i.imgur.com/lHRirvh.png)


Monte Carlo method is not too different from dynamic programming. Both use policy evaluation and improvement. However, keep in mind that Monte Carlo method does not depend on previous states for the current value.


## Exploration


Monte Carlo does not explore every possible path but only the ones it deems "optimal" by the q-values. However, there may be an entire path that's more optimal than the chosen one which wasn't chosen because it was behind a sub-optimal q-value.


The algorithm has to view all options occasionally. This means to explore all actions and update their estimates on the q-value table. There are two approaches to this.

 - **Exploring Starts**
	 - Every time the agent faces the environment to collect the experience, it starts in a random initial state and will take an initial random action.
	 - All q-values will be updated at some point when the state and the action are chosen to start the episode.
	 - Not very realistic since there are a lot of tasks we don't have the option.
 - **Stochastic Policies**
	 - $\pi(a|s)>0, \forall a \in A(s)$
	 - The policies will sometimes have a probability of choosing every action greater than 0. This ensures that from time to time, it takes an action that it doesn't consider optimal to improve its understanding of the task.
	 - This is easier to implement than Exploring Starts



On the topic of Schostatic policies, there are two types:

 - On-policy learning:
	 - Generates samples using the same policy $\pi$ that we're going to optimize.
 - Off-policy learning:
	 - Generates samples with an exploratory policy $b$ different from the one we're going to optimize

