#Monte Carlo Methods
Different from Dynamic programming, the policy would take the action by highest q(s,a) instead of v(s,a), since in Monte Carlo Methods, the model is not explicit and we couldn't acquire the exact value of state-transition. Instead, we could only pick actions based on the ESTIMATION of the value, which in this case is q(s,a) \\

Family of methods that learn optimal v_*(s) or q_*(s,a) values based on Samples \\

They approximate the values by interacting with the environment to generate sample returns and averaging them \\

#Advantages
* The estimate of one state does not depend on the rest
* The cost of estimating the state value is independent of the total number of states
* No need a model of Environment

#Stochastic Policies
* On-policy Learning: Generate samples using the same policy $π$ that we are using to optimize
* Off-policy Learning: Generate samples with exploratory policy b different from the one we are going to optimize


**On Policy Learning** \\
$\epsilon$-greedy policy: With probability $ϵ$ select a random action, and with probability $1-ϵ$ select the action with highest Q(s,a)

Action is optimal $$π(a|s) = 1-ϵ+ϵ_r (when a = a^*)$$
Action not optimal $$π(a|s) = ϵ_r (when a \neq a^*)$$

Probability of choosing sub-optimal based on estimation of q-value action: $$ϵ_r=\frac{ϵ}{|A|}$$


#Algorithm:

In [None]:
def on_policy_mc_control(policy,action_values,episodes,gamma=0.99,epsilon=0.2):

    sa_returns = {} # state-action pair

    for episode in range(1,episodes+1):
        state = env.reset()
        done = False
        transitions = []

        while not done:
            action = policy(state,epsilon)
            next_state, reward, done, _ = env.step(action)
            transitions.append([state,action,reward])
            state = next_state

        G = 0

        for state_t, action_t, reward_t in reversed(transitions):
            G = reward_t + gamma * G

            if not (state_t, action_t) in sa_returns:
                sa_returns[(state_t,action_t)] = []

            sa_returns[(state_t,action_t)].append(G)
            action_values[state_t][action_t] = np.mean(sa_returns[state_t,action_t])

#Off Policy Learning
Find the optimal actions by taking sub-optimal action to explore the environment\\

For off-policy learning, we separate the exploration from the optimization process and we use different policies for each one \\

* Exploratory Policy: $b(a|s)$ perform tasks and collect the experience. The experience would be a trajectory containing S,A,R in each time step

* Target Policy: $π(a|s)$ Updated by sampling the "experience" collected by exploratory policy

The exploratory policy has to cover all the action that target policy can take. i.e. $if π(a|s) > 0, then b(a|s) > 0$

**GOAL:** $\mathbb{E}_b[G_t|S_t=s,A_t=a]=q_b(s,a)$

**Importance Sampling:**$W_t=𝚷_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{b(A_k|S_K)}$

**Approach the goal:** $\mathbb{E}_b[W_tG_t|S_t=s]=v_\pi(s)$

Update the rules:


1.   Store G_t for each t and compute the average
2.   Everytime observing new return, updating Q value


