<a href="https://colab.research.google.com/github/RLWH/reinforcement-learning-notebook/blob/master/4.%20Model%20Free%20Control/Ch5(b)_Monte_Carlo_Control.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Model-Free Control
The best introduction of this chapter is probably given by David Silver in his video
> For everything in the course up to this point is leading to this lecture. We gotta finally find out how can you drop the robot or agent into some unknown environment, and you don't tell it anything about how the environment works, how can it figure out the right thing to do.

In previous chapters, we have discovered how to estimate the value of each state by different methods - including Monte Carlo, one-step TD, TD($\lambda$), etc. In this chapter, we will bring it forward to control and find optimal policies.

We will talk about 
1. On-policy Monte-Carlo Control
2. On-Policy TD Learning
3. Off-policy Learning

#### On and Off-policy learning
- On-policy learning
  - "Learn on the job"
  - Learn about policy $\pi$ from experience sampled from $\pi$

- Off-policy learning
  - "Look over someone's shoulder"
  - Learn about policy $\pi$ from experience sampled from $\mu$
  - Learn from others

# 1. On-policy MC Control

## Generalised Policy Iteration
The main idea that we will use here is the policy iteration method. This is the same method that has been used in dynamic progamming

![Generalised Policy Iteration](https://raw.githubusercontent.com/RLWH/reinforcement-learning-notebook/master/images/generalised_policy_iteration.png)

Steps:
1. Policy evaluation
  - Estimate $v_{\pi}$
  - Use a random trajectory to estimate the value function
2. Policy Improvement
  - Generate $\pi' \geq \pi$ by greedy selection
  
There's an issue: 
For step 2, when choosing the greedy action, we need to know what is the transition probabilities from the environment. In this case, we still require a model of MDP if we use value function for the iteration process. 

Greedy policy
\begin{equation}
\pi'(s) = \text{argmax}_{a \in A} R^{a}_s + P^{a}_{ss'}V(s')
\end{equation}

What can we do?
- We iterate on the Q(s,a) function instead
- i.e. Making an evaluation at each state of how good to take each of the action 

\begin{equation}
 \pi'(s) = \text{argmax}_{a \in A} Q(s,a)
\end{equation}

## Improvement: Generalised Policy Iteration with Action-Value function

1. We start off by having a Q value function with some policy
2. Take the mean of all the state action pair
3. Greedily choose wrt on Q
4. Iterate

![Policy Iteration on Q](https://raw.githubusercontent.com/RLWH/reinforcement-learning-notebook/master/images/policy_iteration_on_q.png)

Still have issues:
- If we act greedily, we can still get stuck, because there are lack of explorations.

### Exploration by $\epsilon$-greedy exploration
This is the simplest idea for ensuring continual exploration. 

- Suppose there are m actions
- Define a small probability $\epsilon$
  - With probability $1-\epsilon$ choose greedy action
  - With probability $\epsilon$ choose an action at random
  
#### Formulation
Note that the $\epsilon$-greedy policy is a determinstic policy
\begin{equation}
\pi(a|s) = 
\begin{cases}
    \frac{\epsilon}{m} + 1 - \epsilon, & \text{if} a* = \text{argmax}_{a \in A} Q(s,a) \\
    \frac{\epsilon}{m}, & \text{otherwise}
\end{cases}
\end{equation}

#### Theorem
For any $\epsilon$-greedy policy $\pi$, the $\epsilon$-greedy policy $\pi'$ with respect to $q_{\pi}$ is an improvement, $v_{\pi'}(s) \geq v_{\pi}(s)$


## GLIE policy (Greedy in the Limit with Infinite Exploration)
How can we really find the most optimal policy? To do that, we need to balance two different things - Continue exploring for infinity time but asymtotically it will converge to a policy

#### Properties
1. All state-action pairs that are explored many times, to make sure as many as possible, if not all, the state space can be tried
\begin{equation}
lim_{k \to \infty} N_k(s,a) = \infty
\end{equation}

2. The policy eventaully becomes a greedy policy, i.e. a deterministic policy that maximise the q value

\begin{equation}
lim_{k \to \infty} \pi_{k}(a|s) = \mathop{\mathbb{1}}(a = \text{argmax}_{a' \in A}Q_k(s, a'))
\end{equation}

For instance, $epsilon$-greedy is GLIE if $\epsilon$ reduces ot zero at $\epsilon_k = \frac{1}{k}$

## GLIE Monte-Carlo Control

1. First, sample kth episode using $\pi: {S_1, A_1, R_2, S_2, A_2, ..., S_T} \sim \pi$
2. For each state $S_t$ and action $A_t$ in the episode,

 \begin{equation}
 \begin{split}
 N(S_t, A_t) & \leftarrow N(S_t, A_t) + 1 \\
 Q(S_t, A_t) & \leftarrow Q(S_t, A_t) + \frac{1}{N(S_t, A_t)}(G_t - Q(S_t, A_t))
 \end{split}
 \end{equation}
 
3. Improve policy based on new action-value function

 \begin{equation}
 \begin{split}
 \epsilon &\leftarrow \frac{1}{k} \\
 \pi &\leftarrow \epsilon\text{-greedy}(Q)
 \end{split}
 \end{equation}

## Monte Carlo Exploring Starts Algorithm
For Monte Carlo policy iteration, it alternates between evaluation and improvement on an episode-by-episode basis. After each episode, the observed returns are used for policy evaluation, and the the policy is improved at all the states visited in the episode. 

#### Pseudo Code
---
```
Initialise:
  pi(s) in A(s) for all s in S
  Q(s,a) for all s in S, a in A(s)
  Returns(s,a) - Empty list, for all s in S and a in A(s)

Loop forever (for each episode)
   Choose S0 in S, A0 in A(S0) randomly such that all pairs have porbability > 0
   Generate an episode from S0, A0, following pi: S0, A0, R1, S1, A1, ..., RT
   G = 0
   Loop for each step of episode, t=T-1, T-2, ..., 0
   G = gamma * G + R_t+1
   Unless the pair St, At appears in S0, A0, S1, A1, ..., St-1, At-1:
    Append G to Returns(St, At)
    Q(St, At) = Average(Returns(St, At))
    pi(St) = argmax_a(Q(St, a))
    

```
---