# Back to Gym - Frozen Lake

Gym is a toolkit for developing and comparing reinforcement learning algorithms.

https://gym.openai.com/docs/

It provides various environments, including video games and control problems. Let's start with toy examples

https://gym.openai.com/envs/#toy_text

in particular with Frozen Lake problem

https://gym.openai.com/envs/FrozenLake8x8-v0/

<img src="http://1.bp.blogspot.com/-P2GC1uKB-ss/UqHSAXOIXhI/AAAAAAAAAMU/JFNXwAFmV1c/s1600/800px-Frozen_Lake_-_Kosovo.JPG" />

Let's see how it is implemented in Gym.

In [None]:
import gym
from IPython.display import clear_output
import time
env = gym.make('FrozenLake8x8-v0')
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        clear_output()
        env.render()
        time.sleep(0.999)
        #print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            #print("Episode finished after {} timesteps".format(t+1))
            break

<br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br /><br />













We have four possible actions - directions where we try to move.

"Try" means ice is slippery and our step might result in a different state than expected. We don't know exactly how. **Welcome to Reinforcement Learning!**

In [None]:
env.action_space

The documentation does not tell us which direction is north and which direction is south. We have to figure it out during the interaction with the environment. **Welcome to Reinforcement Learning!**

The number of states is $8\times 8$, i.e. $64$. They correspond to the positions on the lake.

In [None]:
env.observation_space

    SFFFFFFF
    FFFFFFFF
    FFFHFFFF
    FFFFFHFF
    FFFHFFFF
    FHHFFFHF
    FHFFHFHF
    FFFHFFFG
    
Where 

- `S` is start
- `F` is frozen lake -  you can move there
- `H` is hole - you will fall there and end the episode
- `G` is the place which you want to reach

For this environment, we want to find the optimal policy $\pi^{*}$ so we can safely reach the goal without falling into a hole.

# Monte Carlo Methods
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/2/2f/Monaco_Monte_Carlo_1.jpg/1200px-Monaco_Monte_Carlo_1.jpg" />

<a href="https://en.wikipedia.org/wiki/Monte_Carlo">Monte Carlo</a>, part of Monacco known as a place of cassionos, gave the name to <a href="https://en.wikipedia.org/wiki/Monte_Carlo_method">methods based on random sampling</a>.

In Monte Carlo, we assume that we don't have the model $p(s',r|s,a)$ of the environment.

## Monte Carlo Prediction

We are given a policy $\pi$, we want to know $v_{\pi}(s)$ for all $s$.

It is suitable for *episodic* tasks only. The whole idea is that for given policy $\pi$ we:

* Sample an episode 
* For each state $s$ visited during the episode, we quantify the return $G$
* We calculate the value function for that state as $v_\pi(s)$ of all returns that relate to $s$ from all already sampled episodes
* This process is iterated

The backup diagram looks like this:

<img src="https://dnddnjs.gitbooks.io/rl/content/MC5.png" width="50%"/>

**Question**:

* Do we need *bootstrapping*, i.e. $v_\pi(s')$ to update $v_\pi(s)$? <a href="https://datascience.stackexchange.com/questions/26938/what-exactly-is-bootstrapping-in-reinforcement-learning">Hint<a/>.

Update strategies:

* First-visit - for a state, we consider only its first visit in the episode
* Every-visit - for a state, we consider all visits in the episode

### Example
Let's implement the algorithm for Frozen Lake. First, we have to define $\gamma$. Note that $\gamma$ is not part of the environment.

**Question**:

* Why?


In [1]:
gamma = 0.999

We also need a policy. Let's keep it simple for now:

In [2]:
import numpy as np
def pi(s):
    return np.random.randint(0,4)

In [4]:
import gym
env = gym.make('FrozenLake8x8-v0')
V = {i:0 for i in range(64)}
N = {i:0 for i in range(64)}
for i_episode in range(20000):
    observation = env.reset()
    state_reward_pairs = []
    for t in range(100):
        action = pi(observation)
        observation_old = observation
        observation, reward, done, info = env.step(action)
        state_reward_pairs.append((observation_old,reward))
        
        if done:
#             print(done)
            break
    G = 0
    for pair in state_reward_pairs[::-1]:
        state, reward = pair
        G = reward + gamma*G
        N[state] += 1
        V[state] += G/N[state]

print(V)
        

[2018-11-02 04:56:17,894] Making new env: FrozenLake8x8-v0


{0: 0.003716225850810045, 1: 0.0054641323289057984, 2: 0.005623687092938593, 3: 0.007611951468950891, 4: 0.014299666409243813, 5: 0.025236619093151107, 6: 0.04494507446456874, 7: 0.034153000196943964, 8: 0.004468929987904574, 9: 0.0072372157661682816, 10: 0.004416078505074799, 11: 0.006401682323738126, 12: 0.014912218676564195, 13: 0.020759443305580022, 14: 0.03678730683357182, 15: 0.033045262556350714, 16: 0.004400435245163378, 17: 0.004702209999103089, 18: 0.0007064126900516699, 19: 0, 20: 0.012233680523468961, 21: 0.02151240696090669, 22: 0.048426209515567996, 23: 0.05907103878743424, 24: 0.00648399563168673, 25: 0.0046670045794323526, 26: 0.002694458376934921, 27: 0.006530433257746656, 28: 0.00936777407669713, 29: 0, 30: 0.06854653585616957, 31: 0.09889376604180707, 32: 0.006211352564805115, 33: 0.0030274178754337157, 34: 0.0006817123099151571, 35: 0, 36: 0.004930526945899872, 37: 0.00629123405072781, 38: 0.07158463777403194, 39: 0.2724575445330732, 40: 0.0, 41: 0, 42: 0, 43: 0.0, 

## Monte Carlo Estimation of Action Values

In this case, we want to estimate the action-value function $q_\pi (s,a)$ for given policy $\pi$.

**Question**

* Why is this more important than in case of Dynamic Programming?



In [None]:
env = gym.make('FrozenLake8x8-v0')
Q = {i:{j:0 for j in range(4)} for i in range(64)}
N = {i:{j:0 for j in range(4)} for i in range(64)}
for i_episode in range(20000):
    observation = env.reset()
    state_action_reward_tuples = []
    for t in range(100):
        action = pi(observation)
        observation_old = observation
        observation, reward, done, info = env.step(action)
        state_action_reward_tuples.append((observation_old,action,reward))
        
        if done:
#             print(done)
            break
    G = 0
    for my_tuple in state_action_reward_tuples[::-1]:
        state, action, reward = my_tuple
        G = reward + gamma*G
        N[state][action] += 1
        Q[state][action] += G/N[state][action]

print(Q)

This evaluation can assure convergence if each action-state pair is selected *asymptotically many times*. This makes the situation similar to *bandits*.

We will see that this needs with Monte Carlo *special care* to be assured.

## Monte Carlo Control

Similarly to DP, we can iterate *policy evaluation* and *policy improvement*. If the *policy evaluation* is based on MC approach, we speak about *MC policy iteration*.

We will use the principle of *generalized policy iteration* as we will be selective regarding the states that are being updated. 

We assume that the evaluation of $q_{\pi_k}$ is perfect which is assured if:

* Infinite number of episodes
* Each episode starts with random pair $(s,a)$

Then, the policy improvement is much simpler than in case of DP:

$$
\pi_{k+1}(s) = \arg\max_{a} q_{\pi_k}(s,a)
$$

Will the policy improvement work?

$$
q_{\pi_k}(s,\pi_{k+1}(s)) = q_{\pi_k}(s,\arg\max_a q_{\pi_k}(s,a))
$$

$$
= \max_a q_{\pi_k}(s,a)
$$

$$
\geq q_{\pi_k}(s,\pi_k(s))
$$

$$
= v_{\pi_k}(s)
$$

The conditions for the evaluation are too strict and practically not useful. Two approaches how to cope with that:

- To keep evaluating $q_{\pi_k}$ until the changes are small.
- After one episode, do directly the policy improvement step.

Let's see the second option in detail.

### Monte Carlo Exploring Starts

At first, we have to be sure that we can start from any $(s,a)$.

This is somehow arfificial, we have to make the environment to be in arbitrary:

In [3]:
env.env.s = 1
observation,reward,_,_ = env.step(3)
[observation,reward]

NameError: name 'env' is not defined

Initialization:

In [None]:
Q = {i:{j:0 for j in range(4)} for i in range(64)}
N = {i:{j:0 for j in range(4)} for i in range(64)}
pi = {i:0 for i in range(64)}


The main loop:

In [None]:
for k in range(20000):
    s = np.random.randint(0,64)
    a = np.random.randint(0,4)
    env.reset()
    env.env.s = s
    observation,reward,done,info = env.step(a)
    state_action_reward_tuples = [(s,a,reward)]
    t = 0
    while not done:
        action = pi[observation]
        observation_old = observation
        observation, reward, done, info = env.step(action)
        state_action_reward_tuples.append((observation_old,action,reward))
        t=+1
        if done or t==99:
            break
        
    G = 0
    visited_states = []
    for my_tuple in state_action_reward_tuples[::-1]:
        state, action, reward = my_tuple
        visited_states.append(state)
        G = reward + gamma*G
        N[state][action] += 1
        Q[state][action] += G/N[state][action]
    
    for state in list(set(visited_states)):
        best_value = -10000000000
        best_action = None
        for a in range(4):
            if Q[state][a]>best_value:
                best_value = Q[state][a]
                best_action = a
        pi[state] = best_action
    print(pi)

In [None]:
Q

## Monte Carlo Control without Exploring States (On Policy)

Exploring initial states - not very realistic. Even with `gym` we had to use less standard API `env.env.s` to change the initial state. In some systems, this will not be possible.

Another approach is to ensure that all actions will be selected infinitely often. There are two ways how to cope with that:

* On-policy - we improve the policy that is used for exploration
* Off-policy - the exploration has own policy; we update another one

For the on-policy approach, we assume that the policy is *soft*, i.e. all actions are possible $\pi(a|s)>0$ for all $s$ and $a$. We can adopt $\epsilon$-greedy that we know from the world of bandits.

That is:
$$\pi(a|s)=\frac{\epsilon}{|\mathcal{A}(s)|}$$
for all actions with the exception of the greedy action (greedy in terms of $q$):
$$
\pi(a|s) = 1-\epsilon + \frac{\epsilon}{|\mathcal{A}(s)|}
$$

We will need a way how to sample actions proportionally randomly:

In [None]:
np.random.choice(list(range(3)),p=[0.5,0.4,0.1],size=20)

This will result in a slightly different initialization:

In [None]:
epsilon = 0.8
Q = {i:{j:0 for j in range(4)} for i in range(64)}
N = {i:{j:0 for j in range(4)} for i in range(64)}
pi = np.ones([64,4])/4
def sample_action(policy):
    return np.random.choice(list(range(len(policy))),p=policy)
sample_action(pi[0,:])

In [None]:
for k in range(20000):
    observation = env.reset()
    state_action_reward_tuples = []
    t = 0
    done = False
    while not done:
        action = sample_action(pi[observation,:])
        observation_old = observation
        observation, reward, done, info = env.step(action)
        state_action_reward_tuples.append((observation_old,action,reward))
        t=+1
        if done or t==99:
            break
        
    G = 0
    visited_states = []
    for my_tuple in state_action_reward_tuples[::-1]:
        state, action, reward = my_tuple
        visited_states.append(state)
        G = reward + gamma*G
        N[state][action] += 1
        Q[state][action] += G/N[state][action]
    
    for state in list(set(visited_states)):
        best_value = -10000000000
        best_action = None
        for a in range(4):
            if Q[state][a]>best_value:
                best_value = Q[state][a]
                best_action = a
        
        for action in range(4):
            pi[state,action] = epsilon/4
        pi[state,best_action] = 1 - epsilon + epsilon/4
        
    print(pi[visited_states,:])

In [None]:
Q

$$
q_\pi(s,\pi'(s)) = \sum_a \pi'(a|s) q_\pi(s,a)
$$

$$
= \frac{\epsilon}{|\mathcal{A}(s)|} \sum_a q_\pi(s,a) + (1-\epsilon)\max_a q_\pi(s,a)
$$

$$
\geq \frac{\epsilon}{|\mathcal{A}(s)|} \sum_a q_\pi(s,a) + (1-\epsilon)\max_a q_\pi(s,a)
$$

## Off-policy Prediction via Importance Sampling

* We want to learn the optimal, i.e. *target* policy $\pi$ - responsible for exploitation
* We interact with the environment based on *behavior* policy $b$ - responsible for explorations

We say that $b$ has coverage of $\pi$ iff $\pi(a|s)>0$ implies $b(a|s)>0$.

We want calculate $q$ from the results based on behavior policy. The trick is *weight average* and the weights express the match with the estimation policy.


We can decomposeeach of them by <a href="https://en.wikipedia.org/wiki/Chain_rule">chain rule</a> and based on Markov property like this:
$$
\prod_{k=1}^{T-1}\pi(A_k|S_k)p(S_{k+1}|s_k,a_k)
$$

Given a trajectory, we can calculate its relative probability under target and estimation policies

$$\rho_{t:T-1} = 
\frac{
\prod_{k=1}^{T-1}\pi(A_k|S_k)p(S_{k+1}|s_k,a_k)
}{
\prod_{k=1}^{T-1}b(A_k|S_k)p(S_{k+1}|s_k,a_k)
}$$
we can reduce the fraction and get
$$
=  
\frac{
\prod_{k=1}^{T-1}\pi(A_k|S_k)
}{
\prod_{k=1}^{T-1}b(A_k|S_k)
}$$
which does not depent on model $p$ (otherwise, we could not speak about Monte Carlo method).

We distinguish *ordinary* importance sampling
$$
V(s) = \frac{\sum_{t \in \mathcal{T}} \rho_{t:T(t-1)}G_t}{|\mathcal{T}(s)|}
$$

and *weighted* importance sampling
$$
V(s) = \frac{\sum_{t \in \mathcal{T}} \rho_{t:T(t-1)}G_t}{\sum_{t \in \mathcal{T}} \rho_{t:T(t-1)}}
$$

Note: even if we have the notation here in terms of $V$, we can similarly do it for $Q$. The only difference is that we have less letters in the notation.

**Question:**

- What is the difference between these two?

## Incremental Implementation

* For ordinary importance sampling: no problem, we simply calculate the weighted $G_t$ and then update it recursively (increasing the nominator by 1)
* For weighted importance sampling: something else is needed (similar principles will be followed):

Let's have some returns for one state $G_1,G_2,\dots G_{n-1}$  and the corresponding importance weights $W_1,W_2,\dots W_{n-1}$. We want to estimate:

$$
V_n = \frac{\sum_{k=1}^{n-1} W_k G_k}{\sum_{k=1}^{n-1} W_k}
$$

similarly to other incremental implementations, we will maintain the nominator as a cummulative sum $C_n = \sum_{k=1}^{n-1} W_k$. Then we can do the incremental update like this:
$$
V_{n+1} = V_n + \frac{W_n}{C_n}\left[G_n-V_n\right]
$$

and
$$
C_{n+1} = C_n + W_{n+1}
$$

## Off-policy Monte Carlo Control
### Estimation of $Q$ for $\pi$

- Input: policy $\pi$
- Initialize $Q$ arbitrarily and $C(s,a)\gets=0$
- In a loop
 - take a behavior policy $b$ with coverage of $\pi$
 - generate an episode using $b$: $S_0,A_0,R_1\dots A_{T-1},R_{T},S_{T}$
 - initiate $G\gets 0$ and $W\gets 1$
 - for $t=T-1,T-2,\dots,0$:
  - update 
   - $G\gets \gamma G + R_{t+1}$
   - $C(S_t,A_t)\gets C(S_t,A_t) + W$
   - $Q(S_t,A_t)\gets Q(S_t,A_t) + \frac{W}{C(S_t,A_t)} [G - Q(S_t,A_t)]$
   - $W\gets W \cdot \frac{\pi(A_t|S_t)}{b(A_t|S_t)}$
  - if $W=0$ exit for loop
  
**Question**:

- Is this *first* visit or *every* visit update?

### Off-Policy Monte Carlo Control $\pi\approx \pi^{*}$

- Input: policy $\pi$
- Initialize $Q$ arbitrarily and $C(s,a)\gets=0$
- $\pi(s)\gets\arg\max_a Q(s,a)$
- In a loop
 - take a behavior policy $b$ with coverage of $\pi$
 - generate an episode using $b$: $S_0,A_0,R_1\dots A_{T-1},R_{T},S_{T}$
 - initiate $G\gets 0$ and $W\gets 1$
 - for $t=T-1,T-2,\dots,0$:
  - update 
   - $G\gets \gamma G + R_{t+1}$
   - $C(S_t,A_t)\gets C(S_t,A_t) + W$
   - $Q(S_t,A_t)\gets Q(S_t,A_t) + \frac{W}{C(S_t,A_t)} [G - Q(S_t,A_t)]$
   - $\pi(S_t)\gets\arg\max_a Q(S_t,a)$
   - if $A_t\neq \pi(S_t)$ then exit for loop
   - $W\gets W \cdot \frac{1}{b(A_t|S_t)}$
   
 

## Summary

- Monte Carlo has several advantages over Dynamic Programming:
-  Learns directly from interaction with environment
 - Full models not needed
 - No need to learn about *all* states
 - Less harm by Markovian violations 
- MC methods provide an alternate policy evaluation process
- Challenge to be addressed: maintaining sufficient exploration
 - Exploring starts, soft policies
- No bootstrapping (as opposed to DP)

# Homework

Obligatory

- Implement the off-policy control with weighted importance sampling for Frozen Lake. As a behavior $b$ use random actions.

Optional

- Create the wrapper for your environment (one of the previous home works)