# Reinforcement Learning - Toy examples with Gym

## FrozenLake

In this Notebook, we use functions (from module TD in auxModules) that implement $SARSA$ and $Q-Learning$ methods to solve the FrozenLake environments (https://gym.openai.com/envs/FrozenLake-v0/), using and comparing different policies for exploration. Note that basic FrozenLake is a stochastic envivonment since, as mentioned in the documentation, "the ice is slippery, so you won't always move in the direction you intend". We create deterministic environments calling function register from gym.envs.registration.

Both algorithms update a tabular estimate of the $Q-function$ using a following update rule:
* $SARSA$ algorithm is an on-policy method using:
$$Q_{t+1}(s_t,a_t) \leftarrow Q_t(s_t,a_t) + \alpha (r_t + \gamma Q_t(s_{t+1}, a_{t+1}) - Q_t(s_t,a_t))$$

* $Q-Learning$ algorithm is an off-policy method using:
$$Q_{t+1}(s_t,a_t) \leftarrow Q_t(s_t,a_t) + \alpha (r_t + \gamma \max_b Q_t(s_{t+1}, b) - Q_t(s_t,a_t))$$

For exploration, we will compare performances between:
* $\epsilon$-greedy policy with a fixed $\epsilon$.
* $\epsilon$-greedy policy with a decaying $\epsilon$.
* softmax exploration assigning a probability for an action to be
performed according to the following rule:
$$P(a_i \vert s) = \frac{e^{\frac{1}{\tau}Q(s,a_j)}}{\sum_j e^{\frac{1}{\tau}Q(s,a_j)}}$$

In [1]:
import sys
sys.path.append("../") # go to parent dir
import matplotlib.pyplot as plt
%matplotlib inline
from auxModules.TD import *
from gym.envs.registration import register

## Deterministic 4x4 FrozenLake

In [2]:
register(
    id='FrozenLakeNotSlippery4x4-v0',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name' : '4x4', 'is_slippery': False},
    max_episode_steps=100,
    reward_threshold=0.78, # optimum = .8196
)

In [3]:
compareMethods("FrozenLakeNotSlippery4x4-v0", nEpisodeAccuracy=1, threshold=0.99, nEpisodeMax=2000)

  result = entry_point.load(False)


epsilon-greedy with fixed epsilon


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:25<00:00,  5.18s/it]


epsilon-greedy with decreasing epsilon


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:18<00:00,  6.31s/it]


Softmax


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [01:15<00:00, 17.14s/it]


## Deterministic 8x8 FrozenLake

In [4]:
register(
    id='FrozenLakeNotSlippery8x8-v0',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name' : '8x8', 'is_slippery': False},
    max_episode_steps=100,
    reward_threshold=0.78, # optimum = .8196
)

In [None]:
compareMethods("FrozenLakeNotSlippery8x8-v0", nEpisodeAccuracy=1, threshold=0.99, nEpisodeMax=1000)

  result = entry_point.load(False)


epsilon-greedy with fixed epsilon


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [02:00<00:00, 24.22s/it]


epsilon-greedy with decreasing epsilon


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [01:32<00:00, 30.89s/it]


Softmax


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [03:03<00:00, 38.77s/it]


## Stochastic 4x4 FrozenLake

In [None]:
compareMethods("FrozenLake-v0", nEpisodeAccuracy=1000, threshold=0.8, nEpisodeMax=10000)

epsilon-greedy with fixed epsilon


 80%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊                             | 4/5 [03:17<00:50, 50.27s/it]

## Stochastic 8x8 FrozenLake

In [None]:
compareMethods("FrozenLake8x8-v0", nEpisodeAccuracy=1000, threshold=0.8, nEpisodeMax=10000)