# Reinforcement Learning - Toy examples with Gym

## FrozenLake

In this Notebook, we use functions (from module TD in auxModules) that implement $SARSA$ and $Q-Learning$ methods to solve the FrozenLake environments (https://gym.openai.com/envs/FrozenLake-v0/), using and comparing different policies for exploration. Note that basic FrozenLake is a stochastic envivonment since, as mentioned in the documentation, "the ice is slippery, so you won't always move in the direction you intend". We create deterministic environments calling function register from gym.envs.registration.

Both algorithms update a tabular estimate of the $Q-function$ using a following update rule:
* $SARSA$ algorithm is an on-policy method using:
$$Q_{t+1}(s_t,a_t) \leftarrow Q_t(s_t,a_t) + \alpha (r_t + \gamma Q_t(s_{t+1}, a_{t+1}) - Q_t(s_t,a_t))$$

* $Q-Learning$ algorithm is an off-policy method using:
$$Q_{t+1}(s_t,a_t) \leftarrow Q_t(s_t,a_t) + \alpha (r_t + \gamma \max_b Q_t(s_{t+1}, b) - Q_t(s_t,a_t))$$

For exploration, we will compare performances between:
* $\epsilon$-greedy policy with a fixed $\epsilon$.
* $\epsilon$-greedy policy with a decaying $\epsilon$.
* softmax exploration assigning a probability for an action to be
performed according to the following rule:
$$P(a_i \vert s) = \frac{e^{\frac{1}{\tau}Q(s,a_j)}}{\sum_j e^{\frac{1}{\tau}Q(s,a_j)}}$$

In [1]:
import sys
sys.path.append("../") # go to parent dir
import matplotlib.pyplot as plt
%matplotlib inline
from auxModules.TD import *
from gym.envs.registration import register

## Deterministic 4x4 FrozenLake

In [2]:
register(
    id='FrozenLakeNotSlippery4x4-v0',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name' : '4x4', 'is_slippery': False},
    max_episode_steps=100,
    reward_threshold=0.78, # optimum = .8196
)

In [3]:
compareMethods("FrozenLakeNotSlippery4x4-v0", nEpisodeAccuracy=1, threshold=0.99, nEpisodeMax=2000)

epsilon-greedy with fixed epsilon


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:51<00:00, 10.35s/it]


epsilon-greedy with decreasing epsilon


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:28<00:00,  8.81s/it]


Softmax


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [03:26<00:00, 40.34s/it]


Unnamed: 0,Accuracy - SARSA,Nb episodes - SARSA,Accuracy - QLearning,Nb episodes - QLearning
Fixed $\epsilon$ : $\epsilon$ = 0.1,1.0,114.6,1.0,180.2
Fixed $\epsilon$ : $\epsilon$ = 0.3,0.6,136.0,1.0,201.2
Fixed $\epsilon$ : $\epsilon$ = 0.5,0.4,136.2,1.0,194.0
Fixed $\epsilon$ : $\epsilon$ = 0.7,1.0,123.6,1.0,191.8
Fixed $\epsilon$ : $\epsilon$ = 0.9,1.0,135.0,1.0,195.6
Decaying-$\epsilon$ : decaying rate = 0.9,1.0,119.0,1.0,195.8
Decaying-$\epsilon$ : decaying rate = 0.99,1.0,119.6,1.0,220.0
Decaying-$\epsilon$ : decaying rate = 0.999,1.0,123.2,1.0,189.8
Softmax : $\tau$ = 1,0.4,2000.0,1.0,2000.0
Softmax : $\tau$ = 0.1,0.6,705.0,1.0,2000.0


## Deterministic 8x8 FrozenLake

In [4]:
register(
    id='FrozenLakeNotSlippery8x8-v0',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name' : '8x8', 'is_slippery': False},
    max_episode_steps=100,
    reward_threshold=0.78, # optimum = .8196
)

In [5]:
compareMethods("FrozenLakeNotSlippery8x8-v0", nEpisodeAccuracy=1, threshold=0.99, nEpisodeMax=1000)

epsilon-greedy with fixed epsilon


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [05:24<00:00, 60.96s/it]


epsilon-greedy with decreasing epsilon


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [02:21<00:00, 47.99s/it]


Softmax


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [06:29<00:00, 83.17s/it]


Unnamed: 0,Accuracy - SARSA,Nb episodes - SARSA,Accuracy - QLearning,Nb episodes - QLearning
Fixed $\epsilon$ : $\epsilon$ = 0.1,1.0,315.0,1.0,318.6
Fixed $\epsilon$ : $\epsilon$ = 0.3,1.0,291.8,1.0,382.2
Fixed $\epsilon$ : $\epsilon$ = 0.5,1.0,332.8,1.0,293.6
Fixed $\epsilon$ : $\epsilon$ = 0.7,1.0,265.0,1.0,394.4
Fixed $\epsilon$ : $\epsilon$ = 0.9,1.0,399.2,1.0,370.8
Decaying-$\epsilon$ : decaying rate = 0.9,1.0,319.4,1.0,369.2
Decaying-$\epsilon$ : decaying rate = 0.99,1.0,308.8,1.0,339.8
Decaying-$\epsilon$ : decaying rate = 0.999,1.0,277.8,1.0,341.8
Softmax : $\tau$ = 1,0.0,1000.0,0.0,1000.0
Softmax : $\tau$ = 0.1,0.2,1000.0,0.0,1000.0


## Stochastic 4x4 FrozenLake

In [None]:
compareMethods("FrozenLake-v0", nEpisodeAccuracy=1000, threshold=0.8, nEpisodeMax=10000)

epsilon-greedy with fixed epsilon


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [08:41<00:00, 110.47s/it]


epsilon-greedy with decreasing epsilon


100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [04:56<00:00, 85.45s/it]


Softmax


100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [07:28<00:00, 101.21s/it]


Unnamed: 0,Accuracy - SARSA,Nb episodes - SARSA,Accuracy - QLearning,Nb episodes - QLearning
Fixed $\epsilon$ : $\epsilon$ = 0.1,0.6986,2490.2,0.7086,1926.8
Fixed $\epsilon$ : $\epsilon$ = 0.3,0.7364,684.8,0.6958,807.2
Fixed $\epsilon$ : $\epsilon$ = 0.5,0.7382,630.2,0.7398,1209.2
Fixed $\epsilon$ : $\epsilon$ = 0.7,0.734,613.4,0.7352,1056.6
Fixed $\epsilon$ : $\epsilon$ = 0.9,0.7496,598.8,0.742,944.2
Decaying-$\epsilon$ : decaying rate = 0.9,0.7364,663.6,0.7324,795.8
Decaying-$\epsilon$ : decaying rate = 0.99,0.7454,2612.6,0.7424,1114.6
Decaying-$\epsilon$ : decaying rate = 0.999,0.7332,974.2,0.741,755.0
Softmax : $\tau$ = 1,0.097,10000.0,0.3874,10000.0
Softmax : $\tau$ = 0.1,0.0982,10000.0,0.3546,10000.0


## Stochastic 8x8 FrozenLake

In [None]:
compareMethods("FrozenLake8x8-v0", nEpisodeAccuracy=1000, threshold=0.8, nEpisodeMax=10000)

epsilon-greedy with fixed epsilon


  0%|                                                                                                                                                          | 0/5 [00:00<?, ?it/s]

## Method comparison

In [None]:
env = gym.make("FrozenLake-v0")

In [None]:
nEp = 1000

plt.figure(figsize=(18,10))

plt.subplot(2,3,1)
Eps = [0.9,0.5,0.1]
for eps in Eps:
    q, a = SARSA(env, nEpisode=nEp, epsilon0=eps, decreaseRate=1, softmax=False,
          tau=0.01)
    plt.plot(np.arange(len(a))*50, a, label = "$\epsilon =$ {}".format(eps))

plt.legend()
plt.title("(SARSA) Fixed-$\epsilon$ strategy")

plt.subplot(2,3,2)
DR = [0.9,0.99,0.999]
for dr in DR:
    q, a = SARSA(env, nEpisode=nEp, epsilon0=0.9, decreaseRate=dr, softmax=False,
          tau=0.01)
    plt.plot(np.arange(len(a))*50, a, label = "Decreasing rate = {}".format(dr))

plt.legend()
plt.title("(SARSA) Decreasing-$\epsilon$ strategy")

plt.subplot(2,3,3)
Tau = [0.1,0.01,0.003]
for tau in Tau:
    q, a = SARSA(env, nEpisode=nEp, epsilon0=0.9, decreaseRate=1, softmax=True,
          tau=tau)
    plt.plot(np.arange(len(a))*50, a, label = "$\\tau =$ {}".format(tau))

plt.legend()
plt.title("(SARSA) Softmax strategy")

plt.subplot(2,3,4)
Eps = [0.9,0.5,0.1]
for eps in Eps:
    q, a = QLearning(env, nEpisode=nEp, epsilon0=eps, decreaseRate=1, softmax=False,
          tau=0.01)
    plt.plot(np.arange(len(a))*50, a, label = "$\epsilon =$ {}".format(eps))

plt.legend()
plt.title("(QL) Fixed-$\epsilon$ strategy")

plt.subplot(2,3,5)
DR = [0.9,0.99,0.999]
for dr in DR:
    q, a = QLearning(env, nEpisode=nEp, epsilon0=0.9, decreaseRate=dr, softmax=False,
          tau=0.01)
    plt.plot(np.arange(len(a))*50, a, label = "Decreasing rate = {}".format(dr))

plt.legend()
plt.title("(QL) Decreasing-$\epsilon$ strategy")

plt.subplot(2,3,6)
Tau = [0.1,0.01,0.003]
for tau in Tau:
    q, a = QLearning(env, nEpisode=nEp, epsilon0=0.9, decreaseRate=1, softmax=True,
          tau=tau)
    plt.plot(np.arange(len(a))*50, a, label = "$\\tau =$ {}".format(tau))

plt.legend()
plt.title("(QL) Softmax strategy")

In [None]:
env = gym.make("FrozenLakeNotSlippery8x8-v0")

In [None]:
nEp = 7000

plt.figure(figsize=(18,10))

plt.subplot(2,3,1)
Eps = [0.9,0.5,0.1]
for eps in Eps:
    q, a = SARSA(env, nEpisode=nEp, epsilon0=eps, decreaseRate=1, softmax=False,
          tau=0.01)
    plt.plot(a, label = "$\epsilon =$ {}".format(eps))

plt.legend()
plt.title("(SARSA) Fixed-$\epsilon$ strategy")

plt.subplot(2,3,2)
DR = [0.9,0.99,0.999]
for dr in DR:
    q, a = SARSA(env, nEpisode=nEp, epsilon0=0.9, decreaseRate=dr, softmax=False,
          tau=0.01)
    plt.plot(a, label = "Decreasing rate = {}".format(dr))

plt.legend()
plt.title("(SARSA) Decreasing-$\epsilon$ strategy")

plt.subplot(2,3,3)
Tau = [0.1,0.01,0.001]
for tau in Tau:
    q, a = SARSA(env, nEpisode=nEp, epsilon0=0.9, decreaseRate=1, softmax=True,
          tau=tau)
    plt.plot(a, label = "$\\tau =$ {}".format(tau))

plt.legend()
plt.title("(SARSA) Softmax strategy")

plt.subplot(2,3,4)
Eps = [0.9,0.5,0.1]
for eps in Eps:
    q, a = QLearning(env, nEpisode=nEp, epsilon0=eps, decreaseRate=1, softmax=False,
          tau=0.01)
    plt.plot(a, label = "$\epsilon =$ {}".format(eps))

plt.legend()
plt.title("(QL) Fixed-$\epsilon$ strategy")

plt.subplot(2,3,5)
DR = [0.9,0.99,0.999]
for dr in DR:
    q, a = QLearning(env, nEpisode=nEp, epsilon0=0.9, decreaseRate=dr, softmax=False,
          tau=0.01)
    plt.plot(a, label = "Decreasing rate = {}".format(dr))

plt.legend()
plt.title("(QL) Decreasing-$\epsilon$ strategy")

plt.subplot(2,3,6)
Tau = [0.1,0.01,0.001]
for tau in Tau:
    q, a = QLearning(env, nEpisode=nEp, epsilon0=0.9, decreaseRate=1, softmax=True,
          tau=tau)
    plt.plot(a, label = "$\\tau =$ {}".format(tau))

plt.legend()
plt.title("(QL) Softmax strategy")

In [None]:
env = gym.make("FrozenLake-v0")

In [None]:
nEp = 1000

plt.figure()


Tau = [0.1]
for tau in Tau:
    q, a = SARSA(env, nEpisode=nEp, epsilon0=0.9, decreaseRate=1, softmax=True,
          tau=tau)
    plt.plot(np.arange(len(a))*50, a, label = "(SARSA) $\\tau =$ {}".format(tau))

plt.legend()
plt.title("(SARSA) Softmax strategy")


for tau in Tau:
    q, a = QLearning(env, nEpisode=nEp, epsilon0=0.9, decreaseRate=1, softmax=True,
          tau=tau)
    plt.plot(np.arange(len(a))*50, a, label = "(Q-Learning) $\\tau =$ {}".format(tau))

plt.legend()
plt.title("(QL) Softmax strategy")