### Cliff Walking Playground
Playground used to experiment with different algorithms on the Cliff Walking environment from Example 6.6 of Sutton and Barto

In [1]:
# imports
import matplotlib.pyplot as plt
import sys
sys.path.append('/Users/bogdanfloris/Downloads/Code/Reinforcement-Learning/')
from library.dynamic_programming import dynamic_programming as dp
from library.td_learning import temporal_diff_learning as td
from library.environments.cliff_walking import CliffWalkingEnv

Define the environment

In [10]:
env = CliffWalkingEnv()

#### Dynamic Programming
Experiments on Cliff Walking environment using Dynamic Programming algorithms

In [11]:
iterations = 100
policy, state_values = dp.policy_iteration(env=env, iterations=iterations)

In [12]:
print("Optimal policy found using Policy Iteration algorithm")
env.render_policy(policy=policy)

Optimal policy found using Policy Iteration algorithm
 →  →  →  →  →  →  →  →  →  →  →  ↓ 
 →  →  →  →  →  →  →  →  →  →  →  ↓ 
 →  →  →  →  →  →  →  →  →  →  →  ↓ 
 ↑  C  C  C  C  C  C  C  C  C  C  G 

Policy Iteration converges to the optimal policy

#### Temporal Difference Learning
Experiments on Cliff Walking environment using Temporal Difference Learning algorithms

In [16]:
# SARSA hyperparameters
num_episodes = 1000
epsilon = 0.1

In [17]:
q, _ = td.sarsa(env, num_episodes, epsilon=epsilon)
policy = td.make_epsilon_greedy_policy(q=q, epsilon=0.0, action_count=env.action_space.n)
print("Policy after {} episodes of {} for \u03B5 = {}".format(num_episodes, 'SARSA', epsilon))
env.render_policy(policy=policy)

Policy after 1000 episodes of SARSA for ε = 0.1
 →  →  →  →  →  →  →  →  →  →  →  ↓ 
 →  ↑  →  ↑  ←  ↑  ↑  ↑  ↑  ↑  →  ↓ 
 ↑  ↑  ↑  ↑  ↑  ←  ↑  →  ↑  ←  →  ↓ 
 ↑  C  C  C  C  C  C  C  C  C  C  G 

In [18]:
# Q Learning hyperparameters
num_episodes = 1000
epsilon = 0.1

In [19]:
q, _ = td.q_learning(env, num_episodes, epsilon=epsilon)
policy = td.make_epsilon_greedy_policy(q=q, epsilon=0.0, action_count=env.action_space.n)
print("Policy after {} episodes of {} for \u03B5 = {}".format(num_episodes, 'Q-Learning', epsilon))
env.render_policy(policy=policy)

Policy after 1000 episodes of Q-Learning for ε = 0.1
 ↑  →  →  ↓  →  →  ↓  ←  ↓  ↑  ↓  ↓ 
 ↓  →  →  →  →  ↓  ↓  →  →  ↓  ↓  ↓ 
 →  →  →  →  →  →  →  →  →  →  →  ↓ 
 ↑  C  C  C  C  C  C  C  C  C  C  G 

As we can see, SARSA chooses to take the top route, while Q-Learning chooses the route that is near the cliff (taking the risk of falling in it, if we make an epsilon greedy policy). These results are according to the Sutton and Barto example.