# Tutorial 4: TD methods

Let's implement these methods for a simple classical problem (Cliffwalk) from the Sutton and Barto book. The code below can help you get started


In [1]:
import gym
env = gym.make("CliffWalking-v0")

**First things first:** Spend some time getting familiar with the environment.

    The board is a 4x12 matrix, indexed as 1D array:
        0 = top leftt
        11 = top right
        12 = beginning of 2nd row from top at left side
        ...
    Each time step incurs -1 reward, and stepping into the cliff incurs -100 reward 
    and a reset to the start. An episode terminates when the agent reaches the goal.
    
    env.step(action) = (new_state, reward_of_this_state, done, probability)

In [2]:
env.observation_space.n

48

In [3]:
env.action_space.n

4

In [4]:
env.step(0)

(24, -1, False, {'prob': 1.0})

In [5]:
env.render()

o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
x  o  o  o  o  o  o  o  o  o  o  o
o  C  C  C  C  C  C  C  C  C  C  T



In [6]:
env.step(0)

(12, -1, False, {'prob': 1.0})

In [7]:
env.render()

o  o  o  o  o  o  o  o  o  o  o  o
x  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  C  C  C  C  C  C  C  C  C  C  T



In [24]:
import numpy as np


def epsilon_greedy_policy(Q, epsilon, actions):
    """ Q is a numpy array, epsilon between 0,1 
    and a list of actions"""
    
    def policy_fn(state):
        if np.random.rand()>epsilon:
            action = np.argmax(Q[state,:])
        else:
            action = np.random.choice(actions)
        return action
    return policy_fn

def sarsa_update(Q, state, action, reward, new_state, new_action):
    Q[state, action] = Q[state, action] + alpha*(reward + gamma*Q[new_state, new_action] - Q[state, action])
    return Q

def Q_learning_update(Q, state, action, reward, new_state, new_action):
    Q[state, action] = Q[state, action] + alpha*(reward + gamma*np.max(Q[new_state, :]) - Q[state, action])
    return Q


Q = np.zeros([env.observation_space.n, env.action_space.n])

gamma = 0.99 
alpha = 0.1 # learnintg rate
n_episodes = 1000

actions = range(env.action_space.n)

score = []    
for j in range(n_episodes):
    done = False
    state = env.reset()
    
    # Play randomly 10 episodes, then reduce slowly the randomness
    policy = epsilon_greedy_policy(Q, epsilon=10./(j+1), actions = actions ) 
    
    
    ### Generate sample episode
    t=0
    total_reward = 0
    while not done:
        t+=1
        action = policy(state)    
        new_state, reward, done, _ =  env.step(action)
        new_action = policy(new_state)
        total_reward += reward
        
        if state == new_state:
            reward -= 1
        
        Q = Q_learning_update(Q, state, action, reward, new_state, new_action)
            
            
        state, action = new_state, new_action
            
        if done:
            score.append(total_reward)
            
            if (j+1)%10 == 0:
                print("INFO: Episode {} finished after {} timesteps with r={}. \
                Avg score: {}".format(j+1, t, total_reward, np.mean(score)))
            

env.close()

INFO: Episode 10 finished after 4733 timesteps with r=-51362.                 Avg score: -51953.8
INFO: Episode 20 finished after 73 timesteps with r=-568.                 Avg score: -27855.3
INFO: Episode 30 finished after 31 timesteps with r=-31.                 Avg score: -18675.533333333333
INFO: Episode 40 finished after 18 timesteps with r=-18.                 Avg score: -14038.975
INFO: Episode 50 finished after 32 timesteps with r=-230.                 Avg score: -11256.06
INFO: Episode 60 finished after 15 timesteps with r=-15.                 Avg score: -9402.033333333333
INFO: Episode 70 finished after 41 timesteps with r=-338.                 Avg score: -8066.3
INFO: Episode 80 finished after 23 timesteps with r=-23.                 Avg score: -7070.625
INFO: Episode 90 finished after 28 timesteps with r=-127.                 Avg score: -6289.533333333334
INFO: Episode 100 finished after 28 timesteps with r=-28.                 Avg score: -5665.83
INFO: Episode 110 finished

**Control question**: Which trajectories are found by which algorithm?

### Trajectory found by SARSA

In [23]:
def simulate_best():
    policy = epsilon_greedy_policy(Q, epsilon=0, actions = actions )
    state = env.reset()
    t = 0
    done = False
    
    while not done and t < 20:
        t+= 1
        action = policy(state)    
        state, reward, done, _ =  env.step(action)
        env.render()

simulate_best()

o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
x  o  o  o  o  o  o  o  o  o  o  o
o  C  C  C  C  C  C  C  C  C  C  T

o  o  o  o  o  o  o  o  o  o  o  o
x  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  C  C  C  C  C  C  C  C  C  C  T

x  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  C  C  C  C  C  C  C  C  C  C  T

o  x  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  C  C  C  C  C  C  C  C  C  C  T

o  o  x  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  C  C  C  C  C  C  C  C  C  C  T

o  o  o  x  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  C  C  C  C  C  C  C  C  C  C  T

o  o  o  o  x  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  C  C  C  C  C  C  C  C  C  C  T

o  o  o  o  o

### Trajectory found by Q-learning

In [25]:
simulate_best()

o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
x  o  o  o  o  o  o  o  o  o  o  o
o  C  C  C  C  C  C  C  C  C  C  T

o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  x  o  o  o  o  o  o  o  o  o  o
o  C  C  C  C  C  C  C  C  C  C  T

o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  x  o  o  o  o  o  o  o  o  o
o  C  C  C  C  C  C  C  C  C  C  T

o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  x  o  o  o  o  o  o  o  o
o  C  C  C  C  C  C  C  C  C  C  T

o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  x  o  o  o  o  o  o  o
o  C  C  C  C  C  C  C  C  C  C  T

o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  x  o  o  o  o  o  o
o  C  C  C  C  C  C  C  C  C  C  T

o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  o  o  o  o  o  o
o  o  o  o  o  o  x  o  o  o  o  o
o  C  C  C  C  C  C  C  C  C  C  T

o  o  o  o  o