**Reinforcement Learning with TensorFlow & TRFL: TD(λ)**

Outline:
1. TD(λ)
1. TD(λ) with trfl.td_lambda() toy example
1. FrozenLake
* TD(λ) with trfl.td_lambda() on FrozenLake


In [0]:
#TRFL works with TensorFlow 1.12
#installs TensorFlow version 1.12 then restarts the runtime
!pip install tensorflow==1.12

import os
os.kill(os.getpid(), 9)



In [1]:
#install tensorflow-probability 0.5.0 that works with TensorFlow 1.12
!pip install tensorflow-probability==0.5.0

#install TRFL
!pip install trfl==1.0



In [0]:
import gym
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import trfl
import tensorflow_probability as tfp

** TD(λ) **

TD(λ) is a way of weighting n-step returns and relating Monte Carlo and TD methods. TD(0), ie λ = 0, is the one step return. TD(1), ie λ = 1, is the Monte Carlo return. In the toy example and FrozenLake examples in this notebook, you can modify the λ values to see the changes in state estimation.


** TD(λ) Toy Example **

Below is a toy example to illustrate the affects of different values of λ, discount, and learning_rate. Imagine a number line where the agent starts at 0. The agent can only move right and the episode ends when the agent reaches 10. The agent receives a reward of 1 at state 10. The toy example lets you change various hyperparameters to see how the TD(λ) updates change.

In [3]:
#hyperparamters
learning_rate = 0.1 #update state values with this size learning rate
lambda_value = 0.9 #lambda value. range: [0,1]. weight lambda returns by this amount
discount = 0.9 #discount factor (gamma). range: [0,1]. Decay rewards by this amount each step
n_episode = 10 #number of episodes to run toy example
done_state = 10 #episode ends after reaching this state

#set up env, tensors and TRFL
state_value_array = np.zeros((done_state+1,1)) #estimated values of states

tf.reset_default_graph()

state_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="state_value")
reward_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="reward")
discount_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="discount")
bootstrap_ = tf.placeholder(dtype=tf.float32, shape=[1], name="bootstrap")

#TRFL usage: TD Lambda
td_lambda_return_ = trfl.td_lambda(state_, reward_, discount_, bootstrap_, lambda_=lambda_value)

with tf.Session() as sess:
  for i in range(n_episode):
    current_state = 0
    state_list, reward_list, state_int_list = [], [], []
    done = 0
    
    while not done:
      state_list.append(state_value_array[current_state])
      state_int_list.append(current_state)

      current_state += 1

      if current_state < done_state+1:
        reward = 0
        done = 0
        bootstrap_v = state_value_array[current_state]
      else:
        reward = 1
        done = 1
        bootstrap_v = 0.
      reward_list.append(reward)   
      
      td_lambda_output = sess.run(td_lambda_return_, feed_dict={
          state_:np.array(state_list).reshape(-1,1),
          reward_:np.array(reward_list).reshape(-1,1),
          discount_:np.array([discount]*len(state_list)).reshape(-1,1),
          bootstrap_:np.array(bootstrap_v).reshape((1,))
        })

      state_value_array[state_int_list] += learning_rate*td_lambda_output.extra.temporal_differences

    print("Finished episode {}: state values are {}".format(i,np.round(np.squeeze(state_value_array),3)))

Finished episode 0: state values are [0.012 0.015 0.019 0.023 0.028 0.035 0.043 0.053 0.066 0.081 0.1  ]
Finished episode 1: state values are [0.03  0.036 0.043 0.051 0.061 0.073 0.088 0.106 0.128 0.156 0.19 ]
Finished episode 2: state values are [0.051 0.059 0.069 0.081 0.095 0.112 0.132 0.157 0.187 0.224 0.271]
Finished episode 3: state values are [0.072 0.084 0.096 0.111 0.129 0.15  0.175 0.205 0.242 0.287 0.344]
Finished episode 4: state values are [0.094 0.108 0.123 0.141 0.162 0.187 0.216 0.251 0.293 0.344 0.41 ]
Finished episode 5: state values are [0.115 0.131 0.149 0.17  0.194 0.222 0.255 0.293 0.34  0.397 0.469]
Finished episode 6: state values are [0.135 0.153 0.174 0.197 0.224 0.255 0.291 0.333 0.383 0.445 0.522]
Finished episode 7: state values are [0.154 0.174 0.196 0.222 0.251 0.285 0.324 0.369 0.423 0.488 0.57 ]
Finished episode 8: state values are [0.172 0.193 0.218 0.245 0.277 0.313 0.355 0.403 0.46  0.528 0.613]
Finished episode 9: state values are [0.188 0.211 0.237

** TRFL Usage **

The lambda_ in trfl.td_lambda() defaults to 1.0. The optional argument can be a constant between 0 and 1 or a tensor of values. The shape of the inputs [sequence_length, batch_size] where sequence_length is the time dimension where index 0 is the start of the sequence. The exception is bootstrap_ input which is shape batch_size and is the bootstrap value for each sequence.

** FrozenLake **

FrozenLake is a GridWorld like environment (env) where the agent tries to navigate a two-dimensional world from a starting point a goal. The agent has to avoid falling into holes while navigating a slippery surface that sometimes causes the agent to move unexpectedly.

In [4]:
env = gym.make('FrozenLake-v0')
env.reset()
env.reset()
env.render()

for i in range(10):
  action = env.action_space.sample()
  obs, reward, done, info = env.step(action)
  env.render()
  if done:
    env.reset()
    
env.close()


[41mS[0mFFF
FHFH
FFFH
HFFG
  (Left)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Down)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Left)
SFFF
[41mF[0mHFH
FFFH
HFFG
  (Up)
SFFF
F[41mH[0mFH
FFFH
HFFG
  (Up)
[41mS[0mFFF
FHFH
FFFH
HFFG
  (Up)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Up)
SF[41mF[0mF
FHFH
FFFH
HFFG
  (Down)
S[41mF[0mFF
FHFH
FFFH
HFFG
  (Up)
SF[41mF[0mF
FHFH
FFFH
HFFG


** TD(λ) with trfl.td_lambda() **

We'll do two examples. In both cases the agent will move under a random policy. In the first example, the agent will act deterministically. In the second, the agent will act non-deterministically and sometimes move in an unexpected way.

In [0]:
#deterministic env
from gym.envs.registration import register
register(
    id='FrozenLakeNotSlippery-v0',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name' : '4x4', 'is_slippery': False}
)

In [0]:
#env = gym.make('FrozenLake-v0')
env = gym.make('FrozenLakeNotSlippery-v0')

#hyperparameters
episodes = 10000
learning_rate = 0.01
discount = 1.
lambda_val = 0.5
stats_every = 1000

tf.reset_default_graph()

state_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="state_value")
reward_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="reward")
discount_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="discount")
bootstrap_ = tf.placeholder(dtype=tf.float32, shape=[1], name="bootstrap")
#lambda_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="lambda") #optionally can do lambda placeholder
td_lambda_return_ = trfl.td_lambda(state_, reward_, discount_, bootstrap_, lambda_=lambda_val)

** TRFL Usage **

The lambda_ in trfl.td_lambda() defaults to 1.0. The optional argument can be a constant between 0 and 1 or a tensor of values. The shape of the inputs [sequence_length, batch_size] where sequence_length is the time dimension where index 0 is the start of the sequence. The exception is bootstrap_ input which is shape batch_size and is the bootstrap value for each sequence.

In [7]:
with tf.Session() as sess:
  
  #initialize the estimated state values to zero
  state_value_array = np.zeros((16,1))
  #reset the env
  current_state = env.reset()

  current_episode = 1
  state_list, reward_list, state_int_list = [], [], []

  while current_episode < episodes:
    #take a random action
    random_action = env.action_space.sample()
    next_state, rew, done, info = env.step(random_action)

    state_list.append(state_value_array[current_state])
    state_int_list.append(current_state)
    reward_list.append(rew)
    bootstrap_v = state_value_array[next_state]
    
    current_state = next_state
    
    #run td lambda in the session to get lambda returns
    td_lambda_output = sess.run(td_lambda_return_, feed_dict={
        state_:np.array(state_list).reshape(-1,1),
        reward_:np.array(reward_list).reshape(-1,1),
        discount_:np.array([discount]*len(state_list)).reshape(-1,1),
        bootstrap_:np.array(bootstrap_v).reshape((1,))
      })
    #use the lambda returns to update the tabular state value esimates
    state_value_array[state_int_list] += learning_rate*td_lambda_output.extra.temporal_differences
    
    if done:
      state_list, reward_list, state_int_list = [], [], []
      current_state = env.reset()
      current_episode += 1
      if current_episode % stats_every == 0:
        print("Current Episode: {}".format(current_episode))
        print("Reshaped State Value Estimates:")
        print(np.round(state_value_array.reshape(4,4),3))
        print("")  
          

Current Episode: 1000
Reshaped State Value Estimates:
[[0.008 0.007 0.009 0.005]
 [0.008 0.    0.013 0.   ]
 [0.013 0.022 0.031 0.   ]
 [0.    0.03  0.102 0.   ]]

Current Episode: 2000
Reshaped State Value Estimates:
[[0.017 0.016 0.019 0.017]
 [0.019 0.    0.025 0.   ]
 [0.023 0.037 0.048 0.   ]
 [0.    0.054 0.132 0.   ]]

Current Episode: 3000
Reshaped State Value Estimates:
[[0.027 0.027 0.036 0.024]
 [0.027 0.    0.053 0.   ]
 [0.035 0.056 0.094 0.   ]
 [0.    0.076 0.19  0.   ]]

Current Episode: 4000
Reshaped State Value Estimates:
[[0.039 0.039 0.05  0.035]
 [0.037 0.    0.069 0.   ]
 [0.045 0.069 0.112 0.   ]
 [0.    0.11  0.213 0.   ]]

Current Episode: 5000
Reshaped State Value Estimates:
[[0.038 0.037 0.044 0.04 ]
 [0.039 0.    0.055 0.   ]
 [0.053 0.077 0.111 0.   ]
 [0.    0.12  0.222 0.   ]]

Current Episode: 6000
Reshaped State Value Estimates:
[[0.045 0.044 0.057 0.047]
 [0.046 0.    0.078 0.   ]
 [0.06  0.095 0.132 0.   ]
 [0.    0.14  0.285 0.   ]]

Current Episode:

In [0]:
#non deterministic env
env = gym.make('FrozenLake-v0')
#env = gym.make('FrozenLakeNotSlippery-v0')

#hyperparameters
episodes = 10000
learning_rate = 0.01
discount = 0.99
lambda_val = 0.5
stats_every = 1000

tf.reset_default_graph()

state_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="state_value")
reward_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="reward")
discount_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="discount")
bootstrap_ = tf.placeholder(dtype=tf.float32, shape=[1], name="bootstrap")
#lambda_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="lambda") #optionally can do lambda placeholder
td_lambda_return_ = trfl.td_lambda(state_, reward_, discount_, bootstrap_, lambda_=lambda_val)


In [9]:
with tf.Session() as sess:
  
  #initialize the estimated state values to zero
  state_value_array = np.zeros((16,1))
  #reset the env
  current_state = env.reset()

  current_episode = 1
  state_list, reward_list, state_int_list = [], [], []

  while current_episode < episodes:
    #take a random action
    random_action = env.action_space.sample()
    next_state, rew, done, info = env.step(random_action)

    state_list.append(state_value_array[current_state])
    state_int_list.append(current_state)
    reward_list.append(rew)
    bootstrap_v = state_value_array[next_state]
    
    current_state = next_state
    #run td lambda in the session to get lambda returns
    td_lambda_output = sess.run(td_lambda_return_, feed_dict={
        state_:np.array(state_list).reshape(-1,1),
        reward_:np.array(reward_list).reshape(-1,1),
        discount_:np.array([discount]*len(state_list)).reshape(-1,1),
        bootstrap_:np.array(bootstrap_v).reshape((1,))
      })
    #use the lambda returns to update the tabular state value esimates
    state_value_array[state_int_list] += learning_rate*td_lambda_output.extra.temporal_differences
    
    if done:
      state_list, reward_list, state_int_list = [], [], []
      current_state = env.reset()
      current_episode += 1
      if current_episode % stats_every == 0:
        print("Current Episode: {}".format(current_episode))
        print("Reshaped State Value Estimates:")
        print(np.round(state_value_array.reshape(4,4),3))
        print("")  

Current Episode: 1000
Reshaped State Value Estimates:
[[0.006 0.006 0.007 0.004]
 [0.006 0.    0.011 0.   ]
 [0.008 0.015 0.025 0.   ]
 [0.    0.015 0.067 0.   ]]

Current Episode: 2000
Reshaped State Value Estimates:
[[0.015 0.012 0.013 0.009]
 [0.016 0.    0.02  0.   ]
 [0.024 0.038 0.053 0.   ]
 [0.    0.04  0.122 0.   ]]

Current Episode: 3000
Reshaped State Value Estimates:
[[0.02  0.018 0.021 0.016]
 [0.023 0.    0.026 0.   ]
 [0.029 0.042 0.048 0.   ]
 [0.    0.062 0.129 0.   ]]

Current Episode: 4000
Reshaped State Value Estimates:
[[0.037 0.031 0.033 0.026]
 [0.043 0.    0.047 0.   ]
 [0.055 0.075 0.097 0.   ]
 [0.    0.098 0.181 0.   ]]

Current Episode: 5000
Reshaped State Value Estimates:
[[0.033 0.03  0.034 0.026]
 [0.036 0.    0.051 0.   ]
 [0.054 0.09  0.1   0.   ]
 [0.    0.153 0.238 0.   ]]

Current Episode: 6000
Reshaped State Value Estimates:
[[0.034 0.033 0.038 0.035]
 [0.036 0.    0.048 0.   ]
 [0.046 0.076 0.105 0.   ]
 [0.    0.131 0.219 0.   ]]

Current Episode:

The agent is not very successful when acting randomly. When we do Q(λ) later this section, we'll see the agent perform better with a learned policy and see how changing values like discount, lambda, and learning rate can change performance.