**Reinforcement Learning with TensorFlow & TRFL: Q(λ)**

Outline:
* Q(λ) 
* TRFL usage with trfl.qlambda()




In [0]:
#TRFL works with TensorFlow 1.12
#installs TensorFlow version 1.12 then restarts the runtime
!pip install tensorflow==1.12

import os
os.kill(os.getpid(), 9)



In [1]:
#install tensorflow-probability 0.5.0 that works with TensorFlow 1.12
!pip install tensorflow-probability==0.5.0

#install TRFL
!pip install trfl==1.0




In [0]:
import gym
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import trfl
import tensorflow_probability as tfp

** Q(λ) **

Q(λ) has many variants. Some considerations are what values of λ to use, how to handle the next state max, and how to handle on-policy and off-policy actions. In Watkins’s Q(λ), the eligibility traces are set to 0 on the first non-greedy action and remains 0 for the rest of the trajectory. Naive Q(λ) and TB(λ) don’t set eligibility traces to 0 on non-greedy actions. Peng’s Q(λ) is a hybrid of SARSA(λ) and Watkins’s Q(λ).

In this notebooks we'll use naive Q(λ), and ignore whether the action is on-policy or off-policy. We'll solve deterministic (ie not slippery) FrozenLake 4x4, non-deterministic FrozenLake 4x4, and FrozenLake 8x8.


** Example 1: FrozenLake 4x4 Not Slippery **

First example we set is_slippery to False in FrozenLake. Every action the agent takes becomes deterministic, making the env much easier.

In [0]:
from gym.envs.registration import register
register(
    id='FrozenLakeNotSlippery-v0',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name' : '4x4', 'is_slippery': False}
)

In [0]:
#hyperparameters
episodes = 10000
learning_rate = 0.01
discount = 0.99
lambda_val = 0.5
epsilon_start = 1.0
epsilon_min = 0.01
epsilon_step = (epsilon_start - epsilon_min)/(episodes*.9)

env = gym.make('FrozenLakeNotSlippery-v0')
num_actions = env.action_space.n
stats_every = 1000

tf.reset_default_graph()

#et up input tensors
q_value_ = tf.placeholder(dtype=tf.float32, shape=[None, 1, num_actions], name="q_value")
action_ = tf.placeholder(dtype=tf.int32, shape=[None, 1], name="action")
reward_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="reward")
discount_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="discount")
q_next_ = tf.placeholder(dtype=tf.float32, shape=[None, 1, num_actions], name="q_next")
lambda_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="lambda")
#set up TRFL qlambda tensor
q_lambda_return_ = trfl.qlambda(q_value_, action_, reward_, discount_, q_next_, lambda_)

** TRFL Usage **
Q(λ) is similar to the λ methods we have gone over in earlier videos. Like in Section 1, we replace state values with q values and add a tensor for actions. The loss return or td error return can be used to perform updates.

In [5]:
stats_success = []
epsilon = epsilon_start

with tf.Session() as sess:
  action_value_array = np.zeros((16,num_actions))
  #reset the env
  current_state = env.reset()

  current_episode = 1
  q_list, action_list, reward_list, q_next_list, state_int_list = [], [], [], [], []

  while current_episode < episodes:
    #take epsilon greedy action
    if np.random.rand() < epsilon:
      action = env.action_space.sample()
    else:
      #Choose a greedy action. If multiple greedy actions randomly choose between
      max_actions = np.argwhere(action_value_array[current_state] == np.max(action_value_array[current_state])).reshape((-1))
      action = np.random.choice(max_actions)

    next_state, rew, done, info = env.step(action)
      
    q_list.append(action_value_array[current_state])
    reward_list.append(rew)
    action_list.append(action)
    q_next_list.append(action_value_array[next_state])
    state_int_list.append(current_state)
    
    current_state = next_state
    #run TRFL qlambda tensor to get TD error
    q_lambda_output = sess.run(q_lambda_return_, feed_dict={
        q_value_:np.array(q_list).reshape(-1,1,num_actions),
        action_:np.array(action_list).reshape(-1,1),
        reward_:np.array(reward_list).reshape(-1,1),
        discount_:np.array([discount]*len(q_list)).reshape(-1,1),
        q_next_:np.array(q_next_list).reshape(-1,1,num_actions),
        lambda_:np.array([lambda_val]*len(q_list)).reshape(-1,1),
      })
    #use TD error output update action values
    action_value_array[state_int_list, action_list] += np.squeeze(learning_rate*q_lambda_output.extra.td_error)

    if done:
      if next_state == 15:
        stats_success.append(1)
      else:
        stats_success.append(0)
        
      q_list, action_list, reward_list, q_next_list, state_int_list = [], [], [], [], []
      current_state = env.reset()
      current_episode += 1
      
      #decrease epsilon
      epsilon -= epsilon_step
      if epsilon < epsilon_min:
        epsilon = epsilon_min
        
      if current_episode % stats_every == 0:
        print("Current Episode, Epsilon, Trailing Success %: {}, {:.2f}, {:.2f}".format(current_episode, epsilon,
                                                                                        np.mean(stats_success[-1000:])))
        optimal_action_estimates = np.max(action_value_array,axis=1)
        print("Optimal Action Value Estimates:")
        print(np.round(optimal_action_estimates.reshape(4,4),2))
        print("estimate of the optimal state value at each state")
        print("")
        print("All Action Value Estimates:")
        print(np.round(action_value_array.reshape((16,4)),2))
        print("")
        

Current Episode, Epsilon, Trailing Success %: 1000, 0.89, 0.03
Optimal Action Value Estimates:
[[0.05 0.06 0.07 0.05]
 [0.05 0.   0.08 0.  ]
 [0.04 0.05 0.12 0.  ]
 [0.   0.05 0.22 0.  ]]
estimate of the optimal state value at each state

All Action Value Estimates:
[[0.05 0.04 0.05 0.05]
 [0.05 0.   0.06 0.05]
 [0.05 0.07 0.04 0.05]
 [0.05 0.   0.03 0.03]
 [0.04 0.03 0.   0.05]
 [0.   0.   0.   0.  ]
 [0.   0.08 0.   0.04]
 [0.   0.   0.   0.  ]
 [0.02 0.   0.04 0.03]
 [0.01 0.02 0.05 0.  ]
 [0.01 0.12 0.   0.03]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.01 0.05 0.01]
 [0.01 0.08 0.22 0.02]
 [0.   0.   0.   0.  ]]

Current Episode, Epsilon, Trailing Success %: 2000, 0.78, 0.07
Optimal Action Value Estimates:
[[0.36 0.37 0.39 0.35]
 [0.33 0.   0.43 0.  ]
 [0.25 0.32 0.49 0.  ]
 [0.   0.33 0.6  0.  ]]
estimate of the optimal state value at each state

All Action Value Estimates:
[[0.33 0.31 0.36 0.33]
 [0.33 0.   0.37 0.34]
 [0.32 0.39 0.31 0.34]
 [0.35 0.   0.26 0.24]
 [0.

In [6]:
print("Current Episode, Epsilon, Trailing Success %: {}, {:.2f}, {:.2f}".format(current_episode, epsilon,
                                                                                        np.mean(stats_success[-1000:])))
optimal_action_estimates = np.max(action_value_array,axis=1)
print("Optimal Action Value Estimates:")
print(np.round(optimal_action_estimates.reshape(4,4),2))
print("estimate of the optimal state value at each state")
print("")
print("All Action Value Estimates:")
print(np.round(action_value_array.reshape((16,4)),2))
print("")

Current Episode, Epsilon, Trailing Success %: 10000, 0.01, 0.98
Optimal Action Value Estimates:
[[0.95 0.96 0.97 0.95]
 [0.94 0.   0.98 0.  ]
 [0.95 0.98 0.99 0.  ]
 [0.   0.99 1.   0.  ]]
estimate of the optimal state value at each state

All Action Value Estimates:
[[0.93 0.92 0.95 0.93]
 [0.93 0.   0.96 0.94]
 [0.94 0.97 0.93 0.95]
 [0.95 0.   0.88 0.88]
 [0.9  0.94 0.   0.91]
 [0.   0.   0.   0.  ]
 [0.   0.98 0.   0.95]
 [0.   0.   0.   0.  ]
 [0.92 0.   0.95 0.92]
 [0.92 0.95 0.98 0.  ]
 [0.95 0.99 0.   0.96]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.88 0.99 0.88]
 [0.97 0.99 1.   0.97]
 [0.   0.   0.   0.  ]]



** Example 2: FrozenLake 4x4 Slippery **

Standard FrozenLake env where slippery is enabled. Notice the increased failure rate and lower Q values

In [0]:
#hyperparameters
episodes = 10000
learning_rate = 0.01
discount = 1.
lambda_val = 0.5
epsilon_start = 1.0
epsilon_min = 0.01
epsilon_step = (epsilon_start - epsilon_min)/(episodes*.9)

seed = 31
env = gym.make('FrozenLake-v0')
env.seed(seed)
np.random.seed(seed)
num_actions = env.action_space.n
stats_every = 1000

tf.reset_default_graph()

q_value_ = tf.placeholder(dtype=tf.float32, shape=[None, 1, num_actions], name="q_value")
action_ = tf.placeholder(dtype=tf.int32, shape=[None, 1], name="action")
reward_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="reward")
discount_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="discount")
q_next_ = tf.placeholder(dtype=tf.float32, shape=[None, 1, num_actions], name="q_next")
lambda_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="lambda")

q_lambda_return_ = trfl.qlambda(q_value_, action_, reward_, discount_, q_next_, lambda_)

In [8]:
stats_success = []
epsilon = epsilon_start

with tf.Session() as sess:
  #initialize the estimated state values to zero
  action_value_array = np.zeros((16,num_actions))
  #reset the env
  current_state = env.reset()

  current_episode = 1
  q_list, action_list, reward_list, q_next_list, state_int_list = [], [], [], [], []

  while current_episode < episodes:
    #take epsilon greedy action
    if np.random.rand() < epsilon:
      action = env.action_space.sample()
    else:
      #Choose a greedy action. If multiple greedy actions randomly choose between
      max_actions = np.argwhere(action_value_array[current_state] == np.max(action_value_array[current_state])).reshape((-1))
      action = np.random.choice(max_actions)

    next_state, rew, done, info = env.step(action)
      
    q_list.append(action_value_array[current_state])
    reward_list.append(rew)
    action_list.append(action)
    q_next_list.append(action_value_array[next_state])
    state_int_list.append(current_state)
    
    current_state = next_state
    
    q_lambda_output = sess.run(q_lambda_return_, feed_dict={
        q_value_:np.array(q_list).reshape(-1,1,num_actions),
        action_:np.array(action_list).reshape(-1,1),
        reward_:np.array(reward_list).reshape(-1,1),
        discount_:np.array([discount]*len(q_list)).reshape(-1,1),
        q_next_:np.array(q_next_list).reshape(-1,1,num_actions),
        lambda_:np.array([lambda_val]*len(q_list)).reshape(-1,1),
      })

    #action_value_array[state_int_list, action_list] += np.squeeze(learning_rate*q_lambda_output.extra.td_error)
    for s, a, td in zip(state_int_list,action_list,q_lambda_output.extra.td_error.tolist()):
      action_value_array[s,a] += learning_rate*td[0]

    if done:
      if next_state == 15:
        stats_success.append(1)
      else:
        stats_success.append(0)
        
      q_list, action_list, reward_list, q_next_list, state_int_list = [], [], [], [], []
      current_state = env.reset()
      current_episode += 1
      epsilon -= epsilon_step
      if epsilon < epsilon_min:
        epsilon = epsilon_min
      if current_episode % stats_every == 0:
        print("Current Episode, Epsilon, Trailing Success %: {}, {:.2f}, {:.2f}".format(current_episode, epsilon,
                                                                                        np.mean(stats_success[-1000:])))
        optimal_action_estimates = np.max(action_value_array,axis=1)
        print("Optimal Action Value Estimates:")
        print(np.round(optimal_action_estimates.reshape(4,4),2))
        print("estimate of the optimal state value at each state")
        print("")
        print("All Action Value Estimates:")
        print(np.round(action_value_array.reshape((16,4)),2))
        print("")

Current Episode, Epsilon, Trailing Success %: 1000, 0.89, 0.01
Optimal Action Value Estimates:
[[0.01 0.01 0.01 0.01]
 [0.01 0.   0.01 0.  ]
 [0.02 0.02 0.02 0.  ]
 [0.   0.02 0.05 0.  ]]
estimate of the optimal state value at each state

All Action Value Estimates:
[[0.01 0.01 0.01 0.01]
 [0.01 0.01 0.01 0.01]
 [0.01 0.01 0.01 0.01]
 [0.   0.   0.   0.01]
 [0.01 0.01 0.01 0.01]
 [0.   0.   0.   0.  ]
 [0.01 0.01 0.01 0.  ]
 [0.   0.   0.   0.  ]
 [0.01 0.01 0.01 0.02]
 [0.01 0.02 0.01 0.01]
 [0.01 0.02 0.02 0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.01 0.01 0.02 0.02]
 [0.01 0.04 0.04 0.05]
 [0.   0.   0.   0.  ]]

Current Episode, Epsilon, Trailing Success %: 2000, 0.78, 0.03
Optimal Action Value Estimates:
[[0.04 0.03 0.03 0.02]
 [0.04 0.   0.03 0.  ]
 [0.04 0.06 0.06 0.  ]
 [0.   0.1  0.14 0.  ]]
estimate of the optimal state value at each state

All Action Value Estimates:
[[0.04 0.03 0.03 0.03]
 [0.03 0.03 0.03 0.03]
 [0.03 0.03 0.02 0.02]
 [0.02 0.02 0.02 0.02]
 [0.

In [9]:
print("Current Episode, Epsilon, Trailing Success %: {}, {:.2f}, {:.2f}".format(current_episode, epsilon,
                                                                                        np.mean(stats_success[-1000:])))
optimal_action_estimates = np.max(action_value_array,axis=1)
print("Optimal Action Value Estimates:")
print(np.round(optimal_action_estimates.reshape(4,4),2))
print("estimate of the optimal state value at each state")
print("")
print("All Action Value Estimates:")
print(np.round(action_value_array.reshape((16,4)),2))
print("")

Current Episode, Epsilon, Trailing Success %: 10000, 0.01, 0.62
Optimal Action Value Estimates:
[[0.81 0.8  0.8  0.8 ]
 [0.81 0.   0.8  0.  ]
 [0.81 0.81 0.81 0.  ]
 [0.   0.81 0.84 0.  ]]
estimate of the optimal state value at each state

All Action Value Estimates:
[[0.81 0.8  0.79 0.8 ]
 [0.71 0.79 0.79 0.8 ]
 [0.77 0.78 0.78 0.8 ]
 [0.76 0.79 0.78 0.8 ]
 [0.81 0.78 0.79 0.79]
 [0.   0.   0.   0.  ]
 [0.8  0.39 0.52 0.57]
 [0.   0.   0.   0.  ]
 [0.75 0.77 0.75 0.81]
 [0.66 0.81 0.72 0.74]
 [0.81 0.63 0.73 0.6 ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.73 0.69 0.81 0.73]
 [0.71 0.84 0.73 0.71]
 [0.   0.   0.   0.  ]]



** Example 3: FrozenLake 8x8 **

FrozenLake on an 8x8 grid. Much harder to randomly find the goal. To make learning faster, we add a penalty for falling into a hole.

In [0]:
#hyperparameters
episodes = 20000
learning_rate = 0.01
discount = 0.99
lambda_val = 0.75
epsilon_start = 1.0
epsilon_min = 0.01
epsilon_step = (epsilon_start - epsilon_min)/(episodes*.9)
hole_penalty = -0.1 #penalty for falling into a hole

seed = 31
env = gym.make('FrozenLake8x8-v0')
env.seed(seed)
np.random.seed(seed)
num_actions = env.action_space.n
stats_every = 1000

tf.reset_default_graph()

q_value_ = tf.placeholder(dtype=tf.float32, shape=[None, 1, num_actions], name="q_value")
action_ = tf.placeholder(dtype=tf.int32, shape=[None, 1], name="action")
reward_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="reward")
discount_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="discount")
q_next_ = tf.placeholder(dtype=tf.float32, shape=[None, 1, num_actions], name="q_next")
lambda_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="lambda")

q_lambda_return_ = trfl.qlambda(q_value_, action_, reward_, discount_, q_next_, lambda_)

In [0]:
stats_success = []
epsilon = epsilon_start

with tf.Session() as sess:
  #initialize the estimated state values to zero
  action_value_array = np.zeros((64,num_actions))
  #reset the env
  current_state = env.reset()

  current_episode = 1
  q_list, action_list, reward_list, q_next_list, state_int_list = [], [], [], [], []

  while current_episode < episodes:
    #take epsilon greedy action
    if np.random.rand() < epsilon:
      action = env.action_space.sample()
    else:
      #Choose a greedy action. If multiple greedy actions randomly choose between
      max_actions = np.argwhere(action_value_array[current_state] == np.max(action_value_array[current_state])).reshape((-1))
      action = np.random.choice(max_actions)

    next_state, rew, done, info = env.step(action)
      
    if done and rew < 1:
      rew = hole_penalty
      
    q_list.append(action_value_array[current_state])
    reward_list.append(rew)
    action_list.append(action)
    q_next_list.append(action_value_array[next_state])
    state_int_list.append(current_state)
    
    current_state = next_state
    
    q_lambda_output = sess.run(q_lambda_return_, feed_dict={
        q_value_:np.array(q_list).reshape(-1,1,num_actions),
        action_:np.array(action_list).reshape(-1,1),
        reward_:np.array(reward_list).reshape(-1,1),
        discount_:np.array([discount]*len(q_list)).reshape(-1,1),
        q_next_:np.array(q_next_list).reshape(-1,1,num_actions),
        lambda_:np.array([lambda_val]*len(q_list)).reshape(-1,1),
      })

    #action_value_array[state_int_list, action_list] += np.squeeze(learning_rate*q_lambda_output.extra.td_error)
    for s, a, td in zip(state_int_list,action_list,q_lambda_output.extra.td_error.tolist()):
      action_value_array[s,a] += learning_rate*td[0]

    if done:
      if next_state == 63:
        stats_success.append(1)
      else:
        stats_success.append(0)
        
      q_list, action_list, reward_list, q_next_list, state_int_list = [], [], [], [], []
      current_state = env.reset()
      current_episode += 1
      epsilon -= epsilon_step
      if epsilon < epsilon_min:
        epsilon = epsilon_min
      if current_episode % stats_every == 0:
        print("Current Episode, Epsilon, Trailing Success %: {}, {:.2f}, {:.2f}".format(current_episode, epsilon,
                                                                                        np.mean(stats_success[-1000:])))
        optimal_action_estimates = np.max(action_value_array,axis=1)
        print("Optimal Action Value Estimates:")
        print(np.round(optimal_action_estimates.reshape(8,8),2))
        print("estimate of the optimal state value at each state")
        print("")
        print("All Action Value Estimates:")
        print(np.round(action_value_array.reshape((64,4)),2))
        print("")

Current Episode, Epsilon, Trailing Success %: 1000, 0.95, 0.00
Optimal Action Value Estimates:
[[-0.   -0.   -0.   -0.   -0.   -0.   -0.   -0.  ]
 [-0.   -0.   -0.   -0.   -0.   -0.   -0.   -0.  ]
 [-0.   -0.   -0.    0.   -0.   -0.   -0.   -0.  ]
 [-0.   -0.   -0.01 -0.01 -0.01  0.   -0.   -0.  ]
 [-0.01 -0.01 -0.01  0.   -0.   -0.   -0.    0.  ]
 [-0.01  0.    0.    0.    0.    0.    0.    0.  ]
 [-0.01  0.    0.    0.    0.    0.    0.    0.01]
 [-0.   -0.    0.    0.    0.    0.    0.    0.  ]]
estimate of the optimal state value at each state

All Action Value Estimates:
[[-0.   -0.   -0.   -0.  ]
 [-0.   -0.   -0.   -0.  ]
 [-0.   -0.   -0.   -0.  ]
 [-0.   -0.   -0.   -0.  ]
 [-0.   -0.   -0.   -0.  ]
 [-0.   -0.   -0.   -0.  ]
 [-0.   -0.   -0.   -0.  ]
 [-0.   -0.   -0.   -0.  ]
 [-0.   -0.   -0.   -0.  ]
 [-0.   -0.   -0.   -0.  ]
 [-0.   -0.   -0.   -0.01]
 [-0.01 -0.01 -0.   -0.  ]
 [-0.   -0.   -0.   -0.  ]
 [-0.   -0.   -0.   -0.  ]
 [-0.   -0.   -0.   -0.  ]
 [-0.   -0. 

In [0]:
print("Current Episode, Epsilon, Trailing Success %: {}, {:.2f}, {:.2f}".format(current_episode, epsilon,
                                                                                np.mean(stats_success[-1000:])))
optimal_action_estimates = np.max(action_value_array,axis=1)
print("Optimal Action Value Estimates:")
print(np.round(optimal_action_estimates.reshape(8,8),2))
print("estimate of the optimal state value at each state")
print("")
print("All Action Value Estimates:")
print(np.round(action_value_array.reshape((64,4)),2))
print("")