**Reinforcement Learning with TensorFlow & TRFL: TD Learning**
* This notebook shows how to use TD learning with TRFL.
* This simple example is to get you familiar with TRFL. For one step TD learning updates in the tabular case, the classic method (shown in the code for reference) is faster. The TRFL method is superior when using the TD learning loss on batches of tensors. Section 2, Deep Q Networks, will hightlight this.

Outline:
1. Install TRFL
2. Define the GridWorld environment
3. Introduce Gym Environments
4. Find the value of each state value in the environment using TD learning and a random policy






In [0]:
#TRFL works with TensorFlow 1.12
#installs TensorFlow version 1.12 then restarts the runtime
!pip install tensorflow==1.12

import os
os.kill(os.getpid(), 9)



In [1]:
#install TRFL
!pip install trfl==1.0

#install Tensorflow Probability
!pip install tensorflow-probability==0.5.0



**Gridworld**

The Gridworld environment is a four by four grid. The agent randomly starts on the grid and can move either up, left, right, or down. If the agent reaches the upper left or lower right the episode is over. Every action the agent takes gets a reward of -1 until you reach the upper left or over right. Gridworld code from Denny Britz's Reinforcement Learning repo: https://github.com/dennybritz/reinforcement-learning


In [0]:
#Environment from: https://github.com/dennybritz/reinforcement-learning/blob/cee9e78652f8ce98d6079282daf20680e5e17c6a/lib/envs/gridworld.py

#define the environment

import io
import numpy as np
import sys
from gym.envs.toy_text import discrete
import pprint

UP = 0
RIGHT = 1
DOWN = 2
LEFT = 3

class GridworldEnv(discrete.DiscreteEnv):
    """
    Grid World environment from Sutton's Reinforcement Learning book chapter 4.
    You are an agent on an MxN grid and your goal is to reach the terminal
    state at the top left or the bottom right corner.
    For example, a 4x4 grid looks as follows:
    T  o  o  o
    o  x  o  o
    o  o  o  o
    o  o  o  T
    x is your position and T are the two terminal states.
    You can take actions in each direction (UP=0, RIGHT=1, DOWN=2, LEFT=3).
    Actions going off the edge leave you in your current state.
    You receive a reward of -1 at each step until you reach a terminal state.
    """

    metadata = {'render.modes': ['human', 'ansi']}

    def __init__(self, shape=[4,4]):
        if not isinstance(shape, (list, tuple)) or not len(shape) == 2:
            raise ValueError('shape argument must be a list/tuple of length 2')

        self.shape = shape

        nS = np.prod(shape)
        nA = 4

        MAX_Y = shape[0]
        MAX_X = shape[1]

        P = {}
        grid = np.arange(nS).reshape(shape)
        it = np.nditer(grid, flags=['multi_index'])

        while not it.finished:
            s = it.iterindex
            y, x = it.multi_index

            # P[s][a] = (prob, next_state, reward, is_done)
            P[s] = {a : [] for a in range(nA)}

            is_done = lambda s: s == 0 or s == (nS - 1)
            reward = 0.0 if is_done(s) else -1.0
            #reward = 1.0 if is_done(s) else 0.0

            # We're stuck in a terminal state
            if is_done(s):
                P[s][UP] = [(1.0, s, reward, True)]
                P[s][RIGHT] = [(1.0, s, reward, True)]
                P[s][DOWN] = [(1.0, s, reward, True)]
                P[s][LEFT] = [(1.0, s, reward, True)]
            # Not a terminal state
            else:
                ns_up = s if y == 0 else s - MAX_X
                ns_right = s if x == (MAX_X - 1) else s + 1
                ns_down = s if y == (MAX_Y - 1) else s + MAX_X
                ns_left = s if x == 0 else s - 1
                P[s][UP] = [(1.0, ns_up, reward, is_done(ns_up))]
                P[s][RIGHT] = [(1.0, ns_right, reward, is_done(ns_right))]
                P[s][DOWN] = [(1.0, ns_down, reward, is_done(ns_down))]
                P[s][LEFT] = [(1.0, ns_left, reward, is_done(ns_left))]

            it.iternext()

        # Initial state distribution is uniform
        isd = np.ones(nS) / nS

        # We expose the model of the environment for educational purposes
        # This should not be used in any model-free learning algorithm
        self.P = P

        super(GridworldEnv, self).__init__(nS, nA, P, isd)

    def _render(self, mode='human', close=False):
        """ Renders the current gridworld layout
         For example, a 4x4 grid with the mode="human" looks like:
            T  o  o  o
            o  x  o  o
            o  o  o  o
            o  o  o  T
        where x is your position and T are the two terminal states.
        """
        if close:
            return

        outfile = io.StringIO() if mode == 'ansi' else sys.stdout

        grid = np.arange(self.nS).reshape(self.shape)
        it = np.nditer(grid, flags=['multi_index'])
        while not it.finished:
            s = it.iterindex
            y, x = it.multi_index

            if self.s == s:
                output = " x "
            elif s == 0 or s == self.nS - 1:
                output = " T "
            else:
                output = " o "

            if x == 0:
                output = output.lstrip()
            if x == self.shape[1] - 1:
                output = output.rstrip()

            outfile.write(output)

            if x == self.shape[1] - 1:
                outfile.write("\n")

            it.iternext()
            
pp = pprint.PrettyPrinter(indent=2)

**An Introduction to Gym Environments**

Gym is a popular RL library created by OpenAI. Gym has a variety of environments (https://gym.openai.com/). With gym environments you take actions and receive rewards and test algorithms and policies (among other things). At the core of interacting with a gym environment (env) are a few key methods:

* *env.reset()*: Resets the environment and returns an observation of the current state.
* *env.step(action)*: Input an action and the env outputs a observation, reward, done indication, and info. Agents typically interact and receive feedback from the env using this method.

I'll show you a simple example of the GridWorld env in action. The 'x' is the agent, 'T' are the terminal states that end the episode. Watch how the agent moves in the grid by taking actions. In this notebook we are using a random policy i.e. the agent takes a random action at each step.

In [3]:
#declare the environment
env = GridworldEnv()
#reset the environment and get the agent's current position (observation)
observation = env.reset()
env._render()
print("")
action_dict = {0:"UP",1:"RIGHT", 2:"DOWN",3:"LEFT"}

for i in range(10):
  #get a random action
  random_action = env.action_space.sample()
  observation,reward,done,info = env.step(random_action)
  print("Agent took action {} and is now in state {} ".format(action_dict[random_action], observation))
  env._render()
  print("")
  if done:
    print("Agent reached end of episode, resetting the env")
    print(env.reset())
    print("")
    env._render()
    print("")

T  o  o  o
o  o  o  o
x  o  o  o
o  o  o  T

Agent took action UP and is now in state 4 
T  o  o  o
x  o  o  o
o  o  o  o
o  o  o  T

Agent took action LEFT and is now in state 4 
T  o  o  o
x  o  o  o
o  o  o  o
o  o  o  T

Agent took action RIGHT and is now in state 5 
T  o  o  o
o  x  o  o
o  o  o  o
o  o  o  T

Agent took action UP and is now in state 1 
T  x  o  o
o  o  o  o
o  o  o  o
o  o  o  T

Agent took action LEFT and is now in state 0 
x  o  o  o
o  o  o  o
o  o  o  o
o  o  o  T

Agent reached end of episode, resetting the env
15

T  o  o  o
o  o  o  o
o  o  o  o
o  o  o  x

Agent took action LEFT and is now in state 15 
T  o  o  o
o  o  o  o
o  o  o  o
o  o  o  x

Agent reached end of episode, resetting the env
13

T  o  o  o
o  o  o  o
o  o  o  o
o  x  o  T

Agent took action LEFT and is now in state 12 
T  o  o  o
o  o  o  o
o  o  o  o
x  o  o  T

Agent took action LEFT and is now in state 12 
T  o  o  o
o  o  o  o
o  o  o  o
x  o  o  T

Agent took action RIGHT and is no

** TRFL Usage **

The main steps for using trfl:
1. In the TensorFlow graph, define the necessary TensorFlow tensors
2. In the graph, feed the tensors into the trfl method
3. In the TensorFlow session, run the graph operation

Steps 1. and 2. are in the next cell. We define the tensors needed (step 1) and then pass them to trfl.td_learning (step 2). Step 3 is defined in the td_learning_value_estimate() function in the line sess.run([td_learning_t],..)

In [0]:
#set up TRFL graph
import tensorflow as tf
import trfl

#https://github.com/deepmind/trfl/blob/master/docs/trfl.md#td_learningv_tm1-r_t-pcont_t-v_t-nametdlearning
# Args:
# v_tm1: Tensor holding values at previous timestep, shape [B].
# r_t: Tensor holding rewards, shape [B].
# pcont_t: Tensor holding pcontinue values, shape [B].
# v_t: Tensor holding values at current timestep, shape [B].
# name: name to prefix ops created by this function.

state_value_t = tf.placeholder(dtype=tf.float32,name="state_value")
reward_t = tf.placeholder(dtype=tf.float32,name='reward')
gamma_t = tf.placeholder(dtype=tf.float32,name='discount_factor')
next_state_value_t = tf.placeholder(dtype=tf.float32,name='next_state_value')
  
td_learning_t = trfl.td_learning(state_value_t,reward_t,gamma_t,next_state_value_t,name="td_learning")


** The RL Training Loop **

In the next cell we are going to define the training loop and then run it in the following cell. The goal is to estimate the value of each state (each cell in the gridworld) under a random policy using TD learning. state_value_array holds the estimated values and after each step the agent takes in the env, we update the state_value_array with the TD learning formula.

** TRFL Usage **

The TRFL usage here is to run the trfl operation td_learning_t in sess.run(). We then take the output (td_sess_output) and extract the td_error part of that tensor. Using the td_error we update the state_value_array. For reference, the code below shows the full output of trfl.td_learning and the classic RL method of performing tabular TD learning updates.

In [0]:
def td_learning_value_estimate(env,episodes=1000,alpha=0.05,discount_factor=1.0):
  """
     Args:
        env: OpenAI env. env.P represents the transition probabilities of the environment.
            env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).
            env.nS is a number of states in the environment. 
            env.nA is a number of actions in the environment.
        episodes: number of episodes to run
        alpha: learning rate for state value updates
        discount_factor: Gamma discount factor. pcont_t TRFL argument
        
     Returns:
      Value of each state using a random policy
  """
  
  with tf.Session() as sess:
    #initialize the estimated state values to zero
    state_value_array = np.zeros(env.nS)
    #reset the env
    current_state = env.reset()
    #env._render()

    #run through each episode taking a random action each time
    #upgrade estimated state value after each action
    current_episode = 0
    while current_episode < episodes:
      #take a random action
      random_action = env.action_space.sample()
      next_state, rew, done, info = env.step(random_action)
      
      #run TRFL operation in the session
      td_sess_output = sess.run([td_learning_t],feed_dict={state_value_t:state_value_array[current_state],reward_t:rew,
                                                                gamma_t:discount_factor,next_state_value_t:state_value_array[next_state]})

      #trfl.td_learning() returns
        #loss_output(loss=0.0, extra=td_extra(target=2.0, td_error=0.0))
          #loss can be used with a gradient descent optimizer (we will see this in the Deep Q Network section)
          #td_extra contains:
            #target: a batch of next_state_value
            #td_error: this is what we use to update our state values in td learning

      #use the TD learning TD error to update estimated state values
      state_value_array[current_state] = state_value_array[current_state] + alpha*td_sess_output[0].extra.td_error
      #For reference, here is the tabular TD learning method
      #state_value_array[current_state] = state_value_array[current_state] + alpha * (rew + discount_factor*state_value_array[next_state] -state_value_array[current_state])
        
      #if the epsiode is done, reset the env, if not the next state becomes the current state and the loop repeats
      if done:
        current_state = env.reset()
        current_episode += 1
      else:
        current_state = next_state


    return state_value_array
  

  

In [6]:
#run episodes with TD learning and get the state value estimates
state_values = td_learning_value_estimate(env,episodes=2000,alpha=0.03)

print("State Value Estimates:")
print(np.round(state_values,2))
print("")

print("Reshaped State Value Estimates:")
print(np.round(state_values.reshape(env.shape),2))
print("")

State Value Estimates:
[  0.   -13.06 -19.26 -20.96 -12.74 -17.34 -19.32 -19.88 -19.   -18.89
 -16.99 -16.74 -20.88 -18.12 -11.52   0.  ]

Reshaped State Value Estimates:
[[  0.   -13.06 -19.26 -20.96]
 [-12.74 -17.34 -19.32 -19.88]
 [-19.   -18.89 -16.99 -16.74]
 [-20.88 -18.12 -11.52   0.  ]]



The 'Reshaped State Value Estimates' show the TD learning estimate for the state values. The closer the agent is to a terminal state, the higher the estimate (since the agent is more likely to randomly choose an action and end up ending the episode).