**Reinforcement Learning with TD Learning**

Outline:
1. Define the GridWorld environment
2. Find the value of each state value in the environment using TD learning (TD(0)) and a random policy






**Gridworld**

The Gridworld environment is a four by four grid. The agent randomly starts on the grid and can move either up, left, right, or down. If the agent reaches the upper left or lower right the episode is over. Every action the agent takes gets a reward of -1 until you reach the upper left or over right. Gridworld code from Denny Britz's Reinforcement Learning repo: https://github.com/dennybritz/reinforcement-learning


In [0]:
#Environment from: https://github.com/dennybritz/reinforcement-learning/blob/cee9e78652f8ce98d6079282daf20680e5e17c6a/lib/envs/gridworld.py

#define the environment

import io
import numpy as np
import sys
from gym.envs.toy_text import discrete
import pprint

UP = 0
RIGHT = 1
DOWN = 2
LEFT = 3

class GridworldEnv(discrete.DiscreteEnv):
    """
    Grid World environment from Sutton's Reinforcement Learning book chapter 4.
    You are an agent on an MxN grid and your goal is to reach the terminal
    state at the top left or the bottom right corner.
    For example, a 4x4 grid looks as follows:
    T  o  o  o
    o  x  o  o
    o  o  o  o
    o  o  o  T
    x is your position and T are the two terminal states.
    You can take actions in each direction (UP=0, RIGHT=1, DOWN=2, LEFT=3).
    Actions going off the edge leave you in your current state.
    You receive a reward of -1 at each step until you reach a terminal state.
    """

    metadata = {'render.modes': ['human', 'ansi']}

    def __init__(self, shape=[4,4]):
        if not isinstance(shape, (list, tuple)) or not len(shape) == 2:
            raise ValueError('shape argument must be a list/tuple of length 2')

        self.shape = shape

        nS = np.prod(shape)
        nA = 4

        MAX_Y = shape[0]
        MAX_X = shape[1]

        P = {}
        grid = np.arange(nS).reshape(shape)
        it = np.nditer(grid, flags=['multi_index'])

        while not it.finished:
            s = it.iterindex
            y, x = it.multi_index

            # P[s][a] = (prob, next_state, reward, is_done)
            P[s] = {a : [] for a in range(nA)}

            is_done = lambda s: s == 0 or s == (nS - 1)
            reward = 0.0 if is_done(s) else -1.0
            #reward = 1.0 if is_done(s) else 0.0

            # We're stuck in a terminal state
            if is_done(s):
                P[s][UP] = [(1.0, s, reward, True)]
                P[s][RIGHT] = [(1.0, s, reward, True)]
                P[s][DOWN] = [(1.0, s, reward, True)]
                P[s][LEFT] = [(1.0, s, reward, True)]
            # Not a terminal state
            else:
                ns_up = s if y == 0 else s - MAX_X
                ns_right = s if x == (MAX_X - 1) else s + 1
                ns_down = s if y == (MAX_Y - 1) else s + MAX_X
                ns_left = s if x == 0 else s - 1
                P[s][UP] = [(1.0, ns_up, reward, is_done(ns_up))]
                P[s][RIGHT] = [(1.0, ns_right, reward, is_done(ns_right))]
                P[s][DOWN] = [(1.0, ns_down, reward, is_done(ns_down))]
                P[s][LEFT] = [(1.0, ns_left, reward, is_done(ns_left))]

            it.iternext()

        # Initial state distribution is uniform
        isd = np.ones(nS) / nS

        # We expose the model of the environment for educational purposes
        # This should not be used in any model-free learning algorithm
        self.P = P

        super(GridworldEnv, self).__init__(nS, nA, P, isd)

    def _render(self, mode='human', close=False):
        """ Renders the current gridworld layout
         For example, a 4x4 grid with the mode="human" looks like:
            T  o  o  o
            o  x  o  o
            o  o  o  o
            o  o  o  T
        where x is your position and T are the two terminal states.
        """
        if close:
            return

        outfile = io.StringIO() if mode == 'ansi' else sys.stdout

        grid = np.arange(self.nS).reshape(self.shape)
        it = np.nditer(grid, flags=['multi_index'])
        while not it.finished:
            s = it.iterindex
            y, x = it.multi_index

            if self.s == s:
                output = " x "
            elif s == 0 or s == self.nS - 1:
                output = " T "
            else:
                output = " o "

            if x == 0:
                output = output.lstrip()
            if x == self.shape[1] - 1:
                output = output.rstrip()

            outfile.write(output)

            if x == self.shape[1] - 1:
                outfile.write("\n")

            it.iternext()
            
pp = pprint.PrettyPrinter(indent=2)

In [0]:
#declare the environment
env = GridworldEnv()
#reset the environment and get the agent's current position (observation)
observation = env.reset()
env._render()
print("")
action_dict = {0:"UP",1:"RIGHT", 2:"DOWN",3:"LEFT"}

for i in range(10):
    #get a random action
    random_action = env.action_space.sample()
    observation,reward,done,info = env.step(random_action)
    print("Agent took action {} and is now in state {} ".format(action_dict[random_action], observation))
    env._render()
    print("")
    if done:
        print("Agent reached end of episode, resetting the env")
        print(env.reset())
        print("")
        env._render()
        print("")

T  o  o  o
o  o  o  o
o  o  o  o
x  o  o  T

Agent took action UP and is now in state 8 
T  o  o  o
o  o  o  o
x  o  o  o
o  o  o  T

Agent took action LEFT and is now in state 8 
T  o  o  o
o  o  o  o
x  o  o  o
o  o  o  T

Agent took action RIGHT and is now in state 9 
T  o  o  o
o  o  o  o
o  x  o  o
o  o  o  T

Agent took action UP and is now in state 5 
T  o  o  o
o  x  o  o
o  o  o  o
o  o  o  T

Agent took action LEFT and is now in state 4 
T  o  o  o
x  o  o  o
o  o  o  o
o  o  o  T

Agent took action LEFT and is now in state 4 
T  o  o  o
x  o  o  o
o  o  o  o
o  o  o  T

Agent took action LEFT and is now in state 4 
T  o  o  o
x  o  o  o
o  o  o  o
o  o  o  T

Agent took action LEFT and is now in state 4 
T  o  o  o
x  o  o  o
o  o  o  o
o  o  o  T

Agent took action RIGHT and is now in state 5 
T  o  o  o
o  x  o  o
o  o  o  o
o  o  o  T

Agent took action LEFT and is now in state 4 
T  o  o  o
x  o  o  o
o  o  o  o
o  o  o  T



**The RL Training Loop**

In the next cell we are going to define the training loop and then run it in the following cell. The goal is to estimate the value of each state (each cell in the gridworld) under a random policy using TD(0). state_value_array holds the estimated values and after each step the agent takes in the env, we update the state_value_array with the TD(0) formula.


In [0]:
def td_learning_value_estimate(env,episodes=1000,alpha=0.05,discount_factor=1.0):
  #initialize the estimated state values to zero
  state_size = env.nS
  state_value_array = np.zeros(state_size)
  #reset the env
  current_state = env.reset()
  #env._render()

  #run through each episode taking a random action each time
  #update estimated state value after each action
  current_episode = 0
  while current_episode < episodes:
    #take a random action
    random_action = env.action_space.sample()
    next_state, rew, done, info = env.step(random_action)

    #update state values using TD(0)
    state_value_array[current_state] = state_value_array[current_state] + \
      alpha * (rew + discount_factor*state_value_array[next_state] -state_value_array[current_state])

    #if episode is done, reset the env, if not the next state becomes the current state and the loop repeats
    if done:
      current_state = env.reset()
      current_episode += 1
    else:
      current_state = next_state

  return state_value_array

In [0]:
#run episodes with TD learning and get the state value estimates
state_values = td_learning_value_estimate(env,episodes=10000,alpha=0.01)

print("State Value Estimates:")
print(np.round(state_values,2))
print("")

print("Reshaped State Value Estimates:")
print(np.round(state_values.reshape(env.shape),2))
print("")

State Value Estimates:
[  0.   -14.12 -19.95 -22.11 -14.39 -18.12 -20.1  -20.33 -20.52 -20.17
 -18.05 -14.43 -22.27 -20.02 -13.36   0.  ]

Reshaped State Value Estimates:
[[  0.   -14.12 -19.95 -22.11]
 [-14.39 -18.12 -20.1  -20.33]
 [-20.52 -20.17 -18.05 -14.43]
 [-22.27 -20.02 -13.36   0.  ]]



The 'Reshaped State Value Estimates' show the TD learning estimate for the state values. The closer the agent is to a terminal state, the higher the estimate (since the agent is more likely to randomly choose an action and end up ending the episode).