**Reinforcement Learning with First Visit Monte Carlo**
* This notebook shows how to apply the first visit Monte Carlo to the GridWorld environment


Outline:
1. Define the GridWorld environment
3. Find the value of each Q value in the environment using first visit Monte Carlo





**GridWorld**

The GridWorld environment is a four by four grid. The agent randomly starts on the grid and can move either up, left, right, or down. If the agent reaches the upper left or lower right the episode is over. Every action the agent takes gets a reward of -1 until you reach the upper left or over right.

In [0]:
#Environment from: https://github.com/dennybritz/reinforcement-learning/blob/cee9e78652f8ce98d6079282daf20680e5e17c6a/lib/envs/gridworld.py

#define the environment

import io
import numpy as np
import sys
from gym.envs.toy_text import discrete
import pprint

UP = 0
RIGHT = 1
DOWN = 2
LEFT = 3

class GridworldEnv(discrete.DiscreteEnv):
    """
    Grid World environment from Sutton's Reinforcement Learning book chapter 4.
    You are an agent on an MxN grid and your goal is to reach the terminal
    state at the top left or the bottom right corner.
    For example, a 4x4 grid looks as follows:
    T  o  o  o
    o  x  o  o
    o  o  o  o
    o  o  o  T
    x is your position and T are the two terminal states.
    You can take actions in each direction (UP=0, RIGHT=1, DOWN=2, LEFT=3).
    Actions going off the edge leave you in your current state.
    You receive a reward of -1 at each step until you reach a terminal state.
    """

    metadata = {'render.modes': ['human', 'ansi']}

    def __init__(self, shape=[4,4]):
        if not isinstance(shape, (list, tuple)) or not len(shape) == 2:
            raise ValueError('shape argument must be a list/tuple of length 2')

        self.shape = shape

        nS = np.prod(shape)
        nA = 4

        MAX_Y = shape[0]
        MAX_X = shape[1]

        P = {}
        grid = np.arange(nS).reshape(shape)
        it = np.nditer(grid, flags=['multi_index'])

        while not it.finished:
            s = it.iterindex
            y, x = it.multi_index

            # P[s][a] = (prob, next_state, reward, is_done)
            P[s] = {a : [] for a in range(nA)}

            is_done = lambda s: s == 0 or s == (nS - 1)
            reward = 0.0 if is_done(s) else -1.0
            #reward = 1.0 if is_done(s) else 0.0

            # We're stuck in a terminal state
            if is_done(s):
                P[s][UP] = [(1.0, s, reward, True)]
                P[s][RIGHT] = [(1.0, s, reward, True)]
                P[s][DOWN] = [(1.0, s, reward, True)]
                P[s][LEFT] = [(1.0, s, reward, True)]
            # Not a terminal state
            else:
                ns_up = s if y == 0 else s - MAX_X
                ns_right = s if x == (MAX_X - 1) else s + 1
                ns_down = s if y == (MAX_Y - 1) else s + MAX_X
                ns_left = s if x == 0 else s - 1
                P[s][UP] = [(1.0, ns_up, reward, is_done(ns_up))]
                P[s][RIGHT] = [(1.0, ns_right, reward, is_done(ns_right))]
                P[s][DOWN] = [(1.0, ns_down, reward, is_done(ns_down))]
                P[s][LEFT] = [(1.0, ns_left, reward, is_done(ns_left))]

            it.iternext()

        # Initial state distribution is uniform
        isd = np.ones(nS) / nS

        # We expose the model of the environment for educational purposes
        # This should not be used in any model-free learning algorithm
        self.P = P

        super(GridworldEnv, self).__init__(nS, nA, P, isd)

    def _render(self, mode='human', close=False):
        """ Renders the current gridworld layout
         For example, a 4x4 grid with the mode="human" looks like:
            T  o  o  o
            o  x  o  o
            o  o  o  o
            o  o  o  T
        where x is your position and T are the two terminal states.
        """
        if close:
            return

        outfile = io.StringIO() if mode == 'ansi' else sys.stdout

        grid = np.arange(self.nS).reshape(self.shape)
        it = np.nditer(grid, flags=['multi_index'])
        while not it.finished:
            s = it.iterindex
            y, x = it.multi_index

            if self.s == s:
                output = " x "
            elif s == 0 or s == self.nS - 1:
                output = " T "
            else:
                output = " o "

            if x == 0:
                output = output.lstrip()
            if x == self.shape[1] - 1:
                output = output.rstrip()

            outfile.write(output)

            if x == self.shape[1] - 1:
                outfile.write("\n")

            it.iternext()
            
pp = pprint.PrettyPrinter(indent=2)

**The RL Training Loop**

In the next cell we are going to define the training loop and then run it in the following cell. The goal is to estimate the Q value of each state (the value of each state-action combination) using first visit Monte Carlo. q_value_array holds the estimated values. After each step the agent takes in the env, we update the q_value_array with the first visit Monte Carlo pseudocode seen in the video. Pseudocode is from http://incompleteideas.net/book/the-book.html. 



![alt text](https://drive.google.com/uc?export=view&id=1PxGyfu124QLrSL77NwDEdRUlJZE2ehh2)

In [0]:
def monte_carlo_first_visit_update(q_values, q_returns, traj, discount=1.):
  g_return = 0.
  # dictionary tracking first visit
  first_visit_dict = {}
  # iterate through trajectory
  for t in range(len(traj)-1,-1,-1):
    state, reward, action = traj[t]
    # calculate return
    g_return = discount*g_return + reward
    if (state, action) not in first_visit_dict:
      first_visit_dict[(state,action)] = 1
      # calculate average return. we do a running average
      q_returns[state][action][1] += 1 # counter of how many returns for this state and action
      q_returns[state][action][0] = (q_returns[state][action][0] * (q_returns[state][action][1]-1) + g_return)/ q_returns[state][action][1]
      # update the q_value with average return
      q_values[state][action] = q_returns[state][action][0]
      # in pseudocode you find argmax action here too; in this code we do it at action selection time
  
  return q_values, q_returns

def monte_carlo_q_value_estimate(env,episodes=1000,discount_factor=1.0,epsilon=0.1):
  state_size = env.nS
  action_size = env.nA
  max_timesteps = 100 # halt episode after this many timesteps
  timesteps = 0
  #initialize the estimated state values to zero
  q_value_array = np.zeros((state_size, action_size))
  #initialize the collected array for holding returns; we use a running average
  q_return_array = np.zeros((state_size,action_size,2))
  
  #list for holding trajectories
  trajectory_list = []
  
  #reset the env
  current_state = env.reset()
  #env._render()

  #run through each episode taking a random action each time
  #upgrade estimated state value after each action
  current_episode = 0
  while current_episode < episodes:
    #choose action based on epsilon-greedy policy
    if np.random.rand() < epsilon:
      eg_action = env.action_space.sample()
    else:
      #Choose a greedy action from available max actions
      argmax_index = np.argmax(q_value_array[current_state])
      argmax_value = q_value_array[current_state][argmax_index]
      greedy_indices = np.argwhere(q_value_array[current_state] == argmax_value).reshape(-1)
      eg_action = np.random.choice(greedy_indices)

    #take a step using epsilon-greedy action
    next_state, rew, done, info = env.step(eg_action)
    trajectory_list.append((current_state,rew,eg_action))
    #optional: end gridworld early if too many timesteps taken in an episode
    timesteps += 1
    if timesteps > max_timesteps:
      done = 1

    #if the episode is done use Monte Carlo to update q values and reset the env
    if done:
      q_value_array, q_return_array = monte_carlo_first_visit_update(q_value_array, q_return_array, trajectory_list, discount_factor)
      trajectory_list = []
      timesteps = 0
      current_state = env.reset()
      current_episode += 1
    else:
      current_state = next_state

  return q_value_array, q_return_array

In [6]:
env = GridworldEnv()

#run episodes with Monte Carlo method and get the Q value estimates
q_values, q_returns = monte_carlo_q_value_estimate(env, episodes=10000, discount_factor=1., epsilon=0.1)

print("All Q Value Estimates:")
print(np.round(q_values.reshape((16,4)),1))
print("each row is a state, each column is an action")
print("")

#action_dict = {0:"UP",1:"RIGHT", 2:"DOWN",3:"LEFT"}
greedy_q_value_estimates = np.max(q_values,axis=1)
print("Greedy Q Value Estimates:")
print(np.round(greedy_q_value_estimates.reshape(env.shape),1))
print("estimate of the optimal state value at each state")
print("")

env.close()

All Q Value Estimates:
[[  0.    0.    0.    0. ]
 [ -2.1  -3.9  -4.   -1. ]
 [ -5.  -11.3  -5.9  -2.1]
 [-16.9 -14.   -3.4  -5.8]
 [ -1.   -5.8  -5.6  -4.3]
 [ -2.1  -7.4  -4.3  -3.1]
 [ -6.7 -14.6  -5.2  -3.3]
 [ -4.7 -15.1  -2.2  -6.6]
 [ -2.1  -4.5  -5.4  -4.7]
 [ -3.2  -4.   -4.4  -6.2]
 [ -5.1  -4.4  -2.1  -5.8]
 [ -5.5  -2.2  -1.   -4.6]
 [ -5.6  -3.2  -6.4  -5.3]
 [ -7.8  -2.1  -3.5  -6. ]
 [ -3.4  -1.   -2.2  -3.4]
 [  0.    0.    0.    0. ]]
each row is a state, each column is an action

Greedy Q Value Estimates:
[[ 0.  -1.  -2.1 -3.4]
 [-1.  -2.1 -3.3 -2.2]
 [-2.1 -3.2 -2.1 -1. ]
 [-3.2 -2.1 -1.   0. ]]
estimate of the optimal state value at each state



The first output shows the estimated value for each action in each state. Ie row 4 column 4 is the value if the agent was in the upper right grid cell and took that action left. In the second output, we take the best action for each of the 16 states and show the agent's estimate of the state value assuming the agent always acts greedily.