<a href="https://colab.research.google.com/github/RLWH/reinforcement-learning-notebook/blob/master/Ch3%20MPD/Ch3_Gridworld_MDP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# The Gridworld problem and solving gridworld by Dynamic Programming
1. Policy Evaluation
2. Policy Iteration
3. Value Iteration

## Setup OpenAI Rendering in Colab

In [0]:
!pip install gym pyvirtualdisplay > /dev/null 2>&1
!apt-get install -y xvfb python-opengl ffmpeg > /dev/null 2>&1

In [0]:
import gym
from gym import logger as gymlogger
from gym.wrappers import Monitor
gymlogger.set_level(40) #error only

import math
import glob
import io
import base64
from IPython.display import HTML
from IPython import display as ipythondisplay

In [3]:
from pyvirtualdisplay import Display
display = Display(visible=0, size=(400, 300))
display.start()

<Display cmd_param=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '400x300x24', ':1001'] cmd=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', '0', '400x300x24', ':1001'] oserror=None return_code=None stdout="None" stderr="None" timeout_happened=False>

In [0]:
def show_video():
  mp4list = glob.glob('video/*.mp4')
  if len(mp4list) > 0:
    mp4 = mp4list[0]
    video = io.open(mp4, 'r+b').read()
    encoded = base64.b64encode(video)
    ipythondisplay.display(HTML(data='''<video alt="test" autoplay 
                controls style="height: 400px;">
                <source src="data:video/mp4;base64,{0}" type="video/mp4" />
             </video>'''.format(encoded.decode('ascii'))))
  else: 
    print("Could not find video")

In [0]:
def wrap_env(env):
  env = Monitor(env, './video', force=True)
  return env

In [6]:
env = wrap_env(gym.make('CartPole-v0'))
# env.reset()
for i_episode in range(20):
    observation = env.reset()
    for t in range(100):
        env.render()
#         print(observation)
        action = env.action_space.sample()
        observation, reward, done, info = env.step(action)
        if done:
            print("Episode finished after {} timesteps".format(t+1))
            break
env.close()

Episode finished after 12 timesteps
Episode finished after 13 timesteps
Episode finished after 14 timesteps
Episode finished after 21 timesteps
Episode finished after 23 timesteps
Episode finished after 12 timesteps
Episode finished after 16 timesteps
Episode finished after 9 timesteps
Episode finished after 12 timesteps
Episode finished after 15 timesteps
Episode finished after 24 timesteps
Episode finished after 17 timesteps
Episode finished after 11 timesteps
Episode finished after 18 timesteps
Episode finished after 28 timesteps
Episode finished after 17 timesteps
Episode finished after 31 timesteps
Episode finished after 33 timesteps
Episode finished after 12 timesteps
Episode finished after 14 timesteps


In [7]:
show_video()

## Setup the Gridworld Environment
https://github.com/dennybritz/reinforcement-learning/blob/master/lib/envs/gridworld.py

In [0]:
import numpy as np
import sys

from collections import defaultdict
from gym.envs.toy_text import discrete

In [0]:
UP = 0
RIGHT = 1
DOWN = 2
LEFT = 3

In [0]:
class GridworldEnv(discrete.DiscreteEnv):
  """
  The Gridworld environment that is a discrete space
  
  Grid World environment from Sutton's Reinforcement Learning book chapter 4.
  You are an agent on an MxN grid and your goal is to reach the terminal
  state at the top left or the bottom right corner.

  For example, a 4x4 grid looks as follows:

  T  o  o  o
  o  x  o  o
  o  o  o  o
  o  o  o  T

  x is your position and T are the two terminal states.

  You can take actions in each direction (UP=0, RIGHT=1, DOWN=2, LEFT=3).
  Actions going off the edge leave you in your current state.
  You receive a reward of -1 at each step until you reach a terminal state.
  
  A toy text discrete environment has the followings:
  1. nS: Number of states
  2. nA: Number of actions
  3. P: transitions (*)
  4. isd: initial state distribution
  
  (*) dictionary dict of dicts of lists, where
    P[s][a] == [(probability, nextstate, reward, done), ...]
  (**) list or array of length nS

  """
  
  metadata = {'render.modes': ['human', 'ansi']}
  
  def __init__(self, shape=[4, 4]):
    if not isinstance(shape, (list, tuple)) or not len(shape) == 2:
      raise ValueError('shape argument must be a list/tuple of length 2')
      
    self.shape = shape
    
    # The grid world has n x n states
    nS = np.prod(shape)
    
    # There are only 4 possible actions
    actions = [UP, RIGHT, DOWN, LEFT]
    nA = len(actions)
    
    # Define the maximum board shape
    MAX_Y = shape[0]
    MAX_X = shape[1]
    
    # Define terminate state
    is_done = lambda s: s == 0 or s == (nS - 1)
    
    
    # Initialise P
    P = {}
    
    grid = np.arange(nS).reshape(shape)
    it = np.nditer(grid, flags=['multi_index'])
    
    while not it.finished:
      s = it.iterindex
      y, x = it.multi_index
      
      # Define Reward is -1 for each step
      reward = 0.0 if is_done(s) else -1.0
      
      # Initialize P
      P[s] = {a: [] for a in actions}
      
      if is_done(s):
        for a in actions:
          P[s][a] = [(1.0, s, reward, True)]
      else:
        
        # Define the next state of each state
        ns_up = s if y == 0 else s - MAX_X
        ns_right = s if x == (MAX_X - 1) else s + 1
        ns_down = s if y == (MAX_Y - 1) else s + MAX_X
        ns_left = s if x == 0 else s - 1
        
        P[s][UP] = [(1.0, ns_up, reward, is_done(ns_up))]
        P[s][RIGHT] = [(1.0, ns_right, reward, is_done(ns_right))]
        P[s][DOWN] = [(1.0, ns_down, reward, is_done(ns_down))]
        P[s][LEFT] = [(1.0, ns_left, reward, is_done(ns_left))]
        
        
      it.iternext()
      
      # Define Initial state distribution is uniform
      isd = np.ones(nS) / nS
      
      self.P = P
      
      
      super().__init__(nS, nA, P, isd)
    
    
  def render(self, mode='human', close=False):
    """
    Render the environment
    """
    if close:
      return
    
    outfile = StringIO() if mode == 'ansi' else sys.stdout
    
    grid = np.arange(self.nS).reshape(self.shape)
    it = np.nditer(grid, flags=['multi_index'])
    
    while not it.finished:
      s = it.iterindex
      y, x = it.multi_index
      
      if self.s == s:
        output = "x "
      elif s == 0 or s == self.nS - 1:
        output = "T "
      else:
        output = "o "
        
      if x == 0:
        output = output.lstrip()
        
      if x == self.shape[1] - 1:
        output = output.rstrip()

      outfile.write(output)
      
      if x == self.shape[1] - 1:
        outfile.write("\n")
        
      it.iternext()
    
    

In [11]:
env = GridworldEnv()
for t in range(100):
    env.render()
#         print(observation)
    action = env.action_space.sample()
    observation, reward, done, info = env.step(action)
    if done:
        print("Episode finished after {} timesteps".format(t+1))
        break
env.close()

T o o o
o o o o
o o o o
o o o x
Episode finished after 1 timesteps


In [12]:
env.P[1]

{0: [(1.0, 1, -1.0, False)],
 1: [(1.0, 2, -1.0, False)],
 2: [(1.0, 5, -1.0, False)],
 3: [(1.0, 0, -1.0, True)]}

## Solving Gridworld by dynamic programming
We will solve a Bellman equation using two algorithms:
1. Value iteration
2. Policy iteration

Q(s,a) = Transition probability * ( Reward probability + gamma * value_of_next_state)

### Value Iteration
In value iteration, we start off with a random value function, then look for a new improved value function in iterative fashion until we find the optimal value function.

1. Initialise random value function
2. For each state, calculate Q(s, a)
3. Since V(s) = Max W(s, a), update the value function with max value of Q(s, a)
4. If V(S) is optimal,  then stop. Repeat otherwise.

#### Environment Inspection

In [13]:
print(env.observation_space.n)

16


In [14]:
print(env.action_space.n)

4


#### Initialise the value table

In [0]:
value_table = np.zeros(env.observation_space.n)
no_of_iterations = 100

In [16]:
value_table

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [0]:
# Upon starting each iteration, we copy the value_table to updated_value_table
for i in range(no_of_iterations):
  updated_value_table = np.copy(value_table)

In [0]:
for state in range(env.observation_space.n):
  Q_value = []
  
  for action in range(env.action_space.n):
    next_states_rewards = []
    for next_sr in env.P[state][action]:
      trans_prob, next_state, reward_prob, _ = next_sr
      next_states_rewards.append((trans_prob * (reward_prob + 1 * updated_value_table[next_state])))
      Q_value.append(np.sum(next_states_rewards))
      
      # Pick up the maximum Q value and update it as value of a state
      value_table[state] = max(Q_value)

In [19]:
value_table.reshape((4, 4))

array([[ 0., -1., -1., -1.],
       [-1., -1., -1., -1.],
       [-1., -1., -1., -1.],
       [-1., -1., -1.,  0.]])

#### Combine them into a function

In [0]:
def value_iteration(env, gamma=1.0, no_of_iterations=100000):
  """
  Value iteration function
  """
  
  # Initialise value tables
  value_table = np.zeros(env.observation_space.n)
  threshold = 1e-20
  
  print(no_of_iterations)
  
  # Table update
  for i in range(no_of_iterations):
    
    # Copy the current value table to updated value table
    updated_value_table = np.copy(value_table)
    
    for state in range(env.observation_space.n):
      print("state: %s" % state, end="")
      # For each of the state, calculate the state-action value q(s,a)
      Q_value = []
      
      for action in range(env.action_space.n):
        # One step ahead search for each action
        next_states_rewards = []
        
        for next_sr in env.P[state][action]:
          # P[s][a] == [(probability, nextstate, reward, done), ...]
#           print(next_sr)
          trans_prob, next_state, reward, _ = next_sr
          next_states_rewards.append((trans_prob * (reward + gamma * updated_value_table[next_state])))
          
        Q_value.append(np.sum(next_states_rewards))
          
      # Pick the maximum Q value and update it as value of a state
      value_table[state] = max(Q_value)
    
    print("Diff: %s" % np.sum(np.fabs(updated_value_table - value_table)))

    if (np.sum(np.fabs(updated_value_table - value_table))) <= threshold:
      print("Value-iteration converged at iteration %d" % (i + 1))
      break

  return value_table

In [0]:
env = GridworldEnv()

In [49]:
optimal_value_function = value_iteration(env=env, gamma=1.0)

100000
state: 0state: 1state: 2state: 3state: 4state: 5state: 6state: 7state: 8state: 9state: 10state: 11state: 12state: 13state: 14state: 15Diff: 14.0
state: 0state: 1state: 2state: 3state: 4state: 5state: 6state: 7state: 8state: 9state: 10state: 11state: 12state: 13state: 14state: 15Diff: 10.0
state: 0state: 1state: 2state: 3state: 4state: 5state: 6state: 7state: 8state: 9state: 10state: 11state: 12state: 13state: 14state: 15Diff: 4.0
state: 0state: 1state: 2state: 3state: 4state: 5state: 6state: 7state: 8state: 9state: 10state: 11state: 12state: 13state: 14state: 15Diff: 0.0
Value-iteration converged at iteration 4


In [51]:
optimal_value_function.reshape((4,4))

array([[ 0., -1., -2., -3.],
       [-1., -2., -3., -2.],
       [-2., -3., -2., -1.],
       [-3., -2., -1.,  0.]])

#### Extracting the optimal policy
After finding optimal value function, how can we extract the optimal policy from optimal function?  
We calculate the Q value using our optimal value action and pick up actions greedily for each state as the optimal policy.

We do this via a function called `extract_policy()`

#### Build a Q table for each state

A Q table looks like the following:

|State|Action|Value|
|--------|-----------|--------|
State 1|Action 1|Value 1|
State 1|Action 2|Value 2|
State 1|Action 3|Value 3|
State 1|Action 4|Value 4|



In [0]:
def extract_policy(value_table, gamma=1.0):
  
  # First, Define a random policy pi
  policy = np.zeros(env.observation_space.n)
  
  # Build a Q table - One step ahead
  for state in range(env.observation_space.n):
    # For each state, the Q table has num_actions 
    Q_table = np.zeros(env.action_space.n)

    for action in range(env.action_space.n):

      for next_sr in env.P[state][action]:
        # One step look ahead
        trans_prob, next_state, reward, _ = next_sr

        new_value = trans_prob * (reward + gamma * value_table[next_state])

        Q_table[action] += new_value
        
    policy[state] =  np.argmax(Q_table)
    
  return policy

In [0]:
optimal_policy = extract_policy(optimal_value_function)

In [56]:
optimal_policy

array([0., 3., 3., 2., 0., 0., 0., 2., 0., 0., 1., 2., 0., 1., 1., 0.])

### Policy Iteration
1. Policy evaluation: Evaluating the value function of a randomly estimated policy
2. Policy improvement: Upon evaluating the value function, if it is not optimal, we find a new improved policy $\pi'$

##### How can we evaluate the policies?  
We will evaluate our randomly initialized policies by computing value functions for them. If they are not good, then we find a new policy.   
We repeat this process until we find a good policy.

#### Steps
1. Initialize random policy $\pi$
2. Calculate value function V(S) for the policy
3. If V(S) is optimal -> End, otherwise find improved policy
4. Repeat

In [0]:
gamma = 1.0

In [0]:
# Create a value table with the number of states
value_table = np.zeros(env.nS)

In [0]:
# For each state, we get the action from policy, and compute the value function 
# according to the `action` and `state` as folows


#### Combining them into `extract_policy` function