**Reinforcement Learning with TensorFlow & TRFL: SARSE & SARSE**
* This notebook shows how to apply the classic Reinforcement Learning (RL) concepts of SARSA and SARSE with TRFL.
* In SARSA, we estimate action values: Q(s,a) like we did in Q learning. However in SARSA we do on-policy updates while in Q learning we do off-policy updates
* We can create a policy from the action values. Two types of policy categorizations are on-policy and off-policy methods. 
* In off-policy methods we use one policy for exploration (behavior policy) while we learn a seperate policy (target policy). In on-policy methods, the exploration and learned policy are the same. In SARSA we explore with the policy we are learning.
* SARSE is a slight variation of SARSA. In SARSA the next state is found by sampling an action from the policy, in SARSE the next state is the expected value of all states weighted by the policy. In SARS**A** we take an **A**ction while in SARS**E** we use **E**xpected value.

Outline:
1. Install TRFL
2. Define the GridWorld environment
3. Discuss On-policy and Off-policy methods
4. Find the value of each state-action value in the environment using SARSA
5. Find the value of each state-action value in the environment using SARSE





In [0]:
#TRFL has issues on Colab with TensorFlow version tensorflow-1.13.0rc1
#install TensorFlow 1.12 and restart run time
!pip install tensorflow==1.12

import os
os.kill(os.getpid(), 9)



In [2]:
#install TRFL
!pip install trfl==1.0

#install Tensorflow Probability
!pip install tensorflow-probability==0.5.0



**GridWorld**

The GridWorld environment is a four by four grid. The agent randomly starts on the grid and can move either up, left, right, or down. If the agent reaches the upper left or lower right the episode is over. Every action the agent takes gets a reward of -1 until you reach the upper left or over right.

In [0]:
#Environment from: https://github.com/dennybritz/reinforcement-learning/blob/cee9e78652f8ce98d6079282daf20680e5e17c6a/lib/envs/gridworld.py
#https://github.com/dennybritz/reinforcement-learning/blob/cee9e78652f8ce98d6079282daf20680e5e17c6a/DP/Value%20Iteration%20Solution.ipynb

#define the environment

import io
import numpy as np
import sys
from gym.envs.toy_text import discrete
import pprint

UP = 0
RIGHT = 1
DOWN = 2
LEFT = 3

class GridworldEnv(discrete.DiscreteEnv):
    """
    Grid World environment from Sutton's Reinforcement Learning book chapter 4.
    You are an agent on an MxN grid and your goal is to reach the terminal
    state at the top left or the bottom right corner.
    For example, a 4x4 grid looks as follows:
    T  o  o  o
    o  x  o  o
    o  o  o  o
    o  o  o  T
    x is your position and T are the two terminal states.
    You can take actions in each direction (UP=0, RIGHT=1, DOWN=2, LEFT=3).
    Actions going off the edge leave you in your current state.
    You receive a reward of -1 at each step until you reach a terminal state.
    """

    metadata = {'render.modes': ['human', 'ansi']}

    def __init__(self, shape=[4,4]):
        if not isinstance(shape, (list, tuple)) or not len(shape) == 2:
            raise ValueError('shape argument must be a list/tuple of length 2')

        self.shape = shape

        nS = np.prod(shape)
        nA = 4

        MAX_Y = shape[0]
        MAX_X = shape[1]

        P = {}
        grid = np.arange(nS).reshape(shape)
        it = np.nditer(grid, flags=['multi_index'])

        while not it.finished:
            s = it.iterindex
            y, x = it.multi_index

            # P[s][a] = (prob, next_state, reward, is_done)
            P[s] = {a : [] for a in range(nA)}

            is_done = lambda s: s == 0 or s == (nS - 1)
            reward = 0.0 if is_done(s) else -1.0
            #reward = 1.0 if is_done(s) else 0.0

            # We're stuck in a terminal state
            if is_done(s):
                P[s][UP] = [(1.0, s, reward, True)]
                P[s][RIGHT] = [(1.0, s, reward, True)]
                P[s][DOWN] = [(1.0, s, reward, True)]
                P[s][LEFT] = [(1.0, s, reward, True)]
            # Not a terminal state
            else:
                ns_up = s if y == 0 else s - MAX_X
                ns_right = s if x == (MAX_X - 1) else s + 1
                ns_down = s if y == (MAX_Y - 1) else s + MAX_X
                ns_left = s if x == 0 else s - 1
                P[s][UP] = [(1.0, ns_up, reward, is_done(ns_up))]
                P[s][RIGHT] = [(1.0, ns_right, reward, is_done(ns_right))]
                P[s][DOWN] = [(1.0, ns_down, reward, is_done(ns_down))]
                P[s][LEFT] = [(1.0, ns_left, reward, is_done(ns_left))]

            it.iternext()

        # Initial state distribution is uniform
        isd = np.ones(nS) / nS

        # We expose the model of the environment for educational purposes
        # This should not be used in any model-free learning algorithm
        self.P = P

        super(GridworldEnv, self).__init__(nS, nA, P, isd)

    def _render(self, mode='human', close=False):
        """ Renders the current gridworld layout
         For example, a 4x4 grid with the mode="human" looks like:
            T  o  o  o
            o  x  o  o
            o  o  o  o
            o  o  o  T
        where x is your position and T are the two terminal states.
        """
        if close:
            return

        outfile = io.StringIO() if mode == 'ansi' else sys.stdout

        grid = np.arange(self.nS).reshape(self.shape)
        it = np.nditer(grid, flags=['multi_index'])
        while not it.finished:
            s = it.iterindex
            y, x = it.multi_index

            if self.s == s:
                output = " x "
            elif s == 0 or s == self.nS - 1:
                output = " T "
            else:
                output = " o "

            if x == 0:
                output = output.lstrip()
            if x == self.shape[1] - 1:
                output = output.rstrip()

            outfile.write(output)

            if x == self.shape[1] - 1:
                outfile.write("\n")

            it.iternext()
            
pp = pprint.PrettyPrinter(indent=2)

**Policies: On-Policy vs. Off-Policy**

A policy is the agent's action selection method for each state (a probability distribution over actions). This can be a deterministic choice like a greedy policy where the highest valued action is always chosen or a stochastic choice like in the TD learning notebook were we used a random policy at each state. Two categorizations of policies are on-policy and off-policy methods. SARSA and Q learning are very similar. The difference is in how the  action value estimate is updated. In Q learning the update is off-policy, in SARSA the update is on-policy.

In off-policy methods we use one policy for exploration (behavior policy) while we learn a separate policy (target policy). In on-policy methods, the exploration and learned policy are the same. In SARSA we explore and learn with one policy. The difference is in how we use the TD error. In Q learning the TD error is:

reward + gamma*max(Q(s',a)) - current_state_estimate. 

The max value isn't based on the current policy that the agent is actually following, it's based on a greedy policy that is always selecting the highest action value estimate. Contrast this to SARSA where the TD error is:

reward + gamma*Q(s',sampled_action) - current_state_estimate

In SARSA we sample the next action selected from the policy and use that for our next action value estimate. The code cell below has the updates side by side. SARSA is making updates using the policy that SARSA is exploring the env with.


In [4]:
#declare the environment
env = GridworldEnv()
#reset the environment and get the agent's current position (observation)
current_state = env.reset()
env._render()
print("")
action_dict = {0:"UP",1:"RIGHT", 2:"DOWN",3:"LEFT"}
q_table = np.array([[ 0.,   0.,   0.,   0. ],
 [-1.7, -2.4, -2.2, -1. ],
 [-2.3, -2.8, -2.6, -2. ],
 [-3.2, -3.3, -3.,  -3. ],
 [-1.,  -2.4, -2.6, -1.8],
 [-2.,  -2.8, -2.5, -2. ],
 [-3.,  -3.,  -3.,  -3. ],
 [-2.7, -2.5, -2.,  -2.5],
 [-2.,  -2.4, -2.6, -2.4],
 [-3.,  -3.,  -3.,  -3. ],
 [-2.5, -2.,  -2.,  -2.9],
 [-1.9, -1.5, -1.,  -2.3],
 [-3.,  -3.,  -3.5, -3.1],
 [-2.9, -2.,  -2.6, -2.9],
 [-2.5, -1.,  -1.6, -2.3],
 [ 0.,   0.,   0.,   0. ]])
alpha = 0.1
gamma = 1.

epsilon = 0.1

def get_action(s):
  #choose random action epsilon amount of the time
  if np.random.rand() < epsilon:
    action = env.action_space.sample()
    action_type = "random"
  else:
    #Choose a greedy action.
    action = np.argmax(q_table[s])
    action_type = "greedy"
  return action, action_type
   
action,action_type = get_action(current_state)

for i in range(10):
  next_state,reward,done,info = env.step(action)
  print("Agent took {} action {} and is now in state {} ".format(action_type, action_dict[action], current_state))
  #in SARSA we find our next action based on the current policy (on-policy). In Q learning we don't need the next action, we take the max of the next state
  next_action, action_type = get_action(next_state) 
  
  #update q table on-policy (SARSA)
  q_table[current_state,action] = q_table[current_state,action] + alpha*(gamma*q_table[next_state,next_action] - q_table[current_state,action])
  
  #For reference update q table off-policy (Q learning)
  #q_table[current_state,action] = q_table[current_state,action] + alpha*(gamma*np.max(q_table[next_state]) - q_table[current_state,action])
  
  env._render()
  print("")
  if done:
    print("Agent reached end of episode, resetting the env")
    current_state = env.reset()
    print("")
    env._render()
    print("")
  else:
    current_state = next_state
    action = next_action

T  o  o  o
o  o  o  x
o  o  o  o
o  o  o  T

Agent took random action UP and is now in state 7 
T  o  o  x
o  o  o  o
o  o  o  o
o  o  o  T

Agent took greedy action DOWN and is now in state 3 
T  o  o  o
o  o  o  x
o  o  o  o
o  o  o  T

Agent took greedy action DOWN and is now in state 7 
T  o  o  o
o  o  o  o
o  o  o  x
o  o  o  T

Agent took greedy action DOWN and is now in state 11 
T  o  o  o
o  o  o  o
o  o  o  o
o  o  o  x

Agent reached end of episode, resetting the env

T  o  o  o
o  o  o  o
o  o  o  o
o  o  x  T

Agent took greedy action DOWN and is now in state 14 
T  o  o  o
o  o  o  o
o  o  o  o
o  o  x  T

Agent took greedy action RIGHT and is now in state 14 
T  o  o  o
o  o  o  o
o  o  o  o
o  o  o  x

Agent reached end of episode, resetting the env

T  o  o  o
o  o  o  o
o  o  o  o
o  x  o  T

Agent took greedy action RIGHT and is now in state 13 
T  o  o  o
o  o  o  o
o  o  o  o
o  o  x  T

Agent took greedy action RIGHT and is now in state 14 
T  o  o  o
o  o  o  o


** TRFL Usage **

Once again, the three main TRFL steps are:
1. In the TensorFlow graph, define the necessary TensorFlow tensors
2. In the graph, feed the tensors into the trfl method
3. In the TensorFlow session, run the graph operation

The difference between this trfl.sarsa and trfl.qlearning is that in trfl.sarsa an additional argument is needed: the next_action_t. SARSA updates estimated values using this next_action_t while in Q learning, the update is done with the max value of q_next_t.

In [0]:
#set up TRFL graph
import tensorflow as tf
import trfl

num_actions = env.action_space.n
batch_size = 1

#https://github.com/deepmind/trfl/blob/master/docs/trfl.md#sarsaq_tm1-a_tm1-r_t-pcont_t-q_t-a_t-namesarsa
# Args:
# q_tm1: Tensor holding Q-values for first timestep in a batch of transitions, shape [B x num_actions].
# a_tm1: Tensor holding action indices, shape [B].
# r_t: Tensor holding rewards, shape [B].
# pcont_t: Tensor holding pcontinue values, shape [B].
# q_t: Tensor holding Q-values for second timestep in a batch of transitions, shape [B x num_actions].
# a_t: Tensor holding action indices for second timestep, shape [B].
# name: name to prefix ops created within this op.

q_t = tf.placeholder(dtype=tf.float32,shape=[batch_size,num_actions],name="action_value")
action_t = tf.placeholder(dtype=tf.int32,shape=[batch_size],name="action")
reward_t = tf.placeholder(dtype=tf.float32,shape=[batch_size],name='reward')
gamma_t = tf.placeholder(dtype=tf.float32,shape=[batch_size],name='discount_factor')
q_next_t = tf.placeholder(dtype=tf.float32,shape=[batch_size,num_actions],name="next_action_value")
next_action_t = tf.placeholder(dtype=tf.int32,shape=[batch_size],name="next_action_action")

_, sarsa_t = trfl.sarsa(q_t, action_t, reward_t, gamma_t, q_next_t, next_action_t, name='Sarsa')

** The RL Training Loop **

In the next cell we are going to define the training loop and then run it in the following cell. The goal is to estimate the action value of each state (the value of each state-action combination) using SARSA. action_value_array holds the estimated values. After each step the agent takes in the env, we update the action_value_array with the SARSA formula. The SARSA loop differs in that prior to updating the estimate, we select the next action. We use the next action in the update and then in the agent's next step we use that next action as the action to take.

** TRFL Usage **

The TRFL usage here is to run the trfl operation sarsa_t in sess.run(). We then take the output (sarsa_output) and extract the td_error part of that tensor. Using the td_error we update the action_value_array. For reference, the code below shows the full output of trfl.sarsa and the classic RL method of performing tabular SARSA learning updates.

In [0]:
def choose_action(q_table, state, epsilon=0.1):
  #choose action based on epsilon-greedy policy
  if np.random.rand() < epsilon:
    eg_action = env.action_space.sample()
  else:
    #Choose a greedy action. We will learn greedy actions with Q learning in the following cells.
    eg_action = np.argmax(q_table[state])
  return eg_action

def sarsa_action_value_estimate(env,episodes=1000,alpha=0.05,discount_factor=1.0,epsilon=0.1):
  """
     Args:
        env: OpenAI env. env.P represents the transition probabilities of the environment.
            env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).
            env.nS is a number of states in the environment. 
            env.nA is a number of actions in the environment.
        episodes: number of episodes to run
        alpha: learning rate for state value updates
        discount_factor: Gamma discount factor. pcont_t TRFL argument
        
     Returns:
      Value of each state with random policy
  """
  
  with tf.Session() as sess:
    #initialize the estimated state values to zero
    action_value_array = np.zeros((env.nS,env.nA))
    #reset the env
    current_state = env.reset()
    eg_action = choose_action(action_value_array, current_state, epsilon)
    
    #run through each episode taking a random action each time
    #upgrade estimated state value after each action
    current_episode = 0
    while current_episode < episodes:
      
      #take a step using epsilon-greedy action
      next_state, rew, done, info = env.step(eg_action)
      next_action = choose_action(action_value_array, next_state, epsilon)
      
      #run TRFL operation in the session
      sarsa_output = sess.run([sarsa_t],feed_dict={q_t:np.expand_dims(action_value_array[current_state],axis=0),
                                                             action_t:np.expand_dims(eg_action,axis=0),
                                                             reward_t:np.expand_dims(rew,axis=0),
                                                             gamma_t:np.expand_dims(discount_factor,axis=0),
                                                             q_next_t:np.expand_dims(action_value_array[next_state],axis=0),
                                                             next_action_t:np.expand_dims(next_action,axis=0)})
      
#      trfl.sarsa() returns:
#       A namedtuple with fields:
#         * `loss`: a tensor containing the batch of losses, shape `[B]`.
#         * `extra`: a namedtuple with fields:
#             * `target`: batch of target values for `q_tm1[a_tm1]`, shape `[B]`.
#             * `td_error`: batch of temporal difference errors, shape `[B]`.
      
      #Use the SARSA TD error to update estimated state-action values
      action_value_array[current_state,eg_action] = action_value_array[current_state,eg_action] + alpha * sarsa_output[0].td_error
      
      #For reference, here is the tabular SARSA update method
#       action_value_array[current_state,eg_action] = action_value_array[current_state,eg_action] + \
#          alpha * (rew + discount_factor*action_value_array[next_state,next_action] - action_value_array[current_state,eg_action])
      
      #if the epsiode is done, reset the env, if not the next state becomes the current state and the loop repeats
      if done:
        current_state = env.reset()
        eg_action = choose_action(action_value_array, current_state, epsilon)
        current_episode += 1
      else:
        current_state = next_state
        eg_action = next_action


    return action_value_array
  

  

In [7]:
#run episodes with SARSA and get the state value estimates
action_values = sarsa_action_value_estimate(env,episodes=1000,alpha=0.1)

print("All Action Value Estimates:")
print(np.round(action_values.reshape((16,4)),2))
print("each row is a state, each column is an action")
print("")

optimal_action_estimates = np.max(action_values,axis=1)
print("Current Policy State Value Estimates:")
print(np.round(optimal_action_estimates.reshape(env.shape),2))
print("estimate of the current state value at each state")
print("")

All Action Value Estimates:
[[ 0.    0.    0.    0.  ]
 [-1.54 -2.14 -1.76 -1.  ]
 [-2.43 -2.52 -2.21 -2.07]
 [-3.16 -3.2  -3.01 -3.01]
 [-1.   -1.74 -1.85 -1.57]
 [-2.01 -2.52 -2.11 -2.08]
 [-2.89 -2.89 -2.88 -2.88]
 [-2.58 -2.35 -2.08 -2.37]
 [-2.04 -2.19 -2.56 -2.52]
 [-2.76 -2.77 -2.74 -2.77]
 [-2.69 -2.05 -2.02 -2.17]
 [-1.75 -1.3  -1.   -2.12]
 [-3.   -3.   -3.06 -3.01]
 [-2.3  -2.04 -2.23 -2.28]
 [-1.91 -1.   -1.55 -1.64]
 [ 0.    0.    0.    0.  ]]
each row is a state, each column is an action

Current Policy State Value Estimates:
[[ 0.   -1.   -2.07 -3.01]
 [-1.   -2.01 -2.88 -2.08]
 [-2.04 -2.74 -2.02 -1.  ]
 [-3.   -2.04 -1.    0.  ]]
estimate of the current state value at each state



**SARSE vs. SARSA**

SARSE slightly modifies SARSA. While in SARSA we sample to get the next action, in SARSE we use the policy probabilities to create an expected value of the next state estimate. For example, with SARSA we used epsilon-greedy exploration to get the next action. 92.5% of the time SARSA chose the greedy action (90% greedy + 2.5% random) and 2.5% of the time each of the other non-greedy actions were chosen. SARSE uses these probabilities (0.925, 0.025, 0.025, 0.025) and the state-action value estimates to create an expectation. The TD error update becomes:

reward + gamma*next_state_estimate - current_state_estimate


where next_state_estimate is:

next_state_estimate = 0.925 x q_table[next_state_0,next_action_0] + 0.025 x q_table[next_state_1,next_action_1]  + 0.025 x q_table[next_state_2, next_action_2] + 0.025 x q_table[next_state_3,next_action_3]



SARSE is on-policy.

**TRFL Usage**

In SARSE we use the sarse_action_probs_t instead of next_action_t. Ie we are using the expected distribution of actions rather than the action that was actually selected by the policy.


In [0]:
#set up TRFL graph
import tensorflow as tf
import trfl

num_actions = env.action_space.n
batch_size = 1

#SARSE replaces the next_action tensor with a tensor holding a probability of next_actions

#https://github.com/deepmind/trfl/blob/master/docs/trfl.md#sarseq_tm1-a_tm1-r_t-pcont_t-q_t-probs_a_t-debugfalse-namesarse
# Args:
# q_tm1: Tensor holding Q-values for first timestep in a batch of transitions, shape [B x num_actions].
# a_tm1: Tensor holding action indices, shape [B].
# r_t: Tensor holding rewards, shape [B].
# pcont_t: Tensor holding pcontinue values, shape [B].
# q_t: Tensor holding Q-values for second timestep in a batch of transitions, shape [B x num_actions].
# probs_a_t: Tensor holding action probabilities for second timestep, shape [B x num_actions].
# debug: Boolean flag, when set to True adds ops to check whether probs_a_t is a batch of (approximately) valid probability distributions.
# name: name to prefix ops created by this function.

sarse_q_t = tf.placeholder(dtype=tf.float32,shape=[batch_size,num_actions],name="action_value")
sarse_action_t = tf.placeholder(dtype=tf.int32,shape=[batch_size],name="action")
sarse_reward_t = tf.placeholder(dtype=tf.float32,shape=[batch_size],name='reward')
sarse_gamma_t = tf.placeholder(dtype=tf.float32,shape=[batch_size],name='discount_factor')
sarse_q_next_t = tf.placeholder(dtype=tf.float32,shape=[batch_size,num_actions],name="next_action_value")
sarse_action_probs_t = tf.placeholder(dtype=tf.float32,shape=[batch_size,num_actions],name='action_probs')

_, sarse_t = trfl.sarse(sarse_q_t, sarse_action_t, sarse_reward_t, sarse_gamma_t, sarse_q_next_t, sarse_action_probs_t, name='Sarse')

In [0]:
def sarse_action_value_estimate(env,episodes=1000,alpha=0.05,discount_factor=1.0,epsilon=0.1):
  """
     Args:
        env: OpenAI env. env.P represents the transition probabilities of the environment.
            env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).
            env.nS is a number of states in the environment. 
            env.nA is a number of actions in the environment.
        episodes: number of episodes to run
        alpha: learning rate for state value updates
        discount_factor: Gamma discount factor. pcont_t TRFL argument
        
     Returns:
      Value of each state with random policy
  """
  
  with tf.Session() as sess:
    #initialize the estimated state values to zero
    action_value_array = np.zeros((env.nS,env.nA))
    #reset the env
    current_state = env.reset()
    
    #chance of choosing random action based on epsilon. use this with SARSE's action probabilities
    random_prob = epsilon/env.nA
    greedy_prob = 1.-epsilon
    
    #run through each episode taking a random action each time
    #upgrade estimated state value after each action
    current_episode = 0
    while current_episode < episodes:
      #choose action based on epsilon-greedy policy
      if np.random.rand() < epsilon:
        eg_action = env.action_space.sample()
      else:
        #Choose a greedy action. We will learn greedy actions with Q learning in the following cells.
        eg_action = np.argmax(action_value_array[current_state])
      
      #take a step using epsilon-greedy action
      next_state, rew, done, info = env.step(eg_action)
      
      #generate action probabilities
      #randomly choose each action with probability epislon/4 
      action_probs = np.array([random_prob]*env.nA) 
      #choose greedy action with probability 1-epsilon
      action_probs[np.argmax(action_value_array[next_state])] += greedy_prob 
      
      #run TRFL operation in the session
      sarse_output = sess.run([sarse_t],feed_dict={sarse_q_t:np.expand_dims(action_value_array[current_state],axis=0),
                                                             sarse_action_t:np.expand_dims(eg_action,axis=0),
                                                             sarse_reward_t:np.expand_dims(rew,axis=0),
                                                             sarse_gamma_t:np.expand_dims(discount_factor,axis=0),
                                                             sarse_q_next_t:np.expand_dims(action_value_array[next_state],axis=0),
                                                             sarse_action_probs_t:np.expand_dims(action_probs,axis=0)})
      
#      trfl.sarse() returns:
#       A namedtuple with fields:
#         * `loss`: a tensor containing the batch of losses, shape `[B]`.
#         * `extra`: a namedtuple with fields:
#             * `target`: batch of target values for `q_tm1[a_tm1]`, shape `[B]`.
#             * `td_error`: batch of temporal difference errors, shape `[B]`.
      
      #Use the SARSE TD error to update estimated state-action values
      action_value_array[current_state,eg_action] = action_value_array[current_state,eg_action] + alpha * sarse_output[0].td_error
      
      #For reference, here is the tabular SARSE update method
#       next_action_value_estimate = 0.
#       for i in range(env.nA):
#         next_action_value_estimate += action_probs[i] * action_value_array[next_state,i]
#       action_value_array[current_state,eg_action] = action_value_array[current_state,eg_action] + \
#          alpha * (rew + discount_factor*next_action_value_estimate - action_value_array[current_state,eg_action])
      
      #if the epsiode is done, reset the env, if not the next state becomes the current state and the loop repeats
      if done:
        current_state = env.reset()
        current_episode += 1
      else:
        current_state = next_state

    return action_value_array

In [10]:
#run episodes with SARSE and get the state value estimates
action_values = sarse_action_value_estimate(env,episodes=1000,alpha=0.1)

print("All Action Value Estimates:")
print(np.round(action_values.reshape((16,4)),2))
print("each row is a state, each column is an action")
print("")

optimal_action_estimates = np.max(action_values,axis=1)
print("Current Policy State Value Estimates:")
print(np.round(optimal_action_estimates.reshape(env.shape),2))
print("estimate of the current state value at each state")
print("")

All Action Value Estimates:
[[ 0.    0.    0.    0.  ]
 [-1.49 -1.71 -1.77 -1.  ]
 [-2.16 -2.35 -2.38 -2.05]
 [-3.   -3.16 -3.   -3.  ]
 [-1.   -1.86 -1.25 -1.42]
 [-2.03 -2.25 -2.33 -2.03]
 [-2.83 -2.83 -2.81 -2.82]
 [-2.26 -2.25 -2.05 -2.33]
 [-2.04 -2.17 -2.31 -2.29]
 [-2.88 -2.86 -2.87 -2.87]
 [-2.35 -2.04 -2.04 -2.84]
 [-1.75 -1.53 -1.   -1.93]
 [-3.   -3.   -3.01 -3.18]
 [-2.75 -2.05 -2.36 -2.48]
 [-2.06 -1.   -1.59 -1.72]
 [ 0.    0.    0.    0.  ]]
each row is a state, each column is an action

Current Policy State Value Estimates:
[[ 0.   -1.   -2.05 -3.  ]
 [-1.   -2.03 -2.81 -2.05]
 [-2.04 -2.86 -2.04 -1.  ]
 [-3.   -2.05 -1.    0.  ]]
estimate of the current state value at each state

