**Reinforcement Learning with TensorFlow & TRFL: Multi-step Forward View**

Outline:
* Watkins Q(λ) with trfl.multistep_forward_view()



In [0]:
#TRFL works with TensorFlow 1.12
#installs TensorFlow version 1.12 then restarts the runtime
!pip install tensorflow==1.12

import os
os.kill(os.getpid(), 9)

Collecting tensorflow==1.12
[?25l  Downloading https://files.pythonhosted.org/packages/22/cc/ca70b78087015d21c5f3f93694107f34ebccb3be9624385a911d4b52ecef/tensorflow-1.12.0-cp36-cp36m-manylinux1_x86_64.whl (83.1MB)
[K    100% |████████████████████████████████| 83.1MB 329kB/s 
Collecting tensorboard<1.13.0,>=1.12.0 (from tensorflow==1.12)
[?25l  Downloading https://files.pythonhosted.org/packages/07/53/8d32ce9471c18f8d99028b7cef2e5b39ea8765bd7ef250ca05b490880971/tensorboard-1.12.2-py3-none-any.whl (3.0MB)
[K    100% |████████████████████████████████| 3.1MB 11.6MB/s 
Installing collected packages: tensorboard, tensorflow
  Found existing installation: tensorboard 1.13.1
    Uninstalling tensorboard-1.13.1:
      Successfully uninstalled tensorboard-1.13.1
  Found existing installation: tensorflow 1.13.1
    Uninstalling tensorflow-1.13.1:
      Successfully uninstalled tensorflow-1.13.1
Successfully installed tensorboard-1.12.2 tensorflow-1.12.0


In [0]:
#install tensorflow-probability 0.5.0 that works with TensorFlow 1.12
!pip install tensorflow-probability==0.5.0

#install TRFL
!pip install trfl




In [0]:
import gym
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt
import trfl
import tensorflow_probability as tfp

** Multi-step Forward View **

trfl.td_lambda() calls trfl.generalized_lambda_returns() which calls trfl.multistep_forward_view(). trfl.qlambda() also calls trfl.multistep_forward_view(). You can alter the state_values argument in trfl.multistep_forward_view() to implement Q(λ) or SARSA(λ) and alter the lambda_ argument in trfl.multistep_forward_view to implement Peng’s Q(λ), Watkins’ Q(λ) and Retrace (more on this in Section 5). 

In this notebook we'll implemnt Watkins’ Q(λ) by setting the eligibility trace to 0 after the first non-greedy action is taken.


** Example 1: FrozenLake 4x4 Not Slippery **

First example we set is_slippery to False in FrozenLake. Every action the agent takes becomes deterministic, making the env much easier.

In [0]:
from gym.envs.registration import register
register(
    id='FrozenLakeNotSlippery-v0',
    entry_point='gym.envs.toy_text:FrozenLakeEnv',
    kwargs={'map_name' : '4x4', 'is_slippery': False}
)

In [0]:
#hyperparameters
episodes = 10000
learning_rate = 0.01
discount = 0.99
lambda_val = 0.8
epsilon_start = 1.0
epsilon_min = 0.01
epsilon_step = (epsilon_start - epsilon_min)/(episodes*.9)

env = gym.make('FrozenLakeNotSlippery-v0')
num_actions = env.action_space.n
stats_every = 1000

tf.reset_default_graph()

reward_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="reward")
discount_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="discount")
state_value_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="state_value")
lambda_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="lambda")
                              
mfv_return_ = trfl.multistep_forward_view(reward_, discount_, state_value_, lambda_, back_prop=False)

** TRFL Usage **

We input tensors for reward_, discount_, state_value_, and lambda_ into trfl.multistep_forward_view(). The state_value_ tensor_ in this case is the max value from the next state Q value. We run the mfv_return_ tensor in the session and use return minus the current Q value to get the TD error.

For Watkins’ Q(λ) we'll feed in a list of lambda_ values to trfl.multistep_forward_view(). The list of lambda_ values will be 0 from the point of the first non-greedy action onwards, as you can see below in the lambda_list.append() code in the action selection section.

In [0]:
stats_success = []
epsilon = epsilon_start

with tf.Session() as sess:
  #initialize the estimated state values to zero
  action_value_array = np.zeros((16,num_actions))
  #reset the env
  current_state = env.reset()

  current_episode = 1
  state_value_list, q_action_list, reward_list, state_int_list, action_list = [], [], [], [], []
  eligibility_cutoff = False
  
  while current_episode < episodes:
    #take epsilon greedy action
    if np.random.rand() < epsilon:
      action = env.action_space.sample()
      #if random action is not max action, cutoff eligibility cutoffs
      if action_value_array[current_state, action] != np.max(action_value_array[current_state]):
        eligibility_cutoff = True
    else:
      #Choose a greedy action. If multiple greedy actions randomly choose between
      max_actions = np.argwhere(action_value_array[current_state] == np.max(action_value_array[current_state])).reshape((-1))
      action = np.random.choice(max_actions)

    next_state, rew, done, info = env.step(action)
      
    reward_list.append(rew)
    action_list.append(action)
    q_action_list.append(action_value_array[current_state, action])
    state_int_list.append(current_state)
    arg_action = np.argmax(action_value_array[next_state])
    state_value_list = action_value_array[next_state, arg_action]
    
    current_state = next_state
    
    #run TRFL tensor
    mfv_output = sess.run(mfv_return_, feed_dict={
        reward_:np.array(reward_list).reshape(-1,1),
        discount_:np.array([discount]*len(reward_list)).reshape(-1,1),
        state_value_:np.array(state_value_list).reshape(-1,1),
        lambda_:np.array([lambda_val]*len(reward_list)).reshape(-1,1),
      })
    #use mfv output and subtract q_value to get td_error
    td_error = mfv_output - np.array(q_action_list).reshape(-1,1)

    #update action values
    #action_value_array[state_int_list, action_list] += learning_rate*td_error
    for s, a, td in zip(state_int_list, action_list, td_error.tolist()):
      action_value_array[s,a] += learning_rate*td[0]
    
    # cut off one action past the first exploration
    if eligibility_cutoff:
      done = 1
      
    if done:
      if next_state == 15:
        stats_success.append(1)
      else:
        stats_success.append(0)
        
      state_value_list, q_action_list, reward_list, state_int_list, action_list = [], [], [], [], []
      eligibility_cutoff = False
      current_state = env.reset()
      current_episode += 1
      #decrease epsilon
      epsilon -= epsilon_step
      if epsilon < epsilon_min:
        epsilon = epsilon_min
        
      if current_episode % stats_every == 0:
        print("Current Episode, Epsilon, Trailing Success %: {}, {:.2f}, {:.2f}".format(current_episode, epsilon,
                                                                                        np.mean(stats_success[-1000:])))
        optimal_action_estimates = np.max(action_value_array,axis=1)
        print("Optimal Action Value Estimates:")
        print(np.round(optimal_action_estimates.reshape(4,4),2))
        print("estimate of the optimal state value at each state")
        print("")
        print("All Action Value Estimates:")
        print(np.round(action_value_array.reshape((16,4)),2))
        print("")
        

Current Episode, Epsilon, Trailing Success %: 1000, 0.89, 0.00
Optimal Action Value Estimates:
[[0.01 0.02 0.02 0.  ]
 [0.   0.   0.03 0.  ]
 [0.   0.   0.03 0.  ]
 [0.   0.   0.04 0.  ]]
estimate of the optimal state value at each state

All Action Value Estimates:
[[0.01 0.   0.01 0.01]
 [0.   0.   0.02 0.  ]
 [0.   0.02 0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.03 0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.03 0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.04 0.  ]
 [0.   0.   0.   0.  ]]

Current Episode, Epsilon, Trailing Success %: 2000, 0.78, 0.00
Optimal Action Value Estimates:
[[0.03 0.03 0.03 0.  ]
 [0.   0.   0.04 0.  ]
 [0.   0.   0.05 0.  ]
 [0.   0.   0.07 0.  ]]
estimate of the optimal state value at each state

All Action Value Estimates:
[[0.02 0.   0.03 0.02]
 [0.01 0.   0.03 0.02]
 [0.01 0.03 0.   0.01]
 [0.   0.   0.   0.  ]
 [0.

In [0]:
print("Current Episode, Epsilon, Trailing Success %: {}, {:.2f}, {:.2f}".format(current_episode, epsilon,
                                                                                        np.mean(stats_success[-1000:])))
optimal_action_estimates = np.max(action_value_array,axis=1)
print("Optimal Action Value Estimates:")
print(np.round(optimal_action_estimates.reshape(4,4),2))
print("estimate of the optimal state value at each state")
print("")
print("All Action Value Estimates:")
print(np.round(action_value_array.reshape((16,4)),2))
print("")

Current Episode, Epsilon, Trailing Success %: 10000, 0.01, 0.96
Optimal Action Value Estimates:
[[0.74 0.76 0.78 0.  ]
 [0.   0.   0.82 0.  ]
 [0.   0.   0.89 0.  ]
 [0.   0.   1.   0.  ]]
estimate of the optimal state value at each state

All Action Value Estimates:
[[0.62 0.   0.74 0.61]
 [0.57 0.   0.76 0.59]
 [0.56 0.78 0.   0.57]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.82 0.   0.47]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.89 0.   0.45]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.55 1.   0.46]
 [0.   0.   0.   0.  ]]



** Example 2: FrozenLake 4x4 Slippery **

Standard FrozenLake env where slippery is enabled. Notice the increased failure rate and lower Q values

In [0]:
#hyperparameters
episodes = 10000
learning_rate = 0.01
discount = 0.99
lambda_val = 0.5
epsilon_start = 1.0
epsilon_min = 0.01
epsilon_step = (epsilon_start - epsilon_min)/(episodes*.9)

env = gym.make('FrozenLake-v0')
num_actions = env.action_space.n
stats_every = 1000

tf.reset_default_graph()

reward_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="reward")
discount_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="discount")
state_value_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="state_value")
lambda_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="lambda")
                              
mfv_return_ = trfl.multistep_forward_view(reward_, discount_, state_value_, lambda_, back_prop=False)

In [0]:
stats_success = []
epsilon = epsilon_start

with tf.Session() as sess:
  #initialize the estimated state values to zero
  action_value_array = np.zeros((16,num_actions))
  #reset the env
  current_state = env.reset()

  current_episode = 1
  state_value_list, q_action_list, reward_list, state_int_list, action_list = [], [], [], [], []
  eligibility_cutoff = False
  
  while current_episode < episodes:
    #take epsilon greedy action
    if np.random.rand() < epsilon:
      action = env.action_space.sample()
      #if random action is not max action, cutoff eligibility cutoffs
      if action_value_array[current_state, action] != np.max(action_value_array[current_state]):
        eligibility_cutoff = True
    else:
      #Choose a greedy action. If multiple greedy actions randomly choose between
      max_actions = np.argwhere(action_value_array[current_state] == np.max(action_value_array[current_state])).reshape((-1))
      action = np.random.choice(max_actions)

    next_state, rew, done, info = env.step(action)
      
    reward_list.append(rew)
    action_list.append(action)
    q_action_list.append(action_value_array[current_state, action])
    state_int_list.append(current_state)
    arg_action = np.argmax(action_value_array[next_state])
    state_value_list = action_value_array[next_state, arg_action]
    
    current_state = next_state
    
    
    mfv_output = sess.run(mfv_return_, feed_dict={
        reward_:np.array(reward_list).reshape(-1,1),
        discount_:np.array([discount]*len(reward_list)).reshape(-1,1),
        state_value_:np.array(state_value_list).reshape(-1,1),
        lambda_:np.array([lambda_val]*len(reward_list)).reshape(-1,1),
      })
    td_error = mfv_output - np.array(q_action_list).reshape(-1,1)

    #action_value_array[state_int_list, action_list] += learning_rate*td_error
    for s, a, td in zip(state_int_list, action_list, td_error.tolist()):
      action_value_array[s,a] += learning_rate*td[0]
    
    # cut off one action past the first exploration
    if eligibility_cutoff:
      done = 1
      
    if done:
      if next_state == 15:
        stats_success.append(1)
      else:
        stats_success.append(0)
        
      state_value_list, q_action_list, reward_list, state_int_list, action_list = [], [], [], [], []
      eligibility_cutoff = False
      current_state = env.reset()
      current_episode += 1
      epsilon -= epsilon_step
      if epsilon < epsilon_min:
        epsilon = epsilon_min
      if current_episode % stats_every == 0:
        print("Current Episode, Epsilon, Trailing Success %: {}, {:.2f}, {:.2f}".format(current_episode, epsilon,
                                                                                        np.mean(stats_success[-1000:])))
        optimal_action_estimates = np.max(action_value_array,axis=1)
        print("Optimal Action Value Estimates:")
        print(np.round(optimal_action_estimates.reshape(4,4),2))
        print("estimate of the optimal state value at each state")
        print("")
        print("All Action Value Estimates:")
        print(np.round(action_value_array.reshape((16,4)),2))
        print("")     

Current Episode, Epsilon, Trailing Success %: 1000, 0.89, 0.00
Optimal Action Value Estimates:
[[0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.01 0.  ]]
estimate of the optimal state value at each state

All Action Value Estimates:
[[0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.01 0.   0.  ]
 [0.   0.   0.   0.  ]]

Current Episode, Epsilon, Trailing Success %: 2000, 0.78, 0.00
Optimal Action Value Estimates:
[[0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.01 0.  ]]
estimate of the optimal state value at each state

All Action Value Estimates:
[[0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.

In [0]:
print("Current Episode, Epsilon, Trailing Success %: {}, {:.2f}, {:.2f}".format(current_episode, epsilon,
                                                                                        np.mean(stats_success[-1000:])))
optimal_action_estimates = np.max(action_value_array,axis=1)
print("Optimal Action Value Estimates:")
print(np.round(optimal_action_estimates.reshape(4,4),2))
print("estimate of the optimal state value at each state")
print("")
print("All Action Value Estimates:")
print(np.round(action_value_array.reshape((16,4)),2))
print("")

Current Episode, Epsilon, Trailing Success %: 10000, 0.01, 0.07
Optimal Action Value Estimates:
[[0.08 0.08 0.08 0.  ]
 [0.08 0.   0.1  0.  ]
 [0.11 0.11 0.13 0.  ]
 [0.   0.13 0.21 0.  ]]
estimate of the optimal state value at each state

All Action Value Estimates:
[[ 0.04  0.    0.08  0.03]
 [ 0.01  0.01  0.01  0.08]
 [ 0.08  0.01  0.01  0.01]
 [ 0.    0.    0.    0.  ]
 [-0.01  0.08  0.03  0.03]
 [ 0.    0.    0.    0.  ]
 [ 0.01  0.1   0.    0.  ]
 [ 0.    0.    0.    0.  ]
 [ 0.02  0.11  0.02  0.02]
 [ 0.01  0.11  0.01  0.01]
 [ 0.01  0.13  0.01  0.  ]
 [ 0.    0.    0.    0.  ]
 [ 0.    0.    0.    0.  ]
 [ 0.01  0.01  0.01  0.13]
 [ 0.01  0.21  0.01  0.07]
 [ 0.    0.    0.    0.  ]]



** Example 3: FrozenLake 8x8 **

FrozenLake on an 8x8 grid. Much harder to randomly find the goal. To make learning faster, we add a penalty for falling into a hole.

In [0]:
#hyperparameters
episodes = 20000
learning_rate = 0.01
discount = 0.99
lambda_val = 0.95
epsilon_start = 1.0
epsilon_min = 0.01
epsilon_step = (epsilon_start - epsilon_min)/(episodes*.9)
hole_penalty = -0.1 #penalty for falling into a hole

seed = 31
env = gym.make('FrozenLake8x8-v0')
env.seed(seed)
np.random.seed(seed)
num_actions = env.action_space.n
stats_every = 1000

tf.reset_default_graph()

reward_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="reward")
discount_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="discount")
state_value_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="state_value")
lambda_ = tf.placeholder(dtype=tf.float32, shape=[None, 1], name="lambda")
                              
mfv_return_ = trfl.multistep_forward_view(reward_, discount_, state_value_, lambda_, back_prop=False)

In [0]:
stats_success = []
epsilon = epsilon_start

with tf.Session() as sess:
  #initialize the estimated state values to zero
  action_value_array = np.zeros((64,num_actions))
  #reset the env
  current_state = env.reset()

  current_episode = 1
  state_value_list, q_action_list, reward_list, state_int_list, action_list = [], [], [], [], []
  eligibility_cutoff = False
  
  while current_episode < episodes:
    #take epsilon greedy action
    if np.random.rand() < epsilon:
      action = env.action_space.sample()
      #if random action is not max action, cutoff eligibility cutoffs
      if action_value_array[current_state, action] != np.max(action_value_array[current_state]):
        eligibility_cutoff = True
    else:
      #Choose a greedy action. If multiple greedy actions randomly choose between
      max_actions = np.argwhere(action_value_array[current_state] == np.max(action_value_array[current_state])).reshape((-1))
      action = np.random.choice(max_actions)

    next_state, rew, done, info = env.step(action)
    
#     if done and rew < 1:
#       rew = hole_penalty
      
    reward_list.append(rew)
    action_list.append(action)
    q_action_list.append(action_value_array[current_state, action])
    state_int_list.append(current_state)
    arg_action = np.argmax(action_value_array[next_state])
    state_value_list = action_value_array[next_state, arg_action]
    
    current_state = next_state
    
    
    mfv_output = sess.run(mfv_return_, feed_dict={
        reward_:np.array(reward_list).reshape(-1,1),
        discount_:np.array([discount]*len(reward_list)).reshape(-1,1),
        state_value_:np.array(state_value_list).reshape(-1,1),
        lambda_:np.array([lambda_val]*len(reward_list)).reshape(-1,1),
      })
    td_error = mfv_output - np.array(q_action_list).reshape(-1,1)

    #action_value_array[state_int_list, action_list] += learning_rate*td_error
    for s, a, td in zip(state_int_list, action_list, td_error.tolist()):
      action_value_array[s,a] += learning_rate*td[0]
    
    # cut off one action past the first exploration
    if eligibility_cutoff:
      done = 1
      
    if done:
      if next_state == 63:
        stats_success.append(1)
      else:
        stats_success.append(0)
        
      state_value_list, q_action_list, reward_list, state_int_list, action_list = [], [], [], [], []
      eligibility_cutoff = False
      current_state = env.reset()
      current_episode += 1
      epsilon -= epsilon_step
      if epsilon < epsilon_min:
        epsilon = epsilon_min
      if current_episode % stats_every == 0:
        print("Current Episode, Epsilon, Trailing Success %: {}, {:.2f}, {:.2f}".format(current_episode, epsilon,
                                                                                        np.mean(stats_success[-1000:])))
        optimal_action_estimates = np.max(action_value_array,axis=1)
        print("Optimal Action Value Estimates:")
        print(np.round(optimal_action_estimates.reshape(8,8),2))
        print("estimate of the optimal state value at each state")
        print("")
        print("All Action Value Estimates:")
        print(np.round(action_value_array.reshape((64,4)),2))
        print("")  

Current Episode, Epsilon, Trailing Success %: 1000, 0.95, 0.00
Optimal Action Value Estimates:
[[0.   0.   0.   0.   0.   0.   0.01 0.01]
 [0.   0.   0.   0.   0.   0.   0.01 0.  ]
 [0.   0.   0.   0.   0.   0.   0.01 0.  ]
 [0.   0.   0.   0.   0.   0.   0.01 0.01]
 [0.   0.   0.   0.   0.   0.   0.   0.01]
 [0.   0.   0.   0.   0.   0.   0.   0.01]
 [0.   0.   0.   0.   0.   0.   0.   0.01]
 [0.   0.   0.   0.   0.   0.   0.   0.  ]]
estimate of the optimal state value at each state

All Action Value Estimates:
[[0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.01 0.   0.01 0.  ]
 [0.01 0.01 0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.01 0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  ]
 [0.   0.   0.   0.  

In [0]:
print("Current Episode, Epsilon, Trailing Success %: {}, {:.2f}, {:.2f}".format(current_episode, epsilon,
                                                                                np.mean(stats_success[-1000:])))
optimal_action_estimates = np.max(action_value_array,axis=1)
print("Optimal Action Value Estimates:")
print(np.round(optimal_action_estimates.reshape(8,8),2))
print("estimate of the optimal state value at each state")
print("")
print("All Action Value Estimates:")
print(np.round(action_value_array.reshape((64,4)),2))
print("")

Current Episode, Epsilon, Trailing Success %: 20000, 0.01, 0.00
Optimal Action Value Estimates:
[[-0.    0.    0.    0.01  0.01  0.02  0.02  0.03]
 [-0.    0.   -0.    0.01  0.01  0.01  0.02  0.03]
 [ 0.    0.    0.    0.    0.02  0.02  0.03  0.03]
 [ 0.    0.    0.    0.01  0.01  0.    0.02  0.04]
 [ 0.    0.    0.    0.    0.02  0.02  0.03  0.05]
 [ 0.    0.    0.    0.    0.01  0.01  0.    0.06]
 [ 0.    0.    0.    0.    0.    0.    0.    0.06]
 [ 0.    0.    0.    0.    0.    0.    0.    0.  ]]
estimate of the optimal state value at each state

All Action Value Estimates:
[[-0.05 -0.   -0.02 -0.06]
 [ 0.   -0.02 -0.03 -0.03]
 [-0.   -0.01  0.   -0.04]
 [ 0.01 -0.01 -0.05 -0.03]
 [ 0.01  0.    0.   -0.02]
 [ 0.    0.    0.    0.02]
 [ 0.    0.    0.02  0.  ]
 [ 0.    0.03 -0.   -0.  ]
 [-0.02 -0.   -0.03 -0.02]
 [ 0.   -0.01 -0.02 -0.01]
 [-0.   -0.01 -0.   -0.  ]
 [ 0.    0.    0.    0.01]
 [ 0.   -0.    0.01  0.  ]
 [-0.    0.    0.    0.01]
 [ 0.    0.02  0.    0.  ]
 [-0.    0.