# Executive Summary

I'm using the non game environment to practice how to train an agent to complete task by reinforcement learning. In this notebook, I tried to ***build single robot in a warehouse environment*** and ***train the AI Agent using Q-Learning***. 
 
### Here are key steps during the process:

1. Choose a random, non-terminal state (white square) for the agent to begin this new episode.
2. Choose an action (move *up*, *right*, *down*, or *left*) for the current state. Actions will be chosen using an *epsilon greedy algorithm*.
3. Perform the chosen action, and transition to the next state (i.e., move to the next location).
4. Receive the reward for moving to the new state, and calculate the temporal difference.
5. Update the Q-value for the previous state and action pair.
6. If the new (current) state is a terminal state, go to #1. Else, go to #2.

This entire process will be repeated across 1 million episodes. This will provide the AI agent sufficient opportunity to learn the shortest paths between the item packaging area and all other locations in the warehouse where the robot is allowed to travel, while simultaneously avoiding crashing into any of the item storage locations!

### ***Table of Content:***

0. What is Q-Learningg
1. Scenario - Robots in a Warehouse
2. Define the Environment
3. Train the Model
4. Get Shortest Paths
5. Travel from the Packaging Area to Another Location in the Warehouse

# 0. What is Q-Learning

***Q-Learning is a type of reinforcement learning***
- Involves an AI ***agent operating in an environment with states & rewards (input)*** and ***action (outputs)***

***Q-Learning invalves model-free environments***
- The AI agent is not seeking to learn about an underlying mathematical model or probability distribution (as with Thompson Sampling)
- Instead, the AI ***agent attempts to construct an optimal policy directly by interacting with the environment***

***Q-Learning uses a trial-and-error-based approach***
- The AI ***agent repeatedly tries to solve the problem using varied approaches***, and ***continuously updates its policy*** as it learns more and more about its environment

***All of the fundamental characteristics of reinforcement learning apply to Q-Learning models***
- An input and output system, rewards, an environment, Markov decision processes, and training & inference

***Q-Learning includes 2 additional characteristics***
- The ***number of possible states is finite :*** The AI agent will always be in one of a fixed number of possible situations
- The ***number of possible actions is finite :*** The AI agent will always need to choose from among a fixed number of possible actions


### What are Q-Values

<img src='image/q_value.png' width='500px'/>

***A Q-value indicates the quality of a particular action a in a given state s: Q(s, a)***

***Q-values are our current estimates of the sum of future rewards***
- That is, ***Q-values estimate how much additional reward we can accumulate through all remaining steps in the current episode*** if the AI agent is in state s and takes action a
- ***Q-values therefore increase as the AI agent gets closer and closer to the highest reward***

***Q-values are stored in a Q-table, which has one row for each possible state, and one column for each possible action***
- An optimal Q-table contains values that allow the AI agent to take the best action in any possible state, thus providing the agent with the optimal path to the highest reward
- The Q-table therefore represents the AI agent's policy for acting in the current environment

### What are Temporal Differences

- Temporal differences (TDs) provide us with a method of caluclating how much the Q-value for the action taken in the previous state should be changed based on what the AI agent has learned about the Q-values for the current state's actions

<img src='image/temporal_differences.png' width='750px'/>

### What is the Bellman Equation

The Bellman Equation tells us ***what new value to use as the Q-value for the action taken in the previous state***
- Relies on a both the old Q-value for the action taken in the previous state and what has been learned after moving to the next state
- Includes a learning rate parameter (alpha) that defines how quickly Q-values are adjusted
- Invented by Richard Bellman

<img src='image/Bellman_Equation.png' width='750px'/>

### Q-Learing Process

- Step 1: Initialize Q-Table
- Step 2: Choose an Action from the Q-Table for the Current State
- Step 3: Perform the Chosen Action and Transition to the Next State
- Step 4: Receive Reward and Compute the Temporal Difference
- Step 5: Update Q-Value for Previous State
- Step 6: Repeat Step 2 until Reach a Terminal State

After the terminal state, move the AI agent back to an initial state and start a new episode in which it tries to solve the problem again by using the updated Q-values from the previous episode. 

The agent will often run through 1,000 or more episodes before we are satisfied that it has learned optimal policy for the environment.

# 1. Scenario - Robots in a Warehouse

A growing e-commerce company is building a new warehouse, and the company would like all of the picking operations in the new warehouse to be performed by warehouse robots.

* In the context of e-commerce warehousing, ***“picking” is the task of gathering individual items from various locations in the warehouse*** in order to fulfill customer orders.

After picking items from the shelves, the robots must bring the items to a specific location within the warehouse where the items can be packaged for shipping.

In order ***to ensure maximum efficiency and productivity, the robots will need to learn the shortest path*** between the box and all other locations within the warehouse where the robots are allowed to travel.

Let's use Q-learning to accomplish this task!

In [1]:
#import libraries
import numpy as np

# 2. Define the Environment
The environment consists of **states**, **actions**, and **rewards**. 
- States and actions are inputs for the Q-learning AI agent, while the possible actions are the AI agent's outputs.

#### States
The states in the environment are all of the possible locations within the warehouse. Some of these locations are for storing items (**black squares**), while other locations are aisles that the robot can use to travel throughout the warehouse (**white squares**).

The black squares is **terminal states**!

<img src='image/map.png' width=500/>

The AI agent's goal is to learn the shortest path between the box and all of the other locations in the warehouse where the robot is allowed to travel.

As shown in the image above, there are 121 possible states (locations) within the warehouse. These states are arranged in a grid containing 11 rows and 11 columns. Each location can hence be identified by its row and column index.

In [2]:
#define the shape of the environment (i.e., its states)
environment_rows = 11
environment_columns = 11

# Create a 4D numpy array to hold the current Q-values for each state and action pair: Q(s, a) 
# The array contains 11 rows and 11 columns (to match the shape of the environment), as well as a third "action" dimension.
# The "action" dimension consists of 4 layers that will allow us to keep track of the Q-values for each possible action in
# each state (see next cell for a description of possible actions). 
# The value of each (state, action) pair is initialized to 0.


In [3]:
# define q_values table
index = environment_rows*environment_columns

q_values = np.zeros((index, environment_rows, environment_columns, 4))      # 4: nunber of possible actions (up, right, down, left)

q_values.shape                                                              

(121, 11, 11, 4)

In [4]:
index_table = {}     # each square in the map represent single index number

count_index = 1
    
for row_index in range(0, environment_rows):
        
    for column_index in range(0, environment_columns):
    
        index_table[row_index, column_index] = count_index
        
        count_index += 1
        
index_table

{(0, 0): 1,
 (0, 1): 2,
 (0, 2): 3,
 (0, 3): 4,
 (0, 4): 5,
 (0, 5): 6,
 (0, 6): 7,
 (0, 7): 8,
 (0, 8): 9,
 (0, 9): 10,
 (0, 10): 11,
 (1, 0): 12,
 (1, 1): 13,
 (1, 2): 14,
 (1, 3): 15,
 (1, 4): 16,
 (1, 5): 17,
 (1, 6): 18,
 (1, 7): 19,
 (1, 8): 20,
 (1, 9): 21,
 (1, 10): 22,
 (2, 0): 23,
 (2, 1): 24,
 (2, 2): 25,
 (2, 3): 26,
 (2, 4): 27,
 (2, 5): 28,
 (2, 6): 29,
 (2, 7): 30,
 (2, 8): 31,
 (2, 9): 32,
 (2, 10): 33,
 (3, 0): 34,
 (3, 1): 35,
 (3, 2): 36,
 (3, 3): 37,
 (3, 4): 38,
 (3, 5): 39,
 (3, 6): 40,
 (3, 7): 41,
 (3, 8): 42,
 (3, 9): 43,
 (3, 10): 44,
 (4, 0): 45,
 (4, 1): 46,
 (4, 2): 47,
 (4, 3): 48,
 (4, 4): 49,
 (4, 5): 50,
 (4, 6): 51,
 (4, 7): 52,
 (4, 8): 53,
 (4, 9): 54,
 (4, 10): 55,
 (5, 0): 56,
 (5, 1): 57,
 (5, 2): 58,
 (5, 3): 59,
 (5, 4): 60,
 (5, 5): 61,
 (5, 6): 62,
 (5, 7): 63,
 (5, 8): 64,
 (5, 9): 65,
 (5, 10): 66,
 (6, 0): 67,
 (6, 1): 68,
 (6, 2): 69,
 (6, 3): 70,
 (6, 4): 71,
 (6, 5): 72,
 (6, 6): 73,
 (6, 7): 74,
 (6, 8): 75,
 (6, 9): 76,
 (6, 10): 77,
 

#### Actions
The actions that are available to the AI agent are to move the robot in one of four directions:
* Up
* Right
* Down
* Left

Obviously, the AI agent must learn to avoid driving into the item storage locations (e.g., shelves)!

In [5]:
# define actions
actions = ['up', 'right', 'down', 'left']            # numeric action codes: 0 = up, 1 = right, 2 = down, 3 = left

#### Rewards
The last component of the environment that we need to define are the **rewards**. 

To help the AI agent learn, each state (location) in the warehouse is assigned a reward value.

The agent may begin at any white square, but its goal is always the same: ***to maximize its total rewards***!

Negative rewards (i.e., **punishments**) are used for all states except the goal.
* This encourages the AI to identify the *shortest path* to the goal by *minimizing its punishments*!

To maximize its cumulative rewards (by minimizing its cumulative punishments), the AI agent will need find the shortest paths between the box and all of the other locations in the warehouse where the robot is allowed to travel (white squares). The agent will also need to learn to avoid crashing into any of the item storage locations (black squares)!

In [6]:
# Create a 2D numpy array to hold the rewards for each state. 
revised_rewards = np.full((environment_rows, environment_columns), -100.)  # The array contains 11 rows and 11 columns (to match the shape of the environment), and each value is initialized to -100.

#rewards[0, 5] = 100.                                               # set the reward for the packaging area (i.e., the goal) to 100

In [7]:
# define aisle locations (i.e., white squares) for rows 1 through 9
aisles = {}                                                        # store locations in a dictionary
aisles[1] = [i for i in range(1, 10)]
aisles[2] = [1, 7, 9]
aisles[3] = [i for i in range(1, 8)]
aisles[3].append(9)
aisles[4] = [3, 7]
aisles[5] = [i for i in range(11)]
aisles[6] = [5]
aisles[7] = [i for i in range(1, 10)]
aisles[8] = [3, 7]
aisles[9] = [i for i in range(11)]

# set the rewards for all aisle locations (i.e., white squares)
for row_index in range(1, 10):
    
    for column_index in aisles[row_index]:
        
        revised_rewards[row_index, column_index] = -1.
  

for row in revised_rewards:                                                # print rewards matrix
    
    print(row)

[-100. -100. -100. -100. -100. -100. -100. -100. -100. -100. -100.]
[-100.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1. -100.]
[-100.   -1. -100. -100. -100. -100. -100.   -1. -100.   -1. -100.]
[-100.   -1.   -1.   -1.   -1.   -1.   -1.   -1. -100.   -1. -100.]
[-100. -100. -100.   -1. -100. -100. -100.   -1. -100. -100. -100.]
[-1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.]
[-100. -100. -100. -100. -100.   -1. -100. -100. -100. -100. -100.]
[-100.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1. -100.]
[-100. -100. -100.   -1. -100. -100. -100.   -1. -100. -100. -100.]
[-1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1.]
[-100. -100. -100. -100. -100. -100. -100. -100. -100. -100. -100.]


In [34]:
# define function the reset the map as initial state
def map_reset(box_row_index, box_column_index):
    
    revised_rewards = np.full((environment_rows, environment_columns), -100.)  # The array contains 11 rows and 11 columns (to match the shape of the environment), and each value is initialized to -100.

    # define aisle locations (i.e., white squares) for rows 1 through 9
    aisles = {}                                                                # store locations in a dictionary
    aisles[1] = [i for i in range(1, 10)]
    aisles[2] = [1, 7, 9]
    aisles[3] = [i for i in range(1, 8)]
    aisles[3].append(9)
    aisles[4] = [3, 7]
    aisles[5] = [i for i in range(11)]
    aisles[6] = [5]
    aisles[7] = [i for i in range(1, 10)]
    aisles[8] = [3, 7]
    aisles[9] = [i for i in range(11)]

    # set the rewards for all aisle locations (i.e., white squares)
    for row_index in range(1, 10):

        for column_index in aisles[row_index]:

            revised_rewards[row_index, column_index] = -1.
            
    #box_starting_location()
            
    revised_rewards[box_row_index, box_column_index] = 1000             # set box_starting location with reward +100
            
    #for row in rewards:                                                       # print rewards matrix
    
    #    print(row)
        
    return revised_rewards

# 3. Train the Model
Our next task is for our AI agent to learn about its environment by implementing a Q-learning model. The learning process will follow these steps:

***Part 1***
1. Initialize ***single agent_A and one box need to be pick up starting in random location*** 
2. Choose a random, non-terminal state (white square) for ***the box need to be pick up to begin in this new episode with reward +1000***.
3. Choose an action (move *up*, *right*, *down*, or *left*) for that agent moving to pick up the box. Actions will be chosen using an *epsilon greedy algorithm*. This algorithm will usually choose the most promising action for the AI agent, but it will occasionally choose a less promising option in order to encourage the agent to explore the environment.
4. Perform the chosen action, and transition to the next state (i.e., move to the next loation).
5. Receive the reward for moving to the new state, and calculate the temporal difference.
6. Update the Q-value for the previous state and action pair.
7. If the new (current) state is a terminal state, go to #1. Else, go to #3.

This entire process will be repeated across 1 million episodes. This will provide the AI agent sufficient opportunity to learn the shortest paths between the box location and all other locations in the warehouse where the robot is allowed to travel, while simultaneously avoiding crashing into any of the item storage locations!

In [9]:
# define a function that determines if the specified location is a terminal state
def is_terminal_state(row_index, column_index):
    
    if revised_rewards[row_index, column_index] == -1.:  # if the reward for this location is -1, then it is
                                                                 # not a terminal state (i.e., it is a 'white square')
        
        return False

    else:
        
        return True

In [10]:
# define a function that will put a box in a random, non-terminal starting location 
def box_starting_location():
    
    box_row_index = np.random.randint(environment_rows)        # get a random row index for the box
    
    box_column_index = np.random.randint(environment_columns)  # get a random column index for the box
    
    while is_terminal_state(box_row_index, box_column_index):  # check if the box located in non-terminal starting location
        
        box_row_index = np.random.randint(environment_rows)        # get a random row index for the box
    
        box_column_index = np.random.randint(environment_columns)  # get a random column index for the box
        
    return box_row_index, box_column_index  

In [11]:
# define a function that will put a box in a random, non-terminal starting location 
def agent_A_starting_location():
    
    agent_A_row_index = np.random.randint(environment_rows)        # get a random row index for the box
    
    agent_A_column_index = np.random.randint(environment_columns)  # get a random column index for the box
    
    while is_terminal_state(agent_A_row_index, agent_A_column_index):  # check if the box located in non-terminal starting location
        
        agent_A_row_index = np.random.randint(environment_rows)        # get a random row index for the box
    
        agent_A_column_index = np.random.randint(environment_columns)  # get a random column index for the box
        
    return agent_A_row_index, agent_A_column_index  

In [12]:
# define an epsilon greedy algorithm that agent_A will choose which action to take next (i.e., where to move next)
def agent_A_get_next_action(agent_A_row_index, agent_A_column_index, epsilon):
    
    if np.random.random() < epsilon:                     # if a randomly chosen value between 0 and 1 is less than epsilon, 
                                                          # then choose the most promising value from the Q-table for this state.

        return np.argmax(q_values[agent_A_row_index, agent_A_column_index])
    
    else:                                                # choose a random action
        
        return np.random.randint(4)

In [13]:
#define a function that agent_A will get the next location based on the chosen action
def agent_A_get_next_location(agent_A_row_index, agent_A_column_index, action_index):
    
    agent_A_new_row_index = agent_A_row_index
    
    agent_A_column_index = agent_A_column_index
    
    if actions[action_index] == 'up' and agent_A_row_index > 0:
        
        agent_A_new_row_index -= 1
        
    elif actions[action_index] == 'right' and agent_A_column_index < environment_columns - 1:
        
        agent_A_column_index += 1
        
    elif actions[action_index] == 'down' and agent_A_row_index < environment_rows - 1:
        
        agent_A_new_row_index += 1
        
    elif actions[action_index] == 'left' and agent_A_column_index > 0:
        
        agent_A_column_index -= 1
        
    return agent_A_new_row_index, agent_A_column_index

### Train the AI Agent using Q-Learning
- 1. get_starting_location
- 2. get_next_action
- 3. store the old row and column indexes
- 4. get_next_location
- 5. received reward for moving to the new state
- 6. calculate the old q_value
- 7. calculate the temporal difference
- 8. calculate the new q_value
- 9. update the q_value

In [14]:
%%time
# define training parameters
epsilon = 0.9                       # the percentage of time when we should take the best action (instead of a random action)
discount_factor = 0.9               # discount factor for future rewards
learning_rate = 0.9                 # the rate at which the AI agent should learn

counter = 0

box_starting_location_dict = np.zeros((index, environment_rows, environment_columns))      # 4: nunber of possible actions (up, right, down, left)

agent_A_starting_location_dict = np.zeros((index, environment_rows, environment_columns))          

for episode in range(1000000):      # run through 1 million training episodes
    
    box_row_index, box_column_index = box_starting_location()                 # get the starting location for this episode
    
    agent_A_row_index, agent_A_column_index = agent_A_starting_location()
    
    if not (box_row_index, box_column_index) == (agent_A_row_index, agent_A_column_index):
        
        # find out which index of q_values table
        index = index_table[box_row_index, box_column_index]
    
        # record box and agent_A starting location
        box_starting_location_dict[index, box_row_index, box_column_index] += 1

        agent_A_starting_location_dict[index, agent_A_row_index, agent_A_column_index] += 1


        # find the suitable q_value table
        #q_values_table = find_q_values_table(box_row_index, box_column_index, agent_A_row_index, agent_A_column_index)

        



        # update rewards +100 in box location
        revised_rewards = map_reset(box_row_index, box_column_index)
        
        ###print(revised_rewards)

        #print(box_row_index, box_column_index, agent_A_row_index, agent_A_column_index)


        # continue taking actions (i.e., moving) until we reach a terminal state (i.e., until we reach the item packaging area or crash into an item storage location)    
        while not (is_terminal_state(box_row_index, box_column_index) and is_terminal_state(agent_A_row_index, agent_A_column_index)):

            ###print('agent_A_row_index, agent_A_column_index :', agent_A_row_index, agent_A_column_index)

            action_index = agent_A_get_next_action(agent_A_row_index, agent_A_column_index, epsilon) # choose which action to take (i.e., where to move next)

            ###print('action_index :', actions[action_index])


            # perform the chosen action, and transition to the next state (i.e., move to the next location)
            agent_A_old_row_index, agent_A_old_column_index = agent_A_row_index, agent_A_column_index     # store the old row and column indexes

            ###print('agent_A_old_row_index, agent_A_old_column_index :', agent_A_old_row_index, agent_A_old_column_index)


            agent_A_row_index, agent_A_column_index = agent_A_get_next_location(agent_A_old_row_index, agent_A_old_column_index, action_index)

            ###print('agent_A_row_index, agent_A_column_index :', agent_A_row_index, agent_A_column_index)


            # receive the reward for moving to the new state, and calculate the temporal difference
            rewards = revised_rewards[agent_A_row_index, agent_A_column_index]

            ###print('revised_rewards :', rewards)

            # find out which index of q_values table
            #index = index_table[box_row_index, box_column_index]

            old_q_value = q_values[index, agent_A_old_row_index, agent_A_old_column_index, action_index]

            ###print('old_q_value :', old_q_value)

            temporal_difference = rewards + (discount_factor * np.max(q_values[index, agent_A_row_index, agent_A_column_index])) - old_q_value

            # update the Q-value for the previous state and action pair
            new_q_value = old_q_value + (learning_rate * temporal_difference)

            q_values[index, agent_A_old_row_index, agent_A_old_column_index, action_index] = new_q_value
            
            ###print(index)

            ###print(new_q_value)
            
            counter += 1
            
            if counter % 100000 == 0:
                
                print('Training complete :', counter)
        
print('Training complete :', counter)

Training complete : 100000
Training complete : 200000
Training complete : 300000
Training complete : 400000
Training complete : 500000
Training complete : 600000
Training complete : 700000
Training complete : 800000
Training complete : 900000
Training complete : 1000000
Training complete : 1100000
Training complete : 1200000
Training complete : 1300000
Training complete : 1400000
Training complete : 1500000
Training complete : 1516395
CPU times: user 1min 9s, sys: 3.48 s, total: 1min 12s
Wall time: 1min 10s


# 4. Get Shortest Paths
Now that the AI agent has been fully trained, we can see what it has learned by displaying the shortest path between any location in the warehouse where the robot is allowed to travel and the item packaging area.

<img src='image/map.png' width=500/>

Run the code cell below to try a few different starting locations for both the box and agent!

In [56]:
# Define a function that agent_A will get the shortest path between any location within the warehouse that the robot is allowed to travel and the item packaging location.
def agent_A_get_shortest_path(agent_A_row_index, agent_A_column_index, box_row_index, box_column_index):
    
    if is_terminal_state(box_row_index, box_column_index):   # return immediately if this is an invalid starting location
        
        print('Box Start in Terminal State')
        
        return []
    
    if is_terminal_state(agent_A_row_index, agent_A_column_index):   # return immediately if this is an invalid starting location
        
        print('Agent Start in Terminal State')
        
        return []
    
    else:                                                        # if this is a 'legal' starting location
        
        revised_rewards = map_reset(box_row_index, box_column_index)
        
        print('revised_rewards:',revised_rewards)
        
        current_row_index, current_column_index = agent_A_row_index, agent_A_column_index
        
        print('current_row_index, current_column_index :', current_row_index, current_column_index)
        
        shortest_path = []
        
        shortest_path.append([current_row_index, current_column_index])
        
        index = index_table[box_row_index, box_column_index]
        
        print('index :', index)
        
        while (revised_rewards[current_row_index, current_column_index] == -1):            # continue moving along the path until we reach the goal (i.e., the item packaging location)

            #action_index = agent_A_get_next_action(agent_A_row_index, agent_A_column_index, 1.)  # get the best action to take

            action_index = np.argmax(q_values[index , current_row_index, current_column_index])
            
            print('action_index :', action_index)
            
            current_row_index, current_column_index = agent_A_get_next_location(current_row_index, current_column_index, action_index)  # move to the next location on the path, and add the new location to the list
            
            shortest_path.append([current_row_index, current_column_index])
            
            print('current:',current_row_index, current_column_index)
            
            print('is terminal:',is_terminal_state(current_row_index, current_column_index))
            
        return shortest_path

In [76]:
# display a few shortest paths
print(agent_A_get_shortest_path(5, 0, 5,10))   

revised_rewards: [[-100. -100. -100. -100. -100. -100. -100. -100. -100. -100. -100.]
 [-100.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1. -100.]
 [-100.   -1. -100. -100. -100. -100. -100.   -1. -100.   -1. -100.]
 [-100.   -1.   -1.   -1.   -1.   -1.   -1.   -1. -100.   -1. -100.]
 [-100. -100. -100.   -1. -100. -100. -100.   -1. -100. -100. -100.]
 [  -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1. 1000.]
 [-100. -100. -100. -100. -100.   -1. -100. -100. -100. -100. -100.]
 [-100.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1. -100.]
 [-100. -100. -100.   -1. -100. -100. -100.   -1. -100. -100. -100.]
 [  -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1.]
 [-100. -100. -100. -100. -100. -100. -100. -100. -100. -100. -100.]]
current_row_index, current_column_index : 5 0
index : 66
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
cu

is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
a

action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
cur

is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
a

action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
cur

current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is term

is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
a

current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is term

is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
a

current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is term

action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
cur

action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
cur

is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
a

is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
a

is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
a

current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is term

is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
action_index : 3
current: 5 0
is terminal: False
a

KeyboardInterrupt: 

In [72]:
#display a few shortest paths
print(agent_A_get_shortest_path(1, 1, 9, 1))   

revised_rewards: [[-100. -100. -100. -100. -100. -100. -100. -100. -100. -100. -100.]
 [-100.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1. -100.]
 [-100.   -1. -100. -100. -100. -100. -100.   -1. -100.   -1. -100.]
 [-100.   -1.   -1.   -1.   -1.   -1.   -1.   -1. -100.   -1. -100.]
 [-100. -100. -100.   -1. -100. -100. -100.   -1. -100. -100. -100.]
 [  -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1.]
 [-100. -100. -100. -100. -100.   -1. -100. -100. -100. -100. -100.]
 [-100.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1. -100.]
 [-100. -100. -100.   -1. -100. -100. -100.   -1. -100. -100. -100.]
 [  -1. 1000.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1.   -1.]
 [-100. -100. -100. -100. -100. -100. -100. -100. -100. -100. -100.]]
current_row_index, current_column_index : 1 1
index : 101
action_index : 1
current: 1 2
is terminal: False
action_index : 1
current: 1 3
is terminal: False
action_index : 1
current: 1 4
is terminal: False
action_index : 1
c

***End of Page***