<a href="https://colab.research.google.com/github/SteviDunn/Q-Learning-Pressure-Data/blob/main/AOA_3D_Matrix__N_Iteration_Dunn_Copy_of_Notebook_for_Topic_08_Video_Q_Learning_A_Complete_Example_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Scenario - Robots in a Warehouse
A growing e-commerce company is building a new warehouse, and the company would like all of the picking operations in the new warehouse to be performed by warehouse robots.
* In the context of e-commerce warehousing, “picking” is the task of gathering individual items from various locations in the warehouse in order to fulfill customer orders.

After picking items from the shelves, the robots must bring the items to a specific location within the warehouse where the items can be packaged for shipping.

In order to ensure maximum efficiency and productivity, the robots will need to learn the shortest path between the item packaging area and all other locations within the warehouse where the robots are allowed to travel.
* We will use Q-learning to accomplish this task!

#### Import Required Libraries

In [None]:
#import libraries
import numpy as np

## Define the Environment
The environment consists of **states**, **actions**, and **rewards**. States and actions are inputs for the Q-learning AI agent, while the possible actions are the AI agent's outputs.
#### States
The states in the environment are all of the possible locations within the warehouse. Some of these locations are for storing items (**black squares**), while other locations are aisles that the robot can use to travel throughout the warehouse (**white squares**). The **green square** indicates the item packaging and shipping area.

The black and green squares are **terminal states**!

![warehouse map](https://www.danielsoper.com/teaching/img/08-warehouse-map.png)

The AI agent's goal is to learn the shortest path between the item packaging area and all of the other locations in the warehouse where the robot is allowed to travel.

As shown in the image above, there are 121 possible states (locations) within the warehouse. These states are arranged in a grid containing 11 rows and 11 columns. Each location can hence be identified by its row and column index.

In [None]:
# Define the shape of the 4D environment (i.e., its states)
environment_x = 4  # Number of rows (vertical axis)
environment_y = 8  # Number of columns (horizontal axis)
environment_z = 3  # Number of layers (depth)
environment_aoa = 3  # Number of angle of attack levels

# Define the number of possible actions (assuming 7 possible actions, e.g., move in 6 directions plus AOA change)
num_actions = 7

# Create a 5D numpy array to hold the Q-values for each state-action pair: Q(s, a)
# The array now contains 'environment_x', 'environment_y', 'environment_z', and 'environment_aoa'
# to match the 4D shape of the environment, plus the action dimension for Q-values.
q_values = np.zeros((environment_x, environment_y, environment_z, environment_aoa, num_actions))

# The q_values array is now a 5D array with dimensions (4, 8, 3, 3, 7),
# representing the 4D state space (including AOA) and the action space.


#### Actions
The actions that are available to the AI agent are to move the robot in one of four directions:
* Up
* Right
* Down
* Left

Obviously, the AI agent must learn to avoid driving into the item storage locations (e.g., shelves)!


In [None]:
# Define actions
# Numeric action codes:
# 0 = move up z, 1 = move down z, 2 = move right x, 3 = move left x,
# 4 = move forward y, 5 = move backward y, 6 = stay,
# 7 = increase AOA, 8 = decrease AOA
actions = [
    'up_z', 'down_z', 'right_x', 'left_x',
    'forward_y', 'backward_y', 'stay',
    'increase_aoa', 'decrease_aoa'
]

# Now, there are 9 possible actions, accounting for the new AOA dimension.
num_actions = len(actions)  # This will be 9


#### Rewards
The last component of the environment that we need to define are the **rewards**.

To help the AI agent learn, each state (location) in the warehouse is assigned a reward value.

The agent may begin at any white square, but its goal is always the same: ***to maximize its total rewards***!

Negative rewards (i.e., **punishments**) are used for all states except the goal.
* This encourages the AI to identify the *shortest path* to the goal by *minimizing its punishments*!

![warehouse map](https://www.danielsoper.com/teaching/img/08-warehouse-map-rewards.png)

To maximize its cumulative rewards (by minimizing its cumulative punishments), the AI agent will need find the shortest paths between the item packaging area (green square) and all of the other locations in the warehouse where the robot is allowed to travel (white squares). The agent will also need to learn to avoid crashing into any of the item storage locations (black squares)!

In [None]:
import numpy as np
##Data taken from L/D true stats 0 (not 1)
# Re-initializing the original matrix (3x4x8) and the two additional matrices for different AOAs
ld_ratios_aoa1 = np.array([
    [
        [-0.560847533, -0.549910806, -0.274771177, 0.17906459, 0.508083076, 0.479364257, 0.631179823, 0.544423392],
        [-1.896456314, -1.128275428, -0.379404476, 0.343553674, 1.196006183, 0.69787821, 0.56373206, 0.592049292],
        [-0.672843077, -0.522756134, 0.112989564, 0.660673904, 0.906478581, 0.87273689, 0.840761812, 0.811992305],
        [-0.280506088, 0.041536102, 0.703439779, 0.829640412, 0.94494181, 0.956008818, 0.831228384, 0.784962628]
    ],
    [
        [-0.073252714, -0.024258678, 0.391823059, 0.717513434, 0.913279534, 0.898604341, 0.856517598, 0.848944202],
        [-1.230233893, -0.472085117, 0.256742042, 0.879774892, 1.537068472, 1.015542931, 0.933517792, 0.862737841],
        [-0.482303262, -0.387791443, 0.399816267, 0.914603997, 1.103226727, 1.026804074, 0.904824902, 0.861630871],
        [-0.133467972, -0.149268045, 0.37746769, 0.633622619, 0.765301144, 0.68454384, 0.681714189, 0.590227279]
    ],
    [
        [1.68008597, 2.014682889, 1.979787083, 4.06447641, 5.058449713, 4.855977718, 4.769882173, 4.701115418],
        [-0.26305801, 1.079966811, 3.261120016, 5.427790842, 7.714367372, 5.238375795, 5.16754281, 4.82601987],
        [0.679493776, 1.727854201, 2.708559173, 5.014285095, 5.613175464, 5.192517442, 5.140730408, 4.778637736],
        [2.126749894, 2.128782325, 3.435657384, 4.349849545, 4.211052237, 4.532229962, 4.341667419, 4.280858792]
    ]
])

ld_ratios_aoa2 = np.array([
    [
        [-0.85379, -0.84453, -0.56022, -0.06628, 0.37986, 0.41998, 0.41249, 0.4571],
        [-1.87015, -1.20318, -0.26315, 0.39479, 1.53792, 0.58283, 0.48226, 0.43149],
        [-0.75481, -0.9494, 0.13764, 0.59038, 0.9183, 0.80119, 0.62282, 0.54975],
        [-0.61532, -0.42584, -0.06139, 0.36071, 0.65771, 0.65834, 0.60638, 0.50963]
    ],
    [
        [-0.68866, -0.74915, -0.47427, 0.05647, 0.46502, 0.40518, 0.32778, 0.32082],
        [-1.61831, -0.86785, 0.18366, 0.47784, 1.417, 0.66128, 0.48675, 0.32428],
        [-0.82634, -0.51604, 0.14678, 0.41475, 0.77826, 0.57504, 0.45781, 0.35237],
        [-0.53136, -0.5462, -0.32186, 0.28728, 0.48974, 0.43762, 0.41432, 0.32416]
    ],
    [
        [1.62563, 1.47, 1.68325, 2.86456, 4.24603, 4.21611, 4.33648, 4.35567],
        [-0.07066, 1.10294, 3.67149, 5.63637, 9.06201, 5.33921, 4.6232, 4.58389],
        [1.09722, 2.03986, 2.77107, 4.52965, 5.68317, 5.35768, 4.73271, 4.28755],
        [2.58787, 3.55767, 4.26496, 4.42522, 4.75202, 4.9988, 4.65062, 4.34528]
    ]
])

ld_ratios_aoa3 = np.array([
    [
        [-0.2419, -0.15264, 0.00376, 0.56413, -0.40147, -0.42427, -0.44084, 0.77374],
        [-1.27657, -0.74033, 0.04363, -0.73607, -0.02309, -0.36898, -0.46431, -0.49032],
        [-0.15158, -0.30351, 0.08621, 0.6776, -0.36884, 0.53997, 0.8615, 0.86059],
        [-0.06312, -0.01893, 0.52361, 0.77201, 0.89356, 0.84575, 0.86792, 0.79792]
    ],
    [
        [-0.04231, -0.07856, -0.00763, 0.49826, -0.47588, -0.46418, 0.71219, -0.51256],
        [-0.83529, -0.60825, -0.13244, -0.78788, -0.17724, -0.41714, -0.49839, 0.23287],
        [-0.23809, -0.29614, -0.08143, 0.44881, -0.44097, -0.43293, 0.61755, -0.53545],
        [-0.07924, -0.15566, 0.17938, 0.4737, 0.65039, 0.6634, -0.54116, -0.53267]
    ],
    [
        [1.88072, 1.86878, 2.81673, 3.58065, 4.34279, 4.1078, 3.00279, 4.28956],
        [0.61285, 1.37992, 2.8145, 3.98211, 7.11544, 4.99674, 4.46355, 4.2167],
        [1.48737, 2.55674, 2.93813, 0.94013, 1.15169, 1.0747, 1.0509, 4.09221],
        [2.8802, 3.38346, 3.58961, 4.43283, 1.31512, 5.05872, 5.00518, 4.6079]
    ]
])

# Combining the matrices along a new dimension for AOA
rewards = np.stack([ld_ratios_aoa1, ld_ratios_aoa2, ld_ratios_aoa3], axis=3)
max_reward = np.max(rewards)

# Confirming the shape of the resulting matrix
rewards.shape
print(f"Largest Reward: {max_reward}")


Largest Reward: 9.06201


## Train the Model
Our next task is for our AI agent to learn about its environment by implementing a Q-learning model. The learning process will follow these steps:
1. Choose a random, non-terminal state (white square) for the agent to begin this new episode.
2. Choose an action (move *up*, *right*, *down*, or *left*) for the current state. Actions will be chosen using an *epsilon greedy algorithm*. This algorithm will usually choose the most promising action for the AI agent, but it will occasionally choose a less promising option in order to encourage the agent to explore the environment.
3. Perform the chosen action, and transition to the next state (i.e., move to the next location).
4. Receive the reward for moving to the new state, and calculate the temporal difference.
5. Update the Q-value for the previous state and action pair.
6. If the new (current) state is a terminal state, go to #1. Else, go to #2.

This entire process will be repeated across 1000 episodes. This will provide the AI agent sufficient opportunity to learn the shortest paths between the item packaging area and all other locations in the warehouse where the robot is allowed to travel, while simultaneously avoiding crashing into any of the item storage locations!

#### Define Helper Functions

In [None]:
# Get starting location at random in 4D space
def get_starting_location():
    # Get a random x, y, z, and AOA index
    current_x_index = np.random.randint(environment_x)
    current_y_index = np.random.randint(environment_y)
    current_z_index = np.random.randint(environment_z)
    current_aoa_index = np.random.randint(environment_aoa)
    return current_x_index, current_y_index, current_z_index, current_aoa_index

# Define an epsilon greedy algorithm that will choose which action to take next
def get_next_action(current_x_index, current_y_index, current_z_index, current_aoa_index, epsilon):
    # If a randomly chosen value between 0 and 1 is less than epsilon,
    # then choose the most promising value from the Q-table for this state.
    if np.random.random() < epsilon:
        # Choose the action with the highest Q-value for the current state
        return np.argmax(q_values[current_x_index, current_y_index, current_z_index, current_aoa_index])
    else:
        # Choose a random action from the available action space
        return np.random.randint(num_actions)  # Use the total number of actions (9)


# Define a function that will get the next location based on the chosen action
def get_next_location(current_x_index, current_y_index, current_z_index, current_aoa_index, action_index):
    new_x_index = current_x_index
    new_y_index = current_y_index
    new_z_index = current_z_index
    new_aoa_index = current_aoa_index

    if actions[action_index] == 'up_z' and current_z_index > 0:
        new_z_index -= 1
    elif actions[action_index] == 'down_z' and current_z_index < environment_z - 1:
        new_z_index += 1
    elif actions[action_index] == 'right_x' and current_x_index < environment_x - 1:
        new_x_index += 1
    elif actions[action_index] == 'left_x' and current_x_index > 0:
        new_x_index -= 1
    elif actions[action_index] == 'forward_y' and current_y_index < environment_y - 1:
        new_y_index += 1
    elif actions[action_index] == 'backward_y' and current_y_index > 0:
        new_y_index -= 1
    elif actions[action_index] == 'increase_aoa' and current_aoa_index < environment_aoa - 1:
        new_aoa_index += 1
    elif actions[action_index] == 'decrease_aoa' and current_aoa_index > 0:
        new_aoa_index -= 1
    elif actions[action_index] == 'stay':
        pass  # No movement if the action is 'stay'

    return new_x_index, new_y_index, new_z_index, new_aoa_index


In [None]:
#define helper functions for the termination condition after convergence of steps

# Initialize Q-table with zeros for 4D environment
def initialize_q_table(environment_x, environment_y, environment_z, environment_aoa, num_actions=9):
    return np.zeros((environment_x, environment_y, environment_z, environment_aoa, num_actions))

# Q-learning update function with temporal difference for 4D state space
def q_learning_update(q_values, state, action, reward, next_state, learning_rate, discount_factor):
    old_q_value = q_values[state[0], state[1], state[2], state[3], action]

    # Calculate the temporal difference (TD error)
    future_reward = np.max(q_values[next_state[0], next_state[1], next_state[2], next_state[3]])  # Max reward for the next state
    temporal_difference = reward + discount_factor * future_reward - old_q_value

    # Update the Q-value with the TD error
    new_q_value = old_q_value + learning_rate * temporal_difference

    # Update the Q-value in the Q-table
    q_values[state[0], state[1], state[2], state[3], action] = new_q_value

    # Return the absolute difference between old and new Q-values for convergence tracking
    return abs(new_q_value - old_q_value)


# Print the converged results helper function for 4D state space
def print_converged_results(q_values, rewards):
    # Find the maximum Q-value
    max_q_value = np.max(q_values)

    # Find the state-action pair corresponding to the highest Q-value
    max_q_value_location = np.unravel_index(np.argmax(q_values), q_values.shape)
    max_state = max_q_value_location[:4]  # extract the (x, y, z, aoa) state
    max_action = max_q_value_location[4]  # extract the action

    # Get the L/D ratio from the rewards grid (only x, y, z, aoa needed for rewards)
    max_ld_ratio = rewards[max_state[0], max_state[1], max_state[2], max_state[3]]

    # Print the highest Q-value and corresponding L/D ratio
    print(f"\nHighest Q-value (indicating best state for L/D ratio): {max_q_value}")
    print(f"Location (State): {max_state}, Action: {max_action}")
    print(f"Highest L/D Ratio at this location: {max_ld_ratio}")


# Convergence function for 4D state space
def check_convergence(q_value_diff_sum, convergence_threshold, q_values, rewards):
    if q_value_diff_sum < convergence_threshold:
        print("Convergence Achieved!")
        print(f"Q-value Difference Sum: {q_value_diff_sum}")
        print_converged_results(q_values, rewards)


# Find the highest L/D ratio in the rewards grid for 4D environment
def find_highest_ld_ratio(rewards):
    # Find the highest L/D ratio in the rewards grid
    max_ld_ratio = np.max(rewards)
    max_ld_location = np.unravel_index(np.argmax(rewards), rewards.shape)

    # Print the highest L/D ratio and its location
    print(f"Highest L/D Ratio: {max_ld_ratio} at Location: {max_ld_location}")

    return max_ld_ratio, max_ld_location

def extract_policy(q_values):
    # The policy maps each state (x, y, z, aoa) to the best action based on the Q-values
    policy = np.argmax(q_values, axis=4)  # Find the action with the highest Q-value for each state
    return policy





#### Train the AI Agent using Q-Learning

In [None]:
# Update training parameters
epsilon = 0.70  # the percentage of time when we should take the best action (instead of a random action)
learning_rate = 0.7  # the rate at which the AI agent should learn
discount_factor = 0.9  # discount factor for future rewards

n_iterations = 10000  # the number of times that we'll run the training
num_steps = 100

# Q-value helper function call
q_values = initialize_q_table(environment_x, environment_y, environment_z, environment_aoa)  # Initialize Q-table for 4D space

# Run through 10000 training episodes
for episode in range(n_iterations):
    # Get the starting location for this episode in 4D space
    x_index, y_index, z_index, aoa_index = get_starting_location()

    # Print the starting location to verify it's different each time
    print(f"Episode {episode + 1}: Starting Location - X: {x_index}, Y: {y_index}, Z: {z_index}, AOA: {aoa_index}")

    q_value_diff_sum = 0  # Track the total Q-value convergence for training

    for step in range(num_steps):
        # Choose which action to take using epsilon-greedy
        action_index = get_next_action(x_index, y_index, z_index, aoa_index, epsilon)

        # Perform the action and transition to the next state
        old_x_index, old_y_index, old_z_index, old_aoa_index = x_index, y_index, z_index, aoa_index
        x_index, y_index, z_index, aoa_index = get_next_location(x_index, y_index, z_index, aoa_index, action_index)

        # Receive the reward for moving to the new state (reward grid is in 4D space)
        reward = rewards[z_index, x_index, y_index, aoa_index] - 7.7

        # Perform Q-learning update with helper function
        q_value_diff = q_learning_update(q_values,
                                         (old_x_index, old_y_index, old_z_index, old_aoa_index),
                                         action_index,
                                         reward,
                                         (x_index, y_index, z_index, aoa_index),
                                         learning_rate,
                                         discount_factor)

        # Accumulate the absolute difference for convergence tracking
        q_value_diff_sum += q_value_diff

    # Optionally, check for convergence here after each episode if needed
    if episode % 1000 == 0:  # Print convergence info every 1000 episodes
        print(f"After {episode + 1} episodes, Q-value difference sum: {q_value_diff_sum}")

# Call the function to find the highest L/D ratio in the rewards grid
max_ld_ratio, max_ld_location = find_highest_ld_ratio(rewards)

# Optional: Print a summary of your Q-learning results if needed
#print_converged_results(q_values, rewards)

print()

#extract the policy
policy = extract_policy(q_values)

# Training is complete
print('Training Complete')

# Print the whole Q-value table
# print("Q-values after training:")
# print(q_values)


[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Episode 5008: Starting Location - X: 1, Y: 1, Z: 0, AOA: 1
Episode 5009: Starting Location - X: 1, Y: 2, Z: 2, AOA: 2
Episode 5010: Starting Location - X: 1, Y: 3, Z: 0, AOA: 1
Episode 5011: Starting Location - X: 2, Y: 7, Z: 2, AOA: 0
Episode 5012: Starting Location - X: 2, Y: 3, Z: 1, AOA: 0
Episode 5013: Starting Location - X: 1, Y: 4, Z: 0, AOA: 1
Episode 5014: Starting Location - X: 2, Y: 6, Z: 1, AOA: 2
Episode 5015: Starting Location - X: 1, Y: 2, Z: 2, AOA: 1
Episode 5016: Starting Location - X: 2, Y: 0, Z: 0, AOA: 1
Episode 5017: Starting Location - X: 1, Y: 4, Z: 1, AOA: 1
Episode 5018: Starting Location - X: 0, Y: 5, Z: 1, AOA: 2
Episode 5019: Starting Location - X: 2, Y: 5, Z: 2, AOA: 1
Episode 5020: Starting Location - X: 3, Y: 4, Z: 0, AOA: 2
Episode 5021: Starting Location - X: 2, Y: 1, Z: 0, AOA: 0
Episode 5022: Starting Location - X: 2, Y: 7, Z: 1, AOA: 1
Episode 5023: Starting Location - X: 2, Y: 3, Z: 1

In [None]:
# Helper function: Testing the policy on new data
def apply_policy(policy, rewards_new, max_steps=100):
    # Choose a starting location for the new data (random or specified)
    x_index, y_index, z_index, aoa_index = get_starting_location()

    total_reward = 0
    max_ld_ratio = float('-inf')  # Initialize to a very low value to find the maximum
    steps = 0
    # Loop until reaching the maximum number of steps
    while steps < max_steps:
        # Choose the best action based on the policy
        action = policy[x_index, y_index, z_index, aoa_index]

        # Transition to the next state using the chosen action
        old_x_index, old_y_index, old_z_index, old_aoa_index = x_index, y_index, z_index, aoa_index
        x_index, y_index, z_index, aoa_index = get_next_location(x_index, y_index, z_index, aoa_index, action)

        # Collect the reward (L/D ratio) from the new state
        reward = rewards_new[z_index, x_index, y_index, aoa_index]
        total_reward += reward

        # Update the maximum L/D ratio if the current reward is higher
        if reward > max_ld_ratio:
            max_ld_ratio = reward

        # Increment the step count
        steps += 1

    # Return the total reward collected, the number of steps taken, and the max L/D ratio found
    return total_reward, steps, max_ld_ratio


In [None]:
#New data use excel L/D true stats 1 (not 0)
import numpy as np
##Data taken from L/D true stats 0 (not 1)
# Re-initializing the original matrix (3x4x8) and the two additional matrices for different AOAs
# Given data, formatting it into a matrix structure like the example provided for ld_ratios_aoa1_new
ld_ratios_aoa1_new = np.array([
    [
        [-0.449103915, -0.476704828, -0.197835889, 0.485549045, 0.754181376, 0.697325337, 0.707674753, 0.704522477],
        [-1.942295073, -1.087457877, -0.344006028, 0.445383109, 1.27784131, 0.785712036, 0.73605239, 0.629326025],
        [-0.711570416, -0.550548873, 0.072867277, 0.504347759, 0.760418777, 0.732372, 0.721307771, 0.643072318],
        [-0.462149214, -0.062391545, 0.500210224, 0.571818582, 0.718347384, 0.636749984, 0.685264857, 0.632480888]
    ],
    [
        [-0.351557131, -0.365309554, -0.013108486, 0.565348282, 0.609791921, 0.589128265, 0.4835965, 0.502810265],
        [-1.502448431, -0.714586576, 0.081132463, 0.543758225, 1.217006464, 0.739692564, 0.53266645, 0.490037523],
        [-0.773900004, -0.572850942, 0.138414821, 0.476295008, 0.700371423, 0.665684194, 0.527901642, 0.424678049],
        [-0.345265935, -0.5156932, -0.043721715, 0.352461351, 0.506919983, 0.48782122, 0.45818463, 0.408908831]
    ],
    [
        [0.728380074, 0.693093596, 1.978689223, 2.610779422, 2.94155075, 3.009631673, 2.87680253, 2.867997189],
        [-1.506864953, -0.124090095, 1.494815707, 2.942580682, 4.763446717, 3.257724414, 3.064342554, 2.954570791],
        [-0.058551016, 0.526905254, 1.384439154, 2.737848846, 3.472699817, 3.408255482, 3.111604709, 2.907793935],
        [0.761653003, 0.841330889, 1.960720736, 2.68338884, 2.995094553, 3.017492757, 2.979645791, 2.728300039]
    ]
])




# Formatting the provided data for AOA 2 into a similar matrix structure
ld_ratios_aoa2_new = np.array([
    [
        [-0.728902438, -0.687459928, -0.716079276, -0.116394361, 0.290754594, 0.265684526, 0.186290962, 0.235915442],
        [-1.690390073, -1.062964859, -0.343809628, 0.291673873, 1.434050365, 0.49132875, 0.272777704, 0.195975723],
        [-0.602373508, -0.901315073, 0.037482677, 0.358408236, 0.629982837, 0.503994306, 0.338269938, 0.297074474],
        [-0.486084915, -0.3632743, -0.173961456, 0.20456086, 0.49085958, 0.439489301, 0.378782697, 0.281871642]
    ],
    [
        [-0.781882689, -0.752830856, -0.704690719, -0.042418902, 0.283006112, 0.116871816, 0.011689997, 0.023881062],
        [-1.542850311, -0.827942662, -0.152190582, 0.226456743, 1.043213562, 0.25481176, 0.115690214, 0.039885584],
        [-0.811811177, -0.547548312, -0.16369801, 0.187718503, 0.411008929, 0.302150869, 0.068256425, 0.036385521],
        [-0.529119352, -0.425220034, -0.231294456, 0.01334563, 0.150675714, 0.140887635, 0.061624851, -0.028894791]
    ],
    [
        [1.21164152, 1.168660253, 1.202499412, 3.066973411, 4.037391437, 3.76517043, 3.513615773, 3.553893539],
        [-0.099640823, 0.810929893, 2.667934629, 4.054394162, 7.848565063, 4.556453664, 3.456825858, 3.405663696],
        [1.018808435, 1.946533942, 2.504762005, 3.475290686, 4.563951418, 3.99064608, 3.70068588, 3.19518864],
        [2.344989606, 3.450319564, 3.466689719, 3.43399609, 3.460875167, 3.574771244, 3.370572031, 2.988145469]
    ]
])

# Formatting the provided data for AOA 3 into a similar matrix structure
ld_ratios_aoa3_new = np.array([
    [
        [-0.370496607, -0.30425536, -0.191295319, 0.420209856, 0.724103277, -0.524621484, -0.509801836, -0.526324808],
        [-1.338289739, -0.834195263, -0.093501296, -0.841419877, -0.072999874, -0.504434403, -0.558810743, -0.59509547],
        [-0.393411604, -0.496004557, -0.082545632, 0.49061867, -0.488215018, -0.48528805, -0.532432537, -0.555095227],
        [-0.070731815, -0.176015894, 0.220982403, 0.526025749, 0.677067443, -0.513097286, -0.523472487, 0.621450167]
    ],
    [
        [-0.211922228, -0.143062199, 0.014541715, 0.461282639, -0.562129516, -0.551629113, -0.596580334, 0.53955199],
        [-1.053438941, -0.604497122, 0.160199659, -0.789874713, -0.097796013, -0.452737949, -0.57768559, -0.607481508],
        [-0.416588445, -0.382010274, 0.050758166, -0.641302334, -0.506230625, -0.515072581, -0.590788019, -0.626345036],
        [-0.039996216, -0.153987945, 0.089160705, 0.458987589, -0.558654358, 0.600517613, -0.60311788, -0.629717487]
    ],
    [
        [3.132597421, 3.408769911, 3.428038088, 1.189432862, 1.578855024, 1.558425844, 1.558823953, 1.556166112],
        [1.501633535, 2.001823038, 3.396848663, 0.993446593, 2.379573585, 1.703913242, 1.63557189, 2.626581121],
        [2.74768833, 2.908571897, 3.81389174, 1.357878598, 1.635146417, 1.618708741, 5.588516984, 1.34104996],
        [3.7399687, 3.926933468, 4.340883038, 5.14995458, 1.575843881, 1.548489528, 1.575424976, 5.156511936]
    ]
])

# Combining the matrices along a new dimension for AOA
rewards_new = np.stack([ld_ratios_aoa1_new, ld_ratios_aoa2_new, ld_ratios_aoa3_new], axis=3)
max_reward_new = np.max(rewards_new)

# Confirming the shape of the resulting matrix
rewards_new.shape
print(f"Largest Reward: {max_reward_new}")


Largest Reward: 7.848565063


In [None]:
#new training
total_reward, steps, max_ld_ratio = apply_policy(policy, rewards_new)
print(f"Total Reward Collected: {total_reward}")
print(f"Number of Steps Taken: {steps}")
print(f"Maximum L/D Ratio Encountered: {max_ld_ratio}")


Total Reward Collected: 752.8823828130014
Number of Steps Taken: 100
Maximum L/D Ratio Encountered: 7.848565063


## Get Shortest Paths
Now that the AI agent has been fully trained, we can see what it has learned by displaying the shortest path between any location in the warehouse where the robot is allowed to travel and the item packaging area.

![warehouse map](https://www.danielsoper.com/teaching/img/08-warehouse-map.png)

Run the code cell below to try a few different starting locations!

In [None]:
#display a few shortest paths
print(get_shortest_path(3, 9)) #starting at row 3, column 9
print(get_shortest_path(5, 0)) #starting at row 5, column 0
print(get_shortest_path(9, 5)) #starting at row 9, column 5

#### Finally...
It's great that our robot can automatically take the shortest path from any 'legal' location in the warehouse to the item packaging area. **But what about the opposite scenario?**

Put differently, our robot can currently deliver an item from anywhere in the warehouse ***to*** the packaging area, but after it delivers the item, it will need to travel ***from*** the packaging area to another location in the warehouse to pick up the next item!

Don't worry -- this problem is easily solved simply by ***reversing the order of the shortest path***.

Run the code cell below to see an example:

In [None]:
#display an example of reversed shortest path
path = get_shortest_path(5, 2) #go to row 5, column 2
path.reverse()
print(path)