# Theoritical Overview


# Implementation example

## Prisoner Escape Environment

### Overview
The **Prisoner Escape Environment** is a custom **grid-based** environment where a **prisoner** attempts to escape while avoiding a **guard**. The environment is implemented using **OpenAI Gym** and designed for **reinforcement learning (RL)** training.

---

### Grid Layout
- The environment is a **7x7 grid**.
- The **prisoner (`P`)** starts at a **fixed position** `(0,0)`.
- The **escape point (`E`)** is placed **randomly** in a **specific area** `(2,2)` to `(5,5)`.
- The **guard (`G`)** is placed **randomly** anywhere on the grid but **cannot** start at the same position as the prisoner or escape point.

[['P' ' ' ' ' ' ' ' ' ' ' '] <br>
[' ' ' ' ' ' ' ' ' ' ' ' '] <br>
[' ' ' ' ' ' ' ' 'E' ' '] <br>
[' ' ' ' ' ' 'G' ' ' ' '] <br>
[' ' ' ' ' ' ' ' ' ' ' '] <br>
[' ' ' ' ' ' ' ' ' ' ' '] <br>
[' ' ' ' ' ' ' ' ' ' ' ']]<br>


---

### Actions
The prisoner has **4 possible actions**:
1. **Move Left** `(0)`: Decrease X position (if not at the left boundary).
2. **Move Right** `(1)`: Increase X position (if not at the right boundary).
3. **Move Up** `(2)`: Decrease Y position (if not at the top boundary).
4. **Move Down** `(3)`: Increase Y position (if not at the bottom boundary).

The **guard does not move**—it remains stationary.

---

### Observations
The observation space consists of **three integer values**:
1. **Prisoner’s Position**: `(x, y)` represented as a **single value** (`x + 7*y`).
2. **Guard’s Position**: `(x, y)` represented similarly.
3. **Escape Point’s Position**: `(x, y)` represented similarly.

Since the grid is `7x7`, each position is encoded as a **single integer between 0 and 48**.

- **Total observation space:** `shape = (3,)`

---

### Rewards
- **+1** if the prisoner **reaches the escape point (`E`)**.
- **-1** if the prisoner is **caught by the guard (`G`)**.
- **0** if the prisoner is still trying to escape.
- The episode **ends** when the prisoner **escapes or gets caught**.

---

### Episode Termination
- The episode **terminates** if:
  - The **prisoner reaches the escape point** (**success**).
  - The **prisoner collides with the guard** (**failure**).
  - The episode reaches **100 timesteps** (**truncation**).

---


### Implementation of custom single environment in gym

In [1]:
import gymnasium as gym
from gymnasium import spaces
from gymnasium.envs.registration import register
from gymnasium.utils import EzPickle, seeding
import random

import numpy as np
import os

In [2]:
class CustomEnvironment(gym.Env):
    """Custom Gym Environment for the prisoner escape task"""
    
    metadata = {
        "name": "custom_escape_environment_v0"
    }

    def __init__(self):
        """Initialize environment parameters"""
        super().__init__()
        
        # Initialize coordinates and other environment variables
        self.escape_y = None
        self.escape_x = None
        self.guard_y = None
        self.guard_x = None
        self.prisoner_y = None
        self.prisoner_x = None
        self.timestep = None
        self.render_mode = None

        # Define action space (4 possible actions)
        self.action_space = spaces.Discrete(4)  # Prisoner has 4 actions

        # Observation space - Prisoner's observation space
        self.observation_space = spaces.Box(low=0, high=48, shape=(3,), dtype=np.float32)

    def reset(self, *, seed=None, options=None):
        """Reset the environment to a starting state"""


        self.prisoner_x = 0
        self.prisoner_y = 0
        self.escape_x = random.randint(2, 5)
        self.escape_y = random.randint(2, 5)
        
        # Randomly place the guard
        self.guard_x = random.randint(0, 6)
        self.guard_y = random.randint(0, 6)
        
        # Ensure the guard does not spawn at the same location as the escape point or the prisoner
        while (self.guard_x == self.prisoner_x and self.guard_y == self.prisoner_y) or \
              (self.guard_x == self.escape_x and self.guard_y == self.escape_y):
            self.guard_x = random.randint(0, 6)
            self.guard_y = random.randint(0, 6)

        self.timestep = 0

        # Initialize observations for the prisoner
        observations = np.array([
                self.prisoner_x + 7 * self.prisoner_y,
                self.guard_x + 7 * self.guard_y,
                self.escape_x + 7 * self.escape_y]
            )
        infos = {}

        return observations, infos

    def step(self, action):
        """Take a step in the environment"""
        # Execute prisoner action
        if action == 0 and self.prisoner_x > 0:
            self.prisoner_x -= 1
        elif action == 1 and self.prisoner_x < 6:
            self.prisoner_x += 1
        elif action == 2 and self.prisoner_y > 0:
            self.prisoner_y -= 1
        elif action == 3 and self.prisoner_y < 6:
            self.prisoner_y += 1

        # Check if the prisoner has collided with the guard
        terminated = False
        reward = 0
        if self.prisoner_x == self.guard_x and self.prisoner_y == self.guard_y:
            # Prisoner caught by the guard
            reward = -1
            terminated = True
        elif self.prisoner_x == self.escape_x and self.prisoner_y == self.escape_y:
            # Prisoner escaped successfully
            reward = 1
            terminated = True

        # Check truncation conditions (max steps)
        truncation = False
        if self.timestep > 100:
            reward = 0
            truncation = True

        self.timestep += 1

        # Get the new observations
        observations = np.array([
                self.prisoner_x + 7 * self.prisoner_y,
                self.guard_x + 7 * self.guard_y,
                self.escape_x + 7 * self.escape_y])

        return observations, reward, terminated, truncation, {}

    def render(self):
        """Render the current state of the environment"""
        grid = np.full((7, 7), " ")
        grid[self.prisoner_y, self.prisoner_x] = "P"
        grid[self.guard_y, self.guard_x] = "G"
        grid[self.escape_y, self.escape_x] = "E"
        print(f"{grid} \n")



### Training using Ray-RLlib


In [3]:
import ray
from ray import air, tune
from ray.rllib.env.wrappers.pettingzoo_env import ParallelPettingZooEnv
from ray.tune.logger import UnifiedLogger
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.utils.test_utils import (
    add_rllib_example_script_args,
    run_rllib_example_script_experiment,
)
from ray.tune.registry import get_trainable_cls, register_env
from ray.tune.logger import JsonLoggerCallback, CSVLoggerCallback, TBXLoggerCallback
from ray.rllib.algorithms.callbacks import DefaultCallbacks

2025-02-06 13:15:14,805	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.
2025-02-06 13:15:15,446	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


In [4]:
single_agent_env = CustomEnvironment()


# Register the single-agent environment
register_env(
    "SingleAgentEnvironment",
    lambda _: single_agent_env
)

In [5]:
ppo_config_single = (
    PPOConfig()
    .environment("SingleAgentEnvironment")  # Use the registered environment name
    .env_runners(
        num_env_runners=1,  # Single environment
        num_envs_per_env_runner=1,
        batch_mode="complete_episodes"
    )
    .framework("torch")  # Use PyTorch
)

In [6]:
# Initialize Ray
ray.init()

# Define the stop criteria and logging settings
stop_criteria = {
    'training_iteration': 5  # Adjust as needed
}




2025-02-06 13:15:31,316	INFO worker.py:1816 -- Started a local Ray instance.


In [7]:
# Define the logging directory
log_dir = f"./tensorboard_logs_rlclass/single_example"
storage_path="file://" + os.path.abspath(log_dir)

In [8]:
# Create a Tuner for training
tuner = tune.Tuner(
    "PPO",
    param_space=ppo_config_single.to_dict(),
    run_config=air.RunConfig(
        stop=stop_criteria,
        verbose=1,
        checkpoint_config=air.CheckpointConfig(
            checkpoint_frequency=1,  # Frequency of checkpoints (e.g., every .... iteration)
            checkpoint_at_end=True  # Ensure final checkpoint at end
        ),
        storage_path=storage_path,  # Directory for TensorBoard logs
        name="PPO_Training_Experiment_example"
    ),
)

# Run the training
results = tuner.fit()

# Shutdown Ray
ray.shutdown()

0,1
Current time:,2025-02-06 13:17:04
Running for:,00:01:27.95
Memory:,5.8/8.0 GiB

Trial name,status,loc,iter,total time (s),ts,num_healthy_workers,num_in_flight_async_ sample_reqs,num_remote_worker_re starts
PPO_SingleAgentEnvironment_12378_00000,TERMINATED,127.0.0.1:16940,5,64.8528,20263,1,0,0


[36m(PPO pid=16940)[0m Trainable.setup took 11.945 seconds. If your trainable is slow to initialize, consider setting reuse_actors=True to reduce actor creation overheads.
[36m(PPO pid=16940)[0m Install gputil for GPU system monitoring.
[36m(PPO pid=16940)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sakshisharma/Desktop/rl_scheduling/tensorboard_logs_rlclass/single_example/PPO_Training_Experiment_example/PPO_SingleAgentEnvironment_12378_00000_0_2025-02-06_13-15-36/checkpoint_000000)
[36m(PPO pid=16940)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sakshisharma/Desktop/rl_scheduling/tensorboard_logs_rlclass/single_example/PPO_Training_Experiment_example/PPO_SingleAgentEnvironment_12378_00000_0_2025-02-06_13-15-36/checkpoint_000001)
[36m(PPO pid=16940)[0m Checkpoint successfully created at: Checkpoint(filesystem=local, path=/Users/sakshisharma/Desktop/rl_scheduling/tensorboard_logs_rlclass/single_example/PPO_

2025-02-06 13:17:04,409	INFO tune.py:1009 -- Wrote the latest version of all result files and experiment state to '/Users/sakshisharma/Desktop/rl_scheduling/tensorboard_logs_rlclass/single_example/PPO_Training_Experiment_example' in 0.0207s.
2025-02-06 13:17:04,695	INFO tune.py:1041 -- Total run time: 88.32 seconds (87.93 seconds for the tuning loop).


In [9]:
from ray.rllib.policy.policy import Policy

checkpoint_dir = "/Users/sakshisharma/Desktop/rl_scheduling/tensorboard_logs_rlclass/single_example/PPO_Training_Experiment_example/PPO_SingleAgentEnvironment_12378_00000_0_2025-02-06_13-15-36/checkpoint_000004" 
checkpoint_path = "file://" + os.path.abspath(checkpoint_dir)
# Restore the model from the checkpoint
policy = Policy.from_checkpoint(checkpoint_path)
print(policy)

{'default_policy': PPOTorchPolicy}


In [15]:
for episode in range(1):  # You can change the number of episodes for testing
    print(f"Testing Episode: {episode + 1}")
    
    obs,info = single_agent_env.reset()  # Reset the environment at the beginning of the episode
    done = False
    # episode_rewards = {"Primary": 0, "Auxillary": 0}

    rewards = {} 
    container_states = {}
    actions = {}
    
    
    while not done:
        # Get actions from the trained model
        s = 0
        action = policy['default_policy'].compute_single_action(obs)[0]
        
        # Step the environment    
        next_obs, reward, done, _ , infos = single_agent_env.step(action)
        print('step',s )
        single_agent_env.render()
        print(reward)
        s =+ 1
        # Update observations
        obs = next_obs
# Shutdown Ray after testing is done
ray.shutdown()

Testing Episode: 1
step 0
[['P' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' 'G' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' 'E' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']] 

0
step 0
[[' ' 'P' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' 'G' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' 'E' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']] 

0
step 0
[[' ' ' ' 'P' ' ' ' ' ' ' ' ']
 [' ' ' ' 'G' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' 'E' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']] 

0
step 0
[[' ' 'P' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' 'G' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' 'E' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']] 

0
step 0
[[' ' ' ' 'P' ' ' ' ' ' ' ' ']
 [' ' ' ' 'G' ' ' ' ' ' ' '

# Partially Observable Markov Games (POMG) and Multi-Agent RL

## Partially Observable Markov Games (POMG)

A **Partially Observable Markov Game (POMG)** extends the **Markov Decision Process (MDP)** to multi-agent settings where agents have limited information about the environment.

A POMG is defined by:
- **N agents** interacting in an environment.
- Each agent **i** has a private observation \( o_i \) derived from the state \( s \).
- Each agent selects an action \( a_i \) based on \( o_i \).
- The environment transitions to a new state \( s' \) based on all agents' actions.
- Agents receive individual rewards \( R_i \), which may be cooperative or competitive.

## Multi-Agent Reinforcement Learning (MARL)

In MARL, multiple agents learn simultaneously, affecting each other’s learning process. The two primary settings are:
1. **Cooperative**: Agents share rewards and work towards a common goal.
2. **Competitive**: Agents have conflicting objectives (e.g., adversarial games).

Common MARL algorithms:
- **Independent Q-Learning**: Each agent learns its own Q-function, treating others as part of the environment.
- **Centralized Training, Decentralized Execution (CTDE)**: Agents train with shared knowledge but act independently.
- **Multi-Agent Deep Deterministic Policy Gradient (MADDPG)**: An extension of DDPG for multi-agent settings.


---

## Multi-Agent Prisoner Escape Environment  

###  Overview  
The **Multi-Agent Prisoner Escape Environment** is extention of previous environment where two agents—the **Prisoner** and the **Guard**—interact with the goal of either escaping or preventing the escape. 




### Environment Details  

- **Initial Positions:**  
  - **Prisoner:** Always starts at **(0,0)**.  
  - **Escape Point:** Randomly placed between **(2,2) to (5,5)**.  
  - **Guard:** Randomly placed **anywhere** except the escape point or prisoner’s start location.  

- **Observations (for each agent):**  
  - **Prisoner:**  
    - Own position `(x, y)`  
    - Guard’s position `(x, y)`  
    - Escape point position `(x, y)`  
  - **Guard:**  
    - Own position `(x, y)`  
    - Prisoner’s position `(x, y)`  



##  Game Rules & Termination  
The game ends when **either**:  
1. **The prisoner reaches the escape point** → **Prisoner wins**   
2. **The guard catches the prisoner** → **Guard wins** 
3. **Maximum steps (100) reached** → **Game ends in a draw**   

### **Rewards:**  
- **Prisoner:**  
  - `+1` for escaping  
  - `-1` if caught  
- **Guard:**  
  - `+1` for catching the prisoner   
  - `-1` if the prisoner escapes   

---






## MARL Environment in petting Zoo

In [43]:
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy
import functools
import random
from copy import copy

import numpy as np
from gymnasium.spaces import Discrete, MultiDiscrete

from pettingzoo import ParallelEnv

import supersuit 

In [44]:
class MRLEnvironment(ParallelEnv):
    """The metadata holds environment constants.

    The "name" metadata allows the environment to be pretty printed.
    """

    metadata = {
        "name": "custom_environment_v0"
    }

    def __init__(self):
        """The init method takes in environment arguments.

        Should define the following attributes:
        - escape x and y coordinates
        - guard x and y coordinates
        - prisoner x and y coordinates
        - timestamp
        - possible_agents

        Note: as of v1.18.1, the action_spaces and observation_spaces attributes are deprecated.
        Spaces should be defined in the action_space() and observation_space() methods.
        If these methods are not overridden, spaces will be inferred from self.observation_spaces/action_spaces, raising a warning.

        These attributes should not be changed after initialization.
        """
        self.escape_y = None
        self.escape_x = None
        self.guard_y = None
        self.guard_x = None
        self.prisoner_y = None
        self.prisoner_x = None
        self.timestep = None
        self.possible_agents = ["guard","prisoner"]
        self.render_mode = None


    def reset(self, seed=None, options=None):
        """Reset set the environment to a starting point.

        It needs to initialize the following attributes:
        - agents
        - timestamp
        - prisoner x and y coordinates
        - guard x and y coordinates
        - escape x and y coordinates
        - observation
        - infos

        And must set up the environment so that render(), step(), and observe() can be called without issues.
        """
        self.agents = copy(self.possible_agents)
        self.timestep = 0

        self.prisoner_x = 0
        self.prisoner_y = 0

        self.guard_x = 6
        self.guard_y = 6

        self.escape_x = random.randint(2, 5)
        self.escape_y = random.randint(2, 5)

        observations = {
            a: (
                self.prisoner_x + 7 * self.prisoner_y,
                self.guard_x + 7 * self.guard_y,
                self.escape_x + 7 * self.escape_y,
            )
            for a in self.agents
        }

        # Get dummy infos. Necessary for proper parallel_to_aec conversion
        infos = {a: {} for a in self.agents}

        return observations, infos

    def step(self, actions):
        """Takes in an action for the current agent (specified by agent_selection).

        Needs to update:
        - prisoner x and y coordinates
        - guard x and y coordinates
        - terminations
        - truncations
        - rewards
        - timestamp
        - infos

        And any internal state used by observe() or render()
        """
        # Execute actions
        prisoner_action = actions["prisoner"]
        guard_action = actions["guard"]

        if prisoner_action == 0 and self.prisoner_x > 0:
            self.prisoner_x -= 1
        elif prisoner_action == 1 and self.prisoner_x < 6:
            self.prisoner_x += 1
        elif prisoner_action == 2 and self.prisoner_y > 0:
            self.prisoner_y -= 1
        elif prisoner_action == 3 and self.prisoner_y < 6:
            self.prisoner_y += 1

        if guard_action == 0 and self.guard_x > 0:
            self.guard_x -= 1
        elif guard_action == 1 and self.guard_x < 6:
            self.guard_x += 1
        elif guard_action == 2 and self.guard_y > 0:
            self.guard_y -= 1
        elif guard_action == 3 and self.guard_y < 6:
            self.guard_y += 1

        # Check termination conditions
        terminations = {a: False for a in self.agents}
        rewards = {a: 0 for a in self.agents}
        if self.prisoner_x == self.guard_x and self.prisoner_y == self.guard_y:
            rewards = {"prisoner": -1, "guard": 1}
            terminations = {a: True for a in self.agents}

        elif self.prisoner_x == self.escape_x and self.prisoner_y == self.escape_y:
            rewards = {"prisoner": 1, "guard": -1}
            terminations = {a: True for a in self.agents}

        # Check truncation conditions (overwrites termination conditions)
        truncations = {a: False for a in self.agents}
        if self.timestep > 100:
            rewards = {"prisoner": 0, "guard": 0}
            truncations = {"prisoner": True, "guard": True}

        self.timestep += 1

        # Get observations
        observations = {
            a: (
                self.prisoner_x + 7 * self.prisoner_y,
                self.guard_x + 7 * self.guard_y,
                self.escape_x + 7 * self.escape_y,
            )
            for a in self.agents
        }

        # Get dummy infos (not used in this example)
        infos = {a: {} for a in self.agents}

        # if any(terminations.values()) or any(truncations.values()):
            # self.agents = []

        return observations, rewards, terminations, truncations, infos

    def render(self):
        """Renders the environment."""
        grid = np.full((7, 7), " ")
        grid[self.prisoner_y, self.prisoner_x] = "P"
        grid[self.guard_y, self.guard_x] = "G"
        grid[self.escape_y, self.escape_x] = "E"
        print(f"{grid} \n")

    # Observation space should be defined here.
    # lru_cache allows observation and action spaces to be memoized, reducing clock cycles required to get each agent's space.
    # If your spaces change over time, remove this line (disable caching).
    @functools.lru_cache(maxsize=None)
    def observation_space(self, agent):
        # gymnasium spaces are defined and documented here: https://gymnasium.farama.org/api/spaces/
        return spaces.Box(low=0, high=48, shape=(3,), dtype=np.float32)

    # Action space should be defined here.
    # If your spaces change over time, remove this line (disable caching).
    @functools.lru_cache(maxsize=None)
    def action_space(self, agent):
        return Discrete(4)

In [45]:
env_marl = MRLEnvironment()

In [46]:
env = supersuit.pettingzoo_env_to_vec_env_v1(env_marl)
env = supersuit.concat_vec_envs_v1(env,4,num_cpus=1, base_class="stable_baselines3")

In [47]:
model = PPO(MlpPolicy,env,verbose=1,learning_rate=1e-4,batch_size=256,)

Using cpu device


In [48]:
model.learn(total_timesteps=1000)

------------------------------
| time/              |       |
|    fps             | 8850  |
|    iterations      | 1     |
|    time_elapsed    | 1     |
|    total_timesteps | 16384 |
------------------------------


<stable_baselines3.ppo.ppo.PPO at 0x1998aead0>

In [50]:
from pettingzoo.utils import parallel_to_aec

aec_env = parallel_to_aec(env_marl)

In [51]:
aec_env.reset()
step = 0
for agent in aec_env.agent_iter():
  obs, rewards, terminations, truncations, infos = aec_env.last()
  print(f'{agent}_reward',rewards)

  if not truncations or terminations:
    act = model.predict(obs, deterministic=True)[0]

  aec_env.step(act)
  aec_env.render()

  if agent == 'prisoner':
    print('step',step)
    step = step + 2

guard_reward 0
[['P' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' 'E' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' 'G']] 

prisoner_reward 0
[['P' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' 'E' ' ' ' ' 'G']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']] 

step 0
guard_reward 0
[['P' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' 'E' ' ' ' ' 'G']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']] 

prisoner_reward 0
[['P' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' 'G']
 [' ' ' ' ' ' 'E' ' ' ' ' ' ']
 [' ' ' ' ' ' ' ' ' ' ' ' ' ']] 

step 2
guard_reward 0
[['P' ' ' ' ' ' ' ' ' ' '

AssertionError: 