# PettingZoo Parallel API Custom Env Tutorial

I'm going to complete a simple tutorial on creating/running custom Environments using Gymnasium's PettingZoo Parallel API.
### Simple (Rock-Paper-Scissors)
This simple game will cover the basics of parallel agent actions and observations, and calculating the reward structure.
### Gridworld (Guard & Prisoner)
This Gridworld game will be much closer to the Embodied Communication Game, and will teach me how to update the Env's internal logic after each joint call to the Env.step() function.

## Rock-Paper-Scissors

In [13]:
#Imports

#MISC
from copy import copy
from typing import Optional
import functools
import random as rng
import numpy as np

#GYMNASIUM / PETTINGZOO
import gymnasium as gym
from gymnasium import Env
from gymnasium.spaces import Discrete, MultiDiscrete
from gymnasium.utils import seeding
from pettingzoo import ParallelEnv
from pettingzoo.butterfly import pistonball_v6
from pettingzoo.test import parallel_api_test
from pettingzoo.utils import parallel_to_aec, wrappers

#SB3
import supersuit as ss
from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy

In [8]:
#GLOBAL CONSTANTS
ROCK = 0
PAPER = 1
SCISSORS = 2
NONE = 3
MOVES = ["ROCK","PAPER","SCISSORS","None"]
NUM_ITERS = 100
REWARD_MAP = {
    (ROCK, ROCK): (0,0),
    (ROCK, PAPER): (-1,1),
    (ROCK, SCISSORS): (1,-1),
    (PAPER, PAPER): (0,0),
    (PAPER, SCISSORS): (-1,1),
    (PAPER, ROCK): (1,-1),
    (SCISSORS, SCISSORS): (0,0),
    (SCISSORS, ROCK): (-1, 1),
    (SCISSORS, PAPER): (1, -1),
}

#The Env function wraps the environment in some wrappers by default.
def env(render_mode=None):
    internal_render_mode = render_mode if render_mode != "ansi" else "human"
    env = raw_env(render_mode=internal_render_mode)
    #This wrapper is meant only for Envs which print results to the terminal
    if render_mode == "ansi":
        env = wrappers.CaptureStdoutWrapper(env)
    #This wrapper helps error handling for discrete action spaces.
    env = wrappers.AssertOutOfBoundsWrapper(env)
    #This wrapper provides a variety of helpful user errors
    env = wrappers.OrderEnforcingWrapper(env)
    return env

#The raw_env function uses from_parallel to convert from ParallelEnv to AEC env.
def raw_env(render_mode=None):
    env = parallel_env(render_mode=render_mode)
    env = parallel_to_aec(env)
    return env

class RockPaperScissors(ParallelEnv):
    metadata = {"render_modes": ["human"], "name": "rps_v2"}

    def __init__(self, render_mode=None):
        """
        The init method takes in envorinoment arguments and should defin the following attributes:
        -self.possible_agents
        -self._render_mode

        Note: self.action_space and self.observation_space are now depracated. Action/Observation Spaces are
        defined within the action_space() and observation_space() methods. These methods automatically return the
        aforementioned variables, unless otherwise specified. So using them is fine unless we have a reason not to.
        """
        #The player names
        self.possible_agents = ["player_"+str(r) for r in range(2)]
        
        #render_mode, action_space, and observation_space will be public variables. All other variables will be private.
        self.render_mode = render_mode

     #We will define the observation space as a function rather than a variable for this class.
    #lru_cache allows the observation_space and action_space functions to be memoized for better performance
    @functools.lru_cache(maxsize=None)
    def observation_space(self, agent):
        return Discrete(4)

    @functools.lru_cache(maxsize=None)
    def action_space(self, agent):
        return Discrete(3)

    def render(self):
        """
        Renders the Env. In human mode, it can print to terminal, open a graphical window, or open some other type
        of display that's visible to the user.
        """
        if self.render_mode is None:
            gymnasium.logger.warn(
                "You are calling the render method, but haven't specified a render mode."
            )
            return
        
        if len(self.agents) == 2:
            string = f"Current state: Agent1: {MOVES[self.state[self.agents[0]]]}, Agent2: {MOVES[self.state[self.agents[1]]]}"
        else:
            string = "Game Over."
        print(string)

    def close(self):
        """
        Close should release any graphical displays, subprocesses, network connections, or any other environment data
        which should not be kept around after the user is no longer using the Env. As we are not using any of these in
        this version of the class, it currently does nothing.
        """
        pass

    def reset(self, seed=None, options=None):
        """
        Resets the Env to its initial state. Sets up the Env so that render() and step() can be called without issue.
        Here it re-initializes the 'num_moves' variable which counts the number of hands played.
        Returns the observations/infos for each agent.
        """
        self.agents = self.possible_agents[:]
        self._num_moves = 0
        obs = {agent: None for agent in self.agents}
        info = {agent: {} for agent in self.agents}
        #We will use this variable to track the complete Env state, and update it within the step() method.
        self._state = obs

        return (obs, info)

    def step(self, actions):
        """
        step(action) takes an action for each agent as input and should return 5 variables:
        [observations, rewards, terminated, truncated, infos]
        each of these will be dicts containing one key per agent, like so:
        {agent_1: item_1, agent_2: item_2}
        """
        #If a user passes in actions containing no agents, then the returned dicts will be empty.
        if not actions:
            self.agents = []
            return {}, {}, {}, {}, {}
            
        #Rewards for all agents are placed in a rewards dict to be returned.
        rewards = {}
        #I really dislike the fact that these methods are referencing external constants, but I'll build this
        # the way the author wrote it for the purpose of completing this tutorial.
        rewards[self.agents [0]], rewards[self.agents [1]] = REWARD_MAP[(
            actions[self.agents [0]], actions[self.agents [1]]
        )]

        terminations = {agent: env_truncation for agent in self.agents}

        observations = {
            self.agents [i]: int(actions[self.agents [1-i]]) for i in range(len(self.agents))
        }

        self._state = observations

        #typically there won't be any information in infos, but step() must still return an info for each agent.
        infos = {agent: {} for agent in self.agents}

        if env_truncation:
            self.agents = []

        if self.render_mode == "human":
            self.render()

        return observations, rewards, terminations, truncations, infos


### Well, now we've made a parallel environment. 
Now the question is how to run it.

In [11]:
env = RockPaperScissors()
env.reset()

({'player_0': 3, 'player_1': 3}, {'player_0': {}, 'player_1': {}})

This is an odd procedure for creating an running the environment. I'm a much bigger fan of the gym.register() and gym.make() approach. Unfortunately, this approach produces an error upon trying to run gym.make(), presumably because a **ParallelEnv** doesn't subscribe to the same interfaces as a normal Gymnasium Env.

Let's try making the next Env, **Guard and Prisoner**. We'll put more effort into running that Env successfully with the SB3 RL models, since, being a Grid-World, it's much more similar to the Embodied Communication Game.

## Gridworld (Guard and Prisoner)

In [56]:
class GuardAndPrisoner(ParallelEnv):
    """
    metadata holds Env constants. "name" metadata allows Env to be pretty printed.
    """
    metadata = { "render_modes": [],
                "name": "guard_and_prisoner_v0" }
    def __init__(self, height = 5, width = 5, step_limit = 100):
        """
        Takes in Env arguments.

        Should define the following:

        -Escape Coords
        -Prisoner starting Coords
        -Guard starting Coords
        -Timestamp
        -Possible_Agents
        """
        #Private Variables
        self._height = height
        self._width = width
        self._escape_coords = np.array([-1, -1], dtype=np.int32)
        self._guard_coords = np.array([-1,-1],dtype=np.int32)
        self._prisoner_coords = np.array([-1,-1],dtype=np.int32)
        self._timestep = 0
        self._step_limit = step_limit
        
        #Public Variables (necessary for ParallelEnv implementation)
        self.possible_agents = ["prisoner","guard"]
        self.render_mode = None

        #Dictionary to map action space to a direction on the Gridworld.
        self._action_to_direction = {
            0: np.array([0,-1]), #Down
            1: np.array([1,0]), #Right
            2: np.array([0,1]), #Up
            3: np.array([-1,0]), #Left
        }

    #Because this Environment doesn't actually extend the base Env class, I can't set the seed
    # by calling super().reset(). Instead, this function will manually set the seed.
    def _seed(self, seed=None):
        self.np_random, seed = seeding.np_random(seed)

    #Farama recommends defining the Spaces directly in a getter function, rather than relying on self.x_space variables.
    #Defining observation_space in a memo-ized getter function for best performance.
    @functools.lru_cache(maxsize=None)
    def observation_space(self,agent):
        return gym.spaces.Dict({
            "escape" : gym.spaces.Box(0, max([self._height, self._width]) - 1, shape=(2,), dtype=int),
            "prisoner" : gym.spaces.Box(0, max([self._height, self._width]) - 1, shape=(2,), dtype=int),
            "guard" : gym.spaces.Box(0, max([self._height, self._width]) - 1, shape=(2,), dtype=int),
        })
   #Defining action_space in a memo-ized getter function.
    @functools.lru_cache(maxsize=None)
    def action_space(self,agent):
        return Discrete(4)

    #Helper function to return agent observations. We will call this in self.reset() and self.step()
    def _get_obs(self):
        observations = {
            a : {
                "escape": self._escape_coords,
                "prisoner": self._prisoner_coords,
                "guard": self._guard_coords,
            } for a in self.agents
        }
        return observations

    #Helper function to return infos. This doesn't currently return anything, but can be useful for auxiliary info.
    def _get_infos(self):
        infos = {a : {} for a in self.agents}
        return infos

    #Helper function to calculate rewards and terminateds, because terminated is true whenever rewards are doled out. Called in self.step()
    def _get_rewards_and_terminateds(self):
        terminateds = {a : False for a in self.agents}
        rewards = {a : 0 for a in self.agents}
        #Reward if the prisoner escapes. In this version of the game, a tie goes to the runner.
        if(np.array_equal(self._prisoner_coords, self._escape_coords)):
            rewards = {"prisoner": 1, "guard": 0}
            terminateds = {a : True for a in self.agents}
        #Reward if the guard catches the prisoner.
        elif(np.array_equal(self._guard_coords, self._prisoner_coords)):
            rewards = {"prisoner": 0, "guard": 1}
            terminateds = {a : True for a in self.agents}
        return rewards, terminateds

    #Helper function to get Truncateds
    def _get_truncateds(self):
        if(self._timestep > self._step_limit):
            truncateds = {"prisoner": True, "guard": True}
        else:
            truncateds = {"prisoner": False, "guard": False}
        return truncateds

    def reset(self, seed: Optional[int] = None, options: Optional[dict] = None):
        """
        Reset Env to a randomized starting configuration.
        Must initialize the following:

        -agents
        -timestamp
        -prisoner coords
        -guard coords
        -escape coords
        -observations
        -infos
        """
        #If the seed is not None, call _seed(seed) to set self.np_random
        self._seed(seed)
        
        #Start self.agents with a copy of self.possible_agents and set self._timestep to 0.
        self.agents = copy(self.possible_agents)
        self._timestep = 0
        
        #Use the set() object to generate three unique coords to assign to guard, prisoner, and escape.
        coords = set()
        while len(coords) < 3:
            x = self.np_random.integers(0, self._width)
            y = self.np_random.integers(0, self._height)
            coords.add((x,y))
            
        coords = np.array(list(coords))
        
        #instantiate the guard, prisoner, and escape coords.
        self._escape_coords = coords[0]
        self._guard_coords = coords[1]
        self._prisoner_coords = coords[2]
        
        #Get the agent observations and auxiliary info and return it.
        obs = self._get_obs()
        infos = self._get_infos()
        
        return (obs,infos)

    def step(self,actions):
        """
        This function takes in a dictionary of actions as an argument and updates the Env according to the actions.
        The actions dict contains two actions and looks something like this:
        {
            "guard": Guard Action,
            "prisoner": Prisoner Action,
        }
        Must update:
        -prisoner and guard coords
        -terminations and rewards if guard reaches prisoner or prisoner reaches escape.
        
        This function returns 5 variables:
        Observations, Rewards, Terminateds, Truncateds, Infos
        """
        prisoner_direction = self._action_to_direction[actions["prisoner"]]
        guard_direction = self._action_to_direction[actions["guard"]]
        
        #Update state based on agent actions.

        #Firstly, increment self._step_count
        self._timestep += 1
        
        #To update the agent locations, we must add the direction coords to their coords, and apply np.clip to ensure
        #The agents are still in-bounds.
        self._prisoner_coords = np.clip(self._prisoner_coords+prisoner_direction,[0,0],[self._width-1,self._height-1])
        self._guard_coords = np.clip(self._guard_coords+guard_direction,[0,0],[self._width-1,self._height-1])
        
        #Check the new state for rewards, terminateds, and truncateds, and generate the return values.
        observations = self._get_obs()
        infos = self._get_infos()
        rewards, terminateds = self._get_rewards_and_terminateds()
        truncateds = self._get_truncateds()
        
        #Apparently you have to empty out self.agents[] if the game is terminated or truncated.
        # I'm not sure why to be honest, but that's an API standard.
        if all(terminateds.values()) or all(truncateds.values()):
            self.agents = []

        return observations, rewards, terminateds, truncateds, infos
        

In [57]:
#Trying to create an Environment which multiple PPO models can train in using SuperSuit's converter
# functions for PettingZoo Envs.
env = GuardAndPrisoner()

#First, we must pass the Parallel_API_Test to make sure we are implementing ParallelEnv correctly. We
# made quite a few errors in our initial implementation of the Environment, including calling
# super().reset() to an Env which did not extend the base Env class (ParallelEnv doesn't extend Env for some reason)
# Not realizing that self.agents and self.possible_agents are essential variables used by the
# ParallelEnv API and not just some random variable names used by the guy who wrote the tutorial.
# They must be named self.agents and self.possible agents, and they must be created and destroyed
# at very specific points (E.G. The trickiest mistake for me to fix was learning that I had to reset
# self.agents to [] within self.step() whenever the game was terminated or truncated. It really wasn't
# obvious that that was causing the error in the test, and I had to look back through the tutorial code
# to figure out what I was doing wrong.
parallel_api_test(env, num_cycles=1000)

#Now that our ParallelEnv passes the API test, we want to use SuperSuit to hook it up to SB3's PPO models.
env = ss.pettingzoo_env_to_vec_env_v1(env)
env = ss.concat_vec_envs_v1(env, 1, base_class="stable_baselines3")

mPPO = PPO(
    "MultiInputPolicy",
    env,
    verbose = 0
)
mean, std = evaluate_policy(mPPO, env, n_eval_episodes=100)
print(f"baseline performance. Mean reward of {mean} +/- {std}")
print(f"starting training on {str(env.venv.metadata['name'])}.")

mPPO.learn(total_timesteps=100000, progress_bar = True)
mean, std = evaluate_policy(mPPO, env, n_eval_episodes=100)
print(f"performance after 100000 training timesteps. Mean reward of {mean} +/- {std}")

Passed Parallel API test


Output()

baseline performance. Mean reward of 0.1 +/- 0.3
starting training on guard_and_prisoner_v0.


performance after 100000 training timesteps. Mean reward of 0.43 +/- 0.4950757517794625


It looks like it did learn in the parallelized VecEnv! However, the strength of this approach is severely limited because both 'agents' are actually the same model, and so the must learn together and have identical strategies. This means that this learning process will only achieve the optimal strategy in environments which can be solved with two identical strategies, i.e. "symmetric" games. This game is fundamentally asymmetric, as the prisoner wants to escape and the guard wants to catch the prisoner. My guess is that the model made a beeline for the exit with both the guard and the prisoner, and as a result was able to receive some kind of reward each run. Fortunately, the **ECG** can be solved either symmetrically or asymmetrically, so I have hope that this training method will work for it.