# PettingZoo Speaker-Listener Environment Demonstration

## Introduction

[PettingZoo](https://www.pettingzoo.ml/) is a Python library for conducting research in multi-agent reinforcement learning, akin to a multi-agent version of [Gym](https://github.com/openai/gym). It implements a variety of environments, including:
- [Atari](https://www.pettingzoo.ml/atari): Multi-player Atari 2600 games (cooperative, competitive and mixed sum)
- [Butterfly](https://www.pettingzoo.ml/butterfly): Cooperative graphical games developed by us, requiring a high degree of coordination
- [Classic](https://www.pettingzoo.ml/classic): Classical games including card games, board games, etc.
- [MAgent](https://www.pettingzoo.ml/magent): Configurable environments with massive numbers of particle agents, originally from https://github.com/geek-ai/MAgent
- [MPE](https://www.pettingzoo.ml/mpe): A set of simple nongraphical communication tasks, originally from https://github.com/openai/multiagent-particle-envs
- [SISL](https://www.pettingzoo.ml/sisl): 3 cooperative environments, originally from https://github.com/sisl/MADRL

<img src="https://www.pettingzoo.ml/mpe/mpe_simple_speaker_listener.gif" width="500" align="center"/>

The [Simple Speaker Listener Environment](https://www.pettingzoo.ml/mpe/simple_speaker_listener) is implemented in the MPE library. It is a 2-agent environment in which one agent, the "speaker", has information about the goal and has a limmited mode of communication with the second agent, the "listener", which must use the speaker's communications and its limitted observations to navigate a 2D space toward the goal. The speaker agent cannot navigate, and the listener object cannot communicate.

<img src="images/speaker_listener_screenshot.png" width="500" align="center"/>

In [1]:
!pip install 'pettingzoo[mpe]'



## Environment Description

The latest implementation of the speaker-listener environment is `simple_speaker_listener_v3`. We can create an environment instance using the `env` function which accepts two parameters:
1. `max_cycles` - the number of actions each agent can perform before the end of the episode. _default=25_
2. `continuous_actions` - if `True`, both the speaker and the listener have a continuous action space. otherwise they are discrete, finite spaces. _default=False_

The environment object implements many usful tools to help understand and properly utilize the environment. Below we use the `agents` attribute toiterate over the agent names, and the `observation_space` and `action_space` functions to show the [gym spaces](https://gym.openai.com/docs/#spaces) for the agents observation and action spaces.

In [2]:
from pettingzoo.mpe import simple_speaker_listener_v3
import numpy as np


def print_env_info(continuous_actions):
    env = simple_speaker_listener_v3.env(continuous_actions=continuous_actions)
    env.reset()
    
    print('continuous actions:' if continuous_actions else 'discrete actions:')
    
    for i, agent in enumerate(env.agents, 1):
        print(f'- agent {i}: {agent}')
        print(f'\t- observation space: {env.observation_space(agent)}')
        print(f'\t- action space: {env.action_space(agent)}')


print_env_info(continuous_actions=False)
print()
print_env_info(continuous_actions=True)

discrete actions:
- agent 1: speaker_0
	- observation space: Box(-inf, inf, (3,), float32)
	- action space: Discrete(3)
- agent 2: listener_0
	- observation space: Box(-inf, inf, (11,), float32)
	- action space: Discrete(5)

continuous actions:
- agent 1: speaker_0
	- observation space: Box(-inf, inf, (3,), float32)
	- action space: Box(0.0, 1.0, (3,), float32)
- agent 2: listener_0
	- observation space: Box(-inf, inf, (11,), float32)
	- action space: Box(0.0, 1.0, (5,), float32)





### Observation Spaces

The `Box(low, high, shape, dtype)` space contains any vector of shape `shape` that contains only values within the closed interval between `low` and `high` represented as type `dtype`. Both the speaker and the listener receive one dimensional `Box` observations of different sizes with any 32-bit floating point value. Note that the observation spaces remain the same regardless of the action space type (continuous / discrete). We can get the next acting agent's observation using the environment's `last` function, which returns the previous observation, reward, "done" flag, info dictionary. The current acting agent is chosen sequentially according to the agents' order in the `agents` attribute. The next agent is chosen when calling the `step` function which is sets the agent's action.

#### Speaker
The speaker observation is of type `Box(-inf, inf, (3,), float32)`, which is any vector of 3 dimensions. The values represent the RGB color of the goal to which the listener must navigate to maximize rewards.

#### Listener
The speaker observation is of type `Box(-inf, inf, (11,), float32)`, which is any vector of 11 dimensions. The first two values are the agent's velocity in 2D space. The next six values are the red, blue, and green landmarks' positions relative to the listener. The last three values correspond to communication received from the speaker. Below is a precise ordering of the values in the observation vector:
1. listener agent velocity X
2. listener agent velocity Y
3. red landmark X pos - listener agent X pos
4. red landmark Y pos - listener agent Y pos
5. blue landmark X pos - listener agent X pos
6. blue landmark Y pos - listener agent Y pos
7. green landmark X pos - listener agent X pos
8. green landmark Y pos - listener agent Y pos
9. communication channel 1
10. communication channel 2
11. communication channel 3

Note that the communication observation (values 9, 10, and 11) will always be 0 in the first round since no communication has yet been received from the speaker

In [3]:
env = simple_speaker_listener_v3.env()
env.reset()  # reset the environment, selected agent is "speaker_0"

# run twice to show the chnage in the communication vector
for i in range(2):
    #speaker obs
    obs, _, _, _ = env.last()  # get speaker observation vector
    print(f'agnet: {env.agents[0]}')
    print(f'observation: {obs}')
    print()
    env.step(0)  # send discrete message "A". next agent is selected (listener_0)
    
    obs, _, _, _ = env.last()  # get listener observation vector
    print(f'agnet: {env.agents[1]}')
    print(f'observation: {obs}')
    print()
    env.step(1)  # perform the "go left" action

agnet: speaker_0
observation: [0.65 0.15 0.15]

agnet: listener_0
observation: [ 0.          0.         -1.1573222   0.05878095  0.4423617  -0.9294968
  0.70731    -1.4845002   0.          0.          0.        ]

agnet: speaker_0
observation: [0.65 0.15 0.15]

agnet: listener_0
observation: [-0.5         0.         -1.1073222   0.05878095  0.4923617  -0.9294968
  0.75731    -1.4845002   1.          0.          0.        ]



### Action Spaces

The action agents' action spaces can be either discrete or continuous, depending on the `continuous_actions` parameter. If discrete, the action spaces are of type `Discrete(n)`, which contains the integer values 0 to n-1. Otherwise, the action spaces are of type `Box` (like the observation spaces), but with values constrained between 0 and 1. Actions are given sequentially by agent order according the `agents` attribute by using the `step` function. The given action must be one from the corresponding agent's action space.

#### Discrete Actions

##### Speaker
The action space is `Discrete(3)` containing the values 0, 1, 2. Each value corresponds to a possible message. Value 0 corresponds to message A, which can be seen in the communication vector of the listener's observation as \[1, 0, 0\]. Similarly, values 1 and 2 correspond to messages B and C and appear as \[0, 1, 0\] and \[0, 0, 1\] in the listener's observation respectively.

##### Listener
Discrete:  
The action space is `Discrete(5)` containing the values 0 - 4. Each value applies force on the agent, increasing its velocity to some direction. The velocity will slowly deteriorate until the agent stops, unless constant force is applied. The values' meanings are as follows:
* 0 - do nothing
* 1 - push left (add velocity in negative x-axis direction)
* 2 - push right (add velocity in positive x-axis direction)
* 3 - push down (add velocity in negative y-axis direction)
* 4 - push up (add velocity in positive t-axis direction)

In [4]:
# speaker action-to-index dict
SPEAKER_DISCRETE_ACTIONS = {
    'A': 0,
    'B': 1,
    'C': 2
}

# listener action-to-index dict
LISTENER_DISCRETE_ACTIONS = {
    'nothing': 0,
    'left':    1,
    'right':   2,
    'down':    3,
    'up':      4
}

env = simple_speaker_listener_v3.env(continuous_actions=False)  # discrete actions env
env.reset()

# CHANGE ACTION ACCORDING TO THE SPEADER TABLE AND SEE THE LISTENER'S COMMUNICATION OBSERVATIONS CHANGE
chosen_speaker_action = SPEAKER_DISCRETE_ACTIONS['A']

# CHANGE ACTION ACCORDING TO THE LISTENER TABLE AND SEE THE VELOCITY OBSERVATIONS CHANGE
chosen_listener_action = LISTENER_DISCRETE_ACTIONS['left']

# run twice to show the chnage in the communication vector
for i in range(2):
    #speaker action
    obs, _, _, _ = env.last()
    print(f'agnet: {env.agents[0]}')
    print(f'observation: {obs}')
    print()
    env.step(chosen_speaker_action)  
    
    # listener action
    obs, _, _, _ = env.last()
    print(f'agnet: {env.agents[1]}')
    print(f'observation: {obs}')
    print()
    env.step(chosen_listener_action)  

agnet: speaker_0
observation: [0.65 0.15 0.15]

agnet: listener_0
observation: [ 0.          0.         -1.1739286   1.2495128  -0.5537686   1.1765938
 -1.0880105   0.92581254  0.          0.          0.        ]

agnet: speaker_0
observation: [0.65 0.15 0.15]

agnet: listener_0
observation: [-0.5         0.         -1.1239287   1.2495128  -0.50376856  1.1765938
 -1.0380106   0.92581254  1.          0.          0.        ]



#### Continuous Actions

##### Speaker
The action space is `Box(0.0, 1.0, (3,), float32)` containing 3D vectors of values in \[0, 1\]. The values have no specific meaning, and are given, as is, as the listener observation's communication vector.

##### Listener
The action space is `Box(0.0, 1.0, (5,), float32)` containing 3D vectors of values in \[0, 1\]. Each value describes the amount of force applied in each direction (left, right, up, and down). Below is a precise ordering of the values in the action vector:
1. No force (useful for some implementations)
2. Force right (positive in x-axis)
3. Force left (negative in x-axis)
4. Force up (positive in y-axis)
5. Force down (negative in y-axis)

In [5]:
# speaker continuous action function
def speaker_continuous_action(v1, v2, v3):
    return np.array([v1, v2, v3], dtype=np.float32)

# listener continuous action function
def listener_continuous_action(right, left, up, down):
    return np.array([0, right, left, up, down], dtype=np.float32)

env = simple_speaker_listener_v3.env(continuous_actions=True)  # continuous actions env
env.reset()

# CHANGE ACTION AS NEEDED
chosen_speaker_action = speaker_continuous_action(v1=0.5, v2=1.0, v3=0.2)

# CHANGE ACTION AS NEEDED
chosen_listener_action = listener_continuous_action(right=0.8, left=0.8, up=0.5, down=0.7)

# run twice to show the chnage in the communication vector
for i in range(2):
    #speaker action
    obs, _, _, _ = env.last()
    print(f'agnet: {env.agents[0]}')
    print(f'observation: {obs}')
    print()
    env.step(chosen_speaker_action)  
    
    # listener action
    obs, _, _, _ = env.last()
    print(f'agnet: {env.agents[1]}')
    print(f'observation: {obs}')
    print()
    env.step(chosen_listener_action)  

agnet: speaker_0
observation: [0.15 0.15 0.65]

agnet: listener_0
observation: [ 0.          0.         -0.5547875   0.23933914  1.2868022  -0.4283366
  0.5035991   0.9161092   0.          0.          0.        ]

agnet: speaker_0
observation: [0.15 0.15 0.65]

agnet: listener_0
observation: [ 0.         -0.09999999 -0.5547875   0.24933913  1.2868022  -0.4183366
  0.5035991   0.9261092   0.5         1.          0.2       ]



## Running the Environment

We can run a game simulation by resetting an environment and playing out the episode. This is done by iterating over the agents repeatedly, providing an action for each agent at every iteration, until we wish to stop or until the `max_cycles` limit has been reached. In the example below, we define a policy function to generate random actions within each agent's action space.

### Policies

We implement a policy as a function that, given the current observation, returns an action to perform. Below we define a policy class that supports both `Discrete` and `Box` type action spaces that completely ignores the observation and samples a random valid action from the given action space.

In [6]:
from gym.spaces import Discrete, Box

# define a random policy for continuous action agents.
# the policy returns a numpy array of the action space shape with random values between 0 and 1.
class RandomPolicy:
    def __init__(self, action_space):
        # choose a policy function for this action space type
        if isinstance(action_space, Discrete):  # discrete action policy
            self.policy_fn = self.__discrete_policy
        elif isinstance(action_space, Box):  # continuous action policy
            self.policy_fn = self.__continuous_policy
        else:  # other types are not supported
            raise TypeError(f'action_space must be of type Box or Discrete. got {type(action_space).__name__}')
        
        self.action_space = action_space
        
    def __call__(self, observation):
        # we completely ignore the observation and create a random valid action.
        return self.policy_fn()
    
    def __discrete_policy(self):
        # a random number within the discrete action range
        return np.random.randint(self.action_space.n)
    
    def __continuous_policy(self):
        # a random vector within the continuous range of the appropriate dimensionality
        # convert to the right dtype to avoid clipping warnings (e.g. float64 to float32)
        return np.random.uniform(self.action_space.low, self.action_space.high, self.action_space.shape).astype(self.action_space.dtype)

### Simulation

We now define an environment with either a discrete or continuous action space that limits to a small number of steps, for the purposes of this demonstration, using the `max_cycles` parameter. We then create a policy for each agent and start iterating over them repeatedly. For this, the environment implements the `agent_iter` function. This function simply iterates over the list of agents `max_cycles` times, allowing us to perform `max_cycles` steps for each agent. An added bonus of using `agent_iter` is that it raises an error if there was no `step` call within the iteration (which can prevent horrible bugs). After the episode has ended, the agents' "done" status will be true, and . In this case the episode is complete and the environment must be reset if we wish to run it again.

In [7]:
# CHOOSE MAX CYCLES AND DISCRETE OR CONTINUOUS ACTION SPACE
env = simple_speaker_listener_v3.env(max_cycles=5, continuous_actions=True)
env.reset()

# create a random policy for both agents' action spaces.
policies = {
    env.agents[0]: RandomPolicy(env.action_space(env.agents[0])),
    env.agents[1]: RandomPolicy(env.action_space(env.agents[1]))
}

# iterate over agents until the episode is complete
for agent in env.agent_iter():
    observation, reward, done, info = env.last()

    # if done, the episode is complete. no more actions can be taken
    if done:
        break
    
    # choose an action and execute
    action = policies[agent](observation)
    env.step(action)
    
    # log everything
    print(f'{agent} reward:      {reward}')
    print(f'{agent} observation: {observation}')
    print(f'{agent} action:      {action}')
    print()

speaker_0 reward:      0.0
speaker_0 observation: [0.15 0.65 0.15]
speaker_0 action:      [0.7946723  0.82636803 0.71717334]

listener_0 reward:      0.0
listener_0 observation: [ 0.          0.         -1.0930883  -1.4339958  -1.3082466  -1.5381738
 -0.51077074 -0.5750493   0.          0.          0.        ]
listener_0 action:      [0.17187975 0.81018907 0.45053533 0.829903   0.79029036]

speaker_0 reward:      -4.130959731095398
speaker_0 observation: [0.15 0.65 0.15]
speaker_0 action:      [0.3222627 0.4749429 0.4140355]

listener_0 reward:      -4.130959731095398
listener_0 observation: [ 0.17982687  0.01980633 -1.111071   -1.4359765  -1.3262293  -1.5401543
 -0.5287534  -0.57702994  0.7946723   0.82636803  0.71717334]
listener_0 action:      [0.19847904 0.7819158  0.15683617 0.35229257 0.48536792]

speaker_0 reward:      -4.235741904178669
speaker_0 observation: [0.15 0.65 0.15]
speaker_0 action:      [0.8946533  0.27243754 0.10652413]

listener_0 reward:      -4.235741904178669
l

### Rendering

We can render the environment to see visualize the observation space. We must call the `render` function at every iteration to create and update a rendering in a separate window. Below we show a 100-step episode controlled by our random policies.

In [8]:
# long episode for interesting rendering.
env = simple_speaker_listener_v3.env(max_cycles=100, continuous_actions=True)
env.reset()

# create a random policy for both agents' action spaces.
policies = {
    env.agents[0]: RandomPolicy(env.action_space(env.agents[0])),
    env.agents[1]: RandomPolicy(env.action_space(env.agents[1]))
}

# run an episode
for agent in env.agent_iter():
    observation, reward, done, info = env.last()
    
    # stop if done
    if done:
        break
    
    # choose and execute action
    action = policies[agent](observation)
    env.step(action)
    
    # render the environment
    env.render('human')

# This line is SUPPOSED to close the rendering window, but it does not.
# restart the kernel close the window, and then don't run this cell again.
env.close()

## Advanced Tools

### World Model

All MPE environments define a world with customizable physical properties that make up the world model. The `pettingzoo.mpe._mpe_utils.core.World` object is the template for such a world. It contains a collection of `Entity` objects, divided into agents and landmarks. Both the world object and the different entities have physical attributes that affect transitions within the environment.

Let us explore the Simple Speaker Listener environment's world attributes.

In [9]:
env = simple_speaker_listener_v3.env(continuous_actions=True)
env.reset()

# list environment "world" attributes
vars(env.unwrapped.world)

{'agents': [<pettingzoo.mpe._mpe_utils.core.Agent at 0x7f9c079268e0>,
  <pettingzoo.mpe._mpe_utils.core.Agent at 0x7f9c079263d0>],
 'landmarks': [<pettingzoo.mpe._mpe_utils.core.Landmark at 0x7f9c079269a0>,
  <pettingzoo.mpe._mpe_utils.core.Landmark at 0x7f9c07926a90>,
  <pettingzoo.mpe._mpe_utils.core.Landmark at 0x7f9c07926130>],
 'dim_c': 3,
 'dim_p': 2,
 'dim_color': 3,
 'dt': 0.1,
 'damping': 0.25,
 'contact_force': 100.0,
 'contact_margin': 0.001,
 'collaborative': True}

`agents` and `landmarks` are the world entities. The `dim_X` attributes define the dimensions of values in the environment, e.g., `dim_p` defines the world position dimensions (the default is a 2d world). The rest are physical properties that affect transitions:
- `dt` - time units per step
- `damping` - applies a multiplicative drag on moving agents.
- `contact_force` and `contact_margin` - used to calculate collision force

To complete our view of the world, let us explore an entity. Specifically, below are the attributes of the listener agent.

In [10]:
# list "listener" agent entity attributes
vars(env.unwrapped.world.agents[1])

{'name': 'listener_0',
 'size': 0.075,
 'movable': True,
 'collide': False,
 'density': 25.0,
 'color': array([0.6, 1.1, 0.6]),
 'max_speed': None,
 'accel': None,
 'state': <pettingzoo.mpe._mpe_utils.core.AgentState at 0x7f9c079263a0>,
 'initial_mass': 1.0,
 'silent': True,
 'blind': False,
 'u_noise': None,
 'c_noise': None,
 'u_range': 1.0,
 'action': <pettingzoo.mpe._mpe_utils.core.Action at 0x7f9c07926250>,
 'action_callback': None,
 'goal_a': None,
 'goal_b': None}

Some interesting attributes:

- `movable` - if `True`, the entity can change position via actions / collisions.
- `collide` - if `True`, the entity can collide with other entities with `collide=True`
- `max_speed` - a maximal limit on the velocity norm. if `None`, the speed is not bounded.
- `accel` - a constant value that scales the force acceleration. The default is `None` which defaults to `5`.
- `state` - contains the position, velocity, and communications vector of the entity.
- `initial_mass` - the entity's mass
- `u_noise` and `c_noise` - the standard deviation for additive, zero-mean Gaussian noise for the action force and communications respectively. `None` is equivalent to 0.
- `u_range` - The maximal force that can be applied to the agent in any axis.

As we can see, this environment has one movable agent that cannot collide with other entities, and whos actions are deterministic (i.e., no noise). In this scenario, when the listener performs its action `[x, right, left, up, down]` at time step $t$, the velocity and position of the agent is calculated as follows:
$$u_{t} = [\text{right} - \text{left}, \text{up} - \text{down}]$$
$$v_{t + 1} = v_{t}\cdot (1 - \text{damping}) + \frac{u_{t} \cdot\text{accel}}{\text{mass}}\cdot \text{dt}$$
$$x_{t + 1} = x_{t} + v_{t + 1}\cdot \text{dt}$$

The example below demonstrates how to extract use the world's physical attributes to determine the next velocity and position of the listener, and compares them to the actual environment update. We also show that it is possible to customize the environment by directly changing the `damping` value.

In [11]:
# get world and listener agent instances
world = env.unwrapped.world
listener = world.agents[1]

# change world damping
world.damping = 0.05

# get constants
dt = world.dt
damping = world.damping
accel = listener.accel or 5  # defaults to 5 if None
mass = listener.mass
v_prev = listener.state.p_vel
x_prev = listener.state.p_pos

# SET ACTION
# values must be within [0, u_range]
# x - nothing
# r - right
# l - left
# u - up
# d - down        [x, r,   l,    u,    d]
ACTION = np.array([0, 1, 0.5, 0.75, 0.75], dtype=np.float32)

# calculate the applied force
# [right - left, up - down]
u = ACTION[1::2] - ACTION[2::2]

# calculate v_{t+1} and x_{t+1}
expected_next_v = v_prev * (1 - damping) + ((u * accel) / mass) * dt
expected_next_x = x_prev + expected_next_v * dt

# do speaker step (ignoring this value in comunication)
env.step(np.array([0, 0, 1], dtype=np.float32))

# do listener step
env.step(ACTION)

actual_next_v = listener.state.p_vel
actual_next_x = listener.state.p_pos

print(f'expected next v: {expected_next_v}')
print(f'actual next v:   {actual_next_v}')
print()
print(f'expected next x: {expected_next_x}')
print(f'actual next x:   {actual_next_x}')

expected next v: [0.25 0.  ]
actual next v:   [0.25 0.  ]

expected next x: [0.44215233 0.92596952]
actual next x:   [0.44215233 0.92596952]


Only by understanding the environment model can we hope to implement model-based algorithms, e.g., planning (BFS, DFS, A*, etc.). The above example showcases only the listener agent of a single environment using one configuration. To better understand the effect of different configurations (and in different MPE environments), e.g., introducing collisions between the speaker and listener agents, one must dive into the [MPE code](https://github.com/Farama-Foundation/PettingZoo/tree/master/pettingzoo/mpe). <span style="color:yellow">Be warned</span> that this code is not fully documented.

### Wrappers

PettingZoo provides utilities called wrappers. They are used to alter the behavior of the environment with minimal effort wihtout changing the general environment API. In fact, the Simple Speaker Listener environment already wrapped upon creation. In the below demonstration, we can see that the environment type is actually a wrapper called `OrderEnforcingWrapper`. This class works like the original environment and adds checks that enforce the agent order, and adds extra functionality, e.g. the `agent_iter` function is implemented in this wrapper (and is not available in the raw environment). The underlying environment, revealed using the environment's `env` attribute, is actually wrapped by another wrapper called `AssertOutOfBoundsWrapper`, which checks that given actions are compatible with the agents' discrete action space (similarly, the continuous action space uses another wrapper called `ClipOutOfBoundsWrapper`). Only under this wrapper do we find the raw environment object. However, we can jump directly to the raw environment by using the `unwrapped` property.

In [12]:
env = simple_speaker_listener_v3.env()
print(f'external wrapper:      {type(env)}')
print(f'inner wrapper:         {type(env.env)}')
print(f'raw environment:       {type(env.env.env)}')
print(f'environment unwrapped: {type(env.unwrapped)}')

external wrapper:      <class 'pettingzoo.utils.wrappers.order_enforcing.OrderEnforcingWrapper'>
inner wrapper:         <class 'pettingzoo.utils.wrappers.assert_out_of_bounds.AssertOutOfBoundsWrapper'>
raw environment:       <class 'pettingzoo.mpe.simple_speaker_listener_v3.raw_env'>
environment unwrapped: <class 'pettingzoo.mpe.simple_speaker_listener_v3.raw_env'>


#### Custom Wrappers

Using the `BaseWrapper` abstraction, we can create our own wrappers. We demonstrate this below with a custom wrapper that turns this environment into a single-agent environment. This is done by skipping the speaker's step before the user has a chance to do so. The action will be the goal color if continuous or some constant message for each color if discrete. We the user to  

In [13]:
from pettingzoo.utils import wrappers
import numpy as np

class ListenerOnlyWrapper(wrappers.BaseWrapper):
    
    def __init__(self, env):
        super().__init__(env)
        
        # reset to skip speaker before new game
        self.reset()
        
        # set single agent list
        self.agents = self.agents[1:]
        
    def reset(self):
        super().reset()
        
        # skip speaker action
        self.__step_speaker()
    
    def step(self, action):
        super().step(action)  # do listener action
        
        # skip speaker action
        self.__step_speaker()
        
    def __step_speaker(self):
        _, _, done, _ = self.env.last()
        goal_color, _, done, _ = self.env.last()
        
        # speaker is done before the listener.
        if done:
            return
        
        # step with the correct action type
        if self.env.unwrapped.continuous_actions:
            super().step(goal_color)
        else:
            super().step(np.argmax(goal_color))

We can add wrap a new environment by creating a new wrapper instance initialized with the wrapped environment.

In [14]:
env = simple_speaker_listener_v3.env(max_cycles=5, continuous_actions=True)
env = ListenerOnlyWrapper(env)
print(f'custom wrapped environment: {env}')
print(f'list of agents:             {env.agents}')

custom wrapped environment: ListenerOnlyWrapper<simple_speaker_listener_v3>
list of agents:             ['listener_0']


Like before, we can simulate an episode with `agent_iter` which will now always select the listener. Note that the initial observation contains a communication vector of 0 because no communication is received before the first step. We can see the listener's observation in the following observations until the end of the episode

In [15]:
policy = RandomPolicy(env.action_space(env.agents[0]))
for agent in env.agent_iter():
    observation, reward, done, info = env.last()
    
    # stop if done
    if done:
        break
    
    # choose and execute action
    action = policy(observation)
    env.step(action)
    
    # log everything
    print(f'{agent} reward:      {reward}')
    print(f'{agent} observation: {observation}')
    print(f'{agent} action:      {action}')
    print()

listener_0 reward:      0.0
listener_0 observation: [ 0.          0.         -0.34709144  0.967302   -0.01558756  0.34916973
 -0.01930392  0.6651746   0.          0.          0.        ]
listener_0 action:      [0.29963553 0.42740586 0.9696676  0.94981784 0.36658645]

listener_0 reward:      -0.10253806598032186
listener_0 observation: [-0.27113086  0.2916157  -0.31997836  0.93814045  0.01152553  0.32000816
  0.00780917  0.63601303  0.15        0.65        0.15      ]
listener_0 action:      [0.25012997 0.70812196 0.6843732  0.01767832 0.87168497]

listener_0 reward:      -0.11711090866196903
listener_0 observation: [-0.19147377 -0.20829156 -0.30083096  0.95896965  0.0306729   0.34083733
  0.02695655  0.6568422   0.15        0.65        0.15      ]
listener_0 action:      [0.47302386 0.75251347 0.12384099 0.40512335 0.7007381 ]

listener_0 reward:      -0.1380040380749845
listener_0 observation: [ 0.17073092 -0.30402604 -0.31790406  0.98937225  0.01359981  0.37123993
  0.00988346  0.68

### Parallel Environments

Up until now we were able to view each agent's observations and act individually, even though the actions were only applied after a full cycle through all the agents. Many PettingZoo environments support acting in parallel using yet another wrapper, including Simple Speaker Listener. This is important for implementing algorithms that consider joint actions, e.g., centralized control. We can create a parallel environment by invoking the `parallel_env` function.

In [16]:
env = simple_speaker_listener_v3.parallel_env()

In this environment, both the observations of the speaker and listener agents are bundled together in a dictionary. The initial environment observations is returned from the `reset` function.

In [17]:
observations = env.reset()
observations

{'speaker_0': array([0.15, 0.65, 0.15], dtype=float32),
 'listener_0': array([ 0.        ,  0.        ,  0.35934502,  0.12687765,  0.8555883 ,
         0.44403875,  0.3446538 , -0.3340047 ,  0.        ,  0.        ,
         0.        ], dtype=float32)}

Actions are performed jointly using the `step` function. This will return the next observation bundle together with separate dictionaries for the reward and "done" status for each agent.

In [18]:
joint_action = {'speaker_0': 1, 'listener_0': 2}
observations, rewards, done, info = env.step(joint_action)

print('new observations:')
print(observations)
print('step rewards:')
print(rewards)
print('done status:')
print(done)

new observations:
{'speaker_0': array([0.15, 0.65, 0.15], dtype=float32), 'listener_0': array([ 0.5       ,  0.        ,  0.30934504,  0.12687765,  0.8055883 ,
        0.44403875,  0.29465377, -0.3340047 ,  0.        ,  1.        ,
        0.        ], dtype=float32)}
step rewards:
defaultdict(<class 'int'>, {'speaker_0': -0.8461429721052728, 'listener_0': -0.8461429721052728})
done status:
{'speaker_0': False, 'listener_0': False}


Putting it all together, a simulation might look something like this:

In [19]:
env = simple_speaker_listener_v3.parallel_env(max_cycles=5)
observations = env.reset()

# a random policy for each agent
policies = {
    env.agents[0]: RandomPolicy(env.action_space(env.agents[0])),
    env.agents[1]: RandomPolicy(env.action_space(env.agents[1]))
}

for _ in range(env.unwrapped.max_cycles):
    joint_action = {agent: policies[agent](obs) for agent, obs in observations.items()}
    observations, rewards, done, info = env.step(joint_action)
    
    if any(done.values()):
        break
    
    # log everything
    print(f'rewards:      {rewards}')
    print(f'observations: {observations}')
    print(f'actions:      {joint_action}')
    print()

rewards:      defaultdict(<class 'int'>, {'speaker_0': -2.360987262366699, 'listener_0': -2.360987262366699})
observations: {'speaker_0': array([0.65, 0.15, 0.15], dtype=float32), 'listener_0': array([ 0.        , -0.5       , -0.09914988, -1.5333482 ,  0.63261783,
       -1.0574328 , -1.0721401 , -1.3541496 ,  0.        ,  0.        ,
        1.        ], dtype=float32)}
actions:      {'speaker_0': 2, 'listener_0': 3}

rewards:      defaultdict(<class 'int'>, {'speaker_0': -2.247392400772925, 'listener_0': -2.247392400772925})
observations: {'speaker_0': array([0.65, 0.15, 0.15], dtype=float32), 'listener_0': array([ 0.        , -0.375     , -0.09914988, -1.4958482 ,  0.63261783,
       -1.0199329 , -1.0721401 , -1.3166496 ,  1.        ,  0.        ,
        0.        ], dtype=float32)}
actions:      {'speaker_0': 0, 'listener_0': 0}

rewards:      defaultdict(<class 'int'>, {'speaker_0': -2.1640419577025956, 'listener_0': -2.1640419577025956})
observations: {'speaker_0': array([0.65,

### Other Environments

There already exists a single-agent and single-landmark version of Simple Speaker Listener, called Simple. Here, the one agent acts like the listener and has a simplified observation space containing the agent's velocity and the landmark's relative location. There are other, more complex environments in the MPE library and other environment libraries in PettingZoo. You can explore them in the [PettingZoo Website](https://www.pettingzoo.ml/envs).

In [20]:
from pettingzoo.mpe import simple_v2


print_env_info(continuous_actions=False)
print()
print_env_info(continuous_actions=True)

discrete actions:
- agent 1: speaker_0
	- observation space: Box(-inf, inf, (3,), float32)
	- action space: Discrete(3)
- agent 2: listener_0
	- observation space: Box(-inf, inf, (11,), float32)
	- action space: Discrete(5)

continuous actions:
- agent 1: speaker_0
	- observation space: Box(-inf, inf, (3,), float32)
	- action space: Box(0.0, 1.0, (3,), float32)
- agent 2: listener_0
	- observation space: Box(-inf, inf, (11,), float32)
	- action space: Box(0.0, 1.0, (5,), float32)


### Single-Agent Reduction to Gym

The MPE Simple environment shown above is a single-agent environment. It would be beneficial to align its API with that of gym, thus allowing us to use other algorithms implemented for gym. This is similar to the parallel environment API, except that these accept dictionaries of actions and return dictionaries of observations, rewards, etc.

Below is a wrapper for this environment that does exactly that! It overrides the `reset` method to the single observation, and the `step` method to accept a single action and return a single tuple of step results. Note that for parallel environments must be wrapped with the dedicated wrapper of type `BaseParallelWrapper`.

In [21]:
# use a different wrapper base class for parallel environments
from pettingzoo.utils.wrappers import BaseParallelWraper

class SingleAgentParallelEnvGymWrapper(BaseParallelWraper):
    """
    A wrapper for single-agent parallel environments aligning the environments'
    API with OpenAI Gym.
    """

    def reset(self):
        # run `reset` as usual.
        # returned value is a dictionary of observations with a single entry
        obs = self.env.reset()

        # return the single entry value as is.
        # no need for the key (only one agent)
        return next(iter(obs.values()))

    def step(self, action):
        # step using "joint action" of a single agnet as a dictionary
        step_rets = self.env.step({self.env.agents[0]: action})

        # unpack step return values from their dictionaries
        return tuple(next(iter(ret.values())) for ret in step_rets)

    @property  # make property for gym-like access
    def action_space(self, _=None):  # ignore second argument in API
        # get action space of the single agent
        return self.env.action_space(self.env.possible_agents[0])

    @property  # make property for gym-like access
    def observation_space(self, _=None):  # ignore second argument in API
        # get observation space of the single agent
        return self.env.observation_space(self.env.possible_agents[0])
    
simple_gym_env = simple_v2.parallel_env(max_cycles=5, continuous_actions=True)
simple_gym_env = SingleAgentParallelEnvGymWrapper(simple_gym_env)

And we can now use the environment as we would any other gym environment

In [22]:
# CHOOSE MAX CYCLES AND DISCRETE OR CONTINUOUS ACTION SPACE
observation = simple_gym_env.reset()

# create a random policy for both agents' action spaces.
policy = RandomPolicy(simple_gym_env.action_space)

# iterate over agents until the episode is complete
for i in range(simple_gym_env.unwrapped.max_cycles):
    # choose an action and execute
    action = policy(observation)
    observation, reward, done, info = simple_gym_env.step(action)
    
    # log everything
    print(f'step {i}')
    print(f'reward:      {reward}')
    print(f'observation: {observation}')
    print(f'action:      {action}')
    print()
    
    # if done, the episode is complete. no more actions can be taken
    if done:
        break

step 0
reward:      -1.062627617748555
observation: [0.3680436  0.32354856 0.9243896  0.45621428]
action:      [0.32831636 0.81992656 0.08383939 0.6577298  0.01063272]

step 1
reward:      -0.9718449655002888
observation: [0.39994454 0.20679925 0.8843952  0.43553433]
action:      [0.27477986 0.86535984 0.6175362  0.25367478 0.3253991 ]

step 2
reward:      -0.994246645519305
observation: [-0.08008812 -0.09282242  0.89240396  0.4448166 ]
action:      [0.8639809  0.2307866  0.99087965 0.21249507 0.7083388 ]

step 3
reward:      -1.0399584335001262
observation: [-0.3644977   0.23878156  0.92885375  0.42093843]
action:      [0.12598233 0.00312544 0.6119886  0.7112211  0.09442434]

step 4
reward:      -1.1326313435091178
observation: [-0.45510942 -0.071326    0.9743647   0.42807102]
action:      [0.69730085 0.5179444  0.88141674 0.14974637 0.6505707 ]



In [23]:
simple_gym_env.render()