In [1]:
!pip install pettingzoo[butterfly]>=1.24.0
!pip install supersuit>=3.9.0
!pip install stable-baselines3>=2.0.0
!pip install imageio

y


### PettingZoo

Gym is the environment manager for single agent, and PettingZoo is the equivalent of Gym for multi agent environments.

We will set up a PPO training run for a multi agent environment, Knights and Zombies.

In [1]:
from __future__ import annotations

import glob
import os
import time

import supersuit as ss
from stable_baselines3 import PPO
from stable_baselines3.ppo import CnnPolicy, MlpPolicy
from stable_baselines3.common.vec_env import VecVideoRecorder

from pettingzoo.butterfly import knights_archers_zombies_v10

  from pkg_resources import resource_stream, resource_exists
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)
Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
  declare_namespace(pkg)


### Env kwargs
This is a set of keywords in SB3 that tells the environment how it should initialise it.

- `max_cycles`: Run for only 100 cycles maximally
- `max_zombies`: Tell it to only create 4 zombies at any point of time
- `vector_state`: Use vectors to represent the state or use images instead.

In [2]:
env_kwargs = dict(max_cycles=100, max_zombies=4, vector_state=True)
env_fn = knights_archers_zombies_v10

### Training parameters
steps = 81_920
seed = 0


### Today's environment: Knights Archers Zombies
<img src="https://pettingzoo.farama.org/_images/butterfly_knights_archers_zombies.gif"/>
Consists of two types of agents running around killing zombies
- Archer: Can chuck arrows at zombies
- Knights: Go in to melee zombies

Actions: Can move in any direction and rotate character
Reward: The more zombies an agent kills, the more rewards it gets.

Setup: Can be AEC or Parallel

#### Types of Environment
AEC: Each agent takes a turn before passing on to the next agent

Parallel: Everyone moves in parallel



In [3]:
### Initialise environment

env = env_fn.parallel_env(**env_kwargs)

### Introducing Supersuit

Supersuit is a collections of small functions which can wrap reinforcement learning environments to do preprocessing. It supports both Gymnasium and PettingZoo.

To ensure robustness, we have added some wrappers that can help modify the environment by making our actions or observations noisier

`sticky_actions_v0` assigns a probability of an old action “sticking” to the environment and not updating as given by `repeat_action_prob`, allowing us to occassionally use the same action instead of changing it.

Moreover, `frame_skip_v0` allows us to skip some observations and rewards, which cause our model to have to react to that.

In [4]:
# Add black death wrapper so the number of agents stays constant
# MarkovVectorEnv does not support environments with varying numbers of active agents unless black_death is set to True
env = ss.black_death_v3(env)
# --- 2. modality‑agnostic robustness wrappers -----------------
repeat_action_prob = 0.3 #Probability of repeating the action
frameskip = (1,3) #Skip from 1 up to 3 frames

env = ss.sticky_actions_v0(env, repeat_action_probability=repeat_action_prob)
env = ss.frame_skip_v0(env, frameskip)

# Pre-process using SuperSuit (Only if visual is set to True)
visual_observation = not env.unwrapped.vector_state
if visual_observation:
    # If the observation space is visual, reduce the color channels, resize from 512px to 84px, and apply frame stacking
    env = ss.color_reduction_v0(env, mode="B")
    env = ss.resize_v1(env, x_size=84, y_size=84)
    env = ss.frame_stack_v1(env, 3)


### Setting up the policy and training

#### Housekeeping code

`ss.pettingzoo_env_to_vec_env_v1(env)` vectorizes the PettingZoo environment into a Gym-style VecEnv. It treats each agent in the multigent env as if it was in its own sub environment. The result is a single VecEnv where each index correspond to one PettingZoo agent. This bridges the gap between whats Stable Baselines 3 expects and PettingZoo's wrapper.

`ss.concat_vec_envs_v1(env, num_vec_envs=8, num_cpus=1, base_class="stable_baselines3")` takes an existing vecEnv and stacks 8 identical copies of it into a bigger VecEnv that runs 8 parallel rollouts.
- `num_cpus = 1` means it runs everything in a single process
- `base_class="stable_baselines3"` forces the VecEnv to use VecEnv's base class from SB3 so that SB3 algorithms can work with it.




In [5]:
env.reset(seed=seed)

print(f"Starting training on {str(env.metadata['name'])}.")

env = ss.pettingzoo_env_to_vec_env_v1(env)
env = ss.concat_vec_envs_v1(env, 8, num_cpus=1, base_class="stable_baselines3")

# Use a CNN policy if the observation space is visual
model = PPO(
    CnnPolicy if visual_observation else MlpPolicy,
    env,
    verbose=3,
    batch_size=256,
)

model.learn(total_timesteps=steps, progress_bar = True)

model.save(f"{env.unwrapped.metadata.get('name')}_{time.strftime('%Y%m%d-%H%M%S')}")

print("Model has been saved.")

print(f"Finished training on {str(env.unwrapped.metadata['name'])}.")

env.close()

Starting training on knights_archers_zombies_v10.
Using cpu device


Output()

------------------------------
| time/              |       |
|    fps             | 362   |
|    iterations      | 1     |
|    time_elapsed    | 180   |
|    total_timesteps | 65536 |
------------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 337         |
|    iterations           | 2           |
|    time_elapsed         | 387         |
|    total_timesteps      | 131072      |
| train/                  |             |
|    approx_kl            | 0.009844935 |
|    clip_fraction        | 0.0813      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.78       |
|    explained_variance   | -1.19       |
|    learning_rate        | 0.0003      |
|    loss                 | 0.0178      |
|    n_updates            | 10          |
|    policy_gradient_loss | -0.00612    |
|    value_loss           | 0.0432      |
-----------------------------------------


Model has been saved.
Finished training on knights_archers_zombies_v10.


### Evaluation of Model

We load our trained model, and experiment with it here. We use imageio to save a video of the entire run. Notice that we use the same visual observations here, but removed Black Death. That is because we are no longer running this within SB3, and are running it on PettingZoo directly.

We also added `cvar`, or conditional variance at risk, as well as worst return to help evaluate our models.

In [6]:
import numpy as np

def cvar(returns, alpha=0.1):
    """CVaR_α of a 1‑D list/array (alpha in (0,1])."""
    if len(returns) == 0:
        return np.nan
    k = max(1, int(np.ceil(alpha * len(returns)))) #Finds the proportional length according to alpha
    return np.mean(np.sort(returns)[:k]) #Take worst alpha runs and mean

In [9]:
import imageio

def eval(env_fn, num_games: int = 100, render_mode: str | None = None, repeat_action_prob = 0.3, frameskip = 2, cvar_alpha: float = 0.1, **env_kwargs):
    # Evaluate a trained agent vs a random agent
    video_folder ="./logs"
    os.makedirs("logs",exist_ok=True)
    video_length = 200
    env = env_fn.env(render_mode=render_mode, **env_kwargs)
    # Test for Robustness in evals
    env = ss.sticky_actions_v0(env, repeat_action_probability=repeat_action_prob)
    env = ss.frame_skip_v0(env, frameskip)

    # Pre-process using SuperSuit
    visual_observation = not env.unwrapped.vector_state
    if visual_observation:
        # If the observation space is visual, reduce the color channels, resize from 512px to 84px, and apply frame stacking
        env = ss.color_reduction_v0(env, mode="B")
        env = ss.resize_v1(env, x_size=84, y_size=84)
        env = ss.frame_stack_v1(env, 3)
    print(
        f"\nStarting evaluation on {str(env.metadata['name'])} (num_games={num_games}, render_mode={render_mode})"
    )

    try:
        latest_policy = max(
            glob.glob(f"{env.metadata['name']}*.zip"), key=os.path.getctime
        )
    except ValueError:
        print("Policy not found.")
        exit(0)



    model = PPO.load(latest_policy)

    ep_returns = {agent: [] for agent in env.possible_agents}
    # Note: we evaluate here using an AEC environments, to allow for easy A/B testing against random policies
    # For example, we can see here that using a random agent for archer_0 results in less points than the trained agent
    for i in range(num_games):
        env.reset(seed=i)
        rewards_this_ep = {a: 0.0 for a in env.possible_agents}
        env.action_space(env.possible_agents[0]).seed(i)
        frames = []

        for agent in env.agent_iter():
            obs, reward, termination, truncation, info = env.last()
            if render_mode == "rgb_array":
              frames.append(env.render())

            for a in env.agents:
                rewards_this_ep[a] += env.rewards[a]

            if termination or truncation:
                break
            else:
                if agent == env.possible_agents[0]:
                    act = env.action_space(agent).sample()
                else:
                    act = model.predict(obs, deterministic=True)[0]
            env.step(act)

        # store per‑episode returns
        for a in env.possible_agents:
            ep_returns[a].append(rewards_this_ep[a])

        if render_mode == "rgb_array":
            path = f"{video_folder}/PPO_game_{i}.mp4"
            imageio.mimsave(path, frames, fps = 30)
            print(f"Saved to {path}")
    env.close()

    # ---- aggregate metrics ------------------------------------------------
    mean_return  = {a: np.mean(ep_returns[a]) for a in env.possible_agents}
    worst_return = {a: np.min( ep_returns[a]) for a in env.possible_agents}
    cvar_return  = {a: cvar(ep_returns[a], alpha=cvar_alpha) for a in env.possible_agents}
    var_return   = {a: np.var(ep_returns[a]) for a in env.possible_agents}

    # Compute overall metrics
    mean_overall  = np.mean(list(mean_return.values()))
    worst_overall = np.min([min(r) for r in ep_returns.values()])
    cvar_overall  = np.mean(list(cvar_return.values()))
    var_overall   = np.mean(list(var_return.values()))

    # Print summary
    print(f"\n==== Evaluation Results (α={cvar_alpha:.2f}) ====")
    print(
        f"Overall Rewards → "
        f"mean: {mean_overall:.2f} | "
        f"worst: {worst_overall:.2f} | "
        f"CVaR: {cvar_overall:.2f} | "
        f"Variance: {var_overall:.2f}"
    )

    print("\nPer‑agent metrics:")
    for a in env.possible_agents:
        print(
            f"  {a:>10}: "
            f"mean={mean_return[a]:.2f}, "
            f"worst={worst_return[a]:.2f}, "
            f"CVaR={cvar_return[a]:.2f}, "
            f"Variance={var_return[a]:.2f}"
        )

    return {
        "mean": mean_overall,
        "worst": worst_overall,
        "cvar": cvar_overall,
        "variance": var_overall,
        "per_agent": {
            a: {
                "mean": mean_return[a],
                "worst": worst_return[a],
                "cvar": cvar_return[a],
                "variance": var_return[a],
            }
            for a in env.possible_agents
        },
    }

In [10]:
env_fn = knights_archers_zombies_v10

# Set vector_state to false in order to use visual observations (significantly longer training time)
env_kwargs = dict(max_cycles=100, max_zombies=4, vector_state=True)


# Evaluate 10 games (takes ~10 seconds on a laptop CPU)
eval(env_fn, num_games=10, render_mode=None, **env_kwargs)

# Watch 2 games (takes ~10 seconds on a laptop CPU)
eval(env_fn, num_games=2, render_mode="rgb_array", **env_kwargs)


Starting evaluation on knights_archers_zombies_v10 (num_games=10, render_mode=None)

==== Evaluation Results (α=0.10) ====
Overall Rewards → mean: 0.03 | worst: 0.00 | CVaR: 0.00 | Variance: 0.02

Per‑agent metrics:
    archer_0: mean=0.10, worst=0.00, CVaR=0.00, Variance=0.09
    archer_1: mean=0.00, worst=0.00, CVaR=0.00, Variance=0.00
    knight_0: mean=0.00, worst=0.00, CVaR=0.00, Variance=0.00
    knight_1: mean=0.00, worst=0.00, CVaR=0.00, Variance=0.00

Starting evaluation on knights_archers_zombies_v10 (num_games=2, render_mode=rgb_array)
Saved to ./logs/PPO_game_0.mp4
Saved to ./logs/PPO_game_1.mp4

==== Evaluation Results (α=0.10) ====
Overall Rewards → mean: 0.00 | worst: 0.00 | CVaR: 0.00 | Variance: 0.00

Per‑agent metrics:
    archer_0: mean=0.00, worst=0.00, CVaR=0.00, Variance=0.00
    archer_1: mean=0.00, worst=0.00, CVaR=0.00, Variance=0.00
    knight_0: mean=0.00, worst=0.00, CVaR=0.00, Variance=0.00
    knight_1: mean=0.00, worst=0.00, CVaR=0.00, Variance=0.00


{'mean': np.float64(0.0),
 'worst': np.float64(0.0),
 'cvar': np.float64(0.0),
 'variance': np.float64(0.0),
 'per_agent': {'archer_0': {'mean': np.float64(0.0),
   'worst': np.float64(0.0),
   'cvar': np.float64(0.0),
   'variance': np.float64(0.0)},
  'archer_1': {'mean': np.float64(0.0),
   'worst': np.float64(0.0),
   'cvar': np.float64(0.0),
   'variance': np.float64(0.0)},
  'knight_0': {'mean': np.float64(0.0),
   'worst': np.float64(0.0),
   'cvar': np.float64(0.0),
   'variance': np.float64(0.0)},
  'knight_1': {'mean': np.float64(0.0),
   'worst': np.float64(0.0),
   'cvar': np.float64(0.0),
   'variance': np.float64(0.0)}}}

In [11]:
### Visualise one of the games
import IPython.display as ipd

ipd.Video("logs/PPO_game_0.mp4", embed = True)