In [30]:
!pip install pettingzoo[classic]==1.23.1
!pip install stable_baselines3==2.0.0
!pip install supersuit==3.9.0
!pip install sb3-contrib==2.0.0
## Make sure to restart runtime for these packages to work!!

Exception ignored in: <function ProcConcatVec.__del__ at 0x7d24a535fbe0>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/supersuit/vector/multiproc_vec.py", line 224, in __del__
    self.close()
  File "/usr/local/lib/python3.10/dist-packages/supersuit/vector/multiproc_vec.py", line 239, in close
    for pipe, proc in zip(self.pipes, self.procs):
AttributeError: 'ProcConcatVec' object has no attribute 'pipes'


Collecting sb3-contrib==2.0.0
  Using cached sb3_contrib-2.0.0-py3-none-any.whl (80 kB)
Installing collected packages: sb3-contrib
Successfully installed sb3-contrib-2.0.0


In [31]:
import os
os.environ['SDL_VIDEODRIVER']='dummy'
import pygame
pygame.display.set_mode((640,480))

<Surface(640x480x32 SW)>

If you recall from the previous lesson (Deep Q-learning), we were using the DQN to find the optimal path to the maze and we saw it performs really bad. That has a lot to do with the fact that a maze, unlike the most RL environment that you have seen so far, has constraints. Such as when the agent is in a certain state, some of the actions cannot be perform.

[Maskable PPO](https://arxiv.org/abs/2006.14171) is one of the algorithm to tackle this type of constrained RL problem by only select the valid actions during training using an action mask.

Below is the code for training 2 agents to play [Connect Four](https://pettingzoo.farama.org/environments/classic/connect_four/) using [Maskable PPO](https://sb3-contrib.readthedocs.io/en/master/modules/ppo_mask.html) from SB3. Feel free to look at the code and play with it (train and evaluate) !!!

In [37]:
"""Uses Stable-Baselines3 to train agents in the Connect Four environment using invalid action masking.

For information about invalid action masking in PettingZoo, see https://pettingzoo.farama.org/api/aec/#action-masking
For more information about invalid action masking in SB3, see https://sb3-contrib.readthedocs.io/en/master/modules/ppo_mask.html

Author: Elliot (https://github.com/elliottower)
"""
import glob
import os
import time

from sb3_contrib import MaskablePPO
from sb3_contrib.common.maskable.policies import MaskableActorCriticPolicy
from sb3_contrib.common.wrappers import ActionMasker

import pettingzoo.utils
from pettingzoo.classic import connect_four_v3


class SB3ActionMaskWrapper(pettingzoo.utils.BaseWrapper):
    """Wrapper to allow PettingZoo environments to be used with SB3 illegal action masking."""

    def reset(self, seed=None, options=None):
        """Gymnasium-like reset function which assigns obs/action spaces to be the same for each agent.

        This is required as SB3 is designed for single-agent RL and doesn't expect obs/action spaces to be functions
        """
        super().reset(seed, options)

        # Strip the action mask out from the observation space
        self.observation_space = super().observation_space(self.possible_agents[0])[
            "observation"
        ]
        self.action_space = super().action_space(self.possible_agents[0])

        # Return initial observation, info (PettingZoo AEC envs do not by default)
        return self.observe(self.agent_selection), {}

    def step(self, action):
        """Gymnasium-like step function, returning observation, reward, termination, truncation, info."""
        super().step(action)
        return super().last()

    def observe(self, agent):
        """Return only raw observation, removing action mask."""
        return super().observe(agent)["observation"]

    def action_mask(self):
        """Separate function used in order to access the action mask."""
        return super().observe(self.agent_selection)["action_mask"]


def mask_fn(env):
    # Do whatever you'd like in this function to return the action mask
    # for the current env. In this example, we assume the env has a
    # helpful method we can rely on.
    return env.action_mask()


def train_action_mask(env_fn, steps=10_000, seed=0, **env_kwargs):
    """Train a single model to play as each agent in a zero-sum game environment using invalid action masking."""
    env = env_fn.env(**env_kwargs)

    print(f"Starting training on {str(env.metadata['name'])}.")

    # Custom wrapper to convert PettingZoo envs to work with SB3 action masking
    env = SB3ActionMaskWrapper(env)

    env.reset(seed=seed)  # Must call reset() in order to re-define the spaces

    env = ActionMasker(env, mask_fn)  # Wrap to enable masking (SB3 function)
    # MaskablePPO behaves the same as SB3's PPO unless the env is wrapped
    # with ActionMasker. If the wrapper is detected, the masks are automatically
    # retrieved and used when learning. Note that MaskablePPO does not accept
    # a new action_mask_fn kwarg, as it did in an earlier draft.
    model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1)
    model.set_random_seed(seed)
    model.learn(total_timesteps=steps)

    model.save(f"{env.unwrapped.metadata.get('name')}_{time.strftime('%Y%m%d-%H%M%S')}")

    print("Model has been saved.")

    print(f"Finished training on {str(env.unwrapped.metadata['name'])}.\n")

    env.close()


def eval_action_mask(env_fn, num_games=100, render_mode=None, **env_kwargs):
    # Evaluate a trained agent vs a random agent
    env = env_fn.env(render_mode=render_mode, **env_kwargs)

    print(
        f"Starting evaluation vs a random agent. Trained agent will play as {env.possible_agents[1]}."
    )

    try:
        latest_policy = max(
            glob.glob(f"{env.metadata['name']}*.zip"), key=os.path.getctime
        )
    except ValueError:
        print("Policy not found.")
        exit(0)

    model = MaskablePPO.load(latest_policy)

    scores = {agent: 0 for agent in env.possible_agents}
    total_rewards = {agent: 0 for agent in env.possible_agents}
    round_rewards = []

    for i in range(num_games):
        env.reset(seed=i)
        env.action_space(env.possible_agents[0]).seed(i)

        for agent in env.agent_iter():
            obs, reward, termination, truncation, info = env.last()

            # Separate observation and action mask
            observation, action_mask = obs.values()

            if termination or truncation:
                # If there is a winner, keep track, otherwise don't change the scores (tie)
                if (
                    env.rewards[env.possible_agents[0]]
                    != env.rewards[env.possible_agents[1]]
                ):
                    winner = max(env.rewards, key=env.rewards.get)
                    scores[winner] += env.rewards[
                        winner
                    ]  # only tracks the largest reward (winner of game)
                # Also track negative and positive rewards (penalizes illegal moves)
                for a in env.possible_agents:
                    total_rewards[a] += env.rewards[a]
                # List of rewards by round, for reference
                round_rewards.append(env.rewards)
                break
            else:
                if agent == env.possible_agents[0]:
                    act = env.action_space(agent).sample(action_mask)
                else:
                    # Note: PettingZoo expects integer actions # TODO: change chess to cast actions to type int?
                    act = int(
                        model.predict(
                            observation, action_masks=action_mask, deterministic=True
                        )[0]
                    )
            env.step(act)
    env.close()

    # Avoid dividing by zero
    if sum(scores.values()) == 0:
        winrate = 0
    else:
        winrate = scores[env.possible_agents[1]] / sum(scores.values())
    print("Rewards by round: ", round_rewards)
    print("Total rewards (incl. negative rewards): ", total_rewards)
    print("Winrate: ", winrate)
    print("Final scores: ", scores)
    return round_rewards, total_rewards, winrate, scores


if __name__ == "__main__":
    env_fn = connect_four_v3

    env_kwargs = {}

    # Evaluation/training hyperparameter notes:
    # 10k steps: Winrate:  0.76, loss order of 1e-03
    # 20k steps: Winrate:  0.86, loss order of 1e-04
    # 40k steps: Winrate:  0.86, loss order of 7e-06

    # Train a model against itself (takes ~20 seconds on a laptop CPU)
    train_action_mask(env_fn, steps=1000, seed=0, **env_kwargs)

    # Evaluate 100 games against a random agent (winrate should be ~80%)
    eval_action_mask(env_fn, num_games=1, render_mode=None, **env_kwargs)

    # Watch two games vs a random agent
    eval_action_mask(env_fn, num_games=2, render_mode="human", **env_kwargs)

Starting training on connect_four_v3.
Using cuda device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 21.7     |
|    ep_rew_mean     | 1        |
| time/              |          |
|    fps             | 447      |
|    iterations      | 1        |
|    time_elapsed    | 4        |
|    total_timesteps | 2048     |
---------------------------------
Model has been saved.
Finished training on connect_four_v3.

Starting evaluation vs a random agent. Trained agent will play as player_1.
Rewards by round:  [{'player_0': 1, 'player_1': -1}]
Total rewards (incl. negative rewards):  {'player_0': 1, 'player_1': -1}
Winrate:  0.0
Final scores:  {'player_0': 1, 'player_1': 0}
Starting evaluation vs a random agent. Trained agent will play as player_1.
Rewards by round:  [{'player_0': 1, 'player_1': -1}, {'player_0': -1, 'player_1': 1}]
Total rewards (incl. negative rewards):  {

### Mentor-Guided Challenge

Since this is the challenge for the last lesson, we are going to try something different. Look at the [Atari MARL environment](https://pettingzoo.farama.org/environments/atari/) and the [SB3](https://stable-baselines3.readthedocs.io/en/master/guide/algos.html) algorithms. Pick 1 environment and 1 algorithm of your choice to train a RL agent! Have fun!

**Note**: For some environment, you might have to install certain dependencies for pettingzoo. For example, for the *Atari* environment, we did something like
> !pip install pettingzoo[atari]==1.23.1

In [None]:
# TODO