#Task

We will train a PPO agent which learns to play the classic super mario game.

You can use the stable baselines implementation of PPO or right your own version.

For the env, we will use gym_super_mario_bros. Read more about it [Here](https://github.com/Kautenja/gym-super-mario-bros/)

Note that the stable-baselines3 implementations expect a gymnasium environment and not a gym environment (gymnasium is the upgraded form of gym. gym is depreciated but we can still find a lot of environments made in it.)

Fortunately, gymnasium has a way to resolve that issue and convert a gym env to a gymnasium env. We do need to install a compatible version of gym though.

In [1]:
%pip install swig
%pip install stable-baselines3 gymnasium[all] gym_super_mario_bros nes_py gym==0.10.9  # might need a restart of the session.

Collecting swig
  Downloading swig-4.2.1-py2.py3-none-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: swig
Successfully installed swig-4.2.1
Collecting stable-baselines3
  Downloading stable_baselines3-2.3.2-py3-none-any.whl (182 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m182.3/182.3 kB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gymnasium[all]
  Downloading gymnasium-0.29.1-py3-none-any.whl (953 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m953.9/953.9 kB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting gym_super_mario_bros
  Downloading gym_super_mario_bros-7.4.0-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.1/199.1 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting nes_py
  Downloading nes_py-8

In [2]:
from stable_baselines3 import PPO
from stable_baselines3.common.vec_env import DummyVecEnv, VecFrameStack
from stable_baselines3.common.evaluation import evaluate_policy

from gymnasium.wrappers import GrayScaleObservation
import gymnasium as gym
import gym_super_mario_bros
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
from nes_py.wrappers import JoypadSpace

import matplotlib.pyplot as plt
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

In [3]:
def frames_to_video(frames, fps=24):
    fig = plt.figure(figsize=(frames[0].shape[1] / 100, frames[0].shape[0] / 100), dpi=100)
    ax = plt.axes()
    ax.set_axis_off()

    if len(frames[0].shape) == 2:  # Grayscale image
        im = ax.imshow(frames[0], cmap='gray')
    else:  # Color image
        im = ax.imshow(frames[0])

    def init():
        if len(frames[0].shape) == 2:
            im.set_data(frames[0], cmap='gray')
        else:
            im.set_data(frames[0])
        return im,

    def update(frame):
        if len(frames[frame].shape) == 2:
            im.set_data(frames[frame], cmap='gray')
        else:
            im.set_data(frames[frame])
        return im,

    interval = 1000 / fps
    anim = FuncAnimation(fig, update, frames=len(frames), init_func=init, blit=True, interval=interval)
    plt.close()
    return HTML(anim.to_html5_video())

## Making the environment

On top of making the gym requirement, we will make a vectorized environment (provided by stable baselines 3)

This introduces training over multiple environments simultaneously, making the traning faster. We will use DummyVecEnv which doesn't actually use subprocesses but if we were working with a complex environment with higher compute time, we could also use SubProcessVecEnv

Think about what wrappers you can use to make the job easier. You can also make the action-space simpler. Read more about it in the env page referenced above.

Use `'SuperMarioBros-v0'` version of environment

In [4]:
# Create and wrap the environment
def make_env():
    env = gym_super_mario_bros.make('SuperMarioBros-v0')
    env = JoypadSpace(env, SIMPLE_MOVEMENT)
    env = gym.make("GymV21Environment-v0", env=env, render_mode="rgb_array")
    env = GrayScaleObservation(env, keep_dim=True)
    env = DummyVecEnv([lambda:env])
    env = VecFrameStack(env, 4, channels_order="last")
    return env

# Create the vectorized environment
env = make_env()

  result = entry_point.load(False)
  logger.warn(


## Creating and training the model

In [8]:
# Import necessary libraries
import matplotlib.pyplot as plt
from stable_baselines3.common.callbacks import CallbackList, EvalCallback, StopTrainingOnRewardThreshold, BaseCallback # Import BaseCallback
from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize

# Define a custom callback to store episode rewards
class RewardCallback(BaseCallback):
    """
    Callback for saving episode rewards
    """
    def __init__(self, verbose=0):
        super().__init__(verbose)
        self.episode_rewards = []
        self.current_episode_reward = 0

    def _on_step(self) -> bool:
        # Accumulate reward for the current step
        self.current_episode_reward += self.locals.get("rewards")[0]

        # Store the episode reward when an episode ends
        if self.locals.get("dones")[0]:  # Check if the episode is done
            self.episode_rewards.append(self.current_episode_reward)
            self.current_episode_reward = 0 # Reset for the next episode

        return True

# Create the callback
reward_callback = RewardCallback()

In [20]:
!pip install --upgrade stable-baselines3  # Make sure you have the latest version
import gym
from stable_baselines3 import PPO
from stable_baselines3.common.callbacks import BaseCallback, EvalCallback  # Import necessary callbacks



In [33]:
# Create the PPO model
model = PPO('CnnPolicy', env, verbose=1, tensorboard_log="./mario_tensorboard/")

# Train the model
model.learn(total_timesteps=int(500000),callback=reward_callback)  # You can adjust the number of timesteps as needed

# Save the model
model.save("ppo_mario")

Using cpu device
Wrapping the env in a VecTransposeImage.
Logging to ./mario_tensorboard/PPO_6


  return (self.ram[0x86] - self.ram[0x071c]) % 256


-----------------------------
| time/              |      |
|    fps             | 44   |
|    iterations      | 1    |
|    time_elapsed    | 46   |
|    total_timesteps | 2048 |
-----------------------------
-----------------------------------------
| time/                   |             |
|    fps                  | 6           |
|    iterations           | 2           |
|    time_elapsed         | 620         |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.021459041 |
|    clip_fraction        | 0.279       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.92       |
|    explained_variance   | 0.00352     |
|    learning_rate        | 0.0003      |
|    loss                 | 1.56        |
|    n_updates            | 10          |
|    policy_gradient_loss | 0.0129      |
|    value_loss           | 22.8        |
-----------------------------------------
----------------------------------

KeyboardInterrupt: 

In [37]:
# Evaluate the policy
mean_reward, std_reward = evaluate_policy(model, env, n_eval_episodes=1)
print(f"Mean reward: {mean_reward:.2f}")
print(f"Standard deviation of reward: {std_reward:.2f}")

Mean reward: -1086.00
Standard deviation of reward: 0.00


## Visualizing the results

In [None]:
t_env = make_env()
state = t_env.reset()
frames = []

while True:
    action, _ = model.predict(state)
    state_next, r, done, info = t_env.step(action)
    state = state_next.copy()
    frames.append(t_env.render().copy())
    if done:
        print("done")
        break
    if len(frames) > 5000:  # to limit the video length in case mario is stuck on untrained models. can be removed
        print("limit")
        break

t_env.close()

In [8]:
frames_to_video(frames, fps=24)