# 1. Install and Import Dependencies

In [None]:
!pip install nes-py gym-super-mario-bros
!pip install stable-baselines3[extra]

# Only works on NVIDIA GPUs with CUDA installed
!conda install pytorch torchvision torchaudio cudatoolkit=11.3 -c pytorch

In [None]:
# For the Environment
import gym_super_mario_bros
from nes_py.wrappers import JoypadSpace
from gym_super_mario_bros.actions import SIMPLE_MOVEMENT
from gym.wrappers import GrayScaleObservation
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.vec_env import VecFrameStack, DummyVecEnv, VecTransposeImage

# For the Learning Model
from stable_baselines3.common.callbacks import CheckpointCallback, EvalCallback, CallbackList
from stable_baselines3 import PPO

# 2. Create and Preprocess Environments

In [None]:
env_name = 'SuperMarioBros-v3'

def create_and_preprocess_env(env_name):
    env = gym_super_mario_bros.make(env_name)
    env = JoypadSpace(env, SIMPLE_MOVEMENT)
    env = GrayScaleObservation(env, keep_dim=True)
    env = Monitor(env)
    env = DummyVecEnv([lambda: env])
    env = VecFrameStack(env, 4, channels_order='last')
    env = VecTransposeImage(env)
    return env
    
train_env = create_and_preprocess_env(env_name)
eval_env = create_and_preprocess_env(env_name)

Note: There are now two environments, one for training and one for evaluation. This is because most learning models use exploration noise during training, and using a separate environment for evaluation prevents any conflicts with this.

# 3. Create and Train the Optimized Agent

Note: I optimized the original MarioAI agent with hyperparameter tuning in 'MarioAI Experiment'. There, I trained four agents with different hyperparameter values. I chose the values with help from <a href="https://stable-baselines3.readthedocs.io/en/master/guide/rl_zoo.html"><u>Stable Baselines3 Zoo</u></a>, which features auto-tuned hyperparameters for various types of learning models on popular environments. They did not have auto-tuned hyperparameters for the Super Mario Bros environment, but I reviewed similar environments. Regardless, once all agents had trained, I evaluated their performance over time with <a href="https://www.tensorflow.org/tensorboard/"><u>TensorBoard</u></a>. I also evaluated their performance by watching them play the game. Finally, I input the hyperparameter values of the agent that performed the best here.

In [None]:
save_path = './Saved Models/'
log_path = './Logs/'

# Save the model every 200,000 timesteps
checkpoint_callback = CheckpointCallback(
    save_freq = 200000, 
    save_path = save_path,
    name_prefix = 'Optimized')

# Evaluate the model every 20,000 timesteps and save the best model
eval_callback = EvalCallback(
    eval_env, 
    eval_freq = 20000, 
    best_model_save_path = save_path)

callback = CallbackList([checkpoint_callback, eval_callback])

In [None]:
model = PPO('CnnPolicy', train_env, verbose=1, tensorboard_log=log_path,
            # These are the optimized values
            learning_rate = 3e-5,
            n_steps = 512,
            batch_size = 128,
            n_epochs = 20)

model.learn(total_timesteps=2000000, callback=callback) # Train the model for 2,000,000 timesteps

Note: The optimized agent performed slightly better, more reliably, and smarter than the other agents as shown in Tensorboard and through observation of the agent playing. It consistently scored a reward between about 1,800 and 2,200 whereas other agents were either lower, less reliable, or both. This agent also adopted interesting strategies and managed to complete the first level of Super Mario Bros once. I believe the agent's success is primarily due to its learning rate as this has the largest potential to affect performance. From my experimentation, a learning rate of 3e-5 to 3e-7 or potentially lower is optimal and leads to high-performing, consistent agents within about 1,000,000 timesteps.

Note: I stored images of the Tensorboard results in "MarioAI Experiment Tensorboard Results.docx".

Note: The experiment I used to optimize the hyperparameters could be improved. For an optimal experiment, I would adjust each hyperparameter individually to clearly see how each impacts performance. I would also test many different values of each hyperparameter. This, of course, would involve training many more agents. To account for this, I could have reduced the total number of timesteps each agent trains for. I have found that the agents usually stop improving, in terms of reward, around 1,000,000 timesteps. Therefore, instead of training each agent for 6,000,000 timesteps, I can train each agent for 2,000,000 timesteps without impacting their overall performance. I could also simply run the experiment for longer or acquire more computers to run multiple agents at one time.

# 4. Evaluate the Optimized Agent

Note: The below steps are set up to load and run the pre-trained optimized MarioAI agent.

In [None]:
# Load the model
model = PPO.load('./Saved Models/Experiment1_6000000_steps', env=eval_env)

In [None]:
# Start the game
state = eval_env.reset()
# Loop through the game
while True:
    action, _ = model.predict(state)
    state, reward, done, info = eval_env.step(action)
    eval_env.render()

In [None]:
# Close the game
eval_env.close()