# Monitor Envs, Recording Tools, and Evaluating Model Performance in Custom Envs

In this notebook, I'll be learning how to record model performance using the *Monitor* wrapper for gymnasium **Envs**. Stable-Baselines-3 (SB3) has an excellent suite of tools for recording and plotting model performance, designed to work with the *Monitor* Env wrapper. The plan for this Notebook is to accomplish the following:

1. Wrap an existing Gymnasium Env in the Monitor Wrapper
2.  Log the performance of an SB3 model during training in that environment
3.  Save the log to a separate folder in this directory
4.  Use the log data to plot model performance over training timesteps using SB3 tools.
5.  Wrap my custom TimedColorGame-V0 Env in the Monitor Wrapper and register it as TimedColorGame-v1
6.  Record the performance of an SB3 model in my Env and save in a local folder.
7.  Use the SB3 plotting tools to plot model performance over training timesteps for the Timed Color Game.
8.  Record the performance of multiple models over multiple timesteps and plot their performance in a single Graph using SB3 tools
9.  Streamline this process into a simple workflow or even a function I can recycle for future model testing.

#### 1. Wrap a Gym Env in the Gymnasium *Monitor* Wrapper 

For this task, I'll be following a simple tutorial provided by SB3, which is publically available here: https://colab.research.google.com/github/Stable-Baselines-Team/rl-colab-notebooks/blob/sb3/monitor_training.ipynb#scrollTo=pUWGZp3i9wyf . Let's go!

In [6]:
#IMPORTS
import os
#Gymnasium and base Imports
import gymnasium as gym
import numpy as np
import matplotlib.pyplot as plt
#SB3 Imports
from stable_baselines3 import TD3
from stable_baselines3.common import results_plotter
from stable_baselines3.common.monitor import Monitor
from stable_baselines3.common.results_plotter import load_results, ts2xy
from stable_baselines3.common.noise import NormalActionNoise
from stable_baselines3.common.callbacks import BaseCallback

In [10]:
#We're Creating a Custom Callback function for saving a model after the ideal number of training timesteps as per the tutorial.

class SaveOnBestTrainingRewardCallback(BaseCallback):
    """
    Callback for savig a model based on the training reward (the check is done every `check_freq` steps)
    In practice, we recommend using `EvalCallback`.

    :param check_freq: (int)
    :param log_dir: (str) Path to the folder where the model will be saved. Must contain the file created by the
    `Monitor` wrapper.
    :oaram verbose: (int)
    """
    def __init__(self,check_freq: int, log_dir: str, verbose=1):
        super().__init__(verbose)
        self.check_freq = check_freq
        self.log_dir = log_dir
        self.save_path = os.path.join(log_dir,"best_model")
        self.best_mean_reward = -np.inf
    
    def _init_callback(self) -> None:
        if self.save_path is not None:
            os.makedirs(self.save_path, exist_ok=True)
    
    def _on_step(self) -> bool:
        if self.n_calls % self.check_freq == 0:
            #Retrieve training reward
            x, y = ts2xy(load_results(self.log_dir),"timesteps")
            if len(x) > 0:
                #Mean training reward over the past 100 episodes
                mean_reward = np.mean(y[-100:])
                if self.verbose > 0:
                    print(f"Num timesteps: {self.num_timesteps}")
                    print(f"Best mean reward: {self.best_mean_reward:.2f} - Last mean reward per episode: {mean_reward:.2f}")
                    #New best model, you could save the agent here
                    if mean_reward > self.best_mean_reward:
                        self.best_mean_reward = mean_reward
                        #Example ofr saving best model
                        if self.verbose > 0:
                            print(f"Saving new best model to {self.save_path}.zip")
                        self.model.save(self.save_path)

        return True

In [11]:
# Create log dir
log_dir = "/log/"
os.makedirs(log_dir,exist_ok=True)

#Create Env and wrap in Monitor wrapper
env = gym.make("LunarLanderContinuous-v2")
env = Monitor(env, log_dir)

In [12]:
# Create action noise because TD3 and DDPG use a deterministic policy
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))
# Create the callback: check every 1000 steps
callback = SaveOnBestTrainingRewardCallback(check_freq=1000, log_dir=log_dir)
# Create RL model
model = TD3('MlpPolicy', env, action_noise=action_noise, verbose=0)
# Train the agent
model.learn(total_timesteps=int(5e4), callback=callback)

TypeError: Converting from sequence to b2Vec2, expected int/float arguments index 0