# Mario PPO RL Project

I'll start by saying this was my very first reinforcement learning project I've even attempted. I didn't implement PPO myself from scratch, but I learned a whole lot about what goes into such projects - preprocessing, understanding the environment (an NES emulator in this case), and a lifetime of waiting on my computer!

I watched about a dozen different YouTube videos on AIs that play various games and found the whole concept incredibly intriguing watching how so many algorithms were used. I was eager to learn the differences and to know when one algorithm might be considered a better choice than another. I love tackling challenges that would bring me to new heights in my programming knowledge, so Mario came to my mind as the perfect fit for my new expedition into machine learning.

### Fun Facts

1. The whole NES system runs on 2 Kb of RAM and a 1.79 MHz processor! The game screen is 240 wide by 256 pixels tall.
2. Due to how early power systems were developed, the North American (NTSC) version of the game runs at 60 fps, and there are two European (PAL) versions - one version is optimized for PAL and the other is not. The unoptimized PAL version runs slower and has slower music, while the optimized version keeps pace with the NTSC version, but occasionally has to skip a frame to do so. This leads to frame-perfect or pixel-perfect tricks being slightly different.
3. The NES has 5 channels for sound - 3 were usable for the game's music and audio effects. This is why some of the music stopped if you collected a coin or powerup. This also forced the composers of the era to get creative with the limited number of parts their song could have.
4. Licking your old NES cartridges to get them to work is just a placebo effect and you are actually making the problem worse by doing so!
5. The average person has a reaction time of about 273ms (or 16 frames in Mario terms) and can mash a controller button 6.69 times every second (or once every 9 frames). My initial AI was not limited by these factor.

## Starting the Game

### Installing the dependencies

There were several things for me to install for this project. `gym`, `gym_super_mario_bros`, and `nes_py` were necessary to get the game running. The game itself is the NTSC version running on an emulator. `nes_py` makes hacking the ram and cheating with the game quite easy! This is handy if I want to make any sort of save states to revert to.

I am using a custom `PyTorch` installation which will let me use my GPU (a little GeForce GTX 1050 Ti) via CUDA and `stable-baselines3` for the actual PPO model. I didn't do any multiprocessing in this project, which would be a next step for me to learn to drastically increase my performance gains.

In [1]:
# Install the environment
!pip install gym==0.21.0 gym_super_mario_bros==7.3.0 nes_py --no-input

# This line must be run before the installation of stable-baselines3 to force
# PyTorch to use the GPU via CUDA.
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu117 --no-input

# Stable-baselines3 gives us access to many RL models.
!pip install stable-baselines3[extra] --no-input

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu117
Defaulting to user installation because normal site-packages is not writeable


### Understanding the Game

The NES controller has a total of 8 buttons - `up`, `down`, `left`, `right`, `A`, `B`, `start`, and `select`. On the emulator, all of these buttons could be pressed at the same time, even if such configurations are physically impossible such as `up` and `down` simultaneously. This gives a total of 256 possible input combinations per frame.

We need to limit the action space so the agent can learn easier. We don't want the agent pushing start or select for any reason. Invalid combinations could be avoided as well. Lastly, certain combinations don't yield any special results, such as pushing `down` in conjunction with any other button.

The goal is to run to the right as quick as possible, and the environment rewards Mario based on his change in X position. (Rewarding Mario for the actual score on the screen will cause him to mope around collecting all the coins and grinding enemies.) I decided to force the AI to hold `right` and `B` to constantly run right. All it has to decide is when to push the `A` button to jump. I figured this would net the quickest results as there are no parts of the game where you are forced to move left or stop moving.

The environment will only return frames which are rewardable, so the start screen is skipped and 1 player mode is automatically selected. Similarly, death and other animations are also skipped.

In [2]:
# Import the game and SIMPLIFIED controls
import gym_super_mario_bros
from gym_super_mario_bros.actions import *
SPEEDRUN_MOVEMENT = [['right', 'B'], ['right', 'A', 'B']]

# Import wrappers
from gym import Wrapper
from nes_py.wrappers import JoypadSpace
from gym.wrappers import GrayScaleObservation, RecordEpisodeStatistics, ResizeObservation
from stable_baselines3.common.vec_env import VecFrameStack, SubprocVecEnv

# Import the RL model
from stable_baselines3 import PPO

# Import utilities for image viewing, file I/O, RNG, etc.
from stable_baselines3.common.callbacks import BaseCallback
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding environment {}".format(id))
  logger.warn("Overriding en

## Testing the Environment

The specific environment I used was one which selected a random stage on each reset. I did this to help the AI generalize as well as to reduce overfitting on the first level. Plus, it would be very slow if Mario had to go through all 7 of the other worlds before getting to world 8 which the AI needs to learn.

This had me thinking of implementing a similar idea for a single level as well - what if we could rewind a second or two and the AI could retry just the part where it died instead of having to run through the first half of the level? There were some pitfalls in this logic about perhaps getting stuck in death loops, so I did not implement this until later to measure the improvements that could be made.

The environment also turns everything into rectangles to reduce the complexity of the information for the agent. The code below tests the game for a few seconds to ensure that it is working properly.

In [3]:
# Create and wrap the environment
env = gym_super_mario_bros.make('SuperMarioBros-v3')
env = JoypadSpace(env, SIMPLE_MOVEMENT)
state = env.reset()
env.render()

# Test the game with random actions
try:
    for step in range(10 ** 3):
        # Pick a random action
        action = env.action_space.sample()
        state, reward, done, info = env.step(action)
        
        if done:
            env.reset()
        env.render()
except KeyboardInterrupt:
    pass
finally:
    env.close()

  return (self.ram[0x86] - self.ram[0x071c]) % 256


## Preprocessing

The design choices I made here were intended to help speed up the rate that the AI learns and to model human reactions a little better. The base environment that I used was a rectangular downsample of the whole game - sprites are reduced to only their hitboxes. I also grayscale the image which could possibly cut out on important information (such as not being able to distinguish red lava and a gray floor) but will overall speed up the learning process.

Original | Downsampled | Grayscaled
- | - | -
![mario original](smb_original.png) | ![mario downsampled](smb_rectangle.png) | ![mario grayscale](smb_grayscale.png)

### Deadlock

I decided if Mario gets stuck and does not progress for 3 seconds then he should be killed off without having to wait minutes for the timer to run out in order to restart.

In [4]:
class DeadlockEnv(Wrapper):
    
    def __init__(self, env, threshold):
        super().__init__(env)
        self.threshold = threshold
    
    def reset(self, **kwargs):
        self.max_x_pos = 0
        self.count = 0
        return self.env.reset(**kwargs)
    
    def step(self, action):
        state, reward, done, info = self.env.step(action)
        x_pos = info['x_pos']
        if x_pos <= self.max_x_pos:
            self.count += 1
        else:
            self.count = 0
            self.max_x_pos = x_pos
        
        if self.count >= self.threshold:
            reward -= 15
            done = True
        
        return state, reward, done, info

### Skipping Frames

The AI can change inputs every 5 frames to model human limitations. This also makes it easier to do things like jump over pipes, where the a button needs to be held for several frames. The reward for the frames is summated.

In [5]:
class SkipFrame(Wrapper):
    
    def __init__(self, env, skip):
        super().__init__(env)
        self._skip = skip
    
    def step(self, action):
        total_reward = 0
        for i in range(self._skip):
            state, reward, done, info = self.env.step(action)
            total_reward += reward
            if done:
                break
        return state, total_reward, done, info

### Other Wrappers

I also grayscale the image and reduced the size by half to reduce the information passed to the AI. (If you try to display the image using Matplotlib, it messes with the colors on a grayscale image when rendering the output for visibility.)

The environment is put into a subprocess environment for the PPO algorithm to play multiple games at once, utilizing the GPU.

Then, a total of 4 frames are stacked on each other to give the AI a sense of motion and acceleration.

Now we just create the actual environment and wrap everything up.

In [6]:
def create_env(ranks, seed = 0):
    
    def helper(rank):
        # 1. Create the base environment
        env = gym_super_mario_bros.make('SuperMarioBros-v3')
        env.seed(seed + rank)

        # 2. Simplify the controls
        env = JoypadSpace(env, COMPLEX_MOVEMENT)

        # 3. If Mario is not making progress after 180 frames (3 seconds), end it
        env = DeadlockEnv(env, 60 * 3)

        # 4. Skip frames so the agent can only change button presses every 5 frames
        # env = SkipFrame(env, 5)

        # 5. Grayscale - This cuts down on info for the agent to learn
        env = GrayScaleObservation(env, keep_dim=True)
        
        # 6. Resize the image to 50%
        env = ResizeObservation(env, (120, 128))
        
        # 7. Record stats about the env for tensorboard
        env = RecordEpisodeStatistics(env)
        
        return env

    # 8. Wrap inside Dummy Vector Env
    env = SubprocVecEnv([(lambda: helper(i)) for i in range(ranks)])

    # 9. Stack the frames
    env = VecFrameStack(env, 4, channels_order='last')
    
    return env

## Logging and Training

Initially I decided to train the model for a total of 1 million timesteps and benchmark the progress every 10,000 steps. I used a learning rate of 1e-6 and chose to update the model every 512 steps.

In [7]:
# Helper class will save our model every n steps
class TrainAndLoggingCallback(BaseCallback):
    
    def __init__(self, save_freq, save_path, verbose = 1):
        super().__init__(verbose)
        self.save_freq = save_freq
        self.save_path = save_path
    
    def _init_callback(self):
        if self.save_path is not None:
            os.makedirs(self.save_path, exist_ok=True)
    
    def _on_step(self):
        if self.n_calls % self.save_freq == 0:
            model_path = os.path.join(self.save_path,
                                      f'best_model_{self.n_calls // 1000}k')
            self.model.save(model_path)
        return True

In [8]:
# Set up save dirs
CHECKPOINT_DIR = 'E:/train/'
TRAIN_DIR = 'E:/logs/'
MODEL_PATH = 'E:/train/finished_model'
REWARD_LOG = TRAIN_DIR + 'reward_log.csv'

# Set up constants
SAVE_FREQ = 10 ** 5
NUM_SUBPROCESSES = 4

# Set up the model saving callback
callback = TrainAndLoggingCallback(SAVE_FREQ // NUM_SUBPROCESSES,
                                   CHECKPOINT_DIR)
env = create_env(NUM_SUBPROCESSES)

In [9]:
# Create the model (policy)
model = PPO('CnnPolicy',
            env = env,
            verbose = 1,
            tensorboard_log = TRAIN_DIR,
            learning_rate = 1e-5,
            n_steps = 512)

Using cuda device
Wrapping the env in a VecTransposeImage.


In [10]:
try:
    # Train and save model
    model.learn(total_timesteps = 10 ** 6,
                callback = callback)
    model.save(MODEL_PATH)
except KeyboardInterrupt:
    pass
finally:
    # Free memory, close resources
    del model
    del env
    print('Freed memory and closed resources.')

Logging to E:/logs/PPO_3
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 207      |
|    ep_rew_mean     | -34.4    |
| time/              |          |
|    fps             | 283      |
|    iterations      | 1        |
|    time_elapsed    | 7        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 302         |
|    ep_rew_mean          | 9.75        |
| time/                   |             |
|    fps                  | 236         |
|    iterations           | 2           |
|    time_elapsed         | 17          |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.015081482 |
|    clip_fraction        | 0.123       |
|    clip_range           | 0.2         |
|    entropy_loss         | -2.48       |
|    explained_variance   | 0.000377    |
|    

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 615         |
|    ep_rew_mean          | 336.82352   |
| time/                   |             |
|    fps                  | 209         |
|    iterations           | 11          |
|    time_elapsed         | 107         |
|    total_timesteps      | 22528       |
| train/                  |             |
|    approx_kl            | 0.010392243 |
|    clip_fraction        | 0.108       |
|    clip_range           | 0.2         |
|    entropy_loss         | -2.28       |
|    explained_variance   | 0.895       |
|    learning_rate        | 1e-05       |
|    loss                 | 28.1        |
|    n_updates            | 100         |
|    policy_gradient_loss | -0.00772    |
|    value_loss           | 105         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 610   

----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 592        |
|    ep_rew_mean          | 417.9014   |
| time/                   |            |
|    fps                  | 207        |
|    iterations           | 21         |
|    time_elapsed         | 207        |
|    total_timesteps      | 43008      |
| train/                  |            |
|    approx_kl            | 0.01372692 |
|    clip_fraction        | 0.141      |
|    clip_range           | 0.2        |
|    entropy_loss         | -2.04      |
|    explained_variance   | 0.975      |
|    learning_rate        | 1e-05      |
|    loss                 | 23.6       |
|    n_updates            | 200        |
|    policy_gradient_loss | -0.00401   |
|    value_loss           | 59.3       |
----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 588         |
|    ep_rew_m

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 608         |
|    ep_rew_mean          | 497.37      |
| time/                   |             |
|    fps                  | 206         |
|    iterations           | 31          |
|    time_elapsed         | 307         |
|    total_timesteps      | 63488       |
| train/                  |             |
|    approx_kl            | 0.007682711 |
|    clip_fraction        | 0.0864      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.66       |
|    explained_variance   | 0.913       |
|    learning_rate        | 1e-05       |
|    loss                 | 252         |
|    n_updates            | 300         |
|    policy_gradient_loss | -0.00283    |
|    value_loss           | 287         |
-----------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 590 

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 566         |
|    ep_rew_mean          | 561.08      |
| time/                   |             |
|    fps                  | 206         |
|    iterations           | 41          |
|    time_elapsed         | 407         |
|    total_timesteps      | 83968       |
| train/                  |             |
|    approx_kl            | 0.007549705 |
|    clip_fraction        | 0.178       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.68       |
|    explained_variance   | 0.986       |
|    learning_rate        | 1e-05       |
|    loss                 | 32.4        |
|    n_updates            | 400         |
|    policy_gradient_loss | -0.0032     |
|    value_loss           | 60.9        |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 562   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 550         |
|    ep_rew_mean          | 591.45      |
| time/                   |             |
|    fps                  | 205         |
|    iterations           | 51          |
|    time_elapsed         | 508         |
|    total_timesteps      | 104448      |
| train/                  |             |
|    approx_kl            | 0.017585145 |
|    clip_fraction        | 0.151       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.56       |
|    explained_variance   | 0.993       |
|    learning_rate        | 1e-05       |
|    loss                 | 18.4        |
|    n_updates            | 500         |
|    policy_gradient_loss | -0.00417    |
|    value_loss           | 36.9        |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 547   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 537         |
|    ep_rew_mean          | 578.33      |
| time/                   |             |
|    fps                  | 204         |
|    iterations           | 61          |
|    time_elapsed         | 610         |
|    total_timesteps      | 124928      |
| train/                  |             |
|    approx_kl            | 0.011246964 |
|    clip_fraction        | 0.125       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.46       |
|    explained_variance   | 0.957       |
|    learning_rate        | 1e-05       |
|    loss                 | 59.4        |
|    n_updates            | 600         |
|    policy_gradient_loss | -0.00232    |
|    value_loss           | 114         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 530   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 520         |
|    ep_rew_mean          | 568.4       |
| time/                   |             |
|    fps                  | 204         |
|    iterations           | 71          |
|    time_elapsed         | 710         |
|    total_timesteps      | 145408      |
| train/                  |             |
|    approx_kl            | 0.010220475 |
|    clip_fraction        | 0.125       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.36       |
|    explained_variance   | 0.978       |
|    learning_rate        | 1e-05       |
|    loss                 | 50.9        |
|    n_updates            | 700         |
|    policy_gradient_loss | -0.000817   |
|    value_loss           | 111         |
-----------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 522 

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 528          |
|    ep_rew_mean          | 580.03       |
| time/                   |              |
|    fps                  | 204          |
|    iterations           | 81           |
|    time_elapsed         | 810          |
|    total_timesteps      | 165888       |
| train/                  |              |
|    approx_kl            | 0.0063026194 |
|    clip_fraction        | 0.0753       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.968       |
|    explained_variance   | 0.98         |
|    learning_rate        | 1e-05        |
|    loss                 | 54.2         |
|    n_updates            | 800          |
|    policy_gradient_loss | -0.00227     |
|    value_loss           | 76.7         |
------------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_m

----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 543        |
|    ep_rew_mean          | 613.47     |
| time/                   |            |
|    fps                  | 204        |
|    iterations           | 91         |
|    time_elapsed         | 910        |
|    total_timesteps      | 186368     |
| train/                  |            |
|    approx_kl            | 0.00895274 |
|    clip_fraction        | 0.122      |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.945     |
|    explained_variance   | 0.959      |
|    learning_rate        | 1e-05      |
|    loss                 | 65.2       |
|    n_updates            | 900        |
|    policy_gradient_loss | 0.00119    |
|    value_loss           | 175        |
----------------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 543        |
|    ep_rew_mean

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 562         |
|    ep_rew_mean          | 659.1       |
| time/                   |             |
|    fps                  | 204         |
|    iterations           | 101         |
|    time_elapsed         | 1012        |
|    total_timesteps      | 206848      |
| train/                  |             |
|    approx_kl            | 0.006489372 |
|    clip_fraction        | 0.112       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.878      |
|    explained_variance   | 0.904       |
|    learning_rate        | 1e-05       |
|    loss                 | 348         |
|    n_updates            | 1000        |
|    policy_gradient_loss | -0.000811   |
|    value_loss           | 366         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 564   

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 585          |
|    ep_rew_mean          | 708.1        |
| time/                   |              |
|    fps                  | 204          |
|    iterations           | 111          |
|    time_elapsed         | 1114         |
|    total_timesteps      | 227328       |
| train/                  |              |
|    approx_kl            | 0.0132488515 |
|    clip_fraction        | 0.141        |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.939       |
|    explained_variance   | 0.959        |
|    learning_rate        | 1e-05        |
|    loss                 | 86.4         |
|    n_updates            | 1100         |
|    policy_gradient_loss | -0.00275     |
|    value_loss           | 151          |
------------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_m

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 594          |
|    ep_rew_mean          | 735.56       |
| time/                   |              |
|    fps                  | 203          |
|    iterations           | 121          |
|    time_elapsed         | 1215         |
|    total_timesteps      | 247808       |
| train/                  |              |
|    approx_kl            | 0.0061640018 |
|    clip_fraction        | 0.0607       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.62        |
|    explained_variance   | 0.852        |
|    learning_rate        | 1e-05        |
|    loss                 | 187          |
|    n_updates            | 1200         |
|    policy_gradient_loss | -0.00157     |
|    value_loss           | 398          |
------------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_m

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 601          |
|    ep_rew_mean          | 752.2        |
| time/                   |              |
|    fps                  | 203          |
|    iterations           | 131          |
|    time_elapsed         | 1317         |
|    total_timesteps      | 268288       |
| train/                  |              |
|    approx_kl            | 0.0058018714 |
|    clip_fraction        | 0.055        |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.584       |
|    explained_variance   | 0.902        |
|    learning_rate        | 1e-05        |
|    loss                 | 69.5         |
|    n_updates            | 1300         |
|    policy_gradient_loss | -0.00192     |
|    value_loss           | 307          |
------------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_m

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 601         |
|    ep_rew_mean          | 775.79      |
| time/                   |             |
|    fps                  | 203         |
|    iterations           | 141         |
|    time_elapsed         | 1417        |
|    total_timesteps      | 288768      |
| train/                  |             |
|    approx_kl            | 0.004935762 |
|    clip_fraction        | 0.055       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.524      |
|    explained_variance   | 0.947       |
|    learning_rate        | 1e-05       |
|    loss                 | 59.4        |
|    n_updates            | 1400        |
|    policy_gradient_loss | -0.000927   |
|    value_loss           | 186         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 606   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 614         |
|    ep_rew_mean          | 783.26      |
| time/                   |             |
|    fps                  | 203         |
|    iterations           | 151         |
|    time_elapsed         | 1520        |
|    total_timesteps      | 309248      |
| train/                  |             |
|    approx_kl            | 0.008461941 |
|    clip_fraction        | 0.105       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.698      |
|    explained_variance   | 0.941       |
|    learning_rate        | 1e-05       |
|    loss                 | 49          |
|    n_updates            | 1500        |
|    policy_gradient_loss | -0.00426    |
|    value_loss           | 158         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 616   

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 620          |
|    ep_rew_mean          | 827.29       |
| time/                   |              |
|    fps                  | 203          |
|    iterations           | 161          |
|    time_elapsed         | 1623         |
|    total_timesteps      | 329728       |
| train/                  |              |
|    approx_kl            | 0.0072175125 |
|    clip_fraction        | 0.0639       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.534       |
|    explained_variance   | 0.794        |
|    learning_rate        | 1e-05        |
|    loss                 | 313          |
|    n_updates            | 1600         |
|    policy_gradient_loss | -0.00317     |
|    value_loss           | 463          |
------------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_m

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 619         |
|    ep_rew_mean          | 832.01      |
| time/                   |             |
|    fps                  | 202         |
|    iterations           | 171         |
|    time_elapsed         | 1726        |
|    total_timesteps      | 350208      |
| train/                  |             |
|    approx_kl            | 0.011944721 |
|    clip_fraction        | 0.0934      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.908      |
|    explained_variance   | 0.962       |
|    learning_rate        | 1e-05       |
|    loss                 | 67          |
|    n_updates            | 1700        |
|    policy_gradient_loss | -0.00122    |
|    value_loss           | 131         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 615   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 617         |
|    ep_rew_mean          | 823.72      |
| time/                   |             |
|    fps                  | 202         |
|    iterations           | 181         |
|    time_elapsed         | 1829        |
|    total_timesteps      | 370688      |
| train/                  |             |
|    approx_kl            | 0.021555662 |
|    clip_fraction        | 0.104       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.797      |
|    explained_variance   | 0.937       |
|    learning_rate        | 1e-05       |
|    loss                 | 91.6        |
|    n_updates            | 1800        |
|    policy_gradient_loss | -0.00537    |
|    value_loss           | 162         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 613   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 631         |
|    ep_rew_mean          | 852.02      |
| time/                   |             |
|    fps                  | 202         |
|    iterations           | 191         |
|    time_elapsed         | 1931        |
|    total_timesteps      | 391168      |
| train/                  |             |
|    approx_kl            | 0.013657369 |
|    clip_fraction        | 0.154       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.876      |
|    explained_variance   | 0.818       |
|    learning_rate        | 1e-05       |
|    loss                 | 57.5        |
|    n_updates            | 1900        |
|    policy_gradient_loss | -0.000924   |
|    value_loss           | 148         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 651   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 659         |
|    ep_rew_mean          | 875.74      |
| time/                   |             |
|    fps                  | 202         |
|    iterations           | 201         |
|    time_elapsed         | 2033        |
|    total_timesteps      | 411648      |
| train/                  |             |
|    approx_kl            | 0.010440534 |
|    clip_fraction        | 0.103       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.839      |
|    explained_variance   | 0.903       |
|    learning_rate        | 1e-05       |
|    loss                 | 70          |
|    n_updates            | 2000        |
|    policy_gradient_loss | -0.00676    |
|    value_loss           | 207         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 659   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 682         |
|    ep_rew_mean          | 950.43      |
| time/                   |             |
|    fps                  | 202         |
|    iterations           | 211         |
|    time_elapsed         | 2135        |
|    total_timesteps      | 432128      |
| train/                  |             |
|    approx_kl            | 0.007776398 |
|    clip_fraction        | 0.0952      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.758      |
|    explained_variance   | 0.649       |
|    learning_rate        | 1e-05       |
|    loss                 | 224         |
|    n_updates            | 2100        |
|    policy_gradient_loss | -0.00755    |
|    value_loss           | 439         |
-----------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 684 

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 671         |
|    ep_rew_mean          | 969.36      |
| time/                   |             |
|    fps                  | 202         |
|    iterations           | 221         |
|    time_elapsed         | 2237        |
|    total_timesteps      | 452608      |
| train/                  |             |
|    approx_kl            | 0.008153491 |
|    clip_fraction        | 0.0965      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.624      |
|    explained_variance   | 0.811       |
|    learning_rate        | 1e-05       |
|    loss                 | 94.1        |
|    n_updates            | 2200        |
|    policy_gradient_loss | -0.00128    |
|    value_loss           | 161         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 668   

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 673          |
|    ep_rew_mean          | 1005.33      |
| time/                   |              |
|    fps                  | 202          |
|    iterations           | 231          |
|    time_elapsed         | 2339         |
|    total_timesteps      | 473088       |
| train/                  |              |
|    approx_kl            | 0.0076605896 |
|    clip_fraction        | 0.0915       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.683       |
|    explained_variance   | 0.766        |
|    learning_rate        | 1e-05        |
|    loss                 | 64.5         |
|    n_updates            | 2300         |
|    policy_gradient_loss | -0.00436     |
|    value_loss           | 232          |
------------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_m

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 677         |
|    ep_rew_mean          | 985.27      |
| time/                   |             |
|    fps                  | 202         |
|    iterations           | 241         |
|    time_elapsed         | 2440        |
|    total_timesteps      | 493568      |
| train/                  |             |
|    approx_kl            | 0.008546933 |
|    clip_fraction        | 0.0867      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.739      |
|    explained_variance   | 0.682       |
|    learning_rate        | 1e-05       |
|    loss                 | 134         |
|    n_updates            | 2400        |
|    policy_gradient_loss | -0.00388    |
|    value_loss           | 323         |
-----------------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 681     

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 619         |
|    ep_rew_mean          | 845.84      |
| time/                   |             |
|    fps                  | 202         |
|    iterations           | 251         |
|    time_elapsed         | 2541        |
|    total_timesteps      | 514048      |
| train/                  |             |
|    approx_kl            | 0.021312999 |
|    clip_fraction        | 0.135       |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.08       |
|    explained_variance   | 0.976       |
|    learning_rate        | 1e-05       |
|    loss                 | 19.1        |
|    n_updates            | 2500        |
|    policy_gradient_loss | -0.00829    |
|    value_loss           | 65.5        |
-----------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 622 

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 552         |
|    ep_rew_mean          | 690.65      |
| time/                   |             |
|    fps                  | 202         |
|    iterations           | 261         |
|    time_elapsed         | 2642        |
|    total_timesteps      | 534528      |
| train/                  |             |
|    approx_kl            | 0.025280248 |
|    clip_fraction        | 0.149       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.952      |
|    explained_variance   | 0.988       |
|    learning_rate        | 1e-05       |
|    loss                 | 10.4        |
|    n_updates            | 2600        |
|    policy_gradient_loss | -0.0128     |
|    value_loss           | 39.3        |
-----------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 542 

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 520          |
|    ep_rew_mean          | 575.49       |
| time/                   |              |
|    fps                  | 202          |
|    iterations           | 271          |
|    time_elapsed         | 2742         |
|    total_timesteps      | 555008       |
| train/                  |              |
|    approx_kl            | 0.0109510245 |
|    clip_fraction        | 0.0939       |
|    clip_range           | 0.2          |
|    entropy_loss         | -1.08        |
|    explained_variance   | 0.993        |
|    learning_rate        | 1e-05        |
|    loss                 | 6.75         |
|    n_updates            | 2700         |
|    policy_gradient_loss | -0.0102      |
|    value_loss           | 19.4         |
------------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_m

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 515         |
|    ep_rew_mean          | 576.88      |
| time/                   |             |
|    fps                  | 202         |
|    iterations           | 281         |
|    time_elapsed         | 2842        |
|    total_timesteps      | 575488      |
| train/                  |             |
|    approx_kl            | 0.016428413 |
|    clip_fraction        | 0.101       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.956      |
|    explained_variance   | 0.965       |
|    learning_rate        | 1e-05       |
|    loss                 | 16.9        |
|    n_updates            | 2800        |
|    policy_gradient_loss | -0.00289    |
|    value_loss           | 123         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 517   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 524         |
|    ep_rew_mean          | 626.77      |
| time/                   |             |
|    fps                  | 202         |
|    iterations           | 291         |
|    time_elapsed         | 2940        |
|    total_timesteps      | 595968      |
| train/                  |             |
|    approx_kl            | 0.015410436 |
|    clip_fraction        | 0.129       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.902      |
|    explained_variance   | 0.89        |
|    learning_rate        | 1e-05       |
|    loss                 | 71.6        |
|    n_updates            | 2900        |
|    policy_gradient_loss | -0.00903    |
|    value_loss           | 241         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 530   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 564         |
|    ep_rew_mean          | 731.81      |
| time/                   |             |
|    fps                  | 202         |
|    iterations           | 301         |
|    time_elapsed         | 3041        |
|    total_timesteps      | 616448      |
| train/                  |             |
|    approx_kl            | 0.010139663 |
|    clip_fraction        | 0.142       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.987      |
|    explained_variance   | 0.944       |
|    learning_rate        | 1e-05       |
|    loss                 | 79.2        |
|    n_updates            | 3000        |
|    policy_gradient_loss | -0.00248    |
|    value_loss           | 114         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 564   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 585         |
|    ep_rew_mean          | 799.77      |
| time/                   |             |
|    fps                  | 202         |
|    iterations           | 311         |
|    time_elapsed         | 3140        |
|    total_timesteps      | 636928      |
| train/                  |             |
|    approx_kl            | 0.005480572 |
|    clip_fraction        | 0.0699      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.579      |
|    explained_variance   | 0.88        |
|    learning_rate        | 1e-05       |
|    loss                 | 137         |
|    n_updates            | 3100        |
|    policy_gradient_loss | -0.00581    |
|    value_loss           | 264         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 589   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 614         |
|    ep_rew_mean          | 845.73      |
| time/                   |             |
|    fps                  | 203         |
|    iterations           | 321         |
|    time_elapsed         | 3238        |
|    total_timesteps      | 657408      |
| train/                  |             |
|    approx_kl            | 0.015161225 |
|    clip_fraction        | 0.0644      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.42       |
|    explained_variance   | 0.727       |
|    learning_rate        | 1e-05       |
|    loss                 | 98.4        |
|    n_updates            | 3200        |
|    policy_gradient_loss | -0.0112     |
|    value_loss           | 321         |
-----------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 616 

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 613         |
|    ep_rew_mean          | 838.74      |
| time/                   |             |
|    fps                  | 203         |
|    iterations           | 331         |
|    time_elapsed         | 3337        |
|    total_timesteps      | 677888      |
| train/                  |             |
|    approx_kl            | 0.012460374 |
|    clip_fraction        | 0.0969      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.881      |
|    explained_variance   | 0.959       |
|    learning_rate        | 1e-05       |
|    loss                 | 24.9        |
|    n_updates            | 3300        |
|    policy_gradient_loss | -0.00605    |
|    value_loss           | 80          |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 620   

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 624          |
|    ep_rew_mean          | 778.72       |
| time/                   |              |
|    fps                  | 203          |
|    iterations           | 341          |
|    time_elapsed         | 3436         |
|    total_timesteps      | 698368       |
| train/                  |              |
|    approx_kl            | 0.0066098506 |
|    clip_fraction        | 0.0785       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.833       |
|    explained_variance   | 0.988        |
|    learning_rate        | 1e-05        |
|    loss                 | 22.4         |
|    n_updates            | 3400         |
|    policy_gradient_loss | -0.00372     |
|    value_loss           | 51.9         |
------------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_m

----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 622        |
|    ep_rew_mean          | 738.46     |
| time/                   |            |
|    fps                  | 203        |
|    iterations           | 351        |
|    time_elapsed         | 3536       |
|    total_timesteps      | 718848     |
| train/                  |            |
|    approx_kl            | 0.00668725 |
|    clip_fraction        | 0.0899     |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.886     |
|    explained_variance   | 0.966      |
|    learning_rate        | 1e-05      |
|    loss                 | 42.2       |
|    n_updates            | 3500       |
|    policy_gradient_loss | -0.00715   |
|    value_loss           | 105        |
----------------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 614        |
|    ep_rew_mean

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 594         |
|    ep_rew_mean          | 718.7       |
| time/                   |             |
|    fps                  | 203         |
|    iterations           | 361         |
|    time_elapsed         | 3634        |
|    total_timesteps      | 739328      |
| train/                  |             |
|    approx_kl            | 0.027531888 |
|    clip_fraction        | 0.18        |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.17       |
|    explained_variance   | 0.971       |
|    learning_rate        | 1e-05       |
|    loss                 | 9.72        |
|    n_updates            | 3600        |
|    policy_gradient_loss | -0.00469    |
|    value_loss           | 31.5        |
-----------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 588 

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 578         |
|    ep_rew_mean          | 774.03      |
| time/                   |             |
|    fps                  | 203         |
|    iterations           | 371         |
|    time_elapsed         | 3733        |
|    total_timesteps      | 759808      |
| train/                  |             |
|    approx_kl            | 0.013492268 |
|    clip_fraction        | 0.116       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.674      |
|    explained_variance   | 0.823       |
|    learning_rate        | 1e-05       |
|    loss                 | 75.8        |
|    n_updates            | 3700        |
|    policy_gradient_loss | -0.00721    |
|    value_loss           | 243         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 580   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 570         |
|    ep_rew_mean          | 723.71      |
| time/                   |             |
|    fps                  | 203         |
|    iterations           | 381         |
|    time_elapsed         | 3832        |
|    total_timesteps      | 780288      |
| train/                  |             |
|    approx_kl            | 0.008734304 |
|    clip_fraction        | 0.138       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.861      |
|    explained_variance   | 0.956       |
|    learning_rate        | 1e-05       |
|    loss                 | 23.2        |
|    n_updates            | 3800        |
|    policy_gradient_loss | -0.00545    |
|    value_loss           | 194         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 575   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 591         |
|    ep_rew_mean          | 734.75      |
| time/                   |             |
|    fps                  | 203         |
|    iterations           | 391         |
|    time_elapsed         | 3931        |
|    total_timesteps      | 800768      |
| train/                  |             |
|    approx_kl            | 0.027822316 |
|    clip_fraction        | 0.0742      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.579      |
|    explained_variance   | 0.909       |
|    learning_rate        | 1e-05       |
|    loss                 | 40.2        |
|    n_updates            | 3900        |
|    policy_gradient_loss | -0.008      |
|    value_loss           | 265         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 597   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 609         |
|    ep_rew_mean          | 741.87      |
| time/                   |             |
|    fps                  | 203         |
|    iterations           | 401         |
|    time_elapsed         | 4030        |
|    total_timesteps      | 821248      |
| train/                  |             |
|    approx_kl            | 0.009841463 |
|    clip_fraction        | 0.0855      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.65       |
|    explained_variance   | 0.722       |
|    learning_rate        | 1e-05       |
|    loss                 | 70.8        |
|    n_updates            | 4000        |
|    policy_gradient_loss | -0.00305    |
|    value_loss           | 289         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 610   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 623         |
|    ep_rew_mean          | 819.38      |
| time/                   |             |
|    fps                  | 203         |
|    iterations           | 411         |
|    time_elapsed         | 4129        |
|    total_timesteps      | 841728      |
| train/                  |             |
|    approx_kl            | 0.005224688 |
|    clip_fraction        | 0.0597      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.441      |
|    explained_variance   | 0.843       |
|    learning_rate        | 1e-05       |
|    loss                 | 98.2        |
|    n_updates            | 4100        |
|    policy_gradient_loss | -0.00339    |
|    value_loss           | 200         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 611   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 598         |
|    ep_rew_mean          | 784.74      |
| time/                   |             |
|    fps                  | 203         |
|    iterations           | 421         |
|    time_elapsed         | 4228        |
|    total_timesteps      | 862208      |
| train/                  |             |
|    approx_kl            | 0.012124794 |
|    clip_fraction        | 0.0878      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.649      |
|    explained_variance   | 0.919       |
|    learning_rate        | 1e-05       |
|    loss                 | 28          |
|    n_updates            | 4200        |
|    policy_gradient_loss | -0.00509    |
|    value_loss           | 136         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 596   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 595         |
|    ep_rew_mean          | 779.95      |
| time/                   |             |
|    fps                  | 203         |
|    iterations           | 431         |
|    time_elapsed         | 4326        |
|    total_timesteps      | 882688      |
| train/                  |             |
|    approx_kl            | 0.007914569 |
|    clip_fraction        | 0.0756      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.655      |
|    explained_variance   | 0.848       |
|    learning_rate        | 1e-05       |
|    loss                 | 349         |
|    n_updates            | 4300        |
|    policy_gradient_loss | -0.00045    |
|    value_loss           | 272         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 588   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 584         |
|    ep_rew_mean          | 749.67      |
| time/                   |             |
|    fps                  | 204         |
|    iterations           | 441         |
|    time_elapsed         | 4426        |
|    total_timesteps      | 903168      |
| train/                  |             |
|    approx_kl            | 0.008060316 |
|    clip_fraction        | 0.0717      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.608      |
|    explained_variance   | 0.894       |
|    learning_rate        | 1e-05       |
|    loss                 | 226         |
|    n_updates            | 4400        |
|    policy_gradient_loss | -0.00514    |
|    value_loss           | 316         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 591   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 592         |
|    ep_rew_mean          | 765.98      |
| time/                   |             |
|    fps                  | 204         |
|    iterations           | 451         |
|    time_elapsed         | 4526        |
|    total_timesteps      | 923648      |
| train/                  |             |
|    approx_kl            | 0.039807193 |
|    clip_fraction        | 0.148       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.777      |
|    explained_variance   | 0.963       |
|    learning_rate        | 1e-05       |
|    loss                 | 19.3        |
|    n_updates            | 4500        |
|    policy_gradient_loss | -0.0142     |
|    value_loss           | 94.2        |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 595   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 610         |
|    ep_rew_mean          | 728.22      |
| time/                   |             |
|    fps                  | 204         |
|    iterations           | 461         |
|    time_elapsed         | 4624        |
|    total_timesteps      | 944128      |
| train/                  |             |
|    approx_kl            | 0.025326822 |
|    clip_fraction        | 0.145       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.85       |
|    explained_variance   | 0.989       |
|    learning_rate        | 1e-05       |
|    loss                 | 8.68        |
|    n_updates            | 4600        |
|    policy_gradient_loss | -0.0135     |
|    value_loss           | 28.1        |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 625   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 647         |
|    ep_rew_mean          | 725.03      |
| time/                   |             |
|    fps                  | 204         |
|    iterations           | 471         |
|    time_elapsed         | 4722        |
|    total_timesteps      | 964608      |
| train/                  |             |
|    approx_kl            | 0.008841213 |
|    clip_fraction        | 0.119       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.942      |
|    explained_variance   | 0.991       |
|    learning_rate        | 1e-05       |
|    loss                 | 6.59        |
|    n_updates            | 4700        |
|    policy_gradient_loss | -0.00535    |
|    value_loss           | 22.7        |
-----------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 649 

----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 653        |
|    ep_rew_mean          | 704.18     |
| time/                   |            |
|    fps                  | 204        |
|    iterations           | 481        |
|    time_elapsed         | 4821       |
|    total_timesteps      | 985088     |
| train/                  |            |
|    approx_kl            | 0.00989325 |
|    clip_fraction        | 0.0854     |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.727     |
|    explained_variance   | 0.969      |
|    learning_rate        | 1e-05      |
|    loss                 | 16         |
|    n_updates            | 4800       |
|    policy_gradient_loss | -0.00339   |
|    value_loss           | 63.7       |
----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 652         |
|    ep_rew_m

## Testing the Model

In [11]:
# Load the model
model = PPO.load('E:/Train/finished_model.zip',
                 #env = create_env(4),
                 reset_num_timesteps = False,
                 tensorboard_log=TRAIN_DIR)

In [None]:
env = create_env(1)
state = env.reset()
try:
    while True:
        action, _ = model.predict(state)
        state, reward, done, info = env.step(action)
        env.render()
except KeyboardInterrupt:
    pass
finally:
    env.close()

## The Results



# Future improvements

Some basics are lowering the learning rate and giving the model more time to run. Other algorithms and model architectures besides PPO could also be tried. If the model is running into pixel perfect boundaries, the size of the screen could be kept the same. If some color palletes clash on the grayscale image, then the full RGB bands could be used as well. However, these will slow down training.

One issue the AI faces is having to spend a lot of time playing through the first part of a level in order to get to a new part to learn. This is especially troublesome if we would all the levels to be played. The gym-super-mario-bros environment allows Mario to start on a random world which can deal with this. Another remedy is to save the current state every second (up to 5 saved states) and when Mario dies, revert back to the state from 5 seconds ago to retry the new part again.