<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Imports-and-setup" data-toc-modified-id="Imports-and-setup-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Imports and setup</a></span></li><li><span><a href="#Lunar-Lander-Environment" data-toc-modified-id="Lunar-Lander-Environment-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Lunar Lander Environment</a></span><ul class="toc-item"><li><span><a href="#What-do-we-want?" data-toc-modified-id="What-do-we-want?-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>What do we want?</a></span></li><li><span><a href="#Start" data-toc-modified-id="Start-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Start</a></span><ul class="toc-item"><li><span><a href="#Basic-random-actions" data-toc-modified-id="Basic-random-actions-2.2.1"><span class="toc-item-num">2.2.1&nbsp;&nbsp;</span>Basic random actions</a></span></li><li><span><a href="#Basic-algorithm-picking-an-action" data-toc-modified-id="Basic-algorithm-picking-an-action-2.2.2"><span class="toc-item-num">2.2.2&nbsp;&nbsp;</span>Basic algorithm picking an action</a></span></li></ul></li><li><span><a href="#Saving-and-loading-models" data-toc-modified-id="Saving-and-loading-models-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Saving and loading models</a></span><ul class="toc-item"><li><span><a href="#Train-and-save-some-PPO-models" data-toc-modified-id="Train-and-save-some-PPO-models-2.3.1"><span class="toc-item-num">2.3.1&nbsp;&nbsp;</span>Train and save some PPO models</a></span></li><li><span><a href="#Train-and-save-some-A2C-models" data-toc-modified-id="Train-and-save-some-A2C-models-2.3.2"><span class="toc-item-num">2.3.2&nbsp;&nbsp;</span>Train and save some A2C models</a></span></li><li><span><a href="#Load-some-models" data-toc-modified-id="Load-some-models-2.3.3"><span class="toc-item-num">2.3.3&nbsp;&nbsp;</span>Load some models</a></span></li></ul></li></ul></li><li><span><a href="#Custom-Environment" data-toc-modified-id="Custom-Environment-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Custom Environment</a></span><ul class="toc-item"><li><span><a href="#What-do-we-want?" data-toc-modified-id="What-do-we-want?-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>What do we want?</a></span></li></ul></li></ul></div>

# Imports and setup

A [video link](https://www.youtube.com/watch?v=XbWhJdQgi7E&list=PLQVvvaa0QuDf0O2DWwLZBfJeYY-JOeZB1&index=1&t=7s) to the tutorial I am following

A [text link](https://pythonprogramming.net/introduction-reinforcement-learning-stable-baselines-3-tutorial/) if you are that way inclined

This is just to help me familiarise myself with Reinforcement Learning, I tried to get into it in the past but got bored.

I have the table of contents plugin activate in my notebooks and I would highly suggest using this along with collapsible headings. It will make your navigation here much easier.

In [1]:
!date

Thu 17 Feb 2022 15:00:52 NZDT


In [13]:
from tqdm.auto import tqdm

In [3]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:85% !important; }</style>"))

# Lunar Lander Environment

## What do we want?
* A very basic understanding of how reinforcement learning works

* Mimic the _LunarLander_ example from OpenAI

**Side Note**: The tutor likes to constantly re-write the same file building up the process. I will just make a new cell every time they seem to get to a new topic or iteration of the process

## Start

### Basic random actions

In [20]:
import gym

env = gym.make("LunarLander-v2")

# After creating an environment we need to "reset it" before using it
env.reset()

for step in range(200):
    env.render()
    obs, reward, done, info = env.step(env.action_space.sample()) #Random action
    print(reward)
    
env.close()

-0.447524422554493
-3.8767619686296255
-2.3093069830390958
0.9173874309346683
-4.195947235907528
-5.72795666640518
0.42032748617768334
-0.5233807257040997
1.2003884961460483
1.4212392965232834
-3.79897247857524
1.624323732192438
1.6994997013431725
-3.6969935649800787
0.09144582272230764
0.878363122997456
-3.3727232560114997
1.804787293566535
1.0639306333117418
-0.017093455086295534
1.9647508927575654
-0.1267357128212143
0.7329161827238124
-0.5212742225848206
-0.8503572764268756
-1.0091147952510073
-3.88306902524518
-0.10867512606577634
-4.346111946198351
-1.474354101010049
-1.3782708426236059
-1.7957474761167578
0.2937667488143927
-0.773491259921343
-0.8497429907785659
0.08728403111555963
-2.329847659678704
-1.1581697617969553
-1.236436889135092
-2.242925819368564
-1.2690650434764905
-3.5652214852252824
-2.6699234883554253
-2.0475148015385
-2.7471422410664443
-0.43206977930847645
-3.2837777151760177
-2.6962104622498644
-1.5648686674130659
-2.6964975427082707
-3.1454958234986905
-3.1237

### Basic algorithm picking an action

In [21]:
import gym

# Pick an algorithm to use
from stable_baselines3 import A2C as model_algo
# from stable_baselines3 import PPO as model_algo

env = gym.make("LunarLander-v2")
env.reset()

# We will use a multi layer perceptron for now because it is a good default
model = model_algo("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10_000)

episodes = 10

for episode in tqdm(range(episodes)):
    obs = env.reset()
    done = False
    
    while not done:
        env.render()
        action, _states = model.predict(obs)
        obs, reward, done, info = env.step(action)
    
env.close()

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 123      |
|    ep_rew_mean        | -357     |
| time/                 |          |
|    fps                | 1056     |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 500      |
| train/                |          |
|    entropy_loss       | -1.01    |
|    explained_variance | 0.0436   |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | -10.1    |
|    value_loss         | 578      |
------------------------------------
-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 120       |
|    ep_rew_mean        | -378      |
| time/                 |           |
|    fps                | 1097      |
|    iterations         | 200       |
|    time_e

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 135      |
|    ep_rew_mean        | -308     |
| time/                 |          |
|    fps                | 888      |
|    iterations         | 1400     |
|    time_elapsed       | 7        |
|    total_timesteps    | 7000     |
| train/                |          |
|    entropy_loss       | -0.622   |
|    explained_variance | -0.00615 |
|    learning_rate      | 0.0007   |
|    n_updates          | 1399     |
|    policy_loss        | 7.4      |
|    value_loss         | 349      |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 135      |
|    ep_rew_mean        | -308     |
| time/                 |          |
|    fps                | 861      |
|    iterations         | 1500     |
|    time_elapsed       | 8        |
|    total_timesteps    | 7500     |
| train/                |          |
|

  0%|          | 0/10 [00:00<?, ?it/s]

KeyboardInterrupt: 

## Saving and loading models

In [None]:
import gym
import os

**Side note**: While the models are training you can look at the tensorboard logs for a pretty helpful output

Enter the following into terminal and then use your browser to navigate to the [tensorboard dashboard](http:localhost:6006)

```tensorboard --logdir=logs```

Obviously if you set the `logs_dir` in the following to something else then you will need to update the above query

### Train and save some PPO models

In [9]:
from stable_baselines3 import PPO as model_algo

model_type=  "PPO"
models_dir = f"models/{model_type}"
logs_dir = "logs"
os.makedirs(models_dir, exist_ok=True)
os.makedirs(logs_dir, exist_ok=True)

env = gym.make("LunarLander-v2")
env.reset()

model = model_algo("MlpPolicy", env, verbose=1, tensorboard_log=logs_dir)

TIME_STEPS = 10_000
for i in tqdm(range(1, 30)):
    model.learn(
        total_timesteps=TIME_STEPS, 
        reset_num_timesteps=False, 
        tb_log_name=model_type
    )
    model.save(f"{models_dir}/{TIME_STEPS * i}")

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


  0%|          | 0/29 [00:00<?, ?it/s]

Logging to logs/PPO_0
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 83.3     |
|    ep_rew_mean     | -181     |
| time/              |          |
|    fps             | 595      |
|    iterations      | 1        |
|    time_elapsed    | 3        |
|    total_timesteps | 2048     |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 86.1        |
|    ep_rew_mean          | -183        |
| time/                   |             |
|    fps                  | 706         |
|    iterations           | 2           |
|    time_elapsed         | 5           |
|    total_timesteps      | 4096        |
| train/                  |             |
|    approx_kl            | 0.012127435 |
|    clip_fraction        | 0.0587      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.38       |
|    explained_variance   | 0.0127      |
|    lea

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 140         |
|    ep_rew_mean          | -119        |
| time/                   |             |
|    fps                  | 3417        |
|    iterations           | 3           |
|    time_elapsed         | 7           |
|    total_timesteps      | 26624       |
| train/                  |             |
|    approx_kl            | 0.008806859 |
|    clip_fraction        | 0.0683      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.2        |
|    explained_variance   | 0.105       |
|    learning_rate        | 0.0003      |
|    loss                 | 236         |
|    n_updates            | 120         |
|    policy_gradient_loss | -0.00784    |
|    value_loss           | 441         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 147   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 260         |
|    ep_rew_mean          | -56.6       |
| time/                   |             |
|    fps                  | 3033        |
|    iterations           | 4           |
|    time_elapsed         | 16          |
|    total_timesteps      | 49152       |
| train/                  |             |
|    approx_kl            | 0.005513802 |
|    clip_fraction        | 0.0346      |
|    clip_range           | 0.2         |
|    entropy_loss         | -1.08       |
|    explained_variance   | 0.74        |
|    learning_rate        | 0.0003      |
|    loss                 | 45.5        |
|    n_updates            | 230         |
|    policy_gradient_loss | -0.00545    |
|    value_loss           | 132         |
-----------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 279   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 411         |
|    ep_rew_mean          | -6.52       |
| time/                   |             |
|    fps                  | 3208        |
|    iterations           | 5           |
|    time_elapsed         | 22          |
|    total_timesteps      | 71680       |
| train/                  |             |
|    approx_kl            | 0.011646489 |
|    clip_fraction        | 0.107       |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.988      |
|    explained_variance   | 0.932       |
|    learning_rate        | 0.0003      |
|    loss                 | 5.53        |
|    n_updates            | 340         |
|    policy_gradient_loss | -0.00784    |
|    value_loss           | 14.5        |
-----------------------------------------
Logging to logs/PPO_0
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 429  

Logging to logs/PPO_0
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 566      |
|    ep_rew_mean     | 35.8     |
| time/              |          |
|    fps             | 29332    |
|    iterations      | 1        |
|    time_elapsed    | 3        |
|    total_timesteps | 94208    |
---------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 574          |
|    ep_rew_mean          | 39.6         |
| time/                   |              |
|    fps                  | 11212        |
|    iterations           | 2            |
|    time_elapsed         | 8            |
|    total_timesteps      | 96256        |
| train/                  |              |
|    approx_kl            | 0.0061628595 |
|    clip_fraction        | 0.0209       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.829       |
|    explained_variance   | 0.749   

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 701         |
|    ep_rew_mean          | 68.1        |
| time/                   |             |
|    fps                  | 13147       |
|    iterations           | 2           |
|    time_elapsed         | 8           |
|    total_timesteps      | 116736      |
| train/                  |             |
|    approx_kl            | 0.010563273 |
|    clip_fraction        | 0.0925      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.95       |
|    explained_variance   | 0.805       |
|    learning_rate        | 0.0003      |
|    loss                 | 84          |
|    n_updates            | 560         |
|    policy_gradient_loss | -0.000985   |
|    value_loss           | 52.4        |
-----------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 710 

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 718          |
|    ep_rew_mean          | 90           |
| time/                   |              |
|    fps                  | 11887        |
|    iterations           | 3            |
|    time_elapsed         | 11           |
|    total_timesteps      | 139264       |
| train/                  |              |
|    approx_kl            | 0.0053332155 |
|    clip_fraction        | 0.0371       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.666       |
|    explained_variance   | 0.873        |
|    learning_rate        | 0.0003       |
|    loss                 | 37.6         |
|    n_updates            | 670          |
|    policy_gradient_loss | -0.00242     |
|    value_loss           | 44.1         |
------------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_m

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 724         |
|    ep_rew_mean          | 83.7        |
| time/                   |             |
|    fps                  | 8269        |
|    iterations           | 4           |
|    time_elapsed         | 19          |
|    total_timesteps      | 161792      |
| train/                  |             |
|    approx_kl            | 0.010319103 |
|    clip_fraction        | 0.0918      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.956      |
|    explained_variance   | 0.803       |
|    learning_rate        | 0.0003      |
|    loss                 | 25.7        |
|    n_updates            | 780         |
|    policy_gradient_loss | -0.00529    |
|    value_loss           | 45.1        |
-----------------------------------------
----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 711     

----------------------------------------
| rollout/                |            |
|    ep_len_mean          | 670        |
|    ep_rew_mean          | 70.1       |
| time/                   |            |
|    fps                  | 8738       |
|    iterations           | 5          |
|    time_elapsed         | 21         |
|    total_timesteps      | 184320     |
| train/                  |            |
|    approx_kl            | 0.00651766 |
|    clip_fraction        | 0.0293     |
|    clip_range           | 0.2        |
|    entropy_loss         | -0.846     |
|    explained_variance   | 0.678      |
|    learning_rate        | 0.0003     |
|    loss                 | 19.8       |
|    n_updates            | 890        |
|    policy_gradient_loss | -0.000569  |
|    value_loss           | 120        |
----------------------------------------
Logging to logs/PPO_0
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 658      |
|    ep_rew_mea

Logging to logs/PPO_0
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 690      |
|    ep_rew_mean     | 56       |
| time/              |          |
|    fps             | 49923    |
|    iterations      | 1        |
|    time_elapsed    | 4        |
|    total_timesteps | 206848   |
---------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 701         |
|    ep_rew_mean          | 54.5        |
| time/                   |             |
|    fps                  | 23912       |
|    iterations           | 2           |
|    time_elapsed         | 8           |
|    total_timesteps      | 208896      |
| train/                  |             |
|    approx_kl            | 0.006302519 |
|    clip_fraction        | 0.0542      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.764      |
|    explained_variance   | 0.757       |
|    lea

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 711          |
|    ep_rew_mean          | 67           |
| time/                   |              |
|    fps                  | 15382        |
|    iterations           | 3            |
|    time_elapsed         | 15           |
|    total_timesteps      | 231424       |
| train/                  |              |
|    approx_kl            | 0.0045236563 |
|    clip_fraction        | 0.0271       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.82        |
|    explained_variance   | 0.757        |
|    learning_rate        | 0.0003       |
|    loss                 | 33           |
|    n_updates            | 1120         |
|    policy_gradient_loss | -0.00151     |
|    value_loss           | 69.6         |
------------------------------------------
------------------------------------------
| rollout/                |              |
|    ep_len

------------------------------------------
| rollout/                |              |
|    ep_len_mean          | 710          |
|    ep_rew_mean          | 66.7         |
| time/                   |              |
|    fps                  | 20026        |
|    iterations           | 4            |
|    time_elapsed         | 12           |
|    total_timesteps      | 253952       |
| train/                  |              |
|    approx_kl            | 0.0055842455 |
|    clip_fraction        | 0.0823       |
|    clip_range           | 0.2          |
|    entropy_loss         | -0.808       |
|    explained_variance   | 0.634        |
|    learning_rate        | 0.0003       |
|    loss                 | 71.6         |
|    n_updates            | 1230         |
|    policy_gradient_loss | -0.00231     |
|    value_loss           | 105          |
------------------------------------------
-----------------------------------------
| rollout/                |             |
|    ep_len_m

-----------------------------------------
| rollout/                |             |
|    ep_len_mean          | 587         |
|    ep_rew_mean          | 52.4        |
| time/                   |             |
|    fps                  | 17783       |
|    iterations           | 5           |
|    time_elapsed         | 15          |
|    total_timesteps      | 276480      |
| train/                  |             |
|    approx_kl            | 0.015101545 |
|    clip_fraction        | 0.0678      |
|    clip_range           | 0.2         |
|    entropy_loss         | -0.77       |
|    explained_variance   | 0.685       |
|    learning_rate        | 0.0003      |
|    loss                 | 30.3        |
|    n_updates            | 1340        |
|    policy_gradient_loss | -0.00187    |
|    value_loss           | 114         |
-----------------------------------------
Logging to logs/PPO_0
---------------------------------
| rollout/           |          |
|    ep_len_mean     | 575  

### Train and save some A2C models

A2C is apparently an older algorithm that isn't used as much anymore but it should be good to demonstrate a point

In [10]:
from stable_baselines3 import A2C as model_algo

model_type=  "A2C"
models_dir = f"models/{model_type}"
logs_dir = "logs"
os.makedirs(models_dir, exist_ok=True)
os.makedirs(logs_dir, exist_ok=True)

env = gym.make("LunarLander-v2")
env.reset()

model = model_algo("MlpPolicy", env, verbose=1, tensorboard_log=logs_dir)

TIME_STEPS = 10_000
for i in tqdm(range(1, 30)):
    model.learn(
        total_timesteps=TIME_STEPS, 
        reset_num_timesteps=False, 
        tb_log_name=model_type
    )
    model.save(f"{models_dir}/{TIME_STEPS * i}")

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


  0%|          | 0/29 [00:00<?, ?it/s]

Logging to logs/A2C_0
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 113      |
|    ep_rew_mean        | -311     |
| time/                 |          |
|    fps                | 879      |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 500      |
| train/                |          |
|    entropy_loss       | -0.809   |
|    explained_variance | 0.071    |
|    learning_rate      | 0.0007   |
|    n_updates          | 99       |
|    policy_loss        | -2.23    |
|    value_loss         | 5.01     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 140      |
|    ep_rew_mean        | -437     |
| time/                 |          |
|    fps                | 817      |
|    iterations         | 200      |
|    time_elapsed       | 1        |
|    total_timesteps    | 1000     |
| train/        

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 196      |
|    ep_rew_mean        | -301     |
| time/                 |          |
|    fps                | 690      |
|    iterations         | 1400     |
|    time_elapsed       | 10       |
|    total_timesteps    | 7000     |
| train/                |          |
|    entropy_loss       | -0.273   |
|    explained_variance | 0.00457  |
|    learning_rate      | 0.0007   |
|    n_updates          | 1399     |
|    policy_loss        | -0.331   |
|    value_loss         | 37.5     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 204      |
|    ep_rew_mean        | -294     |
| time/                 |          |
|    fps                | 689      |
|    iterations         | 1500     |
|    time_elapsed       | 10       |
|    total_timesteps    | 7500     |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 207      |
|    ep_rew_mean        | -204     |
| time/                 |          |
|    fps                | 3422     |
|    iterations         | 700      |
|    time_elapsed       | 3        |
|    total_timesteps    | 13500    |
| train/                |          |
|    entropy_loss       | -0.024   |
|    explained_variance | -0.0121  |
|    learning_rate      | 0.0007   |
|    n_updates          | 2699     |
|    policy_loss        | 0.018    |
|    value_loss         | 46.3     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 207      |
|    ep_rew_mean        | -205     |
| time/                 |          |
|    fps                | 3139     |
|    iterations         | 800      |
|    time_elapsed       | 4        |
|    total_timesteps    | 14000    |
| train/                |          |
|

Logging to logs/A2C_0
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 225      |
|    ep_rew_mean        | -164     |
| time/                 |          |
|    fps                | 23561    |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 20500    |
| train/                |          |
|    entropy_loss       | -0.981   |
|    explained_variance | -0.356   |
|    learning_rate      | 0.0007   |
|    n_updates          | 4099     |
|    policy_loss        | -4.31    |
|    value_loss         | 62.7     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 226      |
|    ep_rew_mean        | -162     |
| time/                 |          |
|    fps                | 13960    |
|    iterations         | 200      |
|    time_elapsed       | 1        |
|    total_timesteps    | 21000    |
| train/        

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 243      |
|    ep_rew_mean        | -74.9    |
| time/                 |          |
|    fps                | 2852     |
|    iterations         | 1400     |
|    time_elapsed       | 9        |
|    total_timesteps    | 27000    |
| train/                |          |
|    entropy_loss       | -0.644   |
|    explained_variance | -0.416   |
|    learning_rate      | 0.0007   |
|    n_updates          | 5399     |
|    policy_loss        | 3.16     |
|    value_loss         | 47.7     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 243      |
|    ep_rew_mean        | -67.2    |
| time/                 |          |
|    fps                | 2698     |
|    iterations         | 1500     |
|    time_elapsed       | 10       |
|    total_timesteps    | 27500    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 236      |
|    ep_rew_mean        | -20.9    |
| time/                 |          |
|    fps                | 8588     |
|    iterations         | 700      |
|    time_elapsed       | 3        |
|    total_timesteps    | 33500    |
| train/                |          |
|    entropy_loss       | -0.235   |
|    explained_variance | -0.476   |
|    learning_rate      | 0.0007   |
|    n_updates          | 6699     |
|    policy_loss        | 0.00149  |
|    value_loss         | 0.00246  |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 239      |
|    ep_rew_mean        | -15.8    |
| time/                 |          |
|    fps                | 7459     |
|    iterations         | 800      |
|    time_elapsed       | 4        |
|    total_timesteps    | 34000    |
| train/                |          |
|

Logging to logs/A2C_0
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 226      |
|    ep_rew_mean        | 26.8     |
| time/                 |          |
|    fps                | 49430    |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 40500    |
| train/                |          |
|    entropy_loss       | -0.754   |
|    explained_variance | 0.294    |
|    learning_rate      | 0.0007   |
|    n_updates          | 8099     |
|    policy_loss        | 0.944    |
|    value_loss         | 7.74     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 227      |
|    ep_rew_mean        | 24.7     |
| time/                 |          |
|    fps                | 25829    |
|    iterations         | 200      |
|    time_elapsed       | 1        |
|    total_timesteps    | 41000    |
| train/        

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 214      |
|    ep_rew_mean        | 41.7     |
| time/                 |          |
|    fps                | 5074     |
|    iterations         | 1400     |
|    time_elapsed       | 9        |
|    total_timesteps    | 47000    |
| train/                |          |
|    entropy_loss       | -1.14    |
|    explained_variance | 0.509    |
|    learning_rate      | 0.0007   |
|    n_updates          | 9399     |
|    policy_loss        | 2.24     |
|    value_loss         | 4.68     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 210      |
|    ep_rew_mean        | 37.1     |
| time/                 |          |
|    fps                | 4824     |
|    iterations         | 1500     |
|    time_elapsed       | 9        |
|    total_timesteps    | 47500    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 200      |
|    ep_rew_mean        | 13.6     |
| time/                 |          |
|    fps                | 11879    |
|    iterations         | 700      |
|    time_elapsed       | 4        |
|    total_timesteps    | 53500    |
| train/                |          |
|    entropy_loss       | -1       |
|    explained_variance | -0.691   |
|    learning_rate      | 0.0007   |
|    n_updates          | 10699    |
|    policy_loss        | -0.113   |
|    value_loss         | 0.27     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 201      |
|    ep_rew_mean        | 13.4     |
| time/                 |          |
|    fps                | 10264    |
|    iterations         | 800      |
|    time_elapsed       | 5        |
|    total_timesteps    | 54000    |
| train/                |          |
|

Logging to logs/A2C_0
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 249      |
|    ep_rew_mean        | 6.95     |
| time/                 |          |
|    fps                | 42135    |
|    iterations         | 100      |
|    time_elapsed       | 1        |
|    total_timesteps    | 60500    |
| train/                |          |
|    entropy_loss       | -0.896   |
|    explained_variance | 0.378    |
|    learning_rate      | 0.0007   |
|    n_updates          | 12099    |
|    policy_loss        | -3.45    |
|    value_loss         | 9.81     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 257      |
|    ep_rew_mean        | 6.29     |
| time/                 |          |
|    fps                | 22473    |
|    iterations         | 200      |
|    time_elapsed       | 2        |
|    total_timesteps    | 61000    |
| train/        

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 305      |
|    ep_rew_mean        | -2.22    |
| time/                 |          |
|    fps                | 3567     |
|    iterations         | 1400     |
|    time_elapsed       | 18       |
|    total_timesteps    | 67000    |
| train/                |          |
|    entropy_loss       | -0.47    |
|    explained_variance | 0.357    |
|    learning_rate      | 0.0007   |
|    n_updates          | 13399    |
|    policy_loss        | 3.7      |
|    value_loss         | 11.2     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 305      |
|    ep_rew_mean        | -2.22    |
| time/                 |          |
|    fps                | 3387     |
|    iterations         | 1500     |
|    time_elapsed       | 19       |
|    total_timesteps    | 67500    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 364      |
|    ep_rew_mean        | -1.76    |
| time/                 |          |
|    fps                | 6430     |
|    iterations         | 700      |
|    time_elapsed       | 11       |
|    total_timesteps    | 73500    |
| train/                |          |
|    entropy_loss       | -0.578   |
|    explained_variance | -0.042   |
|    learning_rate      | 0.0007   |
|    n_updates          | 14699    |
|    policy_loss        | 6.77     |
|    value_loss         | 22.3     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 364      |
|    ep_rew_mean        | -1.76    |
| time/                 |          |
|    fps                | 5860     |
|    iterations         | 800      |
|    time_elapsed       | 12       |
|    total_timesteps    | 74000    |
| train/                |          |
|

Logging to logs/A2C_0
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 415      |
|    ep_rew_mean        | -4.08    |
| time/                 |          |
|    fps                | 64121    |
|    iterations         | 100      |
|    time_elapsed       | 1        |
|    total_timesteps    | 80500    |
| train/                |          |
|    entropy_loss       | -0.802   |
|    explained_variance | 0.253    |
|    learning_rate      | 0.0007   |
|    n_updates          | 16099    |
|    policy_loss        | 0.29     |
|    value_loss         | 2.51     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 423      |
|    ep_rew_mean        | -4.55    |
| time/                 |          |
|    fps                | 23286    |
|    iterations         | 200      |
|    time_elapsed       | 3        |
|    total_timesteps    | 81000    |
| train/        

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 464      |
|    ep_rew_mean        | -16.7    |
| time/                 |          |
|    fps                | 4196     |
|    iterations         | 1400     |
|    time_elapsed       | 20       |
|    total_timesteps    | 87000    |
| train/                |          |
|    entropy_loss       | -0.97    |
|    explained_variance | 0.121    |
|    learning_rate      | 0.0007   |
|    n_updates          | 17399    |
|    policy_loss        | -1.09    |
|    value_loss         | 2.87     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 464      |
|    ep_rew_mean        | -16.7    |
| time/                 |          |
|    fps                | 3934     |
|    iterations         | 1500     |
|    time_elapsed       | 22       |
|    total_timesteps    | 87500    |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 501      |
|    ep_rew_mean        | -31.7    |
| time/                 |          |
|    fps                | 10471    |
|    iterations         | 700      |
|    time_elapsed       | 8        |
|    total_timesteps    | 93500    |
| train/                |          |
|    entropy_loss       | -0.571   |
|    explained_variance | 0.127    |
|    learning_rate      | 0.0007   |
|    n_updates          | 18699    |
|    policy_loss        | 1.91     |
|    value_loss         | 14.1     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 508      |
|    ep_rew_mean        | -35.4    |
| time/                 |          |
|    fps                | 9191     |
|    iterations         | 800      |
|    time_elapsed       | 10       |
|    total_timesteps    | 94000    |
| train/                |          |
|

Logging to logs/A2C_0
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 554      |
|    ep_rew_mean        | -39.5    |
| time/                 |          |
|    fps                | 114907   |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 100500   |
| train/                |          |
|    entropy_loss       | -0.442   |
|    explained_variance | -0.528   |
|    learning_rate      | 0.0007   |
|    n_updates          | 20099    |
|    policy_loss        | 3.97     |
|    value_loss         | 150      |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 555      |
|    ep_rew_mean        | -39.9    |
| time/                 |          |
|    fps                | 55037    |
|    iterations         | 200      |
|    time_elapsed       | 1        |
|    total_timesteps    | 101000   |
| train/        

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 592      |
|    ep_rew_mean        | -16.9    |
| time/                 |          |
|    fps                | 7492     |
|    iterations         | 1400     |
|    time_elapsed       | 14       |
|    total_timesteps    | 107000   |
| train/                |          |
|    entropy_loss       | -0.599   |
|    explained_variance | 0.792    |
|    learning_rate      | 0.0007   |
|    n_updates          | 21399    |
|    policy_loss        | 0.148    |
|    value_loss         | 2.2      |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 595      |
|    ep_rew_mean        | -14.1    |
| time/                 |          |
|    fps                | 7058     |
|    iterations         | 1500     |
|    time_elapsed       | 15       |
|    total_timesteps    | 107500   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 614      |
|    ep_rew_mean        | 5.42     |
| time/                 |          |
|    fps                | 18925    |
|    iterations         | 700      |
|    time_elapsed       | 5        |
|    total_timesteps    | 113500   |
| train/                |          |
|    entropy_loss       | -0.9     |
|    explained_variance | 0.961    |
|    learning_rate      | 0.0007   |
|    n_updates          | 22699    |
|    policy_loss        | -1.17    |
|    value_loss         | 0.687    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 610      |
|    ep_rew_mean        | 4.86     |
| time/                 |          |
|    fps                | 16834    |
|    iterations         | 800      |
|    time_elapsed       | 6        |
|    total_timesteps    | 114000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 338      |
|    ep_rew_mean        | 31.5     |
| time/                 |          |
|    fps                | 7153     |
|    iterations         | 2000     |
|    time_elapsed       | 16       |
|    total_timesteps    | 120000   |
| train/                |          |
|    entropy_loss       | -0.0606  |
|    explained_variance | 0.363    |
|    learning_rate      | 0.0007   |
|    n_updates          | 23999    |
|    policy_loss        | -0.0614  |
|    value_loss         | 98.4     |
------------------------------------
Logging to logs/A2C_0
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 316      |
|    ep_rew_mean        | 32.7     |
| time/                 |          |
|    fps                | 154942   |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 120500   |
| train/        

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 156      |
|    ep_rew_mean        | 1.03     |
| time/                 |          |
|    fps                | 11143    |
|    iterations         | 1300     |
|    time_elapsed       | 11       |
|    total_timesteps    | 126500   |
| train/                |          |
|    entropy_loss       | -0.206   |
|    explained_variance | 0.571    |
|    learning_rate      | 0.0007   |
|    n_updates          | 25299    |
|    policy_loss        | 0.000921 |
|    value_loss         | 3.65     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 158      |
|    ep_rew_mean        | 4.72     |
| time/                 |          |
|    fps                | 10253    |
|    iterations         | 1400     |
|    time_elapsed       | 12       |
|    total_timesteps    | 127000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 180      |
|    ep_rew_mean        | 8.32     |
| time/                 |          |
|    fps                | 26076    |
|    iterations         | 600      |
|    time_elapsed       | 5        |
|    total_timesteps    | 133000   |
| train/                |          |
|    entropy_loss       | -0.00351 |
|    explained_variance | 0.98     |
|    learning_rate      | 0.0007   |
|    n_updates          | 26599    |
|    policy_loss        | 0.000213 |
|    value_loss         | 0.744    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 179      |
|    ep_rew_mean        | 5.66     |
| time/                 |          |
|    fps                | 22716    |
|    iterations         | 700      |
|    time_elapsed       | 5        |
|    total_timesteps    | 133500   |
| train/                |          |
|

-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 215       |
|    ep_rew_mean        | 11.1      |
| time/                 |           |
|    fps                | 6896      |
|    iterations         | 2000      |
|    time_elapsed       | 20        |
|    total_timesteps    | 140000    |
| train/                |           |
|    entropy_loss       | -0.084    |
|    explained_variance | 0.116     |
|    learning_rate      | 0.0007    |
|    n_updates          | 27999     |
|    policy_loss        | -0.000444 |
|    value_loss         | 0.00258   |
-------------------------------------
Logging to logs/A2C_0
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 218      |
|    ep_rew_mean        | 16.2     |
| time/                 |          |
|    fps                | 169040   |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 140500   |

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 234      |
|    ep_rew_mean        | 53.5     |
| time/                 |          |
|    fps                | 14009    |
|    iterations         | 1300     |
|    time_elapsed       | 10       |
|    total_timesteps    | 146500   |
| train/                |          |
|    entropy_loss       | -0.407   |
|    explained_variance | 0.501    |
|    learning_rate      | 0.0007   |
|    n_updates          | 29299    |
|    policy_loss        | 0.224    |
|    value_loss         | 2.05     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 238      |
|    ep_rew_mean        | 57.9     |
| time/                 |          |
|    fps                | 13055    |
|    iterations         | 1400     |
|    time_elapsed       | 11       |
|    total_timesteps    | 147000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 232      |
|    ep_rew_mean        | 62.7     |
| time/                 |          |
|    fps                | 30375    |
|    iterations         | 600      |
|    time_elapsed       | 5        |
|    total_timesteps    | 153000   |
| train/                |          |
|    entropy_loss       | -0.399   |
|    explained_variance | 0.865    |
|    learning_rate      | 0.0007   |
|    n_updates          | 30599    |
|    policy_loss        | 0.775    |
|    value_loss         | 3.13     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 231      |
|    ep_rew_mean        | 63.7     |
| time/                 |          |
|    fps                | 26262    |
|    iterations         | 700      |
|    time_elapsed       | 5        |
|    total_timesteps    | 153500   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 245      |
|    ep_rew_mean        | 61.5     |
| time/                 |          |
|    fps                | 8686     |
|    iterations         | 2000     |
|    time_elapsed       | 18       |
|    total_timesteps    | 160000   |
| train/                |          |
|    entropy_loss       | -0.735   |
|    explained_variance | -5.18    |
|    learning_rate      | 0.0007   |
|    n_updates          | 31999    |
|    policy_loss        | 1.27     |
|    value_loss         | 14.9     |
------------------------------------
Logging to logs/A2C_0
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 245      |
|    ep_rew_mean        | 62.4     |
| time/                 |          |
|    fps                | 220333   |
|    iterations         | 100      |
|    time_elapsed       | 0        |
|    total_timesteps    | 160500   |
| train/        

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 260      |
|    ep_rew_mean        | 68.2     |
| time/                 |          |
|    fps                | 14886    |
|    iterations         | 1300     |
|    time_elapsed       | 11       |
|    total_timesteps    | 166500   |
| train/                |          |
|    entropy_loss       | -0.69    |
|    explained_variance | -1.31    |
|    learning_rate      | 0.0007   |
|    n_updates          | 33299    |
|    policy_loss        | -0.118   |
|    value_loss         | 0.85     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 260      |
|    ep_rew_mean        | 68.2     |
| time/                 |          |
|    fps                | 13621    |
|    iterations         | 1400     |
|    time_elapsed       | 12       |
|    total_timesteps    | 167000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 304      |
|    ep_rew_mean        | 64.3     |
| time/                 |          |
|    fps                | 31746    |
|    iterations         | 600      |
|    time_elapsed       | 5        |
|    total_timesteps    | 173000   |
| train/                |          |
|    entropy_loss       | -0.832   |
|    explained_variance | 0.843    |
|    learning_rate      | 0.0007   |
|    n_updates          | 34599    |
|    policy_loss        | 0.238    |
|    value_loss         | 0.422    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 310      |
|    ep_rew_mean        | 64.8     |
| time/                 |          |
|    fps                | 27630    |
|    iterations         | 700      |
|    time_elapsed       | 6        |
|    total_timesteps    | 173500   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 333      |
|    ep_rew_mean        | 42.3     |
| time/                 |          |
|    fps                | 12734    |
|    iterations         | 1900     |
|    time_elapsed       | 14       |
|    total_timesteps    | 179500   |
| train/                |          |
|    entropy_loss       | -0.801   |
|    explained_variance | 0.959    |
|    learning_rate      | 0.0007   |
|    n_updates          | 35899    |
|    policy_loss        | 0.787    |
|    value_loss         | 1.78     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 333      |
|    ep_rew_mean        | 42.3     |
| time/                 |          |
|    fps                | 12082    |
|    iterations         | 2000     |
|    time_elapsed       | 14       |
|    total_timesteps    | 180000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 380      |
|    ep_rew_mean        | 37.5     |
| time/                 |          |
|    fps                | 15727    |
|    iterations         | 1200     |
|    time_elapsed       | 11       |
|    total_timesteps    | 186000   |
| train/                |          |
|    entropy_loss       | -0.449   |
|    explained_variance | 0.85     |
|    learning_rate      | 0.0007   |
|    n_updates          | 37199    |
|    policy_loss        | -0.229   |
|    value_loss         | 0.245    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 388      |
|    ep_rew_mean        | 37.2     |
| time/                 |          |
|    fps                | 14781    |
|    iterations         | 1300     |
|    time_elapsed       | 12       |
|    total_timesteps    | 186500   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 440      |
|    ep_rew_mean        | 33.4     |
| time/                 |          |
|    fps                | 40447    |
|    iterations         | 500      |
|    time_elapsed       | 4        |
|    total_timesteps    | 192500   |
| train/                |          |
|    entropy_loss       | -0.397   |
|    explained_variance | -0.0875  |
|    learning_rate      | 0.0007   |
|    n_updates          | 38499    |
|    policy_loss        | -0.389   |
|    value_loss         | 2.27     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 440      |
|    ep_rew_mean        | 33.4     |
| time/                 |          |
|    fps                | 31777    |
|    iterations         | 600      |
|    time_elapsed       | 6        |
|    total_timesteps    | 193000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 499      |
|    ep_rew_mean        | 21.7     |
| time/                 |          |
|    fps                | 11074    |
|    iterations         | 1900     |
|    time_elapsed       | 18       |
|    total_timesteps    | 199500   |
| train/                |          |
|    entropy_loss       | -0.161   |
|    explained_variance | 0.778    |
|    learning_rate      | 0.0007   |
|    n_updates          | 39899    |
|    policy_loss        | 0.00646  |
|    value_loss         | 0.17     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 499      |
|    ep_rew_mean        | 21.7     |
| time/                 |          |
|    fps                | 10463    |
|    iterations         | 2000     |
|    time_elapsed       | 19       |
|    total_timesteps    | 200000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 549      |
|    ep_rew_mean        | 14.2     |
| time/                 |          |
|    fps                | 17487    |
|    iterations         | 1200     |
|    time_elapsed       | 11       |
|    total_timesteps    | 206000   |
| train/                |          |
|    entropy_loss       | -0.634   |
|    explained_variance | 0.308    |
|    learning_rate      | 0.0007   |
|    n_updates          | 41199    |
|    policy_loss        | -0.191   |
|    value_loss         | 1.13     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 556      |
|    ep_rew_mean        | 10.1     |
| time/                 |          |
|    fps                | 16532    |
|    iterations         | 1300     |
|    time_elapsed       | 12       |
|    total_timesteps    | 206500   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 608      |
|    ep_rew_mean        | 4.15     |
| time/                 |          |
|    fps                | 50425    |
|    iterations         | 500      |
|    time_elapsed       | 4        |
|    total_timesteps    | 212500   |
| train/                |          |
|    entropy_loss       | -0.407   |
|    explained_variance | 0.249    |
|    learning_rate      | 0.0007   |
|    n_updates          | 42499    |
|    policy_loss        | 0.14     |
|    value_loss         | 1.04     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 608      |
|    ep_rew_mean        | 4.15     |
| time/                 |          |
|    fps                | 41697    |
|    iterations         | 600      |
|    time_elapsed       | 5        |
|    total_timesteps    | 213000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 663      |
|    ep_rew_mean        | -1.16    |
| time/                 |          |
|    fps                | 12183    |
|    iterations         | 1900     |
|    time_elapsed       | 18       |
|    total_timesteps    | 219500   |
| train/                |          |
|    entropy_loss       | -0.409   |
|    explained_variance | 0.663    |
|    learning_rate      | 0.0007   |
|    n_updates          | 43899    |
|    policy_loss        | -1.07    |
|    value_loss         | 2.62     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 663      |
|    ep_rew_mean        | -1.16    |
| time/                 |          |
|    fps                | 11640    |
|    iterations         | 2000     |
|    time_elapsed       | 18       |
|    total_timesteps    | 220000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 695      |
|    ep_rew_mean        | -6.95    |
| time/                 |          |
|    fps                | 20054    |
|    iterations         | 1200     |
|    time_elapsed       | 11       |
|    total_timesteps    | 226000   |
| train/                |          |
|    entropy_loss       | -0.542   |
|    explained_variance | 0.726    |
|    learning_rate      | 0.0007   |
|    n_updates          | 45199    |
|    policy_loss        | -0.0805  |
|    value_loss         | 0.559    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 695      |
|    ep_rew_mean        | -6.95    |
| time/                 |          |
|    fps                | 18334    |
|    iterations         | 1300     |
|    time_elapsed       | 12       |
|    total_timesteps    | 226500   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 738      |
|    ep_rew_mean        | -21.7    |
| time/                 |          |
|    fps                | 39156    |
|    iterations         | 500      |
|    time_elapsed       | 5        |
|    total_timesteps    | 232500   |
| train/                |          |
|    entropy_loss       | -0.385   |
|    explained_variance | -0.15    |
|    learning_rate      | 0.0007   |
|    n_updates          | 46499    |
|    policy_loss        | 1.18     |
|    value_loss         | 5.35     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 738      |
|    ep_rew_mean        | -21.7    |
| time/                 |          |
|    fps                | 34249    |
|    iterations         | 600      |
|    time_elapsed       | 6        |
|    total_timesteps    | 233000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 774      |
|    ep_rew_mean        | -30.2    |
| time/                 |          |
|    fps                | 12350    |
|    iterations         | 1900     |
|    time_elapsed       | 19       |
|    total_timesteps    | 239500   |
| train/                |          |
|    entropy_loss       | -0.32    |
|    explained_variance | 0.753    |
|    learning_rate      | 0.0007   |
|    n_updates          | 47899    |
|    policy_loss        | 0.0785   |
|    value_loss         | 0.394    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 774      |
|    ep_rew_mean        | -30.2    |
| time/                 |          |
|    fps                | 11848    |
|    iterations         | 2000     |
|    time_elapsed       | 20       |
|    total_timesteps    | 240000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 804      |
|    ep_rew_mean        | -46.3    |
| time/                 |          |
|    fps                | 18401    |
|    iterations         | 1200     |
|    time_elapsed       | 13       |
|    total_timesteps    | 246000   |
| train/                |          |
|    entropy_loss       | -0.292   |
|    explained_variance | 0.963    |
|    learning_rate      | 0.0007   |
|    n_updates          | 49199    |
|    policy_loss        | 0.0643   |
|    value_loss         | 0.182    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 804      |
|    ep_rew_mean        | -46      |
| time/                 |          |
|    fps                | 17206    |
|    iterations         | 1300     |
|    time_elapsed       | 14       |
|    total_timesteps    | 246500   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 820      |
|    ep_rew_mean        | -54.3    |
| time/                 |          |
|    fps                | 49246    |
|    iterations         | 500      |
|    time_elapsed       | 5        |
|    total_timesteps    | 252500   |
| train/                |          |
|    entropy_loss       | -0.117   |
|    explained_variance | 0.676    |
|    learning_rate      | 0.0007   |
|    n_updates          | 50499    |
|    policy_loss        | 0.038    |
|    value_loss         | 1.75     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 820      |
|    ep_rew_mean        | -54.3    |
| time/                 |          |
|    fps                | 41905    |
|    iterations         | 600      |
|    time_elapsed       | 6        |
|    total_timesteps    | 253000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 836      |
|    ep_rew_mean        | -63.2    |
| time/                 |          |
|    fps                | 12382    |
|    iterations         | 1900     |
|    time_elapsed       | 20       |
|    total_timesteps    | 259500   |
| train/                |          |
|    entropy_loss       | -0.204   |
|    explained_variance | -0.0325  |
|    learning_rate      | 0.0007   |
|    n_updates          | 51899    |
|    policy_loss        | -1.24    |
|    value_loss         | 84.6     |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 836      |
|    ep_rew_mean        | -62      |
| time/                 |          |
|    fps                | 11916    |
|    iterations         | 2000     |
|    time_elapsed       | 21       |
|    total_timesteps    | 260000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 875      |
|    ep_rew_mean        | -61.4    |
| time/                 |          |
|    fps                | 14471    |
|    iterations         | 1200     |
|    time_elapsed       | 18       |
|    total_timesteps    | 266000   |
| train/                |          |
|    entropy_loss       | -0.374   |
|    explained_variance | 0.969    |
|    learning_rate      | 0.0007   |
|    n_updates          | 53199    |
|    policy_loss        | 0.101    |
|    value_loss         | 0.743    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 883      |
|    ep_rew_mean        | -58.6    |
| time/                 |          |
|    fps                | 13665    |
|    iterations         | 1300     |
|    time_elapsed       | 19       |
|    total_timesteps    | 266500   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 883      |
|    ep_rew_mean        | -55.3    |
| time/                 |          |
|    fps                | 42856    |
|    iterations         | 500      |
|    time_elapsed       | 6        |
|    total_timesteps    | 272500   |
| train/                |          |
|    entropy_loss       | -0.653   |
|    explained_variance | 0.966    |
|    learning_rate      | 0.0007   |
|    n_updates          | 54499    |
|    policy_loss        | 0.556    |
|    value_loss         | 0.861    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 883      |
|    ep_rew_mean        | -55.3    |
| time/                 |          |
|    fps                | 35248    |
|    iterations         | 600      |
|    time_elapsed       | 7        |
|    total_timesteps    | 273000   |
| train/                |          |
|

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 846      |
|    ep_rew_mean        | -35.5    |
| time/                 |          |
|    fps                | 11926    |
|    iterations         | 1800     |
|    time_elapsed       | 23       |
|    total_timesteps    | 279000   |
| train/                |          |
|    entropy_loss       | -0.0217  |
|    explained_variance | 1.51e-05 |
|    learning_rate      | 0.0007   |
|    n_updates          | 55799    |
|    policy_loss        | -0.00073 |
|    value_loss         | 0.0624   |
------------------------------------
-------------------------------------
| rollout/              |           |
|    ep_len_mean        | 839       |
|    ep_rew_mean        | -32.3     |
| time/                 |           |
|    fps                | 11447     |
|    iterations         | 1900      |
|    time_elapsed       | 24        |
|    total_timesteps    | 279500    |
| train/                |    

------------------------------------
| rollout/              |          |
|    ep_len_mean        | 793      |
|    ep_rew_mean        | -2.14    |
| time/                 |          |
|    fps                | 26585    |
|    iterations         | 1100     |
|    time_elapsed       | 10       |
|    total_timesteps    | 285500   |
| train/                |          |
|    entropy_loss       | -0.121   |
|    explained_variance | 0.933    |
|    learning_rate      | 0.0007   |
|    n_updates          | 57099    |
|    policy_loss        | 0.0256   |
|    value_loss         | 0.239    |
------------------------------------
------------------------------------
| rollout/              |          |
|    ep_len_mean        | 788      |
|    ep_rew_mean        | -2.68    |
| time/                 |          |
|    fps                | 25290    |
|    iterations         | 1200     |
|    time_elapsed       | 11       |
|    total_timesteps    | 286000   |
| train/                |          |
|

### Load some models

In [22]:
from stable_baselines3 import PPO as model_algo

model_type=  "PPO"
models_dir = f"models/{model_type}"
model_path = f"{models_dir}/140000.zip"

model = model_algo.load(model_path, env=env)

episodes = 10

for episode in tqdm(range(episodes)):
    obs = env.reset()
    done = False
    
    while not done:
        env.render()
        action, _ = model.predict(obs)
        obs, reward, done, info = env.step(action)
    
env.close()

Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


  0%|          | 0/10 [00:00<?, ?it/s]

# Custom Environment

## What do we want?

* Familiarise yourself with how to form action spaces and set rewards