<a href="https://colab.research.google.com/github/ScorcaF/imitation/blob/master/MBRL_GAILcartpole.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Train an Agent using Generative Adversarial Imitation Learning

The idea of generative adversarial imitation learning is to train a discriminator network to distinguish between expert trajectories and learner trajectories.
The learner is trained using a traditional reinforcement learning algorithm such as PPO and is rewarded for trajectories that make the discriminator think that it was an expert trajectory.

In [1]:
%%capture 
%%bash

git clone http://github.com/ScorcaF/imitation
cd imitation && git checkout 0861607f146457e3e086ee91c362c39aeac1d8c4
pip install -e .

pip install mbrl
pip install omegaconf
apt-get install swig
pip install matplotlib==3.1.1
# install required system dependencies
apt-get install -y xvfb x11-utils

# install required python dependencies (might need to install additional gym extras depending)
pip install gym[box2d]==0.17.* pyvirtualdisplay==0.2.* PyOpenGL==3.1.* PyOpenGL-accelerate==3.1.*

pip3 install box2d-py
pip3 install gym[Box_2D]
pip install stable_baselines3


# git clone https://github.com/ScorcaF/mbrl-lib.git
# pip install -e ".[dev]"
# pip install imitation

In [2]:
import pyvirtualdisplay

_display = pyvirtualdisplay.Display(visible=False,  # use False with Xvfb
                                    size=(1400, 900))
_ = _display.start()

_display = pyvirtualdisplay.Display(visible=False, size=(1400, 900))
_ = _display.start()

As usual, we first need an expert. 
Note that we now use a variant of the CartPole environment from the seals package, which has fixed episode durations. Read more about why we do this [here](https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html).

In [21]:
%%capture 
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy
import gym
import mbrl.env.cartpole_continuous as cartpole_env


env = cartpole_env.CartPoleEnv()

expert = PPO(
    policy=MlpPolicy,
    env=env,
    seed=0)
expert.learn(100_000)  # Note: set to 100000 to train a proficient expert

We generate some expert trajectories, that the discriminator needs to distinguish from the learner's trajectories.

In [22]:
%%capture 
%cd imitation/src
from imitation.data import rollout
from imitation.data.wrappers import RolloutInfoWrapper
from stable_baselines3.common.vec_env import DummyVecEnv
%cd -

rollouts = rollout.rollout(
    expert,
    DummyVecEnv([lambda: RolloutInfoWrapper(env)] * 5),
    rollout.make_sample_until(min_timesteps=None, min_episodes=60),
)

In [None]:
%cd imitation/src
from imitation.algorithms.adversarial.gail import GAIL
from imitation.rewards.reward_nets import BasicRewardNet
from imitation.util.networks import RunningNorm
%cd -

from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv

import gym



venv = DummyVecEnv([lambda: env] )
learner = PPO(
    env=venv,
    policy=MlpPolicy,
    batch_size=64,
    ent_coef=0.0,
    learning_rate=0.0003,
    n_epochs=10,
)
reward_net = BasicRewardNet(
    venv.observation_space, venv.action_space, normalize_input_layer=RunningNorm
)
gail_trainer = GAIL(
    demonstrations=rollouts,
    demo_batch_size=1024,
    gen_replay_buffer_capacity=2048,
    n_disc_updates_per_round=4,
    venv=venv,
    gen_algo=learner,
    reward_net=reward_net,
    allow_variable_horizon=True #READ THE DOCS-------------------------------------------
)

learner_rewards_before_training, _ = evaluate_policy(
    learner, venv, 100, return_episode_rewards=True
)
gail_trainer.train(300_000)  # Note: set to 300000 for better results
learner_rewards_after_training, _ = evaluate_policy(
    learner, venv, 100, return_episode_rewards=True
)

/content/imitation/src
/content
Running with `allow_variable_horizon` set to True. Some algorithms are biased towards shorter or longer episodes, which may significantly confound results. Additionally, even unbiased algorithms can exploit the information leak from the termination condition, producing spuriously high performance. See https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html for more information.


round:   0%|          | 0/146 [00:00<?, ?it/s]

--------------------------------------
| raw/                        |      |
|    gen/time/fps             | 780  |
|    gen/time/iterations      | 1    |
|    gen/time/time_elapsed    | 2    |
|    gen/time/total_timesteps | 2048 |
--------------------------------------
--------------------------------------------------
| raw/                                |          |
|    disc/disc_acc                    | 0.327    |
|    disc/disc_acc_expert             | 0.0156   |
|    disc/disc_acc_gen                | 0.638    |
|    disc/disc_entropy                | 0.69     |
|    disc/disc_loss                   | 0.737    |
|    disc/disc_proportion_expert_pred | 0.189    |
|    disc/disc_proportion_expert_true | 0.5      |
|    disc/global_step                 | 1        |
|    disc/n_expert                    | 1.02e+03 |
|    disc/n_generated                 | 1.02e+03 |
--------------------------------------------------
--------------------------------------------------
| raw/       

round:   1%|          | 1/146 [00:03<09:15,  3.83s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 16.9        |
|    gen/time/fps                    | 785         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 4096        |
|    gen/train/approx_kl             | 0.009815651 |
|    gen/train/clip_fraction         | 0.0854      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.43       |
|    gen/train/explained_variance    | 0.00182     |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 2.66        |
|    gen/train/n_updates             | 10          |
|    gen/train/policy_gradient_loss  | -0.0128     |
|    gen/train/std                   | 1.02        |
|    gen/train/value_loss            | 21.7        |
----------------------------------------------

round:   1%|▏         | 2/146 [00:07<09:17,  3.87s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 19.8        |
|    gen/time/fps                    | 782         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 6144        |
|    gen/train/approx_kl             | 0.008101222 |
|    gen/train/clip_fraction         | 0.0753      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.43       |
|    gen/train/explained_variance    | 0.175       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 5.95        |
|    gen/train/n_updates             | 20          |
|    gen/train/policy_gradient_loss  | -0.0106     |
|    gen/train/std                   | 1.01        |
|    gen/train/value_loss            | 15          |
----------------------------------------------

round:   2%|▏         | 3/146 [00:11<09:08,  3.84s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 24.2        |
|    gen/time/fps                    | 784         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 8192        |
|    gen/train/approx_kl             | 0.012296966 |
|    gen/train/clip_fraction         | 0.119       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.42       |
|    gen/train/explained_variance    | 0.402       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 7.68        |
|    gen/train/n_updates             | 30          |
|    gen/train/policy_gradient_loss  | -0.0184     |
|    gen/train/std                   | 0.995       |
|    gen/train/value_loss            | 18.2        |
----------------------------------------------

round:   3%|▎         | 4/146 [00:15<09:02,  3.82s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 30.9        |
|    gen/time/fps                    | 776         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 10240       |
|    gen/train/approx_kl             | 0.009129245 |
|    gen/train/clip_fraction         | 0.102       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.41       |
|    gen/train/explained_variance    | 0.39        |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 8.89        |
|    gen/train/n_updates             | 40          |
|    gen/train/policy_gradient_loss  | -0.0179     |
|    gen/train/std                   | 0.984       |
|    gen/train/value_loss            | 24.9        |
----------------------------------------------

round:   3%|▎         | 5/146 [00:19<08:57,  3.81s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 39.6        |
|    gen/time/fps                    | 785         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 12288       |
|    gen/train/approx_kl             | 0.006962429 |
|    gen/train/clip_fraction         | 0.0747      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.4        |
|    gen/train/explained_variance    | 0.471       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 5.17        |
|    gen/train/n_updates             | 50          |
|    gen/train/policy_gradient_loss  | -0.0128     |
|    gen/train/std                   | 0.984       |
|    gen/train/value_loss            | 24.8        |
----------------------------------------------

round:   4%|▍         | 6/146 [00:22<08:51,  3.80s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 50.1        |
|    gen/time/fps                    | 782         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 14336       |
|    gen/train/approx_kl             | 0.009967764 |
|    gen/train/clip_fraction         | 0.0861      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.41       |
|    gen/train/explained_variance    | 0.702       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 9.4         |
|    gen/train/n_updates             | 60          |
|    gen/train/policy_gradient_loss  | -0.0139     |
|    gen/train/std                   | 0.992       |
|    gen/train/value_loss            | 13.7        |
----------------------------------------------

round:   5%|▍         | 7/146 [00:26<08:47,  3.80s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 60.9         |
|    gen/time/fps                    | 786          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 16384        |
|    gen/train/approx_kl             | 0.0048443167 |
|    gen/train/clip_fraction         | 0.04         |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.41        |
|    gen/train/explained_variance    | 0.627        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 12.1         |
|    gen/train/n_updates             | 70           |
|    gen/train/policy_gradient_loss  | -0.00734     |
|    gen/train/std                   | 0.991        |
|    gen/train/value_loss            | 21.6         |
----------------------------

round:   5%|▌         | 8/146 [00:30<08:42,  3.79s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 72.6         |
|    gen/time/fps                    | 787          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 18432        |
|    gen/train/approx_kl             | 0.0064208265 |
|    gen/train/clip_fraction         | 0.0633       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.41        |
|    gen/train/explained_variance    | 0.942        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 1.69         |
|    gen/train/n_updates             | 80           |
|    gen/train/policy_gradient_loss  | -0.00861     |
|    gen/train/std                   | 0.984        |
|    gen/train/value_loss            | 6.14         |
----------------------------

round:   6%|▌         | 9/146 [00:34<08:39,  3.79s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 85.2         |
|    gen/time/fps                    | 781          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 20480        |
|    gen/train/approx_kl             | 0.0064841136 |
|    gen/train/clip_fraction         | 0.0634       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.4         |
|    gen/train/explained_variance    | 0.866        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 1.44         |
|    gen/train/n_updates             | 90           |
|    gen/train/policy_gradient_loss  | -0.00522     |
|    gen/train/std                   | 0.983        |
|    gen/train/value_loss            | 12.1         |
----------------------------

round:   7%|▋         | 10/146 [00:38<08:35,  3.79s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 95.8         |
|    gen/time/fps                    | 792          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 22528        |
|    gen/train/approx_kl             | 0.0014354013 |
|    gen/train/clip_fraction         | 0.0496       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.4         |
|    gen/train/explained_variance    | 0.86         |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 9.02         |
|    gen/train/n_updates             | 100          |
|    gen/train/policy_gradient_loss  | -0.0054      |
|    gen/train/std                   | 0.982        |
|    gen/train/value_loss            | 10.3         |
----------------------------

round:   8%|▊         | 11/146 [00:41<08:30,  3.78s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 109         |
|    gen/time/fps                    | 782         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 24576       |
|    gen/train/approx_kl             | 0.006712849 |
|    gen/train/clip_fraction         | 0.0617      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.4        |
|    gen/train/explained_variance    | 0.921       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 1.81        |
|    gen/train/n_updates             | 110         |
|    gen/train/policy_gradient_loss  | -0.00453    |
|    gen/train/std                   | 0.977       |
|    gen/train/value_loss            | 6.46        |
----------------------------------------------

round:   8%|▊         | 12/146 [00:45<08:26,  3.78s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 121         |
|    gen/time/fps                    | 780         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 26624       |
|    gen/train/approx_kl             | 0.008124143 |
|    gen/train/clip_fraction         | 0.0864      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.42       |
|    gen/train/explained_variance    | 0.98        |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.285       |
|    gen/train/n_updates             | 120         |
|    gen/train/policy_gradient_loss  | -0.00938    |
|    gen/train/std                   | 1.02        |
|    gen/train/value_loss            | 1.69        |
----------------------------------------------

round:   9%|▉         | 13/146 [00:49<08:31,  3.85s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 128         |
|    gen/time/fps                    | 637         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 3           |
|    gen/time/total_timesteps        | 28672       |
|    gen/train/approx_kl             | 0.014247734 |
|    gen/train/clip_fraction         | 0.193       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.43       |
|    gen/train/explained_variance    | 0.984       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.709       |
|    gen/train/n_updates             | 130         |
|    gen/train/policy_gradient_loss  | -0.0243     |
|    gen/train/std                   | 1.01        |
|    gen/train/value_loss            | 1.63        |
----------------------------------------------

round:  10%|▉         | 14/146 [00:53<08:49,  4.01s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 147          |
|    gen/time/fps                    | 779          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 30720        |
|    gen/train/approx_kl             | 0.0036991392 |
|    gen/train/clip_fraction         | 0.0258       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.43        |
|    gen/train/explained_variance    | 0.0628       |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 39.3         |
|    gen/train/n_updates             | 140          |
|    gen/train/policy_gradient_loss  | -0.00411     |
|    gen/train/std                   | 1.01         |
|    gen/train/value_loss            | 44.5         |
----------------------------

round:  10%|█         | 15/146 [00:57<08:38,  3.95s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 147          |
|    gen/time/fps                    | 790          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 32768        |
|    gen/train/approx_kl             | 0.0065063494 |
|    gen/train/clip_fraction         | 0.0424       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.42        |
|    gen/train/explained_variance    | 0.176        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 0.309        |
|    gen/train/n_updates             | 150          |
|    gen/train/policy_gradient_loss  | -0.00446     |
|    gen/train/std                   | 0.993        |
|    gen/train/value_loss            | 1.61         |
----------------------------

round:  11%|█         | 16/146 [01:01<08:27,  3.91s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 147          |
|    gen/time/fps                    | 785          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 34816        |
|    gen/train/approx_kl             | 0.0073189223 |
|    gen/train/clip_fraction         | 0.0723       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.41        |
|    gen/train/explained_variance    | 0.242        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 0.227        |
|    gen/train/n_updates             | 160          |
|    gen/train/policy_gradient_loss  | -0.0082      |
|    gen/train/std                   | 0.98         |
|    gen/train/value_loss            | 1.11         |
----------------------------

round:  12%|█▏        | 17/146 [01:05<08:19,  3.87s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 181         |
|    gen/time/fps                    | 793         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 36864       |
|    gen/train/approx_kl             | 0.003987509 |
|    gen/train/clip_fraction         | 0.0218      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.4        |
|    gen/train/explained_variance    | 0.0209      |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 21.1        |
|    gen/train/n_updates             | 170         |
|    gen/train/policy_gradient_loss  | -0.000982   |
|    gen/train/std                   | 0.976       |
|    gen/train/value_loss            | 17.2        |
----------------------------------------------

round:  12%|█▏        | 18/146 [01:09<08:14,  3.86s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 181          |
|    gen/time/fps                    | 792          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 38912        |
|    gen/train/approx_kl             | 0.0035848161 |
|    gen/train/clip_fraction         | 0.05         |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.39        |
|    gen/train/explained_variance    | 0.12         |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 0.24         |
|    gen/train/n_updates             | 180          |
|    gen/train/policy_gradient_loss  | -0.00669     |
|    gen/train/std                   | 0.97         |
|    gen/train/value_loss            | 0.567        |
----------------------------

round:  13%|█▎        | 19/146 [01:12<08:06,  3.83s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 216         |
|    gen/time/fps                    | 792         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 40960       |
|    gen/train/approx_kl             | 0.003438307 |
|    gen/train/clip_fraction         | 0.0256      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.39       |
|    gen/train/explained_variance    | 0.0184      |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 6.73        |
|    gen/train/n_updates             | 190         |
|    gen/train/policy_gradient_loss  | -0.00192    |
|    gen/train/std                   | 0.97        |
|    gen/train/value_loss            | 26.3        |
----------------------------------------------

round:  14%|█▎        | 20/146 [01:16<07:58,  3.80s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 232          |
|    gen/time/fps                    | 794          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 43008        |
|    gen/train/approx_kl             | 0.0022603474 |
|    gen/train/clip_fraction         | 0.0105       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.38        |
|    gen/train/explained_variance    | 0.0127       |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 1.17         |
|    gen/train/n_updates             | 200          |
|    gen/train/policy_gradient_loss  | -0.000534    |
|    gen/train/std                   | 0.958        |
|    gen/train/value_loss            | 34.2         |
----------------------------

round:  14%|█▍        | 21/146 [01:20<07:52,  3.78s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 243          |
|    gen/time/fps                    | 789          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 45056        |
|    gen/train/approx_kl             | 0.0038601337 |
|    gen/train/clip_fraction         | 0.0219       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.38        |
|    gen/train/explained_variance    | 0.158        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 31.5         |
|    gen/train/n_updates             | 210          |
|    gen/train/policy_gradient_loss  | -0.00432     |
|    gen/train/std                   | 0.959        |
|    gen/train/value_loss            | 41.5         |
----------------------------

round:  15%|█▌        | 22/146 [01:24<07:47,  3.77s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 257         |
|    gen/time/fps                    | 796         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 47104       |
|    gen/train/approx_kl             | 0.004461592 |
|    gen/train/clip_fraction         | 0.0435      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.38       |
|    gen/train/explained_variance    | -0.272      |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 3.83        |
|    gen/train/n_updates             | 220         |
|    gen/train/policy_gradient_loss  | -0.00203    |
|    gen/train/std                   | 0.961       |
|    gen/train/value_loss            | 17.5        |
----------------------------------------------

round:  16%|█▌        | 23/146 [01:27<07:42,  3.76s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 257         |
|    gen/time/fps                    | 790         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 49152       |
|    gen/train/approx_kl             | 0.010432835 |
|    gen/train/clip_fraction         | 0.0961      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.36       |
|    gen/train/explained_variance    | -0.148      |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.449       |
|    gen/train/n_updates             | 230         |
|    gen/train/policy_gradient_loss  | -0.0152     |
|    gen/train/std                   | 0.931       |
|    gen/train/value_loss            | 0.999       |
----------------------------------------------

round:  16%|█▋        | 24/146 [01:31<07:37,  3.75s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 286         |
|    gen/time/fps                    | 785         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 51200       |
|    gen/train/approx_kl             | 0.006183514 |
|    gen/train/clip_fraction         | 0.062       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.34       |
|    gen/train/explained_variance    | -0.0312     |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.626       |
|    gen/train/n_updates             | 240         |
|    gen/train/policy_gradient_loss  | -0.00535    |
|    gen/train/std                   | 0.925       |
|    gen/train/value_loss            | 15.4        |
----------------------------------------------

round:  17%|█▋        | 25/146 [01:35<07:34,  3.76s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 299          |
|    gen/time/fps                    | 800          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 53248        |
|    gen/train/approx_kl             | 0.0025190385 |
|    gen/train/clip_fraction         | 0.0145       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.34        |
|    gen/train/explained_variance    | -0.00991     |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 8.58         |
|    gen/train/n_updates             | 250          |
|    gen/train/policy_gradient_loss  | -0.00198     |
|    gen/train/std                   | 0.924        |
|    gen/train/value_loss            | 86.4         |
----------------------------

round:  18%|█▊        | 26/146 [01:39<07:28,  3.74s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 307          |
|    gen/time/fps                    | 785          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 55296        |
|    gen/train/approx_kl             | 0.0077374526 |
|    gen/train/clip_fraction         | 0.0376       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.34        |
|    gen/train/explained_variance    | -0.0664      |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 3.7          |
|    gen/train/n_updates             | 260          |
|    gen/train/policy_gradient_loss  | -0.00222     |
|    gen/train/std                   | 0.913        |
|    gen/train/value_loss            | 28.9         |
----------------------------

round:  18%|█▊        | 27/146 [01:42<07:26,  3.75s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 319          |
|    gen/time/fps                    | 798          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 57344        |
|    gen/train/approx_kl             | 0.0028507109 |
|    gen/train/clip_fraction         | 0.0305       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.33        |
|    gen/train/explained_variance    | 0.0336       |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 14.1         |
|    gen/train/n_updates             | 270          |
|    gen/train/policy_gradient_loss  | -0.00467     |
|    gen/train/std                   | 0.914        |
|    gen/train/value_loss            | 45.6         |
----------------------------

round:  19%|█▉        | 28/146 [01:46<07:22,  3.75s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 319          |
|    gen/time/fps                    | 789          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 59392        |
|    gen/train/approx_kl             | 0.0072529158 |
|    gen/train/clip_fraction         | 0.0577       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.31        |
|    gen/train/explained_variance    | -0.588       |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 0.227        |
|    gen/train/n_updates             | 280          |
|    gen/train/policy_gradient_loss  | -0.00478     |
|    gen/train/std                   | 0.883        |
|    gen/train/value_loss            | 0.999        |
----------------------------

round:  20%|█▉        | 29/146 [01:50<07:18,  3.75s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 319         |
|    gen/time/fps                    | 783         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 61440       |
|    gen/train/approx_kl             | 0.010014379 |
|    gen/train/clip_fraction         | 0.108       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.3        |
|    gen/train/explained_variance    | 0.363       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.807       |
|    gen/train/n_updates             | 290         |
|    gen/train/policy_gradient_loss  | -0.0121     |
|    gen/train/std                   | 0.887       |
|    gen/train/value_loss            | 1.83        |
----------------------------------------------

round:  21%|██        | 30/146 [01:54<07:15,  3.76s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 352          |
|    gen/time/fps                    | 791          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 63488        |
|    gen/train/approx_kl             | 0.0035680933 |
|    gen/train/clip_fraction         | 0.0373       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.3         |
|    gen/train/explained_variance    | 0.0142       |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 67           |
|    gen/train/n_updates             | 300          |
|    gen/train/policy_gradient_loss  | -0.00647     |
|    gen/train/std                   | 0.885        |
|    gen/train/value_loss            | 111          |
----------------------------

round:  21%|██        | 31/146 [01:57<07:11,  3.75s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 361          |
|    gen/time/fps                    | 792          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 65536        |
|    gen/train/approx_kl             | 0.0039362866 |
|    gen/train/clip_fraction         | 0.0208       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.29        |
|    gen/train/explained_variance    | 0.0682       |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 10.5         |
|    gen/train/n_updates             | 310          |
|    gen/train/policy_gradient_loss  | -0.00273     |
|    gen/train/std                   | 0.88         |
|    gen/train/value_loss            | 40.4         |
----------------------------

round:  22%|██▏       | 32/146 [02:01<07:07,  3.75s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 366          |
|    gen/time/fps                    | 780          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 67584        |
|    gen/train/approx_kl             | 0.0037909802 |
|    gen/train/clip_fraction         | 0.0271       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.29        |
|    gen/train/explained_variance    | 0.112        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 23.1         |
|    gen/train/n_updates             | 320          |
|    gen/train/policy_gradient_loss  | -0.00177     |
|    gen/train/std                   | 0.875        |
|    gen/train/value_loss            | 21.8         |
----------------------------

round:  23%|██▎       | 33/146 [02:05<07:05,  3.77s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 380         |
|    gen/time/fps                    | 788         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 69632       |
|    gen/train/approx_kl             | 0.009019945 |
|    gen/train/clip_fraction         | 0.104       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.29       |
|    gen/train/explained_variance    | 0.0432      |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 1.93        |
|    gen/train/n_updates             | 330         |
|    gen/train/policy_gradient_loss  | -0.00659    |
|    gen/train/std                   | 0.88        |
|    gen/train/value_loss            | 14.2        |
----------------------------------------------

round:  23%|██▎       | 34/146 [02:09<07:01,  3.76s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 386         |
|    gen/time/fps                    | 799         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 71680       |
|    gen/train/approx_kl             | 0.002149808 |
|    gen/train/clip_fraction         | 0.0163      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.29       |
|    gen/train/explained_variance    | 0.0826      |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 3.95        |
|    gen/train/n_updates             | 340         |
|    gen/train/policy_gradient_loss  | -0.001      |
|    gen/train/std                   | 0.879       |
|    gen/train/value_loss            | 22          |
----------------------------------------------

round:  24%|██▍       | 35/146 [02:13<06:59,  3.78s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 386         |
|    gen/time/fps                    | 797         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 73728       |
|    gen/train/approx_kl             | 0.010327541 |
|    gen/train/clip_fraction         | 0.131       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.28       |
|    gen/train/explained_variance    | -1.57       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.661       |
|    gen/train/n_updates             | 350         |
|    gen/train/policy_gradient_loss  | -0.0104     |
|    gen/train/std                   | 0.867       |
|    gen/train/value_loss            | 1.62        |
----------------------------------------------

round:  25%|██▍       | 36/146 [02:16<06:54,  3.77s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 412         |
|    gen/time/fps                    | 790         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 75776       |
|    gen/train/approx_kl             | 0.007772349 |
|    gen/train/clip_fraction         | 0.0841      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.27       |
|    gen/train/explained_variance    | 0.036       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 5.92        |
|    gen/train/n_updates             | 360         |
|    gen/train/policy_gradient_loss  | -0.00395    |
|    gen/train/std                   | 0.86        |
|    gen/train/value_loss            | 13          |
----------------------------------------------

round:  25%|██▌       | 37/146 [02:21<07:05,  3.91s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 430          |
|    gen/time/fps                    | 696          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 77824        |
|    gen/train/approx_kl             | 0.0026420727 |
|    gen/train/clip_fraction         | 0.0362       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.27        |
|    gen/train/explained_variance    | 0.0595       |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 5.73         |
|    gen/train/n_updates             | 370          |
|    gen/train/policy_gradient_loss  | -0.000632    |
|    gen/train/std                   | 0.86         |
|    gen/train/value_loss            | 27.6         |
----------------------------

round:  26%|██▌       | 38/146 [02:25<07:08,  3.97s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 440          |
|    gen/time/fps                    | 791          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 79872        |
|    gen/train/approx_kl             | 0.0025323334 |
|    gen/train/clip_fraction         | 0.0142       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.27        |
|    gen/train/explained_variance    | 0.189        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 27.5         |
|    gen/train/n_updates             | 380          |
|    gen/train/policy_gradient_loss  | -0.00113     |
|    gen/train/std                   | 0.858        |
|    gen/train/value_loss            | 33.5         |
----------------------------

round:  27%|██▋       | 39/146 [02:28<06:57,  3.90s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 445          |
|    gen/time/fps                    | 792          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 81920        |
|    gen/train/approx_kl             | 0.0075871693 |
|    gen/train/clip_fraction         | 0.0527       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.26        |
|    gen/train/explained_variance    | 0.294        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 22.6         |
|    gen/train/n_updates             | 390          |
|    gen/train/policy_gradient_loss  | -0.00643     |
|    gen/train/std                   | 0.854        |
|    gen/train/value_loss            | 44           |
----------------------------

round:  27%|██▋       | 40/146 [02:32<06:49,  3.86s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 452          |
|    gen/time/fps                    | 791          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 83968        |
|    gen/train/approx_kl             | 0.0030961563 |
|    gen/train/clip_fraction         | 0.0292       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.26        |
|    gen/train/explained_variance    | 0.153        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 2.32         |
|    gen/train/n_updates             | 400          |
|    gen/train/policy_gradient_loss  | -0.00246     |
|    gen/train/std                   | 0.854        |
|    gen/train/value_loss            | 22           |
----------------------------

round:  28%|██▊       | 41/146 [02:36<06:42,  3.83s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 452         |
|    gen/time/fps                    | 790         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 86016       |
|    gen/train/approx_kl             | 0.010870019 |
|    gen/train/clip_fraction         | 0.133       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.26       |
|    gen/train/explained_variance    | -0.517      |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.657       |
|    gen/train/n_updates             | 410         |
|    gen/train/policy_gradient_loss  | -0.00827    |
|    gen/train/std                   | 0.851       |
|    gen/train/value_loss            | 1.72        |
----------------------------------------------

round:  29%|██▉       | 42/146 [02:40<06:36,  3.81s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 475          |
|    gen/time/fps                    | 797          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 88064        |
|    gen/train/approx_kl             | 0.0063049207 |
|    gen/train/clip_fraction         | 0.0939       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.25        |
|    gen/train/explained_variance    | 0.132        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 21.7         |
|    gen/train/n_updates             | 420          |
|    gen/train/policy_gradient_loss  | -0.00532     |
|    gen/train/std                   | 0.847        |
|    gen/train/value_loss            | 23.3         |
----------------------------

round:  29%|██▉       | 43/146 [02:43<06:31,  3.80s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 483          |
|    gen/time/fps                    | 794          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 90112        |
|    gen/train/approx_kl             | 0.0034109028 |
|    gen/train/clip_fraction         | 0.0222       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.25        |
|    gen/train/explained_variance    | 0.156        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 6.6          |
|    gen/train/n_updates             | 430          |
|    gen/train/policy_gradient_loss  | -0.00043     |
|    gen/train/std                   | 0.846        |
|    gen/train/value_loss            | 26.7         |
----------------------------

round:  30%|███       | 44/146 [02:47<06:26,  3.79s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 490         |
|    gen/time/fps                    | 784         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 92160       |
|    gen/train/approx_kl             | 0.007405732 |
|    gen/train/clip_fraction         | 0.0602      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.24       |
|    gen/train/explained_variance    | 0.0334      |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 3.05        |
|    gen/train/n_updates             | 440         |
|    gen/train/policy_gradient_loss  | -0.00389    |
|    gen/train/std                   | 0.828       |
|    gen/train/value_loss            | 13.4        |
----------------------------------------------

round:  31%|███       | 45/146 [02:51<06:21,  3.78s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 507          |
|    gen/time/fps                    | 789          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 94208        |
|    gen/train/approx_kl             | 0.0055139987 |
|    gen/train/clip_fraction         | 0.0778       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.23        |
|    gen/train/explained_variance    | -0.04        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 4.94         |
|    gen/train/n_updates             | 450          |
|    gen/train/policy_gradient_loss  | -0.00405     |
|    gen/train/std                   | 0.823        |
|    gen/train/value_loss            | 22.7         |
----------------------------

round:  32%|███▏      | 46/146 [02:55<06:17,  3.77s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 516         |
|    gen/time/fps                    | 793         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 96256       |
|    gen/train/approx_kl             | 0.008674208 |
|    gen/train/clip_fraction         | 0.0793      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.22       |
|    gen/train/explained_variance    | 0.133       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 27.9        |
|    gen/train/n_updates             | 460         |
|    gen/train/policy_gradient_loss  | -0.00546    |
|    gen/train/std                   | 0.82        |
|    gen/train/value_loss            | 23.6        |
----------------------------------------------

round:  32%|███▏      | 47/146 [02:58<06:13,  3.77s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 516          |
|    gen/time/fps                    | 793          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 98304        |
|    gen/train/approx_kl             | 0.0071753305 |
|    gen/train/clip_fraction         | 0.109        |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.23        |
|    gen/train/explained_variance    | 0.414        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 0.65         |
|    gen/train/n_updates             | 470          |
|    gen/train/policy_gradient_loss  | -0.00477     |
|    gen/train/std                   | 0.828        |
|    gen/train/value_loss            | 1.42         |
----------------------------

round:  33%|███▎      | 48/146 [03:02<06:08,  3.76s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 536         |
|    gen/time/fps                    | 788         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 100352      |
|    gen/train/approx_kl             | 0.006253198 |
|    gen/train/clip_fraction         | 0.0941      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.24       |
|    gen/train/explained_variance    | 0.11        |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 1.89        |
|    gen/train/n_updates             | 480         |
|    gen/train/policy_gradient_loss  | -0.00295    |
|    gen/train/std                   | 0.834       |
|    gen/train/value_loss            | 13.4        |
----------------------------------------------

round:  34%|███▎      | 49/146 [03:06<06:05,  3.77s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 547          |
|    gen/time/fps                    | 796          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 102400       |
|    gen/train/approx_kl             | 0.0010886884 |
|    gen/train/clip_fraction         | 0.016        |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.24        |
|    gen/train/explained_variance    | 0.141        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 6.93         |
|    gen/train/n_updates             | 490          |
|    gen/train/policy_gradient_loss  | 0.00028      |
|    gen/train/std                   | 0.83         |
|    gen/train/value_loss            | 23.8         |
----------------------------

round:  34%|███▍      | 50/146 [03:10<06:01,  3.77s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 547         |
|    gen/time/fps                    | 786         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 104448      |
|    gen/train/approx_kl             | 0.008866068 |
|    gen/train/clip_fraction         | 0.109       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.24       |
|    gen/train/explained_variance    | 0.598       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 1.02        |
|    gen/train/n_updates             | 500         |
|    gen/train/policy_gradient_loss  | -0.00931    |
|    gen/train/std                   | 0.834       |
|    gen/train/value_loss            | 1.65        |
----------------------------------------------

round:  35%|███▍      | 51/146 [03:14<06:00,  3.80s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 560         |
|    gen/time/fps                    | 790         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 106496      |
|    gen/train/approx_kl             | 0.003765387 |
|    gen/train/clip_fraction         | 0.0444      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.23       |
|    gen/train/explained_variance    | -0.033      |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 33.5        |
|    gen/train/n_updates             | 510         |
|    gen/train/policy_gradient_loss  | -0.0051     |
|    gen/train/std                   | 0.829       |
|    gen/train/value_loss            | 81.8        |
----------------------------------------------

round:  36%|███▌      | 52/146 [03:17<05:55,  3.78s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 566          |
|    gen/time/fps                    | 795          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 108544       |
|    gen/train/approx_kl             | 0.0039844085 |
|    gen/train/clip_fraction         | 0.0367       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.23        |
|    gen/train/explained_variance    | -0.212       |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 5.35         |
|    gen/train/n_updates             | 520          |
|    gen/train/policy_gradient_loss  | -0.00157     |
|    gen/train/std                   | 0.829        |
|    gen/train/value_loss            | 27.1         |
----------------------------

round:  36%|███▋      | 53/146 [03:21<05:51,  3.77s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 567          |
|    gen/time/fps                    | 793          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 110592       |
|    gen/train/approx_kl             | 0.0077832276 |
|    gen/train/clip_fraction         | 0.0816       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.23        |
|    gen/train/explained_variance    | 0.153        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 34           |
|    gen/train/n_updates             | 530          |
|    gen/train/policy_gradient_loss  | -0.00917     |
|    gen/train/std                   | 0.829        |
|    gen/train/value_loss            | 38.6         |
----------------------------

round:  37%|███▋      | 54/146 [03:25<05:46,  3.77s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 574         |
|    gen/time/fps                    | 791         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 112640      |
|    gen/train/approx_kl             | 0.004648091 |
|    gen/train/clip_fraction         | 0.0721      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.23       |
|    gen/train/explained_variance    | 0.452       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 2.03        |
|    gen/train/n_updates             | 540         |
|    gen/train/policy_gradient_loss  | -0.00232    |
|    gen/train/std                   | 0.832       |
|    gen/train/value_loss            | 10.4        |
----------------------------------------------

round:  38%|███▊      | 55/146 [03:29<05:42,  3.76s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 585          |
|    gen/time/fps                    | 797          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 114688       |
|    gen/train/approx_kl             | 0.0072025955 |
|    gen/train/clip_fraction         | 0.0663       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.23        |
|    gen/train/explained_variance    | 0.00811      |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 6.9          |
|    gen/train/n_updates             | 550          |
|    gen/train/policy_gradient_loss  | 0.00237      |
|    gen/train/std                   | 0.827        |
|    gen/train/value_loss            | 12.7         |
----------------------------

round:  38%|███▊      | 56/146 [03:32<05:37,  3.75s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 594          |
|    gen/time/fps                    | 797          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 116736       |
|    gen/train/approx_kl             | 0.0029278006 |
|    gen/train/clip_fraction         | 0.0397       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.23        |
|    gen/train/explained_variance    | -0.0226      |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 4.87         |
|    gen/train/n_updates             | 560          |
|    gen/train/policy_gradient_loss  | -0.00415     |
|    gen/train/std                   | 0.823        |
|    gen/train/value_loss            | 29.3         |
----------------------------

round:  39%|███▉      | 57/146 [03:36<05:33,  3.75s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 594         |
|    gen/time/fps                    | 794         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 118784      |
|    gen/train/approx_kl             | 0.003961526 |
|    gen/train/clip_fraction         | 0.0363      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.22       |
|    gen/train/explained_variance    | 0.212       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.842       |
|    gen/train/n_updates             | 570         |
|    gen/train/policy_gradient_loss  | -0.000184   |
|    gen/train/std                   | 0.821       |
|    gen/train/value_loss            | 12          |
----------------------------------------------

round:  40%|███▉      | 58/146 [03:40<05:29,  3.75s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 594         |
|    gen/time/fps                    | 795         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 120832      |
|    gen/train/approx_kl             | 0.007366457 |
|    gen/train/clip_fraction         | 0.0999      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.23       |
|    gen/train/explained_variance    | 0.135       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.329       |
|    gen/train/n_updates             | 580         |
|    gen/train/policy_gradient_loss  | -0.00623    |
|    gen/train/std                   | 0.834       |
|    gen/train/value_loss            | 0.776       |
----------------------------------------------

round:  40%|████      | 59/146 [03:44<05:26,  3.75s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 594         |
|    gen/time/fps                    | 793         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 122880      |
|    gen/train/approx_kl             | 0.014739519 |
|    gen/train/clip_fraction         | 0.145       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.26       |
|    gen/train/explained_variance    | 0.308       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.49        |
|    gen/train/n_updates             | 590         |
|    gen/train/policy_gradient_loss  | -0.00928    |
|    gen/train/std                   | 0.866       |
|    gen/train/value_loss            | 1.13        |
----------------------------------------------

round:  41%|████      | 60/146 [03:47<05:22,  3.75s/it]

---------------------------------------------------
| raw/                               |            |
|    gen/rollout/ep_rew_wrapped_mean | 629        |
|    gen/time/fps                    | 788        |
|    gen/time/iterations             | 1          |
|    gen/time/time_elapsed           | 2          |
|    gen/time/total_timesteps        | 124928     |
|    gen/train/approx_kl             | 0.00502079 |
|    gen/train/clip_fraction         | 0.075      |
|    gen/train/clip_range            | 0.2        |
|    gen/train/entropy_loss          | -1.28      |
|    gen/train/explained_variance    | 0.0335     |
|    gen/train/learning_rate         | 0.0003     |
|    gen/train/loss                  | 0.818      |
|    gen/train/n_updates             | 600        |
|    gen/train/policy_gradient_loss  | -0.000798  |
|    gen/train/std                   | 0.871      |
|    gen/train/value_loss            | 12.6       |
---------------------------------------------------
------------

round:  42%|████▏     | 61/146 [03:51<05:21,  3.78s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 635         |
|    gen/time/fps                    | 632         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 3           |
|    gen/time/total_timesteps        | 126976      |
|    gen/train/approx_kl             | 0.005165441 |
|    gen/train/clip_fraction         | 0.0445      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.28       |
|    gen/train/explained_variance    | 0.00481     |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 26.2        |
|    gen/train/n_updates             | 610         |
|    gen/train/policy_gradient_loss  | -0.00381    |
|    gen/train/std                   | 0.872       |
|    gen/train/value_loss            | 33.7        |
----------------------------------------------

round:  42%|████▏     | 62/146 [03:56<05:33,  3.97s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 635         |
|    gen/time/fps                    | 793         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 129024      |
|    gen/train/approx_kl             | 0.006219656 |
|    gen/train/clip_fraction         | 0.0503      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.29       |
|    gen/train/explained_variance    | 0.113       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.395       |
|    gen/train/n_updates             | 620         |
|    gen/train/policy_gradient_loss  | 0.000344    |
|    gen/train/std                   | 0.885       |
|    gen/train/value_loss            | 0.762       |
----------------------------------------------

round:  43%|████▎     | 63/146 [03:59<05:24,  3.90s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 641          |
|    gen/time/fps                    | 786          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 131072       |
|    gen/train/approx_kl             | 0.0049381377 |
|    gen/train/clip_fraction         | 0.0378       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.3         |
|    gen/train/explained_variance    | 0.146        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 5.82         |
|    gen/train/n_updates             | 630          |
|    gen/train/policy_gradient_loss  | -0.0037      |
|    gen/train/std                   | 0.883        |
|    gen/train/value_loss            | 26.5         |
----------------------------

round:  44%|████▍     | 64/146 [04:03<05:17,  3.87s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 626          |
|    gen/time/fps                    | 783          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 133120       |
|    gen/train/approx_kl             | 0.0057743015 |
|    gen/train/clip_fraction         | 0.0479       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.29        |
|    gen/train/explained_variance    | 0.104        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 16.8         |
|    gen/train/n_updates             | 640          |
|    gen/train/policy_gradient_loss  | -0.00561     |
|    gen/train/std                   | 0.879        |
|    gen/train/value_loss            | 40.5         |
----------------------------

round:  45%|████▍     | 65/146 [04:07<05:11,  3.85s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 626         |
|    gen/time/fps                    | 781         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 135168      |
|    gen/train/approx_kl             | 0.010783653 |
|    gen/train/clip_fraction         | 0.107       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.29       |
|    gen/train/explained_variance    | -0.318      |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.35        |
|    gen/train/n_updates             | 650         |
|    gen/train/policy_gradient_loss  | -0.00427    |
|    gen/train/std                   | 0.876       |
|    gen/train/value_loss            | 1.08        |
----------------------------------------------

round:  45%|████▌     | 66/146 [04:11<05:09,  3.87s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 580          |
|    gen/time/fps                    | 794          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 137216       |
|    gen/train/approx_kl             | 0.0014184937 |
|    gen/train/clip_fraction         | 0.024        |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.29        |
|    gen/train/explained_variance    | 0.048        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 33.2         |
|    gen/train/n_updates             | 660          |
|    gen/train/policy_gradient_loss  | -0.00233     |
|    gen/train/std                   | 0.876        |
|    gen/train/value_loss            | 43           |
----------------------------

round:  46%|████▌     | 67/146 [04:15<05:02,  3.83s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 561         |
|    gen/time/fps                    | 791         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 139264      |
|    gen/train/approx_kl             | 0.006193195 |
|    gen/train/clip_fraction         | 0.0685      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.28       |
|    gen/train/explained_variance    | 0.132       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 13.1        |
|    gen/train/n_updates             | 670         |
|    gen/train/policy_gradient_loss  | -0.00708    |
|    gen/train/std                   | 0.873       |
|    gen/train/value_loss            | 50.8        |
----------------------------------------------

round:  47%|████▋     | 68/146 [04:18<04:56,  3.81s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 524         |
|    gen/time/fps                    | 788         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 141312      |
|    gen/train/approx_kl             | 0.004737477 |
|    gen/train/clip_fraction         | 0.0724      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.28       |
|    gen/train/explained_variance    | 0.32        |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 20          |
|    gen/train/n_updates             | 680         |
|    gen/train/policy_gradient_loss  | -0.0083     |
|    gen/train/std                   | 0.871       |
|    gen/train/value_loss            | 34.6        |
----------------------------------------------

round:  47%|████▋     | 69/146 [04:22<04:52,  3.80s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 524          |
|    gen/time/fps                    | 792          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 143360       |
|    gen/train/approx_kl             | 0.0063402513 |
|    gen/train/clip_fraction         | 0.105        |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.28        |
|    gen/train/explained_variance    | -0.836       |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 0.514        |
|    gen/train/n_updates             | 690          |
|    gen/train/policy_gradient_loss  | -0.00285     |
|    gen/train/std                   | 0.87         |
|    gen/train/value_loss            | 1.5          |
----------------------------

round:  48%|████▊     | 70/146 [04:26<04:47,  3.79s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 524         |
|    gen/time/fps                    | 794         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 145408      |
|    gen/train/approx_kl             | 0.009119112 |
|    gen/train/clip_fraction         | 0.0858      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.3        |
|    gen/train/explained_variance    | 0.25        |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.283       |
|    gen/train/n_updates             | 700         |
|    gen/train/policy_gradient_loss  | -0.00409    |
|    gen/train/std                   | 0.896       |
|    gen/train/value_loss            | 0.695       |
----------------------------------------------

round:  49%|████▊     | 71/146 [04:30<04:43,  3.79s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 554          |
|    gen/time/fps                    | 787          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 147456       |
|    gen/train/approx_kl             | 0.0044103772 |
|    gen/train/clip_fraction         | 0.0579       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.32        |
|    gen/train/explained_variance    | -0.0403      |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 0.775        |
|    gen/train/n_updates             | 710          |
|    gen/train/policy_gradient_loss  | -0.000942    |
|    gen/train/std                   | 0.905        |
|    gen/train/value_loss            | 12.5         |
----------------------------

round:  49%|████▉     | 72/146 [04:33<04:39,  3.78s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 561         |
|    gen/time/fps                    | 783         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 149504      |
|    gen/train/approx_kl             | 0.004434803 |
|    gen/train/clip_fraction         | 0.0553      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.32       |
|    gen/train/explained_variance    | -0.0184     |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 12.4        |
|    gen/train/n_updates             | 720         |
|    gen/train/policy_gradient_loss  | -0.000454   |
|    gen/train/std                   | 0.906       |
|    gen/train/value_loss            | 12.4        |
----------------------------------------------

round:  50%|█████     | 73/146 [04:37<04:36,  3.79s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 561         |
|    gen/time/fps                    | 789         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 151552      |
|    gen/train/approx_kl             | 0.004109665 |
|    gen/train/clip_fraction         | 0.05        |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.33       |
|    gen/train/explained_variance    | 0.236       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.273       |
|    gen/train/n_updates             | 730         |
|    gen/train/policy_gradient_loss  | 1.3e-05     |
|    gen/train/std                   | 0.92        |
|    gen/train/value_loss            | 0.569       |
----------------------------------------------

round:  51%|█████     | 74/146 [04:41<04:33,  3.79s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 582         |
|    gen/time/fps                    | 764         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 153600      |
|    gen/train/approx_kl             | 0.004263496 |
|    gen/train/clip_fraction         | 0.0584      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.34       |
|    gen/train/explained_variance    | 0.0348      |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.442       |
|    gen/train/n_updates             | 740         |
|    gen/train/policy_gradient_loss  | -0.00398    |
|    gen/train/std                   | 0.927       |
|    gen/train/value_loss            | 11.9        |
----------------------------------------------

round:  51%|█████▏    | 75/146 [04:45<04:31,  3.82s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 582         |
|    gen/time/fps                    | 775         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 155648      |
|    gen/train/approx_kl             | 0.008628234 |
|    gen/train/clip_fraction         | 0.11        |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.35       |
|    gen/train/explained_variance    | 0.543       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.26        |
|    gen/train/n_updates             | 750         |
|    gen/train/policy_gradient_loss  | -0.00525    |
|    gen/train/std                   | 0.946       |
|    gen/train/value_loss            | 0.593       |
----------------------------------------------

round:  52%|█████▏    | 76/146 [04:49<04:28,  3.84s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 598          |
|    gen/time/fps                    | 779          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 157696       |
|    gen/train/approx_kl             | 0.0051231394 |
|    gen/train/clip_fraction         | 0.0734       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.36        |
|    gen/train/explained_variance    | 0.113        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 0.39         |
|    gen/train/n_updates             | 760          |
|    gen/train/policy_gradient_loss  | -9e-05       |
|    gen/train/std                   | 0.942        |
|    gen/train/value_loss            | 11.6         |
----------------------------

round:  53%|█████▎    | 77/146 [04:53<04:24,  3.84s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 598          |
|    gen/time/fps                    | 783          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 159744       |
|    gen/train/approx_kl             | 0.0071421647 |
|    gen/train/clip_fraction         | 0.0958       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.36        |
|    gen/train/explained_variance    | 0.658        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 0.534        |
|    gen/train/n_updates             | 770          |
|    gen/train/policy_gradient_loss  | -0.00776     |
|    gen/train/std                   | 0.94         |
|    gen/train/value_loss            | 1.18         |
----------------------------

round:  53%|█████▎    | 78/146 [04:57<04:20,  3.83s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 613          |
|    gen/time/fps                    | 790          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 161792       |
|    gen/train/approx_kl             | 0.0060063824 |
|    gen/train/clip_fraction         | 0.0502       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.36        |
|    gen/train/explained_variance    | 0.298        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 5.99         |
|    gen/train/n_updates             | 780          |
|    gen/train/policy_gradient_loss  | -0.00218     |
|    gen/train/std                   | 0.939        |
|    gen/train/value_loss            | 20.2         |
----------------------------

round:  54%|█████▍    | 79/146 [05:00<04:15,  3.82s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 617         |
|    gen/time/fps                    | 782         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 163840      |
|    gen/train/approx_kl             | 0.009351809 |
|    gen/train/clip_fraction         | 0.118       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.36       |
|    gen/train/explained_variance    | 0.0938      |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 18.1        |
|    gen/train/n_updates             | 790         |
|    gen/train/policy_gradient_loss  | -0.00691    |
|    gen/train/std                   | 0.939       |
|    gen/train/value_loss            | 13.6        |
----------------------------------------------

round:  55%|█████▍    | 80/146 [05:04<04:11,  3.82s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 593         |
|    gen/time/fps                    | 790         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 165888      |
|    gen/train/approx_kl             | 0.003251575 |
|    gen/train/clip_fraction         | 0.0467      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.36       |
|    gen/train/explained_variance    | 0.073       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 36.2        |
|    gen/train/n_updates             | 800         |
|    gen/train/policy_gradient_loss  | -0.00523    |
|    gen/train/std                   | 0.938       |
|    gen/train/value_loss            | 79.1        |
----------------------------------------------

round:  55%|█████▌    | 81/146 [05:08<04:07,  3.80s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 599          |
|    gen/time/fps                    | 793          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 167936       |
|    gen/train/approx_kl             | 0.0066148373 |
|    gen/train/clip_fraction         | 0.0662       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.35        |
|    gen/train/explained_variance    | 0.171        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 15.3         |
|    gen/train/n_updates             | 810          |
|    gen/train/policy_gradient_loss  | -0.00196     |
|    gen/train/std                   | 0.937        |
|    gen/train/value_loss            | 24.3         |
----------------------------

round:  56%|█████▌    | 82/146 [05:12<04:02,  3.79s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 605         |
|    gen/time/fps                    | 795         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 169984      |
|    gen/train/approx_kl             | 0.004712427 |
|    gen/train/clip_fraction         | 0.0351      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.35       |
|    gen/train/explained_variance    | 0.255       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 7.32        |
|    gen/train/n_updates             | 820         |
|    gen/train/policy_gradient_loss  | -0.0051     |
|    gen/train/std                   | 0.934       |
|    gen/train/value_loss            | 37.3        |
----------------------------------------------

round:  57%|█████▋    | 83/146 [05:16<04:00,  3.81s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 605          |
|    gen/time/fps                    | 790          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 172032       |
|    gen/train/approx_kl             | 0.0077405563 |
|    gen/train/clip_fraction         | 0.0706       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.35        |
|    gen/train/explained_variance    | -0.447       |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 0.307        |
|    gen/train/n_updates             | 830          |
|    gen/train/policy_gradient_loss  | -0.000932    |
|    gen/train/std                   | 0.929        |
|    gen/train/value_loss            | 1.29         |
----------------------------

round:  58%|█████▊    | 84/146 [05:19<03:55,  3.80s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 628         |
|    gen/time/fps                    | 780         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 174080      |
|    gen/train/approx_kl             | 0.006814155 |
|    gen/train/clip_fraction         | 0.102       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.34       |
|    gen/train/explained_variance    | 0.369       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 4.29        |
|    gen/train/n_updates             | 840         |
|    gen/train/policy_gradient_loss  | -0.00607    |
|    gen/train/std                   | 0.925       |
|    gen/train/value_loss            | 9.26        |
----------------------------------------------

round:  58%|█████▊    | 85/146 [05:24<03:59,  3.93s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 630         |
|    gen/time/fps                    | 688         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 176128      |
|    gen/train/approx_kl             | 0.011045111 |
|    gen/train/clip_fraction         | 0.148       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.34       |
|    gen/train/explained_variance    | 0.161       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 8.71        |
|    gen/train/n_updates             | 850         |
|    gen/train/policy_gradient_loss  | -0.00877    |
|    gen/train/std                   | 0.926       |
|    gen/train/value_loss            | 25.5        |
----------------------------------------------

round:  59%|█████▉    | 86/146 [05:28<04:00,  4.00s/it]

---------------------------------------------------
| raw/                               |            |
|    gen/rollout/ep_rew_wrapped_mean | 630        |
|    gen/time/fps                    | 787        |
|    gen/time/iterations             | 1          |
|    gen/time/time_elapsed           | 2          |
|    gen/time/total_timesteps        | 178176     |
|    gen/train/approx_kl             | 0.01147538 |
|    gen/train/clip_fraction         | 0.143      |
|    gen/train/clip_range            | 0.2        |
|    gen/train/entropy_loss          | -1.35      |
|    gen/train/explained_variance    | 0.306      |
|    gen/train/learning_rate         | 0.0003     |
|    gen/train/loss                  | 0.366      |
|    gen/train/n_updates             | 860        |
|    gen/train/policy_gradient_loss  | -0.00718   |
|    gen/train/std                   | 0.936      |
|    gen/train/value_loss            | 1.35       |
---------------------------------------------------
------------

round:  60%|█████▉    | 87/146 [05:31<03:52,  3.94s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 630         |
|    gen/time/fps                    | 791         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 180224      |
|    gen/train/approx_kl             | 0.013238069 |
|    gen/train/clip_fraction         | 0.152       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.36       |
|    gen/train/explained_variance    | 0.588       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.422       |
|    gen/train/n_updates             | 870         |
|    gen/train/policy_gradient_loss  | -0.014      |
|    gen/train/std                   | 0.945       |
|    gen/train/value_loss            | 0.947       |
----------------------------------------------

round:  60%|██████    | 88/146 [05:35<03:45,  3.89s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 658         |
|    gen/time/fps                    | 783         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 182272      |
|    gen/train/approx_kl             | 0.010457922 |
|    gen/train/clip_fraction         | 0.139       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.37       |
|    gen/train/explained_variance    | 0.142       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 10.1        |
|    gen/train/n_updates             | 880         |
|    gen/train/policy_gradient_loss  | -0.0059     |
|    gen/train/std                   | 0.953       |
|    gen/train/value_loss            | 6.54        |
----------------------------------------------

round:  61%|██████    | 89/146 [05:39<03:39,  3.86s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 657         |
|    gen/time/fps                    | 787         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 184320      |
|    gen/train/approx_kl             | 0.003784472 |
|    gen/train/clip_fraction         | 0.0369      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.37       |
|    gen/train/explained_variance    | -0.07       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 5.24        |
|    gen/train/n_updates             | 890         |
|    gen/train/policy_gradient_loss  | -0.000376   |
|    gen/train/std                   | 0.95        |
|    gen/train/value_loss            | 41.3        |
----------------------------------------------

round:  62%|██████▏   | 90/146 [05:43<03:34,  3.83s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 657         |
|    gen/time/fps                    | 797         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 186368      |
|    gen/train/approx_kl             | 0.011000747 |
|    gen/train/clip_fraction         | 0.101       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.36       |
|    gen/train/explained_variance    | -1.07       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.225       |
|    gen/train/n_updates             | 900         |
|    gen/train/policy_gradient_loss  | -0.000771   |
|    gen/train/std                   | 0.94        |
|    gen/train/value_loss            | 0.864       |
----------------------------------------------

round:  62%|██████▏   | 91/146 [05:47<03:29,  3.81s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 657         |
|    gen/time/fps                    | 794         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 188416      |
|    gen/train/approx_kl             | 0.006103541 |
|    gen/train/clip_fraction         | 0.0812      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.37       |
|    gen/train/explained_variance    | 0.416       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.195       |
|    gen/train/n_updates             | 910         |
|    gen/train/policy_gradient_loss  | -0.00256    |
|    gen/train/std                   | 0.968       |
|    gen/train/value_loss            | 0.853       |
----------------------------------------------

round:  63%|██████▎   | 92/146 [05:50<03:25,  3.80s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 669         |
|    gen/time/fps                    | 792         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 190464      |
|    gen/train/approx_kl             | 0.009831965 |
|    gen/train/clip_fraction         | 0.0801      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.39       |
|    gen/train/explained_variance    | 0.271       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 4.71        |
|    gen/train/n_updates             | 920         |
|    gen/train/policy_gradient_loss  | -0.00298    |
|    gen/train/std                   | 0.977       |
|    gen/train/value_loss            | 9.61        |
----------------------------------------------

round:  64%|██████▎   | 93/146 [05:54<03:20,  3.79s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 654          |
|    gen/time/fps                    | 786          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 192512       |
|    gen/train/approx_kl             | 0.0062568276 |
|    gen/train/clip_fraction         | 0.0644       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.4         |
|    gen/train/explained_variance    | 0.164        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 1.65         |
|    gen/train/n_updates             | 930          |
|    gen/train/policy_gradient_loss  | -0.00216     |
|    gen/train/std                   | 0.985        |
|    gen/train/value_loss            | 15.3         |
----------------------------

round:  64%|██████▍   | 94/146 [05:58<03:16,  3.78s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 631         |
|    gen/time/fps                    | 791         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 194560      |
|    gen/train/approx_kl             | 0.002353487 |
|    gen/train/clip_fraction         | 0.0228      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.4        |
|    gen/train/explained_variance    | 0.214       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 18.2        |
|    gen/train/n_updates             | 940         |
|    gen/train/policy_gradient_loss  | -0.00254    |
|    gen/train/std                   | 0.983       |
|    gen/train/value_loss            | 45          |
----------------------------------------------

round:  65%|██████▌   | 95/146 [06:02<03:12,  3.77s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 631         |
|    gen/time/fps                    | 781         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 196608      |
|    gen/train/approx_kl             | 0.006686069 |
|    gen/train/clip_fraction         | 0.0897      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.4        |
|    gen/train/explained_variance    | 0.461       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 3.85        |
|    gen/train/n_updates             | 950         |
|    gen/train/policy_gradient_loss  | -0.00623    |
|    gen/train/std                   | 0.982       |
|    gen/train/value_loss            | 22.7        |
----------------------------------------------

round:  66%|██████▌   | 96/146 [06:05<03:08,  3.78s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 609          |
|    gen/time/fps                    | 791          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 198656       |
|    gen/train/approx_kl             | 0.0064319363 |
|    gen/train/clip_fraction         | 0.0634       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.4         |
|    gen/train/explained_variance    | 0.472        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 30.7         |
|    gen/train/n_updates             | 960          |
|    gen/train/policy_gradient_loss  | -0.00626     |
|    gen/train/std                   | 0.982        |
|    gen/train/value_loss            | 39.9         |
----------------------------

round:  66%|██████▋   | 97/146 [06:09<03:04,  3.77s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 611          |
|    gen/time/fps                    | 798          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 200704       |
|    gen/train/approx_kl             | 0.0064326026 |
|    gen/train/clip_fraction         | 0.103        |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.4         |
|    gen/train/explained_variance    | 0.549        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 4.58         |
|    gen/train/n_updates             | 970          |
|    gen/train/policy_gradient_loss  | -0.00687     |
|    gen/train/std                   | 0.992        |
|    gen/train/value_loss            | 10.6         |
----------------------------

round:  67%|██████▋   | 98/146 [06:13<03:00,  3.77s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 607         |
|    gen/time/fps                    | 792         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 202752      |
|    gen/train/approx_kl             | 0.008172163 |
|    gen/train/clip_fraction         | 0.0798      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.41       |
|    gen/train/explained_variance    | 0.724       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 11.1        |
|    gen/train/n_updates             | 980         |
|    gen/train/policy_gradient_loss  | -0.00449    |
|    gen/train/std                   | 0.991       |
|    gen/train/value_loss            | 15.5        |
----------------------------------------------

round:  68%|██████▊   | 99/146 [06:17<02:56,  3.76s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 612         |
|    gen/time/fps                    | 760         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 204800      |
|    gen/train/approx_kl             | 0.011870781 |
|    gen/train/clip_fraction         | 0.169       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.41       |
|    gen/train/explained_variance    | 0.515       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 3.47        |
|    gen/train/n_updates             | 990         |
|    gen/train/policy_gradient_loss  | -0.0138     |
|    gen/train/std                   | 0.996       |
|    gen/train/value_loss            | 13.5        |
----------------------------------------------

round:  68%|██████▊   | 100/146 [06:21<02:54,  3.80s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 602          |
|    gen/time/fps                    | 799          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 206848       |
|    gen/train/approx_kl             | 0.0053247176 |
|    gen/train/clip_fraction         | 0.0425       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.41        |
|    gen/train/explained_variance    | 0.56         |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 28.2         |
|    gen/train/n_updates             | 1000         |
|    gen/train/policy_gradient_loss  | -0.00271     |
|    gen/train/std                   | 0.988        |
|    gen/train/value_loss            | 36.4         |
----------------------------

round:  69%|██████▉   | 101/146 [06:24<02:50,  3.79s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 602         |
|    gen/time/fps                    | 787         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 208896      |
|    gen/train/approx_kl             | 0.010456721 |
|    gen/train/clip_fraction         | 0.151       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.4        |
|    gen/train/explained_variance    | 0.779       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 2.62        |
|    gen/train/n_updates             | 1010        |
|    gen/train/policy_gradient_loss  | -0.014      |
|    gen/train/std                   | 0.973       |
|    gen/train/value_loss            | 5.47        |
----------------------------------------------

round:  70%|██████▉   | 102/146 [06:28<02:46,  3.79s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 602         |
|    gen/time/fps                    | 785         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 210944      |
|    gen/train/approx_kl             | 0.011344118 |
|    gen/train/clip_fraction         | 0.15        |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.39       |
|    gen/train/explained_variance    | 0.854       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 2.53        |
|    gen/train/n_updates             | 1020        |
|    gen/train/policy_gradient_loss  | -0.0119     |
|    gen/train/std                   | 0.977       |
|    gen/train/value_loss            | 4.64        |
----------------------------------------------

round:  71%|███████   | 103/146 [06:32<02:43,  3.79s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 602          |
|    gen/time/fps                    | 786          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 212992       |
|    gen/train/approx_kl             | 0.0072132843 |
|    gen/train/clip_fraction         | 0.113        |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.39        |
|    gen/train/explained_variance    | 0.803        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 3.81         |
|    gen/train/n_updates             | 1030         |
|    gen/train/policy_gradient_loss  | -0.011       |
|    gen/train/std                   | 0.96         |
|    gen/train/value_loss            | 8.32         |
----------------------------

round:  71%|███████   | 104/146 [06:36<02:39,  3.79s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 617         |
|    gen/time/fps                    | 796         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 215040      |
|    gen/train/approx_kl             | 0.010856345 |
|    gen/train/clip_fraction         | 0.124       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.38       |
|    gen/train/explained_variance    | 0.408       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 39.1        |
|    gen/train/n_updates             | 1040        |
|    gen/train/policy_gradient_loss  | -0.00886    |
|    gen/train/std                   | 0.959       |
|    gen/train/value_loss            | 53.5        |
----------------------------------------------

round:  72%|███████▏  | 105/146 [06:39<02:35,  3.78s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 617         |
|    gen/time/fps                    | 793         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 217088      |
|    gen/train/approx_kl             | 0.012359882 |
|    gen/train/clip_fraction         | 0.137       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.38       |
|    gen/train/explained_variance    | 0.693       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 2.94        |
|    gen/train/n_updates             | 1050        |
|    gen/train/policy_gradient_loss  | -0.00793    |
|    gen/train/std                   | 0.96        |
|    gen/train/value_loss            | 5.61        |
----------------------------------------------

round:  73%|███████▎  | 106/146 [06:43<02:31,  3.79s/it]

---------------------------------------------------
| raw/                               |            |
|    gen/rollout/ep_rew_wrapped_mean | 617        |
|    gen/time/fps                    | 784        |
|    gen/time/iterations             | 1          |
|    gen/time/time_elapsed           | 2          |
|    gen/time/total_timesteps        | 219136     |
|    gen/train/approx_kl             | 0.01185911 |
|    gen/train/clip_fraction         | 0.14       |
|    gen/train/clip_range            | 0.2        |
|    gen/train/entropy_loss          | -1.38      |
|    gen/train/explained_variance    | 0.741      |
|    gen/train/learning_rate         | 0.0003     |
|    gen/train/loss                  | 1.43       |
|    gen/train/n_updates             | 1060       |
|    gen/train/policy_gradient_loss  | -0.00836   |
|    gen/train/std                   | 0.966      |
|    gen/train/value_loss            | 2.86       |
---------------------------------------------------
------------

round:  73%|███████▎  | 107/146 [06:47<02:27,  3.79s/it]

---------------------------------------------------
| raw/                               |            |
|    gen/rollout/ep_rew_wrapped_mean | 658        |
|    gen/time/fps                    | 787        |
|    gen/time/iterations             | 1          |
|    gen/time/time_elapsed           | 2          |
|    gen/time/total_timesteps        | 221184     |
|    gen/train/approx_kl             | 0.00854413 |
|    gen/train/clip_fraction         | 0.131      |
|    gen/train/clip_range            | 0.2        |
|    gen/train/entropy_loss          | -1.39      |
|    gen/train/explained_variance    | 0.334      |
|    gen/train/learning_rate         | 0.0003     |
|    gen/train/loss                  | 19.7       |
|    gen/train/n_updates             | 1070       |
|    gen/train/policy_gradient_loss  | -0.00386   |
|    gen/train/std                   | 0.972      |
|    gen/train/value_loss            | 22.3       |
---------------------------------------------------
------------

round:  74%|███████▍  | 108/146 [06:51<02:24,  3.79s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 658         |
|    gen/time/fps                    | 718         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 223232      |
|    gen/train/approx_kl             | 0.008127629 |
|    gen/train/clip_fraction         | 0.118       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.4        |
|    gen/train/explained_variance    | 0.717       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 1.17        |
|    gen/train/n_updates             | 1080        |
|    gen/train/policy_gradient_loss  | -0.00756    |
|    gen/train/std                   | 0.981       |
|    gen/train/value_loss            | 2.22        |
----------------------------------------------

round:  75%|███████▍  | 109/146 [06:55<02:29,  4.04s/it]

---------------------------------------------------
| raw/                               |            |
|    gen/rollout/ep_rew_wrapped_mean | 671        |
|    gen/time/fps                    | 786        |
|    gen/time/iterations             | 1          |
|    gen/time/time_elapsed           | 2          |
|    gen/time/total_timesteps        | 225280     |
|    gen/train/approx_kl             | 0.00851368 |
|    gen/train/clip_fraction         | 0.0968     |
|    gen/train/clip_range            | 0.2        |
|    gen/train/entropy_loss          | -1.4       |
|    gen/train/explained_variance    | 0.15       |
|    gen/train/learning_rate         | 0.0003     |
|    gen/train/loss                  | 15.4       |
|    gen/train/n_updates             | 1090       |
|    gen/train/policy_gradient_loss  | 0.00105    |
|    gen/train/std                   | 0.994      |
|    gen/train/value_loss            | 22.3       |
---------------------------------------------------
------------

round:  75%|███████▌  | 110/146 [06:59<02:22,  3.97s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 671         |
|    gen/time/fps                    | 790         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 227328      |
|    gen/train/approx_kl             | 0.013904126 |
|    gen/train/clip_fraction         | 0.161       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.41       |
|    gen/train/explained_variance    | 0.784       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.851       |
|    gen/train/n_updates             | 1100        |
|    gen/train/policy_gradient_loss  | -0.0124     |
|    gen/train/std                   | 0.994       |
|    gen/train/value_loss            | 1.74        |
----------------------------------------------

round:  76%|███████▌  | 111/146 [07:03<02:16,  3.91s/it]

---------------------------------------------------
| raw/                               |            |
|    gen/rollout/ep_rew_wrapped_mean | 671        |
|    gen/time/fps                    | 782        |
|    gen/time/iterations             | 1          |
|    gen/time/time_elapsed           | 2          |
|    gen/time/total_timesteps        | 229376     |
|    gen/train/approx_kl             | 0.01122525 |
|    gen/train/clip_fraction         | 0.144      |
|    gen/train/clip_range            | 0.2        |
|    gen/train/entropy_loss          | -1.4       |
|    gen/train/explained_variance    | 0.763      |
|    gen/train/learning_rate         | 0.0003     |
|    gen/train/loss                  | 0.601      |
|    gen/train/n_updates             | 1110       |
|    gen/train/policy_gradient_loss  | -0.00993   |
|    gen/train/std                   | 0.977      |
|    gen/train/value_loss            | 1.23       |
---------------------------------------------------
------------

round:  77%|███████▋  | 112/146 [07:07<02:11,  3.88s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 671         |
|    gen/time/fps                    | 788         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 231424      |
|    gen/train/approx_kl             | 0.010461642 |
|    gen/train/clip_fraction         | 0.11        |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.4        |
|    gen/train/explained_variance    | 0.661       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.568       |
|    gen/train/n_updates             | 1120        |
|    gen/train/policy_gradient_loss  | -0.00694    |
|    gen/train/std                   | 0.975       |
|    gen/train/value_loss            | 1.52        |
----------------------------------------------

round:  77%|███████▋  | 113/146 [07:11<02:06,  3.84s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 720         |
|    gen/time/fps                    | 796         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 233472      |
|    gen/train/approx_kl             | 0.008927832 |
|    gen/train/clip_fraction         | 0.0668      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.39       |
|    gen/train/explained_variance    | -0.107      |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 5.13        |
|    gen/train/n_updates             | 1130        |
|    gen/train/policy_gradient_loss  | 0.000226    |
|    gen/train/std                   | 0.978       |
|    gen/train/value_loss            | 21.9        |
----------------------------------------------

round:  78%|███████▊  | 114/146 [07:14<02:02,  3.82s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 720         |
|    gen/time/fps                    | 792         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 235520      |
|    gen/train/approx_kl             | 0.008572119 |
|    gen/train/clip_fraction         | 0.104       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.42       |
|    gen/train/explained_variance    | 0.532       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.306       |
|    gen/train/n_updates             | 1140        |
|    gen/train/policy_gradient_loss  | -0.00493    |
|    gen/train/std                   | 1.03        |
|    gen/train/value_loss            | 1.63        |
----------------------------------------------

round:  79%|███████▉  | 115/146 [07:18<01:58,  3.84s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 736         |
|    gen/time/fps                    | 788         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 237568      |
|    gen/train/approx_kl             | 0.008075712 |
|    gen/train/clip_fraction         | 0.0978      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.45       |
|    gen/train/explained_variance    | -0.0106     |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.553       |
|    gen/train/n_updates             | 1150        |
|    gen/train/policy_gradient_loss  | -0.00166    |
|    gen/train/std                   | 1.03        |
|    gen/train/value_loss            | 16.3        |
----------------------------------------------

round:  79%|███████▉  | 116/146 [07:22<01:54,  3.82s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 743          |
|    gen/time/fps                    | 786          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 239616       |
|    gen/train/approx_kl             | 0.0033422373 |
|    gen/train/clip_fraction         | 0.0584       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.45        |
|    gen/train/explained_variance    | -0.0555      |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 3.42         |
|    gen/train/n_updates             | 1160         |
|    gen/train/policy_gradient_loss  | -0.00164     |
|    gen/train/std                   | 1.03         |
|    gen/train/value_loss            | 37.7         |
----------------------------

round:  80%|████████  | 117/146 [07:26<01:50,  3.81s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 743         |
|    gen/time/fps                    | 787         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 241664      |
|    gen/train/approx_kl             | 0.004288719 |
|    gen/train/clip_fraction         | 0.0568      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.45       |
|    gen/train/explained_variance    | 0.17        |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 2.32        |
|    gen/train/n_updates             | 1170        |
|    gen/train/policy_gradient_loss  | -0.00115    |
|    gen/train/std                   | 1.03        |
|    gen/train/value_loss            | 10.5        |
----------------------------------------------

round:  81%|████████  | 118/146 [07:30<01:46,  3.81s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 742          |
|    gen/time/fps                    | 782          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 243712       |
|    gen/train/approx_kl             | 0.0026195704 |
|    gen/train/clip_fraction         | 0.0294       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.45        |
|    gen/train/explained_variance    | 0.187        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 5.13         |
|    gen/train/n_updates             | 1180         |
|    gen/train/policy_gradient_loss  | -0.00399     |
|    gen/train/std                   | 1.04         |
|    gen/train/value_loss            | 41.4         |
----------------------------

round:  82%|████████▏ | 119/146 [07:33<01:42,  3.81s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 719          |
|    gen/time/fps                    | 784          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 245760       |
|    gen/train/approx_kl             | 0.0021713136 |
|    gen/train/clip_fraction         | 0.016        |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.45        |
|    gen/train/explained_variance    | 0.227        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 27.3         |
|    gen/train/n_updates             | 1190         |
|    gen/train/policy_gradient_loss  | 0.000157     |
|    gen/train/std                   | 1.03         |
|    gen/train/value_loss            | 33.4         |
----------------------------

round:  82%|████████▏ | 120/146 [07:37<01:38,  3.81s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 714          |
|    gen/time/fps                    | 787          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 247808       |
|    gen/train/approx_kl             | 0.0069446317 |
|    gen/train/clip_fraction         | 0.0453       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.45        |
|    gen/train/explained_variance    | 0.424        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 5.88         |
|    gen/train/n_updates             | 1200         |
|    gen/train/policy_gradient_loss  | 0.000178     |
|    gen/train/std                   | 1.03         |
|    gen/train/value_loss            | 10.6         |
----------------------------

round:  83%|████████▎ | 121/146 [07:41<01:35,  3.81s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 714         |
|    gen/time/fps                    | 790         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 249856      |
|    gen/train/approx_kl             | 0.009087678 |
|    gen/train/clip_fraction         | 0.0988      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.46       |
|    gen/train/explained_variance    | 0.165       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.382       |
|    gen/train/n_updates             | 1210        |
|    gen/train/policy_gradient_loss  | -0.00139    |
|    gen/train/std                   | 1.06        |
|    gen/train/value_loss            | 1.12        |
----------------------------------------------

round:  84%|████████▎ | 122/146 [07:45<01:31,  3.80s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 714         |
|    gen/time/fps                    | 784         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 251904      |
|    gen/train/approx_kl             | 0.018265575 |
|    gen/train/clip_fraction         | 0.128       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.49       |
|    gen/train/explained_variance    | 0.684       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.458       |
|    gen/train/n_updates             | 1220        |
|    gen/train/policy_gradient_loss  | -0.00724    |
|    gen/train/std                   | 1.09        |
|    gen/train/value_loss            | 1.04        |
----------------------------------------------

round:  84%|████████▍ | 123/146 [07:49<01:27,  3.80s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 653         |
|    gen/time/fps                    | 789         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 253952      |
|    gen/train/approx_kl             | 0.005339601 |
|    gen/train/clip_fraction         | 0.0557      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.51       |
|    gen/train/explained_variance    | 0.121       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 18.7        |
|    gen/train/n_updates             | 1230        |
|    gen/train/policy_gradient_loss  | -0.00446    |
|    gen/train/std                   | 1.09        |
|    gen/train/value_loss            | 33.1        |
----------------------------------------------

round:  85%|████████▍ | 124/146 [07:52<01:23,  3.80s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 644          |
|    gen/time/fps                    | 789          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 256000       |
|    gen/train/approx_kl             | 0.0044585112 |
|    gen/train/clip_fraction         | 0.0294       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.51        |
|    gen/train/explained_variance    | 0.199        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 16.7         |
|    gen/train/n_updates             | 1240         |
|    gen/train/policy_gradient_loss  | -0.00259     |
|    gen/train/std                   | 1.09         |
|    gen/train/value_loss            | 20.6         |
----------------------------

round:  86%|████████▌ | 125/146 [07:56<01:19,  3.80s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 630          |
|    gen/time/fps                    | 791          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 258048       |
|    gen/train/approx_kl             | 0.0071211113 |
|    gen/train/clip_fraction         | 0.045        |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.51        |
|    gen/train/explained_variance    | 0.211        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 4.84         |
|    gen/train/n_updates             | 1250         |
|    gen/train/policy_gradient_loss  | -0.0021      |
|    gen/train/std                   | 1.1          |
|    gen/train/value_loss            | 14           |
----------------------------

round:  86%|████████▋ | 126/146 [08:00<01:15,  3.79s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 634         |
|    gen/time/fps                    | 784         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 260096      |
|    gen/train/approx_kl             | 0.007468432 |
|    gen/train/clip_fraction         | 0.0988      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.51       |
|    gen/train/explained_variance    | 0.356       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 1.44        |
|    gen/train/n_updates             | 1260        |
|    gen/train/policy_gradient_loss  | -0.00096    |
|    gen/train/std                   | 1.09        |
|    gen/train/value_loss            | 5.41        |
----------------------------------------------

round:  87%|████████▋ | 127/146 [08:04<01:12,  3.80s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 630          |
|    gen/time/fps                    | 783          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 262144       |
|    gen/train/approx_kl             | 0.0044705635 |
|    gen/train/clip_fraction         | 0.0715       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.51        |
|    gen/train/explained_variance    | 0.382        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 9.32         |
|    gen/train/n_updates             | 1270         |
|    gen/train/policy_gradient_loss  | -0.00188     |
|    gen/train/std                   | 1.09         |
|    gen/train/value_loss            | 10.2         |
----------------------------

round:  88%|████████▊ | 128/146 [08:08<01:08,  3.80s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 628          |
|    gen/time/fps                    | 791          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 264192       |
|    gen/train/approx_kl             | 0.0045119068 |
|    gen/train/clip_fraction         | 0.0945       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.5         |
|    gen/train/explained_variance    | 0.506        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 7.13         |
|    gen/train/n_updates             | 1280         |
|    gen/train/policy_gradient_loss  | -0.00448     |
|    gen/train/std                   | 1.08         |
|    gen/train/value_loss            | 12.8         |
----------------------------

round:  88%|████████▊ | 129/146 [08:11<01:04,  3.80s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 632         |
|    gen/time/fps                    | 789         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 266240      |
|    gen/train/approx_kl             | 0.006755358 |
|    gen/train/clip_fraction         | 0.0951      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.5        |
|    gen/train/explained_variance    | 0.674       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 3.51        |
|    gen/train/n_updates             | 1290        |
|    gen/train/policy_gradient_loss  | -0.00317    |
|    gen/train/std                   | 1.09        |
|    gen/train/value_loss            | 7.13        |
----------------------------------------------

round:  89%|████████▉ | 130/146 [08:15<01:00,  3.79s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 605         |
|    gen/time/fps                    | 786         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 268288      |
|    gen/train/approx_kl             | 0.009840423 |
|    gen/train/clip_fraction         | 0.147       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.5        |
|    gen/train/explained_variance    | 0.479       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 2.94        |
|    gen/train/n_updates             | 1300        |
|    gen/train/policy_gradient_loss  | -0.00923    |
|    gen/train/std                   | 1.08        |
|    gen/train/value_loss            | 11.3        |
----------------------------------------------

round:  90%|████████▉ | 131/146 [08:19<00:56,  3.79s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 517         |
|    gen/time/fps                    | 789         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 270336      |
|    gen/train/approx_kl             | 0.007991383 |
|    gen/train/clip_fraction         | 0.0741      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.49       |
|    gen/train/explained_variance    | 0.444       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 9.58        |
|    gen/train/n_updates             | 1310        |
|    gen/train/policy_gradient_loss  | -0.0059     |
|    gen/train/std                   | 1.08        |
|    gen/train/value_loss            | 21.3        |
----------------------------------------------

round:  90%|█████████ | 132/146 [08:23<00:53,  3.82s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 521         |
|    gen/time/fps                    | 626         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 3           |
|    gen/time/total_timesteps        | 272384      |
|    gen/train/approx_kl             | 0.007371483 |
|    gen/train/clip_fraction         | 0.121       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.49       |
|    gen/train/explained_variance    | 0.495       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 0.89        |
|    gen/train/n_updates             | 1320        |
|    gen/train/policy_gradient_loss  | -0.00532    |
|    gen/train/std                   | 1.07        |
|    gen/train/value_loss            | 5.47        |
----------------------------------------------

round:  91%|█████████ | 133/146 [08:27<00:52,  4.05s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 527         |
|    gen/time/fps                    | 789         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 274432      |
|    gen/train/approx_kl             | 0.008402305 |
|    gen/train/clip_fraction         | 0.116       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.49       |
|    gen/train/explained_variance    | 0.539       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 5.01        |
|    gen/train/n_updates             | 1330        |
|    gen/train/policy_gradient_loss  | -0.00745    |
|    gen/train/std                   | 1.07        |
|    gen/train/value_loss            | 7.22        |
----------------------------------------------

round:  92%|█████████▏| 134/146 [08:31<00:47,  3.98s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 536         |
|    gen/time/fps                    | 786         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 276480      |
|    gen/train/approx_kl             | 0.011223714 |
|    gen/train/clip_fraction         | 0.106       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.49       |
|    gen/train/explained_variance    | 0.519       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 4.47        |
|    gen/train/n_updates             | 1340        |
|    gen/train/policy_gradient_loss  | -0.00859    |
|    gen/train/std                   | 1.09        |
|    gen/train/value_loss            | 13.8        |
----------------------------------------------

round:  92%|█████████▏| 135/146 [08:35<00:43,  3.93s/it]

---------------------------------------------------
| raw/                               |            |
|    gen/rollout/ep_rew_wrapped_mean | 542        |
|    gen/time/fps                    | 783        |
|    gen/time/iterations             | 1          |
|    gen/time/time_elapsed           | 2          |
|    gen/time/total_timesteps        | 278528     |
|    gen/train/approx_kl             | 0.00974463 |
|    gen/train/clip_fraction         | 0.0849     |
|    gen/train/clip_range            | 0.2        |
|    gen/train/entropy_loss          | -1.5       |
|    gen/train/explained_variance    | 0.626      |
|    gen/train/learning_rate         | 0.0003     |
|    gen/train/loss                  | 3.68       |
|    gen/train/n_updates             | 1350       |
|    gen/train/policy_gradient_loss  | -0.00585   |
|    gen/train/std                   | 1.09       |
|    gen/train/value_loss            | 10.8       |
---------------------------------------------------
------------

round:  93%|█████████▎| 136/146 [08:39<00:38,  3.89s/it]

-----------------------------------------------------
| raw/                               |              |
|    gen/rollout/ep_rew_wrapped_mean | 538          |
|    gen/time/fps                    | 785          |
|    gen/time/iterations             | 1            |
|    gen/time/time_elapsed           | 2            |
|    gen/time/total_timesteps        | 280576       |
|    gen/train/approx_kl             | 0.0032951096 |
|    gen/train/clip_fraction         | 0.0563       |
|    gen/train/clip_range            | 0.2          |
|    gen/train/entropy_loss          | -1.5         |
|    gen/train/explained_variance    | 0.444        |
|    gen/train/learning_rate         | 0.0003       |
|    gen/train/loss                  | 18.8         |
|    gen/train/n_updates             | 1360         |
|    gen/train/policy_gradient_loss  | -0.00326     |
|    gen/train/std                   | 1.07         |
|    gen/train/value_loss            | 38.7         |
----------------------------

round:  94%|█████████▍| 137/146 [08:43<00:34,  3.86s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 540         |
|    gen/time/fps                    | 790         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 282624      |
|    gen/train/approx_kl             | 0.011426598 |
|    gen/train/clip_fraction         | 0.122       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.49       |
|    gen/train/explained_variance    | 0.416       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 7.3         |
|    gen/train/n_updates             | 1370        |
|    gen/train/policy_gradient_loss  | -0.00645    |
|    gen/train/std                   | 1.07        |
|    gen/train/value_loss            | 18.5        |
----------------------------------------------

round:  95%|█████████▍| 138/146 [08:46<00:30,  3.84s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 524         |
|    gen/time/fps                    | 800         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 284672      |
|    gen/train/approx_kl             | 0.007776484 |
|    gen/train/clip_fraction         | 0.129       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.48       |
|    gen/train/explained_variance    | 0.627       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 5.11        |
|    gen/train/n_updates             | 1380        |
|    gen/train/policy_gradient_loss  | -0.00938    |
|    gen/train/std                   | 1.06        |
|    gen/train/value_loss            | 15.5        |
----------------------------------------------

round:  95%|█████████▌| 139/146 [08:50<00:26,  3.81s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 524         |
|    gen/time/fps                    | 786         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 286720      |
|    gen/train/approx_kl             | 0.009682796 |
|    gen/train/clip_fraction         | 0.138       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.48       |
|    gen/train/explained_variance    | 0.668       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 4.98        |
|    gen/train/n_updates             | 1390        |
|    gen/train/policy_gradient_loss  | -0.0104     |
|    gen/train/std                   | 1.06        |
|    gen/train/value_loss            | 10.6        |
----------------------------------------------

round:  96%|█████████▌| 140/146 [08:54<00:22,  3.80s/it]

---------------------------------------------------
| raw/                               |            |
|    gen/rollout/ep_rew_wrapped_mean | 492        |
|    gen/time/fps                    | 784        |
|    gen/time/iterations             | 1          |
|    gen/time/time_elapsed           | 2          |
|    gen/time/total_timesteps        | 288768     |
|    gen/train/approx_kl             | 0.00181282 |
|    gen/train/clip_fraction         | 0.0636     |
|    gen/train/clip_range            | 0.2        |
|    gen/train/entropy_loss          | -1.47      |
|    gen/train/explained_variance    | 0.14       |
|    gen/train/learning_rate         | 0.0003     |
|    gen/train/loss                  | 22         |
|    gen/train/n_updates             | 1400       |
|    gen/train/policy_gradient_loss  | -0.00113   |
|    gen/train/std                   | 1.04       |
|    gen/train/value_loss            | 47.6       |
---------------------------------------------------
------------

round:  97%|█████████▋| 141/146 [08:58<00:19,  3.81s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 465         |
|    gen/time/fps                    | 788         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 290816      |
|    gen/train/approx_kl             | 0.004896651 |
|    gen/train/clip_fraction         | 0.0608      |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.46       |
|    gen/train/explained_variance    | 0.297       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 20.4        |
|    gen/train/n_updates             | 1410        |
|    gen/train/policy_gradient_loss  | -0.00427    |
|    gen/train/std                   | 1.04        |
|    gen/train/value_loss            | 44.5        |
----------------------------------------------

round:  97%|█████████▋| 142/146 [09:02<00:15,  3.82s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 469         |
|    gen/time/fps                    | 773         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 292864      |
|    gen/train/approx_kl             | 0.010666106 |
|    gen/train/clip_fraction         | 0.125       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.45       |
|    gen/train/explained_variance    | 0.518       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 6.57        |
|    gen/train/n_updates             | 1420        |
|    gen/train/policy_gradient_loss  | -0.00644    |
|    gen/train/std                   | 1.02        |
|    gen/train/value_loss            | 10.6        |
----------------------------------------------

round:  98%|█████████▊| 143/146 [09:06<00:11,  3.84s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 469         |
|    gen/time/fps                    | 768         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 294912      |
|    gen/train/approx_kl             | 0.010382044 |
|    gen/train/clip_fraction         | 0.163       |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.45       |
|    gen/train/explained_variance    | 0.505       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 1.24        |
|    gen/train/n_updates             | 1430        |
|    gen/train/policy_gradient_loss  | -0.0107     |
|    gen/train/std                   | 1.03        |
|    gen/train/value_loss            | 2.63        |
----------------------------------------------

round:  99%|█████████▊| 144/146 [09:09<00:07,  3.86s/it]

--------------------------------------------------
| raw/                               |           |
|    gen/rollout/ep_rew_wrapped_mean | 497       |
|    gen/time/fps                    | 781       |
|    gen/time/iterations             | 1         |
|    gen/time/time_elapsed           | 2         |
|    gen/time/total_timesteps        | 296960    |
|    gen/train/approx_kl             | 0.0087937 |
|    gen/train/clip_fraction         | 0.107     |
|    gen/train/clip_range            | 0.2       |
|    gen/train/entropy_loss          | -1.45     |
|    gen/train/explained_variance    | 0.327     |
|    gen/train/learning_rate         | 0.0003    |
|    gen/train/loss                  | 4.35      |
|    gen/train/n_updates             | 1440      |
|    gen/train/policy_gradient_loss  | -0.00535  |
|    gen/train/std                   | 1.03      |
|    gen/train/value_loss            | 21.3      |
--------------------------------------------------
-------------------------------

round:  99%|█████████▉| 145/146 [09:13<00:03,  3.86s/it]

----------------------------------------------------
| raw/                               |             |
|    gen/rollout/ep_rew_wrapped_mean | 497         |
|    gen/time/fps                    | 780         |
|    gen/time/iterations             | 1           |
|    gen/time/time_elapsed           | 2           |
|    gen/time/total_timesteps        | 299008      |
|    gen/train/approx_kl             | 0.010147174 |
|    gen/train/clip_fraction         | 0.14        |
|    gen/train/clip_range            | 0.2         |
|    gen/train/entropy_loss          | -1.44       |
|    gen/train/explained_variance    | 0.477       |
|    gen/train/learning_rate         | 0.0003      |
|    gen/train/loss                  | 1.26        |
|    gen/train/n_updates             | 1450        |
|    gen/train/policy_gradient_loss  | -0.00443    |
|    gen/train/std                   | 1.02        |
|    gen/train/value_loss            | 2.59        |
----------------------------------------------

round: 100%|██████████| 146/146 [09:17<00:00,  3.82s/it]


In [None]:
import matplotlib.pyplot as plt
import numpy as np

print(np.mean(learner_rewards_after_training))
print(np.mean(learner_rewards_before_training))

plt.hist(
    [learner_rewards_before_training, learner_rewards_after_training],
    label=["untrained", "trained"],
)
plt.legend()
plt.show()

In [None]:
%%capture 
from IPython import display
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import numpy as np
import torch
import omegaconf
import gym

import mbrl.env.reward_fns as reward_fns
import mbrl.env.termination_fns as termination_fns
import mbrl.models as models
import mbrl.planning as planning
import mbrl.util.common as common_util
import mbrl.util as util




%load_ext autoreload
%autoreload 2

mpl.rcParams.update({"font.size": 16})

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

seed = 0
# env = cartpole_env.CartPoleEnv()
env = gail_trainer.venv_train
env.seed(seed)
rng = np.random.default_rng(seed=0)
generator = torch.Generator(device=device)
generator.manual_seed(seed)
obs_shape = env.observation_space.shape
act_shape = env.action_space.shape

# This functions allows the model to evaluate the true rewards given an observation 
# reward_fn = reward_fns.cartpole
# # This function allows the model to know if an observation should make the episode end
term_fn = termination_fns.cartpole

In [None]:
%%capture 
trial_length = 200
num_trials = 10
ensemble_size = 1

# Everything with "???" indicates an option with a missing value.
# Our utility functions will fill in these details using the 
# environment information
cfg_dict = {
    # dynamics model configuration
    "dynamics_model": {
        "model": 
        {
            "_target_": "mbrl.models.GaussianMLP",
            "device": device,
            "num_layers": 3,
            "ensemble_size": ensemble_size,
            "hid_size": 200,
            "in_size": "???",
            "out_size": "???",
            "deterministic": False,
            "propagation_method": "fixed_model",
            # can also configure activation function for GaussianMLP
            "activation_fn_cfg": {
                "_target_": "torch.nn.LeakyReLU",
                "negative_slope": 0.01
            }
        }
    },
    # options for training the dynamics model
    "algorithm": {
        "learned_rewards": False,
        "target_is_delta": True,
        "normalize": True,
    },
    # these are experiment specific options
    "overrides": {
        "trial_length": trial_length,
        "num_steps": num_trials * trial_length,
        "model_batch_size": 32,
        "validation_ratio": 0.05
    }
}
cfg = omegaconf.OmegaConf.create(cfg_dict)


# Create a 1-D dynamics model for this environment
dynamics_model = common_util.create_one_dim_tr_model(cfg, obs_shape, act_shape)

# Create a gym-like environment to encapsulate the model
model_env = models.ModelEnv(env, dynamics_model, term_fn,  generator=generator)


replay_buffer = common_util.create_replay_buffer(cfg, obs_shape, act_shape, rng=rng)

In [None]:
common_util.rollout_agent_trajectories(
    env,
    trial_length, # initial exploration steps
    planning.RandomAgent(env),
    {}, # keyword arguments to pass to agent.act()
    replay_buffer=replay_buffer,
    trial_length=trial_length)

In [None]:
%%capture 


agent_cfg = omegaconf.OmegaConf.create({
    # this class evaluates many trajectories and picks the best one
    "_target_": "mbrl.planning.TrajectoryOptimizerAgent",
    "planning_horizon": 15,
    "replan_freq": 1,
    "verbose": False,
    "action_lb": "???",
    "action_ub": "???",
    # this is the optimizer to generate and choose a trajectory
    "optimizer_cfg": {
        "_target_": "mbrl.planning.CEMOptimizer",
        "num_iterations": 5,
        "elite_ratio": 0.1,
        "population_size": 500,
        "alpha": 0.1,
        "device": device,
        "lower_bound": "???",
        "upper_bound": "???",
        "return_mean_elites": True,
    }
})

discriminator = gail_trainer._reward_net

# Need to insert D-net in reward_fn or model env 

# reward_fn = reward_fns.discriminator
model_env = models.ModelEnv(discriminator, env, dynamics_model, term_fn, generator=generator)

agent = planning.create_trajectory_optim_agent_for_model(
    model_env,
    agent_cfg,
    num_particles=20
)

In [None]:
model_env.reset(np.random.random((4,4)))["obs"]

In [None]:
train_losses = []
val_scores = []

def train_callback(_model, _total_calls, _epoch, tr_loss, val_score, _best_val):
    train_losses.append(tr_loss)
    val_scores.append(val_score.mean().item())   # this returns val score per ensemble model

def update_axes(_axs, _frame, _text, _trial, _steps_trial, _all_rewards, force_update=False):
    if not force_update and (_steps_trial % 10 != 0):
        return
    _axs[0].imshow(_frame)
    _axs[0].set_xticks([])
    _axs[0].set_yticks([])
    _axs[1].clear()
    _axs[1].set_xlim([0, num_trials + .1])
    _axs[1].set_ylim([0, 200])
    _axs[1].set_xlabel("Trial")
    _axs[1].set_ylabel("Trial reward")
    _axs[1].plot(_all_rewards, 'bs-')
    _text.set_text(f"Trial {_trial + 1}: {_steps_trial} steps")
    display.display(plt.gcf())  
    display.clear_output(wait=True)
    plt.savefig("mygraph.png")

#  Create a trainer for the model
model_trainer = models.ModelTrainer(dynamics_model, optim_lr=1e-3, weight_decay=5e-5)
num_trials = 10
# Create visualization objects
fig, axs = plt.subplots(1, 2, figsize=(14, 3.75), gridspec_kw={"width_ratios": [1, 1]})
ax_text = axs[0].text(300, 50, "")
    
# Main PETS loop
all_rewards = [0]
for trial in range(num_trials):
    obs = env.reset()    
    agent.reset()
    
    done = False
    total_reward = 0.0
    steps_trial = 0
    update_axes(axs, env.render(mode="rgb_array"), ax_text, trial, steps_trial, all_rewards)
    while not done:
        # --------------- Model Training -----------------
        if steps_trial == 0:
            dynamics_model.update_normalizer(replay_buffer.get_all())  # update normalizer stats --> all the input arrays must have same number of dimensions, but the array at index 0 has 4 dimension(s) and the array at index 1 has 2 dimension(s)
            
            dataset_train, dataset_val = common_util.get_basic_buffer_iterators(
                replay_buffer,
                batch_size=cfg.overrides.model_batch_size,
                val_ratio=cfg.overrides.validation_ratio,
                ensemble_size=ensemble_size,
                shuffle_each_epoch=True,
                bootstrap_permutes=False,  # build bootstrap dataset using sampling with replacement
            )
                
            model_trainer.train(
                dataset_train, 
                dataset_val=dataset_val, 
                num_epochs=50, 
                patience=10, 
                callback=train_callback,
            )

        # --- Doing env step using the agent and adding to model dataset ---
        next_obs, reward, done, _ = common_util.step_env_and_add_to_buffer(
            env, obs, agent, {}, replay_buffer)
            
        update_axes(axs, env.render(mode="rgb_array"), ax_text, trial, steps_trial, all_rewards)
        
        obs = next_obs
        total_reward += reward
        steps_trial += 1
        
        if steps_trial == trial_length:
            break
    
    all_rewards.append(total_reward)

update_axes(axs, env.render(mode="rgb_array"), ax_text, trial, steps_trial, all_rewards, force_update=True)

In [None]:
%matplotlib inline
fig, ax = plt.subplots(2, 1, figsize=(12, 10))
ax[0].plot(train_losses)
ax[0].set_xlabel("Total training epochs")
ax[0].set_ylabel("Training loss (avg. NLL)")
ax[1].plot(val_scores)
ax[1].set_xlabel("Total training epochs")
ax[1].set_ylabel("Validation score (avg. MSE)")
plt.show()

#Code for printing sequences

In [18]:
obs_seq, rew_seq, act_seq = common_util.rollout_model_env(
    model_env=model_env,
    initial_obs=env.reset().squeeze(),
    agent=agent,
    num_samples=1,
)

#Code for tests with MBRL inserted into GAIL

In [None]:
%%capture 
%cd imitation/src
from imitation.algorithms.adversarial.gail import GAIL
from imitation.rewards.reward_nets import BasicRewardNet
from imitation.util.networks import RunningNorm
%cd -

from stable_baselines3 import PPO
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv


#Need to manage vec envs
venv = DummyVecEnv([lambda: env] * 1)

reward_net = BasicRewardNet(
    venv.observation_space, venv.action_space, normalize_input_layer=RunningNorm
)

model_trainer = models.ModelTrainer(dynamics_model, optim_lr=1e-3, weight_decay=5e-5)

# RuntimeError: BufferingWrapper reset() before samples were accessed, n_stored = 200*times_execution
# common_util.rollout_agent_trajectories(
#     env,
#     trial_length, # initial exploration steps
#     planning.RandomAgent(env),
#     {}, # keyword arguments to pass to agent.act()
#     replay_buffer=replay_buffer,
#     trial_length=trial_length)


gail_trainer = GAIL(
    demonstrations=rollouts,
    demo_batch_size=1024,
    gen_replay_buffer_capacity=2048,
    n_disc_updates_per_round=4,
    venv=venv,
    gen_algo=agent,
    reward_net=reward_net,
    cfg=cfg,
    model_trainer=model_trainer,
    dynamics_model=dynamics_model,
    replay_buffer=replay_buffer,
    gen_train_timesteps = 2000,
    allow_variable_horizon=True #https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html
)




In [None]:
gail_trainer.train(20000)  # Note: set to 300000 for better results

# ValueError: Wrong data shape for acts ------- acts array monodimensional, other bidemensional
# Need to reshape actions for single-dim act space: added "action = action.reshape(-1,1)"

round:   0%|          | 0/10 [02:38<?, ?it/s]

gen_trajs [TrajectoryWithRew(obs=array([[ 3.82382199e-02, -2.71885116e-02,  8.95104650e-03,
        -1.65020972e-02],
       [ 3.76944467e-02,  1.19631752e-01,  8.62100441e-03,
        -2.34092101e-01],
       [ 4.00870815e-02, -4.49558208e-03,  3.93916247e-03,
        -4.53734733e-02],
       [ 3.99971716e-02,  8.10967535e-02,  3.03169270e-03,
        -1.72602862e-01],
       [ 4.16191071e-02, -2.32035737e-03, -4.20364464e-04,
        -4.65864614e-02],
       [ 4.15727012e-02, -1.52969480e-01, -1.35209365e-03,
         1.79263622e-01],
       [ 3.85133103e-02, -2.38789730e-02,  2.23317859e-03,
        -1.47694843e-02],
       [ 3.80357318e-02,  2.75642928e-02,  1.93778903e-03,
        -9.12776366e-02],
       [ 3.85870151e-02, -1.32560745e-01,  1.12236303e-04,
         1.49479181e-01],
       [ 3.59358005e-02, -2.45503664e-01,  3.10181989e-03,
         3.18926543e-01],
       [ 3.10257282e-02, -2.01829851e-01,  9.48035065e-03,
         2.54328072e-01],
       [ 2.69891303e-02, -3.6757




AttributeError: ignored

In [None]:
#changed "action.reshape(-1,1)" in trajectory_opt.py
agent.act(gail_trainer.venv_train.reset())

plan [[ 0.2782188 ]
 [ 0.02829593]
 [ 0.05018038]
 [-0.14561155]
 [-0.1997795 ]
 [-0.1083254 ]
 [-0.07284492]
 [-0.31208798]
 [ 0.16547456]
 [-0.12820692]
 [-0.12892792]
 [ 0.03909729]
 [-0.06817953]
 [ 0.0910966 ]
 [-0.04198803]]
self.actions_to_use  [array([0.2782188], dtype=float32)]


array([[0.2782188]], dtype=float32)

When we look at the histograms of rewards before and after learning, we can see that the learner is not perfect yet, but it made some progress at least.
If not, just re-run the above cell.