[download this notebook here](https://github.com/HumanCompatibleAI/imitation/blob/master/docs/tutorials/5_train_preference_comparisons.ipynb)
# Learning a Reward Function using Preference Comparisons

The preference comparisons algorithm learns a reward function by comparing trajectory segments to each other.

To set up the preference comparisons algorithm, we first need to set up a lot of its internals beforehand:

In [1]:
from imitation.algorithms import preference_comparisons
from imitation.rewards.reward_nets import BasicRewardNet
from imitation.util.networks import RunningNorm
from imitation.util.util import make_vec_env
from imitation.policies.base import FeedForward32Policy, NormalizeFeaturesExtractor
from stable_baselines3 import PPO
import numpy as np

rng = np.random.default_rng(0)

venv = make_vec_env("Pendulum-v1", rng=rng)

reward_net = BasicRewardNet(
    venv.observation_space, venv.action_space, normalize_input_layer=RunningNorm
)

fragmenter = preference_comparisons.RandomFragmenter(
    warning_threshold=0,
    rng=rng,
)
gatherer = preference_comparisons.SyntheticGatherer(rng=rng)
preference_model = preference_comparisons.PreferenceModel(reward_net)
reward_trainer = preference_comparisons.BasicRewardTrainer(
    preference_model=preference_model,
    loss=preference_comparisons.CrossEntropyRewardLoss(),
    epochs=3,
    rng=rng,
)


# Several hyperparameters (reward_epochs, ppo_clip_range, ppo_ent_coef,
# ppo_gae_lambda, ppo_n_epochs, discount_factor, use_sde, sde_sample_freq,
# ppo_lr, exploration_frac, num_iterations, initial_comparison_frac,
# initial_epoch_multiplier, query_schedule) used in this example have been
# approximately fine-tuned to reach a reasonable level of performance.
agent = PPO(
    policy=FeedForward32Policy,
    policy_kwargs=dict(
        features_extractor_class=NormalizeFeaturesExtractor,
        features_extractor_kwargs=dict(normalize_class=RunningNorm),
    ),
    env=venv,
    seed=0,
    n_steps=2048 // venv.num_envs,
    batch_size=64,
    ent_coef=0.01,
    learning_rate=2e-3,
    clip_range=0.1,
    gae_lambda=0.95,
    gamma=0.97,
    n_epochs=10,
)

trajectory_generator = preference_comparisons.AgentTrainer(
    algorithm=agent,
    reward_fn=reward_net,
    venv=venv,
    exploration_frac=0.05,
    rng=rng,
)

pref_comparisons = preference_comparisons.PreferenceComparisons(
    trajectory_generator,
    reward_net,
    num_iterations=5, # Set to 60 for better performance
    fragmenter=fragmenter,
    preference_gatherer=gatherer,
    reward_trainer=reward_trainer,
    fragment_length=100,
    transition_oversampling=1,
    initial_comparison_frac=0.1,
    allow_variable_horizon=False,
    initial_epoch_multiplier=4,
    query_schedule="hyperbolic",
)

Then we can start training the reward model. Note that we need to specify the total timesteps that the agent should be trained and how many fragment comparisons should be made.

In [2]:
pref_comparisons.train(
    total_timesteps=5_000, 
    total_comparisons=200,
)

Query schedule: [20, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
Collecting 40 fragments (4000 transitions)
Requested 3800 transitions but only 0 in buffer. Sampling 3800 additional transitions.
Sampling 200 exploratory transitions.
Creating fragment pairs
Gathering preferences
Dataset now contains 20 comparisons


Training reward model:   0%|          | 0/12 [00:00<?, ?it/s]

Training agent for 83 timesteps
----------------------------------------------------
| raw/                                 |           |
|    agent/rollout/ep_len_mean         | 200       |
|    agent/rollout/ep_rew_mean         | -1.44e+03 |
|    agent/rollout/ep_rew_wrapped_mean | 33.1      |
|    agent/time/fps                    | 17650     |
|    agent/time/iterations             | 1         |
|    agent/time/time_elapsed           | 0         |
|    agent/time/total_timesteps        | 2048      |
----------------------------------------------------
-------------------------------------------------------
| mean/                                   |           |
|    agent/rollout/ep_len_mean            | 200       |
|    agent/rollout/ep_rew_mean            | -1.44e+03 |
|    agent/rollout/ep_rew_wrapped_mean    | 33.1      |
|    agent/time/fps                       | 1.76e+04  |
|    agent/time/iterations                | 1         |
|    agent/time/time_elapsed              | 0 

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -1.26e+03    |
|    agent/rollout/ep_rew_wrapped_mean | 30.1         |
|    agent/time/fps                    | 13715        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 4096         |
|    agent/train/approx_kl             | 0.0013779316 |
|    agent/train/clip_fraction         | 0.055        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.43        |
|    agent/train/explained_variance    | -0.82        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.0412       |
|    agent/train/n_updates             | 10           |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -1.3e+03     |
|    agent/rollout/ep_rew_wrapped_mean | 29.3         |
|    agent/time/fps                    | 17657        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 6144         |
|    agent/train/approx_kl             | 0.0022071619 |
|    agent/train/clip_fraction         | 0.0955       |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.43        |
|    agent/train/explained_variance    | 0.429        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.0744       |
|    agent/train/n_updates             | 20           |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -1.3e+03     |
|    agent/rollout/ep_rew_wrapped_mean | 27.4         |
|    agent/time/fps                    | 16993        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 8192         |
|    agent/train/approx_kl             | 0.0019730625 |
|    agent/train/clip_fraction         | 0.0891       |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.42        |
|    agent/train/explained_variance    | 0.799        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.0823       |
|    agent/train/n_updates             | 30           |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -1.27e+03    |
|    agent/rollout/ep_rew_wrapped_mean | 26.3         |
|    agent/time/fps                    | 4921         |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 10240        |
|    agent/train/approx_kl             | 0.0030079854 |
|    agent/train/clip_fraction         | 0.146        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.42        |
|    agent/train/explained_variance    | 0.83         |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.159        |
|    agent/train/n_updates             | 40           |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -1.24e+03    |
|    agent/rollout/ep_rew_wrapped_mean | 25.8         |
|    agent/time/fps                    | 17802        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 12288        |
|    agent/train/approx_kl             | 0.0023911644 |
|    agent/train/clip_fraction         | 0.0959       |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.43        |
|    agent/train/explained_variance    | 0.643        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.141        |
|    agent/train/n_updates             | 50           |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -1.22e+03    |
|    agent/rollout/ep_rew_wrapped_mean | 24.4         |
|    agent/time/fps                    | 16657        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 14336        |
|    agent/train/approx_kl             | 0.0023503092 |
|    agent/train/clip_fraction         | 0.135        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.44        |
|    agent/train/explained_variance    | 0.798        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.209        |
|    agent/train/n_updates             | 60           |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -1.22e+03    |
|    agent/rollout/ep_rew_wrapped_mean | 25.2         |
|    agent/time/fps                    | 17806        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 16384        |
|    agent/train/approx_kl             | 0.0023587407 |
|    agent/train/clip_fraction         | 0.151        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.44        |
|    agent/train/explained_variance    | 0.71         |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.369        |
|    agent/train/n_updates             | 70           |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
------------------------------------------------------
| raw/                                 |             |
|    agent/rollout/ep_len_mean         | 200         |
|    agent/rollout/ep_rew_mean         | -1.21e+03   |
|    agent/rollout/ep_rew_wrapped_mean | 24.9        |
|    agent/time/fps                    | 14366       |
|    agent/time/iterations             | 1           |
|    agent/time/time_elapsed           | 0           |
|    agent/time/total_timesteps        | 18432       |
|    agent/train/approx_kl             | 0.002118437 |
|    agent/train/clip_fraction         | 0.0851      |
|    agent/train/clip_range            | 0.1         |
|    agent/train/entropy_loss          | -1.44       |
|    agent/train/explained_variance    | 0.736       |
|    agent/train/learning_rate         | 0.002       |
|    agent/train/loss                  | 0.199       |
|    agent/train/n_updates             | 80          |
|    agent/train/policy_gradient_

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-----------------------------------------------------
| raw/                                 |            |
|    agent/rollout/ep_len_mean         | 200        |
|    agent/rollout/ep_rew_mean         | -1.2e+03   |
|    agent/rollout/ep_rew_wrapped_mean | 26.9       |
|    agent/time/fps                    | 17273      |
|    agent/time/iterations             | 1          |
|    agent/time/time_elapsed           | 0          |
|    agent/time/total_timesteps        | 20480      |
|    agent/train/approx_kl             | 0.00333608 |
|    agent/train/clip_fraction         | 0.131      |
|    agent/train/clip_range            | 0.1        |
|    agent/train/entropy_loss          | -1.45      |
|    agent/train/explained_variance    | 0.725      |
|    agent/train/learning_rate         | 0.002      |
|    agent/train/loss                  | 0.228      |
|    agent/train/n_updates             | 90         |
|    agent/train/policy_gradient_loss  | -0.00325 

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
------------------------------------------------------
| raw/                                 |             |
|    agent/rollout/ep_len_mean         | 200         |
|    agent/rollout/ep_rew_mean         | -1.19e+03   |
|    agent/rollout/ep_rew_wrapped_mean | 27.6        |
|    agent/time/fps                    | 17659       |
|    agent/time/iterations             | 1           |
|    agent/time/time_elapsed           | 0           |
|    agent/time/total_timesteps        | 22528       |
|    agent/train/approx_kl             | 0.003017107 |
|    agent/train/clip_fraction         | 0.154       |
|    agent/train/clip_range            | 0.1         |
|    agent/train/entropy_loss          | -1.48       |
|    agent/train/explained_variance    | 0.744       |
|    agent/train/learning_rate         | 0.002       |
|    agent/train/loss                  | 0.323       |
|    agent/train/n_updates             | 100         |
|    agent/train/policy_gradient_

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -1.19e+03    |
|    agent/rollout/ep_rew_wrapped_mean | 28.5         |
|    agent/time/fps                    | 18099        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 24576        |
|    agent/train/approx_kl             | 0.0041196984 |
|    agent/train/clip_fraction         | 0.179        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.48        |
|    agent/train/explained_variance    | 0.591        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.214        |
|    agent/train/n_updates             | 110          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -1.17e+03    |
|    agent/rollout/ep_rew_wrapped_mean | 28.6         |
|    agent/time/fps                    | 14945        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 26624        |
|    agent/train/approx_kl             | 0.0029389746 |
|    agent/train/clip_fraction         | 0.164        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.48        |
|    agent/train/explained_variance    | 0.731        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.331        |
|    agent/train/n_updates             | 120          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
------------------------------------------------------
| raw/                                 |             |
|    agent/rollout/ep_len_mean         | 200         |
|    agent/rollout/ep_rew_mean         | -1.17e+03   |
|    agent/rollout/ep_rew_wrapped_mean | 29.2        |
|    agent/time/fps                    | 17513       |
|    agent/time/iterations             | 1           |
|    agent/time/time_elapsed           | 0           |
|    agent/time/total_timesteps        | 28672       |
|    agent/train/approx_kl             | 0.002608165 |
|    agent/train/clip_fraction         | 0.128       |
|    agent/train/clip_range            | 0.1         |
|    agent/train/entropy_loss          | -1.49       |
|    agent/train/explained_variance    | 0.654       |
|    agent/train/learning_rate         | 0.002       |
|    agent/train/loss                  | 0.0208      |
|    agent/train/n_updates             | 130         |
|    agent/train/policy_gradient_

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
----------------------------------------------------
| raw/                                 |           |
|    agent/rollout/ep_len_mean         | 200       |
|    agent/rollout/ep_rew_mean         | -1.16e+03 |
|    agent/rollout/ep_rew_wrapped_mean | 29        |
|    agent/time/fps                    | 17505     |
|    agent/time/iterations             | 1         |
|    agent/time/time_elapsed           | 0         |
|    agent/time/total_timesteps        | 30720     |
|    agent/train/approx_kl             | 0.0047767 |
|    agent/train/clip_fraction         | 0.158     |
|    agent/train/clip_range            | 0.1       |
|    agent/train/entropy_loss          | -1.51     |
|    agent/train/explained_variance    | 0.635     |
|    agent/train/learning_rate         | 0.002     |
|    agent/train/loss                  | 0.136     |
|    agent/train/n_updates             | 140       |
|    agent/train/policy_gradient_loss  | -0.00448  |
|    agent/tra

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
------------------------------------------------------
| raw/                                 |             |
|    agent/rollout/ep_len_mean         | 200         |
|    agent/rollout/ep_rew_mean         | -1.16e+03   |
|    agent/rollout/ep_rew_wrapped_mean | 28.1        |
|    agent/time/fps                    | 16185       |
|    agent/time/iterations             | 1           |
|    agent/time/time_elapsed           | 0           |
|    agent/time/total_timesteps        | 32768       |
|    agent/train/approx_kl             | 0.003623765 |
|    agent/train/clip_fraction         | 0.123       |
|    agent/train/clip_range            | 0.1         |
|    agent/train/entropy_loss          | -1.53       |
|    agent/train/explained_variance    | 0.65        |
|    agent/train/learning_rate         | 0.002       |
|    agent/train/loss                  | 0.137       |
|    agent/train/n_updates             | 150         |
|    agent/train/policy_gradient_

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -1.16e+03    |
|    agent/rollout/ep_rew_wrapped_mean | 26.8         |
|    agent/time/fps                    | 16141        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 34816        |
|    agent/train/approx_kl             | 0.0025193437 |
|    agent/train/clip_fraction         | 0.134        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.55        |
|    agent/train/explained_variance    | 0.743        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.0505       |
|    agent/train/n_updates             | 160          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -1.16e+03    |
|    agent/rollout/ep_rew_wrapped_mean | 25.4         |
|    agent/time/fps                    | 18085        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 36864        |
|    agent/train/approx_kl             | 0.0030511282 |
|    agent/train/clip_fraction         | 0.124        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.56        |
|    agent/train/explained_variance    | 0.804        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.141        |
|    agent/train/n_updates             | 170          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
------------------------------------------------------
| raw/                                 |             |
|    agent/rollout/ep_len_mean         | 200         |
|    agent/rollout/ep_rew_mean         | -1.16e+03   |
|    agent/rollout/ep_rew_wrapped_mean | 23.4        |
|    agent/time/fps                    | 18032       |
|    agent/time/iterations             | 1           |
|    agent/time/time_elapsed           | 0           |
|    agent/time/total_timesteps        | 38912       |
|    agent/train/approx_kl             | 0.002671984 |
|    agent/train/clip_fraction         | 0.123       |
|    agent/train/clip_range            | 0.1         |
|    agent/train/entropy_loss          | -1.57       |
|    agent/train/explained_variance    | 0.832       |
|    agent/train/learning_rate         | 0.002       |
|    agent/train/loss                  | 0.0691      |
|    agent/train/n_updates             | 180         |
|    agent/train/policy_gradient_

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -1.16e+03    |
|    agent/rollout/ep_rew_wrapped_mean | 20.9         |
|    agent/time/fps                    | 18827        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 40960        |
|    agent/train/approx_kl             | 0.0030450737 |
|    agent/train/clip_fraction         | 0.124        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.58        |
|    agent/train/explained_variance    | 0.885        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.031        |
|    agent/train/n_updates             | 190          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
------------------------------------------------------
| raw/                                 |             |
|    agent/rollout/ep_len_mean         | 200         |
|    agent/rollout/ep_rew_mean         | -1.16e+03   |
|    agent/rollout/ep_rew_wrapped_mean | 19.2        |
|    agent/time/fps                    | 15200       |
|    agent/time/iterations             | 1           |
|    agent/time/time_elapsed           | 0           |
|    agent/time/total_timesteps        | 43008       |
|    agent/train/approx_kl             | 0.002927749 |
|    agent/train/clip_fraction         | 0.153       |
|    agent/train/clip_range            | 0.1         |
|    agent/train/entropy_loss          | -1.58       |
|    agent/train/explained_variance    | 0.925       |
|    agent/train/learning_rate         | 0.002       |
|    agent/train/loss                  | 0.0532      |
|    agent/train/n_updates             | 200         |
|    agent/train/policy_gradient_

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -1.15e+03    |
|    agent/rollout/ep_rew_wrapped_mean | 17.3         |
|    agent/time/fps                    | 16320        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 45056        |
|    agent/train/approx_kl             | 0.0036576793 |
|    agent/train/clip_fraction         | 0.165        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.58        |
|    agent/train/explained_variance    | 0.886        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.00774      |
|    agent/train/n_updates             | 210          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -1.15e+03    |
|    agent/rollout/ep_rew_wrapped_mean | 15.8         |
|    agent/time/fps                    | 18513        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 47104        |
|    agent/train/approx_kl             | 0.0049780402 |
|    agent/train/clip_fraction         | 0.202        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.59        |
|    agent/train/explained_variance    | 0.898        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.0143       |
|    agent/train/n_updates             | 220          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -1.14e+03    |
|    agent/rollout/ep_rew_wrapped_mean | 14.3         |
|    agent/time/fps                    | 18106        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 49152        |
|    agent/train/approx_kl             | 0.0042216815 |
|    agent/train/clip_fraction         | 0.188        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.58        |
|    agent/train/explained_variance    | 0.873        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.129        |
|    agent/train/n_updates             | 230          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-----------------------------------------------------
| raw/                                 |            |
|    agent/rollout/ep_len_mean         | 200        |
|    agent/rollout/ep_rew_mean         | -1.11e+03  |
|    agent/rollout/ep_rew_wrapped_mean | 13.6       |
|    agent/time/fps                    | 17913      |
|    agent/time/iterations             | 1          |
|    agent/time/time_elapsed           | 0          |
|    agent/time/total_timesteps        | 51200      |
|    agent/train/approx_kl             | 0.00329648 |
|    agent/train/clip_fraction         | 0.174      |
|    agent/train/clip_range            | 0.1        |
|    agent/train/entropy_loss          | -1.57      |
|    agent/train/explained_variance    | 0.847      |
|    agent/train/learning_rate         | 0.002      |
|    agent/train/loss                  | 0.328      |
|    agent/train/n_updates             | 240        |
|    agent/train/policy_gradient_loss  | -0.0075  

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
------------------------------------------------------
| raw/                                 |             |
|    agent/rollout/ep_len_mean         | 200         |
|    agent/rollout/ep_rew_mean         | -1.08e+03   |
|    agent/rollout/ep_rew_wrapped_mean | 13.5        |
|    agent/time/fps                    | 17450       |
|    agent/time/iterations             | 1           |
|    agent/time/time_elapsed           | 0           |
|    agent/time/total_timesteps        | 53248       |
|    agent/train/approx_kl             | 0.004688721 |
|    agent/train/clip_fraction         | 0.177       |
|    agent/train/clip_range            | 0.1         |
|    agent/train/entropy_loss          | -1.56       |
|    agent/train/explained_variance    | 0.834       |
|    agent/train/learning_rate         | 0.002       |
|    agent/train/loss                  | 0.102       |
|    agent/train/n_updates             | 250         |
|    agent/train/policy_gradient_

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -1.05e+03    |
|    agent/rollout/ep_rew_wrapped_mean | 13.7         |
|    agent/time/fps                    | 18264        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 55296        |
|    agent/train/approx_kl             | 0.0045436854 |
|    agent/train/clip_fraction         | 0.202        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.55        |
|    agent/train/explained_variance    | 0.828        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.222        |
|    agent/train/n_updates             | 260          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -1.01e+03    |
|    agent/rollout/ep_rew_wrapped_mean | 14.2         |
|    agent/time/fps                    | 17681        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 57344        |
|    agent/train/approx_kl             | 0.0038139299 |
|    agent/train/clip_fraction         | 0.213        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.56        |
|    agent/train/explained_variance    | 0.813        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.196        |
|    agent/train/n_updates             | 270          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -972         |
|    agent/rollout/ep_rew_wrapped_mean | 15.2         |
|    agent/time/fps                    | 17112        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 59392        |
|    agent/train/approx_kl             | 0.0054650456 |
|    agent/train/clip_fraction         | 0.178        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.54        |
|    agent/train/explained_variance    | 0.755        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.234        |
|    agent/train/n_updates             | 280          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -935         |
|    agent/rollout/ep_rew_wrapped_mean | 15.6         |
|    agent/time/fps                    | 17244        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 61440        |
|    agent/train/approx_kl             | 0.0032520462 |
|    agent/train/clip_fraction         | 0.172        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.54        |
|    agent/train/explained_variance    | 0.799        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.231        |
|    agent/train/n_updates             | 290          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -888         |
|    agent/rollout/ep_rew_wrapped_mean | 17           |
|    agent/time/fps                    | 18225        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 63488        |
|    agent/train/approx_kl             | 0.0046639964 |
|    agent/train/clip_fraction         | 0.2          |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.52        |
|    agent/train/explained_variance    | 0.832        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.238        |
|    agent/train/n_updates             | 300          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
------------------------------------------------------
| raw/                                 |             |
|    agent/rollout/ep_len_mean         | 200         |
|    agent/rollout/ep_rew_mean         | -846        |
|    agent/rollout/ep_rew_wrapped_mean | 18.3        |
|    agent/time/fps                    | 17746       |
|    agent/time/iterations             | 1           |
|    agent/time/time_elapsed           | 0           |
|    agent/time/total_timesteps        | 65536       |
|    agent/train/approx_kl             | 0.005667625 |
|    agent/train/clip_fraction         | 0.177       |
|    agent/train/clip_range            | 0.1         |
|    agent/train/entropy_loss          | -1.5        |
|    agent/train/explained_variance    | 0.787       |
|    agent/train/learning_rate         | 0.002       |
|    agent/train/loss                  | 0.36        |
|    agent/train/n_updates             | 310         |
|    agent/train/policy_gradient_

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -796         |
|    agent/rollout/ep_rew_wrapped_mean | 19           |
|    agent/time/fps                    | 8204         |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 67584        |
|    agent/train/approx_kl             | 0.0040816227 |
|    agent/train/clip_fraction         | 0.177        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.49        |
|    agent/train/explained_variance    | 0.791        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.544        |
|    agent/train/n_updates             | 320          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -747         |
|    agent/rollout/ep_rew_wrapped_mean | 19.9         |
|    agent/time/fps                    | 17073        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 69632        |
|    agent/train/approx_kl             | 0.0045624655 |
|    agent/train/clip_fraction         | 0.192        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.47        |
|    agent/train/explained_variance    | 0.833        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.44         |
|    agent/train/n_updates             | 330          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -696         |
|    agent/rollout/ep_rew_wrapped_mean | 21.4         |
|    agent/time/fps                    | 17767        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 71680        |
|    agent/train/approx_kl             | 0.0035519698 |
|    agent/train/clip_fraction         | 0.184        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.46        |
|    agent/train/explained_variance    | 0.82         |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.851        |
|    agent/train/n_updates             | 340          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -645         |
|    agent/rollout/ep_rew_wrapped_mean | 21.9         |
|    agent/time/fps                    | 17430        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 73728        |
|    agent/train/approx_kl             | 0.0040941145 |
|    agent/train/clip_fraction         | 0.188        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.45        |
|    agent/train/explained_variance    | 0.862        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.289        |
|    agent/train/n_updates             | 350          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-----------------------------------------------------
| raw/                                 |            |
|    agent/rollout/ep_len_mean         | 200        |
|    agent/rollout/ep_rew_mean         | -609       |
|    agent/rollout/ep_rew_wrapped_mean | 22         |
|    agent/time/fps                    | 17707      |
|    agent/time/iterations             | 1          |
|    agent/time/time_elapsed           | 0          |
|    agent/time/total_timesteps        | 75776      |
|    agent/train/approx_kl             | 0.00592582 |
|    agent/train/clip_fraction         | 0.181      |
|    agent/train/clip_range            | 0.1        |
|    agent/train/entropy_loss          | -1.41      |
|    agent/train/explained_variance    | 0.805      |
|    agent/train/learning_rate         | 0.002      |
|    agent/train/loss                  | 1.35       |
|    agent/train/n_updates             | 360        |
|    agent/train/policy_gradient_loss  | 7.44e-05 

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
------------------------------------------------------
| raw/                                 |             |
|    agent/rollout/ep_len_mean         | 200         |
|    agent/rollout/ep_rew_mean         | -556        |
|    agent/rollout/ep_rew_wrapped_mean | 21          |
|    agent/time/fps                    | 17099       |
|    agent/time/iterations             | 1           |
|    agent/time/time_elapsed           | 0           |
|    agent/time/total_timesteps        | 77824       |
|    agent/train/approx_kl             | 0.005700239 |
|    agent/train/clip_fraction         | 0.175       |
|    agent/train/clip_range            | 0.1         |
|    agent/train/entropy_loss          | -1.38       |
|    agent/train/explained_variance    | 0.846       |
|    agent/train/learning_rate         | 0.002       |
|    agent/train/loss                  | 0.7         |
|    agent/train/n_updates             | 370         |
|    agent/train/policy_gradient_

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
------------------------------------------------------
| raw/                                 |             |
|    agent/rollout/ep_len_mean         | 200         |
|    agent/rollout/ep_rew_mean         | -509        |
|    agent/rollout/ep_rew_wrapped_mean | 21.5        |
|    agent/time/fps                    | 16466       |
|    agent/time/iterations             | 1           |
|    agent/time/time_elapsed           | 0           |
|    agent/time/total_timesteps        | 79872       |
|    agent/train/approx_kl             | 0.002861008 |
|    agent/train/clip_fraction         | 0.175       |
|    agent/train/clip_range            | 0.1         |
|    agent/train/entropy_loss          | -1.38       |
|    agent/train/explained_variance    | 0.81        |
|    agent/train/learning_rate         | 0.002       |
|    agent/train/loss                  | 1.29        |
|    agent/train/n_updates             | 380         |
|    agent/train/policy_gradient_

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -465         |
|    agent/rollout/ep_rew_wrapped_mean | 22.2         |
|    agent/time/fps                    | 17402        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 81920        |
|    agent/train/approx_kl             | 0.0046677506 |
|    agent/train/clip_fraction         | 0.202        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.38        |
|    agent/train/explained_variance    | 0.818        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.471        |
|    agent/train/n_updates             | 390          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
------------------------------------------------------
| raw/                                 |             |
|    agent/rollout/ep_len_mean         | 200         |
|    agent/rollout/ep_rew_mean         | -424        |
|    agent/rollout/ep_rew_wrapped_mean | 21.9        |
|    agent/time/fps                    | 18358       |
|    agent/time/iterations             | 1           |
|    agent/time/time_elapsed           | 0           |
|    agent/time/total_timesteps        | 83968       |
|    agent/train/approx_kl             | 0.004301121 |
|    agent/train/clip_fraction         | 0.183       |
|    agent/train/clip_range            | 0.1         |
|    agent/train/entropy_loss          | -1.39       |
|    agent/train/explained_variance    | 0.74        |
|    agent/train/learning_rate         | 0.002       |
|    agent/train/loss                  | 0.476       |
|    agent/train/n_updates             | 400         |
|    agent/train/policy_gradient_

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -383         |
|    agent/rollout/ep_rew_wrapped_mean | 22.1         |
|    agent/time/fps                    | 17073        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 86016        |
|    agent/train/approx_kl             | 0.0053577996 |
|    agent/train/clip_fraction         | 0.196        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.39        |
|    agent/train/explained_variance    | 0.805        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 1.45         |
|    agent/train/n_updates             | 410          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -339         |
|    agent/rollout/ep_rew_wrapped_mean | 22.3         |
|    agent/time/fps                    | 18214        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 88064        |
|    agent/train/approx_kl             | 0.0037097735 |
|    agent/train/clip_fraction         | 0.166        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.4         |
|    agent/train/explained_variance    | 0.83         |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.725        |
|    agent/train/n_updates             | 420          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -310         |
|    agent/rollout/ep_rew_wrapped_mean | 23.9         |
|    agent/time/fps                    | 17513        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 90112        |
|    agent/train/approx_kl             | 0.0034513667 |
|    agent/train/clip_fraction         | 0.174        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.41        |
|    agent/train/explained_variance    | 0.825        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 1.55         |
|    agent/train/n_updates             | 430          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -280         |
|    agent/rollout/ep_rew_wrapped_mean | 23.1         |
|    agent/time/fps                    | 17682        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 92160        |
|    agent/train/approx_kl             | 0.0026148246 |
|    agent/train/clip_fraction         | 0.163        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.39        |
|    agent/train/explained_variance    | 0.849        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 1.96         |
|    agent/train/n_updates             | 440          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
------------------------------------------------------
| raw/                                 |             |
|    agent/rollout/ep_len_mean         | 200         |
|    agent/rollout/ep_rew_mean         | -267        |
|    agent/rollout/ep_rew_wrapped_mean | 21          |
|    agent/time/fps                    | 17740       |
|    agent/time/iterations             | 1           |
|    agent/time/time_elapsed           | 0           |
|    agent/time/total_timesteps        | 94208       |
|    agent/train/approx_kl             | 0.003616991 |
|    agent/train/clip_fraction         | 0.161       |
|    agent/train/clip_range            | 0.1         |
|    agent/train/entropy_loss          | -1.4        |
|    agent/train/explained_variance    | 0.779       |
|    agent/train/learning_rate         | 0.002       |
|    agent/train/loss                  | 1.06        |
|    agent/train/n_updates             | 450         |
|    agent/train/policy_gradient_

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
------------------------------------------------------
| raw/                                 |             |
|    agent/rollout/ep_len_mean         | 200         |
|    agent/rollout/ep_rew_mean         | -252        |
|    agent/rollout/ep_rew_wrapped_mean | 18.5        |
|    agent/time/fps                    | 18034       |
|    agent/time/iterations             | 1           |
|    agent/time/time_elapsed           | 0           |
|    agent/time/total_timesteps        | 96256       |
|    agent/train/approx_kl             | 0.003840716 |
|    agent/train/clip_fraction         | 0.176       |
|    agent/train/clip_range            | 0.1         |
|    agent/train/entropy_loss          | -1.43       |
|    agent/train/explained_variance    | 0.861       |
|    agent/train/learning_rate         | 0.002       |
|    agent/train/loss                  | 1.23        |
|    agent/train/n_updates             | 460         |
|    agent/train/policy_gradient_

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -239         |
|    agent/rollout/ep_rew_wrapped_mean | 15.1         |
|    agent/time/fps                    | 17938        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 98304        |
|    agent/train/approx_kl             | 0.0040787766 |
|    agent/train/clip_fraction         | 0.166        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.44        |
|    agent/train/explained_variance    | 0.885        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 1.26         |
|    agent/train/n_updates             | 470          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
------------------------------------------------------
| raw/                                 |             |
|    agent/rollout/ep_len_mean         | 200         |
|    agent/rollout/ep_rew_mean         | -226        |
|    agent/rollout/ep_rew_wrapped_mean | 12.5        |
|    agent/time/fps                    | 16580       |
|    agent/time/iterations             | 1           |
|    agent/time/time_elapsed           | 0           |
|    agent/time/total_timesteps        | 100352      |
|    agent/train/approx_kl             | 0.005924271 |
|    agent/train/clip_fraction         | 0.168       |
|    agent/train/clip_range            | 0.1         |
|    agent/train/entropy_loss          | -1.43       |
|    agent/train/explained_variance    | 0.878       |
|    agent/train/learning_rate         | 0.002       |
|    agent/train/loss                  | 1.26        |
|    agent/train/n_updates             | 480         |
|    agent/train/policy_gradient_

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -225         |
|    agent/rollout/ep_rew_wrapped_mean | 10.8         |
|    agent/time/fps                    | 18266        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 102400       |
|    agent/train/approx_kl             | 0.0032167926 |
|    agent/train/clip_fraction         | 0.173        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.42        |
|    agent/train/explained_variance    | 0.876        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.959        |
|    agent/train/n_updates             | 490          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -228         |
|    agent/rollout/ep_rew_wrapped_mean | 8.55         |
|    agent/time/fps                    | 18022        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 104448       |
|    agent/train/approx_kl             | 0.0056822794 |
|    agent/train/clip_fraction         | 0.177        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.42        |
|    agent/train/explained_variance    | 0.861        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.889        |
|    agent/train/n_updates             | 500          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -235         |
|    agent/rollout/ep_rew_wrapped_mean | 7.5          |
|    agent/time/fps                    | 18472        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 106496       |
|    agent/train/approx_kl             | 0.0037184297 |
|    agent/train/clip_fraction         | 0.166        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.41        |
|    agent/train/explained_variance    | 0.888        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.587        |
|    agent/train/n_updates             | 510          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -231         |
|    agent/rollout/ep_rew_wrapped_mean | 7.41         |
|    agent/time/fps                    | 18035        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 108544       |
|    agent/train/approx_kl             | 0.0076132854 |
|    agent/train/clip_fraction         | 0.18         |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.41        |
|    agent/train/explained_variance    | 0.859        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 1.58         |
|    agent/train/n_updates             | 520          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -232         |
|    agent/rollout/ep_rew_wrapped_mean | 10.4         |
|    agent/time/fps                    | 18398        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 110592       |
|    agent/train/approx_kl             | 0.0041163685 |
|    agent/train/clip_fraction         | 0.166        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.4         |
|    agent/train/explained_variance    | 0.82         |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 1.11         |
|    agent/train/n_updates             | 530          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
------------------------------------------------------
| raw/                                 |             |
|    agent/rollout/ep_len_mean         | 200         |
|    agent/rollout/ep_rew_mean         | -237        |
|    agent/rollout/ep_rew_wrapped_mean | 10.5        |
|    agent/time/fps                    | 17663       |
|    agent/time/iterations             | 1           |
|    agent/time/time_elapsed           | 0           |
|    agent/time/total_timesteps        | 112640      |
|    agent/train/approx_kl             | 0.004275536 |
|    agent/train/clip_fraction         | 0.162       |
|    agent/train/clip_range            | 0.1         |
|    agent/train/entropy_loss          | -1.41       |
|    agent/train/explained_variance    | 0.863       |
|    agent/train/learning_rate         | 0.002       |
|    agent/train/loss                  | 2.43        |
|    agent/train/n_updates             | 540         |
|    agent/train/policy_gradient_

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -236         |
|    agent/rollout/ep_rew_wrapped_mean | 10.5         |
|    agent/time/fps                    | 18392        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 114688       |
|    agent/train/approx_kl             | 0.0032792748 |
|    agent/train/clip_fraction         | 0.191        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.42        |
|    agent/train/explained_variance    | 0.853        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 2.17         |
|    agent/train/n_updates             | 550          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
------------------------------------------------------
| raw/                                 |             |
|    agent/rollout/ep_len_mean         | 200         |
|    agent/rollout/ep_rew_mean         | -236        |
|    agent/rollout/ep_rew_wrapped_mean | 12.3        |
|    agent/time/fps                    | 18447       |
|    agent/time/iterations             | 1           |
|    agent/time/time_elapsed           | 0           |
|    agent/time/total_timesteps        | 116736      |
|    agent/train/approx_kl             | 0.005018767 |
|    agent/train/clip_fraction         | 0.16        |
|    agent/train/clip_range            | 0.1         |
|    agent/train/entropy_loss          | -1.4        |
|    agent/train/explained_variance    | 0.846       |
|    agent/train/learning_rate         | 0.002       |
|    agent/train/loss                  | 1.49        |
|    agent/train/n_updates             | 560         |
|    agent/train/policy_gradient_

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-----------------------------------------------------
| raw/                                 |            |
|    agent/rollout/ep_len_mean         | 200        |
|    agent/rollout/ep_rew_mean         | -232       |
|    agent/rollout/ep_rew_wrapped_mean | 12         |
|    agent/time/fps                    | 17560      |
|    agent/time/iterations             | 1          |
|    agent/time/time_elapsed           | 0          |
|    agent/time/total_timesteps        | 118784     |
|    agent/train/approx_kl             | 0.00420962 |
|    agent/train/clip_fraction         | 0.179      |
|    agent/train/clip_range            | 0.1        |
|    agent/train/entropy_loss          | -1.41      |
|    agent/train/explained_variance    | 0.895      |
|    agent/train/learning_rate         | 0.002      |
|    agent/train/loss                  | 0.687      |
|    agent/train/n_updates             | 570        |
|    agent/train/policy_gradient_loss  | 0.0034   

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -223         |
|    agent/rollout/ep_rew_wrapped_mean | 12.4         |
|    agent/time/fps                    | 18031        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 120832       |
|    agent/train/approx_kl             | 0.0045357402 |
|    agent/train/clip_fraction         | 0.2          |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.4         |
|    agent/train/explained_variance    | 0.845        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 0.885        |
|    agent/train/n_updates             | 580          |
|    agent/train

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 103 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -217         |
|    agent/rollout/ep_rew_wrapped_mean | 13.1         |
|    agent/time/fps                    | 18079        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 122880       |
|    agent/train/approx_kl             | 0.0037307334 |
|    agent/train/clip_fraction         | 0.154        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.39        |
|    agent/train/explained_variance    | 0.808        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 2.5          |
|    agent/train/n_updates             | 590          |
|    agent/trai

Training reward model:   0%|          | 0/3 [00:00<?, ?it/s]

Training agent for 83 timesteps
-------------------------------------------------------
| raw/                                 |              |
|    agent/rollout/ep_len_mean         | 200          |
|    agent/rollout/ep_rew_mean         | -204         |
|    agent/rollout/ep_rew_wrapped_mean | 14.5         |
|    agent/time/fps                    | 17098        |
|    agent/time/iterations             | 1            |
|    agent/time/time_elapsed           | 0            |
|    agent/time/total_timesteps        | 124928       |
|    agent/train/approx_kl             | 0.0072140205 |
|    agent/train/clip_fraction         | 0.203        |
|    agent/train/clip_range            | 0.1          |
|    agent/train/entropy_loss          | -1.36        |
|    agent/train/explained_variance    | 0.831        |
|    agent/train/learning_rate         | 0.002        |
|    agent/train/loss                  | 3.17         |
|    agent/train/n_updates             | 600          |
|    agent/train

{'reward_loss': 0.051602873152920184, 'reward_accuracy': 0.9776785714285714}

After we trained the reward network using the preference comparisons algorithm, we can wrap our environment with that learned reward.

In [3]:
from imitation.rewards.reward_wrapper import RewardVecEnvWrapper

learned_reward_venv = RewardVecEnvWrapper(venv, reward_net.predict_processed)

Next, we train an agent that sees only the shaped, learned reward.

In [4]:
learner = PPO(
    seed=0,
    policy=FeedForward32Policy,
    policy_kwargs=dict(
        features_extractor_class=NormalizeFeaturesExtractor,
        features_extractor_kwargs=dict(normalize_class=RunningNorm),
    ),
    env=learned_reward_venv,
    batch_size=64,
    ent_coef=0.01,
    n_epochs=10,
    n_steps=2048 // learned_reward_venv.num_envs,
    clip_range=0.1,
    gae_lambda=0.95,
    gamma=0.97,
    learning_rate=2e-3,
)
learner.learn(1_000)  # Note: set to 100_000 to train a proficient expert

<stable_baselines3.ppo.ppo.PPO at 0x29db989d0>

Then we can evaluate it using the original reward.

In [5]:
from stable_baselines3.common.evaluation import evaluate_policy

n_eval_episodes = 10
reward_mean, reward_std = evaluate_policy(learner.policy, venv, n_eval_episodes)
reward_stderr = reward_std/np.sqrt(n_eval_episodes)
print(f"Reward: {reward_mean:.0f} +/- {reward_stderr:.0f}")

Reward: -139 +/- 28
