[download this notebook here](https://github.com/HumanCompatibleAI/imitation/blob/master/docs/tutorials/5_train_preference_comparisons.ipynb)
# Learning a Reward Function using Preference Comparisons with Synchronous Human Feedback

You can request human feedback via synchronous CLI or Notebook interactions as well. The setup is only slightly different than it would be with a synthetic preference gatherer.

Here's the starting setup. The major differences from the synthetic setup are indicated with comments

In [None]:
import pathlib
import random
import tempfile
from imitation.algorithms import preference_comparisons
from imitation.rewards.reward_nets import BasicRewardNet
from imitation.util import video_wrapper
from imitation.util.networks import RunningNorm
from imitation.util.util import make_vec_env
from imitation.policies.base import FeedForward32Policy, NormalizeFeaturesExtractor
import gym
from stable_baselines3 import PPO
import numpy as np

# Add a temporary directory for video recordings of trajectories. Unfortunately Jupyter
# won't play videos outside the current directory, so we have to put them here. We'll
# delete them at the end of the script.
video_dir = tempfile.mkdtemp(dir=".", prefix="videos_")

rng = np.random.default_rng(0)

# Add a video wrapper to the environment. This will record videos of the agent's
# trajectories so we can review them later.
venv = make_vec_env(
    "Pendulum-v1",
    rng=rng,
    post_wrappers=[
        video_wrapper.video_wrapper_factory(pathlib.Path(video_dir), single_video=False)
    ],
)

reward_net = BasicRewardNet(
    venv.observation_space, venv.action_space, normalize_input_layer=RunningNorm
)

fragmenter = preference_comparisons.RandomFragmenter(
    warning_threshold=0,
    rng=rng,
)

querent = preference_comparisons.PreferenceQuerent()

# This gatherer will show the user (you!) pairs of trajectories and ask it to choose
# which one is better. It will then use the user's feedback to train the reward network.
gatherer = preference_comparisons.SynchronousHumanGatherer(video_dir=video_dir)

preference_model = preference_comparisons.PreferenceModel(reward_net)
reward_trainer = preference_comparisons.BasicRewardTrainer(
    preference_model=preference_model,
    loss=preference_comparisons.CrossEntropyRewardLoss(),
    epochs=3,
    rng=rng,
)

agent = PPO(
    policy=FeedForward32Policy,
    policy_kwargs=dict(
        features_extractor_class=NormalizeFeaturesExtractor,
        features_extractor_kwargs=dict(normalize_class=RunningNorm),
    ),
    env=venv,
    seed=0,
    n_steps=2048 // venv.num_envs,
    batch_size=64,
    ent_coef=0.0,
    learning_rate=0.0003,
    n_epochs=10,
)

trajectory_generator = preference_comparisons.AgentTrainer(
    algorithm=agent,
    reward_fn=reward_net,
    venv=venv,
    exploration_frac=0.0,
    rng=rng,
)

pref_comparisons = preference_comparisons.PreferenceComparisons(
    trajectory_generator,
    reward_net,
    num_iterations=5,
    fragmenter=fragmenter,
    preference_querent=querent,
    preference_gatherer=gatherer,
    reward_trainer=reward_trainer,
    fragment_length=100,
    transition_oversampling=1,
    initial_comparison_frac=0.1,
    allow_variable_horizon=False,
    initial_epoch_multiplier=1,
)

We're going to train with only 20 comparisons to make it faster for you to evaluate. The videos will appear in-line in this notebook for you to watch, and a text input will appear for you to choose one.

In [None]:
pref_comparisons.train(
    total_timesteps=5_000,  # For good performance this should be 1_000_000
    total_comparisons=20,  # For good performance this should be 5_000
)

From this point onward, this notebook is the same as [the synthetic gatherer notebook](5_train_preference_comparisons.ipynb).

After we trained the reward network using the preference comparisons algorithm, we can wrap our environment with that learned reward.

In [None]:
from imitation.rewards.reward_wrapper import RewardVecEnvWrapper


learned_reward_venv = RewardVecEnvWrapper(venv, reward_net.predict)

Now we can train an agent, that only sees those learned reward.

In [None]:
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy

learner = PPO(
    policy=MlpPolicy,
    env=learned_reward_venv,
    seed=0,
    batch_size=64,
    ent_coef=0.0,
    learning_rate=0.0003,
    n_epochs=10,
    n_steps=64,
)
learner.learn(1000)  # Note: set to 100000 to train a proficient expert

Then we can evaluate it using the original reward.

In [None]:
from stable_baselines3.common.evaluation import evaluate_policy

reward, _ = evaluate_policy(learner.policy, venv, 10)
print(reward)

In [None]:
# clean up the videos we made
import shutil

shutil.rmtree(video_dir)