[download this notebook here](https://github.com/HumanCompatibleAI/imitation/blob/master/docs/tutorials/10_train_sqil.ipynb)
# Train an Agent using Soft Q Imitation Learning

Soft Q Imitation Learning ([SQIL](https://arxiv.org/abs/1905.11108)) is a simple algorithm that can be used to clone expert behavior.
It's fundamentally a modification of the DQN algorithm. At each training step, whenever we sample a batch of data from the replay buffer,
we also sample a batch of expert data. Expert demonstrations are assigned a reward of 1, while the agent's own transitions are assigned a reward of 0.
This approach encourages the agent to imitate the expert's behavior, but also to avoid unfamiliar states.

In this tutorial we will use the `imitation` library to train an agent using SQIL.

First, we need an expert in CartPole-v1 so that we can sample expert trajectories.
Let's train one using stable-baselines3.

Note that you can use other environments, but the action space must be discrete for this algorithm.

In [None]:
import gym
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy

env = gym.make("CartPole-v1")
expert = PPO(
    policy=MlpPolicy,
    env=env,
    seed=0,
    batch_size=64,
    ent_coef=0.0,
    learning_rate=0.0003,
    n_epochs=10,
    n_steps=64,
)
expert.learn(100_000)  # Note: set to 100000 to train a proficient expert

Let's quickly check if the expert is any good.
We usually should be able to reach a reward of 500, which is the maximum achievable value.

In [None]:
from stable_baselines3.common.evaluation import evaluate_policy

reward, _ = evaluate_policy(expert, env, 10)
print(reward)

Now we can use the expert to sample some trajectories.
We flatten them right away since we only need individual transitions.
`imitation` comes with a number of helper functions that makes collecting those transitions really easy. First we collect 50 episode rollouts, then we flatten them to just the transitions that we need for training.
Note that the rollout function requires a vectorized environment and needs the `RolloutInfoWrapper` around each of the environments.

In [None]:
from imitation.data import rollout
from imitation.data.wrappers import RolloutInfoWrapper
from stable_baselines3.common.vec_env import DummyVecEnv
import numpy as np

venv = DummyVecEnv([lambda: RolloutInfoWrapper(env)])
rng = np.random.default_rng()
rollouts = rollout.rollout(
    expert,
    venv,
    rollout.make_sample_until(min_timesteps=None, min_episodes=100),
    rng=rng,
)
transitions = rollout.flatten_trajectories(rollouts)

Let's have a quick look at what we just generated using those library functions:

In [None]:
print(
    f"""The `rollout` function generated a list of {len(rollouts)} {type(rollouts[0])}.
After flattening, this list is turned into a {type(transitions)} object containing {len(transitions)} transitions.
The transitions object contains arrays for: {', '.join(transitions.__dict__.keys())}."
"""
)

After we collected our transitions, it's time to set up our behavior cloning algorithm.

In [None]:
from imitation.algorithms import sqil

sqil_trainer = sqil.SQIL(
    venv=venv,
    demonstrations=transitions,
    policy="MlpPolicy",
)

As you can see the untrained policy only gets poor rewards:

In [None]:
reward_before_training, _ = evaluate_policy(sqil_trainer.policy, env, 10)
print(f"Reward before training: {reward_before_training}")

After training, we can match the rewards of the expert (500):

In [None]:
sqil_trainer.train(total_timesteps=1_000_000)  # Note: set to 1_000_000 to obtain good results
reward_after_training, _ = evaluate_policy(sqil_trainer.policy, env, 10)
print(f"Reward after training: {reward_after_training}")