# Train an Agent using Generative Adversarial Imitation Learning

The idea of generative adversarial imitation learning is to train a discriminator network to distinguish between expert trajectories and learner trajectories.
The learner is trained using a traditional reinforcement learning algorithm such as PPO and is rewarded for trajectories that make the discriminator think that it was an expert trajectory.

As usual, we first need an expert. 
Note that we now use a variant of the CartPole environment from the seals package, which has fixed episode durations. Read more about why we do this [here](https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html).

In [1]:
import gym
env = gym.make("Walker2d-v2")

In [2]:
import numpy as np
from stable_baselines3 import DDPG
from stable_baselines3.common.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))
expert = DDPG(
    policy="MlpPolicy",
    env=env,
    action_noise=action_noise, 
    verbose=1,
    tensorboard_log="/Users/kang/GitHub/GAIL-Fail/tensorboard/ddpg_walker2dv2_log3/"
)

Using cpu device
Wrapping the env with a `Monitor` wrapper
Wrapping the env in a DummyVecEnv.


We generate some expert trajectories, that the discriminator needs to distinguish from the learner's trajectories.

In [None]:
expert.learn(1e10,tb_log_name="first_run")  # Note: set to 100000 to train a proficient expert

In [5]:
from imitation.data import rollout
from imitation.data.wrappers import RolloutInfoWrapper
from stable_baselines3.common.vec_env import DummyVecEnv

rollouts = rollout.rollout(
    expert,
    DummyVecEnv([lambda: RolloutInfoWrapper(gym.make("Walker2d-v2"))] * 5),
    rollout.make_sample_until(min_timesteps=None, min_episodes=60),
)

Now we are ready to set up our GAIL trainer.
Note, that the `reward_net` is actually the network of the discriminator.
We evaluate the learner before and after training so we can see if it made any progress.

In [6]:
import gym
from imitation.algorithms.adversarial.gail import GAIL
from imitation.rewards.reward_nets import BasicRewardNet
from imitation.util.networks import RunningNorm
from stable_baselines3 import PPO
from stable_baselines3.ppo import MlpPolicy
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv


venv = DummyVecEnv([lambda: gym.make("Walker2d-v2")] * 8)


In [7]:
learner = PPO(
    env=venv,
    policy=MlpPolicy,
)


In [8]:
reward_net = BasicRewardNet(
    venv.observation_space, venv.action_space, normalize_input_layer=RunningNorm
)


In [9]:
gail_trainer = GAIL(
    demonstrations=rollouts,
    demo_batch_size=1024,
    gen_replay_buffer_capacity=2048,
    n_disc_updates_per_round=4,
    venv=venv,
    gen_algo=learner,
    reward_net=reward_net,
)


In [16]:

learner_rewards_before_training, _ = evaluate_policy(
    learner, venv, 100, return_episode_rewards=True
)



In [17]:
gail_trainer.train(300000)  # Note: set to 300000 for better results
learner_rewards_after_training, _ = evaluate_policy(
    model = learner, env = venv, n_eval_episodes = 100, return_episode_rewards=True
)

round:   0%|          | 0/18 [00:00<?, ?it/s]

---------------------------------------------------
| raw/                               |            |
|    gen/rollout/ep_rew_wrapped_mean | 19.7       |
|    gen/time/fps                    | 6719       |
|    gen/time/iterations             | 1          |
|    gen/time/time_elapsed           | 2          |
|    gen/time/total_timesteps        | 65536      |
|    gen/train/approx_kl             | 0.01862851 |
|    gen/train/clip_fraction         | 0.249      |
|    gen/train/clip_range            | 0.2        |
|    gen/train/entropy_loss          | -8.37      |
|    gen/train/explained_variance    | 0.541      |
|    gen/train/learning_rate         | 0.0003     |
|    gen/train/loss                  | 5.18       |
|    gen/train/n_updates             | 20         |
|    gen/train/policy_gradient_loss  | -0.0329    |
|    gen/train/std                   | 0.973      |
|    gen/train/value_loss            | 9.69       |
---------------------------------------------------


round:   0%|          | 0/18 [00:06<?, ?it/s]


ValueError: Episodes of different length detected: {10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 63, 64, 65, 67, 68, 69, 71, 72, 73, 74, 75, 76, 77, 78, 80, 81, 82, 83, 84, 86, 87, 88, 89, 90, 91, 98, 101, 103, 104, 107, 108, 110, 125, 137, 144, 166}. Variable horizon environments are discouraged -- termination conditions leak information about reward. See https://imitation.readthedocs.io/en/latest/guide/variable_horizon.html for more information. If you are SURE you want to run imitation on a variable horizon task, then please pass in the flag: `allow_variable_horizon=True`.

When we look at the histograms of rewards before and after learning, we can see that the learner is not perfect yet, but it made some progress at least.
If not, just re-run the above cell.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

print(np.mean(learner_rewards_after_training))
print(np.mean(learner_rewards_before_training))

plt.hist(
    [learner_rewards_before_training, learner_rewards_after_training],
    label=["untrained", "trained"],
)
plt.legend()
plt.show()