<a href="https://colab.research.google.com/drive/1WhiULuo9oBo1kKgXqQjNY53ht3J0TlEG?usp=sharing" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

By [Ibrahim Sobh](https://www.linkedin.com/in/ibrahim-sobh-phd-8681757/)


# GAIL [Generative Adversarial Imitation Learning](https://arxiv.org/pdf/1606.03476.pdf)

In GANs [Generative Adversarial Networks](https://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf), we have two networks learning together:

- Generator network: try to fool the discriminator by generating real-looking images
- Discriminator network: try to distinguish between real and fake images

### GAIL
GAIL uses a discriminator that tries to seperate expert trajectory from trajectories of the learned policy, which has the role of the generator here.

### Steps
- Generate and save expert dataset
- Load the expert dataset
- Train GAIL agent and evaluate



## Install

In [0]:
!pip install gym
!pip install box2d-py
# !pip install pyglet==1.3.2
!pip install pyglet
!pip install stable-baselines
!pip install stable-baselines --upgrade

Collecting box2d-py
[?25l  Downloading https://files.pythonhosted.org/packages/06/bd/6cdc3fd994b0649dcf5d9bad85bd9e26172308bbe9a421bfc6fdbf5081a6/box2d_py-2.3.8-cp36-cp36m-manylinux1_x86_64.whl (448kB)
[K     |████████████████████████████████| 450kB 7.9MB/s 
[?25hInstalling collected packages: box2d-py
Successfully installed box2d-py-2.3.8
Collecting stable-baselines
[?25l  Downloading https://files.pythonhosted.org/packages/c0/05/f6651855083020c0363acf483450c23e38d96f5c18bec8bded113d528da5/stable_baselines-2.9.0-py3-none-any.whl (232kB)
[K     |████████████████████████████████| 235kB 8.6MB/s 
Installing collected packages: stable-baselines
  Found existing installation: stable-baselines 2.2.1
    Uninstalling stable-baselines-2.2.1:
      Successfully uninstalled stable-baselines-2.2.1
Successfully installed stable-baselines-2.9.0


## Generate and save expert trajectories

In [0]:
# Here is an example of training a Soft Actor-Critic model to generate expert trajectories for GAIL
# from stable_baselines import SAC
import gym
import numpy as np
from stable_baselines import TD3
from stable_baselines.td3.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines.ddpg.noise import NormalActionNoise, OrnsteinUhlenbeckActionNoise
from stable_baselines.gail import generate_expert_traj

env = gym.make('Pendulum-v0')
env = DummyVecEnv([lambda: env])

# The noise objects for TD3
n_actions = env.action_space.shape[-1]
action_noise = NormalActionNoise(mean=np.zeros(n_actions), sigma=0.1 * np.ones(n_actions))
model = TD3(MlpPolicy, env, action_noise=action_noise, verbose=1)
generate_expert_traj(model, 'expert_pendulum', n_timesteps=20000, n_episodes=10)


The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.









Instructions for updating:
Use keras.layers.flatten instead.
Instructions for updating:
Please use `layer.__call__` method instead.
Instructions for updating:
Use keras.layers.Dense instead.





Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where



---------------------------------------
| current_lr              | 0.0003    |
| episodes                | 4         |
| fps                     | 143       |
| mean 100 episode reward | -1.47e+03 |
| n_updates               | 400       |
| qf1_loss                | 2.8665037 |
| qf2_loss                | 2.272807  |
| time_elapsed            | 4         |
| total timesteps         | 600       |
---------------------------------------
---------------------------------------
| current_lr              | 0.0003    |
| episodes                | 8         |
| fps                     | 152       |
| mean 100 episode reward | -1.42e+03 |
| n_updates               | 1200      |
| qf1_loss               

{'actions': array([[-0.6231084 ],
        [-0.61850584],
        [-0.62992656],
        ...,
        [-1.7786973 ],
        [-1.9231621 ],
        [-1.9736314 ]], dtype=float32),
 'episode_returns': array([-639.91789679, -767.61171303, -742.2489131 , -503.10143113,
        -780.40580989, -768.87662937, -789.46892238, -763.61080178,
        -765.05994363, -528.3228281 ]),
 'episode_starts': array([ True, False, False, ..., False, False, False]),
 'obs': array([[ 0.99738544,  0.0722651 ,  0.34185213],
        [ 0.99617803,  0.08734594,  0.3025847 ],
        [ 0.9948813 ,  0.10105053,  0.27531826],
        ...,
        [ 0.6024808 , -0.7981334 , -3.8638847 ],
        [ 0.3987389 , -0.9170645 , -4.7292895 ],
        [ 0.12453896, -0.99221474, -5.705562  ]], dtype=float32),
 'rewards': array([-0.01730591, -0.01718709, -0.01822298, ..., -2.35025263,
        -3.58743644, -5.34996176])}

In [0]:
results_mean_list = []
results_std_list = []

In [0]:
# Evalaute the TD3 model (which generated the trajectories) 
env = model.get_env()
obs = env.reset()
r_list = []

for i in range(10):
  print("\riteration: {}".format(i), end="")
  reward_sum = 0.0
  for _ in range(1000):

          action, _ = model.predict(obs)
          obs, reward, done, _ = env.step(action)
          reward_sum += reward
          if done:
                  r_list.append(reward_sum)
                  reward_sum = 0.0
                  obs = env.reset()

print('\nmean, std')
print(np.mean(r_list), np.std(r_list))
results_mean_list.append(np.mean(r_list))
results_std_list.append(np.std(r_list)) 
env.close()

iteration: 9
mean, std
-722.691918814492 111.12368136240866


## Train

In [0]:
n_steps = 0

def callback(_locals, _globals):
    global n_steps
    print("\r Steps: {}".format(n_steps), end = "")
    n_steps += 1
    return True

In [0]:
from stable_baselines import GAIL, SAC
from stable_baselines.gail import ExpertDataset, generate_expert_traj
# Load the expert dataset
dataset = ExpertDataset(expert_path='expert_pendulum.npz', traj_limitation=10, verbose=1)

model = GAIL("MlpPolicy", 'Pendulum-v0', dataset, verbose=0)
# Note: in practice, you need to train for 1M steps to have a working policy
model.learn(total_timesteps=200000, callback=callback)
model.save("gail_pendulum")

actions (2000, 1)
obs (2000, 3)
rewards (2000,)
episode_returns (10,)
episode_starts (2000,)
Total trajectories: 10
Total transitions: 2000
Average returns: -704.8624889178202
Std for returns: 102.69602528313158












 Steps: 196

## Evaluate 

In [0]:
env = model.get_env()
obs = env.reset()
r_list = []

for i in range(10):
  print("\riteration: {}".format(i), end="")
  reward_sum = 0.0
  for _ in range(1000):

          action, _ = model.predict(obs)
          obs, reward, done, _ = env.step(action)
          reward_sum += reward
          if done:
                  r_list.append(reward_sum)
                  reward_sum = 0.0
                  obs = env.reset()

print('\nmean, std')
print(np.mean(r_list), np.std(r_list))
results_mean_list.append(np.mean(r_list))
results_std_list.append(np.std(r_list)) 
env.close()

iteration: 9
mean, std
-932.6035695613036 173.95717308774894
