In [None]:
import tensorflow as tf
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)

# Generative Adversarial Imitation Learning (GAIL)

By the end of this tutorial you will know how to use the GAIL algorithm provided in this library to solve an imitation learning problem where the expert are you.

If you did the 10_IRL_tutorial you almost know how to use GAIL because is very similar to DeepIRL usage. 

In [None]:
import gym
from gym.utils import play
from IL_Problem.base.utils.callbacks import Callbacks, load_expert_memories
from RL_Agent import dddqn_agent, ppo_agent_discrete_parallel
from RL_Agent.base.utils.networks import networks
from IL_Problem.base.utils.networks import networks_dictionaries as il_networks
from RL_Problem import rl_problem as rl_p
from IL_Problem.deepirl import DeepIRL
from IL_Problem.gail import GAIL
from tensorflow.keras.models import Model, Sequential
from tensorflow.keras.layers import Dense, Conv2D, Flatten, Input, MaxPooling2D
import numpy as np

# Collecting Expert Trajectories

In the next cell, you have take the role of an expert and play to Space Invader to record some experiences. For recording the experiences we use the calback provided in "IL_Problem.base.utils.callbacks.py". Finally we use the utility from Gym to play an envirnment "gym.utils.play".

In [None]:
env_name = "SpaceInvaders-v0"
env = gym.make(env_name)

In [None]:
cb = Callbacks()

To control the ship use "A" and "D" to move left or rigth and "space bar" to shoot. When you think that you have enough experiences close the environment in the cross (x) or using "Esc".

In [None]:
play.play(env, zoom=3, callback=cb.remember_callback)

Save the experience to disck.

In [None]:
exp_path = "expert_demonstrations/SpaceInvaders_expert.pkl"
cb.save_memories(exp_path)

# Define a RL Problem

## Preprocessing and Normalization

We want to preprocess the input images in order to reduce the dimensionality, crop the edges, convert to grayscale and normalize the pixel values. Here, we define the function to do all this stuff.

In [None]:
def atari_preprocess(obs):
    # Crop and resize the image
    obs = obs[20:200:2, ::2]

    # Convert the image to grayscale
    obs = obs.mean(axis=2)

    # normalize between [0, 1]
    obs = obs / 255.
    
    # Pass from 2D of shape (90, 80) to 3D array of shape (90, 80, 1)
    obs = obs[:, :, np.newaxis]

    return obs


## Defining the Neural Network Architecture

We define the network architecture using the function "ppo_net" from "RL_Agent.base.utils.networks.networks.py" which return a dictionary.

In [None]:
net_architecture = networks.ppo_net(actor_conv_layers=2,
                                    actor_kernel_num=[8, 8],
                                    actor_kernel_size=[3, 3],
                                    actor_kernel_strides=[2, 2],
                                    actor_conv_activation=['relu', 'relu'],
                                    actor_dense_layers=2,
                                    actor_n_neurons=[128, 128],
                                    actor_dense_activation=['relu', 'relu'],

                                    critic_conv_layers=2,
                                    critic_kernel_num=[8, 8],
                                    critic_kernel_size=[3, 3],
                                    critic_kernel_strides=[2, 2],
                                    critic_conv_activation=['relu', 'relu'],
                                    critic_dense_layers=2,
                                    critic_n_neurons=[128, 128],
                                    critic_dense_activation=['relu', 'relu'],
                                    use_custom_network=False)

## Define the RL Agent

Here, we define the RL agent. A using the next parameters:

* actor_lr: learning rate for training the actor neural network.
* critic_lr: learning rate for training the neural network.
* batch_size: Size of the batches used for training the neural network. 
* memory_size: Size of the buffer filled with experiences in each algorithm iteration. 
* epsilon: Determines the amount of exploration (float between [0, 1]). 0 -> Full Exploitation; 1 -> Full exploration.
* epsilon_decay: Decay factor of the epsilon. In each iteration we calculate the new epslon value as: epsilon' = epsilon * epsilon_decay.
* esilon_min: minimun value epsilon can reach during the training procedure.
* net_architecture: net architecture defined before.
* n_stack: number of stacked timesteps to form the state.
* img_input: boolean. Set to True where the states are images in form of 3D numpy arrays.
* state_size: tuple, size of the state.

In [None]:
agent = ppo_agent_discrete_parallel.Agent(actor_lr=1e-4,
                                              critic_lr=1e-4,
                                              batch_size=128,
                                              memory_size=128,
                                              epsilon=0.9,
                                              epsilon_decay=0.97,
                                              epsilon_min=0.15,
                                              net_architecture=net_architecture,
                                              n_stack=5,
                                              img_input=True,
                                              state_size=(90, 80, 1)
                                              )

## Build a RL Problem

Create a RL problem were the comunications between agent and environment are managed. In this case, we use the funcionality from "RL_Problem.rl_problem.py" which makes transparent to the user the selection of the matching problem. The function "Problem" automaticaly selects the problem based on the agent used.

In [None]:
rl_problem = rl_p.Problem(env, agent)
rl_problem.preprocess = atari_preprocess

# Define the IRL Problem

## Loading Expert Experiences

In "IL_Problem.base.utils.callbacks.py" we have some utilities for storing and loading expert experiences. Especifically, we use the function "load_expert_memories" which recieves three parameters: 1) "path", string with path to data. 2) "load_action", boolean to load or not the actions. We can performs IRL training the discriminator in differenciate only the states reached by an expert from the states reached by an agent or to differenciante the the state-action pairs from the expert and agent. 3) "n_stack" defines how many temporal steps will be stacked to form the state when using the discriminator. We can used stacked states in the agent and not in the discriminator or we can use it for both.

In [None]:
use_expert_actions = True
discriminator_stack = 5
exp_memory = load_expert_memories(exp_path, load_action=use_expert_actions, n_stack=discriminator_stack)


## Defining Discriminator Neural Network

The procedures for defining the neural network for the discriminator are the same that those that we have seen in all past tutorials for the RL agent network. The main difference is that utilities are found inside the "IL_Problem.base" folder.

As we did for the RL agent, we can define the neural network architecture creating a keras model:

In [None]:
def one_layer_custom_model(input_shape):
    x_input = Input(shape=input_shape, name='disc_common_input')
    x = Dense(128, activation='relu')(x_input)
    x = Dense(128, activation='relu')(x)
    x = Dense(1, activation='sigmoid')(x)
    model = Model(inputs=x_input, outputs=x)
    return model


In [None]:
irl_net_architecture = il_networks.irl_discriminator_net(use_custom_network=True,
                                                         state_custom_network=None,
                                                         common_custom_network=one_layer_custom_model,
                                                         define_custom_output_layer=False)

## Build the IRL Problem

As well as a RL problem, an IRL problem have some parameter detailed bellow:

* rl_problem: RL problem defined before. This is formed by an environment an a RL agent.
* expert_traj: RL problem defined before. This is formed by an environment an a RL agent.
* lr_disc: learning rate for training the discriminator neural network.
* batch_size_disc: Size of the batches used for training the discriminator neural network. 
* epochs_disc: Number of epochs fr training the discriminator in each algorithm iteration.
* val_split_disc: Validation split of the data used when training the discriminator.
* n_stack_disc: number of stacked timesteps to for the state in the discriminator input.
* net_architecture: net architecture defined before.
* use_expert_actions: Flag for use or not actions for training the discriminator. If true, the discriminator will recieve as input state-action pairs. If False, the discriminator will recieve as inputs states.


In [None]:
irl_problem = GAIL(rl_problem, exp_memory, lr_disc=1e-5, batch_size_disc=128, epochs_disc=2, val_split_disc=0.1,
                   n_stack_disc=discriminator_stack, net_architecture=irl_net_architecture,
                   use_expert_actions=use_expert_actions)

## Solving the IRL Problem

As we always do in these series of tutorial, lest solve the instanciated problem, in this case an IRL Problem. The parameter for this function are:

- iterations: Number of GAIL iterations. GAIL is integrated into the PPO workflow and the number of iterations is equivalent to the "episodes" param in the solve function from RL agents. Training through GAIL algorithm is like training a PPO agent adding an extra step for training the discriminator and estimate the reward values before training the RL agent. 
- render: If render or not the environment during the process.
- max_step_epi: Limits the number of steps  of each episode.
- render_after: render after n iterations.
- skip_state: State skipping technique by Mnih, V., Kavukcuoglu, K., Silver, D. et al. Human-level control through deep reinforcement learning. Nature 518, 529–533 (2015).
            https://doi.org/10.1038/nature14236If.

In [None]:
irl_problem.solve(200, render=False, max_step_epi=100, render_after=10)

In [None]:
rl_problem.test(10, render=False)

# Takeaways

- We learned how to collect your own expert trajectories in an interactive way.
- We learned how to PPO joined with GAIL.
- We saw the especial parameters from GAIL.
- We train a RL agent through GAIL algorithm.