# Training an agent through Inverse Reinforcement Learning

This tutorial aims to show the use of Inverse Reinforcement Learning tools to train a RL agent via Imitation Learning. 

This library provides three Imitation Learning algorithms:

## 1) Deep Inverse Reinforcement Learning (DeepIRL): 


It Consist of an implementation of "Apprenticeship Learning algorithm from Pieter Abbeel and Andrew Y. Ng. 2004. Apprenticeship learning via inverse reinforcement learning. ICML '04."

As an overview, this algorithm have two main entities with two adversarial task: 1) a RL agent generate actions that aims to be very similar to expert actions. 2) a discriminator tries to diferenciate what actions comes from a RL agent and what comes from an expert. This task produces as result a value that is used as reward to train the RL agent.

This particular implememtation uses Deep Learning. For this purpose whe have replaced the classificator used fro the discriminator in the original work by a neural network. This algorithm is compatible with all Deep Reinforcement Learning agents in this library.

## 2) Generative Adversarial Imitation Learning (GAIL)

This is an implementation of "HO, Jonathan; ERMON, Stefano. Generative adversarial imitation learning. Advances in neural formation processing systems, 2016, vol. 29, p. 4565-4573." 

This algorithm is very similar to DeepIRL but use the workflow of Trus Region Policy Optimization (TRPO) algorithm (this is another RL algorithm) to makes the process more efficient. Have two main entities: 1) a reinforcement learning agent that generates actions that aims to be very similar to the expert actions. 2) a discriminator neural network that tries to diferenciate what actions comes from a RL agent and what comes from an expert. This task produces as result a value that is used as reward to train the RL agent. 

This particular implementation uses Proximal Policy Optimizarion (PPO) instead of TRPO because PPO was created as a refined version of TRPO and both have the same workflow. (This means that GAIL is only compatible with PPO and no other RL agent can be used with it)

## 3) Behavioral Cloning

This algorithm consist of a supervised deep learning problem where a neural network is trained using a dataset of expert experiences which contains the states paired with actions. The neural network is trained using the states as inputs and the actions as labels.

In this library we provide the tools to train the RL agents through behavioral cloining. This tolls also allows to pretrain a RL agent over labeled data and then make fine tuning with RL or IRL.

## Expert Data

All Imitation learning methods need a dataset of expert demonstrations. This dataset should contain the experiences on each time step. This experiences depending on the problem, may contain only the states of the states paired with actions. We also provides some utilities to store and load the exper datasets.

In [1]:
from RL_Problem import rl_problem
from IL_Problem.gail import GAIL
from IL_Problem.deepirl import DeepIRL
from RL_Agent import ppo_agent_continuous_parallel, dpg_agent_continuous
from IL_Problem.base.utils.callbacks import load_expert_memories, Callbacks
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, LSTM, Input
from RL_Agent.base.utils import agent_saver
from RL_Agent.base.utils.networks import networks as rl_networks
from IL_Problem.base.utils.networks import networks_dictionaries as il_networks
import gym

pygame 2.1.0 (SDL 2.0.16, Python 3.7.10)
Hello from the pygame community. https://www.pygame.org/contribute.html


2021-12-10 10:19:57.634324: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2021-12-10 10:19:57.802474: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2799925000 Hz
2021-12-10 10:19:57.803944: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55e0fc84da10 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-12-10 10:19:57.803991: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-12-10 10:19:57.831715: I tensorflow/core/common_runtime/process_util.cc:147] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
  for external in metadata.entry_points().get(self.group, []):


We are going to use the LunarLander environment from OpenAI Gym. 

In [2]:
environment = "LunarLande-v2"
environment = gym.make(environment)

We provide an expert demosntartions dataset in "tutorials/tf_tutorials/expert_demonstrations/Expert_LunarLander.pkl". This dataset was created runing an already trained DPG agent over the environment.

Next, we provide the code we have used to generate the dataset with a DPG agent. If you already have a dataset, you do not need to run the next cell.

In [None]:
exp_path = "tutorials/tf_tutorials/expert_demonstrations/Expert_LunarLanderContinuous.pkl"
net_architecture = rl_networks.net_architecture(dense_layers=2,
                                           n_neurons=[256, 256],
                                           dense_activation=['relu', 'relu'])

expert = dpg_agent_continuous.Agent(learning_rate=5e-4,
                         batch_size=32,
                         net_architecture=net_architecture)

expert_problem = rl_problem.Problem(environment, expert)

callback = Callbacks()

# Comentar si ya se dispone de un fichero de experiencias como "Expert_LunarLander.pkl"
print("Comienzo entrenamiento de un experto")
expert_problem.solve(1000, render=False, max_step_epi=250, render_after=980, skip_states=3)
expert_problem.test(render=False, n_iter=400, callback=callback.remember_callback)

callback.save_memories(exp_path)

Define the agent neural network.

In [3]:
def lstm_custom_model(input_shape):
    actor_model = Sequential()
    actor_model.add(LSTM(16, input_shape=input_shape, activation='tanh'))
    actor_model.add(Dense(256, input_shape=input_shape, activation='relu'))
    actor_model.add(Dense(256, activation='relu'))
    return actor_model


Load the expert experiences.

In "IL_Problem.base.utils.callbacks.py" we have some utilities for storing and solading expert experiences. Especifically, we use the function "load_expert_memories" which recieves three parameters: 1) "path", string with path to data. 2) "load_action", boolean to load or not the actions. We can performs IRL training the discriminator in differenciate only the states reached by an expert from the states reached by the agent or to differenciante the the state-action pairs from the expert and agent. 3) "n_stack" defines how many temporal steps will be stacked to form the state when using the discriminator. We can used stacket states for the agent but not for the discriminator or use it for both.

In [4]:
exp_path = "tutorials/tf_tutorials/expert_demonstrations/Expert_LunarLanderContinuous.pkl"

use_expert_actions = True
discriminator_stack = 3
exp_memory = load_expert_memories(exp_path, load_action=use_expert_actions, n_stack=discriminator_stack)

In [5]:
net_architecture = rl_networks.ppo_net(use_custom_network=True,
                                        actor_custom_network=lstm_custom_model,
                                        critic_custom_network=lstm_custom_model
                                        )

In [6]:
agent = ppo_agent_continuous_parallel.Agent(actor_lr=1e-4,
                                          critic_lr=1e-4,
                                          batch_size=128,
                                          epsilon=0.9,
                                          epsilon_decay=0.97,
                                          epsilon_min=0.15,
                                          memory_size=1024,
                                          net_architecture=net_architecture,
                                          n_stack=discriminator_stack)

In [7]:
rl_problem = rl_problem.Problem(environment, agent)


In [8]:
def one_layer_custom_model(input_shape):
    x_input = Input(shape=input_shape, name='disc_s_input')
    x = Dense(128, activation='relu')(x_input)
    x = Dense(128, input_shape=input_shape, activation='relu')(x)
    x = Dense(1, activation='sigmoid')(x)
    model = Model(inputs=x_input, outputs=x)
    return model

In [9]:
irl_net_architecture = il_networks.irl_discriminator_net(use_custom_network=True,
                                                         common_custom_network=one_layer_custom_model,
                                                         define_custom_output_layer=True,
                                                         use_tf_custom_model=False)

In [10]:
irl_problem = DeepIRL(rl_problem, exp_memory, lr_disc=1e-5, batch_size_disc=128, epochs_disc=2, val_split_disc=0.1,
                      agent_collect_iter=10, agent_train_iter=25, n_stack_disc=discriminator_stack,
                      net_architecture=irl_net_architecture, use_expert_actions=use_expert_actions, tensorboard_dir="logs")

  self.expert_traj = np.array([[np.array([self.preprocess(o) for o in x[0]]), x[1]] for x in expert_traj])


In [11]:
print("Entrenamiento de agente con aprendizaje por imitación")
irl_problem.solve(10, render=False, max_step_epi=None, render_after=1500, skip_states=1,
                  save_live_histogram='hist.json')

Entrenamiento de agente con aprendizaje por imitación
Test episode:  1 Epochs:  84  Reward: -99.2 Smooth Reward: -99.2  Epsilon: 0.9000
Test episode:  2 Epochs:  67  Reward: -120.5 Smooth Reward: -109.9  Epsilon: 0.9000
Test episode:  3 Epochs:  122  Reward: -81.9 Smooth Reward: -101.2  Epsilon: 0.9000
Test episode:  4 Epochs:  69  Reward: -163.0 Smooth Reward: -122.5  Epsilon: 0.9000
Test episode:  5 Epochs:  105  Reward: -155.6 Smooth Reward: -159.3  Epsilon: 0.9000
Test episode:  6 Epochs:  65  Reward: -106.6 Smooth Reward: -131.1  Epsilon: 0.9000
Test episode:  7 Epochs:  78  Reward: -93.9 Smooth Reward: -100.2  Epsilon: 0.9000
Test episode:  8 Epochs:  99  Reward: -121.6 Smooth Reward: -107.7  Epsilon: 0.9000
Test episode:  9 Epochs:  99  Reward: -114.0 Smooth Reward: -117.8  Epsilon: 0.9000
Test episode:  10 Epochs:  135  Reward: -181.1 Smooth Reward: -147.6  Epsilon: 0.9000
Training discriminator
epoch 1	 loss  0.2551 binary_accuracy 0.3445 val_loss  0.2525 val_binary_accuracy 0

Test episode:  8 Epochs:  86  Reward: -429.8 Smooth Reward: -309.5  Epsilon: 0.7054
Test episode:  9 Epochs:  75  Reward: -403.8 Smooth Reward: -416.8  Epsilon: 0.7054
Test episode:  10 Epochs:  86  Reward: -410.3 Smooth Reward: -407.0  Epsilon: 0.7054
Training discriminator
epoch 1	 loss  0.2360 binary_accuracy 0.4901 val_loss  0.2332 val_binary_accuracy 0.5090
epoch 2	 loss  0.2311 binary_accuracy 0.5111 val_loss  0.2282 val_binary_accuracy 0.5256
Episode:  65 Epochs:  1024  Reward: 441.9 Smooth Reward: 442.6  Epsilon: 0.7054
Episode:  66 Epochs:  1024  Reward: 445.0 Smooth Reward: 442.6  Epsilon: 0.7054
Episode:  67 Epochs:  1024  Reward: 447.4 Smooth Reward: 442.6  Epsilon: 0.7054
Episode:  68 Epochs:  1024  Reward: 455.3 Smooth Reward: 442.6  Epsilon: 0.7054
Episode:  69 Epochs:  1024  Reward: 453.6 Smooth Reward: 442.6  Epsilon: 0.7054
Episode:  70 Epochs:  1024  Reward: 433.2 Smooth Reward: 442.6  Epsilon: 0.7054
Episode:  71 Epochs:  1024  Reward: 421.7 Smooth Reward: 442.6  Ep

Actor loss -0.1258593 16
Critic loss 0.046334863 16
Episode:  137 Epochs:  1024  Reward: 319.7 Smooth Reward: 323.2  Epsilon: 0.5362
Episode:  138 Epochs:  1024  Reward: 324.1 Smooth Reward: 323.2  Epsilon: 0.5362
Episode:  139 Epochs:  1024  Reward: 316.3 Smooth Reward: 323.2  Epsilon: 0.5362
Episode:  140 Epochs:  1024  Reward: 329.1 Smooth Reward: 323.2  Epsilon: 0.5362
Episode:  141 Epochs:  1024  Reward: 323.3 Smooth Reward: 323.2  Epsilon: 0.5362
Episode:  142 Epochs:  1024  Reward: 321.9 Smooth Reward: 323.2  Epsilon: 0.5362
Episode:  143 Epochs:  1024  Reward: 327.4 Smooth Reward: 323.2  Epsilon: 0.5362
Episode:  144 Epochs:  1024  Reward: 324.7 Smooth Reward: 323.2  Epsilon: 0.5362
Actor loss 0.019834746 17
Critic loss 0.08491403 17
Episode:  145 Epochs:  1024  Reward: 311.2 Smooth Reward: 321.7  Epsilon: 0.5202
Episode:  146 Epochs:  1024  Reward: 328.8 Smooth Reward: 321.7  Epsilon: 0.5202
Episode:  147 Epochs:  1024  Reward: 309.5 Smooth Reward: 321.7  Epsilon: 0.5202
Episo

Actor loss -0.029988991 25
Critic loss 0.036090434 25
Episode:  209 Epochs:  1024  Reward: 266.6 Smooth Reward: 273.7  Epsilon: 0.4077
Episode:  210 Epochs:  1024  Reward: 300.6 Smooth Reward: 273.7  Epsilon: 0.4077
Episode:  211 Epochs:  1024  Reward: 288.8 Smooth Reward: 273.7  Epsilon: 0.4077
Episode:  212 Epochs:  1024  Reward: 261.2 Smooth Reward: 273.7  Epsilon: 0.4077
Episode:  213 Epochs:  1024  Reward: 280.3 Smooth Reward: 273.7  Epsilon: 0.4077
Episode:  214 Epochs:  1024  Reward: 266.1 Smooth Reward: 273.7  Epsilon: 0.4077
Episode:  215 Epochs:  1024  Reward: 264.4 Smooth Reward: 273.7  Epsilon: 0.4077
Episode:  216 Epochs:  1024  Reward: 274.1 Smooth Reward: 273.7  Epsilon: 0.4077
Actor loss -0.004431309 26
Critic loss 0.06750213 26
Episode:  217 Epochs:  1024  Reward: 260.9 Smooth Reward: 269.7  Epsilon: 0.3954
Episode:  218 Epochs:  1024  Reward: 258.5 Smooth Reward: 269.7  Epsilon: 0.3954
Episode:  219 Epochs:  1024  Reward: 258.1 Smooth Reward: 269.7  Epsilon: 0.3954
Ep

Actor loss 0.2850467 34
Critic loss 0.18974122 34
Episode:  281 Epochs:  1024  Reward: 169.3 Smooth Reward: 172.2  Epsilon: 0.3099
Episode:  282 Epochs:  1024  Reward: 174.8 Smooth Reward: 172.2  Epsilon: 0.3099
Episode:  283 Epochs:  1024  Reward: 169.4 Smooth Reward: 172.2  Epsilon: 0.3099
Episode:  284 Epochs:  1024  Reward: 167.3 Smooth Reward: 172.2  Epsilon: 0.3099
Episode:  285 Epochs:  1024  Reward: 163.7 Smooth Reward: 172.2  Epsilon: 0.3099
Episode:  286 Epochs:  1024  Reward: 180.7 Smooth Reward: 172.2  Epsilon: 0.3099
Episode:  287 Epochs:  1024  Reward: 164.2 Smooth Reward: 172.2  Epsilon: 0.3099
Episode:  288 Epochs:  1024  Reward: 168.3 Smooth Reward: 172.2  Epsilon: 0.3099
Actor loss 0.020746285 35
Critic loss 0.13113049 35
Test episode:  1 Epochs:  63  Reward: -473.7 Smooth Reward: -473.7  Epsilon: 0.3006
Test episode:  2 Epochs:  53  Reward: -457.4 Smooth Reward: -465.5  Epsilon: 0.3006
Test episode:  3 Epochs:  58  Reward: -534.2 Smooth Reward: -495.8  Epsilon: 0.300

In [None]:
rl_problem.test(10)

In [None]:
agent_saver.save(agent, 'agent_ppo.json')