In [1]:
import tensorflow as tf
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)

2021-12-13 15:49:07.609572: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-12-13 15:49:07.642815: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-12-13 15:49:07.643103: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.71GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2021-12-13 15:49:07.643228: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2021-12-13 15:49:07.644292: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-12-13 15:49:07.645295: I tensorflow/stream_executor/platform/de

# Training an agent through Inverse Reinforcement Learning

This tutorial aims to show the use of Inverse Reinforcement Learning tools to train a RL agent via Imitation Learning. 

This library provides three Imitation Learning algorithms:

## 1) Deep Inverse Reinforcement Learning (DeepIRL): 


It Consist of an implementation of "Apprenticeship Learning algorithm from Pieter Abbeel and Andrew Y. Ng. 2004. Apprenticeship learning via inverse reinforcement learning. ICML '04."

As an overview, this algorithm have two main entities with two adversarial task: 1) a RL agent generate actions that aims to be very similar to expert actions. 2) a discriminator tries to diferenciate what actions comes from a RL agent and what comes from an expert. This task produces as result a value that is used as reward to train the RL agent.

This particular implememtation uses Deep Learning. For this purpose whe have replaced the classificator used fro the discriminator in the original work by a neural network. This algorithm is compatible with all Deep Reinforcement Learning agents in this library.

## 2) Generative Adversarial Imitation Learning (GAIL)

This is an implementation of "HO, Jonathan; ERMON, Stefano. Generative adversarial imitation learning. Advances in neural formation processing systems, 2016, vol. 29, p. 4565-4573." 

This algorithm is very similar to DeepIRL but use the workflow of Trus Region Policy Optimization (TRPO) algorithm (this is another RL algorithm) to makes the process more efficient. Have two main entities: 1) a reinforcement learning agent that generates actions that aims to be very similar to the expert actions. 2) a discriminator neural network that tries to diferenciate what actions comes from a RL agent and what comes from an expert. This task produces as result a value that is used as reward to train the RL agent. 

This particular implementation uses Proximal Policy Optimizarion (PPO) instead of TRPO because PPO was created as a refined version of TRPO and both have the same workflow. (This means that GAIL is only compatible with PPO and no other RL agent can be used with it)

## 3) Behavioral Cloning

This algorithm consist of a supervised deep learning problem where a neural network is trained using a dataset of expert experiences which contains the states paired with actions. The neural network is trained using the states as inputs and the actions as labels.

In this library we provide the tools to train the RL agents through behavioral cloining. This tolls also allows to pretrain a RL agent over labeled data and then make fine tuning with RL or IRL.

## Expert Data

All Imitation learning methods need a dataset of expert demonstrations. This dataset should contain the experiences on each time step. This experiences depending on the problem, may contain only the states of the states paired with actions. We also provides some utilities to store and load the exper datasets.

In [30]:
from RL_Problem import rl_problem
from IL_Problem.gail import GAIL
from IL_Problem.deepirl import DeepIRL
from RL_Agent import ppo_agent_discrete_parallel, dpg_agent
from IL_Problem.base.utils.callbacks import load_expert_memories, Callbacks
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, LSTM, Input
from RL_Agent.base.utils import agent_saver
from RL_Agent.base.utils.networks import networks as rl_networks
from IL_Problem.base.utils.networks import networks_dictionaries as il_networks
import gym

We are going to use the LunarLander environment from OpenAI Gym. 

In [4]:
environment = "LunarLander-v2"
environment = gym.make(environment)

We provide an expert demosntartions dataset in "tutorials/tf_tutorials/expert_demonstrations/Expert_LunarLander.pkl". This dataset was created runing an already trained DPG agent over the environment.

Next, we provide the code we have used to generate the dataset with a DPG agent. If you already have a dataset, you do not need to run the next cell.

In [21]:
exp_path = "tutorials/tf_tutorials/expert_demonstrations/ExpertLunarLander.pkl"
net_architecture = rl_networks.net_architecture(dense_layers=2,
                                           n_neurons=[256, 256],
                                           dense_activation=['relu', 'relu'])

expert = dpg_agent.Agent(learning_rate=5e-4,
                         batch_size=32,
                         net_architecture=net_architecture)

expert_problem = rl_problem.Problem(environment, expert)

callback = Callbacks()

# Comentar si ya se dispone de un fichero de experiencias como "Expert_LunarLander.pkl"
print("Comienzo entrenamiento de un experto")
expert_problem.solve(1000, render=False, max_step_epi=250, render_after=980, skip_states=3)
expert_problem.test(render=False, n_iter=400, callback=callback.remember_callback)

callback.save_memories(exp_path)

Comienzo entrenamiento de un experto
Episode:  1 Epochs:  96  Reward: -427.1 Smooth Reward: -427.1  Epsilon: 0.0000
Episode:  2 Epochs:  82  Reward: -439.7 Smooth Reward: -433.4  Epsilon: 0.0000
Episode:  3 Epochs:  131  Reward: -82.1 Smooth Reward: -316.3  Epsilon: 0.0000
Episode:  4 Epochs:  70  Reward: -74.6 Smooth Reward: -255.9  Epsilon: 0.0000
Episode:  5 Epochs:  179  Reward: -201.7 Smooth Reward: -245.0  Epsilon: 0.0000
Episode:  6 Epochs:  80  Reward: -96.6 Smooth Reward: -220.3  Epsilon: 0.0000
Episode:  7 Epochs:  134  Reward: -118.0 Smooth Reward: -205.7  Epsilon: 0.0000
Episode:  8 Epochs:  60  Reward: -112.3 Smooth Reward: -194.0  Epsilon: 0.0000
Episode:  9 Epochs:  100  Reward: -146.5 Smooth Reward: -188.7  Epsilon: 0.0000
Episode:  10 Epochs:  83  Reward: -280.8 Smooth Reward: -197.9  Epsilon: 0.0000
Episode:  11 Epochs:  62  Reward: -112.3 Smooth Reward: -166.4  Epsilon: 0.0000
Episode:  12 Epochs:  86  Reward: -142.0 Smooth Reward: -136.7  Epsilon: 0.0000
Episode:  1

Episode:  102 Epochs:  94  Reward: -116.2 Smooth Reward: -160.0  Epsilon: 0.0000
Episode:  103 Epochs:  100  Reward: -43.6 Smooth Reward: -135.5  Epsilon: 0.0000
Episode:  104 Epochs:  152  Reward: -365.6 Smooth Reward: -156.3  Epsilon: 0.0000
Episode:  105 Epochs:  61  Reward: -80.1 Smooth Reward: -155.0  Epsilon: 0.0000
Episode:  106 Epochs:  98  Reward: -234.3 Smooth Reward: -157.1  Epsilon: 0.0000
Episode:  107 Epochs:  152  Reward: -29.3 Smooth Reward: -145.8  Epsilon: 0.0000
Episode:  108 Epochs:  96  Reward: -89.1 Smooth Reward: -137.0  Epsilon: 0.0000
Episode:  109 Epochs:  109  Reward: -199.6 Smooth Reward: -148.2  Epsilon: 0.0000
Episode:  110 Epochs:  89  Reward: -75.1 Smooth Reward: -139.5  Epsilon: 0.0000
Episode:  111 Epochs:  96  Reward: -112.0 Smooth Reward: -134.5  Epsilon: 0.0000
Episode:  112 Epochs:  101  Reward: -95.2 Smooth Reward: -132.4  Epsilon: 0.0000
Episode:  113 Epochs:  111  Reward: -235.4 Smooth Reward: -151.6  Epsilon: 0.0000
Episode:  114 Epochs:  78  R

Episode:  205 Epochs:  93  Reward: -39.6 Smooth Reward: -76.7  Epsilon: 0.0000
Episode:  206 Epochs:  89  Reward: -31.0 Smooth Reward: -66.9  Epsilon: 0.0000
Episode:  207 Epochs:  89  Reward: -110.8 Smooth Reward: -75.5  Epsilon: 0.0000
Episode:  208 Epochs:  158  Reward: -78.6 Smooth Reward: -77.7  Epsilon: 0.0000
Episode:  209 Epochs:  157  Reward: -195.2 Smooth Reward: -88.5  Epsilon: 0.0000
Episode:  210 Epochs:  152  Reward: -45.9 Smooth Reward: -102.5  Epsilon: 0.0000
Episode:  211 Epochs:  128  Reward: 6.1 Smooth Reward: -93.2  Epsilon: 0.0000
Episode:  212 Epochs:  150  Reward: -38.6 Smooth Reward: -97.7  Epsilon: 0.0000
Episode:  213 Epochs:  252  Reward: 103.7 Smooth Reward: -76.0  Epsilon: 0.0000
Episode:  214 Epochs:  252  Reward: -81.8 Smooth Reward: -51.2  Epsilon: 0.0000
Episode:  215 Epochs:  187  Reward: -150.6 Smooth Reward: -62.3  Epsilon: 0.0000
Episode:  216 Epochs:  108  Reward: -50.2 Smooth Reward: -64.2  Epsilon: 0.0000
Episode:  217 Epochs:  104  Reward: -67.2

Episode:  310 Epochs:  222  Reward: -81.4 Smooth Reward: 8.7  Epsilon: 0.0000
Episode:  311 Epochs:  230  Reward: 39.6 Smooth Reward: 14.4  Epsilon: 0.0000
Episode:  312 Epochs:  252  Reward: 72.9 Smooth Reward: 26.2  Epsilon: 0.0000
Episode:  313 Epochs:  252  Reward: 18.8 Smooth Reward: 26.5  Epsilon: 0.0000
Episode:  314 Epochs:  252  Reward: 89.1 Smooth Reward: 27.1  Epsilon: 0.0000
Episode:  315 Epochs:  252  Reward: 12.1 Smooth Reward: 21.5  Epsilon: 0.0000
Episode:  316 Epochs:  252  Reward: 36.5 Smooth Reward: 26.1  Epsilon: 0.0000
Episode:  317 Epochs:  252  Reward: 58.7 Smooth Reward: 24.1  Epsilon: 0.0000
Episode:  318 Epochs:  252  Reward: 51.6 Smooth Reward: 31.6  Epsilon: 0.0000
Episode:  319 Epochs:  169  Reward: -226.1 Smooth Reward: 7.2  Epsilon: 0.0000
Episode:  320 Epochs:  252  Reward: -9.9 Smooth Reward: 14.3  Epsilon: 0.0000
Episode:  321 Epochs:  115  Reward: -207.6 Smooth Reward: -10.4  Epsilon: 0.0000
Episode:  322 Epochs:  252  Reward: -0.8 Smooth Reward: -17.

Episode:  415 Epochs:  252  Reward: 178.7 Smooth Reward: 114.4  Epsilon: 0.0000
Episode:  416 Epochs:  252  Reward: 163.1 Smooth Reward: 118.5  Epsilon: 0.0000
Episode:  417 Epochs:  152  Reward: 11.5 Smooth Reward: 107.8  Epsilon: 0.0000
Episode:  418 Epochs:  252  Reward: 125.8 Smooth Reward: 114.4  Epsilon: 0.0000
Episode:  419 Epochs:  93  Reward: -19.4 Smooth Reward: 96.8  Epsilon: 0.0000
Episode:  420 Epochs:  252  Reward: 117.4 Smooth Reward: 94.4  Epsilon: 0.0000
Episode:  421 Epochs:  252  Reward: 134.6 Smooth Reward: 103.9  Epsilon: 0.0000
Episode:  422 Epochs:  241  Reward: -25.3 Smooth Reward: 86.1  Epsilon: 0.0000
Episode:  423 Epochs:  252  Reward: 135.2 Smooth Reward: 86.7  Epsilon: 0.0000
Episode:  424 Epochs:  252  Reward: 150.8 Smooth Reward: 97.3  Epsilon: 0.0000
Episode:  425 Epochs:  252  Reward: 76.7 Smooth Reward: 87.1  Epsilon: 0.0000
Episode:  426 Epochs:  252  Reward: 102.0 Smooth Reward: 80.9  Epsilon: 0.0000
Episode:  427 Epochs:  252  Reward: 43.0 Smooth Re

Episode:  521 Epochs:  252  Reward: 102.6 Smooth Reward: 58.3  Epsilon: 0.0000
Episode:  522 Epochs:  252  Reward: 53.6 Smooth Reward: 60.9  Epsilon: 0.0000
Episode:  523 Epochs:  252  Reward: -5.9 Smooth Reward: 53.8  Epsilon: 0.0000
Episode:  524 Epochs:  252  Reward: 44.8 Smooth Reward: 51.3  Epsilon: 0.0000
Episode:  525 Epochs:  252  Reward: 57.2 Smooth Reward: 50.7  Epsilon: 0.0000
Episode:  526 Epochs:  252  Reward: 25.8 Smooth Reward: 47.7  Epsilon: 0.0000
Episode:  527 Epochs:  252  Reward: 75.5 Smooth Reward: 55.3  Epsilon: 0.0000
Episode:  528 Epochs:  252  Reward: 111.9 Smooth Reward: 58.0  Epsilon: 0.0000
Episode:  529 Epochs:  252  Reward: 58.6 Smooth Reward: 55.0  Epsilon: 0.0000
Episode:  530 Epochs:  252  Reward: 122.8 Smooth Reward: 64.7  Epsilon: 0.0000
Episode:  531 Epochs:  252  Reward: 12.4 Smooth Reward: 55.7  Epsilon: 0.0000
Episode:  532 Epochs:  252  Reward: 93.0 Smooth Reward: 59.6  Epsilon: 0.0000
Episode:  533 Epochs:  252  Reward: 53.6 Smooth Reward: 65.6 

Episode:  626 Epochs:  252  Reward: 21.0 Smooth Reward: 54.8  Epsilon: 0.0000
Episode:  627 Epochs:  252  Reward: 102.4 Smooth Reward: 61.7  Epsilon: 0.0000
Episode:  628 Epochs:  252  Reward: 53.6 Smooth Reward: 61.6  Epsilon: 0.0000
Episode:  629 Epochs:  252  Reward: 63.5 Smooth Reward: 63.2  Epsilon: 0.0000
Episode:  630 Epochs:  252  Reward: 17.7 Smooth Reward: 57.5  Epsilon: 0.0000
Episode:  631 Epochs:  252  Reward: 19.5 Smooth Reward: 51.5  Epsilon: 0.0000
Episode:  632 Epochs:  252  Reward: 41.9 Smooth Reward: 47.4  Epsilon: 0.0000
Episode:  633 Epochs:  252  Reward: 40.7 Smooth Reward: 48.4  Epsilon: 0.0000
Episode:  634 Epochs:  252  Reward: 52.3 Smooth Reward: 45.8  Epsilon: 0.0000
Episode:  635 Epochs:  252  Reward: 137.6 Smooth Reward: 55.0  Epsilon: 0.0000
Episode:  636 Epochs:  252  Reward: 132.0 Smooth Reward: 66.1  Epsilon: 0.0000
Episode:  637 Epochs:  252  Reward: 71.8 Smooth Reward: 63.1  Epsilon: 0.0000
Episode:  638 Epochs:  252  Reward: 56.0 Smooth Reward: 63.3 

Episode:  732 Epochs:  252  Reward: 52.7 Smooth Reward: 68.0  Epsilon: 0.0000
Episode:  733 Epochs:  252  Reward: 75.9 Smooth Reward: 69.1  Epsilon: 0.0000
Episode:  734 Epochs:  252  Reward: 52.6 Smooth Reward: 69.1  Epsilon: 0.0000
Episode:  735 Epochs:  252  Reward: 30.4 Smooth Reward: 62.9  Epsilon: 0.0000
Episode:  736 Epochs:  252  Reward: 41.9 Smooth Reward: 57.3  Epsilon: 0.0000
Episode:  737 Epochs:  252  Reward: 12.8 Smooth Reward: 51.0  Epsilon: 0.0000
Episode:  738 Epochs:  252  Reward: 7.9 Smooth Reward: 52.0  Epsilon: 0.0000
Episode:  739 Epochs:  252  Reward: 64.6 Smooth Reward: 50.1  Epsilon: 0.0000
Episode:  740 Epochs:  252  Reward: 15.1 Smooth Reward: 41.6  Epsilon: 0.0000
Episode:  741 Epochs:  252  Reward: 19.1 Smooth Reward: 37.3  Epsilon: 0.0000
Episode:  742 Epochs:  252  Reward: 34.4 Smooth Reward: 35.5  Epsilon: 0.0000
Episode:  743 Epochs:  252  Reward: 64.5 Smooth Reward: 34.3  Epsilon: 0.0000
Episode:  744 Epochs:  252  Reward: 105.7 Smooth Reward: 39.6  Ep

Episode:  839 Epochs:  162  Reward: 17.8 Smooth Reward: -18.4  Epsilon: 0.0000
Episode:  840 Epochs:  165  Reward: 27.1 Smooth Reward: -11.5  Epsilon: 0.0000
Episode:  841 Epochs:  252  Reward: 62.3 Smooth Reward: 5.0  Epsilon: 0.0000
Episode:  842 Epochs:  241  Reward: 233.5 Smooth Reward: 31.6  Epsilon: 0.0000
Episode:  843 Epochs:  207  Reward: 17.4 Smooth Reward: 37.1  Epsilon: 0.0000
Episode:  844 Epochs:  165  Reward: 76.0 Smooth Reward: 41.1  Epsilon: 0.0000
Episode:  845 Epochs:  181  Reward: 29.2 Smooth Reward: 51.2  Epsilon: 0.0000
Episode:  846 Epochs:  169  Reward: 2.4 Smooth Reward: 52.7  Epsilon: 0.0000
Episode:  847 Epochs:  187  Reward: 14.4 Smooth Reward: 48.6  Epsilon: 0.0000
Episode:  848 Epochs:  183  Reward: 27.7 Smooth Reward: 50.8  Epsilon: 0.0000
Episode:  849 Epochs:  176  Reward: -1.1 Smooth Reward: 48.9  Epsilon: 0.0000
Episode:  850 Epochs:  164  Reward: 25.8 Smooth Reward: 48.8  Epsilon: 0.0000
Episode:  851 Epochs:  153  Reward: -14.5 Smooth Reward: 41.1  

Episode:  945 Epochs:  179  Reward: 32.4 Smooth Reward: -6.0  Epsilon: 0.0000
Episode:  946 Epochs:  252  Reward: 124.6 Smooth Reward: 11.6  Epsilon: 0.0000
Episode:  947 Epochs:  181  Reward: 33.8 Smooth Reward: 12.1  Epsilon: 0.0000
Episode:  948 Epochs:  183  Reward: 27.6 Smooth Reward: 18.9  Epsilon: 0.0000
Episode:  949 Epochs:  197  Reward: 27.7 Smooth Reward: 30.2  Epsilon: 0.0000
Episode:  950 Epochs:  196  Reward: 3.8 Smooth Reward: 29.9  Epsilon: 0.0000
Episode:  951 Epochs:  195  Reward: 50.1 Smooth Reward: 35.4  Epsilon: 0.0000
Episode:  952 Epochs:  88  Reward: -20.1 Smooth Reward: 33.1  Epsilon: 0.0000
Episode:  953 Epochs:  174  Reward: 42.2 Smooth Reward: 36.7  Epsilon: 0.0000
Episode:  954 Epochs:  175  Reward: 30.9 Smooth Reward: 35.3  Epsilon: 0.0000
Episode:  955 Epochs:  176  Reward: -20.9 Smooth Reward: 30.0  Epsilon: 0.0000
Episode:  956 Epochs:  173  Reward: 44.7 Smooth Reward: 22.0  Epsilon: 0.0000
Episode:  957 Epochs:  192  Reward: 45.3 Smooth Reward: 23.1  E

Test episode:  47 Epochs:  436  Reward: 170.3 Smooth Reward: 132.3  Epsilon: 0.0000
Test episode:  48 Epochs:  1000  Reward: -6.5 Smooth Reward: 116.9  Epsilon: 0.0000
Test episode:  49 Epochs:  656  Reward: -177.1 Smooth Reward: 104.4  Epsilon: 0.0000
Test episode:  50 Epochs:  386  Reward: 201.1 Smooth Reward: 104.8  Epsilon: 0.0000
Test episode:  51 Epochs:  382  Reward: 241.1 Smooth Reward: 103.9  Epsilon: 0.0000
Test episode:  52 Epochs:  1000  Reward: 14.0 Smooth Reward: 80.8  Epsilon: 0.0000
Test episode:  53 Epochs:  536  Reward: 108.1 Smooth Reward: 69.3  Epsilon: 0.0000
Test episode:  54 Epochs:  396  Reward: 187.5 Smooth Reward: 75.5  Epsilon: 0.0000
Test episode:  55 Epochs:  346  Reward: -5.8 Smooth Reward: 71.5  Epsilon: 0.0000
Test episode:  56 Epochs:  432  Reward: 202.7 Smooth Reward: 93.5  Epsilon: 0.0000
Test episode:  57 Epochs:  414  Reward: 177.1 Smooth Reward: 94.2  Epsilon: 0.0000
Test episode:  58 Epochs:  340  Reward: -52.1 Smooth Reward: 89.7  Epsilon: 0.0000

Test episode:  145 Epochs:  287  Reward: 3.0 Smooth Reward: 94.8  Epsilon: 0.0000
Test episode:  146 Epochs:  472  Reward: 114.8 Smooth Reward: 92.1  Epsilon: 0.0000
Test episode:  147 Epochs:  507  Reward: 179.7 Smooth Reward: 114.3  Epsilon: 0.0000
Test episode:  148 Epochs:  339  Reward: -33.0 Smooth Reward: 94.4  Epsilon: 0.0000
Test episode:  149 Epochs:  426  Reward: 195.4 Smooth Reward: 93.5  Epsilon: 0.0000
Test episode:  150 Epochs:  574  Reward: -144.6 Smooth Reward: 60.1  Epsilon: 0.0000
Test episode:  151 Epochs:  293  Reward: -9.1 Smooth Reward: 36.5  Epsilon: 0.0000
Test episode:  152 Epochs:  438  Reward: 177.7 Smooth Reward: 58.8  Epsilon: 0.0000
Test episode:  153 Epochs:  546  Reward: 110.4 Smooth Reward: 45.9  Epsilon: 0.0000
Test episode:  154 Epochs:  1000  Reward: 7.9 Smooth Reward: 60.2  Epsilon: 0.0000
Test episode:  155 Epochs:  428  Reward: 192.2 Smooth Reward: 79.1  Epsilon: 0.0000
Test episode:  156 Epochs:  377  Reward: 206.5 Smooth Reward: 88.3  Epsilon: 0

Test episode:  243 Epochs:  562  Reward: 148.7 Smooth Reward: 105.4  Epsilon: 0.0000
Test episode:  244 Epochs:  466  Reward: 161.8 Smooth Reward: 100.5  Epsilon: 0.0000
Test episode:  245 Epochs:  333  Reward: -113.8 Smooth Reward: 93.0  Epsilon: 0.0000
Test episode:  246 Epochs:  529  Reward: 158.6 Smooth Reward: 87.3  Epsilon: 0.0000
Test episode:  247 Epochs:  464  Reward: 217.1 Smooth Reward: 93.5  Epsilon: 0.0000
Test episode:  248 Epochs:  495  Reward: 205.7 Smooth Reward: 105.3  Epsilon: 0.0000
Test episode:  249 Epochs:  601  Reward: -146.7 Smooth Reward: 109.0  Epsilon: 0.0000
Test episode:  250 Epochs:  465  Reward: 171.8 Smooth Reward: 109.8  Epsilon: 0.0000
Test episode:  251 Epochs:  519  Reward: -130.4 Smooth Reward: 84.0  Epsilon: 0.0000
Test episode:  252 Epochs:  565  Reward: 127.3 Smooth Reward: 80.0  Epsilon: 0.0000
Test episode:  253 Epochs:  606  Reward: -99.8 Smooth Reward: 55.2  Epsilon: 0.0000
Test episode:  254 Epochs:  428  Reward: 155.5 Smooth Reward: 54.5  

Test episode:  341 Epochs:  452  Reward: 199.9 Smooth Reward: 82.5  Epsilon: 0.0000
Test episode:  342 Epochs:  520  Reward: -92.9 Smooth Reward: 73.5  Epsilon: 0.0000
Test episode:  343 Epochs:  424  Reward: -38.9 Smooth Reward: 72.4  Epsilon: 0.0000
Test episode:  344 Epochs:  353  Reward: 205.8 Smooth Reward: 79.9  Epsilon: 0.0000
Test episode:  345 Epochs:  426  Reward: 228.2 Smooth Reward: 90.1  Epsilon: 0.0000
Test episode:  346 Epochs:  533  Reward: -122.6 Smooth Reward: 76.7  Epsilon: 0.0000
Test episode:  347 Epochs:  441  Reward: 191.8 Smooth Reward: 102.3  Epsilon: 0.0000
Test episode:  348 Epochs:  271  Reward: -1.6 Smooth Reward: 87.5  Epsilon: 0.0000
Test episode:  349 Epochs:  407  Reward: 180.0 Smooth Reward: 89.9  Epsilon: 0.0000
Test episode:  350 Epochs:  314  Reward: 12.1 Smooth Reward: 76.2  Epsilon: 0.0000
Test episode:  351 Epochs:  461  Reward: 228.9 Smooth Reward: 79.1  Epsilon: 0.0000
Test episode:  352 Epochs:  541  Reward: -118.6 Smooth Reward: 76.5  Epsilon

Define the agent neural network.

In [22]:
def lstm_custom_model(input_shape):
    actor_model = Sequential()
    actor_model.add(LSTM(16, input_shape=input_shape, activation='tanh'))
    actor_model.add(Dense(256, input_shape=input_shape, activation='relu'))
    actor_model.add(Dense(256, activation='relu'))
    return actor_model


Load the expert experiences.

In "IL_Problem.base.utils.callbacks.py" we have some utilities for storing and solading expert experiences. Especifically, we use the function "load_expert_memories" which recieves three parameters: 1) "path", string with path to data. 2) "load_action", boolean to load or not the actions. We can performs IRL training the discriminator in differenciate only the states reached by an expert from the states reached by the agent or to differenciante the the state-action pairs from the expert and agent. 3) "n_stack" defines how many temporal steps will be stacked to form the state when using the discriminator. We can used stacket states for the agent but not for the discriminator or use it for both.

In [24]:
exp_path = "tutorials/tf_tutorials/expert_demonstrations/ExpertLunarLander.pkl"

use_expert_actions = True
discriminator_stack = 3
exp_memory = load_expert_memories(exp_path, load_action=use_expert_actions, n_stack=discriminator_stack)

In [25]:
net_architecture = rl_networks.ppo_net(use_custom_network=True,
                                        actor_custom_network=lstm_custom_model,
                                        critic_custom_network=lstm_custom_model
                                        )

In [32]:
agent = ppo_agent_discrete_parallel.Agent(actor_lr=1e-4,
                                  critic_lr=1e-4,
                                  batch_size=128,
                                  epsilon=0.9,
                                  epsilon_decay=0.97,
                                  epsilon_min=0.15,
                                  memory_size=1024,
                                  net_architecture=net_architecture,
                                  n_stack=discriminator_stack)

In [33]:
rl_problem = rl_problem.Problem(environment, agent)


In [34]:
def one_layer_custom_model(input_shape):
    x_input = Input(shape=input_shape, name='disc_s_input')
    x = Dense(128, activation='relu')(x_input)
    x = Dense(128, input_shape=input_shape, activation='relu')(x)
    x = Dense(1, activation='sigmoid')(x)
    model = Model(inputs=x_input, outputs=x)
    return model

In [35]:
irl_net_architecture = il_networks.irl_discriminator_net(use_custom_network=True,
                                                         common_custom_network=one_layer_custom_model,
                                                         define_custom_output_layer=True,
                                                         use_tf_custom_model=False)

In [36]:
irl_problem = DeepIRL(rl_problem, exp_memory, lr_disc=1e-5, batch_size_disc=128, epochs_disc=2, val_split_disc=0.1,
                      agent_collect_iter=10, agent_train_iter=25, n_stack_disc=discriminator_stack,
                      net_architecture=irl_net_architecture, use_expert_actions=use_expert_actions, tensorboard_dir="logs")

  self.expert_traj = np.array([[np.array([self.preprocess(o) for o in x[0]]), x[1]] for x in expert_traj])


In [37]:
print("Entrenamiento de agente con aprendizaje por imitación")
irl_problem.solve(10, render=False, max_step_epi=None, render_after=1500, skip_states=1,
                  save_live_histogram='hist.json')

Entrenamiento de agente con aprendizaje por imitación


2021-12-13 16:34:49.068521: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7


Test episode:  1 Epochs:  62  Reward: -99.8 Smooth Reward: -99.8  Epsilon: 0.9000
Test episode:  2 Epochs:  52  Reward: -262.3 Smooth Reward: -181.0  Epsilon: 0.9000
Test episode:  3 Epochs:  53  Reward: -248.1 Smooth Reward: -255.2  Epsilon: 0.9000
Test episode:  4 Epochs:  56  Reward: -295.9 Smooth Reward: -272.0  Epsilon: 0.9000
Test episode:  5 Epochs:  82  Reward: -494.4 Smooth Reward: -395.2  Epsilon: 0.9000
Test episode:  6 Epochs:  59  Reward: -127.6 Smooth Reward: -311.0  Epsilon: 0.9000
Test episode:  7 Epochs:  57  Reward: -308.8 Smooth Reward: -218.2  Epsilon: 0.9000
Test episode:  8 Epochs:  54  Reward: -148.0 Smooth Reward: -228.4  Epsilon: 0.9000
Test episode:  9 Epochs:  80  Reward: -476.2 Smooth Reward: -312.1  Epsilon: 0.9000
Test episode:  10 Epochs:  66  Reward: -393.7 Smooth Reward: -434.9  Epsilon: 0.9000
Training discriminator
epoch 1	 loss  0.2473 binary_accuracy 0.4618 val_loss  0.2470 val_binary_accuracy 0.5459
epoch 2	 loss  0.2445 binary_accuracy 0.5351 val_

Test episode:  5 Epochs:  58  Reward: -416.1 Smooth Reward: -678.7  Epsilon: 0.7497
Test episode:  6 Epochs:  58  Reward: -451.5 Smooth Reward: -433.8  Epsilon: 0.7497
Test episode:  7 Epochs:  69  Reward: -512.2 Smooth Reward: -481.9  Epsilon: 0.7497
Test episode:  8 Epochs:  63  Reward: -599.8 Smooth Reward: -556.0  Epsilon: 0.7497
Test episode:  9 Epochs:  60  Reward: -539.6 Smooth Reward: -569.7  Epsilon: 0.7497
Test episode:  10 Epochs:  88  Reward: -815.1 Smooth Reward: -677.4  Epsilon: 0.7497
Training discriminator
epoch 1	 loss  0.2328 binary_accuracy 0.5463 val_loss  0.2324 val_binary_accuracy 0.5649
epoch 2	 loss  0.2277 binary_accuracy 0.5633 val_loss  0.2279 val_binary_accuracy 0.5758
Episode:  73 Epochs:  1024  Reward: 446.6 Smooth Reward: 445.9  Epsilon: 0.7497
Episode:  74 Epochs:  1024  Reward: 435.1 Smooth Reward: 445.9  Epsilon: 0.7497
Episode:  75 Epochs:  1024  Reward: 444.9 Smooth Reward: 445.9  Epsilon: 0.7497
Episode:  76 Epochs:  1024  Reward: 444.0 Smooth Rewar

Test episode:  5 Epochs:  255  Reward: -360.0 Smooth Reward: -484.2  Epsilon: 0.6245
Test episode:  6 Epochs:  273  Reward: -256.0 Smooth Reward: -308.0  Epsilon: 0.6245
Test episode:  7 Epochs:  212  Reward: -251.5 Smooth Reward: -253.7  Epsilon: 0.6245
Test episode:  8 Epochs:  213  Reward: -420.5 Smooth Reward: -336.0  Epsilon: 0.6245
Test episode:  9 Epochs:  306  Reward: -245.7 Smooth Reward: -333.1  Epsilon: 0.6245
Test episode:  10 Epochs:  246  Reward: -263.9 Smooth Reward: -254.8  Epsilon: 0.6245
Training discriminator
epoch 1	 loss  0.2183 binary_accuracy 0.5926 val_loss  0.2157 val_binary_accuracy 0.6053
epoch 2	 loss  0.2107 binary_accuracy 0.6036 val_loss  0.2092 val_binary_accuracy 0.6133
Episode:  145 Epochs:  1024  Reward: 414.4 Smooth Reward: 402.2  Epsilon: 0.6245
Episode:  146 Epochs:  1024  Reward: 409.5 Smooth Reward: 402.2  Epsilon: 0.6245
Episode:  147 Epochs:  1024  Reward: 402.5 Smooth Reward: 402.2  Epsilon: 0.6245
Episode:  148 Epochs:  1024  Reward: 413.8 Sm

Test episode:  5 Epochs:  94  Reward: -181.4 Smooth Reward: -205.4  Epsilon: 0.5202
Test episode:  6 Epochs:  110  Reward: -294.3 Smooth Reward: -237.8  Epsilon: 0.5202
Test episode:  7 Epochs:  142  Reward: -269.8 Smooth Reward: -282.0  Epsilon: 0.5202
Test episode:  8 Epochs:  103  Reward: -266.3 Smooth Reward: -268.1  Epsilon: 0.5202
Test episode:  9 Epochs:  79  Reward: -235.1 Smooth Reward: -250.7  Epsilon: 0.5202
Test episode:  10 Epochs:  164  Reward: -285.0 Smooth Reward: -260.0  Epsilon: 0.5202
Training discriminator
epoch 1	 loss  0.1936 binary_accuracy 0.6241 val_loss  0.1929 val_binary_accuracy 0.6307
epoch 2	 loss  0.1865 binary_accuracy 0.6295 val_loss  0.1862 val_binary_accuracy 0.6351
Episode:  217 Epochs:  1024  Reward: 329.3 Smooth Reward: 341.3  Epsilon: 0.5202
Episode:  218 Epochs:  1024  Reward: 357.4 Smooth Reward: 341.3  Epsilon: 0.5202
Episode:  219 Epochs:  1024  Reward: 335.5 Smooth Reward: 341.3  Epsilon: 0.5202
Episode:  220 Epochs:  1024  Reward: 327.6 Smoo

Test episode:  5 Epochs:  144  Reward: -267.9 Smooth Reward: -247.7  Epsilon: 0.4333
Test episode:  6 Epochs:  121  Reward: -236.2 Smooth Reward: -252.1  Epsilon: 0.4333
Test episode:  7 Epochs:  123  Reward: -221.5 Smooth Reward: -228.9  Epsilon: 0.4333
Test episode:  8 Epochs:  184  Reward: -234.9 Smooth Reward: -228.2  Epsilon: 0.4333
Test episode:  9 Epochs:  121  Reward: -237.6 Smooth Reward: -236.3  Epsilon: 0.4333
Test episode:  10 Epochs:  132  Reward: -234.3 Smooth Reward: -236.0  Epsilon: 0.4333
Training discriminator
epoch 1	 loss  0.1687 binary_accuracy 0.6465 val_loss  0.1671 val_binary_accuracy 0.6522
epoch 2	 loss  0.1610 binary_accuracy 0.6533 val_loss  0.1595 val_binary_accuracy 0.6587
Episode:  289 Epochs:  1024  Reward: 320.6 Smooth Reward: 338.6  Epsilon: 0.4333
Episode:  290 Epochs:  1024  Reward: 334.2 Smooth Reward: 338.6  Epsilon: 0.4333
Episode:  291 Epochs:  1024  Reward: 347.6 Smooth Reward: 338.6  Epsilon: 0.4333
Episode:  292 Epochs:  1024  Reward: 303.9 Sm

In [None]:
rl_problem.test(10)

In [None]:
agent_saver.save(agent, 'agent_ppo.json')