In [None]:
import tensorflow as tf
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)

# Behavioral Cloning for Pretraining Agents

By the end of this tutorial you will know how to train a neural network of a selected agent through behavioral cloning to get an initial point to fine tuning the agent via RL.

In [None]:
import gym
from RL_Agent import ddpg_agent, dqn_agent, dpg_agent, a2c_agent_discrete_queue, ppo_agent_discrete, \
    ppo_agent_discrete_parallel, dpg_agent_continuous, a2c_agent_continuous_queue, ppo_agent_continuous,\
    ppo_agent_continuous_parallel, a2c_agent_continuous
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Dense, LSTM, Input
from RL_Agent.base.utils.networks import networks
from IL_Problem.base.utils.callbacks import load_expert_memories, Callbacks
from RL_Problem import rl_problem
from IL_Problem.bclone import BehaviorCloning
from tensorflow.keras.optimizers import Adam
import tensorflow as tf
import numpy as np


## Collecting the Expert Experiences (only if needed)

We provide an expert demosntartions dataset in "tutorials/tf_tutorials/expert_demonstrations/ExpertLunarLander.pkl". This dataset was created runing an already trained DPG agent over the environment.

Next, we provide the code we have used to generate the dataset with a DPG agent. If you already have a dataset, you do not need to run the next cell. In this code we instantiate a RL problem to train an agent and pass some callbacks to record the experiences in test function. We provide this callbacks in "IL_Problem.base.utils.callbacks.py"

In [None]:
environment = "LunarLander-v2"
environment = gym.make(environment)
exp_path = "tutorials/tf_tutorials/expert_demonstrations/ExpertLunarLander.pkl"
net_architecture = networks.dpg_net(dense_layers=2,
                                           n_neurons=[256, 256],
                                           dense_activation=['relu', 'relu'])

expert = dpg_agent.Agent(learning_rate=5e-4,
                         batch_size=32,
                         net_architecture=net_architecture)

expert_problem = rl_problem.Problem(environment, expert)

callback = Callbacks()

expert_problem.solve(1000, render=False, max_step_epi=250, render_after=980, skip_states=3)
expert_problem.test(render=False, n_iter=400, callback=callback.remember_callback)

callback.save_memories(exp_path)

## Loading the Expert Experiences

In "IL_Problem.base.utils.callbacks.py" we have some utilities for storing and loading expert experiences. Especifically, we use the function "load_expert_memories" which recieves three parameters: 1) "path", string with path to data. 2) "load_action", boolean to load or not the actions. We can performs IRL training the discriminator in differenciate only the states reached by an expert from the states reached by an agent or to differenciante the the state-action pairs from the expert and agent. 3) "n_stack" defines how many temporal steps will be stacked to form the state when using the discriminator. We can used stacked states in the agent and not in the discriminator or we can use it for both.

In [None]:
exp_path = "tutorials/tf_tutorials/expert_demonstrations/ExpertLunarLander.pkl"
use_action = True
n_stack = 5
exp_memory = load_expert_memories(exp_path, load_action=use_action, n_stack=n_stack)

## Defining the Agent's Neural Network Architecture

We defined only one network architecture because both actor and critic networks will have the same architecture.

In [None]:
def lstm_custom_model(input_shape):
    actor_model = Sequential()
    actor_model.add(LSTM(16, input_shape=input_shape, activation='tanh'))
    actor_model.add(Dense(128, input_shape=input_shape, activation='relu'))
    actor_model.add(Dense(128, input_shape=input_shape, activation='relu'))
    actor_model.add(Dense(128, activation='relu'))
    return actor_model

In [None]:
net_architecture = networks.actor_critic_net_architecture(use_custom_network=True,
                                                        actor_custom_network=lstm_custom_model,
                                                        critic_custom_network=lstm_custom_model
                                                        )

## Define the RL Agent

Here, we define the RL agent. A using the next parameters:

* actor_lr: learning rate for training the actor neural network.
* critic_lr: learning rate for training the neural network.
* batch_size: Size of the batches used for training the neural network. 
* epsilon: Determines the amount of exploration (float between [0, 1]). 0 -> Full Exploitation; 1 -> Full exploration.
* epsilon_decay: Decay factor of the epsilon. In each iteration we calculate the new epslon value as: epsilon' = epsilon * epsilon_decay.
* esilon_min: minimun value epsilon can reach during the training procedure.
* n_step_return:
* net_architecture: net architecture defined before.
* n_stack: number of stacked timesteps to form the state.

In [None]:
agent = a2c_agent_discrete_queue.Agent(actor_lr=1e-5,
                                       critic_lr=1e-5,
                                       batch_size=32,
                                       epsilon=0.3,
                                       epsilon_decay=0.9999,
                                       epsilon_min=0.15,
                                       n_step_return=15,
                                       net_architecture=net_architecture,
                                       n_stack=n_stack)

## Build the Behavioral Cloning Algrithm

Next cell denifes the behavioral cloning entity which requires the next parameters:
- agent: RL agent defined avobe.
- state_size: Input dimensions to the networks.
- n_actions: number of actions. Should be the sames as the output dimension of the network.
- n_stack: Number of timestep stacked.

In [None]:
bc = BehaviorCloning(agent, state_size=(n_stack, environment.observation_space.shape[0]), n_actions=environment.action_space.n,
                    n_stack=n_stack)

In [None]:
# bc = BehaviorCloning(agent, state_size=(n_stack, environment.observation_space.shape[0]), n_actions=environment.action_space.shape[0],
#                     n_stack=n_stack, action_bounds=[-1., 1.])

Prepare the training data.

In [None]:
states = np.array([x[0] for x in exp_memory])
actions = np.array([x[1] for x in exp_memory])

## Training Through Behavioral Cloning

Lets train the agent neural network. As you may notice if you have work with TensorFlow before, the parameters are very similar to the used for in "fit" function from keras module. This is because we are runing a supervised training.

* expert_traj_s: states are the tnput data for the training process.
* expert_traj_a: action are the labels for the training proccess.
* epochs: Number of training epochs.
* batch_size: Size of the batches used for training. 
* shuffle: Shuffle or not the examples on expert_traj_s and expert_traj_a.
* optimizer: Keras optimizer to be used in training procedure.
* loss: Loss metrics for the training procedure.
* metrics: Metrics for the training procedure.
* verbose: Set verbosity of the function: 0 -> no verbosity, 1 -> batch level verbosity and 2 -> epoch level. verbosity.
* one_hot_encode_actions: If True, expert_traj_a will be transformed into one hot encoding. If False, expert_traj_a will be no altered. Useful for discrete actions.
        

In [None]:
agent = bc.solve(expert_traj_s=states,
                 expert_traj_a=actions, 
                 epochs=10, 
                 batch_size=128, 
                 shuffle=True, 
                 optimizer=Adam(learning_rate=1e-4),
                 loss=tf.keras.losses.MeanSquaredError(),
                 metrics=tf.keras.metrics.MeanAbsoluteError(),
                 verbose=2,
                 validation_split=0.15, 
                 one_hot_encode_actions=False)

## Define the environment
We are going to use the LunarLander environment from OpenAI Gym. 

In [None]:
environment = "LunarLander-v2"
environment = gym.make(environment)

## Build a RL Problem

Once we have pretrain the agent, we can build a RL problem and fine tune the network to reach more reactive behavior provided by the RL framework. We use the funcionality from "RL_Problem.rl_problem.py" which makes transparent to the user the selection of the matching problem. The function "Problem" automaticaly selects the problem based on the agent used.

In [None]:
problem = rl_problem.Problem(environment, agent)
problem.test(render=True, n_iter=2)

## Solving the RL Problem

As a reminder, we defined before the RL agent and all the aprameters are mantained. You may notice how we defined a very low learning rate and epsilon value, this was with the objective of only fine tune and not to learn from the beguining.

```python
agent = a2c_agent_discrete_queue.Agent(actor_lr=1e-5,
                                       critic_lr=1e-5,
                                       batch_size=32,
                                       epsilon=0.0,
                                       epsilon_decay=0.9999,
                                       epsilon_min=0.15,
                                       n_step_return=15,
                                       net_architecture=net_architecture,
                                       n_stack=n_stack)
```

Next step is solving the RL problem that we have define.

In [None]:
problem.solve(100, render=False, skip_states=1, max_step_epi=500)

In [None]:
problem.test(render=True, n_iter=10)

# Takeaways

- We trained a multithread PPO agent
- We learned how to create an environment for approaching custom problems.
- We learned how to use the python interface for environments and its required properties and functions.
- We used real world data to create a trading bot.