In [None]:
import tensorflow as tf
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)

# Fully Custom Networks with TensorFlow and Proximal Policy Oprimization (PPO)

In this tutorial you will learn how to configure you own custon neural network in the most versatile way allowed. You may need to know some TensorFlow to be able to do an extension of one of our neural models and create your own computation graph. 

We use for this example the Proximal Policy Optimization (PPO) agent.

By the end of this notebook you will know how to extend a PPO network model and make modifications to it, create your own customized tensorboard summaries, change the optimizer, set your prefered loss function, set your prefered metric, define you own and fully customizable neural network and even creating your own custom train step appliying foward pass and backpropagation.

In [None]:
import tensorflow as tf
from RL_Problem import rl_problem
from RL_Agent import ppo_agent_discrete
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Input
from RL_Agent.base.utils import agent_saver, history_utils
from RL_Agent.base.utils.networks.agent_networks import PPONet, TrainingHistory
from RL_Agent.base.utils.networks import networks, losses, returns_calculations

import gym


## Create a Custom Network Model with TensorFlow

To create your own neural network It must extent the "RLNetInterfaz" from RL_Agent.base.utils.networks.networks_interface.py. This interfaz contains the minimun and mandatory parameter and funtions that a network need to work within the library. In RL_Agent.base.utils.networks.networks_interface.py we also have the "RLNetModel" class which extend "RLNetInterfaz" and contains some implementation of common functionalities, so create your nerwork extending from "RLNetModel" will be easier than extending from the interfaz.

In this tutorial we are going to extend the "PPONet" from "RL_Agent.utils.network.agent_networks.py" which already extend "RLNetModel" and cotains all the funtionalities that PPO needs. We recomend to extend from the classes implemented in "RL_Agent.utils.network.agent_networks.py" if you plan to use a default RL agent from this library and extend from "RLNetModel" if you pretend to make a deep modification of an agent or implementing a new one.

### Modification to PPONet

Here we explain the modification that we are going to make to the default PPO network.

#### Tensorboar Summaries

We want to change the information recorded with tensorboard, so we need to reimplement our own funtions to write the summaries and assing they to the functions from the class:
* self.loss_sumaries: Write information related to the loss caculation.
* self.rl_loss_sumaries: Write information related to auxiliar data used in loss and metrics calculation.
* self.rl_sumaries: Write information related to the RL process like reward over epochs or epsilon values over epochs.

These three functions have their default implementation in "RL_Agent.utils.network.tensor_board_loss_functions.py"
and receives as inputs:

* data: List of values to write in the summary.
* names: List of sumary names for each value contained in data.
* step: Current step of the training process. We usually use the episodes



In [None]:
def custom_loss_sumaries(data, names, step):
    if isinstance(data, list):
        with tf.name_scope('Losses'):
            for d, n in zip(data, names):
                tf.summary.scalar(n, d, step=step)

def custom_rl_loss_sumaries(data, names, step):
    with tf.name_scope('RL_Values'):
        for d, n in zip(data, names):
            with tf.name_scope(n):
                tf.summary.histogram('histogram', d, step=step)
                tf.summary.scalar('mean', tf.reduce_mean(d), step=step)
                tf.summary.scalar('std', tf.math.reduce_std(d), step=step)
                tf.summary.scalar('max', tf.reduce_max(d), step=step)
                tf.summary.scalar('min', tf.reduce_min(d), step=step)

def custom_rl_sumaries(data, names, step):
    with tf.name_scope('RL'):
        for d, n in zip(data, names):
            with tf.name_scope(n):
                tf.summary.scalar(n, d, step=step)


#### Actor-Critic Neural Network modifications

As we are using an Actor-Critic network we initialy need to define two networks: 1) "self.actor_net" and 2) "self.critic_net". But, in this example, we want to implement only just one neural network to process the input data with two output heads, one for the Actor and one for the Critic. To this end, we are going to define just a single network, but this deep modification will force us to re-implement the prediction and training methods.

We will use the "self.actor_net" parameter to store our single network to avoid make modifications of some other functionalities due to a name change. 

#### Optimizer and Loss Function

We redefined the "compile" method to define our prefered optimizer instead of the defaul one and we select that we want to use the ppo loss for discrete action spaces (this is the default loss for PPO but here we could specify another diferent loss).

#### Train and Predict

We have modified the "predict" function in order to return only the actions and not the state values as the original one did. We also have modified the "_predict_values" function because it made use of the ctitic network.

Finally, we have modified the "_train_step" function to use only one network and remove the calls to the original variable "self.critic_net" that we do not already need. 

In [None]:
class CustomNet(PPONet):
    def __init__(self, input_shape, tensorboard_dir=None):
        super().__init__(actor_net=self._build_net(input_shape), 
                         critic_net=None, 
                         tensorboard_dir=tensorboard_dir)

        self.loss_sumaries = custom_loss_sumaries
        self.rl_loss_sumaries = custom_rl_loss_sumaries
        self.rl_sumaries = custom_rl_sumaries
        
        # Dummy variables for surrogate the critic variables that we do not need
        self.dummy_loss_critic = tf.Variable(0., tf.float32)
        variables_actor = self.actor_net.trainable_variables
        self.dummy_var_critic = [tf.Variable(tf.zeros(var.shape), tf.float32) for var in variables_actor]

    def _build_net(self, input_shape):
        input_data = Input(shape=input_shape)
        lstm = LSTM(64, activation='tanh')(input_data)
        dense1 = Dense(256, activation='relu')(lstm)
        dense2 = Dense(256, activation='relu')(dense1)

        # Actor head
        act_dense = Dense(128, activation='relu')(dense2)
        act_output = Dense(4, activation="softmax")(act_dense)
        
        # Critic Head
        critic_dense = Dense(64, activation='relu')(dense2)
        critic_output = Dense(1, activation="linear")(critic_dense)

        return tf.keras.models.Model(inputs=input_data, outputs=[act_output, critic_output])


    def compile(self, loss, optimizer, metrics=None):
        # Define loss, metric and optimizer
        self.loss_func_actor = losses.ppo_loss_discrete
        self.loss_func_critic = None
        self.optimizer_actor = tf.keras.optimizers.RMSprop(1e-4)
        self.optimizer_critic = None
        self.calculate_advantages = returns_calculations.gae
        self.metrics = metrics
    
    def predict(self, x):
        y_ = self._predict(x)
        return y_[0].numpy()  # Take the predicted action 
    
    @tf.function(experimental_relax_shapes=True)
    def _predict_values(self, x):
        y_ = self.actor_net(tf.cast(x, tf.float32), training=False)
        return y_[1]  # Take the predicted value
    
    @tf.function(experimental_relax_shapes=True)
    def _train_step(self, x, old_prediction, y, returns, advantages, stddev=None, loss_clipping=0.3,
                   critic_discount=0.5, entropy_beta=0.001):
        with tf.GradientTape() as tape:
            y_ = self.actor_net(x, training=True)
            
            # Pass the corresponding actions (y_[0]) and values (y_[1]) to the loss function
            loss_actor, loss_complement_values = self.loss_func_actor(y, 
                                                                      y_[0], 
                                                                      advantages, 
                                                                      old_prediction, 
                                                                      returns, 
                                                                      y_[1], 
                                                                      stddev, 
                                                                      loss_clipping,
                                                                      critic_discount, 
                                                                      entropy_beta)

        variables_actor = self.actor_net.trainable_variables  # Get trainable variables 
        gradients_actor = tape.gradient(loss_actor, variables_actor)  # Get gradients
        self.optimizer_actor.apply_gradients(zip(gradients_actor, variables_actor))  # Update the network

        return [loss_actor, self.dummy_loss_critic], [gradients_actor, self.dummy_var_critic], [variables_actor, self.dummy_var_critic], returns, advantages, loss_complement_values



In the next cell, we define the network architecture dictionary in order to pass the neural model to the agent. We do this through a function that receives the input shape. Latter we create the dictionary setting "use_tf_custom_model" to True, which means that we are going to use a model extended ftom the "RLNetInterfaz". Then, we assing the function to create the model to "tf_custom_model".

When we set the neural network model through the "use_tf_custom_model" and "tf_custom_model" params, we are required to define the output layers becouse the "define_custom_output_layer" param will be overridden.

In [None]:
def custom_model_tf(input_shape):
    return CustomNet(input_shape=input_shape, tensorboard_dir='tensorboard_logs')

net_architecture = networks.ppo_net(use_tf_custom_model=True,
                                     tf_custom_model=custom_model_tf)

Memory size...

In [None]:
agent = ppo_agent_discrete.Agent(batch_size=64,
                                 memory_size=500,
                                 epsilon=0.7,
                                 epsilon_decay=0.97,
                                 epsilon_min=0.15,
                                 net_architecture=net_architecture,
                                 n_stack=4,
                                 loss_critic_discount=0.001,
                                 loss_entropy_beta=0.01)


## Define the Environment

We chose the LunarLander environment from OpenAI Gym.

In [None]:
environment = "LunarLander-v2"
environment = gym.make(environment)

## Build a RL Problem

The RL problem is were the comunications between agent and environment are managed. In this case, we use the funcionalities from "RL_Problem.rl_problem.py" which makes transparent to the user the selection of the matching problem. The function "Problem" automaticaly selects the problem based on the used agent.

In [None]:
problem = rl_problem.Problem(environment, agent)

## Solving the RL Problem

Next step is solving the RL problem that we have define. Here, we specify the number of episodes, the skip_states parameter, we limit the maximun number of step per episode and we want to render the process after 190 iterations.

In [None]:
problem.solve(200, render=False, max_step_epi=200, render_after=190, skip_states=1)

In [None]:
problem.test(render=False, n_iter=10)

In [None]:
hist = problem.get_histogram_metrics()
history_utils.plot_reward_hist(hist, 10)

## Run Tensorboard to See the Recorded Summaries

Lets see the tensorboard logs. Next cell executes the command that runs the tensorboard service. To see the result, you have to open a tab in your browser on the url that the command shows, usually http://localhost:6006/

In [None]:
!tensorboard --logdir=tensorboard_logs

# Takeaways
- We learned how to deeply modified a PPO agent.
- We learned how to make a complete customization of an agent neural network model.
- We learned how to define custom tensorboar summaries.
- We learned how to change the optimizer, the loss function, metrics and advantage calculation.
- We learned how to modify the predict methods.
- We learned how to modify the train method and realize the foward pass and backpropagation.