In [1]:
import tensorflow as tf
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)

2021-11-30 18:42:41.651218: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-11-30 18:42:41.674026: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-30 18:42:41.674287: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce RTX 2070 computeCapability: 7.5
coreClock: 1.71GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 417.29GiB/s
2021-11-30 18:42:41.674409: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2021-11-30 18:42:41.675443: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-11-30 18:42:41.676420: I tensorflow/stream_executor/platform/de

# Fully Custom Networks with TensorFlow and Proximal Policy Oprimization

In this tutorial you will learn how to configure you own custon neural network in the most versatile way. You may need to know some TensorFlow to be able to do an extension of one of our neural models and create your own computation graph. 

We use for this example the Proximal Policy Optimization (PPO) agent.

In [2]:
import tensorflow as tf
from RL_Problem import rl_problem
from RL_Agent.legacy_agents import ppo_agent_discrete
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM, Input
from RL_Agent.base.utils import agent_saver, history_utils
from RL_Agent.base.utils.networks.agent_networks import PPONet, TrainingHistory
from RL_Agent.base.utils.networks import networks, losses, returns_calculations

import gym


2021-11-30 18:42:41.692904: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2021-11-30 18:42:41.697018: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 3199980000 Hz
2021-11-30 18:42:41.697358: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c6cc0687e0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-11-30 18:42:41.697372: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-11-30 18:42:41.697507: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2021-11-30 18:42:41.697794: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeF

Instructions for updating:
non-resource variables are not supported in the long term



Bad key savefig.frameon in file /home/shernandez/anaconda3/envs/tf2py37/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle, line 421 ('savefig.frameon : True')
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.3.4/matplotlibrc.template
or from the matplotlib source distribution

Bad key verbose.level in file /home/shernandez/anaconda3/envs/tf2py37/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle, line 472 ('verbose.level  : silent      # one of silent, helpful, debug, debug-annoying')
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.3.4/matplotlibrc.template
or from the matplotlib source distribution

Bad key verbose.fileo in file /home/shernandez/anaconda3/envs/tf2py37/lib/python3.7/site-packages/matplotlib/mpl-data/stylelib/_classic_test.mplstyle, line 473 ('verbose.fileo  : sys.stdout  # a log filename, sys.

Instructions for updating:
If using Keras pass *_constraint arguments to layers.


## Create the custom network

To create your own neural network It must extent the "RLNetInterfaz" from RL_Agent.base.utils.networks.networks_interface. This interfaz contains the minimun and mandatory parameter and funtions that a network need to work within the library. In RL_Agent.base.utils.networks.networks_interface we also have th "RLNetModel" class which extend "RLNetInterfaz" and contains some implementation of common functionalities, so create your nerwork extending from "RLNetModel" will be easier than extending from the interfaz.

In this tutorial we are going to extend the "PPONet" from "RL_Agent.utils.network.agent_networks" which already extend "RLNetModel" and cotains all the funtionalities that PPO needs. We recomend to extend from the classes implemented in "RL_Agent.utils.network.agent_networks" if you plan to use a default RL agent from this library and extend from "RLNetModel" if you pretend to make a deep modification of an agent or implementing a new one.

### Modification to PPONet

Here we explain the modification that we are going to make to the default PPO network.

#### Tensorboar summaries

We want to change the information recorded with tensorboard, so we need to reimplement our own funtions to write the summaries and assing they to the functions from the class:
* self.loss_sumaries: Write information related to the loss caculation.
* self.rl_loss_sumaries: Write information related to auxiliar data used in loss and metrics calculation.
* self.rl_sumaries: Write information related to the RL process like reward over epochs or epsilon values over epochs.

These three functions have their default implementation in "RL_Agent.utils.network.tensor_board_loss_functions.py"
and receives as inputs:

* data: List of values to write in the summary.
* names: List of sumary names for each value contained in data.
* step: Current step of the training process. We usually use the episodes



In [3]:
def custom_loss_sumaries(loss, names, step):
    if isinstance(loss, list):
        with tf.name_scope('Losses'):
            for l, n in zip(loss, names):
                tf.summary.scalar(n, l, step=step)

def custom_rl_loss_sumaries(data, names, step):
    with tf.name_scope('RL_Values'):
        for d, n in zip(data, names):
            with tf.name_scope(n):
                tf.summary.histogram('histogram', d, step=step)
                tf.summary.scalar('mean', tf.reduce_mean(d), step=step)
                tf.summary.scalar('std', tf.math.reduce_std(d), step=step)
                tf.summary.scalar('max', tf.reduce_max(d), step=step)
                tf.summary.scalar('min', tf.reduce_min(d), step=step)

def custom_rl_sumaries(data, names, step):
    with tf.name_scope('RL'):
        for d, n in zip(data, names):
            with tf.name_scope(n):
                tf.summary.scalar(n, d, step=step)


#### Actor-Critic Neural Network modifications

As we are using an Actor-Critic network we initialy need to define two networks: 1) self.actor_net and 2) self.critic_net. But, in this example, we want to implement only just one neural network to process the input data with two output heads, one for the Actor and one for the Critic. To this end, we are going to define just a single network, but this deep modification will force us to re-implement the prediction and training methods.

We will use the self.actor_net param to aour single network to avoid make modifications of some other functionalities due to a name change. 

#### Optimizer and Loss Function

We redefined the "compile" method to define our prefered optimizer instead of the defaul one and we select that we want to use the ppo loss for discrete action spaces (this is the default loss but here we can specify another diferent loss).

#### Train and Predict

We have modified the predict methos in order to return only the actions and not the state values as the original one does. 

Finally, we have modified the _train_step method to use only one network and remove the calls to the original variable "self.crtitic_net" that we do not already need. 

In [9]:
class CustomNet(PPONet):
    def __init__(self, input_shape, tensorboard_dir=None):
        super().__init__(actor_net=self._build_net(input_shape), 
                         critic_net=None, 
                         tensorboard_dir=tensorboard_dir)

        self.loss_sumaries = custom_loss_sumaries
        self.rl_loss_sumaries = custom_rl_loss_sumaries
        self.rl_sumaries = custom_rl_sumaries

    def _build_net(self, input_shape):
        input_data = Input(shape=input_shape)
        lstm = LSTM(64, activation='tanh')(input_data)
        dense = Dense(256, activation='relu')(lstm)
        
        # Actor head
        act_dense = Dense(128, activation='relu')(dense)
        act_output = Dense(4, activation="softmax")(act_dense)
        
        # Critic Head
        critic_dense = Dense(64, activation='relu')(dense)
        critic_output = Dense(1, activation="linear")(critic_dense)

        return tf.keras.models.Model(inputs=input_data, outputs=[act_output, critic_output])


    def compile(self, loss, optimizer, metrics=None):
        self.loss_func_actor = losses.ppo_loss_discrete
        self.loss_func_critic = None
        self.optimizer_actor = tf.keras.optimizers.SGD(1e-3, momentum=0.2)
        self.optimizer_critic = None
        self.calculate_advantages = returns_calculations.gae
        self.metrics = metrics

    def predict(self, x):
        y_ = self._predict(x)
        return y_[0].numpy()

    @tf.function(experimental_relax_shapes=True)
    def _train_step(self, x, old_prediction, y, returns, advantages, stddev=None, loss_clipping=0.3,
                   critic_discount=0.5, entropy_beta=0.001):
        with tf.GradientTape() as tape:
            y_ = self.actor_net(x, training=True)
            loss_actor = self.loss_func_actor(y, y_[0], advantages, old_prediction, returns, y_[1], stddev, loss_clipping,
                                  critic_discount, entropy_beta)

        variables_actor = self.actor_net.trainable_variables
        gradients_actor,  = tape.gradient(loss_actor, variables_actor)
        self.optimizer_actor.apply_gradients(zip(gradients_actor, variables_actor))

        return [loss_actor, 0.], [gradients_actor, 0.], [variables_actor, 0.], returns, advantages



In the next cell, we define the network architecture dictionario in order to pass the neural model to the agent. We do this through a function that receives the input shape. Latter we create the dictionary setting "use_tf_custom_model" to True, which means that we are going to use a model extended ftom the "RLNetInterfaz". Then, we assing the function to create the model to "tf_custom_model".

When we set the neural network model through the "use_tf_custom_model" and "tf_custom_model" params we are required to define the output layers becaouse the "define_custom_output_layer" param will be overridden.

In [20]:
def custom_model_tf(input_shape):
    return CustomNet(input_shape=input_shape, tensorboard_dir='tensorboard_logs')

net_architecture = networks.ppo_net(use_tf_custom_model=True,
                                     tf_custom_model=custom_model_tf)

In [21]:
agent = ppo_agent_discrete.Agent(batch_size=256,
                                     memory_size=500,
                                     epsilon=1.0,
                                     epsilon_decay=0.9,
                                     epsilon_min=0.15,
                                     net_architecture=net_architecture,
                                     n_stack=4)


In [22]:
environment = "LunarLander-v2"
environment = gym.make(environment)

In [23]:
problem = rl_problem.Problem(environment, agent)



AttributeError: 'CustomNet' object has no attribute 'add'