In [None]:
import tensorflow as tf
physical_devices = tf.config.experimental.list_physical_devices('GPU')
assert len(physical_devices) > 0, "Not enough GPU hardware devices available"
config = tf.config.experimental.set_memory_growth(physical_devices[0], True)

# Deep Deterministic Policy Gradient (DDPG), Actor-Critic agents and How to Define Output layers

Deep Deterministic Policy Gracient (DDPG) is an Actor-Critic algorithm that extend DQN to create an agent for solving problems with continuous actions. This agent consist of two neural networks: 1) the Actor network receives states and propose actions and 2) the Critic network recieves the states and the actions to calculate the advantage values A(s, a).

Aditionally, we will see how to define the output layers of each network when we use keras, since we let the library calculate it automatically until now.

By the end of this tutorial you will know how to use DDPG agents and how to define the properties of the output layers of your networks.

In [None]:
from RL_Problem import rl_problem
from RL_Agent import ddpg_agent
from RL_Agent.base.utils import agent_saver, history_utils
from RL_Agent.base.utils.networks import networks
from RL_Problem.base.ActorCritic import ddpg_problem
from tensorflow.keras.layers import Dense, LSTM
import gym

## Defining the Neural Network Architecture

Here, we define the neural networks using keras. We created two functions, one for creating the actor network and another for the critic network. When creating the critic network for DDPG we have to take into acount that this network need two different inputs: 1) the state, as usual in other agents. In this exmaple we stack 5 time steps and use an LSTM as first layer. And 2) the actions, which consist of a array of 2 values because we have two actions in the selected environment.

Notice that we also define the output layers, doing this allows the user to define for example their prefered activation, add a normalization in the output or even use a Lambda function to sample from a normal distribution.

In [None]:
def actor_custom_model(input_shape):
    lstm = LSTM(32, activation='tanh', input_shape=input_shape, name='lstm_c')
    dense_1 = Dense(256, activation='relu')
    dense_2 = Dense(128, activation='relu')
    
    # Output layer
    output = Dense(2, activation='tanh')
    
    def model():
        model = tf.keras.models.Sequential([lstm, dense_1, dense_2, output])
        return model
    return model()

def critic_custom_model(input_shape, actor_net):
    
    lstm_s = LSTM(32, activation='tanh', input_shape=input_shape, name='lstm_state')
    dense_s = Dense(256, activation='relu', name='dense_state')
    
    dense_a = Dense(128, activation='relu', input_shape=(actor_net.output.shape[1:]), name='dense_act')
    
    dense_c = Dense(128, activation='relu', name='dense_common')
    output = Dense(1, activation='linear', name='output')
    
    def model():
        
        # state model
        state_model = tf.keras.models.Sequential([lstm_s, dense_s])   
        
        # action model
        act_model = tf.keras.models.Sequential([dense_a])
        
        # merge both models
        merge = tf.keras.layers.Concatenate()([state_model.output, act_model.output])
        merge = dense_c(merge)
        
        # Output layer
        out = output(merge)
        
        model = tf.keras.models.Model(inputs=[state_model.input, act_model.input], outputs=out)
        return model
    return model()

In the next cell, we define the neural network using dictionaries. As we have especified the output layers for Actor and Critic we have to set to True the "define_custom_output_layer" parameter to inform the agent of this fact. We also need to set to True the "use_custom_network" param.

As we are using an Actor-Critic agent we need to set two parameters, one for each network: "actor_custom_network" and "critic_custom_network".

In [None]:
net_architecture = networks.ddpg_net(use_custom_network=True,
                                     actor_custom_network=actor_custom_model,
                                     critic_custom_network=critic_custom_model,
                                     define_custom_output_layer=True)

## Define the RL Agent

We define the Actor-Critic agent setting the next parameters:

* actor_lr: learning rate for training the Actor neural network.
* critic_lr: learning rate for training the Critic neural network.
* batch_size: Size of the batches used for training the neural network.
* epsilon: Determines the amount of exploration.
* epsilon_decay: Decay factor of the epsilon. 
* esilon_min: minimun value epsilon can reach during the training procedure.
* net_architecture: net architecture defined before.
* n_stack: number of stacked timesteps to form the state.

In [None]:
agent = ddpg_agent.Agent(actor_lr=1e-3,
                         critic_lr=1e-3,
                         batch_size=64,
                         epsilon=0.5,
                         epsilon_decay=0.9999,
                         epsilon_min=0.15,
                         net_architecture=net_architecture,
                         n_stack=5)

## Define the Environment
We chose the LunarLanderContinuous environment from OpenAI Gym.

In [None]:
environment = "LunarLanderContinuous-v2"
environment = gym.make(environment)

## Build a RL Problem

The RL problem is were the comunications between agent and environment are managed.

In [None]:
problem = ddpg_problem.DDPGPRoblem(environment, agent)

## Solving the RL Problem
Here, we introduce a new parameter: "max_step_epi". It is used to limits the number of steps  of every episode. This is useful if we have an environment without a maximun limit of time steps or, as in this case, we want to reduce the maximun number steps fromm 1000 to 250 to force the agent to solve the problem faster.


In [None]:
problem.solve(250, max_step_epi=200, render_after=150, skip_states=3)


Run the agent in test mode to see the final performance

In [None]:
problem.test(render=False, n_iter=5)


In [None]:
hist = problem.get_histogram_metrics()
history_utils.plot_reward_hist(hist, 10)

# Takeaways
- We learned how to use DDPG agents.
- We learned how to create custom network architectures with keras for Actor-Critic agents.
- We learned how to define the output layers of owr networks to be able to set the desired activations or different functionalities.
- We learned how to limit the episode time steps during training to avoid too long episodes.