# Introduction

This Google Colab notebook aim to propos a possible solution in order to face POMDP problems. This project relyes on two papers:

  1.   [Memory-based Deep Reinforcement Learning for POMDPs](https://arxiv.org/pdf/2102.12344.pdf)
  >   Original implementation : https://github.com/LinghengMeng/lstm_td3  
  Network architecture described in paper
  2.   [Deep Reinforcement Learning in Large Discrete Action Spaces](https://arxiv.org/pdf/1512.07679.pdf)
  >   Original implementation : https://github.com/ChangyWen/wolpertinger_ddpg  
      [Network architecture](https://intellabs.github.io/coach/components/agents/policy_optimization/wolpertinger.html?highlight=wolpertinger)
 
My implementation take inspiration from the above cited works, proposing a Tensorflow version and a different Replay Buffer construction exploiting tensorflow ragged tensor.  
The project has been designed in order to study LSTM-TD3 framework, with the addition to investigate the possibility to extend the field of applicability also for discrete actions spaces, since TD3 is inteded to work only with continuos action spaces. This idea came from the fact that Wolpertinger architecture relyes on Deep Deterministic Policy Gradient (DDPG) for train its policy and since TD3 its based on DDPG too, I thougt that could be interesting test Wolpertinger in this different framework. 
This integration has been reached but performance are not so satisfying due to difficulty on action representability and exploration method. 

## Environments used

Discrete action space - [Lunar Lander v2]( https://www.gymlibrary.dev/environments/box2d/lunar_lander/)

Continue action space - [HalfCheetahBulletEnv-v0](https://github.com/benelot/pybullet-gym)

### Required packages

---


In [None]:
!sudo apt-get update
!sudo apt-get install build-essential python-dev swig python-pygame
!sudo apt-get install swig

!pip install gdown
!pip install toml --quiet
!pip install --upgrade tensorflow
!pip  install tf-agents --quiet
!pip install pyflann-py3 --quiet
!pip install box2d-py --quiet
!pip install Box2D --quiet
!pip3 install gym[Box_2D] --quiet



In [None]:
!git clone https://github.com/benelot/pybullet-gym.git 
!pip install -e pybullet-gym/.
!cp -r ./pybullet-gym/pybulletgym /usr/local/lib/python3.7/dist-packages

In [None]:
# Downloading environment utility modules from Github project folder
!mkdir ./environment/
!wget "https://github.com/Smaike94/Deep_Learning_Project/blob/main/environment/action_space_TD3.py?raw=true" -O ./environment/action_space_TD3.py
!wget "https://github.com/Smaike94/Deep_Learning_Project/blob/main/environment/env_wrapper.py?raw=true" -O ./environment/env_wrapper.py
!wget "https://github.com/Smaike94/Deep_Learning_Project/blob/main/Utility.py?raw=true" -O ./Utility.py

# Downlaod checkpoints folder from personal google drive, in order to only test the model without performing the training
!gdown --folder https://drive.google.com/drive/folders/1SiXR89MvKBoADP9HYKOhJXYgubVa3cCQ?usp=sharing
!gdown 1tbeA-NHQ2Q47zojYztutfRrria2U1RTr
!gdown 1YWeWY0TXJ3BBZn-1auTK26wsWj8WZVVq

In [4]:
##Including required modules
import tensorflow as tf
import gym
import pybulletgym # To register tasks in PyBulletGym
import pybullet_envs  # To register tasks in PyBullet

import statistics
import os
import numpy as np
import tqdm
import Utility
import argparse

from environment.action_space_TD3 import Discrete_space
from environment.env_wrapper import POMDPWrapper
from tf_agents.replay_buffers import tf_uniform_replay_buffer
from tensorflow.keras import layers


# Background

---


## Partially observable Markovian Decision Process (POMDP)
This process is defined by a 6-tuple  **<S, A, R, P, O, Ω>**, where **S** is the state space, **A** is the action space, **P**(*s<sub>t+1</sub>*|*s<sub>t</sub>, a<sub>t</sub>*) is the transition probability and **R**(*s<sub>t+1</sub>, s<sub>t</sub>,a<sub>t</sub>*) is the reward function, **O** is the observation space and **Ω** is the observation model. The first 4-tuple also describes *MDP* process, than instead the remaining 2 are specific to *POMDP* process.
Agent at each time step *t* has to choose an action *a<sub>t</sub>* from **A**, and starting from current state *s<sub>t</sub>*, **P** model the transition probability toward next state *s<sub>t+1</sub>*, according to state *s<sub>t</sub>* and action *a<sub>t</sub>*. 
This transition probability is the same both from *MDP* and *POMDP*, with the difference that in latter case the agent doesn't see all information about state's environment but it receives an observation representing a partial representation of it. So, the agent will receive an observation *o<sub>t+1</sub>* from **O** space, when next state *s<sub>t+1</sub>* is reached according to **Ω**(*o<sub>t+1</sub>*|*s<sub>t+1</sub>*) probability. This probability could also be conditioned w.r.t to action *a<sub>t</sub>*, but if it is included in state feature, like the case for this project, the two conditioned probability are the same.  
Model-based approaches requires the knowledge of both **P** and **Ω**, instead for this project has been selected a model-free approach exploiting *Long Short Term Memory* network, that provides an help for estimating the underlying state. 



---

## Twin Delayed Deep Deterministic Policy Gradient (TD3)

This algorithm has been introduced in order to overcome a common problem faced using Deep Deterministic Policy Gradient (DDPG) that is the overestimation of Q-value function. So three main changes have been introduced:

* **Clipped Double-Q Learning**: two critic networks have to be trained, and the minimun between them Q-value is selected in target netwotk for Bellman error loss function. 

*  **Delayed policy update**: policy and target functions have been updated less frequently than critic Q-value function. This allows the value network to become more stable and reduce errors before it is used to update the policy network.

*  **Target policy smoothing**: this trick is introduced in order to smooth target value estimate by adding clipped noise to target policy. This concept relyes on the fact that in desirable situation we want that target value have low variance, meaning that similar actions should have similar value. Unfortunatly this not happens in deterministic policy algorithms. So, it is forced this situation by adding noise to target policy during training, inducing target value to be higher on actions that are more resistent to perturtbation.



# Proposed approach 


### Network architecture 


---


The suggested implementation for this problem reported in the first paper exploit a recurrent actor-critic framework that uses LSTM network in both actor and critic. Following images show network architecture.

<img src="https://drive.google.com/uc?id=14p7JE9H4eYhFk62dV8AFkxdx4EZz17sW" />

> Fig.1 Actor network

<img src="https://drive.google.com/uc?id=1lLHvSUyDeqbu-rvUEi_BMPiEKHFP5wte" />

> Fig.2 Critic network

<img src="https://drive.google.com/uc?id=1to1NK1z_ejjw5h_g8F4MTonqqGdOwDkr" width="350" height="300"/>

Inside Fully connected blocks could be present more layers, as also in LSTM blocks in which is more correct to speack of cells rather than layers.

*  **H<sub>t,l</sub>** represents the result of concatenation between past 
observations and actions batches. These are 3-dimensional ragged tensor, with *history_length* as ragged dimension. After concatenation the resulting tensor will vary only its third dimension, that becomes *observation_dimension* plus *action_dimension*.  
This tensor contains for each row the sequence of past history of length *l* until corrent observation/action at time *t*.  
Value *n_observation* is equal to the number of sampled observations during train, instead it is fixed to 1 during environment interaction.

*  **O<sub>t</sub>** and **A<sub>t</sub>** are regular tensor of rank 2 representing observations/actions at time *t*. *n_observation* has the same meaning as before. 

*  **Q(O<sub>t</sub>,A<sub>t</sub>,H<sub>t,l</sub>)** this is the Q-value function represented by critic deep neural network. Its purpose is to approximate the optimal action-value function **Q<sup>*</sup>**. 


*  **μ(O<sub>t</sub>,H<sub>t,l</sub>)** this is the policy function represented by actor deep neural network. Its purpose is to approximate the optimal action policy **μ<sup>*</sup>**.



#### Implementation

The class *ActorCriticModel* includes code implementation for both actor and critic network, since the two share an almost equal architecture with two main differences:

*   critic's input feature network requires both observations and actions, instead of actor that requires only the first
*   activation function of last FC layer for critic is the identity and for actor is the hyperbolic tangent.

The method *get_input_emb_layers* has been introduced for adding distinct input embedding layers for observations and actions, both when these are used to feed feature or memory network, for discrete action space case in which the reference architecture suggests to handle input in this way.

Each layer has its own specific activation function, instead in order to test different network configuration is possible to add more layers and change their size, varying the number of hidden units. This has been achieved through a configuration file written in **TOML** format.

In [5]:

def get_input_emb_layers(act_dim, obs_dim, kernel_init, input_emb_config, past_act=None):
    output_dim_act_emb, output_dim_obs_emb = 0, 0
    if past_act is not None:
        # Input Embedding for memory network
        ragged_input = True
        input_act_shape, input_obs_shape = (None, act_dim), (None, obs_dim)
    else:
        # Input Embedding for features network
        ragged_input = None
        input_act_shape, input_obs_shape = act_dim, obs_dim

    input_actor_embedding_net = tf.keras.Sequential()
    input_actor_embedding_net.add(layers.Input(shape=input_act_shape, ragged=ragged_input))
    for hid_layer_size in input_emb_config["input_emb_act"]:
        input_actor_embedding_net.add(layers.Dense(hid_layer_size, activation="relu",
                                                   kernel_initializer=kernel_init))
    if input_emb_config["input_emb_act"]:
        output_dim_act_emb += input_emb_config["input_emb_act"][-1]

    input_observation_embedding_net = tf.keras.Sequential()
    input_observation_embedding_net.add(layers.Input(shape=input_obs_shape, ragged=ragged_input))
    for hid_layer_size in input_emb_config["input_emb_obs"]:
        input_observation_embedding_net.add(layers.Dense(hid_layer_size, activation="relu",
                                                         kernel_initializer=kernel_init))
    if input_emb_config["input_emb_obs"]:
        output_dim_obs_emb += input_emb_config["input_emb_obs"][-1]

    if past_act is not None:
        if past_act:
            # Return both embedding net
            dim_output = output_dim_act_emb + output_dim_obs_emb
            return input_actor_embedding_net, input_observation_embedding_net, dim_output
        else:
            # Return only observation net
            dim_output = output_dim_obs_emb
            return None, input_observation_embedding_net, dim_output
    else:
        # Return both embedding net for features case
        dim_output = output_dim_act_emb + output_dim_obs_emb
        return input_actor_embedding_net, input_observation_embedding_net, dim_output


class ActorCriticModel(tf.keras.Model):
    """
    Combined actor-critic Model.
    The following implementation follow the structure described by this paper 
    https://arxiv.org/pdf/2102.12344.pdf,with the addition of input embedding 
    layers for Wolpertinger architecture in case for discrete action space, 
    applied both at input of memory and features network. For Wolpertinger 
    implementation, https://intellabs.github.io/coach/components/agents/policy_optimization/wolpertinger.html?highlight=wolpertinger.
    """

    def __init__(self, obs_dim, config_net: dict, act_dim, actions_space_continuous, model_type):
        """Initialize."""
        super().__init__()
        self.model_type = model_type
        self.action_space_continuous = actions_space_continuous
        self.mem_net_config = config_net["memory_net"]
        self.fet_net_config = config_net["features_net"]
        self.comb_net_config = config_net["combination_net"]
        self.hist_past_act = config_net["hist_past_act"]

        if self.hist_past_act:
            input_dim_mem_net = obs_dim + act_dim
        else:
            input_dim_mem_net = obs_dim
        if self.model_type == "critic":
            self.output_dim = 1
            input_dim_fet_net = obs_dim + act_dim
            # I.e. use linear activation function according to documentation
            activation_output = None
        elif self.model_type == "actor":
            self.output_dim = act_dim
            input_dim_fet_net = obs_dim
            activation_output = "tanh"

        kernel_init = "glorot_uniform"

        # Memory
        # In case of discrete action space both batched observation and action in input of memory network have to
        # embedded separately, according to Wolpertinger network architecture.
        if not self.action_space_continuous:
            input_emb_act_net, \
            input_emb_obs_net, \
            dim_output_emb = get_input_emb_layers(act_dim, obs_dim,
                                                  kernel_init,
                                                  input_emb_config=self.mem_net_config,
                                                  past_act=self.hist_past_act)

            self.input_actor_memory_embedding = input_emb_act_net
            self.input_observation_memory_embedding = input_emb_obs_net
            input_dim_mem_net = dim_output_emb if dim_output_emb != 0 else input_dim_mem_net

        # pre-RNN
        self.memory_network = tf.keras.Sequential()
        self.memory_network.add(layers.Input(shape=(None, input_dim_mem_net), ragged=True))
        for hid_layer_size in self.mem_net_config["pre_rnn_hid_sizes"]:
            self.memory_network.add(layers.Dense(hid_layer_size, activation="relu",
                                                 kernel_initializer=kernel_init))

        # RNN
        if len(self.mem_net_config["rnn_hid_sizes"]) >= 1:
            rnn_cells = []
            for hid_layer_size in self.mem_net_config["rnn_hid_sizes"]:
                rnn_cells.append(layers.LSTMCell(units=hid_layer_size, kernel_initializer=kernel_init))
            self.memory_network.add(layers.RNN(cell=rnn_cells))

        # post-RNN
        for hid_layer_size in self.mem_net_config["post_rnn_hid_sizes"]:
            self.memory_network.add(layers.Dense(hid_layer_size, activation="relu",
                                                 kernel_initializer=kernel_init))

        # Feature extraction
        # According to Wolpertinger architecture observations and actions have to embedded separately.
        if not self.action_space_continuous and self.model_type == "critic":
            input_emb_act_net, \
            input_emb_obs_net, \
            dim_output_emb = get_input_emb_layers(act_dim, obs_dim,
                                                  kernel_init,
                                                  input_emb_config=self.fet_net_config,
                                                  past_act=None)
            self.input_actor_features_embedding = input_emb_act_net
            self.input_observation_features_embedding = input_emb_obs_net
            input_dim_fet_net = dim_output_emb if dim_output_emb != 0 else input_dim_fet_net

        self.features_network = tf.keras.Sequential()
        self.features_network.add(layers.Input(shape=input_dim_fet_net))
        for hid_layer_size in self.fet_net_config["fet_ext_hid_sizes"]:
            self.features_network.add(layers.Dense(hid_layer_size, activation="relu",
                                                   kernel_initializer=kernel_init))

        # Combination of memory and feature extraction

        self.combination_network = tf.keras.Sequential()
        # Retrieve the name of the most recent block of memory network that has a non-empty value for output units
        # and accordingly get value.
        layer_name = [layers_name for layers_name, layers_struct in self.mem_net_config.items() if layers_struct]
        output_size_fet_net = self.fet_net_config["fet_ext_hid_sizes"][-1] if self.fet_net_config[
            "fet_ext_hid_sizes"] else input_dim_fet_net
        input_dim_comb_net = self.mem_net_config[layer_name[-1]][-1] + output_size_fet_net
        self.combination_network.add(layers.Input(shape=input_dim_comb_net))

        for hid_layer_size in self.comb_net_config["comb_net_hid_sizes"]:
            self.combination_network.add(layers.Dense(hid_layer_size, activation="relu",
                                                      kernel_initializer=kernel_init))

        self.combination_network.add(layers.Dense(self.output_dim, activation=activation_output,
                                                  kernel_initializer=kernel_init))

    def call(self, inputs: dict) -> tf.Tensor:
        memory = inputs["memory"]
        features = inputs["features"]
        hist_obs = memory["obs"]
        hist_act = memory["act"]

        obs = features["obs"]
        act = features["act"]

        if self.action_space_continuous:
            if self.hist_past_act:
                memory_inputs = tf.concat([hist_obs, hist_act], -1)
            else:
                memory_inputs = hist_obs
        else:
            if self.hist_past_act:
                out_emb_act_mem = self.input_actor_memory_embedding(hist_act)
                out_emb_obs_mem = self.input_observation_memory_embedding(hist_obs)
                memory_inputs = tf.concat([out_emb_obs_mem, out_emb_act_mem], -1)
            else:
                memory_inputs = self.input_observation_memory_embedding(hist_obs)

        # Feed memory net
        memory_outputs = self.memory_network(memory_inputs)

        # Feed extraction net

        if self.model_type == "critic":
            if self.action_space_continuous:
                features_inputs = tf.concat([obs, act], -1)
            else:
                out_emb_act_fet = self.input_actor_features_embedding(act)
                out_emb_obs_fet = self.input_observation_features_embedding(obs)
                features_inputs = tf.concat([out_emb_obs_fet, out_emb_act_fet], -1)
        elif self.model_type == "actor":
            features_inputs = obs

        features_outputs = self.features_network(features_inputs)

        # Post-combination
        comb_inputs = tf.concat([memory_outputs, features_outputs], -1)
        model_output = self.combination_network(comb_inputs)

        return model_output


### Agent 


---

The *ActorCriticAgent* class represents the entity in algorithm that interacts with the environment. So, it has been structured in order to include instanstations of both actor and critics network. Two types of agent are possible: *main* and *target*. This reflects the different moment in which networks have been applied during the algorithm: *main* agent produce an action that will be used to interact with environment, instead *target* will produce actions that will be used during update for constructing Bellman lookup target. Both use critic networks during update phase.

The *get_action* method decides what type of action has to be performed based on the topology of action space.  


Standard deviation is not equal for the two agent type, instead network architecture along with action space parameters are the same. 


#### Action space

Definitions shared by both cases:

* $\large O,A$ are osbervation and action space, instead $\large ℝ^n$ represents the continuous real space of dimension *n*, where *n* is the action dimension   
* $\large a_{\text{low}}, a_{\text{high}}$ are the action space boundaries that defines the set of admissible actions, and it's distinctive of the environment.
* $\Large \mu_{\theta}$ is the actor network 
* $\large ϵ $ is random sample drawn from normally distributed noise with zero mean and $\large σ$ standard deviation.
* $\large o_t, h^l_t$ are observation and history length at time *t*.

Equations below describes the case in which policy has been used for interacting with environment, because during train also random noise will be clipped. 

---

##### Continuous  

$$ \boxed{\text{Equations describing how is composed continuous policy}} $$
$$ ⇓ $$

$$\boxed{\large  \mu_{θ}: O, A → ℝ^n \\  
\mu_{θ}(o_t, h^l_t) = {\textbf{a}}}$$

$$ ⇓ $$
<center><strong>Exploration</strong></center>
$$\boxed{\large {\textbf{a}} = \text{clip}({\textbf{a}} + ϵ, a_{\text{low}},a_{\text{high}}), \;\;\;\; ϵ ∼ 𝑁(0, σ)}$$

*contnuous_act* method implements these operations.

---

##### Discrete

Here two steps occurs during action generation, and the whole process defines the *Wolpertinger policy*.
First step produce a continuous action and the second maps this action into one of the possible action that belongs to discrete set.
Crucial here is to choose an appropriate way to represent the action space. In this project has been  proposed to use one dimensional linear space subdivided through evenly spaced points, representing the finite set of actions. So having this in mind, the proto-action has been threated as a point in this definded space, from which has been applied *k nearest neighbourhood* in order to find the *k* closest actions. *k* is expressed as a ratio having values between zero and one, that multiplied with number of actions give as result the  action subset.
At the end, the action with highest Q-value, calculated by critic network, has been selected and then applied to environment. 
This approach allows for generalization over the action set in logarithm time as the number of actions growth, differently from general case in which the complexity grows linearly.  

<img src="https://drive.google.com/uc?id=1Prh5Y5MNbYHV3mv0ZrEbuQAConNWDzgO" 
    width="350" height="500" align="left" style="margin:10px;"/>


$$ \boxed{\text{Equations describing how is composed discrete policy}} $$
$$ ⇓ $$

$$\boxed{\large  \mu_{θ}: O, A → ℝ^n \\  
\mu_{θ}(o_t, h^l_t) = \hat{\textbf{a}}, \;\;\;\; \hat{\textbf{a}}=\text{proto-action}}$$

$$ ⇓ $$
<center><strong>Exploration</strong></center>
$$\boxed{\large \hat{\textbf{a}} = \text{clip}(\hat{\textbf{a}} + ϵ, a_{\text{low}},a_{\text{high}}), \;\;\;\; ϵ ∼ 𝑁(0, σ)}$$
$$ ⇓ $$

$$\boxed{\large g: ℝ^n → A \\
g_k(\hat{\textbf{a}}) = \text{arg}^k \text{min}_{a∈A}|\textbf{a}-\hat{\textbf{a}}|_2}$$  

$$g_k ∘ \mu_{θ} = A_k, \; \text{k closest action to proto-action by} L_2 \text{distance}$$

$$ ⇓ $$

$$\boxed{\large π_{θ}(\textbf{O}, \textbf{H}_l) = \text{arg max}_{a ∈ A_k} Q_{\phi_1}(O, a, H_l)}$$  
$$ \large \pi_{θ} =  \text{Full Wolpertinger policy}$$

<br>


<br clear="left" /> 

> Fig.3 Wolpertinger policy

It is important to highlight that the final *Wolpertinger* policy $\pi_{θ}$ depends only on $θ$ parameters of actor network. This beacuse in order to make this policy differentiable both $g_k$ and $\text{argmax}$ have to be intended as deterministic effects of the environment, so could be not considered during policy training.


*wolp_act* method include all these operations. Here the most computational part involves in replicating input data k-times, since it is necessary to evaluate Q-value for each action.


#### Implementation

In [6]:
class ActorCriticAgent(tf.keras.Model):
    """
    Combined actor-critic network.
    According to TD3 algorithm one actor and two critic network are needed.
    Two type of agent are possible: main and target. They differ for how an action has been selected.
    For main agent the output of actor network has been added to an exploration noise with defined standard deviation.
    Instead, for target type the noise, with its defined standard deviation, has to be clipped before to be added
    to actor output network, according to target policy smoothing described in TD3 algorithm.
    Furthermore, TD3 relies on DDPG that works only for continuous action space, so in order to deal also with discrete
    action space, this implementation of TD3 has been extended including also Wolpertinger policy.
    This variation, after the applications of noise to actor output(the so-called proto-action), performs a further step
    embedding the proto-action into a k-nearest-neighbor mapping, that reduce the continuous proto-action into a
    discrete set.
    In both continuous and discrete action space this action selection has been performed by main and target agent,
    respectively during environment interaction and during the update of critic parameters. Instead, in both cases
    the parameters of actor network are updated taking into account the output of actor network without adding any kind
    of noise.
    """

    def __init__(self, obs_dim, config_net, act_dim, act_space_cont, act_up_limit, act_low_limit,
                 actions, knn_ratio, st_dev_noise, clip_noise, hist_len, agent_type):
        """Initialize."""
        super().__init__()

        self.agent_type = agent_type
        self.actor = ActorCriticModel(obs_dim, config_net["actor_net"], act_dim, act_space_cont, model_type="actor")
        self.critic_1 = ActorCriticModel(obs_dim, config_net["critic_net"], act_dim, act_space_cont,
                                         model_type="critic")
        self.critic_2 = ActorCriticModel(obs_dim, config_net["critic_net"], act_dim, act_space_cont,
                                         model_type="critic")

        self.upper_act_limit = act_up_limit
        self.lower_act_limit = act_low_limit
        self.actions_space_continuous = act_space_cont
        self.std_deviation_act_noise = st_dev_noise
        self.clip_noise = clip_noise
        self.hist_length = hist_len
        # self.random_process = ornstein_uhlenbeck_process(initial_value=0.0, stddev=st_dev_noise)

        if not self.actions_space_continuous:
            num_actions = actions
            self.action_space = Discrete_space(num_actions)
            self.knn = max(1, int(num_actions * knn_ratio))
            self.knn_tensor = tf.constant(self.knn, dtype=tf.int32)

    @tf.function
    def get_action(self, input_act_net):
        if self.actions_space_continuous:
            act = self.continuous_act(input_act_net)
            raw_act = act
            act = tf.squeeze(act)
        else:
            act, raw_act = self.wolp_act(input_act_net)
            act = tf.squeeze(act)
        return act, raw_act

    def continuous_act(self, inputs_act_net):
        action = self.actor(inputs_act_net)

        if self.agent_type == "main":
            action = tf.math.add(action, tf.random.normal(shape=action.shape, stddev=self.std_deviation_act_noise))

            action = tf.clip_by_value(action, clip_value_min=self.lower_act_limit, clip_value_max=self.upper_act_limit)

        elif self.agent_type == "target":
            target_policy_smooth = tf.random.normal(shape=action.shape, stddev=self.std_deviation_act_noise)
            target_policy_smooth = tf.clip_by_value(target_policy_smooth, clip_value_min=-self.clip_noise,
                                                    clip_value_max=self.clip_noise)
            action = tf.math.add(action, target_policy_smooth)
            action = tf.clip_by_value(action, clip_value_min=self.lower_act_limit, clip_value_max=self.upper_act_limit)

        return action

    def wolp_act(self, inputs_act_net):

        proto_action = self.actor(inputs_act_net)

        if self.agent_type == "main":
            proto_action = tf.math.add(proto_action, tf.random.normal(shape=proto_action.shape,
                                                                       stddev=self.std_deviation_act_noise))
            #proto_action = tf.math.add(proto_action, self.random_process())
            proto_action = tf.clip_by_value(proto_action, clip_value_min=self.lower_act_limit,
                                            clip_value_max=self.upper_act_limit)
            # action = tf.math.add(action, self.random_process())
        elif self.agent_type == "target":
            target_policy_smooth = tf.random.normal(shape=proto_action.shape, stddev=self.std_deviation_act_noise)
            # target_policy_smooth = self.random_process()
            target_policy_smooth = tf.clip_by_value(target_policy_smooth, clip_value_min=-self.clip_noise,
                                                    clip_value_max=self.clip_noise)
            proto_action = tf.math.add(proto_action, target_policy_smooth)
            proto_action = tf.clip_by_value(proto_action, clip_value_min=self.lower_act_limit,
                                            clip_value_max=self.upper_act_limit)

        proto_action_input = tf.cast(proto_action, dtype=tf.float64)
        # Raw actions are real values that represent one of the possible discrete set of actions.
        # Actions are the integer values that represent one of the possible actions.
        # Raw actions are used to be stored in replay buffer and used to feed networks, instead actions are used
        #  to interact with environment.
        raw_actions, actions = self.action_space.tf_search_point(proto_action_input, self.knn_tensor)

        obs = inputs_act_net["features"]["obs"]
        last_obs = inputs_act_net["memory"]["obs"]
        last_act = inputs_act_net["memory"]["act"]
        num_raw_actions = tf.cast(obs.get_shape()[0], dtype=tf.int64)

        raw_actions = tf.cast(raw_actions, dtype=tf.float32)

        if self.knn > 1:
            # Since if knn is greater than one for each observation, along with its history sequence, there will be a
            # raw actions equal to knn. So, in order to evaluate which raw action is the best, i.e. gave highest
            # Q value, it's necessary to tile, according to the value of knn, each observation and its history sequence,
            # in order to correctly form pairs of observation and raw action.
            # The principle of tile process is the same both in inference that update case, but this last one
            # is more complicated due to ragged tensor, so further operations have to be done.

            obs_dim, act_dim = last_obs.shape[2], last_act.shape[2]
            s_t = tf.tile(obs, tf.constant(value=[1, self.knn]))
            s_t = tf.reshape(s_t, shape=[num_raw_actions, self.knn, obs_dim])

            if isinstance(last_obs, tf.RaggedTensor) and isinstance(last_act, tf.RaggedTensor):  # Update case
                last_obs_not_ragged, last_act_not_ragged = last_obs.to_tensor(), last_act.to_tensor()
                max_hist_len = tf.cast(self.hist_length, dtype=tf.int64)

                # Create boolean mask
                # Making ones
                total_ones = tf.reduce_sum(last_obs.row_lengths())
                ragged_ones = tf.RaggedTensor.from_row_lengths(tf.ones([total_ones]),
                                                               row_lengths=last_obs.row_lengths())
                # Making zeros
                total_zeros = (max_hist_len * num_raw_actions) - total_ones
                zeros_row_lengths = max_hist_len - last_obs.row_lengths()
                ragged_zeros = tf.RaggedTensor.from_row_lengths(tf.zeros([total_zeros]), row_lengths=zeros_row_lengths)

                bool_mask = tf.cast(tf.concat([ragged_ones, ragged_zeros], axis=1), dtype=tf.bool)
                bool_mask = bool_mask.to_tensor()
                bool_mask = tf.tile(bool_mask, tf.constant(value=[1, self.knn]))
                bool_mask = tf.reshape(bool_mask, shape=[num_raw_actions, self.knn, max_hist_len])

                # Apply boolean mask to last_obs  and last_act tiled and reshaped
                last_obs_not_ragged_tile = tf.tile(last_obs_not_ragged, tf.constant(value=[1, self.knn, 1]))
                last_obs_not_ragged_reshaped = tf.reshape(last_obs_not_ragged_tile, shape=[num_raw_actions, self.knn,
                                                                                           max_hist_len,
                                                                                           obs_dim])

                last_act_not_ragged_tile = tf.tile(last_act_not_ragged, tf.constant(value=[1, self.knn, 1]))
                last_act_not_ragged_reshaped = tf.reshape(last_act_not_ragged_tile, shape=[num_raw_actions, self.knn,
                                                                                           max_hist_len,
                                                                                           act_dim])
                # Obtain reshaped and tiled last_obs and last_act
                last_obs = tf.ragged.boolean_mask(last_obs_not_ragged_reshaped, bool_mask)
                last_act = tf.ragged.boolean_mask(last_act_not_ragged_reshaped, bool_mask)

            elif isinstance(last_obs, tf.Tensor) and isinstance(last_act, tf.Tensor):  # Inference case
                last_obs_tiled = tf.tile(last_obs, tf.constant(value=[1, self.knn, 1]))
                last_act_tiled = tf.tile(last_act, tf.constant(value=[1, self.knn, 1]))
                last_obs = tf.reshape(last_obs_tiled, shape=[num_raw_actions, self.knn,
                                                             last_obs.shape[1], obs_dim])
                last_act = tf.reshape(last_act_tiled, shape=[num_raw_actions, self.knn,
                                                             last_act.shape[1], act_dim])

            fn_to_map = lambda i: self.critic_1(dict(memory={"obs": last_obs[i],
                                                             "act": last_act[i]}, features={"obs": s_t[i],
                                                                                            "act": raw_actions[i]}))
            actions_evaluation = tf.map_fn(fn=fn_to_map, elems=tf.range(num_raw_actions), dtype=tf.float32)

            # Return the best action, i.e., wolpertinger action from the full wolpertinger policy
            max_index = tf.math.argmax(actions_evaluation, axis=1)
            raw_actions_max = tf.squeeze(tf.gather(raw_actions, indices=max_index, batch_dims=1), axis=1)
            actions_max = tf.gather(actions, indices=max_index, batch_dims=1)

        else:
            raw_actions_max = raw_actions
            actions_max = actions

        return actions_max, raw_actions_max

### Replay buffer 


---
In general reinforcement learning algorithm are divided in two logical phases, one in which the agent interact with environment collecting experience, and one in which the agent train network parameters using this data. Data could be stored or not depending on algorithm design, for the proposed algorithm data will be stored in a so called Replay Buffer memory, where replay indicates that same experience could be used multiple times during training.    
Replay buffer is a typical tool that is used in alghoritm that learn to approximate optimal action-value function **Q<sup>*</sup>**, like in this case.
This memory stores all agent's experience in form of tuple: *(o<sub>t</sub>,  a<sub>t</sub>, r<sub>t</sub>, o<sub>t+1</sub>, d<sub>t</sub>)*, where *d<sub>t</sub>* indicates if the terminal state is reached after observing *o<sub>t+1</sub>*, terminal state indicating that current episode is finished.  
It has been introduced in order to improve stability during training process, and usually it should be large enough to store a wide range of experiences.
Since the same replay buffer is used among different episodes, could happens that, during the sample of past history, experience from different episodes could be collected as well. So, in order to avoid this phenomen batches of past history have been designed using ragged tensor provided by Tensorflow, forming batched tensor with variable sequences length, that properly reflects the past history of observations/actions sampled at time *t*.
     
 


#### Implementation

In [7]:


class ReplayBuffer:
    def __init__(self, obs_dim, act_dim, capacity, device):
        self.obs_dim = obs_dim
        self.act_dim = act_dim
        self.batch_size = 1  # Because one element is added for each episode step
        self.batch_length = capacity
        data_spec = (
            tf.TensorSpec([obs_dim], dtype=tf.float32, name='observation'),
            tf.TensorSpec([act_dim], dtype=tf.float32, name='action'),
            tf.TensorSpec([], dtype=tf.float32, name='reward'),
            tf.TensorSpec([obs_dim], dtype=tf.float32, name='next_observation'),
            tf.TensorSpec([], dtype=tf.float32, name='done'),
        )
        self.buffer = tf_uniform_replay_buffer.TFUniformReplayBuffer(
            data_spec, batch_size=self.batch_size, max_length=self.batch_length, device=device)

    def put(self, observation, action, reward, next_observation, done):
        self.buffer.add_batch((observation, action, reward, next_observation, done))

    @staticmethod
    def get_boolean_mask(done_sequence, batch_size, max_hist_len):
        """
        This method create a ragged tensor of indices starting from the done sequence preceding elements at
        given time t. This serves to correctly associate, for each element, its experience, in order to
        not include also elements belonging to other episodes.
        For example if done batch sequence is [[0, 1, 0, 0, 0], [0, 0, 0, 1, 0]] the returned ragged tensor should be
        [[2, 3, 4], [4]], since elements where done is 1 are not considered.
        The special case in which one sequence could end with 1, means that element at time t represent the first one
        in the sampled episode, so no prior experience could be retrieved. In this case also the indices at which this
        happens will be returned in order to add a zero sequence of length one in the final batched sample.

        :param done_sequence: tensor of dimension [batch_size, max_hist_len]
        :param batch_size: number of rows for done batch
        :param max_hist_len: number of columns for done batch
        :return: ragged tensor of indices of dimension [batch_size, None], rows indices where no prior experience occurs
        """
        coordinates_true_done = tf.where(done_sequence == 1)
        unique_obj = tf.unique_with_counts(coordinates_true_done[:, 0])
        rows_with_ones = unique_obj.y

        lengths_rows_with_ones = unique_obj.count
        row_lengths = tf.zeros(shape=[batch_size], dtype=tf.int32)
        row_lengths = tf.tensor_scatter_nd_update(row_lengths, tf.expand_dims(rows_with_ones, axis=-1),
                                                  lengths_rows_with_ones)

        tmp = tf.RaggedTensor.from_row_lengths(coordinates_true_done, row_lengths)
        max_index_for_row = tf.reduce_max(tmp, axis=1)[:, 1:]
        max_index_for_row = tf.where(max_index_for_row >= 0, tf.math.add(max_index_for_row, 1), 0)

        row_indexes_no_mem = tf.where(max_index_for_row[:, 0] == max_hist_len)
        tmp_mod = tf.where(max_index_for_row == max_hist_len, tf.math.add(max_index_for_row, -1), max_index_for_row)
        ragged_indices_to_preserve = tf.ragged.range(starts=tmp_mod[:, 0], limits=max_hist_len)

        return ragged_indices_to_preserve, row_indexes_no_mem

    @tf.function
    def sample_batch_with_history(self, batch_size, max_hist_len):
        """
        Get a random batch sample. The total sequence length sampled is  max_hist_len + 2 , this because
        max_hist_len elements represent the history of observation at time t, present at max_hist_len + 1.
        The last max_hist_len + 2 index represent elements at time t+1, used for fetch history of action at time t+1.
        Gave as result a dictionary including ragged batched tensor for each element stored in replay buffer.

        :param batch_size: number of element in batch
        :param max_hist_len: sequence length
        :return: dictionary including the sampled quantities
        """

        sampled_batch_with_history = self.buffer.get_next(sample_batch_size=batch_size, num_steps=max_hist_len + 2)

        # According to data spec of replay buffer in constructor
        # 0 - Observation at time t
        # 1 - Action at time t
        # 2 - Reward at time t
        # 3 - Observation at time t+1
        # 4 - Done signal at time t

        obs_batch = sampled_batch_with_history[0][0][:, -2, :]
        act_batch = sampled_batch_with_history[0][1][:, -2, :]
        rew_batch = tf.expand_dims(sampled_batch_with_history[0][2][:, -2], axis=-1)
        next_obs_batch = sampled_batch_with_history[0][3][:, -2, :]
        done_batch = tf.expand_dims(sampled_batch_with_history[0][4][:, -2], axis=-1)

        # Two boolean mask are needed, one for time t and one for time t+1
        done_history_seq_t = sampled_batch_with_history[0][4][:, :max_hist_len]
        done_history_seq_next_t = sampled_batch_with_history[0][4][:, 1:max_hist_len + 1]

        range_indices_t, indexes_no_mem_t = self.get_boolean_mask(done_history_seq_t, batch_size, max_hist_len)
        range_indices_next_t, indexes_no_mem_next_t = self.get_boolean_mask(done_history_seq_next_t, batch_size,
                                                                            max_hist_len)

        hist_obs_batch = sampled_batch_with_history[0][0][:, :max_hist_len, :]
        hist_act_batch = sampled_batch_with_history[0][1][:, :max_hist_len, :]
        hist_next_obs_batch = sampled_batch_with_history[0][3][:, :max_hist_len, :]
        hist_next_act_batch = sampled_batch_with_history[0][1][:, 1:max_hist_len + 1, :]

        if len(indexes_no_mem_t) != 0:
            num_seq_to_update = len(indexes_no_mem_t)
            hist_obs_batch = tf.tensor_scatter_nd_update(hist_obs_batch,
                                                         indexes_no_mem_t,
                                                         tf.zeros([num_seq_to_update, max_hist_len, self.obs_dim]))
            hist_act_batch = tf.tensor_scatter_nd_update(hist_act_batch,
                                                         indexes_no_mem_t,
                                                         tf.zeros([num_seq_to_update, max_hist_len, self.act_dim]))

        if len(indexes_no_mem_next_t) != 0:
            num_seq_to_update = len(indexes_no_mem_next_t)
            hist_next_obs_batch = tf.tensor_scatter_nd_update(hist_next_obs_batch,
                                                              indexes_no_mem_next_t,
                                                              tf.zeros([num_seq_to_update, max_hist_len, self.obs_dim]))

            hist_next_act_batch = tf.tensor_scatter_nd_update(hist_next_act_batch,
                                                              indexes_no_mem_next_t,
                                                              tf.zeros([num_seq_to_update, max_hist_len, self.act_dim]))

        hist_obs_batch = tf.gather(hist_obs_batch, range_indices_t, batch_dims=1)
        hist_act_batch = tf.gather(hist_act_batch, range_indices_t, batch_dims=1)
        hist_next_obs_batch = tf.gather(hist_next_obs_batch, range_indices_next_t, batch_dims=1)
        hist_next_act_batch = tf.gather(hist_next_act_batch, range_indices_next_t, batch_dims=1)

        batch_sampled = {"obs": obs_batch,
                         "act": act_batch,
                         "rew": rew_batch,
                         "next_obs": next_obs_batch,
                         "done": done_batch,
                         "hist_obs": hist_obs_batch,
                         "hist_act": hist_act_batch,
                         "hist_next_obs": hist_next_obs_batch,
                         "hist_next_act": hist_next_act_batch}

        return batch_sampled

### Training 

---

#### Critic 

---

TD3 concurrently learns two **Q**-functions, $Q_{\phi_1}$ and $Q_{\phi_2}$, by mean square Bellman error minimization on a randomly sampled batch from replay buffer. Since TD3 provides two critic, so two distinct loss functions are needed, but only one target $y$ is used and that is the same for both.

$\Large L_{\text{critic}}(\phi_j, {\mathcal B})_{j=1,2} = E_{ \{(h^l_t,h^l_{t+1},o_t,a_t,r_t,o_{t+1},d_t)_i\}^{|{\mathcal B}|}_{i=1}}
{\Big( Q_{\phi_j}(o_t,a_t,h^l_t) - y(r_t,d_t,o_{t+1},h^l_{t+1}) \Bigg)^2}  
$
> Eq.1 Loss Function


* ${\mathcal B}$  is the sampled batch from replay buffer. |${\mathcal B}$| is the cardinality.
* $Q_{\phi_j}$ is the prediction of the Q-value function done by main critic network evaluated at time *t*  
* $y$ is the Bellman lookup target  


$\Large y(r_t,d_t,o_{t+1},h^l_{t+1}) =  r_t + γ*(1-d_t)* \text{min}_{j=1,2}Q_{\phi_{j,\text{targ}}}(o_{t+1},a_{t+1},h^l_{t+1})$

> Eq.2 Target 

* $r_t$ is the reward at time *t*
* $ γ$ is the discount factor, indicating how much important are the future returns. Its value is included between 0 and 1.
* $ d_t$ boolean value, indicatin if the *t+1* observation led to terminal state. If 1 future rewards are discarded.
* $Q_{\phi_{j,targ}}$ is the Q-value prediction done by critic target networks.
Target network have been introduced in order to stabilize learning process, further TD3 doubled the critic target networks and taking the smaller between them, helps fend off overestimation in the Q-function.
*$a_{t+1}$ is action at time *t+1* based on target policy $μ_{θ_{targ}}$. Instead for **discret** action space it is based on the full wolpertinger policy $\pi_{θ_{targ}}$ 

$\Large a_{t+1} = \text{clip}\left(\mu_{\theta_{\text{targ}}}(o_{t+1},h^l_{t+1}) + \text{clip}(ϵ,-c,c), a_{Low}, a_{High}\right), \;\;\;\;\; \epsilon \sim \mathcal{N}(0, \sigma_{\text{targ}})$

> Eq 3. Target policy smoothing

* $c$ clipping value for normally distributed noise. This ensure to not obtain action that are too far from the one predicted by target policy.
* $a_{Low}, a_{High}$ are boundaries that define the set of valid actions for the current environment.  

<br>

$\Large \text{min}_{ϕ_j} \; \{L_{\text{critic}}(\phi_j, {\mathcal B}) \}_{j=1,2}$
> Eq. 4 Optimization

In the end ${ϕ_1}$ and ${ϕ_2}$ parameters are optimized to minimize the related loss function. In this case gradient descent has been performed, calculating the gradient taking into account only parameters of the main networks, ${ϕ_1}$ and ${ϕ_2}$, instead ${ϕ_{1,\text{targ}}}$,${ϕ_{2,\text{targ}}}$ and ${θ_{\text{targ}}}$ are treated as constant useful for constructing the target. 


##### Implementation

In [8]:
def update_critic_parameters(data, optimizer, discount_factor, main_agent, target_agent):
    # Input at time t
    inputs_obs = {"memory": {"obs": data['hist_obs'],
                             "act": data['hist_act']},
                  "features": {"obs": data['obs'],
                               "act": data['act']}}

    # Input at time t+1
    inputs_next_obs = {"memory": {"obs": data['hist_next_obs'],
                                  "act": data['hist_next_act']},
                       "features": {"obs": data['next_obs'],
                                    "act": None}}
    rewards_batch = data["rew"]
    done_batch = data["done"]

    _, next_act = target_agent.get_action(inputs_next_obs)

    inputs_next_obs["features"]["act"] = next_act
    # Target Q-values
    q1_pi_targ = target_agent.critic_1(inputs_next_obs)
    q2_pi_targ = target_agent.critic_2(inputs_next_obs)
    q_pi_targ = tf.math.minimum(q1_pi_targ, q2_pi_targ)

    backup = rewards_batch + discount_factor * (1 - done_batch) * q_pi_targ

    mse = tf.keras.losses.MeanSquaredError()

    # Compute critic loss for both critic network for main agent
    with tf.GradientTape() as tape:
        q1 = main_agent.critic_1(inputs_obs)
        q2 = main_agent.critic_2(inputs_obs)

        # MSE loss against Bellman backup
        loss_q1 = mse(backup, q1)
        loss_q2 = mse(backup, q2)
        loss_q = loss_q1 + loss_q2

    # Update trainable parameters
    q1_q2_parameters = main_agent.critic_1.trainable_variables + main_agent.critic_2.trainable_variables
    grads = tape.gradient(loss_q, q1_q2_parameters)
    optimizer.apply_gradients(zip(grads, q1_q2_parameters))

    return loss_q

#### Actor

---

The policy network $μ_{θ}$  update aims to optimize parameters in order to find actions that maximize $Q_{\phi}$. Since the action space is continuous, it is possible to assume that $Q_{\phi}$ is differentiable respect to action, so performing a gradient ascent respect to the expression below will solve the optimization problem. This remains valid also for discrete action space, since for training is considered only $μ_{θ}$, that produces a continuous value in $ℝ^n$ dimension, and not the full Wolpertinger policy $\pi_{θ}$.

$\Large 
L_{\text{actor}}(θ,{\mathcal B})=  E_{ \{(h^l_t,o_t)_i\}^{|{\mathcal B}|}_{i=1}}Q_{ϕ_1}(h^l_t,μ_{θ}(h^l_t,o_t),o_t), \;\;\;\;\; \text{max}_{\theta} L_{\text{actor}}(θ,{\mathcal B})$

> Eq 5. Optimization for Actor loss function

The gradient is calculated only respect to $θ$ parameters of policy network, where $\phi_1$ parameters are treated as constants. Nevertheless TD3 algorithm provides two critic networks, the first paper suggests to take into account $\phi_1$ parameters respect to $\phi_2$.


##### Implementation


In [9]:
def update_actor_parameters(data, optimizer, main_agent):
    inputs_obs = {"memory": {"obs": data['hist_obs'],
                             "act": data['hist_act']},
                  "features": {"obs": data['obs'],
                               "act": None}}

    # Compute actor loss for actor network. Here only parameters of actor net are watched by tape since critic
    # parameters are not required to be updated, but only used for compute the actor loss.
    with tf.GradientTape(watch_accessed_variables=False) as tape:
        tape.watch(main_agent.actor.trainable_variables)
        act_logits = main_agent.actor(inputs_obs)
        inputs_obs["features"]["act"] = act_logits
        q1_pi = main_agent.critic_1(inputs_obs)
        loss_act = tf.reduce_mean(-q1_pi)

    grads = tape.gradient(loss_act, main_agent.actor.trainable_variables)
    optimizer.apply_gradients(zip(grads, main_agent.actor.trainable_variables))

    return loss_act

#### Target 


---

As TD3 algorithm describes the set of target parameters are updated less frequently respect to main parameters, following this expressions:  

$ \Large \{ ϕ_{\text{j,targ}} ← τ* ϕ_j + (1 - τ)*ϕ_{\text{j,targ}} \}_{j=1,2}$ 

$ \Large \theta_{\text{targ}} ← τ* \theta + (1 - τ)*\theta_{\text{targ}}$ 

> Eq.6 Soft Update target parameters

τ is a small number that performs a movin average on previous target parameters.
This is usually called *soft update*, that differs from *hard update* in which target parameters are updated in order to exactly  match values of main parameters. 

#### Update

The method shows the update process in its entirety, including method for update critic networks, along with delayed update of target and actor networks. 

In [10]:
@tf.function
def update(main_agent, target_agent, upd_counter, batch_sizes, max_history_length,
           freq_policy_update, tau_target_update, disc_factor, critic_optimizer, actor_optimizer):
    loss_q, loss_a = tf.constant(0, dtype=tf.float32), tf.constant(0, dtype=tf.float32)
    sampled_batch = replay_buffer.sample_batch_with_history(batch_size=batch_sizes,
                                                            max_hist_len=max_history_length)

    loss_q = update_critic_parameters(sampled_batch, critic_optimizer, disc_factor, main_agent, target_agent)

    # Delayed policy updates
    if upd_counter % freq_policy_update == 0:
        loss_a = update_actor_parameters(sampled_batch, actor_optimizer, main_agent)

        # Soft update of target parameters according to small tau value
        for par_main, par_targ in zip(main_agent.trainable_variables,
                                      target_agent.trainable_variables):
            par_targ.assign(tau_target_update * par_main + (1 - tau_target_update) * par_targ)

    return loss_q, loss_a

### Algorithm 

The next sections will describe the TD3 algorithm in its entirety, including replay buffer initialization, environment interaction and the already explained update phase.


<img src="https://drive.google.com/uc?id=1UcDuD5zEvF_JxspXMM_LuHtWdrLYuMoY"    width="700"/>




#### Replay buffer initialization


---


As described in pseudo algorithm replay buffer is filled with randomly collected experience in order to provide some initial memory for future training


In [11]:
def run_warmup_episodes(num_episodes_warmup):
    """
    Warmup episodes are executed in order to feed Replay buffer memory.
    Randomly selected actions used for interact with environment.
    :param num_episodes_warmup:
    """
    with tqdm.trange(num_episodes_warmup) as warm_episodes:
        for warm_ep in warm_episodes:
            obs_warm = tf.constant(env.reset(), shape=(1, obs_dim), dtype=tf.float32)
            for warm_step in range(episode_steps):
                if continuous:
                    action = tf.convert_to_tensor(env.action_space.sample())
                    raw_action = tf.expand_dims(action, axis=0)
                else:
                    proto_action = tf.random.uniform(minval=lower_act_limit, maxval=upper_act_limit,
                                                     shape=(1, act_dim), dtype=tf.float64)
                    knn = tf.constant(1, dtype=tf.int32)
                    raw_action, action = actor_critic_main.action_space.tf_search_point(proto_action, knn)
                    raw_action = tf.cast(raw_action, dtype=tf.float32)
                    action = tf.squeeze(action)

                next_obs_warm, reward_warm, done_warm = env.tf_step(action)
                done_warm = tf.ones_like(done_warm) if warm_step == episode_steps - 1 else done_warm
                replay_buffer.put(obs_warm, raw_action, reward_warm, next_obs_warm, done_warm)
                obs_warm = next_obs_warm

                if tf.cast(done_warm, dtype=tf.bool):
                    break

            warm_episodes.set_description(f"Warmup episode:[{warm_ep}]")

#### Environment interaction 


---


This step of algorithm has been implemented in a separated function in order to optimize execution time exploiting Tensorflow graph execution.  

*action* is the action format that is accepted by environment, instead *raw_action* is the action as it is returned by the policy.

In this method also the update of history sequence has been done since it's relies on Tensorflow *concat* operation, that could be optimized if included in a Tensorflow function.

In [12]:
@tf.function
def env_interaction(input_act, main_agent, buffer_l):
    """
    Interaction with environment has been performed through an input dictionary that contains tensors for memory
    and actor network. This method after the interaction returns reward step, done signal and next input
    dictionary, in which  last buffer_l observations and actions are stored, along with observation for the next step.
    :param input_act: input dictionary with observation at time t and previous observations and actions
    :param main_agent: Agent that perform action
    :param buffer_l: number of previous elements to take in consideration
    :return: next_input dictionary for next step
    """
    observation = input_act["features"]["obs"]
    next_input_act = {"memory": {"obs": None,
                                 "act": None},
                      "features": {"obs": None,
                                   "act": None}}

    action, raw_action = main_agent.get_action(input_act)

    next_observation, reward, done = env.tf_step(action)

    next_input_act["features"]["obs"] = next_observation
    last_hist_obs, last_hist_act = input_act["memory"]["obs"], input_act["memory"]["act"]

    observation = tf.expand_dims(observation, axis=0)
    raw_action = tf.expand_dims(raw_action, axis=0)
    # For each iteration stack the last action and observation, keeping only the last buffer_l elements
    next_input_act["memory"]["obs"] = tf.concat([last_hist_obs, observation], axis=1)[:, -buffer_l:, :]
    next_input_act["memory"]["act"] = tf.concat([last_hist_act, raw_action], axis=1)[:, -buffer_l:, :]

    return next_input_act, reward, done

#### Implementation

Here are initialized variables that are no strictly related with agents, these are more related with the expriment in general.

In [13]:
parser = argparse.ArgumentParser()
parser.add_argument('--seed', type=int, default=42, help="Seed for experiment reproducibility")
parser.add_argument('--env_name', type=str, default='HalfCheetahBulletEnv-v0', help="Environment name")
parser.add_argument('--pomdp_type', type=str, default='remove_velocity', help="Type of pomdp observation")
parser.add_argument('--config_filename', type=str, default='config_net.toml', help="Name configuration file")
#parser.add_argument('--num_episodes', type=int, default=1000, help="Number of episodes")
parser.add_argument('--num_warmup_episodes', type=int, default=10, help="Number of warmup episodes")
args = parser.parse_args(args=[])

The code below represents the environment and agents initialization, restoring previous values if related checkpoint is present.

In [None]:

# Avoid warnings to be displayed on console
tf.get_logger().setLevel('ERROR')
tf.config.run_functions_eagerly(False)

# Create the environment
env = POMDPWrapper(env_name=args.env_name, pomdp_type=args.pomdp_type)

# Set seed
env.seed(args.seed)
tf.random.set_seed(args.seed)
np.random.seed(args.seed)

device_name = tf.test.gpu_device_name()

continuous = None
try:  # TD3 continuous action space - normal way
    obs_dim = env.observation_space.shape[0] if len(
        env.observation_space.shape) != 0 else 1
    act_dim = env.action_space.shape[0]
    # Not used np.inf is only to indicate that are infinite number of actions
    num_actions = np.inf
    upper_act_limit = env.action_space.high
    lower_act_limit = env.action_space.low
    continuous = True
except IndexError:  # TD3 discrete action space using Wolpertinger agent
    obs_dim = env.observation_space.shape[0] if len(
        env.observation_space.shape) != 0 else 1
    act_dim = env.action_space.shape[0] if len(
        env.action_space.shape) != 0 else 1
    num_actions = env.action_space.n
    lower_act_limit = -1.0
    upper_act_limit = 1.0
    continuous = False

config_net, dir_checkpoints = Utility.get_configuration(args.env_name, args.pomdp_type, args.config_filename)
logger = tf.summary.create_file_writer(logdir=dir_checkpoints + "log_dir/", experimental_trackable=True,
                                        max_queue=10)
logger_test = tf.summary.create_file_writer(logdir=dir_checkpoints + "log_dir/test/", max_queue=10)

# Hyper parameters from arg parsing

#num_episodes = args.num_episodes
warmup_episodes = args.num_warmup_episodes

# Hyper parameters from config file

hyper_parameters = config_net["hyper_parameters"]
knn_ratio = hyper_parameters["knn_ratio"]
buffer_capacity = hyper_parameters["replay_buffer_capacity"]
episode_steps = hyper_parameters["steps_per_episodes"]
buffer_length = hyper_parameters["buffer_length"]
gamma = hyper_parameters["discount_factor"]
tau = hyper_parameters["target_update_rate"]
lr_critic_parameters = hyper_parameters["learning_rate_critic"]
lr_actor_parameters = hyper_parameters["learning_rate_actor"]
batch_size = hyper_parameters["batch_sizes"]
max_hist_length = hyper_parameters["history_length"]
std_dev_act_inf = hyper_parameters["std_dev_actor"]
std_dev_act_update = hyper_parameters["std_dev_actor_target"]
clip_noise = hyper_parameters["clip_noise"]
policy_delay = hyper_parameters["policy_delay"]

# Agent initialization
actor_critic_main = ActorCriticAgent(obs_dim, config_net, act_dim, continuous, upper_act_limit, lower_act_limit, num_actions,
                                      knn_ratio, std_dev_act_inf, clip_noise=None, hist_len=None,
                                      agent_type="main")
actor_critic_target = ActorCriticAgent(obs_dim, config_net, act_dim, continuous, upper_act_limit, lower_act_limit,
                                        num_actions,
                                        knn_ratio, std_dev_act_update, clip_noise=clip_noise, hist_len=max_hist_length,
                                        agent_type="target")
actor_critic_target.set_weights(actor_critic_main.get_weights())

# Replay buffer
replay_buffer = ReplayBuffer(
    obs_dim=obs_dim, act_dim=act_dim, capacity=buffer_capacity, device=device_name)

# Parameters optimizers
critic_opt = tf.keras.optimizers.Adam(learning_rate=lr_critic_parameters)
actor_opt = tf.keras.optimizers.Adam(learning_rate=lr_actor_parameters)

start_episode = tf.Variable(0)
# Count each time an update occurs, and save this variable in checkpoint in order to correctly restart updating.
update_counter = tf.Variable(0)

checkpoint = tf.train.Checkpoint(main_model=actor_critic_main, target_model=actor_critic_target,
                                  rep_buffer=replay_buffer.buffer, critic_optimizer=critic_opt,
                                  actor_optimizer=actor_opt, ep_number=start_episode, update_counter=update_counter,
                                  logger=logger)

checkpoint_manager = tf.train.CheckpointManager(
    checkpoint, directory=dir_checkpoints, max_to_keep=1)

if checkpoint_manager.latest_checkpoint:
    checkpoint.restore(checkpoint_manager.latest_checkpoint)
else:
    # If no previous experience, then execute warmup episodes in order to store data in Replay buffer
    run_warmup_episodes(warmup_episodes)
    start_episode.assign(warmup_episodes - 1)

start_episode.assign_add(1)


This is the main cycle in which TD3 algorithm is performend and where agents interact with the environment along with weights' update.

In [15]:
def run_training(num_episodes=1000):
    with tqdm.trange(start_episode.numpy(), num_episodes) as episodes:
            for episode in episodes:
                obs = tf.expand_dims(tf.convert_to_tensor(
                    env.reset(), dtype=tf.float32), axis=0)
                ep_reward = tf.Variable(initial_value=(
                    0,), dtype=tf.float32, shape=(1,))
                ep_length = 0
                loss_critic_values, loss_actor_values = [], []

                act_buffer = tf.Variable(1)
                # Set initial input to actor model where previous experience has been set to zero
                input_act_selection = {"memory": {"obs": tf.zeros([1, 1, obs_dim], dtype=tf.float32),
                                                  "act": tf.zeros([1, 1, act_dim], dtype=tf.float32)},
                                      "features": {"obs": obs,
                                                    "act": None}}

                with logger.as_default(step=episode):

                    for step in tf.range(episode_steps):

                        next_input_act_selection, step_reward, done_step = env_interaction(input_act_selection,
                                                                                          actor_critic_main,
                                                                                          act_buffer)
                        ep_reward.assign_add(step_reward)
                        ep_length += 1

                        obs = input_act_selection["features"]["obs"]
                        next_obs = next_input_act_selection["features"]["obs"]
                        # Get last action, correspondent to the action performed at current step
                        act = next_input_act_selection["memory"]["act"][0][-1:, :]

                        # Force done to one when time horizon has been reached
                        done_step = tf.ones_like(
                            done_step) if step == episode_steps - 1 else done_step
                        replay_buffer.put(obs, act, step_reward,
                                          next_obs, done_step)

                        input_act_selection = next_input_act_selection
                        act_buffer.assign(buffer_length)

                        # Update parameters each  step
                        update_counter.assign_add(1)
                        loss_critic, loss_actor = update(actor_critic_main, actor_critic_target, update_counter,
                                                        batch_size, max_hist_length, policy_delay, tau, gamma,
                                                        critic_opt, actor_opt)

                        loss_critic_values.append(loss_critic.numpy())
                        if update_counter.numpy() % policy_delay == 0:
                            loss_actor_values.append(loss_actor.numpy())

                        if tf.cast(done_step, dtype=tf.bool):
                            break

                    checkpoint_manager.save()
                    checkpoint.ep_number.assign_add(1)
                    episodes.set_description(f"Episode:[{episode}]")
                    episodes.set_postfix(episode_reward=ep_reward.numpy()[
                                        0], episode_length=ep_length)
                    tf.summary.scalar(name="Ep_reward",
                                      data=ep_reward.numpy()[0], step=episode)
                    tf.summary.scalar(name="Ep_length",
                                      data=ep_length, step=episode)
                    tf.summary.scalar(name="Loss_critic_mean", data=statistics.mean(
                        loss_critic_values), step=episode)
                    tf.summary.scalar(name="Loss_actor_mean", data=statistics.mean(
                        loss_actor_values), step=episode)

In [None]:
# Perform the training algorithm
run_training()

Test agent performance

In [17]:
def run_test(num_test_episodes=10):

    # Set standard deviation of main to zero since during test is no more necessary to add exploration noise
    actor_critic_main.std_deviation_act_noise = 0
    # env.render(mode='human')
    with tqdm.trange(0, num_test_episodes) as episodes_test:
        for episode_test in episodes_test:
            obs = tf.expand_dims(tf.convert_to_tensor(env.reset(), dtype=tf.float32), axis=0)
            ep_reward = tf.Variable(initial_value=(0,), dtype=tf.float32, shape=(1,))
            ep_length = 0

            act_buffer = tf.Variable(1)
            # Set initial input to actor model where previous experience has been set to zero
            input_act_selection = {"memory": {"obs": tf.zeros([1, 1, obs_dim], dtype=tf.float32),
                                              "act": tf.zeros([1, 1, act_dim], dtype=tf.float32)},
                                   "features": {"obs": obs,
                                                "act": None}}

            with logger_test.as_default(step=episode_test):


                for step in tf.range(episode_steps):

                    next_input_act_selection, step_reward, done_step = env_interaction(input_act_selection,
                                                                                       actor_critic_main,
                                                                                       act_buffer)
                    ep_reward.assign_add(step_reward)
                    ep_length += 1

                    # Force done to one when time horizon has been reached
                    done_step = tf.ones_like(done_step) if step == episode_steps - 1 else done_step

                    input_act_selection = next_input_act_selection
                    act_buffer.assign(buffer_length)

                    if tf.cast(done_step, dtype=tf.bool):
                        break

                episodes_test.set_description(f"Test episode:[{episode_test}]")
                episodes_test.set_postfix(episode_reward=ep_reward.numpy()[0], episode_length=ep_length)
  

In [None]:
# Test agent performance 
run_test()

# Conclusions


---

Images belows show two different learnign curves during training for both continue and discrete action space.
In the first case is possible to notice that the agente learn quite good and reach performance similar to the same network proposed in the related paper.

Despite the discrete case seems to not performs so well, because appears to be stucked at a sub optimal solution, not improving its learning curve.
This could be due to problems related to the action space representation selected along with exploration method that after multiple episodes of training doesn't perform as well as the beginning.

For both images X-axis represent the number of episodes and Y-axis represents the reward at each episode. The irregolar shadowed line are the exact reward returned instead the smoothed bold line is the exponential moving average performed by Tensorboard with smooting value 0.99. 

<img src="https://drive.google.com/uc?id=1XMYFaBGmKX_PyKSsrTigry9v3rkV96Ac" width="900" height="400"/>

> Fig 1. Continue action space.

<img src="https://drive.google.com/uc?id=1F1Mws7gvZMWhiEhiffltZ32TziDnWOvg" width="900" height="400"/>

> Fig 2. Discrete action space. 
