# Notice

Original work Copyright (c) June 2020, Sergei Sovik <sergeisovik@yahoo.com>

Permission to use, copy, modify, and/or distribute this software for any purpose with or without fee is hereby granted, provided that the above copyright notice and this permission notice appear in all copies.

The software is provided `as is` and the author disclaims all warranties with regard to this software including all implied warranties of merchantability and fitness. In no event shall the author be liable for any special, direct, indirect, or consequential damages or any damages whatsoever resulting from loss of use, data or profits, whether in an action of contract, negligence or other tortious action, arising out of or in connection with the use or performance of this software.

# Foreword
The algorithm had to be implemented on old Tensorflow functions with Eager disabled, due to the fact that new functions of Tensorflow 2.2 led to a large memory leak and 32 GB of my memory were consumed in literally one hour.

All names of variables and classes are given based on programming experience for more than 25 years, possibly unusual, but intuitive for readers and not a generally accepted standard.

This article was written with the aim of expanding the circle of users with the `Soft Actor Critic` algorithm, since at the time of writing this article, it is the best, and all existing articles are written in a language incomprehensible to many programmers.

For those who absolutely do not understand what a `Computional Graph` is and does not want to go into details. This is a model describing the relationship between all calculations, including determining their order of execution. Each operation is called a `Node`. A `Computional Graph` is somewhat reminiscent of a block diagram with many possible inputs and outputs. Thus, requesting to calculate the result of the `Node` from the neural network engine, all dependencies are calculated, and if necessary, input data is requested.

# Algorithm `Genetic Soft Actor Critic`
The algorithm is implemented as a single graph, which allows to reduce the amount of data exchange with the GPU, and speed up the learning process.

The algorithm consists of four main blocks:
- Block `Neural Network`
- Block `Player`
- Block `Genetic Replay Buffer`
- Block `Trainer`

Each of which can work in parallel-serial.

Block `Neural network`
A neural network consists of several independently trained blocks:
- Two subnet clones `Trainer Actor` and `Target Actor`
- Two duplicate subnets `Trainer Critic`
- Two subnets `Target Critic`
- Coefficient `Alpha Regulator`

### Two subnet clones `Trainer Actor` and `Target Actor`
The `Target Actor` is used exclusively for the ability to parallelize the training and filling in the `Replay Buffer` with new data and is a complete copy of the `Trainer Actor` neural network.

### Two duplicate subnets `Trainer Critic`
Necessary to minimize errors.

### Two subnets `Target Critic`
Used for smooth learning using the moving average method.

### Coefficient `Alpha Regulator`
Performs the role of micro-adjustment of the learning process, to increase accuracy.

## Block `Player`
There is a certain environment in which it is necessary to carry out certain actions to achieve the goal. To simplify understanding, let's call the environment `Game`. The task of the `Player` is to collect observation data from the `Game`, to perform actions, and to receive a `Reward` from the `Game` or to independently make a `Rating` of these actions. Every completed action is a step. In one step, we have the following data set: `Previous Observation`, `Current Observation`, `Completed Action`, `Reward` or `Rating`, `End status`. The decision about which action to take is made by the `Target Actor` network based on the data of the `Previous Observation`. If the decision leads to a situation that can be considered the end, the `Player` completes and resets the` Game`.

There is two types of ratings:
- Rate rewards for every step
- Rate of the entire episode

Each step is stored in the `Replay Buffer` for further training and is called the `Trajectory`

### Rate rewards for every step
The `Player` takes an action and immediately writes to the `Replay Buffer` the following indicators: `Previous Observation`, `Current Observation`, `Completed Action`, `Reward`. Then `Rating` produced by `Trainer`.

### Rate of the entire episode
The `Player` takes action and stores to the `Temporary Buffer`. At the end of the episode, it calculates the `Rating` at each step and then stores to the `Replay Buffer` of the entire episode with the following indicators: `Previous Observation`, `Current Observation`, `Completed Action`, ` Rating`.

## Block `Genetic Replay Buffer`
It is a cyclic `Repeat Buffer`, which, when overflowed, starts overwriting older data with newer ones. It also includes the `Tree-based buffer of the sum` and the` Tree-based buffer of the maximum` used to calculate the priority of each step stored in the `Replay Buffer`. The tree-based buffers in a pair have the similarity with the genetic algorithm when selecting data from the `Replay Buffer`, which can greatly accelerate the learning process, and also reduces the likelihood of knocking down or freezing of the trained model in poor condition. A poor condition can be the result of the neural network getting used to poor results.

## Block `Trainer`
The main brain center of the algorithm that controls all the other blocks.

The training cycle for each step of the `Player`:
1. Select a batch of `Trajectories` from the `Genetic Replay Buffer`, taking into account the priorities.
2. Train two `Trainer Critics` independently.
3. Update the `Target Critics` using the `Moving Average` method.
4. Train `Trainer Actor` and `Alpha Regulator`.
5. Update the `Target Actor`.
6. Update priorities in the `Genetic Replay Buffer` for processed steps from a batch.

The learning process is standard: forward distribution, loss calculation, gradient calculation, back propagation.

# Example results of training `LunarLander-v2`
<table style="float:left;">
    <tr>
        <td style="text-align: center;">Average Score per Episode</td>
        <td style="text-align: center;">Average Steps per Episode</td>
    </tr><tr>
        <td><img src="GSAC-Score.svg" width="320pt"></td>
        <td><img src="GSAC-Steps.svg" width="320pt"></td>
    </tr>
</table>

# Install

In [None]:
!pip install gym[box2d]
!pip install Box2D

# Config

In [1]:
# Use GPU?
bGPU = False
# Size of single layer of neural network `Encoder`
uEncoderLayerSize = 64
# Model and log name
sName = "%d" % uEncoderLayerSize
# Limit maximum episode steps long
uEpisodeStepLimit = 1024
# Size of replay buffer, must contain at least 100 episodes
uReplayCapacity = 128 * 1024
# Batch size of training data
uBatchSize = 256
# Restore graph from checkpoint?
bRestore = False
# Do on screen render?
bRender = False
# Rate the entire episode?
bEpisodeRating = False
# Prioritize replay buffer?
bPriorityMode = False
# Factor of discounting rating
fRatingDiscountFactor = 0.99
# Update coefficient of target neural network
fTrainableTargetAverageUpdateCoef = 0.005
# Logging and statistics level
uLogLevel = 2

# Prepare workspace
Import modules and configure.

In [2]:
# Disable Tensorflow console spam
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '1'
# Disable GPU for small networks, CPU is faster
if not bGPU:
    os.environ['CUDA_VISIBLE_DEVICES'] = '-1'

# Math
import numpy as np
import math

# Tensorflow
import tensorflow as tf
import tensorflow._api.v2.compat.v1 as tf1
import tensorflow.keras.layers as kl
import tensorflow.keras.optimizers as ko
import tensorflow.keras.initializers as ki

# Turn off the allocation of all available memory
if bGPU:
    aoGPUPhysicalList = tf.config.experimental.list_physical_devices('GPU')
    if aoGPUPhysicalList:
        try:
            for oGPUDevice in aoGPUPhysicalList:
                tf.config.experimental.set_memory_growth(oGPUDevice, True)
        except RuntimeError as e:
            pass

# Set default type for keras networks
tf.keras.backend.set_floatx('float32')

# Manual control of graph build and session control, to remove memory leaking problems
tf1.disable_eager_execution()

# Date and time
import time
from datetime import datetime

# Virtual game environment for `Reninforcement learning`
import gym

# Abstract methods support
import abc

In [3]:
import os
import psutil

# Get current process memory usage
def getMemoryUsage():
    oProcess = psutil.Process(os.getpid())
    return "%d:%d" % (oProcess.memory_info().rss, oProcess.memory_info().vms)

# Tensorflow extension
Set of functions and classes for simple work with _Tensorflow_ graphs and sessions

In [4]:
# Default types
tfFloat = tf.keras.backend.floatx()
tfInt = tf.int32

# Graph array variable initializer (for internal usage only)
class ArrayInitializer(ki.Initializer):
    def __init__(self, aValue):
        super(ArrayInitializer, self).__init__()
        self.aValue = aValue

    def __call__(self, shape, dtype=None):
        return tf.convert_to_tensor(self.aValue, dtype=dtype)

# Create initializer defined by shape and value (for internal usage only)
def tfInitializer(tShape, oValue=None):
    if (oValue is not None) and (not isinstance(oValue, ki.Initializer)):
        if isinstance(oValue, list):
            oValue = np.array(oValue)
        if type(oValue) is np.ndarray:
            if oValue.shape != tShape:
                raise TypeError("Wrong initializer shape %s, must be %s" % (oValue.shape, tShape))
            return ArrayInitializer(oValue)
        elif oValue == 0:
            return ki.Zeros()
        elif oValue == 1:
            return ki.Ones()
        else:
            return ki.Constant(oValue)
    return oValue

# Define local variable, used inside one function block of graph
def tfLocalVariable(sName, oType, oValue=None, bTrainable=False):
    if isinstance(oType, tuple):
        sType, tShape = oType[0], oType[1]
    else:
        sType, tShape = oType, []
    oValue = tfInitializer(tShape, oValue)
    return tf1.get_local_variable(sName, shape=tShape, dtype=sType, initializer=oValue, trainable=bTrainable, use_resource=True)

# Define global variable, used within all graph inside one session
def tfGlobalVariable(sName, oType, oValue=None, bTrainable=False):
    if isinstance(oType, tuple):
        sType, tShape = oType[0], oType[1]
    else:
        sType, tShape = oType, []
    oValue = tfInitializer(tShape, oValue)
    return tf1.get_variable(sName, shape=tShape, dtype=sType, initializer=oValue, trainable=bTrainable, use_resource=True)

# Define constant
def tfConstant(sName, oType, oValue):
    sType = oType[0] if isinstance(oType, tuple) else oType
    return tf1.constant(oValue, dtype=sType, name=sName)

# Define graph input data node
def tfInput(sName, oType):
    if isinstance(oType, tuple):
        sType, tShape = oType[0], oType[1]
    else:
        sType, tShape = oType, []
    return tf1.placeholder(sType, shape=tShape, name=sName)

# Alias for a function of waiting to complete list of operations 
class tfWait(object):
    def __init__(self, aDependencies):
        self.tfControl = tf.control_dependencies(aDependencies)

    def __enter__(self):
        return self.tfControl.__enter__()

    def __exit__(self, clsErrorType, errValue, oTraceback):
        return self.tfControl.__exit__(clsErrorType, errValue, oTraceback)

# Current active session
tfActiveSession = None

# Create session for graph execution control
class tfSession(object):
    def __init__(self, tfoGraph=None, sTarget=''):
        self.tfoSession = tf1.Session(target=sTarget, graph=tfoGraph.tfoGraph)

    def __enter__(self):
        global tfActiveSession
        self.tfoDefault = self.tfoSession.as_default()
        self.tfoDefault.__enter__()
        self.tfRestoreSession = tfActiveSession
        tfActiveSession = self
        return self

    def __exit__(self, clsErrorType, errValue, oTraceback):
        global tfActiveSession
        self.tfoSession.close()
        self.tfoDefault = None
        tfActiveSession = self.tfRestoreSession

    def initGlobal(self):
        self.tfoSession.run(tf1.global_variables_initializer())

    def initLocal(self):
        self.tfoSession.run(tf1.local_variables_initializer())

    def eval(self, tfoOutputTensor, oInputDictionary=None):
        return self.tfoSession.run(tfoOutputTensor, oInputDictionary)

# Evaluate result of graph node inside current session
def tfEval(tfoOutputTensor, oInputDictionary=None):
    return tfActiveSession.eval(tfoOutputTensor, oInputDictionary)

# Initialize global variables inside current session
def tfInitGlobal():
    tfActiveSession.initGlobal()

# Initialize local variables inside current session
def tfInitLocal():
    tfActiveSession.initLocal()

# Current active graph
tfActiveGraph = None

# Create graph
class tfGraph(object):
    def __init__(self):
        self.tfoGraph = tf.Graph()
        self.tfoDefault = None

    def __enter__(self):
        global tfActiveGraph
        self.tfoDefault = self.tfoGraph.as_default()
        self.tfoDefault.__enter__()
        self.tfRestoreGraph = tfActiveGraph
        tfActiveGraph = self
        return self

    def __exit__(self, clsErrorType, errValue, oTraceback):
        global tfActiveGraph
        self.tfoDefault = None
        tfActiveGraph = self.tfRestoreGraph
        
# Alias for optimizer
class AMSgrad(ko.Adam):
    def __init__(self, lr=0.001):
        super(AMSgrad, self).__init__(lr=lr, amsgrad=True)

# Clip gradient values to avoid `inf` and `nan`
def fnClipGradients(aGradients, tfMaxNormal):
    aClippedGradients = []
    for tfoGrad in aGradients:
        if tfoGrad is not None:
            if isinstance(tfoGrad, tf.IndexedSlices):
                tfoTemp = tf.clip_by_norm(tfoGrad.values, tfMaxNormal)
                tfoGrad = tf.IndexedSlices(tfoTemp, tfoGrad.indices, tfoGrad.dense_shape)
            else:
                tfoGrad = tf.clip_by_norm(tfoGrad, tfMaxNormal)
        aClippedGradients.append(tfoGrad)
    return aClippedGradients

# Copy of weights from one neuron network to another
def fnHardUpdate(tafTargetVariables, tafSourceVariables):
    atopUpdates = []
    tfStrategy = tf.distribute.get_strategy()

    for (tfoTarget, tfoSource) in zip(tafTargetVariables, tafSourceVariables):
        def fnUpdate(tfoTarget, tfoSource):
            return tfoTarget.assign(tfoSource)

        if tf.distribute.has_strategy() and tfoTarget.trainable:
            topUpdate = tfStrategy.extended.update(tfoTarget, fnUpdate, args=(tfoSource,))
        else:
            topUpdate = fnUpdate(tfoTarget, tfoSource)

        atopUpdates.append(topUpdate)
    return tf.group(*atopUpdates)

# Copy of weights from one neuron network to another using `moving average` method
def fnSoftUpdate(tafTargetVariables, tafSourceVariables, tfZero, tfOne, tfTrainableTargetAverageForgetCoef, tfTrainableTargetAverageUpdateCoef):
    atopUpdates = []
    tfStrategy = tf.distribute.get_strategy()

    for (tfoTarget, tfoSource) in zip(tafTargetVariables, tafSourceVariables):
        def fnUpdate(tfoTarget, tfoSource):
            if not tfoTarget.trainable:
                tfTargetAverageForgetCoef = tfZero
                tfTargetAverageUpdateCoef = tfOne
            else:
                tfTargetAverageForgetCoef = tfTrainableTargetAverageForgetCoef
                tfTargetAverageUpdateCoef = tfTrainableTargetAverageUpdateCoef

            return tfoTarget.assign(tfoTarget * tfTargetAverageForgetCoef + tfoSource * tfTargetAverageUpdateCoef)

        if tf.distribute.has_strategy() and tfoTarget.trainable:
            topUpdate = tfStrategy.extended.update(tfoTarget, fnUpdate, args=(tfoSource,))
        else:
            topUpdate = fnUpdate(tfoTarget, tfoSource)

        atopUpdates.append(topUpdate)
    return tf.group(*atopUpdates)

# Write detailed summary of tensor, one graph variable
def fnTensorSummary(oSummaryWriter, sTag, tfoVariable, tuStep):
    with oSummaryWriter.as_default():
        with tf.name_scope(sTag):
            return tf.group(
                tf.summary.histogram('Histogram', tfoVariable, tuStep),
                tf.summary.scalar('Mean', tf.reduce_mean(tfoVariable, 'fMean'), tuStep),
                tf.summary.scalar('MeanAbs', tf.reduce_mean(tf.abs(tfoVariable), 'fMeanAbs'), tuStep),
                tf.summary.scalar('Max', tf.reduce_max(tfoVariable), tuStep),
                tf.summary.scalar('Min', tf.reduce_min(tfoVariable), tuStep)
            )

# Write detailed summary of neural network weights
def fnWeightsSummary(oSummaryWriter, zipWeightfoGradientsAndVariables, tuStep):
    aOps = []
    with oSummaryWriter.as_default():
        for tfaGradientsGroup, tfaVariablesGroup in zipWeightfoGradientsAndVariables:
            sGroupName = tfaVariablesGroup.name.replace(':', '_')

            if isinstance(tfaVariablesGroup, tf.IndexedSlices):
                tfaValues = tfaVariablesGroup.values
            else:
                tfaValues = tfaVariablesGroup
            aOps.append(tf.summary.histogram('Weights/' + sGroupName, tfaValues, tuStep))
            aOps.append(tf.summary.scalar('WeightsNorm/' + sGroupName, tf.linalg.global_norm([tfaValues]), tuStep))

            if tfaGradientsGroup is not None:
                if isinstance(tfaGradientsGroup, tf.IndexedSlices):
                    tfaGradients = tfaGradientsGroup.values
                else:
                    tfaGradients = tfaGradientsGroup
                aOps.append(tf.summary.histogram('Gradients/' + sGroupName, tfaGradients, tuStep))
                aOps.append(tf.summary.scalar('GradientsNorm/' + sGroupName, tf.linalg.global_norm([tfaGradients]), tuStep))
    return tf.group(*aOps)

# Set of functions to select action inside discrete models

In [5]:
def fnSelectBest(tafUnnormalizedLogProbabilities):
    return tf.argmax(tafUnnormalizedLogProbabilities, axis=-1, output_type=tfInt)

def fnSelectRandom(tafUnnormalizedLogProbabilities):
    return tf.squeeze(tf.random.categorical(tafUnnormalizedLogProbabilities, 1, dtype=tfInt), axis=-1)

def fnSelectNoisyBest(tafUnnormalizedLogProbabilities):
    with tf1.variable_scope('Const', reuse=tf1.AUTO_REUSE):
        tfTotalMinLogInput = tfConstant('fMinLogInput', tfFloat, 1e-8)
        tfTotalMaxLogInput = tfConstant('fMaxLogInput', tfFloat, 1-1e-8)

    with tf.name_scope('fnSelectNoisyBest'):
        tafRandomUniform = tf.random.uniform(tafUnnormalizedLogProbabilities.shape, minval=tfTotalMinLogInput, maxval=tfTotalMaxLogInput, dtype=tfFloat, seed=None) # pylint: disable=unexpected-keyword-arg
        tafGumbel = -tf.math.log(-tf.math.log(tafRandomUniform)) # pylint: disable=invalid-unary-operand-type
        tafUnnormalizedNoisyLogProbabilities = tafUnnormalizedLogProbabilities + tafGumbel

    return tf.argmax(tafUnnormalizedNoisyLogProbabilities, axis=-1, output_type=tfInt)

def fnSelectNoisyRandom(tafUnnormalizedLogProbabilities):
    with tf1.variable_scope('Const', reuse=tf1.AUTO_REUSE):
        tfTotalMinLogInput = tfConstant('fMinLogInput', tfFloat, 1e-8)
        tfTotalMaxLogInput = tfConstant('fMaxLogInput', tfFloat, 1-1e-8)

    with tf.name_scope('fnSelectNoisyRandom'):
        tafRandomUniform = tf.random.uniform(tafUnnormalizedLogProbabilities.shape, minval=tfTotalMinLogInput, maxval=tfTotalMaxLogInput, dtype=tfFloat, seed=None) # pylint: disable=unexpected-keyword-arg
        tafGumbel = -tf.math.log(-tf.math.log(tafRandomUniform)) # pylint: disable=invalid-unary-operand-type
        tafUnnormalizedNoisyLogProbabilities = tafUnnormalizedLogProbabilities + tafGumbel

    return tf.squeeze(tf.random.categorical(tafUnnormalizedNoisyLogProbabilities, 1, dtype=tfInt), axis=-1)

# Environment model
The environment model determines the current observation state, and on the basis of control commands, it calculates the next observation state, the reward for the completed action, and also informs about the end, i.e. the need to reset the environment to its initial state.

In [6]:
# Base environment class
class EnvironmentImpl(object):
    def __init__(self, uObservationSize, uActionsSize, bDiscrete):
        # Size of observation array
        self.uObservationSize = uObservationSize
        # Size of actions array
        self.uActionsSize = uActionsSize
        # Discrete or continuous environment control
        self.bDiscrete = bDiscrete

    # Reset environment. Returns current observation after reset
    @abc.abstractmethod
    def reset(self):
        pass

    # Next step of observation state. Returns new observation, reward, and finish state
    @abc.abstractmethod
    def step(self, oAction):
        pass

In [7]:
# Environment model for game LunarLander
class CustomEnvironment(EnvironmentImpl):
    def __init__(self, bRender=True):
        sEnvironment = 'LunarLander-v2'
        self.oEnvironment = gym.make(sEnvironment)
        
        # Maximum one episode duration
        # Must be at least 4 times greater of average episode duration, but not to big
        self.oEnvironment._max_episode_steps = uEpisodeStepLimit * 10
        
        tObservationShape = self.oEnvironment.observation_space.shape
        tActionsShape = (self.oEnvironment.action_space.n,)

        super(CustomEnvironment, self).__init__(
            np.prod(list(tObservationShape)),
            np.prod(list(tActionsShape)),
            True
        )

        self.bRender = bRender

    def reset(self):
        return self.oEnvironment.reset()

    def step(self, uAction):
        aObservation, fReward, bDone, _ = self.oEnvironment.step(uAction)

        if self.bRender:
            self.oEnvironment.render()

        oInfo = {}
        return aObservation, fReward, bDone, oInfo

    def close(self):
        self.oEnvironment.close()

# Neural networks models

In [8]:
# Base neural network model
class ModelImpl(tf.keras.Model):
    def __init__(self, sName=None):
        super(ModelImpl, self).__init__(name=sName)

    # Override method to make Tensorflow define `self._build_input_shape`
    #  `self._build_input_shape` used for logging of neural network model structure 
    def build(self, tInputShape):
        super(ModelImpl, self).build(tInputShape)

# WARNING! Do not use `relu`, to prevent `vanishing gradient` problem, when value is close or equal zero

# Neural network `Critic`, assesses the potential benefits of the environment observations with actions
class CriticNetwork(ModelImpl):
    def __init__(self, sName=None):
        super(CriticNetwork, self).__init__(sName=sName)
        # `Concatenator` block
        self.fnFlattenObservations = kl.Flatten()
        self.fnFlattenActions = kl.Flatten()
        self.fnConcatInput = kl.Concatenate()
        # `Encoder` block
        self.fnEncoder = tf.keras.Sequential()
        self.fnEncoder.add(kl.Dense(uEncoderLayerSize, activation='elu'))
        self.fnEncoder.add(kl.Dense(uEncoderLayerSize, activation='elu'))
        self.fnEncoder.add(kl.Dense(32, activation='elu'))
        # `Critic` block
        self.fnCritic = kl.Dense(1, activation='linear')

    # Create weights variables for neural network
    def build(self, tObservationShape, tActionsShape):
        super(CriticNetwork, self).build((tObservationShape[0], np.prod(list(tObservationShape[1:])) + np.prod(list(tActionsShape[1:]))))

    # Prepare data using `Concatenator` block for getting result of neural network
    def prepare(self, tafObservations, tafActions):
        tafFlatObservations = self.fnFlattenObservations.call(tafObservations)
        tafFlatActions = self.fnFlattenActions.call(tafActions)
        tafStates = self.fnConcatInput([tafFlatObservations, tafFlatActions])
        return tafStates

    # Calculate result of neural network (predict rating)
    def call(self, tafStates):
        tafEncoded = self.fnEncoder(tafStates)
        tafPredRating = self.fnCritic(tafEncoded)
        return tafPredRating

# Neural network `Discrete actor`, generates unnromalized action log probabilities
class DiscreteActorNetwork(tf.keras.Sequential):
    def __init__(self, uActionCount, sName=None):
        super(DiscreteActorNetwork, self).__init__(name=sName)
        # `Encoder` block
        self.fnEncoder = tf.keras.Sequential()
        self.fnEncoder.add(kl.Dense(uEncoderLayerSize, activation='elu'))
        self.fnEncoder.add(kl.Dense(uEncoderLayerSize, activation='elu'))
        self.fnEncoder.add(kl.Dense(32, activation='elu'))
        # `Actor` block
        self.fnAction = kl.Dense(uActionCount, activation='linear')
        # Sequience of calculations
        self.add(self.fnEncoder)
        self.add(self.fnAction)

# Neural network `Continous actor`, generates mean and standard deviation of possible actions
class ContinuousActorNetwork(ModelImpl):
    def __init__(self, uActionCount, sName=None):
        super(ContinuousActorNetwork, self).__init__(sName=sName)
        # `Encoder` block
        self.fnEncoder = tf.keras.Sequential()
        self.fnEncoder.add(kl.Dense(uEncoderLayerSize, activation='elu'))
        self.fnEncoder.add(kl.Dense(uEncoderLayerSize, activation='elu'))
        self.fnEncoder.add(kl.Dense(32, activation='elu'))
        # `Actor` block
        # The initial values are set around zero with a slight deviation, to avoid sudden and unexpected actions
        self.fnMean = kl.Dense(uActionCount, activation='linear',
            kernel_initializer=ki.VarianceScaling(0.1),
            bias_initializer=ki.Zeros(),)
        self.fnStd = kl.Dense(uActionCount, activation='linear',
            kernel_initializer=ki.VarianceScaling(0.1),
            bias_initializer=ki.Constant(0.0),)

    # Create weights variables for neural network
    def build(self, tObservationShape):
        super(ContinuousActorNetwork, self).build(tObservationShape)

    # Calculate result of neural network
    def call(self, tafState):
        tafEncoded = self.fnEncoder(tafState)
        tafPredLocation = self.fnMean(tafEncoded)
        tafPredScale = self.fnStd(tafEncoded)
        return tafPredLocation, tafPredScale

# Create neural network `Actor` depending on the type of environment
def CreateActor(oEnvironment, uBatchSize=256, sName='Actor'):
    if oEnvironment.bDiscrete:
        nnActor = DiscreteActorNetwork(oEnvironment.uActionsSize, sName)
    else:
        nnActor = ContinuousActorNetwork(oEnvironment.uActionsSize, sName)
    nnActor.build((uBatchSize, oEnvironment.uObservationSize))
    return nnActor

# Environment controller

In [9]:
# Class `Player` used to control environment via getting actions from neural network `Actor`
# fnPolicy - function to convert log probabilities into action id
class CustomPlayer(object):
    def __init__(self, oEnvironment, nnActor, fnPolicy=fnSelectBest):
        self.oEnvironment = oEnvironment
        self.nnActor = nnActor

        self.reset()

        # Constants used with graph
        with tf1.variable_scope('Const', reuse=tf1.AUTO_REUSE):
            tuActionsSize = tfConstant('uActionsSize', tfInt, self.oEnvironment.uActionsSize)
            tuZero = tfConstant('uZero', tfInt, 0)

        # Construct block of graph to compute actions
        with tf.name_scope('CustomPlayer'):
            # The `fnAction` function block
            with tf.name_scope('fnAction'):
                with tf1.variable_scope('Input', reuse=tf1.AUTO_REUSE):
                    # Input data node of environment observation
                    self.tinafReplayObservation = tfInput('afObservation', (tfFloat, [self.oEnvironment.uObservationSize]))

                # Calculate unnormalized log probabilities of next action
                tafActions = nnActor.call(self.tinafReplayObservation[None, :], training=False)
                self.tafActions = tf.squeeze(tafActions)
                # Convert probabilities into action using policy function
                self.tuAction = tf.squeeze(tf.clip_by_value(fnPolicy(tafActions), tuZero, tuActionsSize))

    # Calculate next action
    def action(self, afObservation):
        return tfEval([self.tafActions, self.tuAction], {self.tinafReplayObservation: afObservation})

    # Reset environment and observation state
    def reset(self):
        afObservation = self.oEnvironment.reset()
        self.afPrevObservation = None
        self.uAction = None
        self.afActions = None
        self.afObservation = np.array(afObservation, dtype=tfFloat).flatten()
        self.uStep = 0
        self.fScore = 0
        self.fReward = 0
        self.fAverageReward = 0
        self.bDone = False

    # Take the next intended action
    def next(self):
        if not self.bDone:
            self.afActions, self.uAction = self.action(self.afObservation)

            if self.uStep >= uEpisodeStepLimit:
                self.uAction = 0
            
            afObservation, self.fReward, bDone, _ = self.oEnvironment.step(self.uAction)

            self.afPrevObservation = self.afObservation
            self.afObservation = np.array(afObservation, dtype=tfFloat).flatten()
            self.uStep += 1
            self.fScore += self.fReward
            self.fAverageReward = self.fAverageReward * 0.999 + self.fReward * 0.001
            if bDone:
                self.bDone = True
        return self.bDone

# Agent `Soft Actor Critic`

In [10]:
class SacAgent(object):
    def __init__(self,
            oPlayer,
            uReplayCapacity=128*1024,
            fRatingDiscountFactor=0.99,
            uBatchSize=256,
            bEpisodeRating=False,
            bPriorityMode=True,
            fTrainableTargetAverageUpdateCoef=0.005,
            tfTrainStepCounter=None,
            uLogLevel=1,
            sLogsPath='logs/',
            sRestorePath='models/'):

        if uReplayCapacity < 2:
            raise TypeError('uReplayCapacity must be greater 1, but given %d' % uReplayCapacity)

        # Input params

        self.oPlayer = oPlayer
        self.uBatchSize = uBatchSize
        self.bEpisodeRating = bEpisodeRating
        self.bPriorityMode = bPriorityMode
        self.fRatingDiscountFactor = fRatingDiscountFactor

        # Coefficient of smoothing statistics of the average value of gradients
        fGradientNormalUpdateCoef = 0.01
        # Maximum mean gradients value
        fMaxGradientNormal = 200

        self.uLogLevel = uLogLevel

        # Neural network optimizers

        self.koActorOptimizer = AMSgrad(3e-4)
        self.koCriticOptimizer = AMSgrad(3e-4)
        self.koAlphaOptimizer = AMSgrad(3e-4)

        # Summary writer

        dtCurrentTime = datetime.now().strftime("%Y%m%d-%H%M%S")
        sLogPath = sLogsPath + dtCurrentTime
        self.oSummaryWriter = tf.summary.create_file_writer(sLogPath)

        # Temporary buffer for full episode

        if self.bEpisodeRating:
            # Current data size inside temporary buffer
            self.uBufferSize = 0
            # Current temporary buffer capacity
            self.uBufferCapacity = 256

            # Temporary buffer storage: observations, actions, rewards
            self.afPrevObservations = np.zeros((self.uBufferCapacity, self.oPlayer.oEnvironment.uObservationSize), dtype=tfFloat)
            self.afObservations = np.zeros((self.uBufferCapacity, self.oPlayer.oEnvironment.uObservationSize), dtype=tfFloat)
            if self.oPlayer.oEnvironment.bDiscrete:
                self.auActions = np.zeros((self.uBufferCapacity,), dtype=np.int32)
            else:
                self.afActions = np.zeros((self.uBufferCapacity, self.oPlayer.oEnvironment.uActionsSize), dtype=tfFloat)
            self.afRewards = np.zeros((self.uBufferCapacity,), dtype=tfFloat)
            self.afRatings = np.zeros((self.uBufferCapacity,), dtype=tfFloat)

        # Lists of variables for storing on disk

        aTrainVariables = []
        aTargetVariables = oPlayer.nnActor.trainable_variables
        aReplayBufferVariables = []

        # Global constants

        with tf1.variable_scope('Const', reuse=tf1.AUTO_REUSE):
            tfZero = tfConstant('fZero', tfFloat, 0)
            tfHalf = tfConstant('fHalf', tfFloat, 0.5)
            tfOne = tfConstant('fOne', tfFloat, 1)
            tuOne = tfConstant('uOne', tfInt, 1)
            if self.bPriorityMode:
                tuTwo = tfConstant('uTwo', tfInt, 2)

            tuActionsSize = tfConstant('uActionsSize', tfInt, oEnvironment.uActionsSize)
            tuOne64 = tfConstant('uOne64', tf.int64, 1)

            if fMaxGradientNormal is not None:
                tfMaxGradientNormal = tfConstant('fMaxGradientNormal', tfFloat, fMaxGradientNormal)

            if self.uLogLevel > 1:
                tfGradientNormalUpdateCoef = tfConstant('fGradientNormalUpdateCoef', tfFloat, fGradientNormalUpdateCoef)
                tfGradientNormalForgetCoef = tfConstant('fGradientNormalForgetCoef', tfFloat, 1.0 - fGradientNormalUpdateCoef)

            if self.oPlayer.oEnvironment.bDiscrete:
                tfTotalMinLogInput = tfConstant('fMinClip', tfFloat, 1e-8)
                tfTotalMaxLogInput = tfConstant('fMaxClip', tfFloat, 1-1e-8)
            else:
                tfLogSqrtPi2 = tfConstant('fLogSqrtPi2', tfFloat, math.log(math.sqrt(math.pi * 2.0)))
                tfTotalMinScale = tfConstant('fMinScale', tfFloat, -20)
                tfTotalMaxScale = tfConstant('fMaxScale', tfFloat, 2)

            tfTrainableTargetAverageUpdateCoef = tfConstant('fTrainableTargetAverageUpdateCoef', tfFloat, fTrainableTargetAverageUpdateCoef)
            tfTrainableTargetAverageForgetCoef = tfConstant('fTrainableTargetAverageForgetCoef', tfFloat, 1.0 - fTrainableTargetAverageUpdateCoef)

        with tf1.variable_scope('Var', reuse=tf1.AUTO_REUSE):
            # Current episode
            self.tuEpisode = tfGlobalVariable('uEpisode', tfInt, 1)
            aTrainVariables.append(self.tuEpisode)

        # Body of genetic replay buffer

        with tf.name_scope('ReplayBuffer'):
            with tf1.variable_scope('Const', reuse=tf1.AUTO_REUSE):
                # Cyclic buffer capacity
                self.tuReplayCapacity = tfConstant('uCapacity', tfInt, uReplayCapacity)
                # Cyclic buffer start offset
                self.tuReplayStart = tfGlobalVariable('uStart', tfInt, 0)
                aReplayBufferVariables.append(self.tuReplayStart)
                # Cyclic buffer end offset
                self.tuReplayEnd = tfGlobalVariable('uEnd', tfInt, 0)
                aReplayBufferVariables.append(self.tuReplayEnd)

                # Tree-based buffer params
                if self.bPriorityMode:
                    uTreeSize = int(math.pow(2, math.ceil(math.log2(uReplayCapacity))))
                    tuReplayTreeSize = tfConstant('uTreeSize', tfInt, uTreeSize)
                    tuHalfReplayTreeSize = tfConstant('uHalfTreeSize', tfInt, uTreeSize>>1)
                    tauOne = tfConstant('auOne', (tfInt, [1]), [1])

                    # Coefficients to calculate priority
                    # `priority = clip(pow(error + eps, -power), min, max)`
                    fErrorPower = 0.6
                    fMinPriority = 0.1
                    fMaxPriority = 1.0

                    tfReplayErrorPower = tfConstant('fErrorPower', tfFloat, fErrorPower)
                    tfReplayMinPriority = tfConstant('fMinPriority', tfFloat, fMinPriority)
                    tfReplayMaxPriority = tfConstant('fMaxPriority', tfFloat, fMaxPriority)
                    tafReplayMaxPriority = tfConstant('afMaxPriority', (tfFloat, [1]), [fMaxPriority])
                    tfReplayPriorityEpsilon = tfConstant('fPriorityEpsilon', tfFloat, 0.01)

                    # Coefficients to calculate weights based on priorities
                    fWeightPower = 0.4
                    uWeightFeedSteps = 2e5

                    tfReplayWeightDiff = tfConstant('fWeightDiff', tfFloat, (1.0 - fWeightPower) / float(uWeightFeedSteps))

            with tf1.variable_scope('Var', reuse=tf1.AUTO_REUSE):
                # Cyclic buffer: observations, actions, rewards
                self.tafReplayPrevObservations = tfGlobalVariable('afPrevObservations', (tfFloat, [uReplayCapacity, self.oPlayer.oEnvironment.uObservationSize]), 0)
                aReplayBufferVariables.append(self.tafReplayPrevObservations)
                self.tafReplayObservations = tfGlobalVariable('afObservations', (tfFloat, [uReplayCapacity, self.oPlayer.oEnvironment.uObservationSize]), 0)
                aReplayBufferVariables.append(self.tafReplayObservations)
                if self.oPlayer.oEnvironment.bDiscrete:
                    self.tauReplayActions = tfGlobalVariable('auActions', (tfInt, [uReplayCapacity]), 0)
                    aReplayBufferVariables.append(self.tauReplayActions)
                else:
                    self.tafReplayActions = tfGlobalVariable('afActions', (tfFloat, [uReplayCapacity, self.oPlayer.oEnvironment.uActionsSize]), 0)
                    aReplayBufferVariables.append(self.tafReplayActions)
                if self.bEpisodeRating:
                    self.tafReplayRatings = tfGlobalVariable('afRatings', (tfFloat, [uReplayCapacity]), 0)
                    aReplayBufferVariables.append(self.tafReplayRatings)
                else:
                    self.tafReplayRewards = tfGlobalVariable('afRewards', (tfFloat, [uReplayCapacity]), 0)
                    aReplayBufferVariables.append(self.tafReplayRewards)
                self.tafReplayDones = tfGlobalVariable('afDones', (tfFloat, [uReplayCapacity]), 0)
                aReplayBufferVariables.append(self.tafReplayDones)

                # Tree-based buffer for quick search and random sampling based on priority
                if self.bPriorityMode:
                    # Power of priority significance (tends to 1 over time)
                    self.tfReplayWeightPower = tfGlobalVariable('fWeightPower', tfFloat, fWeightPower)
                    aReplayBufferVariables.append(self.tfReplayWeightPower)

                    # Tree-based buffer of maximums
                    self.tafReplayMaxTree = tfGlobalVariable('afMaxTree', (tfFloat, [uTreeSize * 2]), 0)
                    aReplayBufferVariables.append(self.tafReplayMaxTree)
                    # Tree-based buffer of sums
                    self.tafReplaySumTree = tfGlobalVariable('afSumTree', (tfFloat, [uTreeSize * 2]), 0)
                    aReplayBufferVariables.append(self.tafReplaySumTree)

            # The `fnAdd` function block that adds entries to the retry buffer
            with tf.name_scope('fnAdd'):
                # Input data
                with tf1.variable_scope('Input', reuse=tf1.AUTO_REUSE):
                    self.tinafReplayPrevObservation = tfInput('afPrevObservation', (tfFloat, [self.oPlayer.oEnvironment.uObservationSize]))
                    self.tinafReplayObservation = tfInput('afObservation', (tfFloat, [self.oPlayer.oEnvironment.uObservationSize]))
                    if self.oPlayer.oEnvironment.bDiscrete:
                        self.tinauReplayAction = tfInput('auAction', (tfInt, [1]))
                    else:
                        self.tinafReplayActions = tfInput('afActions', (tfFloat, [self.oPlayer.oEnvironment.uActionsSize]))
                    if self.bEpisodeRating:
                        self.tinafReplayRating = tfInput('afRating', (tfFloat, [1]))
                    else:
                        self.tinafReplayReward = tfInput('afReward', (tfFloat, [1]))
                    self.tinfReplayDone = tfInput('fDone', tfFloat)
                    self.tinfReplayScore = tfInput('fScore', tfFloat)
                    self.tinuReplaySteps = tfInput('uSteps', tfInt)

                tafDones = tf.expand_dims(self.tinfReplayDone, axis=-1)

                # Index limit
                tuMaxIndex = self.tuReplayStart + self.tuReplayCapacity

                # Save input to buffer
                tuIndex = tf.math.floormod(self.tuReplayEnd, self.tuReplayCapacity)
                tauIndices = tf.expand_dims(tf.expand_dims(tuIndex, axis=-1), axis=-1)
                topUpdate1 = self.tafReplayPrevObservations.scatter_nd_update(tauIndices, self.tinafReplayPrevObservation[None, :])
                topUpdate2 = self.tafReplayObservations.scatter_nd_update(tauIndices, self.tinafReplayObservation[None, :])
                if oPlayer.oEnvironment.bDiscrete:
                    topUpdate3 = self.tauReplayActions.scatter_nd_update(tauIndices, self.tinauReplayAction)
                else:
                    topUpdate3 = self.tafReplayActions.scatter_nd_update(tauIndices, self.tinafReplayActions[None, :])
                if self.bEpisodeRating:
                    topUpdate4 = self.tafReplayRatings.scatter_nd_update(tauIndices, self.tinafReplayRating)
                else:
                    topUpdate4 = self.tafReplayRewards.scatter_nd_update(tauIndices, self.tinafReplayReward)
                topUpdate5 = self.tafReplayDones.scatter_nd_update(tauIndices, tafDones)

                with tfWait([topUpdate1, topUpdate2, topUpdate3, topUpdate4, topUpdate5]):
                    tuNewEnd = self.tuReplayEnd.assign_add(tuOne)

                with tfWait([tuNewEnd]):
                    # Check buffer overflow
                    def fnOverflow():
                        return self.tuReplayStart.assign(self.tuReplayEnd + tuOne - self.tuReplayCapacity)
                    def fnNoOverflow():
                        return self.tuReplayStart
                    tuNewStart = tf.cond(tf.greater(self.tuReplayEnd, tuMaxIndex), fnOverflow, fnNoOverflow, 'uNewStart')

                aWaitList = [tuNewStart]

                # Log episode total score
                if self.uLogLevel > 0:
                    def fnAddLogEpisode():
                        with self.oSummaryWriter.as_default(): # pylint: disable=not-context-manager
                            topLogScore = tf.summary.scalar('Stats/Score', self.tinfReplayScore, tf.cast(self.tuEpisode, tf.int64))
                            if self.uLogLevel > 1:
                                topLogSteps = tf.summary.scalar('Info/Steps', self.tinuReplaySteps, tf.cast(self.tuEpisode, tf.int64))
                                topLog = tf.group(topLogScore, topLogSteps)
                            else:
                                topLog = topLogScore
                        return topLog
                    def fnAddLogStep():
                        return tf.no_op()
                    aWaitList.append(tf.cond(tf.equal(self.tinfReplayDone, tfZero), true_fn=fnAddLogStep, false_fn=fnAddLogEpisode))

                with tfWait(aWaitList):
                    # Go to the next episode
                    def fnNextEpisode():
                        return self.tuEpisode.assign_add(tuOne)
                    def fnCurEpisode():
                        return self.tuEpisode
                    tuNewEpisode = tf.cond(tf.not_equal(self.tinfReplayDone, tfZero), fnNextEpisode, fnCurEpisode, 'uNewEpisode')

                # Update tree-based buffer priorities
                if self.bPriorityMode:
                    tuIndex = tuIndex + tuReplayTreeSize
                    tauIndices = tf.expand_dims(tf.expand_dims(tuIndex, axis=-1), axis=-1)
                    topUpdateMax = self.tafReplayMaxTree.scatter_nd_update(tauIndices, tafReplayMaxPriority)
                    topUpdateSum = self.tafReplaySumTree.scatter_nd_update(tauIndices, tafReplayMaxPriority)
                    tuIndex = tf.math.floordiv(tuIndex, tuTwo)
                    with tfWait([topUpdateMax, topUpdateSum]):
                        def fnAddCompare(tuLoopIndex):
                            return tf.greater_equal(tuLoopIndex, tuOne)
                        def fnAddLoop(tuLoopIndex):
                            tauLeft = tf.expand_dims(tuLoopIndex * tuTwo, axis=-1)
                            tauRight = tauLeft + tuOne
                            tfMax = tf.maximum(tf.gather(self.tafReplayMaxTree, tauLeft), tf.gather(self.tafReplayMaxTree, tauRight)) # pylint: disable=no-value-for-parameter
                            tfSum = tf.add(tf.gather(self.tafReplaySumTree, tauLeft), tf.gather(self.tafReplaySumTree, tauRight)) # pylint: disable=no-value-for-parameter
                            tauIndices = tf.expand_dims(tf.expand_dims(tuLoopIndex, axis=-1), axis=-1)
                            topUpdateMax = self.tafReplayMaxTree.scatter_nd_update(tauIndices, tfMax)
                            topUpdateSum = self.tafReplaySumTree.scatter_nd_update(tauIndices, tfSum)
                            with tfWait([topUpdateMax, topUpdateSum]):
                                tuLoopIndex = tf.math.floordiv(tuLoopIndex, tuTwo)
                                return [tuLoopIndex]
                        [topUpdateTree] = tf.while_loop(fnAddCompare, fnAddLoop, [tuIndex])

            # The node of result of the function after adding data to the replay buffer
            if self.bPriorityMode:
                self.tfnReplayAdd = tf.group(tuNewEpisode, topUpdateTree)
            else:
                self.tfnReplayAdd = tuNewEpisode

            # The `fnReadBatch` function block that reads batch of entries from the retry buffer
            with tf.name_scope('fnReadBatch'):
                # Creating batch indices for read from the replay buffer
                if self.bPriorityMode:
                    # The total amount of all priorities
                    tfTotalPriority = tf.squeeze(tf.gather(self.tafReplaySumTree, tauOne)) # pylint: disable=no-value-for-parameter
                    # Set of random priorities offsets
                    tafRandomPriorities = tf.random.uniform([uBatchSize], minval=tfZero, maxval=tfTotalPriority, dtype=tfFloat, seed=None) # pylint: disable=unexpected-keyword-arg
                    # Search for matching indices of priorities offsets
                    tuIndex = tuOne
                    tauIndices = tf.ones([uBatchSize], dtype=tfInt)
                    def fnRandomCompare(tuLoopIndex, tauIndices, tafRandomPriorities):
                        return tf.less(tuLoopIndex, tuReplayTreeSize)
                    def fnRandomLoop(tuLoopIndex, tauIndices, tafRandomPriorities):
                        tauLeft = tauIndices * tuTwo
                        tafValues = tf.gather(self.tafReplaySumTree, tauLeft) # pylint: disable=no-value-for-parameter
                        tabLessEqual = tf.less_equal(tafValues, tafRandomPriorities)
                        return [tuLoopIndex * tuTwo, tauLeft + tf.cast(tabLessEqual, tfInt), tafRandomPriorities - tafValues * tf.cast(tabLessEqual, tfFloat)]
                    [tuReturnIndex, tauReplayIndices, tafReturnRandomPriorities] = tf.while_loop(fnRandomCompare, fnRandomLoop, [tuIndex, tauIndices, tafRandomPriorities])
                    # Calculate weights for each index based on its priority
                    tfMaxPriorityScale = tfOne / tf.squeeze(tf.gather(self.tafReplayMaxTree, tauOne)) # pylint: disable=no-value-for-parameter
                    tafBatchPriorities = tf.gather(self.tafReplaySumTree, tauReplayIndices) # pylint: disable=no-value-for-parameter
                    tafBatchWeights = tf.pow(tafBatchPriorities * tfMaxPriorityScale, self.tfReplayWeightPower) # pylint: disable=invalid-unary-operand-type
                    # Update coefficient affecting weights
                    topUpdateReplayWeightPower = self.tfReplayWeightPower.assign(tf.minimum(tfOne, self.tfReplayWeightPower + tfReplayWeightDiff))

                    with tfWait([tuReturnIndex, tauReplayIndices, tafReturnRandomPriorities, topUpdateReplayWeightPower]):
                        # Set of random indices
                        tauRandomIndices = tauReplayIndices - tuReplayTreeSize
                else:
                    # Set of random offsets
                    tauRandomOffsets = tf.random.uniform([uBatchSize], minval=self.tuReplayStart, maxval=self.tuReplayEnd, dtype=tfInt, seed=None) # pylint: disable=unexpected-keyword-arg
                    # Set of random indices
                    tauRandomIndices = tf.math.floormod(tauRandomOffsets, self.tuReplayCapacity)

                # Create data block based on a sample of random indices
                tafBatchPrevObservations = tf.gather(self.tafReplayPrevObservations, tauRandomIndices) # pylint: disable=no-value-for-parameter
                tafBatchObservations = tf.gather(self.tafReplayObservations, tauRandomIndices) # pylint: disable=no-value-for-parameter
                if self.oPlayer.oEnvironment.bDiscrete:
                    tauBatchActions = tf.gather(self.tauReplayActions, tauRandomIndices) # pylint: disable=no-value-for-parameter
                else:
                    tafBatchActions = tf.gather(self.tafReplayActions, tauRandomIndices) # pylint: disable=no-value-for-parameter
                if self.bEpisodeRating:
                    tafBatchRatings = tf.gather(self.tafReplayRatings, tauRandomIndices) # pylint: disable=no-value-for-parameter
                else:
                    tafBatchRewards = tf.gather(self.tafReplayRewards, tauRandomIndices) # pylint: disable=no-value-for-parameter
                tafBatchDones = tf.gather(self.tafReplayDones, tauRandomIndices) # pylint: disable=no-value-for-parameter

        # Body `Soft Actor Critic` agent

        with tf.name_scope('SacAgent'):
            with tf.name_scope('Critic'):
                # Create two neural networks ``Trainer Critic` and two neural networks `Target Critic`
                self.annCriticNetworks = [CriticNetwork() for _ in range(4)]
                self.nnTrainCritic1 = self.annCriticNetworks[0]
                self.nnTrainCritic2 = self.annCriticNetworks[1]
                self.nnTargetCritic1 = self.annCriticNetworks[2]
                self.nnTargetCritic2 = self.annCriticNetworks[3]
                [nnCritic.build((self.uBatchSize, oEnvironment.uObservationSize), (self.uBatchSize, oEnvironment.uActionsSize)) for nnCritic in self.annCriticNetworks]
                [aTrainVariables.extend(nnCritic.trainable_variables) for nnCritic in self.annCriticNetworks]

            with tf.name_scope('Actor'):
                # Create neural network `Trainer Actor`
                self.nnTrainActor = CreateActor(oEnvironment, self.uBatchSize, 'nnSacTrainActor')
                aTrainVariables.extend(self.nnTrainActor.trainable_variables)

                if not self.oPlayer.oEnvironment.bDiscrete:
                    afActionsMin = np.array([0])
                    afActionsMax = np.array([1])

                    # Constants for continous model
                    with tf1.variable_scope('Const', reuse=tf1.AUTO_REUSE):
                        tafActionsMean = tfConstant('afActionsMean', tfFloat, (afActionsMin + afActionsMax) / 2)
                        tafActionsStd = tfConstant('afActionsStd', tfFloat, (afActionsMin - afActionsMax) / 2)

            with tf1.variable_scope('Var', reuse=tf1.AUTO_REUSE):
                # `Alpha Regulator`
                self.tfLogAlpha = tfGlobalVariable('fLogAlpha', tfFloat, 0, True)
                aTrainVariables.append(self.tfLogAlpha)

                if self.uLogLevel > 1:
                    tfCriticGradientAverageNormal = tfGlobalVariable('fCriticGradientClip', tfFloat, 0)
                    aTrainVariables.append(tfCriticGradientAverageNormal)
                    tfActorGradientAverageNormal = tfGlobalVariable('fActorGradientClip', tfFloat, 0)
                    aTrainVariables.append(tfActorGradientAverageNormal)

                if tfTrainStepCounter is None:
                    tfTrainStepCounter = tf.compat.v1.train.get_or_create_global_step()
                aTrainVariables.append(tfTrainStepCounter)

            with tf1.variable_scope('Const', reuse=tf1.AUTO_REUSE):
                # Factor of discounting rating
                tfRatingDiscountFactor = tfConstant('fRatingDiscountFactor', tfFloat, self.fRatingDiscountFactor)
                # Target entropy
                if self.oPlayer.oEnvironment.bDiscrete:
                    tfTargetEntropy = tfConstant('fTargetEntropy', tfFloat, -np.log(1.0 / oEnvironment.uActionsSize) * 0.98)
                else:
                    tfTargetEntropy = tfConstant('fTargetEntropy', tfFloat, oEnvironment.uActionsSize / 2.0)

            # The `fnInitialize` function block that initialize all variables on first start
            with tf.name_scope('fnInitialize'):
                topCriticUpdate1 = fnHardUpdate(self.nnTargetCritic1.variables, self.nnTrainCritic1.variables)
                topCriticUpdate2 = fnHardUpdate(self.nnTargetCritic2.variables, self.nnTrainCritic2.variables)
                topActorUpdate = fnHardUpdate(self.oPlayer.nnActor.variables, self.nnTrainActor.variables)
            # The node of result of the function after initialization
            self.tfnInitialize = tf.group(topCriticUpdate1, topCriticUpdate2, topActorUpdate)

            # Update target neural networks
            def fnUpdateTarget():
                topCriticUpdate1 = fnSoftUpdate(self.nnTargetCritic1.variables, self.nnTrainCritic1.variables, tfZero, tfOne, tfTrainableTargetAverageForgetCoef, tfTrainableTargetAverageUpdateCoef)
                topCriticUpdate2 = fnSoftUpdate(self.nnTargetCritic2.variables, self.nnTrainCritic2.variables, tfZero, tfOne, tfTrainableTargetAverageForgetCoef, tfTrainableTargetAverageUpdateCoef)
                topActorUpdate = fnHardUpdate(self.oPlayer.nnActor.variables, self.nnTrainActor.variables)
                return tf.group(topCriticUpdate1, topCriticUpdate2, topActorUpdate)

            # The `fnTrain` function block that performs all operations related to training networks
            with tf.name_scope('fnTrain'):
                # Format input data
                if self.oPlayer.oEnvironment.bDiscrete:
                    tafActionsOneHots = tf.one_hot(tauBatchActions, tuActionsSize, on_value=tfOne, off_value=tfZero, dtype=tfFloat) # pylint: disable=unexpected-keyword-arg
                    tafStates = self.nnTrainCritic1.prepare(tafBatchPrevObservations, tafActionsOneHots)
                else:
                    tafStates = self.nnTrainCritic1.prepare(tafBatchPrevObservations, tafBatchActions)

                # Calculate ratings for input data
                if self.bEpisodeRating:
                    # For episode mode, ratings are already calculated in advance
                    tafRatings = tafBatchRatings
                else:
                    # Calculate the probabilities of the next actions
                    if self.oPlayer.oEnvironment.bDiscrete:
                        # Unnormalized predicted logarithmic probabilities of possible actions
                        tafNextUnnormalizedLogProbabilities = self.nnTrainActor.call(tafBatchObservations)
                        # Array of random numbers to add noise to the logarithmic probabilities
                        tafRandomUniforms = tf.random.uniform([self.uBatchSize, oEnvironment.uActionsSize], minval=tfTotalMinLogInput, maxval=tfTotalMaxLogInput, dtype=tfFloat, seed=None) # pylint: disable=unexpected-keyword-arg
                        # Gumbel distribution for random numbers
                        tafGumbels = -tf.math.log(-tf.math.log(tafRandomUniforms)) # pylint: disable=invalid-unary-operand-type
                        # Unnormalized predicted logarithmic probabilities of possible actions with added noise
                        tafNextUnnormalizedNoisyLogProbabilities = tafNextUnnormalizedLogProbabilities + tafGumbels
                        # Best estimated actions
                        tauNextBestActions = tf.argmax(tafNextUnnormalizedNoisyLogProbabilities, axis=-1, output_type=tfInt)
                        # Best predicted actions vectors
                        tafNextPredActions = tf.one_hot(tauNextBestActions, tuActionsSize, on_value=tfOne, off_value=tfZero, dtype=tfFloat) # pylint: disable=unexpected-keyword-arg
                        # Normalized predicted logarithmic probabilities of possible actions
                        tafNextNormalizedLogProbabilities = tf.math.log_softmax(tafNextUnnormalizedLogProbabilities, axis=-1)
                        # Logarithmic probabilities of the best predicted actions (in the role of entropy)
                        tafNextProbabilitiesEntropy = -tf.math.reduce_sum(tafNextPredActions * tafNextNormalizedLogProbabilities, axis=-1)
                    else:
                        tafNextLocations, tafNextScales = self.nnTrainActor.call(tafBatchObservations)
                        tafNextClippedScales = tf.clip_by_value(tafNextScales, tfTotalMinScale, tfTotalMaxScale)
                        tafNextClippedScalesExp = tf.math.exp(tafNextClippedScales)			
                        tafNextRandomNormals = tf.random.normal([self.uBatchSize, oEnvironment.uActionsSize], mean=tafNextLocations, stddev=tafNextClippedScalesExp, dtype=tfFloat, seed=None)
                        tafNextPredActions = tf.math.tanh(tafNextRandomNormals) * tafActionsStd + tafActionsMean
                        tafNextClippedScalesExpSquare = tf.square(tafNextClippedScalesExp)
                        tafNextProbabilitiesEntropy = (tf.square(tafNextPredActions - tafNextLocations)) / (2 * tafNextClippedScalesExpSquare) + tafNextClippedScales + tfLogSqrtPi2 # pylint: disable=invalid-unary-operand-type

                    # Calculate the following probable ratings
                    tafNextStates = self.nnTargetCritic1.prepare(tafBatchObservations, tafNextPredActions)
                    tafPredNextRatings1 = tf.squeeze(self.nnTargetCritic1.call(tafNextStates), axis=-1)
                    tafPredNextRatings2 = tf.squeeze(self.nnTargetCritic2.call(tafNextStates), axis=-1)
                    tafPredNextRatings = tf.minimum(tafPredNextRatings1, tafPredNextRatings2) + tf.math.exp(self.tfLogAlpha) * tafNextProbabilitiesEntropy

                    # Calculate ratings
                    tafRatings = tafBatchRewards + (1. - tafBatchDones) * tafPredNextRatings * tfRatingDiscountFactor

                # Calculate the error of learning the neural network `Trainer Critic` #1
                tfaTrainableCriticVariables1 = self.nnTrainCritic1.trainable_variables
                with tf.GradientTape(watch_accessed_variables=False) as tape:
                    tape.watch(tfaTrainableCriticVariables1)

                    # Predicted ratings
                    tafPredRatings1 = tf.squeeze(self.nnTrainCritic1.call(tf.stop_gradient(tafStates)), axis=-1)
                    # The advantage of real ratings over predicted ones
                    tafAdvantages1 = tf.stop_gradient(tafRatings) - tafPredRatings1
                    # Learning error
                    if self.bPriorityMode:
                        tfCriticLoss1 = tf.reduce_mean(tf.square(tafAdvantages1) * tf.stop_gradient(tafBatchWeights))
                    else:
                        tfCriticLoss1 = tf.reduce_mean(tf.square(tafAdvantages1))

                # Calculate gradients
                aCriticGradients1 = tape.gradient(tfCriticLoss1, tfaTrainableCriticVariables1)

                # Calculate the error of learning the neural network `Trainer Critic` #2
                tfaTrainableCriticVariables2 = self.nnTrainCritic2.trainable_variables
                with tf.GradientTape(watch_accessed_variables=False) as tape:
                    tape.watch(tfaTrainableCriticVariables2)

                    # Predicted ratings
                    tafPredRatings2 = tf.squeeze(self.nnTrainCritic2.call(tf.stop_gradient(tafStates)), axis=-1)
                    # The advantage of real ratings over predicted ones
                    tafAdvantages2 = tf.stop_gradient(tafRatings) - tafPredRatings2
                    # Learning error
                    if self.bPriorityMode:
                        tfCriticLoss2 = tf.reduce_mean(tf.square(tafAdvantages2) * tf.stop_gradient(tafBatchWeights))
                    else:
                        tfCriticLoss2 = tf.reduce_mean(tf.square(tafAdvantages2))

                # Calculate gradients
                aCriticGradients2 = tape.gradient(tfCriticLoss2, tfaTrainableCriticVariables2)

                # Total critic error
                tfCriticLoss = (tfCriticLoss1 + tfCriticLoss2) * tfHalf

                # Calculate the average normal value of the `Trainer Critic` gradients
                if self.uLogLevel > 1:
                    tuGradientsCount = tfZero
                    tfGradientsNormalMass = tfZero
                    for tfaGradientsGroup in aCriticGradients1:
                        tfaGradients = tfaGradientsGroup.values if isinstance(tfaGradientsGroup, tf.IndexedSlices) else tfaGradientsGroup
                        tuCount = tf.cast(tf.size(tfaGradients), dtype=tfFloat) # pylint: disable=unexpected-keyword-arg, no-value-for-parameter
                        tuGradientsCount += tuCount
                        tfGradientsNormalMass += tf.linalg.global_norm([tfaGradients]) * tuCount
                    for tfaGradientsGroup in aCriticGradients2:
                        tfaGradients = tfaGradientsGroup.values if isinstance(tfaGradientsGroup, tf.IndexedSlices) else tfaGradientsGroup
                        tuCount = tf.cast(tf.size(tfaGradients), dtype=tfFloat) # pylint: disable=unexpected-keyword-arg, no-value-for-parameter
                        tuGradientsCount += tuCount
                        tfGradientsNormalMass += tf.linalg.global_norm([tfaGradients]) * tuCount
                    tfCriticGradientNormal = tfCriticGradientAverageNormal.assign(tfCriticGradientAverageNormal * tfGradientNormalForgetCoef + (tfGradientsNormalMass / tuGradientsCount) * tfGradientNormalUpdateCoef)

                # Trim gradients values to prevent `inf` and` nan` errors
                if fMaxGradientNormal is not None:
                    aCriticGradients1 = fnClipGradients(aCriticGradients1, tfMaxGradientNormal)
                    aCriticGradients2 = fnClipGradients(aCriticGradients2, tfMaxGradientNormal)

                # Update priorities in replay buffer
                if self.bPriorityMode:
                    # Calculate priorities based on adantages of ratings
                    tafReplayErrors = tf.minimum(tf.abs(tafAdvantages1), tf.abs(tafAdvantages2))
                    tafReplayPriorities = tf.clip_by_value(tf.pow(tafReplayErrors + tfReplayPriorityEpsilon, -tfReplayErrorPower), tfReplayMinPriority, tfReplayMaxPriority)
                    # Update priorities
                    tauUpdateIndices = tf.expand_dims(tauReplayIndices, axis=-1)
                    topUpdateMax = self.tafReplayMaxTree.scatter_nd_update(tauUpdateIndices, tafReplayPriorities)
                    topUpdateSum = self.tafReplaySumTree.scatter_nd_update(tauUpdateIndices, tafReplayPriorities)
                    with tfWait([topUpdateMax, topUpdateSum]):
                        tuIndex = tuHalfReplayTreeSize
                        tauIndices = tf.math.floordiv(tauReplayIndices, tuTwo)
                        def fnUpdateCompare(tuLoopIndex, tauIndices):
                            return tf.greater_equal(tuLoopIndex, tuOne)
                        def fnUpdateLoop(tuLoopIndex, tauIndices):
                            tauLeft = tauIndices * tuTwo
                            tauRight = tauLeft + tuOne
                            tfMax = tf.maximum(tf.gather(self.tafReplayMaxTree, tauLeft), tf.gather(self.tafReplayMaxTree, tauRight)) # pylint: disable=no-value-for-parameter
                            tfSum = tf.add(tf.gather(self.tafReplaySumTree, tauLeft), tf.gather(self.tafReplaySumTree, tauRight)) # pylint: disable=no-value-for-parameter
                            tauUpdateIndices = tf.expand_dims(tauIndices, axis=-1)
                            topUpdateMax = self.tafReplayMaxTree.scatter_nd_update(tauUpdateIndices, tfMax)
                            topUpdateSum = self.tafReplaySumTree.scatter_nd_update(tauUpdateIndices, tfSum)
                            with tfWait([topUpdateMax, topUpdateSum]):
                                tauIndices = tf.math.floordiv(tauIndices, tuTwo)
                                tuLoopIndex = tf.math.floordiv(tuLoopIndex, tuTwo)
                                return [tuLoopIndex, tauIndices]
                        [tuIndex, tauIndices] = tf.while_loop(fnUpdateCompare, fnUpdateLoop, [tuIndex, tauIndices])
                        topUpdateReplayPriorities = tf.group(tuIndex, tauIndices)

                # Calculate the error of learning the neural network `Trainer Actor`
                tfaTrainableActorVariables = self.nnTrainActor.trainable_variables
                with tf.GradientTape(watch_accessed_variables=False) as tape:
                    tape.watch(tfaTrainableActorVariables)

                    # Calculate the probabilities of all possible actions
                    if self.oPlayer.oEnvironment.bDiscrete:
                        # Unnormalized predicted logarithmic probabilities of actions
                        tafUnnormalizedLogProbabilities = self.nnTrainActor.call(tf.stop_gradient(tafBatchPrevObservations))
                        # Array of random numbers to add noise to the logarithmic probabilities
                        tafRandomUniforms = tf.random.uniform([self.uBatchSize, oEnvironment.uActionsSize], minval=tfTotalMinLogInput, maxval=tfTotalMaxLogInput, dtype=tfFloat, seed=None) # pylint: disable=unexpected-keyword-arg
                        # Gumbel distribution for random numbers
                        tafGumbels = -tf.math.log(-tf.math.log(tafRandomUniforms)) # pylint: disable=invalid-unary-operand-type
                        # Unnormalized predicted logarithmic probabilities of possible actions with added noise
                        tafUnnormalizedNoisyLogProbabilities = tafUnnormalizedLogProbabilities + tafGumbels
                        # Predicted actions vectors
                        tafPredActions = tf.math.exp(tafUnnormalizedNoisyLogProbabilities - tf.math.reduce_logsumexp(tafUnnormalizedNoisyLogProbabilities, axis=-1, keepdims=True))
                        # Normalized predicted logarithmic probabilities of possible actions
                        tafPredNormalizedLogProbabilities = tf.math.log_softmax(tafUnnormalizedLogProbabilities, axis=-1)
                        # Logarithmic probabilities of the best predicted actions (in the role of entropy)
                        tafPredProbabilitiesEntropy = -tf.math.reduce_sum(tafPredActions * tafPredNormalizedLogProbabilities, axis=-1)
                    else:
                        tafLocations, tafScales = self.nnTrainActor.call(tf.stop_gradient(tafBatchPrevObservations))
                        tafClippedScales = tf.clip_by_value(tafScales, tfTotalMinScale, tfTotalMaxScale)
                        tafClippedScalesExp = tf.math.exp(tafClippedScales)
                        tafRandomNormals = tf.random.normal([self.uBatchSize,oEnvironment.uActionsSize], mean=tfZero, stddev=tfOne, dtype=tfFloat, seed=None)
                        tafRandomUnscaledActions = tafLocations + tafClippedScalesExp * tafRandomNormals
                        tafPredActions = tf.math.tanh(tafRandomUnscaledActions) * tafActionsStd + tafActionsMean
                        tafClippedScalesExpPower = tf.math.square(tafClippedScalesExp)
                        tafNormalizers = -tf.math.reduce_sum(tf.math.log(1 - tf.math.square(tafPredActions) + 1e-6), axis=1)
                        tafPredProbabilitiesEntropy = (tf.math.square(tafRandomUnscaledActions - tafLocations)) / (2 * tafClippedScalesExpPower) + tafClippedScales + tfLogSqrtPi2 + tafNormalizers # pylint: disable=invalid-unary-operand-type

                    # Calculate the predicted probable ratings
                    tafRandomStates = self.nnTrainCritic1.prepare(tf.stop_gradient(tafBatchPrevObservations), tafPredActions)
                    tafRandomPredRatings1 = tf.squeeze(self.nnTrainCritic1.call(tafRandomStates), axis=-1)
                    tafRandomPredRatings2 = tf.squeeze(self.nnTrainCritic2.call(tafRandomStates), axis=-1)
                    tafRandomPredRatings = tf.minimum(tafRandomPredRatings1, tafRandomPredRatings2) + tf.exp(self.tfLogAlpha) * tafPredProbabilitiesEntropy

                    # Learning error
                    if self.bPriorityMode:
                        tfActorLoss = -tf.math.reduce_mean(tafRandomPredRatings * tf.stop_gradient(tafBatchWeights))
                    else:
                        tfActorLoss = -tf.math.reduce_mean(tafRandomPredRatings)

                # Calculate gradients
                aActorGradients = tape.gradient(tfActorLoss, tfaTrainableActorVariables)

                # Calculate the average normal value of the `Trainer Actor` gradients
                if self.uLogLevel > 1:
                    tuGradientsCount = tfZero
                    tfGradientsNormalMass = tfZero
                    for tfaGradientsGroup in aActorGradients:
                        tfaGradients = tfaGradientsGroup.values if isinstance(tfaGradientsGroup, tf.IndexedSlices) else tfaGradientsGroup
                        tuCount = tf.cast(tf.size(tfaGradients), dtype=tfFloat) # pylint: disable=unexpected-keyword-arg, no-value-for-parameter
                        tuGradientsCount += tuCount
                        tfGradientsNormalMass += tf.linalg.global_norm([tfaGradients]) * tuCount
                    tfActorGradientNormal = tfActorGradientAverageNormal.assign(tfActorGradientAverageNormal * tfGradientNormalForgetCoef + (tfGradientsNormalMass / tuGradientsCount) * tfGradientNormalUpdateCoef)

                # Trim gradients values to prevent `inf` and` nan` errors
                if fMaxGradientNormal is not None:
                    aActorGradients = fnClipGradients(aActorGradients, tfMaxGradientNormal)

                # Calculate the error of learning the `Alpha Regulator`
                tfaTrainableAlphaVariable = [self.tfLogAlpha]
                with tf.GradientTape(watch_accessed_variables=False) as tape:
                    tape.watch(tfaTrainableAlphaVariable)

                    # Calculate entropy loss
                    tafEntropyLoss = -(tafPredProbabilitiesEntropy - tfTargetEntropy)

                    # Learning error
                    if self.bPriorityMode:
                        tfAlphaLoss = tf.math.reduce_mean(tafEntropyLoss * tf.stop_gradient(tafBatchWeights)) * self.tfLogAlpha
                    else:
                        tfAlphaLoss = tf.math.reduce_mean(tafEntropyLoss) * self.tfLogAlpha

                # Calculate gradients
                aAlphaGradient = tape.gradient(tfAlphaLoss, tfaTrainableAlphaVariable)

                # Trim gradients values to prevent `inf` and` nan` errors
                if fMaxGradientNormal is not None:
                    aAlphaGradient = fnClipGradients(aAlphaGradient, tfMaxGradientNormal)

                # Training
                with tfWait([tfCriticLoss, tfActorLoss, tfAlphaLoss]):
                    topOptimizeCritic1 = self.koCriticOptimizer.apply_gradients(zip(aCriticGradients1, tfaTrainableCriticVariables1))
                    topOptimizeCritic2 = self.koCriticOptimizer.apply_gradients(zip(aCriticGradients2, tfaTrainableCriticVariables2))
                    topOptimizeActor = self.koActorOptimizer.apply_gradients(zip(aActorGradients, tfaTrainableActorVariables))
                    topOptimizeAlpha = self.koAlphaOptimizer.apply_gradients(zip(aAlphaGradient, tfaTrainableAlphaVariable))

                aWaitList = [topOptimizeCritic1, topOptimizeCritic2, topOptimizeActor, topOptimizeAlpha]

                if self.bPriorityMode:
                    aWaitList.append(topUpdateReplayPriorities)

        # Logging function subblock that performs all operations related to statistics

        if self.uLogLevel > 0:
            with self.oSummaryWriter.as_default(): # pylint: disable=not-context-manager
                aWaitList.append(tf.summary.scalar('Stats/Loss/Critic', tf.reduce_mean(tfCriticLoss), tfTrainStepCounter))
                aWaitList.append(tf.summary.scalar('Stats/Loss/Actor', tf.reduce_mean(tfActorLoss), tfTrainStepCounter))
                aWaitList.append(tf.summary.scalar('Stats/Loss/Alpha', tfAlphaLoss, tfTrainStepCounter))

        if self.uLogLevel > 1:
            with self.oSummaryWriter.as_default(): # pylint: disable=not-context-manager
                aWaitList.append(tf.summary.scalar('Info/GradientNormal/Critic', tfCriticGradientNormal, tfTrainStepCounter))
                aWaitList.append(tf.summary.scalar('Info/GradientNormal/Actor', tfActorGradientNormal, tfTrainStepCounter))
                if self.bPriorityMode:
                    aWaitList.append(tf.summary.scalar('Info/LossWeight/Mean', tf.reduce_mean(tafBatchWeights), tfTrainStepCounter))

                aWaitList.append(tf.summary.scalar('Rating/LogAlpha', self.tfLogAlpha, tfTrainStepCounter))
                aWaitList.append(tf.summary.scalar('Rating/Alpha', tf.math.exp(self.tfLogAlpha), tfTrainStepCounter))
                aWaitList.append(tf.summary.scalar('Rating/Value/Mean', tf.reduce_mean(tafRatings), tfTrainStepCounter))
                aWaitList.append(tf.summary.scalar('Rating/Entropy/Mean', tf.reduce_mean(tafPredProbabilitiesEntropy), tfTrainStepCounter))

                aWaitList.append(tf.summary.histogram('Rating/Value', tafRatings, tfTrainStepCounter))
                aWaitList.append(tf.summary.histogram('Rating/Entropy', tafPredProbabilitiesEntropy, tfTrainStepCounter))
                aWaitList.append(tf.summary.histogram('Rating/Pred', tafRandomPredRatings, tfTrainStepCounter))
                tafAdvantages = tf.concat([tafAdvantages1, tafAdvantages2], 0)
                aWaitList.append(tf.summary.histogram('Rating/Advantage', tafAdvantages, tfTrainStepCounter))

        if self.uLogLevel > 2:
            aWaitList.append(fnWeightsSummary(self.oSummaryWriter, zip(aCriticGradients1, tfaTrainableCriticVariables1), tfTrainStepCounter))
            aWaitList.append(fnWeightsSummary(self.oSummaryWriter, zip(aCriticGradients2, tfaTrainableCriticVariables2), tfTrainStepCounter))
            aWaitList.append(fnWeightsSummary(self.oSummaryWriter, zip(aActorGradients, tfaTrainableActorVariables), tfTrainStepCounter))
            aWaitList.append(fnWeightsSummary(self.oSummaryWriter, zip(aAlphaGradient, tfaTrainableAlphaVariable), tfTrainStepCounter))

        with tf.name_scope('SacAgent'):
            # Continue... The `fnTrain` function block that performs all operations related to training networks
            with tf.name_scope('fnTrain'):
                with tfWait(aWaitList):
                    # Update step
                    topNewStepCounter = tfTrainStepCounter.assign_add(tuOne64)
                    # Update target networks using `moving average` method
                    topUpdateTarget = fnUpdateTarget()

                with tfWait([topNewStepCounter, topUpdateTarget]):
                    # The node of result of the function after training
                    self.tfnTrain = tfCriticLoss + tfActorLoss + tfAlphaLoss

        # Create folders for storing the model

        self.sTargetRestorePath = sRestorePath + 'Target/'
        self.sTrainRestorePath = sRestorePath + 'Train/'
        self.sReplayBufferRestorePath = sRestorePath + 'ReplayBuffer/'

        self.oTargetSaver = tf.compat.v1.train.Saver(aTargetVariables, save_relative_paths=True)
        self.oTrainSaver = tf.compat.v1.train.Saver(aTrainVariables, save_relative_paths=True)
        self.oReplayBufferSaver = tf.compat.v1.train.Saver(aReplayBufferVariables, save_relative_paths=True)

        if not os.path.exists(self.sTargetRestorePath):
            os.makedirs(self.sTargetRestorePath, exist_ok=True)
        if not os.path.exists(self.sTrainRestorePath):
            os.makedirs(self.sTrainRestorePath, exist_ok=True)
        if not os.path.exists(self.sReplayBufferRestorePath):
            os.makedirs(self.sReplayBufferRestorePath, exist_ok=True)

    # Internal function to fill single step
    def __fill(self, uDebugLevel):
        uEpisode = tfEval(self.tuEpisode)
        bDone = self.oPlayer.next()
        uFillCount = 0

        if self.bEpisodeRating:
            if self.uBufferSize >= self.uBufferCapacity:
                self.uBufferCapacity += 256

                self.afPrevObservations = np.resize(self.afPrevObservations, (self.uBufferCapacity, self.oPlayer.oEnvironment.uObservationSize))
                self.afObservations = np.resize(self.afObservations, (self.uBufferCapacity, self.oPlayer.oEnvironment.uObservationSize))
                if self.oPlayer.oEnvironment.bDiscrete:
                    self.auActions = np.resize(self.auActions, (self.uBufferCapacity,))
                else:
                    self.afActions= np.resize(self.afActions, (self.uBufferCapacity, self.oPlayer.oEnvironment.uActionsSize))
                self.afRewards = np.resize(self.afRewards, (self.uBufferCapacity,))
                self.afRatings = np.resize(self.afRatings, (self.uBufferCapacity,))

            uIndex = self.uBufferSize

            self.afPrevObservations[uIndex] = self.oPlayer.afPrevObservation
            self.afObservations[uIndex] = self.oPlayer.afObservation
            if self.oPlayer.oEnvironment.bDiscrete:
                self.auActions[uIndex] = self.oPlayer.uAction
            else:
                self.afActions[uIndex] = self.oPlayer.afActions
            self.afRewards[uIndex] = self.oPlayer.fReward

            self.uBufferSize += 1
        else:
            if self.oPlayer.oEnvironment.bDiscrete:
                tfEval(self.tfnReplayAdd, {
                    self.tinafReplayPrevObservation: self.oPlayer.afPrevObservation,
                    self.tinafReplayObservation: self.oPlayer.afObservation,
                    self.tinauReplayAction: [self.oPlayer.uAction],
                    self.tinafReplayReward: [self.oPlayer.fReward],
                    self.tinfReplayDone: float(bDone),
                    self.tinfReplayScore: self.oPlayer.fScore,
                    self.tinuReplaySteps: self.oPlayer.uStep
                })
            else:
                tfEval(self.tfnReplayAdd, {
                    self.tinafReplayPrevObservation: self.oPlayer.afPrevObservation,
                    self.tinafReplayObservation: self.oPlayer.afObservation,
                    self.tinafReplayActions: self.oPlayer.afActions,
                    self.tinafReplayReward: [self.oPlayer.fReward],
                    self.tinfReplayDone: float(bDone),
                    self.tinfReplayScore: self.oPlayer.fScore,
                    self.tinuReplaySteps: self.oPlayer.uStep
                })
            uFillCount += 1

        if bDone:
            if self.bEpisodeRating:
                uIndex = self.uBufferSize - 1
                fLastScore = self.afRatings[uIndex] = self.afRewards[uIndex]
                while uIndex > 0:
                    uIndex -= 1
                    fLastScore = self.afRatings[uIndex] = self.afRewards[uIndex] + self.fRatingDiscountFactor * fLastScore

                for uIndex in range(self.uBufferSize - 1):
                    if self.oPlayer.oEnvironment.bDiscrete:
                        tfEval(self.tfnReplayAdd, {
                            self.tinafReplayPrevObservation: self.afPrevObservations[uIndex],
                            self.tinafReplayObservation: self.afObservations[uIndex],
                            self.tinauReplayAction: [self.auActions[uIndex]],
                            self.tinafReplayRating: [self.afRatings[uIndex]],
                            self.tinfReplayDone: 0.0,
                            self.tinfReplayScore: self.oPlayer.fScore,
                            self.tinuReplaySteps: self.oPlayer.uStep
                        })
                    else:
                        tfEval(self.tfnReplayAdd, {
                            self.tinafReplayPrevObservation: self.afPrevObservations[uIndex],
                            self.tinafReplayObservation: self.afObservations[uIndex],
                            self.tinafReplayActions: [self.afActions[uIndex]],
                            self.tinafReplayRating: [self.afRatings[uIndex]],
                            self.tinfReplayDone: 0.0,
                            self.tinfReplayScore: self.oPlayer.fScore,
                            self.tinuReplaySteps: self.oPlayer.uStep
                        })
                    uFillCount += 1

                uIndex = self.uBufferSize - 1
                if self.oPlayer.oEnvironment.bDiscrete:
                    tfEval(self.tfnReplayAdd, {
                        self.tinafReplayPrevObservation: self.afPrevObservations[uIndex],
                        self.tinafReplayObservation: self.afObservations[uIndex],
                        self.tinauReplayAction: [self.auActions[uIndex]],
                        self.tinafReplayRating: [self.afRatings[uIndex]],
                        self.tinfReplayDone: 1.0,
                        self.tinfReplayScore: self.oPlayer.fScore,
                        self.tinuReplaySteps: self.oPlayer.uStep
                    })
                else:
                    tfEval(self.tfnReplayAdd, {
                        self.tinafReplayPrevObservation: self.afPrevObservations[uIndex],
                        self.tinafReplayObservation: self.afObservations[uIndex],
                        self.tinafReplayActions: [self.afActions[uIndex]],
                        self.tinafReplayRating: [self.afRatings[uIndex]],
                        self.tinfReplayDone: 1.0,
                        self.tinfReplayScore: self.oPlayer.fScore,
                        self.tinuReplaySteps: self.oPlayer.uStep
                    })
                uFillCount += 1

                self.uBufferSize = 0

        if uDebugLevel > 0:
            uEnd, uStart, uReplayCapacity = tfEval([self.tuReplayEnd, self.tuReplayStart, self.tuReplayCapacity])

            print('\r[MEM:%s] Fill: Ep.%d:%d, Score %f, RewardAvg %f, Used %d of %d        ' % (
                getMemoryUsage(),
                uEpisode, self.oPlayer.uStep, self.oPlayer.fScore, self.oPlayer.fAverageReward,
                uEnd - uStart,
                uReplayCapacity
            ), end = '\n' if bDone else '')

        if bDone:
            self.oPlayer.reset()

        return bDone, uFillCount

    # Fill the repeat buffer in N steps or N episodes, depending on the setting
    def fill(self, uFillSize=1, uDebugLevel=0, bPrefill=False):
        if uDebugLevel > 1:
            print('\rFill: Processing...', end='')

        uTotalFillCount = 0

        for _ in range(uFillSize):
            bDone, uFillCount = self.__fill(uDebugLevel)
            uTotalFillCount += uFillCount

        if bPrefill:
            while not bDone:
                bDone, uFillCount = self.__fill(uDebugLevel)
                uTotalFillCount += uFillCount

        if uDebugLevel > 1:
            uEnd, uStart, uReplayCapacity = tfEval([self.tuReplayEnd, self.tuReplayStart, self.tuReplayCapacity])

            print('\nStatus: Used %d of %d from %d to %d' % (
                uEnd - uStart,
                uReplayCapacity,
                uStart % uReplayCapacity,
                uEnd % uReplayCapacity
            ))

        return uTotalFillCount, bDone

    # Initialize graph
    def initialize(self):
        return tfEval([self.oSummaryWriter.init(), self.tfnInitialize])

    # Save model
    def save(self):
        self.oTargetSaver.save(tfActiveSession.tfoSession, self.sTargetRestorePath)
        self.oTrainSaver.save(tfActiveSession.tfoSession, self.sTrainRestorePath)
        self.oReplayBufferSaver.save(tfActiveSession.tfoSession, self.sReplayBufferRestorePath)

    # Restore model
    def restore(self):
        try:
            self.oTargetSaver.restore(tfActiveSession.tfoSession, self.sTargetRestorePath)
        except ValueError:
            pass

        try:
            self.oTrainSaver.restore(tfActiveSession.tfoSession, self.sTrainRestorePath)
        except ValueError:
            pass

        try:
            self.oReplayBufferSaver.restore(tfActiveSession.tfoSession, self.sReplayBufferRestorePath)
        except ValueError:
            return False

        return True

    # Perform one training cycle
    def train(self):
        fStartTime = time.time()
        fTotalLoss = tfEval(self.tfnTrain)
        fTimeElapsed = time.time() - fStartTime

        return fTotalLoss, fTimeElapsed

    # Flush all write buffers before closing
    def flush(self):
        tfEval(self.oSummaryWriter.flush())

# Training loop

In [11]:
# Create environment
oEnvironment = CustomEnvironment(bRender=bRender)
# Define active graph
with tfGraph() as tfoGraph:
    # Create neural network `Target Actor`
    nnActor = CreateActor(oEnvironment, 1)
    # Create virtual player
    oRandomPlayer = CustomPlayer(oEnvironment, nnActor, fnSelectNoisyRandom)
    # Create `Soft Actor Critic` agent
    oAgent = SacAgent(oRandomPlayer,
        uReplayCapacity=uReplayCapacity,
        fRatingDiscountFactor=fRatingDiscountFactor,
        uBatchSize=uBatchSize,
        bEpisodeRating=bEpisodeRating,
        bPriorityMode=bPriorityMode,
        fTrainableTargetAverageUpdateCoef=fTrainableTargetAverageUpdateCoef,
        uLogLevel=uLogLevel,
        sLogsPath='logs/' + sName + '/',
        sRestorePath='models/' + sName + '/')

    # Start session calculating graph
    with tfSession(tfoGraph):
        # Initialize global variables
        tfInitGlobal()
        # Initialize local variables
        tfInitLocal()
        # Initialize agent
        oAgent.initialize()
        if not bRestore or not oAgent.restore():
            # Pre-fill the replay buffer with initial data
            oAgent.fill(uBatchSize * 4, uDebugLevel=2, bPrefill=True)

        try:
            while True:
                # Fill the repeat buffer in one step or one episode, depending on the setting
                uFillCount, bDone = oAgent.fill(1, uDebugLevel=1)
                # For each new step in the replay buffer, do a train
                while uFillCount > 0:
                    fTotalLoss = oAgent.train()
                    uFillCount -= 1
                # Save model
                if bDone:
                    oAgent.save()

        except KeyboardInterrupt:
            print('\nKeyboard Interrupt')
            pass

        finally:
            # Flush all write buffers before closing
            oAgent.flush()
            # Close environment
            oEnvironment.close()

[MEM:468791296:3907833856] Fill: Ep.1:98, Score -201.759663, RewardAvg -0.196427, Used 98 of 131072        
[MEM:468791296:3907833856] Fill: Ep.2:69, Score -335.363813, RewardAvg -0.332597, Used 167 of 131072        
[MEM:468791296:3907833856] Fill: Ep.3:63, Score -101.148985, RewardAvg -0.100947, Used 230 of 131072        
[MEM:468791296:3907833856] Fill: Ep.4:88, Score -409.718291, RewardAvg -0.399650, Used 318 of 131072        
[MEM:470147072:3907833856] Fill: Ep.5:98, Score -149.299674, RewardAvg -0.147940, Used 416 of 131072        
[MEM:470147072:3907833856] Fill: Ep.6:69, Score -99.570018, RewardAvg -0.099234, Used 485 of 131072        
[MEM:470147072:3908096000] Fill: Ep.7:72, Score -85.722617, RewardAvg -0.084817, Used 557 of 131072        
[MEM:470147072:3908096000] Fill: Ep.8:138, Score -402.452137, RewardAvg -0.388593, Used 695 of 131072        
[MEM:470147072:3908096000] Fill: Ep.9:119, Score -468.703400, RewardAvg -0.455598, Used 814 of 131072        
[MEM:470147072:39080