click on title to go to contents
- Distributed A3C
- Distributed TRPO with GAE
- Distributed DDPG
- Distributed Policy Gradient
- Distributed DQN
- Other Algorithms
Inspired by original paper - Asynchronous Methods for Deep Reinforcement Learning from DeepMind
It is actor-critic algorithm, which learns both a policy and a state-value function, and the value function is used for bootstrapping, i.e., updating a state from subsequent estimates, to reduce variance and accelerate learning.
In DA3C, parallel actors employ different exploration policies to stabilize training, so that experience replay is not utilized. Different from most deep learning algorithms, this distributed method can run on the multiple nodes with centralised parameter server. For Atari games, DA3C ran much faster yet performed better than or comparably with DQN, Gorila, D-DQN, Dueling D-DQN, and Prioritized D-DQN. Furthermore DA3C also performs better than original A3C or it synchronous variant, which called A2C. DA3C also succeeded on continuous motor control problems: TORCS car racing games and MujoCo physics manipulation and locomotion, and Labyrinth, a navigating task in random 3D mazes using visual inputs, in which an agent will face a new maze in each new episode, so that it needs to learn a general strategy to explore random mazes.
There are original pseudo code for A3C:
DA3C maintains a policy and an estimate of the value function , being updated with n-step returns in the forward view, after every tmax actions or reaching a terminal state, similar to using minibatches. In contrast to the original code we use processes for the agents instead of threads, where each agent or some set of agents can be run on separate node.
The gradient update can be represented with TD-error multiplier as in original paper
or with an estimate of the advantage function:
,
where
with k upbounded by tmax.
To use the last one just set config parameter use_gae
to true
.
The full set of possible config setups would be described later.
Gradients are also applied with delay compensation on the server parameter for better convergence.
Environment (Client) - each client connects to a particular Agent (Learner).
The main role of any client is feeding data to an Agent by transferring: state, reward and terminal signals (for episodic tasks if episode ends). Client updates these signals at each time step by receiving the action signal from an Agent and then sends updated values back.
- Process State: each state could be pass through some filtering procedure before transferring (if you defined). It could be some color, edge or blob transformations (for image input) or more complex pyramidal, Kalman's and spline filters.
Agent (Parallel Learner) - each Agent connects to the Parameter Server.
The main role of any agent is to perform a main training loop. Agent synchronize their neural network weights with the global network by copying the last one weights at the beginning of each training mini loop. Agent executes N steps of Client's signals receiving and sending actions back. These N steps is similar to batch collection. If batch is collected Agent computes the loss (wrt collected data) and pass it to the Optimizer. It could be some SGD optimizer (ADAM or RMSProp) which computes gradients and sends it to the Parameter Server for update of its neural network weights. All Agents works absolutely independent in asynchronous way and can update or receive the global network weights at any time.
-
Agent's Neural Network: we use the neural network architecture similar to universe agent (by default).
- Input:
3D
input to pass through2D
convolutions (default:42x42x1
) or any other shape. - _Convolution Layers :
4
layers with32
filters each and3x3
kernel, stride2
,ELU
activation (by default). - Fully connected Layers: one layer with
256
hidden units andReLU
activation (by default). - LSTM Layers: one layer with
256
cell size (by default it's replaced with fully connected layer). - Actor: fully connected layer with number of units equals to
action_size
andSoftmax
activation (by default).
It outputs an1-D
array of probability distribution over all possibly actions for the given state. - Critic: fully connected layer with
1
unit (by default).
It outputs an0-D
array representing the value of an state (expected return from this point).
- Input:
-
Total Loss
= Policy_Loss + critic_scale * Value_Loss
It usescritic_scale
parameter to set acritic learning rate
relative topolicy learning rate
It's set to1.0
by default, i.e. thecritic learning rate
is4
times larger thanpolicy learning rate
-
Value Loss: sum (over all batch samples) of squared difference between expected discounted reward
(R)
and a value of the current sample state -V(s)
, i.e. expected discounted return from this state.
, where with k upbounded by tmax.
If st is terminal then V(st) = 0. -
Policy Loss:
where the 1-st term is multiplication of policy log-likelihood on advantage function,
and the last term is entropy multiplied by regularization parameterentropy_beta = 0.01
(by default).
-
-
Compute Gradients: it computes the gradients wrt neural network weights and total loss.
Gradients are also clipped wrt parametergradients_norm_clipping = 40.0
(by default).
To perform the clipping, the valuesnn_weights[i]
are set to:nn_weights[i] * clip_norm / max(global_norm, clip_norm)
where:
global_norm = sqrt(sum([l2norm(w)**2 for w in nn_weights]))
If
clip_norm > global_norm
then the entries innn_weights
remain as they are,
otherwise they're all shrunk by the global ratio.
To avoid clipping just setgradients_norm_clipping = false
in config yaml. -
Synchronize Weights: it synchronize agent's weights with global neural network by copying
the last one to replace its own at the beginning of each batch collection step (1..tmax or terminal).
The new step will not start until the weights are updated, but it allows to switch
this procedure in non-blocking manner by settinghogwild
totrue
in the code. -
Softmax Action: it uses a
Boltzmann
distribution to select an action, so it chooses more
often actions, which has more probability and we called thisSoftmax
action for simplicity.
This method has some benefits over classical e-greedy strategy and helps to avoid problem
of "path along the cliff". Furthermore it helps to explore more at the beginning of the training.
Agent becomes more confident in some actions while training and the probability distribution
over actions is becoming more acute.
Parameter Server (Global) - one for whole algorithm (training process).
The main role of the Parameter Server is to synchronize neural networks weights between Agents.
It holds the shared (global) neural network weights, which is updated by the Agents gradients,
and sent the actual copy of its weights back to Agents to synchronize.
-
Global Neural Network: neural network weights is similar to Agent's one.
-
Some SGD Optimizer: it holds a SGD optimizer and its state (
Adam | RMSProp
).
It is one for all Agents and used to apply gradients from them.
The default optimizer isAdam
withinitial_learning_rate = 1e-4
since the last one is linear annealing wrtmax_global_step
parameter.
Distributed version of A3C algorithm, which can cope with continuous action space.
Architecture is similar to the previous one, but it uses two separate neural networks for Policy
& Critic
also with another Actor
type, Policy Loss
, Choose Action
procedure and additional state (reward) filtering
.
-
Actor (Continuous): it outputs two values
mu
&sigma
separately forNormal
distribution.
They are represented by2
fully connected layers with number of units equals toaction_size
.- mu: it applies a
Tanh
activation (by default) and represents themean
of the distribution.
It also allows to scale the output by config parameterscale
or use another activation (or without it). - sigma: it applies a
SoftPlus
operator (by default), and represents thevariance
of the distribution.
- mu: it applies a
-
Choose Action: it uses a random sampling for exploration wrt
sigma
and defines the finalaction
wrt formula:random(action_size) * sigma + mu
-
Policy Loss:
The full expansion ofPolicy Loss
forcontinuous
action space looks like as follows:
where the term before
advantage
represents a negative-log-likelihood (NLL
)
and the term, which is multiplied bybeta
isentropy
ofNormal
distribution. -
Signal Filtering: it uses the
running
estimate of themean
andvariance
wrt a stream of data.
Inspired by this source. It allows to filter bothstates
andrewards
(it uses only for states by default).
DA3C
algorithm can also be extended with additional models.
By default it can use a ICM by setting use_icm
parameter to True
.
ICM
helps Agent to discover an environment out of curiosity when extrinsic rewards are spare
or not present at all. This model proposed an intrinsic reward which is learned jointly with Agent's policy
even without any extrinsic rewards from the environment. Conceptual architecture is shown in figure below:
You must specify the parameters for the algorithm in the corresponding app.yaml
file to run:
algorithm:
name: da3c # name of the algorithm to load
input:
shape: [42, 42] # shape of the incoming state from an environment
history: 4 # number of consecutive states to stack for input
use_convolutions: true # set to True to process input by convolution layers
output:
continuous: false # set to True to use continuous Actor
action_size: 18 # action size for the given environment
scale: 2.5 # multiplier to scale symmetrically continuous action
action_low: [-3] # lower bound (or list of values) to clip continuous action
action_high: [2] # upper bound (or list of values) to clip continuous action
batch_size: 5 # t_max for batch collection step size
hidden_sizes: [256] # list to define layers sizes after convolutions
use_icm: true # set to True to use ICM module
gae_lambda: 1.00 # discount lambda for generalized advantage estimation
use_lstm: true # set to True to use LSTM instead of Fully-Connected layers
max_global_step: 1e8 # amount of maximum global steps to pass through the training
optimizer: Adam
initial_learning_rate: 1e-4 # initial learning rate which linear annealing through training
RMSProp: # use only for RMSProp optimizer if you don't use Adam
decay: 0.99
epsilon: 0.1
entropy_beta: 0.01 # entropy regularization constant
entropy_type: Origin # choice within Normal `Gauss` and `Origin` A3C entropy
rewards_gamma: 0.99 # rewards discount factor
gradients_norm_clipping: 40. # value for gradients norm clipping
policy_clip: false # false or value (ex.: 5.0) to clip policy loss within range [-value, +value]
critic_clip: 2.0 # false or value (ex.: 2.0) to clip value loss within range [-value, +value]
icm: # ICM relevant parameters
nu: 0.01 # prediction bonus multiplier for intrinsic reward
beta: 0.2 # forward loss importance against inverse model
lr: 1e-3 # ICM learning rate
It allows to omit parameters that don't have sense for current setup
(it retrieves some from default).
It also could be helpful to use some notations to outline different versions of the DA3C
.
Therefore DA3C-LSTM
is referred to architecture with LSTM
layers and DA3C-FF
otherwise.
Discrete DA3C-FF-ICM-16
outlines feedforward architecture with discrete
actor,
and curiosity model (ICM
) with 16
Agents.
DA3C Graph sample from Tensorboard
Performance of Vanilla A3C
on classic Atari
environments from original paper
(1
day = 80
millions of steps)
DA3C-LSTM-8
with Universe A3C architecture on Gym's Atari Pong
(see universe-starter-agent result to compare):
DA3C-FF-8
with Vanilla A3C architecture on Gym's Atari Boxing:
Continuous DA3C-LSTM
on BipedalWalker:
Agent Node | PS Node | Number of clients | Performance |
---|---|---|---|
m4.xlarge | m4.xlarge | 32 | 99 steps per sec |
m4.xlarge | m4.xlarge | 48 | 167 steps per sec |
m4.xlarge | m4.xlarge | 64 | 171 steps per sec |
c4.xlarge | c4.xlarge | 48 | 169 steps per sec |
c4.xlarge | c4.xlarge | 64 | 207 steps per sec |
c4.xlarge | m4.xlarge | 64 | 170 steps per sec |
c4.xlarge | m4.xlarge | 96 | 167 steps per sec |
c4.xlarge | m4.xlarge | 128 | 177 steps per sec |
c4.2xlarge | c4.2xlarge | 232 | 232 steps per sec |
c4.2xlarge | c4.2xlarge | 271 | 271 steps per sec |
Distributed version of TRPO-GAE algorithm, which can cope with both continuous & discrete action space.
Inspired by original papers:
- Trust Region Policy Optimization
- High-Dimensional Continuous Control Using Generalized Advantage Estimation
The main pipeline of the algorithm is the similar to the original sources, but collecting of trajectories is performed independently by parallel agents. These agents have a copy of policy neural network to rollout trajectories from its client. Parameter server is blocked to update when the batch is collected and this procedure repeats.
batch_size == 10.000, trajectory_length == 1600, parallel_agents == 8
Inspired by original paper - Continuous control with deep reinforcement learning from DeepMind
It is actor-critic, model-free, deep deterministic policy gradient (DDPG) algorithm in continuous action spaces,
by extending DQN
and DPG.
With actor-critic as in DPG, DDPG avoids the optimization of action at every time step
to obtain a greedy policy as in Q-learning, which will make it infeasible in complex action spaces with large,
unconstrained function approximators like deep neural networks. To make the learning stable and robust,
similar to DQN, DDPQ deploys experience replay and an idea similar to target network, "soft" target, which,
rather than copying the weights directly as in DQN, updates the soft target network weights slowly to track
the learned networks weights , with . The authors adapted batch normalization
to handle the issue that the different components of the observation with different physical units.
As an off-policy algorithm, DDPG learns an actor policy from experiences from an exploration policy
by adding noise sampled from a noise process to the actor policy.
There are original pseudo code for DDPG:
Environment (Client) - each client connects to a particular Agent (Learner).
The main role of any client is feeding data to an Agent by transferring: state, reward and terminal signals (for episodic tasks if episode ends). Client updates these signals at each time step by receiving the action signal from an Agent and then sends updated values back.
- Process State: each state could be pass through some filtering procedure before transferring (if you defined). It could be some color, edge or blob transformations (for image input) or more complex pyramidal, Kalman's and spline filters.
Agent (Parallel Learner) - each Agent connects to the Parameter Server.
The main role of any agent is to perform a main training loop.
Agent synchronize their neural network weights with the global network
by copying the last one weights at the beginning of each update procedure.
Agent executes 1
step (TBD N
steps) of Client's signals receiving and sending actions back,
then it stores this interaction tuple state | action | reward | next_state
in the Replay Buffer
.
These N steps is similar to batch collection. If Replay Buffer
has enough samples
to retrieve a batch of size, defined by config parameter batch_size
,
Agent computes the loss (wrt collected data) and pass it to the Optimizer.
It uses ADAM optimizer (by default) which computes gradients and sends them
to the Parameter Server for update of its neural network weights.
All Agents works absolutely independent in asynchronous way and can update or
receive the global network weights at any time.
-
Agent's Neural Networks:
4
neural networks ->Actor
&Actor Target
,Critic
&Critic Target
- Input: input with shape consistent to pass through
2D
convolutions or to fully connected layers. - _Convolution Layers : defined by relevant dictionary or use default.
- Fully connected Layers: set of layers defined by parameter
hidden_sizes
andReLU
activation (by default). - Actor: fully connected layer with number of units equals to
action_size
and no activation (by default).
It outputs an1-D
array, which representsQ
values over all possibly actions for the given state. - Critic: fully connected layer with
1
unit (by default).
It outputs an0-D
array representing the value of an state (expected return from this point).
- Input: input with shape consistent to pass through
-
Critic Loss: it computes loss for the
Critic
neural network with the baselines asTarget
networks.
, where N is abatch_size
and . -
Actor Loss: it computes loss for the
Actor
neural network wrtCritic
network.
, where N is abatch_size
. -
Compute Gradients: it computes the gradients wrt neural network weights and relevant loss.
Gradients are computed only forActor
&Critic
neural networks, not forTarget
ones. -
Synchronize Weights: it synchronize agent's weights with global neural networks by copying
the last ones to replace its own at the beginning of each batch collection step.
The new step will not start until the weights are updated. -
Define Action: it chooses an action with maximum
Q
value.
Actions are summed up with some noise, which annealing through the training
or (it recommended) to useOrnstein-Uhlenbeck
process to generate noise by setting parameterou_noise
in config toTrue
, where Nt is the noise process. -
Noise Process: it generates
noise
to add it into action. -
Replay Buffer: it holds a tuples
state | action | reward | next_state
in a cyclic buffer with the size defined by parameterbuffer_size
.
Samples are retrieved from this buffer to perform an update
with the size defined by parameterbatch_size
. -
Signal Filtering: it uses the
running
estimate of themean
andvariance
wrt a stream of data.
Inspired by this source. It allows to filter bothstates
andrewards
(it uses only for states by default).
Parameter Server (Global) - one for whole algorithm (training process).
The main role of the Parameter Server is to synchronize neural networks weights between Agents.
It holds the shared (global) neural networks weights, which is updated by the Agents gradients,
and sent the actual copy of its weights back to Agents to synchronize.
-
Global Neural Networks: neural networks weights is similar to Agent's one.
-
ADAM Optimizers: it holds two ADAM optimizers for
Actor
andCritic
neural networks.
Optimizer's states are global for all Agents and used to apply gradients from them.
It applies gradients only forActor
&Critic
neural networks, not forTarget
ones.
Then soft update forTargets
networks are performed wrt parametertau
:
You must specify the parameters for the algorithm in the corresponding app.yaml
file to run:
algorithm:
name: ddpg # name of the algorithm to load
input:
shape: [3] # shape of the incoming state from an environment
history: 1 # number of consecutive states to stack for input
use_convolutions: false # set to True to process input by convolution layers
output:
action_size: 1 # action size for the given environment
scale: 2.0 # multiplier to scale symmetrically continuous action
hidden_sizes: [400, 300] # list of dense layers sizes, for ex. [128, 64]
batch_size: 64 # batch size, which needs for one network update
buffer_size: 10000 # local buffer size to sample experience (400k-1m)
rewards_gamma: 0.99 # rewards discount factor
actor_learning_rate: 0.0001 # actor learning rate
critic_learning_rate: 0.001 # critic learning rate
tau: 0.001 # rate of target updates
l2: true # set to True to add l2 regularization loss for the Critic
l2_decay: 0.01 # regularization constant multiplier for l2 loss for Critic
ou_noise: true # set to True to use Ornstein-Uhlenbeck process for the noise
exploration: # exploration parameters wrt Ornstein-Uhlenbeck process
ou_mu: 0.0
ou_theta: 0.15
ou_sigma: 0.20
tau: 25
log_lvl: INFO # additional metrics output wrt levels: INFO | DEBUG | VERBOSE
no_ps: false # set to True to perform training without parameter server
It allows to omit parameters that don't have sense for current setup (it retrieves some from default.
Distributed DDPG
with 4 Agents
on classic Pendulum
continuous control task:
It is classical method based on REINFORCE
(Williams, 1992) rule.
Policy Gradient (or PG
) maintains a policy
and similar to DA3C
being updated with n-step returns in the forward view, after every tmax
actions or reaching a terminal state, similar to using minibatches.
It is updating in the direction of:
,
where
with k upbounded by tmax.
The principal architecture is similar to DA3C except that it works
only with one Policy
neural network and always uses Discrete
actor.
Environment (Client) - each client connects to a particular Agent (Learner).
The main role of any client is feeding data to an Agent by transferring: state, reward and terminal signals (for episodic tasks if episode ends). Client updates these signals at each time step by receiving the action signal from an Agent and then sends updated values back.
- Process State: each state could be pass through some filtering procedure before transferring (if you defined). It could be some color, edge or blob transformations (for image input) or more complex pyramidal, Kalman's and spline filters.
Agent (Parallel Learner) - each Agent connects to the Parameter Server.
The main role of any agent is to perform a main training loop. Agent synchronize their neural network weights with the global network by copying the last one weights at the beginning of each training mini loop. Agent executes N steps of Client's signals receiving and sending actions back. These N steps is similar to batch collection. If batch is collected Agent computes the loss (wrt collected data) and pass it to the Optimizer. It uses an ADAM optimizer which computes gradients and sends it to the Parameter Server for update of its neural network weights. All Agents works absolutely independent in asynchronous way and can update or receive the global network weights at any time.
-
Agent's Neural Network:
- Input: input with shape consistent to pass through
2D
convolutions or to fully connected layers. - _Convolution Layers : defined by relevant dictionary or use default.
- Fully connected Layers: set of layers defined by parameter
hidden_sizes
andReLU
activation (by default). - Actor: fully connected layer with number of units equals to
action_size
andSoftmax
activation (by default).
It outputs an1-D
array of probability distribution over all possibly actions for the given state.
- Input: input with shape consistent to pass through
-
Compute Gradients: it computes the gradients wrt neural network weights and policy loss.
-
Synchronize Weights: it synchronize agent's weights with global neural network by copying
the last one to replace its own at the beginning of each batch collection step (1..tmax or terminal).
The new step will not start until the weights are updated. -
Softmax Action: it uses a
Boltzmann
distribution to select an action, so it chooses more
often actions, which has more probability and we called thisSoftmax
action for simplicity.
This method has some benefits over classical e-greedy strategy and helps to avoid problem
of "path along the cliff". Furthermore it helps to explore more at the beginning of the training.
Agent becomes more confident in some actions while training and the probability distribution
over actions is becoming more acute.
Parameter Server (Global) - one for whole algorithm (training process).
The main role of the Parameter Server is to synchronize neural networks weights between Agents.
It holds the shared (global) neural network weights, which is updated by the Agents gradients,
and sent the actual copy of its weights back to Agents to synchronize.
-
Global Neural Network: neural network weights is similar to Agent's one.
-
Adam Optimizer: it holds optimizer's state.
It is one for all Agents and used to apply gradients from them.
You must specify the parameters for the algorithm in the corresponding app.yaml
file to run:
algorithm:
name: policy_gradient # name of the algorithm to load
input:
shape: [4] # shape of the incoming state from an environment
history: 1 # number of consecutive states to stack for input
use_convolutions: false # set to True to process input by convolution layers
output:
action_size: 2 # action size for the given environment
hidden_sizes: [10] # list to define layers sizes after convolutions
batch_size: 200 # t_max for batch collection step size
learning_rate: 0.01 # learning rate for the optimizer
GAMMA: 0.99 # rewards discount factor
Distributed Policy Gradient
with 4 Agents
on classic CartPole
task:
Inspired by original paper - Massively Parallel Methods for Deep Reinforcement Learning from DeepMind
It is one of the first application of deep learning models to reinforcement learning. It uses deep neural network to approximate future reward, which is trained with a variant of Q-learning algorithm, with stochastic gradient descent to update the weights. To alleviate the problems of correlated data and non-stationary distributions an experience replay mechanism is involved. It is implemented as a memory buffer, which stores only N last samples. DQN is model-free, i.e. it solves the reinforcement learning task directly using samples from the emulator. It is also off-policy: it learns about the greedy strategy , while following a behavior distribution that ensures adequate exploration of the state space, which is implemented by eps-greedy strategy. In order to improve stability of training process and preventing it from divergence a separate (target) Q-network is introduced. It is used for generating the targets in the Q-learning update and synchronizes with online Q-network every update_target_weights_interval
updates.
Here is the original pseudo-code for DQN:
Environment (Client) - each client connects to a particular Agent (Learner).
The main role of any client is feeding data to an Agent by transferring: state, reward and terminal signals (for episodic tasks if episode ends). Client updates these signals at each time step by receiving the action signal from an Agent and then sends updated values back.
Process State: each state could be pass through some filtering procedure before transferring (if you defined). It could be some color, edge or blob transformations (for image input) or more complex pyramidal, Kalman's and spline filters.
Agent (Parallel Learner) - each Agent connects to the Parameter Server.
The main role of any agent is to perform a main training loop. Agent synchronize their neural network weights with the global network by copying the last one weights at the beginning of each update procedure. Agent executes 1
step (TBD N
steps) of Client's signals receiving and sending actions back, then it stores this interaction tuple (state, action, reward, next_state, terminal)
in the Replay Buffer. These N steps is similar to batch collection. If ReplayBuffer
has enough samples to retrieve a batch of size, defined by config parameter batch_size
, Agent computes the loss (wrt collected data) and pass it to the Optimizer. It uses Adam
optimizer (by default) which computes gradients and sends them to the Parameter Server for update of its neural network weights. All Agents works absolutely independent in asynchronous way and can update or receive the global network weights at any time.
Agent's Neural Networks: 2
identical neural networks named Network
and Target Network
with weights and respectively.
- Input: input with shape consistent to pass through 2D convolutions or to fully connected layers.
- Convolution Layers : defined by relevant dictionary or use default.
- Fully connected Layers: set of layers defined by parameter
hidden_sizes
andReLU
activation (by default). - Network: fully connected layer with number of units equals to
action_size
and no activation (by default). It outputs an 1-D array, which represents Q values over all possible actions for the given state. - Loss: Let if is
terminal
and otherwise. Then . - Compute Gradients: it computes the gradients wrt neural network weights and relevant loss. Gradients are computed only for
Network
only. - Define Action: it chooses an action according to
?
-greedy policy, i.e. random action with probability?
or action with maximum Q value otherwise. Each agent starts with?
equals toeps.initial
value, which linearly decreases toeps.end
value duringeps.decay_steps
training steps. Ifeps.stochastic
is True,eps.decay_steps
should be a pair[eps_min, eps_max]
instead, and actual number of steps will be chosen randomly from that range. - Replay Buffer: it holds a tuples
(state, action, reward, next_state, terminal)
in a cyclic buffer with the size defined by parametermax_len
. Samples are randomly retrieved from this buffer to perform an update with the size defined by parameterbatch_size
. The choice of samples can be prioritized to more recent ones by increasingalpha
parameter. - Adam Optimizer: it holds Adam optimizer for
Network
. Each Agent has its own optimizer which is used to calculate gradients.
Parameter Server (Global) - one for the whole algorithm (training process).
The main role of the Parameter Server is to synchronize neural networks weights between Agents. It holds the shared (global) neural networks weights, which is updated by the Agents and sent the actual copy of its weights back to Agents to synchronize.
- Global Neural Network: single neural network identical to Agent's
Netowrk
with weights - Adam Optimizer: it holds optimizer's state. It is one for all Agents and used to apply gradients from them.
You must specify the parameters for the algorithm in the corresponding app.yaml
file to run:
algorithm:
name: dqn # short name for algorithm to load
input:
shape: [4] # shape of input state
history: 1 # number of consecutive states to stack
use_convolutions: false # set to true to process input by convolution layers
output:
action_size: 2 # action size for the given environment
double_dqn: true # use DoubleDQN if true
dueling_dqn: true # use DuelingDQN if true
hidden_sizes: [64] # list of dense layers sizes, for ex. [128, 64]
batch_size: 32 # maximum batch size, which need to accumulate for one update
max_global_step: 150000 # amount of maximum global steps to pass through the training
start_sample_step: 2000 # amount of steps before start training local Q-network
update_weights_interval: 100 # interval for receiving target Q-network weights from GlobalServer
update_target_weights_interval: 2000 # approximate number of steps for updating target Q-network on Agent
update_target_weights_min_steps: 500 # minimum number of steps for updating target Q-network after last update
rewards_gamma: 1.0 # rewards discount factor
initial_learning_rate: 5e-4 # initial learning rate, which can be anneal by some procedure
gradients_norm_clipping: false # gradients clipping by global norm, if false then it is ignored
optimizer: Adam # name of optimizer to use within training
replay_buffer_size: 50000 # maximum number of samples in replay buffer
alpha: 1.0 # prioritization exponent. Larger values lead to more prioritization.
eps:
initial: 1.0 # initial value for epsilon
end: 0.02 # end value for epsilon
stochastic: False # use stochastic number of epsilon decay steps if true
decay_steps: 10000 # number of decay steps or decay steps range if stochastic == true
These other algorithms we are working on and planning to make them run on RELAAX server:
-
ACER (A3C with experience) Inspired by:
-
UNREAL Inspired by:
-
PPO with L-BFGS (similar to TRPO) Inspired by:
-
CEM Inspired by: