Skip to content

Latest commit

 

History

History
664 lines (510 loc) · 45.2 KB

Algorithms.md

File metadata and controls

664 lines (510 loc) · 45.2 KB

click on title to go to contents

Inspired by original paper - Asynchronous Methods for Deep Reinforcement Learning from DeepMind

It is actor-critic algorithm, which learns both a policy and a state-value function, and the value function is used for bootstrapping, i.e., updating a state from subsequent estimates, to reduce variance and accelerate learning.

In DA3C, parallel actors employ different exploration policies to stabilize training, so that experience replay is not utilized. Different from most deep learning algorithms, this distributed method can run on the multiple nodes with centralised parameter server. For Atari games, DA3C ran much faster yet performed better than or comparably with DQN, Gorila, D-DQN, Dueling D-DQN, and Prioritized D-DQN. Furthermore DA3C also performs better than original A3C or it synchronous variant, which called A2C. DA3C also succeeded on continuous motor control problems: TORCS car racing games and MujoCo physics manipulation and locomotion, and Labyrinth, a navigating task in random 3D mazes using visual inputs, in which an agent will face a new maze in each new episode, so that it needs to learn a general strategy to explore random mazes.

There are original pseudo code for A3C:

img

DA3C maintains a policy img and an estimate of the value function img, being updated with n-step returns in the forward view, after every tmax actions or reaching a terminal state, similar to using minibatches. In contrast to the original code we use processes for the agents instead of threads, where each agent or some set of agents can be run on separate node.

The gradient update can be represented with TD-error multiplier as in original paper img or with an estimate of the advantage function: img, where img with k upbounded by tmax. To use the last one just set config parameter use_gae to true. The full set of possible config setups would be described later. Gradients are also applied with delay compensation on the server parameter for better convergence.

img

Environment (Client) - each client connects to a particular Agent (Learner).

The main role of any client is feeding data to an Agent by transferring: state, reward and terminal signals (for episodic tasks if episode ends). Client updates these signals at each time step by receiving the action signal from an Agent and then sends updated values back.

  • Process State: each state could be pass through some filtering procedure before transferring (if you defined). It could be some color, edge or blob transformations (for image input) or more complex pyramidal, Kalman's and spline filters.

Agent (Parallel Learner) - each Agent connects to the Parameter Server.

The main role of any agent is to perform a main training loop. Agent synchronize their neural network weights with the global network by copying the last one weights at the beginning of each training mini loop. Agent executes N steps of Client's signals receiving and sending actions back. These N steps is similar to batch collection. If batch is collected Agent computes the loss (wrt collected data) and pass it to the Optimizer. It could be some SGD optimizer (ADAM or RMSProp) which computes gradients and sends it to the Parameter Server for update of its neural network weights. All Agents works absolutely independent in asynchronous way and can update or receive the global network weights at any time.

  • Agent's Neural Network: we use the neural network architecture similar to universe agent (by default).

    • Input: 3D input to pass through 2D convolutions (default: 42x42x1) or any other shape.
    • _Convolution Layers : 4 layers with 32 filters each and 3x3 kernel, stride 2, ELU activation (by default).
    • Fully connected Layers: one layer with 256 hidden units and ReLU activation (by default).
    • LSTM Layers: one layer with 256 cell size (by default it's replaced with fully connected layer).
    • Actor: fully connected layer with number of units equals to action_size and Softmax activation (by default).
      It outputs an 1-D array of probability distribution over all possibly actions for the given state.
    • Critic: fully connected layer with 1 unit (by default).
      It outputs an 0-D array representing the value of an state (expected return from this point).
  • Total Loss = Policy_Loss + critic_scale * Value_Loss
    It uses critic_scale parameter to set a critic learning rate relative to policy learning rate
    It's set to 1.0 by default, i.e. the critic learning rate is 4 times larger than policy learning rate

    • Value Loss: sum (over all batch samples) of squared difference between expected discounted reward (R) and a value of the current sample state - V(s), i.e. expected discounted return from this state.
      img, where img with k upbounded by tmax.
      If st is terminal then V(st) = 0.

    • Policy Loss:
      img
      where the 1-st term is multiplication of policy log-likelihood on advantage function,
      and the last term is entropy multiplied by regularization parameter entropy_beta = 0.01 (by default).

  • Compute Gradients: it computes the gradients wrt neural network weights and total loss.
    Gradients are also clipped wrt parameter gradients_norm_clipping = 40.0 (by default).
    To perform the clipping, the values nn_weights[i] are set to:

    nn_weights[i] * clip_norm / max(global_norm, clip_norm)
    

    where:

    global_norm = sqrt(sum([l2norm(w)**2 for w in nn_weights]))
    

    If clip_norm > global_norm then the entries in nn_weights remain as they are,
    otherwise they're all shrunk by the global ratio.
    To avoid clipping just set gradients_norm_clipping = false in config yaml.

  • Synchronize Weights: it synchronize agent's weights with global neural network by copying
    the last one to replace its own at the beginning of each batch collection step (1..tmax or terminal).
    The new step will not start until the weights are updated, but it allows to switch
    this procedure in non-blocking manner by setting hogwild to true in the code.

  • Softmax Action: it uses a Boltzmann distribution to select an action, so it chooses more
    often actions, which has more probability and we called this Softmax action for simplicity.
    This method has some benefits over classical e-greedy strategy and helps to avoid problem
    of "path along the cliff". Furthermore it helps to explore more at the beginning of the training.
    Agent becomes more confident in some actions while training and the probability distribution
    over actions is becoming more acute.

Parameter Server (Global) - one for whole algorithm (training process).

The main role of the Parameter Server is to synchronize neural networks weights between Agents.
It holds the shared (global) neural network weights, which is updated by the Agents gradients,
and sent the actual copy of its weights back to Agents to synchronize.

  • Global Neural Network: neural network weights is similar to Agent's one.

  • Some SGD Optimizer: it holds a SGD optimizer and its state (Adam | RMSProp).
    It is one for all Agents and used to apply gradients from them.
    The default optimizer is Adam with initial_learning_rate = 1e-4
    since the last one is linear annealing wrt max_global_step parameter.

Distributed version of A3C algorithm, which can cope with continuous action space.
Architecture is similar to the previous one, but it uses two separate neural networks for Policy & Critic
also with another Actor type, Policy Loss, Choose Action procedure and additional state (reward) filtering.

img

  • Actor (Continuous): it outputs two values mu & sigma separately for Normal distribution.
    They are represented by 2 fully connected layers with number of units equals to action_size.

    • mu: it applies a Tanh activation (by default) and represents the mean of the distribution.
      It also allows to scale the output by config parameter scale or use another activation (or without it).
    • sigma: it applies a SoftPlus operator (by default), and represents the variance of the distribution.
  • Choose Action: it uses a random sampling for exploration wrt sigma
    and defines the final action wrt formula: random(action_size) * sigma + mu

  • Policy Loss: img
    The full expansion of Policy Loss for continuous action space looks like as follows:
    img

    where the term before advantage represents a negative-log-likelihood (NLL)
    and the term, which is multiplied by beta is entropy of Normal distribution.

  • Signal Filtering: it uses the running estimate of the mean and variance wrt a stream of data.
    Inspired by this source. It allows to filter both states and rewards (it uses only for states by default).

DA3C algorithm can also be extended with additional models.
By default it can use a ICM by setting use_icm parameter to True.

ICM helps Agent to discover an environment out of curiosity when extrinsic rewards are spare or not present at all. This model proposed an intrinsic reward which is learned jointly with Agent's policy even without any extrinsic rewards from the environment. Conceptual architecture is shown in figure below:

img

You must specify the parameters for the algorithm in the corresponding app.yaml file to run:

algorithm:
    name: da3c                  # name of the algorithm to load

input:
    shape: [42, 42]             # shape of the incoming state from an environment
    history: 4                  # number of consecutive states to stack for input
    use_convolutions: true      # set to True to process input by convolution layers

output:
    continuous: false           # set to True to use continuous Actor
    action_size: 18             # action size for the given environment
    scale: 2.5                  # multiplier to scale symmetrically continuous action
    action_low: [-3]            # lower bound (or list of values) to clip continuous action
    action_high: [2]            # upper bound (or list of values) to clip continuous action

batch_size: 5                   # t_max for batch collection step size
hidden_sizes: [256]             # list to define layers sizes after convolutions

use_icm: true                   # set to True to use ICM module
gae_lambda: 1.00                # discount lambda for generalized advantage estimation

use_lstm: true                  # set to True to use LSTM instead of Fully-Connected layers
max_global_step: 1e8            # amount of maximum global steps to pass through the training

optimizer: Adam
initial_learning_rate: 1e-4     # initial learning rate which linear annealing through training

RMSProp:                        # use only for RMSProp optimizer if you don't use Adam
    decay: 0.99
    epsilon: 0.1

entropy_beta: 0.01              # entropy regularization constant
entropy_type: Origin            # choice within Normal `Gauss` and `Origin` A3C entropy
rewards_gamma: 0.99             # rewards discount factor
gradients_norm_clipping: 40.    # value for gradients norm clipping
policy_clip: false              # false or value (ex.: 5.0) to clip policy loss within range [-value, +value]
critic_clip: 2.0                # false or value (ex.: 2.0) to clip value loss within range [-value, +value]

icm:                            # ICM relevant parameters
    nu: 0.01                    # prediction bonus multiplier for intrinsic reward
    beta: 0.2                   # forward loss importance against inverse model
    lr: 1e-3                    # ICM learning rate

It allows to omit parameters that don't have sense for current setup (it retrieves some from default).
It also could be helpful to use some notations to outline different versions of the DA3C.
Therefore DA3C-LSTM is referred to architecture with LSTM layers and DA3C-FF otherwise.
Discrete DA3C-FF-ICM-16 outlines feedforward architecture with discrete actor,
and curiosity model (ICM) with 16 Agents.

DA3C Graph sample from Tensorboard

img

Performance of Vanilla A3C on classic Atari environments from original paper (1 day = 80 millions of steps)

DA3C-LSTM-8 with Universe A3C architecture on Gym's Atari Pong (see universe-starter-agent result to compare): img
DA3C-FF-8 with Vanilla A3C architecture on Gym's Atari Boxing: img

Continuous DA3C-LSTM on BipedalWalker: img

Agent Node PS Node Number of clients Performance
m4.xlarge m4.xlarge 32 99 steps per sec
m4.xlarge m4.xlarge 48 167 steps per sec
m4.xlarge m4.xlarge 64 171 steps per sec
c4.xlarge c4.xlarge 48 169 steps per sec
c4.xlarge c4.xlarge 64 207 steps per sec
c4.xlarge m4.xlarge 64 170 steps per sec
c4.xlarge m4.xlarge 96 167 steps per sec
c4.xlarge m4.xlarge 128 177 steps per sec
c4.2xlarge c4.2xlarge 232 232 steps per sec
c4.2xlarge c4.2xlarge 271 271 steps per sec


Distributed version of TRPO-GAE algorithm, which can cope with both continuous & discrete action space.

Inspired by original papers:

The main pipeline of the algorithm is the similar to the original sources, but collecting of trajectories is performed independently by parallel agents. These agents have a copy of policy neural network to rollout trajectories from its client. Parameter server is blocked to update when the batch is collected and this procedure repeats.

batch_size == 10.000, trajectory_length == 1600, parallel_agents == 8 img

Inspired by original paper - Continuous control with deep reinforcement learning from DeepMind

It is actor-critic, model-free, deep deterministic policy gradient (DDPG) algorithm in continuous action spaces,
by extending DQN and DPG. With actor-critic as in DPG, DDPG avoids the optimization of action at every time step
to obtain a greedy policy as in Q-learning, which will make it infeasible in complex action spaces with large,
unconstrained function approximators like deep neural networks. To make the learning stable and robust,
similar to DQN, DDPQ deploys experience replay and an idea similar to target network, "soft" target, which, rather than copying the weights directly as in DQN, updates the soft target network weights slowly to track the learned networks weights , with . The authors adapted batch normalization to handle the issue that the different components of the observation with different physical units.
As an off-policy algorithm, DDPG learns an actor policy from experiences from an exploration policy
by adding noise sampled from a noise process to the actor policy.

There are original pseudo code for DDPG: img

img

Environment (Client) - each client connects to a particular Agent (Learner).

The main role of any client is feeding data to an Agent by transferring: state, reward and terminal signals (for episodic tasks if episode ends). Client updates these signals at each time step by receiving the action signal from an Agent and then sends updated values back.

  • Process State: each state could be pass through some filtering procedure before transferring (if you defined). It could be some color, edge or blob transformations (for image input) or more complex pyramidal, Kalman's and spline filters.

Agent (Parallel Learner) - each Agent connects to the Parameter Server.

The main role of any agent is to perform a main training loop. Agent synchronize their neural network weights with the global network by copying the last one weights at the beginning of each update procedure. Agent executes 1 step (TBD N steps) of Client's signals receiving and sending actions back, then it stores this interaction tuple state | action | reward | next_state in the Replay Buffer. These N steps is similar to batch collection. If Replay Buffer has enough samples to retrieve a batch of size, defined by config parameter batch_size, Agent computes the loss (wrt collected data) and pass it to the Optimizer. It uses ADAM optimizer (by default) which computes gradients and sends them to the Parameter Server for update of its neural network weights. All Agents works absolutely independent in asynchronous way and can update or receive the global network weights at any time.

  • Agent's Neural Networks: 4 neural networks -> Actor & Actor Target, Critic & Critic Target

    • Input: input with shape consistent to pass through 2D convolutions or to fully connected layers.
    • _Convolution Layers : defined by relevant dictionary or use default.
    • Fully connected Layers: set of layers defined by parameter hidden_sizes and ReLU activation (by default).
    • Actor: fully connected layer with number of units equals to action_size and no activation (by default).
      It outputs an 1-D array, which represents Q values over all possibly actions for the given state.
    • Critic: fully connected layer with 1 unit (by default).
      It outputs an 0-D array representing the value of an state (expected return from this point).
  • Critic Loss: it computes loss for the Critic neural network with the baselines as Target networks.
    img, where N is a batch_size and img.

  • Actor Loss: it computes loss for the Actor neural network wrt Critic network.
    img, where N is a batch_size.

  • Compute Gradients: it computes the gradients wrt neural network weights and relevant loss.
    Gradients are computed only for Actor & Critic neural networks, not for Target ones.

  • Synchronize Weights: it synchronize agent's weights with global neural networks by copying
    the last ones to replace its own at the beginning of each batch collection step.
    The new step will not start until the weights are updated.

  • Define Action: it chooses an action with maximum Q value.
    Actions are summed up with some noise, which annealing through the training
    or (it recommended) to use Ornstein-Uhlenbeck process to generate noise by setting parameter ou_noise in config to True
    img, where Nt is the noise process.

  • Noise Process: it generates noise to add it into action.

  • Replay Buffer: it holds a tuples state | action | reward | next_state
    in a cyclic buffer with the size defined by parameter buffer_size.
    Samples are retrieved from this buffer to perform an update
    with the size defined by parameter batch_size.

  • Signal Filtering: it uses the running estimate of the mean and variance wrt a stream of data.
    Inspired by this source. It allows to filter both states and rewards (it uses only for states by default).

Parameter Server (Global) - one for whole algorithm (training process).

The main role of the Parameter Server is to synchronize neural networks weights between Agents.
It holds the shared (global) neural networks weights, which is updated by the Agents gradients,
and sent the actual copy of its weights back to Agents to synchronize.

  • Global Neural Networks: neural networks weights is similar to Agent's one.

  • ADAM Optimizers: it holds two ADAM optimizers for Actor and Critic neural networks.
    Optimizer's states are global for all Agents and used to apply gradients from them.
    It applies gradients only for Actor & Critic neural networks, not for Target ones.
    Then soft update for Targets networks are performed wrt parameter tau:
    img
    img

You must specify the parameters for the algorithm in the corresponding app.yaml file to run:

algorithm:
    name: ddpg                  # name of the algorithm to load

input:
    shape: [3]                  # shape of the incoming state from an environment
    history: 1                  # number of consecutive states to stack for input
    use_convolutions: false     # set to True to process input by convolution layers

output:
    action_size: 1              # action size for the given environment
    scale: 2.0                  # multiplier to scale symmetrically continuous action

hidden_sizes: [400, 300]        # list of dense layers sizes, for ex. [128, 64]
batch_size: 64                  # batch size, which needs for one network update
buffer_size: 10000              # local buffer size to sample experience (400k-1m)
rewards_gamma: 0.99             # rewards discount factor

actor_learning_rate: 0.0001     # actor learning rate
critic_learning_rate: 0.001     # critic learning rate
tau: 0.001                      # rate of target updates

l2: true                        # set to True to add l2 regularization loss for the Critic
l2_decay: 0.01                  # regularization constant multiplier for l2 loss for Critic
ou_noise: true                  # set to True to use Ornstein-Uhlenbeck process for the noise

exploration:                    # exploration parameters wrt Ornstein-Uhlenbeck process
    ou_mu: 0.0
    ou_theta: 0.15
    ou_sigma: 0.20
    tau: 25

log_lvl: INFO                   # additional metrics output wrt levels: INFO | DEBUG | VERBOSE
no_ps: false                    # set to True to perform training without parameter server

It allows to omit parameters that don't have sense for current setup (it retrieves some from default.

Distributed DDPG with 4 Agents on classic Pendulum continuous control task: img

It is classical method based on REINFORCE (Williams, 1992) rule.

Policy Gradient (or PG) maintains a policy img and similar to DA3C being updated with n-step returns in the forward view, after every tmax actions or reaching a terminal state, similar to using minibatches.

It is updating in the direction of: img,
where img with k upbounded by tmax.

The principal architecture is similar to DA3C except that it works
only with one Policy neural network and always uses Discrete actor.

Environment (Client) - each client connects to a particular Agent (Learner).

The main role of any client is feeding data to an Agent by transferring: state, reward and terminal signals (for episodic tasks if episode ends). Client updates these signals at each time step by receiving the action signal from an Agent and then sends updated values back.

  • Process State: each state could be pass through some filtering procedure before transferring (if you defined). It could be some color, edge or blob transformations (for image input) or more complex pyramidal, Kalman's and spline filters.

Agent (Parallel Learner) - each Agent connects to the Parameter Server.

The main role of any agent is to perform a main training loop. Agent synchronize their neural network weights with the global network by copying the last one weights at the beginning of each training mini loop. Agent executes N steps of Client's signals receiving and sending actions back. These N steps is similar to batch collection. If batch is collected Agent computes the loss (wrt collected data) and pass it to the Optimizer. It uses an ADAM optimizer which computes gradients and sends it to the Parameter Server for update of its neural network weights. All Agents works absolutely independent in asynchronous way and can update or receive the global network weights at any time.

  • Agent's Neural Network:

    • Input: input with shape consistent to pass through 2D convolutions or to fully connected layers.
    • _Convolution Layers : defined by relevant dictionary or use default.
    • Fully connected Layers: set of layers defined by parameter hidden_sizes and ReLU activation (by default).
    • Actor: fully connected layer with number of units equals to action_size and Softmax activation (by default).
      It outputs an 1-D array of probability distribution over all possibly actions for the given state.
  • Policy Loss:
    img, where img with k upbounded by tmax.

  • Compute Gradients: it computes the gradients wrt neural network weights and policy loss.

  • Synchronize Weights: it synchronize agent's weights with global neural network by copying
    the last one to replace its own at the beginning of each batch collection step (1..tmax or terminal).
    The new step will not start until the weights are updated.

  • Softmax Action: it uses a Boltzmann distribution to select an action, so it chooses more
    often actions, which has more probability and we called this Softmax action for simplicity.
    This method has some benefits over classical e-greedy strategy and helps to avoid problem
    of "path along the cliff". Furthermore it helps to explore more at the beginning of the training.
    Agent becomes more confident in some actions while training and the probability distribution
    over actions is becoming more acute.

Parameter Server (Global) - one for whole algorithm (training process).

The main role of the Parameter Server is to synchronize neural networks weights between Agents.
It holds the shared (global) neural network weights, which is updated by the Agents gradients,
and sent the actual copy of its weights back to Agents to synchronize.

  • Global Neural Network: neural network weights is similar to Agent's one.

  • Adam Optimizer: it holds optimizer's state.
    It is one for all Agents and used to apply gradients from them.

You must specify the parameters for the algorithm in the corresponding app.yaml file to run:

algorithm:
    name: policy_gradient       # name of the algorithm to load

input:
    shape: [4]                  # shape of the incoming state from an environment
    history: 1                  # number of consecutive states to stack for input
    use_convolutions: false     # set to True to process input by convolution layers

output:
    action_size: 2              # action size for the given environment

hidden_sizes: [10]              # list to define layers sizes after convolutions
batch_size: 200                 # t_max for batch collection step size
learning_rate: 0.01             # learning rate for the optimizer
GAMMA: 0.99                     # rewards discount factor

Distributed Policy Gradient with 4 Agents on classic CartPole task: img


Inspired by original paper - Massively Parallel Methods for Deep Reinforcement Learning from DeepMind

It is one of the first application of deep learning models to reinforcement learning. It uses deep neural network to approximate future reward, which is trained with a variant of Q-learning algorithm, with stochastic gradient descent to update the weights. To alleviate the problems of correlated data and non-stationary distributions an experience replay mechanism is involved. It is implemented as a memory buffer, which stores only N last samples. DQN is model-free, i.e. it solves the reinforcement learning task directly using samples from the emulator. It is also off-policy: it learns about the greedy strategy , while following a behavior distribution that ensures adequate exploration of the state space, which is implemented by eps-greedy strategy. In order to improve stability of training process and preventing it from divergence a separate (target) Q-network is introduced. It is used for generating the targets in the Q-learning update and synchronizes with online Q-network every update_target_weights_interval updates.

Here is the original pseudo-code for DQN: img

Distributed DQN Architecture

img

Environment (Client) - each client connects to a particular Agent (Learner).

The main role of any client is feeding data to an Agent by transferring: state, reward and terminal signals (for episodic tasks if episode ends). Client updates these signals at each time step by receiving the action signal from an Agent and then sends updated values back.

Process State: each state could be pass through some filtering procedure before transferring (if you defined). It could be some color, edge or blob transformations (for image input) or more complex pyramidal, Kalman's and spline filters.

Agent (Parallel Learner) - each Agent connects to the Parameter Server.

The main role of any agent is to perform a main training loop. Agent synchronize their neural network weights with the global network by copying the last one weights at the beginning of each update procedure. Agent executes 1 step (TBD N steps) of Client's signals receiving and sending actions back, then it stores this interaction tuple (state, action, reward, next_state, terminal) in the Replay Buffer. These N steps is similar to batch collection. If ReplayBuffer has enough samples to retrieve a batch of size, defined by config parameter batch_size, Agent computes the loss (wrt collected data) and pass it to the Optimizer. It uses Adam optimizer (by default) which computes gradients and sends them to the Parameter Server for update of its neural network weights. All Agents works absolutely independent in asynchronous way and can update or receive the global network weights at any time.

Agent's Neural Networks: 2 identical neural networks named Network and Target Network with weights and respectively.

  • Input: input with shape consistent to pass through 2D convolutions or to fully connected layers.
  • Convolution Layers : defined by relevant dictionary or use default.
  • Fully connected Layers: set of layers defined by parameter hidden_sizes and ReLU activation (by default).
  • Network: fully connected layer with number of units equals to action_size and no activation (by default). It outputs an 1-D array, which represents Q values over all possible actions for the given state.
    • DuelingDQN Network: fully connected layer with 2 outputs for advantage and value functions of size action_size and 1 respectively. The result output will be where + is broadcasted actions_size times.
  • Loss: Let if is terminal and otherwise. Then .
    • DoubleDQN Loss: It remains the same but .
  • Compute Gradients: it computes the gradients wrt neural network weights and relevant loss. Gradients are computed only for Network only.
  • Define Action: it chooses an action according to ?-greedy policy, i.e. random action with probability ? or action with maximum Q value otherwise. Each agent starts with ? equals to eps.initial value, which linearly decreases to eps.end value during eps.decay_steps training steps. If eps.stochastic is True, eps.decay_steps should be a pair [eps_min, eps_max] instead, and actual number of steps will be chosen randomly from that range.
  • Replay Buffer: it holds a tuples (state, action, reward, next_state, terminal) in a cyclic buffer with the size defined by parameter max_len. Samples are randomly retrieved from this buffer to perform an update with the size defined by parameter batch_size. The choice of samples can be prioritized to more recent ones by increasing alpha parameter.
  • Adam Optimizer: it holds Adam optimizer for Network. Each Agent has its own optimizer which is used to calculate gradients.

Parameter Server (Global) - one for the whole algorithm (training process).

The main role of the Parameter Server is to synchronize neural networks weights between Agents. It holds the shared (global) neural networks weights, which is updated by the Agents and sent the actual copy of its weights back to Agents to synchronize.

  • Global Neural Network: single neural network identical to Agent's Netowrk with weights
  • Adam Optimizer: it holds optimizer's state. It is one for all Agents and used to apply gradients from them.

Distributed DQN Config:

You must specify the parameters for the algorithm in the corresponding app.yaml file to run:

algorithm:
  name: dqn                             # short name for algorithm to load

  input:
    shape: [4]                          # shape of input state
    history: 1                          # number of consecutive states to stack
    use_convolutions: false             # set to true to process input by convolution layers

  output:
    action_size: 2                      # action size for the given environment

  double_dqn: true                      # use DoubleDQN if true
  dueling_dqn: true                     # use DuelingDQN if true

  hidden_sizes: [64]                    # list of dense layers sizes, for ex. [128, 64]
  batch_size: 32                        # maximum batch size, which need to accumulate for one update

  max_global_step: 150000               # amount of maximum global steps to pass through the training
  start_sample_step: 2000               # amount of steps before start training local Q-network
  update_weights_interval: 100          # interval for receiving target Q-network weights from GlobalServer
  update_target_weights_interval: 2000  # approximate number of steps for updating target Q-network on Agent
  update_target_weights_min_steps: 500  # minimum number of steps for updating target Q-network after last update

  rewards_gamma: 1.0                    # rewards discount factor

  initial_learning_rate: 5e-4           # initial learning rate, which can be anneal by some procedure
  gradients_norm_clipping: false        # gradients clipping by global norm, if false then it is ignored
  optimizer: Adam                       # name of optimizer to use within training

  replay_buffer_size: 50000             # maximum number of samples in replay buffer
  alpha: 1.0                            # prioritization exponent. Larger values lead to more prioritization.

  eps:
    initial: 1.0                        # initial value for epsilon
    end: 0.02                           # end value for epsilon
    stochastic: False                   # use stochastic number of epsilon decay steps if true
    decay_steps: 10000                  # number of decay steps or decay steps range if stochastic == true

These other algorithms we are working on and planning to make them run on RELAAX server: