why do we need x as argument of train() ? #16

etienne87 · 2017-02-28T10:54:11Z

In NetworkVP, ThreadTrainer, we see that we call train_model with x, r, a (states, rewards, actions)

Wouldn't it be simpler to just to maintain an history of p,v inside for each agent number inside NetworkVP & compute loss + backprop when rewards come back in the train function with agent indices ?

Thus we would avoid recomputing forward (already been done in predict), possibly even accelerating the whole process?

mbz · 2017-02-28T16:09:02Z

great observation! the main reason for recomputing forward pass is that there is a "lag" between prediction and training and the output of the network (and as a result gradients) might be (probably are if there is more than one agent) different in training. So, I don't think you can save any computation. However, if the problem is communication-bound (i.e. CPU-GPU connection is the bottleneck which is the case for Atari), then it is possible to optimize the algorithm by keeping X inside the GPU and then avoid sending it at training time.

…

On Tue, Feb 28, 2017 at 4:54 AM, etienne87 ***@***.***> wrote: In NetworkVP, ThreadTrainer, we see that we call train_model with x, r, a (states, rewards, actions) Wouldn't it be simpler to just to maintain an history of p,v inside for each agent number inside NetworkVP & compute loss + backprop when rewards come back in the train function with agent indices ? Thus we would avoid recomputing forward (already been done in predict), possibly even accelerating the whole process? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#16>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABEfT-U-L1klVw1ipFZA_iZsVtvpEL_sks5rg_zTgaJpZM4MOP4D> .

etienne87 · 2017-02-28T18:53:59Z

@mbz Thanks a lot for taking the time to explain, makes more sense now!

I understand more now the issue with "old gradients", since network has been updated in the mean time. Do you have any reference proving this is critical ?

You are also right that most of the bottleneck is probably in the CPU-GPU connection for small models.

If I understand correctly, we should still maintain a db indexed by ProcessAgent ids inside NetworkVP right & progressively concatenate x's at each predict call inside its own batch maintained on GPU? (there would be NAgent batches maintained inside NetworkVP)

[EDIT] : I randomly stumbled upon this paper on drl for super smash brawl :

The many parallel agents periodically send their experiences
to a trainer, which maintains a circular queue of the
most recent experiences. With the help of a GPU, the trainer
continually performs (minibatched) stochastic gradient descent
on its set of experiences while periodically saving snapshots
of the neural network weights for the agents to load.
This asynchronous setup technically breaks the assumption
of the REINFORCE learning rule that the data is generated
from the current policy network (in reality the network has
since been updated by a few gradient steps), but in practice
this does not appear to be a problem, likely because the gradient
steps are sufficiently small to not change the policy significantly
in the time that an experience sits in the queue. The
upside is that no time is wasted waiting on the part of either
the agents or the trainer.

This looks pretty close to GA3C training configuration no?

mbz · 2017-03-01T18:35:12Z

Yes, this seems to be very related to the lag that I was referring to. You can also refer to Policy Lag section of our ICLR paper for more details on it.

Also yes, we need to keep the track of all the Xs and update their target values when we had them. Implementing this with TensorFlow might be a little tricky but I can help getting into more details if you are interested.

etienne87 · 2017-03-02T17:04:23Z

the new version of your paper is great & more detailed !

Stabilizing (by increasing batch size, thus reducing delay) does not seem to improve convergence, (Figure 8), or is it just because model is less frequently updated?

I am not sure to find in the paper if you tried to update the model with gradient which is not in sync? I mean without even recomputing forward (which makes things less delayed but delayed nonetheless If I understood correctly)

I am interested in helping. I will try small changes & point them to you as it goes.

ifrosio · 2017-03-02T17:54:08Z

Please have a look at the final version of the ICLR paper: here
Our latest experiments (after fixing issue 6) suggest that increasing the training batch size can indeed improve convergence - the model is updated less frequently, but the magnitude of each update is larger. This advantage tends to disappear if the training batch size becomes too large (probably because in this case training takes too much time and predictors stay idle).

etienne87 · 2017-03-05T19:09:19Z

Thanks the precision @ifrosio . Just to add some results confirming what you say, I ran a small experiment in my branch "origin/gae", where I test 4 different configs on CartPole-v0 on 2 paramters : batchsize & advantage formula (R-V or Generalized Advantage Estimation).
(I tested on CartPole in order to get very quick results, I think it is probably a good solution when testing new developments)

The results suggests Batch Size can have a huge impact on the stability of the algorithm.

(Sorry for the time axis in hours, I should have kept the number of steps)

All other hyperparams :

# Input of the DNN
    STACKED_FRAMES = 4
    IMAGE_WIDTH = 1
    IMAGE_HEIGHT = 4

    # Total number of episodes and annealing frequency
    EPISODES = 4000
    ANNEALING_EPISODE_COUNT = 4000

    # Entropy regualrization hyper-parameter
    BETA_START = 0.001
    BETA_END = 0.001
   # Learning rate
    LEARNING_RATE_START = 0.0004
    LEARNING_RATE_END = 0.0004

    # RMSProp parameters
    RMSPROP_DECAY = 0.99
    RMSPROP_MOMENTUM = 0.0
    RMSPROP_EPSILON = 0.1

    # Dual RMSProp - we found that using a single RMSProp for the two cost function works better and faster
    DUAL_RMSPROP = False
    
    # Gradient clipping
    USE_GRAD_CLIP = False
    GRAD_CLIP_NORM = 40.0 
    # Epsilon (regularize policy lag in GA3C)
    LOG_EPSILON = 1e-1

etienne87 · 2017-03-10T17:08:37Z

tried to apply loss in chainer variant . (By the way TF looks wayy faster (at least 3 times) than Chainer or Pytorch for the same problem). the "train_offpolicy" function did not succeed to make converge the basic CartPole problem. Closing issue as @mbz is right : the only thing possible is to keep x gpu tensors & concatenate them before training happen is the only credible acceleration. It also means LSTM will be at least 5x slower than FF one.

etienne87 mentioned this issue Mar 3, 2017

LSTM version #3

Open

etienne87 closed this as completed Mar 10, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why do we need x as argument of train() ? #16

why do we need x as argument of train() ? #16

etienne87 commented Feb 28, 2017

mbz commented Feb 28, 2017 via email •

edited

Loading

etienne87 commented Feb 28, 2017 •

edited

Loading

mbz commented Mar 1, 2017 •

edited

Loading

etienne87 commented Mar 2, 2017 •

edited

Loading

ifrosio commented Mar 2, 2017 •

edited

Loading

etienne87 commented Mar 5, 2017

etienne87 commented Mar 10, 2017

why do we need x as argument of train() ? #16

why do we need x as argument of train() ? #16

Comments

etienne87 commented Feb 28, 2017

mbz commented Feb 28, 2017 via email • edited Loading

etienne87 commented Feb 28, 2017 • edited Loading

mbz commented Mar 1, 2017 • edited Loading

etienne87 commented Mar 2, 2017 • edited Loading

ifrosio commented Mar 2, 2017 • edited Loading

etienne87 commented Mar 5, 2017

etienne87 commented Mar 10, 2017

mbz commented Feb 28, 2017 via email •

edited

Loading

etienne87 commented Feb 28, 2017 •

edited

Loading

mbz commented Mar 1, 2017 •

edited

Loading

etienne87 commented Mar 2, 2017 •

edited

Loading

ifrosio commented Mar 2, 2017 •

edited

Loading