Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why do we need x as argument of train() ? #16

Closed
etienne87 opened this issue Feb 28, 2017 · 7 comments
Closed

why do we need x as argument of train() ? #16

etienne87 opened this issue Feb 28, 2017 · 7 comments

Comments

@etienne87
Copy link

In NetworkVP, ThreadTrainer, we see that we call train_model with x, r, a (states, rewards, actions)

Wouldn't it be simpler to just to maintain an history of p,v inside for each agent number inside NetworkVP & compute loss + backprop when rewards come back in the train function with agent indices ?

Thus we would avoid recomputing forward (already been done in predict), possibly even accelerating the whole process?

@mbz
Copy link
Contributor

mbz commented Feb 28, 2017 via email

@etienne87
Copy link
Author

etienne87 commented Feb 28, 2017

@mbz Thanks a lot for taking the time to explain, makes more sense now!

I understand more now the issue with "old gradients", since network has been updated in the mean time. Do you have any reference proving this is critical ?

You are also right that most of the bottleneck is probably in the CPU-GPU connection for small models.

If I understand correctly, we should still maintain a db indexed by ProcessAgent ids inside NetworkVP right & progressively concatenate x's at each predict call inside its own batch maintained on GPU? (there would be NAgent batches maintained inside NetworkVP)

[EDIT] : I randomly stumbled upon this paper on drl for super smash brawl :

The many parallel agents periodically send their experiences
to a trainer, which maintains a circular queue of the
most recent experiences. With the help of a GPU, the trainer
continually performs (minibatched) stochastic gradient descent
on its set of experiences while periodically saving snapshots
of the neural network weights for the agents to load.
This asynchronous setup technically breaks the assumption
of the REINFORCE learning rule that the data is generated
from the current policy network (in reality the network has
since been updated by a few gradient steps), but in practice
this does not appear to be a problem, likely because the gradient
steps are sufficiently small to not change the policy significantly
in the time that an experience sits in the queue. The
upside is that no time is wasted waiting on the part of either
the agents or the trainer.

This looks pretty close to GA3C training configuration no?

@mbz
Copy link
Contributor

mbz commented Mar 1, 2017

Yes, this seems to be very related to the lag that I was referring to. You can also refer to Policy Lag section of our ICLR paper for more details on it.

Also yes, we need to keep the track of all the Xs and update their target values when we had them. Implementing this with TensorFlow might be a little tricky but I can help getting into more details if you are interested.

@etienne87
Copy link
Author

etienne87 commented Mar 2, 2017

the new version of your paper is great & more detailed !

Stabilizing (by increasing batch size, thus reducing delay) does not seem to improve convergence, (Figure 8), or is it just because model is less frequently updated?

I am not sure to find in the paper if you tried to update the model with gradient which is not in sync? I mean without even recomputing forward (which makes things less delayed but delayed nonetheless If I understood correctly)

I am interested in helping. I will try small changes & point them to you as it goes.

@ifrosio
Copy link
Collaborator

ifrosio commented Mar 2, 2017

Please have a look at the final version of the ICLR paper: here
Our latest experiments (after fixing issue 6) suggest that increasing the training batch size can indeed improve convergence - the model is updated less frequently, but the magnitude of each update is larger. This advantage tends to disappear if the training batch size becomes too large (probably because in this case training takes too much time and predictors stay idle).

@etienne87 etienne87 mentioned this issue Mar 3, 2017
@etienne87
Copy link
Author

Thanks the precision @ifrosio . Just to add some results confirming what you say, I ran a small experiment in my branch "origin/gae", where I test 4 different configs on CartPole-v0 on 2 paramters : batchsize & advantage formula (R-V or Generalized Advantage Estimation).
(I tested on CartPole in order to get very quick results, I think it is probably a good solution when testing new developments)

The results suggests Batch Size can have a huge impact on the stability of the algorithm.

results txt

(Sorry for the time axis in hours, I should have kept the number of steps)

All other hyperparams :

# Input of the DNN
    STACKED_FRAMES = 4
    IMAGE_WIDTH = 1
    IMAGE_HEIGHT = 4

    # Total number of episodes and annealing frequency
    EPISODES = 4000
    ANNEALING_EPISODE_COUNT = 4000

    # Entropy regualrization hyper-parameter
    BETA_START = 0.001
    BETA_END = 0.001
   # Learning rate
    LEARNING_RATE_START = 0.0004
    LEARNING_RATE_END = 0.0004

    # RMSProp parameters
    RMSPROP_DECAY = 0.99
    RMSPROP_MOMENTUM = 0.0
    RMSPROP_EPSILON = 0.1

    # Dual RMSProp - we found that using a single RMSProp for the two cost function works better and faster
    DUAL_RMSPROP = False
    
    # Gradient clipping
    USE_GRAD_CLIP = False
    GRAD_CLIP_NORM = 40.0 
    # Epsilon (regularize policy lag in GA3C)
    LOG_EPSILON = 1e-1

@etienne87
Copy link
Author

tried to apply loss in chainer variant . (By the way TF looks wayy faster (at least 3 times) than Chainer or Pytorch for the same problem). the "train_offpolicy" function did not succeed to make converge the basic CartPole problem. Closing issue as @mbz is right : the only thing possible is to keep x gpu tensors & concatenate them before training happen is the only credible acceleration. It also means LSTM will be at least 5x slower than FF one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants