-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dueling DQN implementation possibly wrong? #52
Comments
Hi! Thanks! You're right, we need to subtract mean by actions, but not over the full batch. Will fix |
Hi NikEyX,Shmuma well spotted ... this might have implication in convergence I'd guess ... is the correct syntax (for master branch of the code) return val + adv - adv.mean(dim=1,keepdim=True).unsqueeze(1) ? Rgds Dom |
Hey Dom, the point of the forward call is to give you back BATCH_SIZE x QVALUES, e.g. for this script it should yield torch.Size([32, 1]), since you get one q-value for each of the items in the batch. (Keep in mind Q-value = State_Value + Advantage_Value) Advantage only ever affects a given state, so it needs to have the same 32x1 size. You can verify that both Keep in mind that the advantage operator isn't that useful in Pong and likely prolongs training a bit in this example. It's much more useful for other environments, especially where state values get very large, e.g. compare a situation where one state gives you a Q-value of 9999 for one action and a Q-value of 10001 for the other. Objectively this reads as "both actions are approximately equal" and so it shouldn't matter which one you take (especially if you use boltzman exploration), but if you use advantage, suddenly the difference between -1 and 1 is quite large (assuming that the state value is 10000). |
Thanks NikEyX cristal clear .... |
See also #6 |
Hi guys! Sorry for a long wait on those bugs. Finally, started to working on 2nd edition (whee!), going to clean up all the mess with this codebase and address all the issues. Thanks for the patience! :) |
If you're working on a second edition, can I suggest using LZ4 instead of LazyFrames? It's significantly more compressed with similar speed and can then even be used for distribution across servers via pyarrow/ray. |
Haven't got the point. Under LZ4 you mean generic compression algorithm used in BigData applications for fast compress/decompress? It might be useful to decrease memory footprint in large replay buffers, but it shouldn't be used instead of LazyFrames. LazyFrames are used to avoid memory copy on frame stack. So, they could be used together, but size of replay buffers are not the issue in any book examples. I think that organizing replay buffer as hash table might be even more beneficial, as lots of frames are the same. But I haven't tried this idea. Might be cool to try CuLE + replay buffer in GPU memory to have DQN completely inside GPU.
oh, well. What's the alternatives? The point of the book is to show how to implement methods from scratch, not how to use particular RL toolkit. Ptan was introduced only in chapter 7 (almost the half of the book), before that, raw pytorch is being used. In fact, ptan is just small wrappers helping to avoid writing experience sources and agent code over and over again, which I find the most annoying piece of RL methods to write. Writing everything from scratch might be the option, but I already was said that too much code is given in the book :). Anyways, it could be very useful exercise to reimplement everything without ptan or port on some other RL toolkit. In the 2nd edition going to use more torch.ignite to reduce training loop code, but ptan will remain, sorry :). |
Hi Max,
this is the Dueling DQN implementation from DeepMind: https://arxiv.org/pdf/1511.06581.pdf
Formula 9 here shows that the advantage is using the mean of over the actions of a given state. That's also in line with the normal definition of the advantage operator I believe.
In your implementation, however, you seem to subtract the mean of the advantages over all states:
whereas I believe this should be more correct:
What do you think?
The text was updated successfully, but these errors were encountered: