Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dueling DQN implementation possibly wrong? #52

Closed
NikEyX opened this issue Jun 3, 2019 · 8 comments
Closed

Dueling DQN implementation possibly wrong? #52

NikEyX opened this issue Jun 3, 2019 · 8 comments

Comments

@NikEyX
Copy link

NikEyX commented Jun 3, 2019

Hi Max,

this is the Dueling DQN implementation from DeepMind: https://arxiv.org/pdf/1511.06581.pdf

image

Formula 9 here shows that the advantage is using the mean of over the actions of a given state. That's also in line with the normal definition of the advantage operator I believe.

In your implementation, however, you seem to subtract the mean of the advantages over all states:

return val + adv - adv.mean() 

whereas I believe this should be more correct:

return val + adv - adv.mean(dim=1).unsqueeze(1)

What do you think?

@Shmuma
Copy link
Collaborator

Shmuma commented Jun 3, 2019

Hi!

Thanks! You're right, we need to subtract mean by actions, but not over the full batch. Will fix

@domixit
Copy link

domixit commented Jun 18, 2019

Hi NikEyX,Shmuma

well spotted ...

this might have implication in convergence I'd guess ...

is the correct syntax (for master branch of the code)

return val + adv - adv.mean(dim=1,keepdim=True).unsqueeze(1) ?

Rgds

Dom

@NikEyX
Copy link
Author

NikEyX commented Jun 18, 2019

Hey Dom,

the point of the forward call is to give you back BATCH_SIZE x QVALUES, e.g. for this script it should yield torch.Size([32, 1]), since you get one q-value for each of the items in the batch. (Keep in mind Q-value = State_Value + Advantage_Value)

Advantage only ever affects a given state, so it needs to have the same 32x1 size. You can verify that both adv.mean(dim=1).unsqueeze(1) as well as adv.mean(dim=1,keepdim=True) will give you the same shape. "Unsqueeze" simply adds another dimension, and since you used keepdim=True you don't need that anymore, so leave it out if you use keepdim.

Keep in mind that the advantage operator isn't that useful in Pong and likely prolongs training a bit in this example. It's much more useful for other environments, especially where state values get very large, e.g. compare a situation where one state gives you a Q-value of 9999 for one action and a Q-value of 10001 for the other. Objectively this reads as "both actions are approximately equal" and so it shouldn't matter which one you take (especially if you use boltzman exploration), but if you use advantage, suddenly the difference between -1 and 1 is quite large (assuming that the state value is 10000).

@domixit
Copy link

domixit commented Jun 19, 2019

Thanks NikEyX

cristal clear ....

@Sycor4x
Copy link
Contributor

Sycor4x commented Jul 6, 2019

See also #6

@Shmuma
Copy link
Collaborator

Shmuma commented Jul 18, 2019

Hi guys!

Sorry for a long wait on those bugs. Finally, started to working on 2nd edition (whee!), going to clean up all the mess with this codebase and address all the issues.

Thanks for the patience! :)

@NikEyX
Copy link
Author

NikEyX commented Jul 18, 2019

If you're working on a second edition, can I suggest using LZ4 instead of LazyFrames? It's significantly more compressed with similar speed and can then even be used for distribution across servers via pyarrow/ray.
Lastly, I really liked your book, but I did find the heavy reliance on ptan a little bit annoying. It made things more complicated than necessary. Just simple duplication of minimal code would have been better - since this way everyone can try their own code first, and then compare vs the solution, instead of trying to analyze a new framework. ptan just abstracted too much in my opinion. Anyways, keep up the good work!

@Shmuma
Copy link
Collaborator

Shmuma commented Aug 4, 2019

If you're working on a second edition, can I suggest using LZ4 instead of LazyFrames?

Haven't got the point. Under LZ4 you mean generic compression algorithm used in BigData applications for fast compress/decompress? It might be useful to decrease memory footprint in large replay buffers, but it shouldn't be used instead of LazyFrames. LazyFrames are used to avoid memory copy on frame stack. So, they could be used together, but size of replay buffers are not the issue in any book examples.

I think that organizing replay buffer as hash table might be even more beneficial, as lots of frames are the same. But I haven't tried this idea. Might be cool to try CuLE + replay buffer in GPU memory to have DQN completely inside GPU.

I did find the heavy reliance on ptan a little bit annoying

oh, well. What's the alternatives? The point of the book is to show how to implement methods from scratch, not how to use particular RL toolkit. Ptan was introduced only in chapter 7 (almost the half of the book), before that, raw pytorch is being used. In fact, ptan is just small wrappers helping to avoid writing experience sources and agent code over and over again, which I find the most annoying piece of RL methods to write.

Writing everything from scratch might be the option, but I already was said that too much code is given in the book :). Anyways, it could be very useful exercise to reimplement everything without ptan or port on some other RL toolkit.

In the 2nd edition going to use more torch.ignite to reduce training loop code, but ptan will remain, sorry :).

Shmuma added a commit that referenced this issue Aug 4, 2019
@Shmuma Shmuma closed this as completed Aug 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants