Dueling DQN implementation possibly wrong? #52

NikEyX · 2019-06-03T03:25:25Z

Hi Max,

this is the Dueling DQN implementation from DeepMind: https://arxiv.org/pdf/1511.06581.pdf

Formula 9 here shows that the advantage is using the mean of over the actions of a given state. That's also in line with the normal definition of the advantage operator I believe.

In your implementation, however, you seem to subtract the mean of the advantages over all states:

return val + adv - adv.mean()

whereas I believe this should be more correct:

return val + adv - adv.mean(dim=1).unsqueeze(1)

What do you think?

The text was updated successfully, but these errors were encountered:

Shmuma · 2019-06-03T03:28:41Z

Hi!

Thanks! You're right, we need to subtract mean by actions, but not over the full batch. Will fix

domixit · 2019-06-18T13:34:22Z

Hi NikEyX,Shmuma

well spotted ...

this might have implication in convergence I'd guess ...

is the correct syntax (for master branch of the code)

return val + adv - adv.mean(dim=1,keepdim=True).unsqueeze(1) ?

Rgds

Dom

NikEyX · 2019-06-18T20:06:44Z

Hey Dom,

the point of the forward call is to give you back BATCH_SIZE x QVALUES, e.g. for this script it should yield torch.Size([32, 1]), since you get one q-value for each of the items in the batch. (Keep in mind Q-value = State_Value + Advantage_Value)

Advantage only ever affects a given state, so it needs to have the same 32x1 size. You can verify that both adv.mean(dim=1).unsqueeze(1) as well as adv.mean(dim=1,keepdim=True) will give you the same shape. "Unsqueeze" simply adds another dimension, and since you used keepdim=True you don't need that anymore, so leave it out if you use keepdim.

Keep in mind that the advantage operator isn't that useful in Pong and likely prolongs training a bit in this example. It's much more useful for other environments, especially where state values get very large, e.g. compare a situation where one state gives you a Q-value of 9999 for one action and a Q-value of 10001 for the other. Objectively this reads as "both actions are approximately equal" and so it shouldn't matter which one you take (especially if you use boltzman exploration), but if you use advantage, suddenly the difference between -1 and 1 is quite large (assuming that the state value is 10000).

domixit · 2019-06-19T13:36:31Z

Thanks NikEyX

cristal clear ....

Sycor4x · 2019-07-06T18:56:45Z

See also #6

Shmuma · 2019-07-18T17:43:32Z

Hi guys!

Sorry for a long wait on those bugs. Finally, started to working on 2nd edition (whee!), going to clean up all the mess with this codebase and address all the issues.

Thanks for the patience! :)

NikEyX · 2019-07-18T18:59:22Z

If you're working on a second edition, can I suggest using LZ4 instead of LazyFrames? It's significantly more compressed with similar speed and can then even be used for distribution across servers via pyarrow/ray.
Lastly, I really liked your book, but I did find the heavy reliance on ptan a little bit annoying. It made things more complicated than necessary. Just simple duplication of minimal code would have been better - since this way everyone can try their own code first, and then compare vs the solution, instead of trying to analyze a new framework. ptan just abstracted too much in my opinion. Anyways, keep up the good work!

Shmuma · 2019-08-04T09:23:45Z

If you're working on a second edition, can I suggest using LZ4 instead of LazyFrames?

Haven't got the point. Under LZ4 you mean generic compression algorithm used in BigData applications for fast compress/decompress? It might be useful to decrease memory footprint in large replay buffers, but it shouldn't be used instead of LazyFrames. LazyFrames are used to avoid memory copy on frame stack. So, they could be used together, but size of replay buffers are not the issue in any book examples.

I think that organizing replay buffer as hash table might be even more beneficial, as lots of frames are the same. But I haven't tried this idea. Might be cool to try CuLE + replay buffer in GPU memory to have DQN completely inside GPU.

I did find the heavy reliance on ptan a little bit annoying

oh, well. What's the alternatives? The point of the book is to show how to implement methods from scratch, not how to use particular RL toolkit. Ptan was introduced only in chapter 7 (almost the half of the book), before that, raw pytorch is being used. In fact, ptan is just small wrappers helping to avoid writing experience sources and agent code over and over again, which I find the most annoying piece of RL methods to write.

Writing everything from scratch might be the option, but I already was said that too much code is given in the book :). Anyways, it could be very useful exercise to reimplement everything without ptan or port on some other RL toolkit.

In the 2nd edition going to use more torch.ignite to reduce training loop code, but ptan will remain, sorry :).

Shmuma added a commit that referenced this issue Aug 4, 2019

Fix dueling implementation, issue #52

5324e0a

Shmuma closed this as completed Aug 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dueling DQN implementation possibly wrong? #52

Dueling DQN implementation possibly wrong? #52

NikEyX commented Jun 3, 2019

Shmuma commented Jun 3, 2019

domixit commented Jun 18, 2019

NikEyX commented Jun 18, 2019 •

edited

domixit commented Jun 19, 2019

Sycor4x commented Jul 6, 2019

Shmuma commented Jul 18, 2019

NikEyX commented Jul 18, 2019

Shmuma commented Aug 4, 2019 •

edited

Dueling DQN implementation possibly wrong? #52

Dueling DQN implementation possibly wrong? #52

Comments

NikEyX commented Jun 3, 2019

Shmuma commented Jun 3, 2019

domixit commented Jun 18, 2019

NikEyX commented Jun 18, 2019 • edited

domixit commented Jun 19, 2019

Sycor4x commented Jul 6, 2019

Shmuma commented Jul 18, 2019

NikEyX commented Jul 18, 2019

Shmuma commented Aug 4, 2019 • edited

NikEyX commented Jun 18, 2019 •

edited

Shmuma commented Aug 4, 2019 •

edited