-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement rank-based prioritised experience replay #1
Comments
Update: Test on Frostbite has only achieved DQN/DDQN scores, despite rank-based prioritised experience replay achieving much higher scores. Therefore the current implementation is either wrong or suboptimal. Proportional prioritised replay still needs to be implemented. |
Hi, |
As above, I attempted an implementation of rank-based prioritised experience replay, but it got nowhere near the scores it should have when I ran a test on Frostbite. The paper helpfully provides complete training graphs for comparison, so clearly there's a bug somewhere. Proportional prioritised experience replay has not been attempted yet since it is trickier to get right. |
Ah, I see. Any ideas on where the bug could be? I could help take a look as well. I've been trying to find prioritized experience replay, so if you know of any other implementations available publicly, that'd be great! thanks again. |
For rank-based, first you need a priority queue, which I've implemented as a max binary heap. I did quick tests halfway through implementation, but checking that would be first. Apart from the priority (absolute TD-error) being the key, the value is actually an index which indicates the location of the sample in the experience replay buffer. So first comes checking the new sample update and the sample update after learning. The actual algorithm/maths to compare against the paper is here. The logic for picking the partitions for sampling is here. There are a few more bits, but those are the important ones to try first. I've looked a lot, but this does seem to be the only public attempt - so no chance of porting code from others! |
Just a quick glance to see if anything jumps out at my uneducated fresh eyes. Haven't used BinaryHeap structures enough to know if this is something or not... https://github.com/Kaixhin/Atari/blob/master/structures/BinaryHeap.lua#L116 Would having the recursion inside those conditions mean it only continues rebalancing the heap if it is currently working on an item which is in the wrong position? It stops soon as it hits an item which doesn't need moving, intended? They're already in order up to this point? |
@mryellow Yes that is the case - one of the ideas behind this data structure is that if an element is inserted (or replaced - the less straightforward case), it will either travel up or down (or stay) along its path in the tree (a heap is a specialised tree). So going from an empty or ordered tree, each operation only needs to deal with the element in question. Tests would be best, but bar that referring to pseudocode/code in other languages would be one way of checking this. One error I might have made is in converting from the usual 0-based indexing to 1-based indexing. |
Seemed that was the idea, my formal CS is weak.
That was one area which stood out, the bumping up/down of |
In https://github.com/mryellow/Atari/blob/test-prior/spec/structures.lua#L140-L151 I'm failing those after |
Yeah that's checking out, with "greater than or equal to". With |
In the paper they mention that they use the order in the underlying array as a proxy for the priority (noted in line 218, so the final elements might not necessarily be in priority order. |
Will have to check further but that part might be failing in the end. If I fill the heap via
|
It looks fine to me - I just printed |
Ahh I was using the wrong index. Priorities mostly checking out when using Although items are out-of-order until After
After
Guess there is no guarantee that the items towards end will be the lowest magnitude. |
So a check would be to make sure Although it's unlikely to show anything, it might be worth halving |
Idiots make the best testers, hopefully my rubber-duck ignorance hits some gold sooner or later ;-) I'm maxing out my RAM and running on CPU rather than GPU otherwise I'd kick off some agents, maybe it's all actually working.
Could there be potential for it getting stuck visiting the same experiences, thus only updating those parts of the heap when |
If only idiots made the best developers too... But really this really could benefit from a fresh pair of eyes. That problem should only happen if the heap is broken in the first place - only inserted or updated elements need to be fixed (the calls to |
Well it only gets to looking at the parent for an
Testing the index after each insert shows new max going to first position and |
Ah that's not an issue - check out how insert works in a binary heap. I'm fairly confident that the priority queue is fine now, so I've started going over the paper and my code more thoroughly now. Made some changes, have a few more to go, and will hopefully get to start a test on Frostbite by the weekend. |
Missing implementation detail to do - the heap-based array "is infrequently sorted once every 10^6 steps to prevent the heap becoming too unbalanced". Also, in line 10 of algorithm 1 it is unclear whether the max IS weight is calculated over all samples or just the ones that were chosen. |
Once again, my math is weak but noob eyes might spot something and can read it well enough to fall over trying. Where does subscript Finding the lowest |
It's a bit ambiguous, but yes it could well be over just the batch. I printed out the difference between the full sample max and the batch max for a while, with the max difference being ~0.05. I don't think that should make much of a difference, so I'll switch back to batch max for now. Worked out a simple trick for getting the full sample max IS weight here - it comes from the smallest probability, which is the last element in the probability distribution. I'm hoping that rebalancing the heap is the last part of the puzzle. A quick guess is that sorting the heap's underlying array and reconstructing the heap from that might just do the trick. |
I was just made aware of your effort to reproduce our work and combine all the best bits, very cool, I'm very happy to see that! We tried to put as much information into the paper as possible, but if we missed something, feel free to ask. Concretely:
|
@schaul Hi, thanks for your great work! I have a few questions about priority experience replay as well. If you have time, could you answer to all/some of them? Thank you! 1)Does the constant assigned to new transitions matter? I have a 1000 for example and all my errors are clipped to -1, 1 before square inside the error. 2)I see jumps in my error values after each max-heap sorting, is this normal?//Please, see edit. 3)How often do I need to check if heap is unbalanced or not? 4)How do you remove elements when the heap is full? Currently I just remove the last element in the heap array. 5)to update the heap in batch, do you need to keep a dictionary from element id to its current position? Or you used some other method? 6)How bad is for beta to reach 1.0 too soon? Before the end of learning. alpha stays the same. 7)I train my network every 4 added transitions and my minibatch size is 32. So there is 4 new and unvisited transitions in the heap for every batch update. Should I increase minibatches or decrease 4 to 2 or 1 or leave it be? I don't participate in this repository, but I hope my questions will be helpful to anyone who was trying to find a priority experience replay implementation and ended up here :) Edit: jumps occur not after sorting, but after recomputing boundaries while heap is not full. |
@avasilkov I'll attempt to answer a few questions based on what I've read in the paper/my implementation in this repo.
|
@Kaixhin What do you think about persistent advantage learning combined with double dqn? I'm not sure how to combine them, do they play along well? 2)About my second question, I asked it wrong. jumps happen while heap is not full and I call sort+recompute segment boundaries. So it doesn't happen after memory is full, but after each boundary recompute there are jumps:
Yes, I know, but @schaul wrote
Currently I'm rebalancing it every heap_size(800k) transitions.
Thanks again! Edit: the error pic is averaged batch error taken every 500 steps. |
@avasilkov According to one of the authors, PAL does not work well with double Q-learning (see notes in code).
|
Thank you, @Kaixhin ! I guess I will go and try FIFO order for removing. Maybe because of my way of removing elements I have stale transitions that only do harm. And thanks for the suggestion about not annealing beta parameter. |
how about https://docs.python.org/2/library/queue.html#Queue.PriorityQueue ? :-) and how about Python? at least the array index there is 0-based :-) I read this thread, and found it seems you are struggling to get the low level code work correct. From a language perspective, Python is a better than Lua; on top of that Python has much much bigger communities, and (standard) libraries. So you do not need to worry about such kind low level code, or bug that hinder your main work on machine learning. As you mentioned, probably this is the only public attempt to prioritised experience replay. It's a pity to see that your progress are blocked by such low level issues. I have a suggestion, how about joining force with others? e.g. https://github.com/tambetm/simple_dqn to "stand on the shoulders of giants" :-) That Python code is well written, and easy to understand / extend. I'd really like to see the relatively small ML community can work together to make joint effort, and make one good & complete implementation, instead of many fragmented ones. Just my 2 cents. |
@mingwugmail I don't want to start a debate about the pros and cons of Torch vs Theano, but some points relevant to this repository:
|
I just want to provide more info: simple_dqn is based on Neon not Theano, and from it's doc http://www.nervanasys.com/deep-reinforcement-learning-with-neon/ """ -- the fastest convolutional kernels, and convnets benchmarks https://github.com/soumith/convnet-benchmarks Mature library + fastest implementation all will speed up your experiment. |
I have a funny non-stationary domain which gives prioritised replay a bit of a workout. No real results to compare against but have noticed the behaviour change. There are 3 actions, |
@mryellow Actually @lake4790k spotted an issue to do with indexing from the priority queue to the experience replay buffer - I've only had time to throw together a quick patch on a new I've also now spotted another issue, which is that terminal states shouldn't be stored in the queue at all. In Edit 2016-06-04: First test on the |
Requires a "sum tree" binary heap for efficient execution.
Edit 2016-06-02: Please keep to the contributing guidelines.
The text was updated successfully, but these errors were encountered: