New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use RoaringDocIdSet instead of FixedBitset for caching. #8689
Conversation
sets[currentBlock] = new NotDocIdSet(new ShortArrayDocIdSet(excludedDocs), BLOCK_SIZE); | ||
} else { | ||
// Neither sparse nor super dense, use a fixed bit set | ||
sets[currentBlock] = new FixedBitSet(denseBuffer.getBits(), denseBuffer.length()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the FixedBitSet instead of the BitDocIdSet implementation is Lucene 5. It's worth noting that the constructor arguments of BitDocIdSet and FibedBitSet have the same declaration, but different meanings.
Something that refrains me from merging this pull request is that elasticsearch is highly biased towards doc id sets that support random-access (such as FixedBitSet, but not RoaringDocIdSet). And for instance, there are lots of places where elasticsearch calls I really think this is the right move, but I'm afraid that if we do not backport it to 1.x without other changes in order to consume iterators rather than using random-access, then it's going to bring more bad than good. |
Thanks for your feedback. It does indeed seem like ES is heavy biased towards fixed bitsets. For your example with .toSafeBits, I've added a new commit to the PR, containing a Bits implementation that is fast for ordered access, if the underlying DocIdSet's iterator has a fast advance() method, and replaced usages of .toSafeBits with calls to that method. I'm a bit uncertain about some things, please see inline comments. Is this a decent approach to take? I also tried looking for parts of ES where the FixedBitSet is used. It seems like it is used most heavily in nested document search logic and geo things. I can not see how that would affect this PR, but I have probably overlooked something. Could you help me identify the areas that need to be adressed in order for this PR to be able to move forward? |
@@ -52,7 +52,7 @@ public static boolean isEmpty(@Nullable DocIdSet set) { | |||
* For example, it does not ends up iterating one doc at a time check for its "value". | |||
*/ | |||
public static boolean isFastIterator(DocIdSet set) { | |||
return set instanceof FixedBitSet; | |||
return set instanceof FixedBitSet || set instanceof RoaringDocIdSet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The iterator in the RoaringDocIdSet is at best O(1) and at worst O(logN) (N = present/absent docs), so I'd call it fast enough to meet the comment's criteria.
@antonha Given that we started efforts to move the master branch forward in preparation for version 2.0 (this is why there are more 2.0 commits these days than in the past), I think we should really focus on master and not backport such important changes. For instance, I like the way that you hide an iterator behind bits in order to easily move away from DocIdSets.toSafeBits. Ideally I think we should try to bring such changes to master and keep on using dense bitsets on 1.x? About nested documents, it is fine that they still use bitsets. The reason is that they need the |
@jpountz I am very happy about the direction master and 2.0 is taking in this regard, and I would be very happy to contribute if that brings value. However, the main reason for me opening this PR was for bringing the functionality to ES users before 2.0, especially since that upgrade might be cumbersome to make (major versions both for ES and Lucene). In our ES setup, we are more or less blocked from using filter caches for most of our filters, due to the memory consumption of the cache. The same would be true for most users with large indices. Thus, I believe that it would be a good thing if we could reduce the memory impact of the caches before 2.0. Is there a better approach to making such a thing happen? |
@antonha I just revisited this PR and spoke to @jpountz about it. We decided to pass on this PR since we already move to it in master as well as for the sake of the complexity of the change. In 1.x we have a still a large amount of code that relies on FixedBitSet being used for filter caching and we don't want to make the moves there since changes are quite complex when it gets to filter conjunctions, intersections etc. I hope you understand our decision. Thanks for helping out on this end! |
The default DocIdBitSet used for caching is currently the FixedBitSet implementation. For very sparse or dense bitsets, there are better implementations.
Lucene currently uses WAH8DocIdSet. #7577 suggested that ES should also switch to that implementation. The PR was rejected, in favour of getting the RoaringDocIdSet from Lucene 5. Upgrading to Lucene 5 was done in 610ce07, and it does indeed use the RoaringDocId implementation.
For ES users with large indices, it would be beneficial if the filter cache did not use so much memory, even before ES uses Lucene 5 (seems to be targeted for 2.0). This PR uses the RoaringDocIdSet from Lucene trunk (with minor modifications) instead of the FixedBitSet implementation. For the modifications, see inline comments.
I've ran the test suite locally a few times, and it works. I did see some issues, but fixed them.