Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use RoaringDocIdSet instead of FixedBitset for caching. #8689

Closed
wants to merge 2 commits into from

Conversation

antonha
Copy link

@antonha antonha commented Nov 27, 2014

The default DocIdBitSet used for caching is currently the FixedBitSet implementation. For very sparse or dense bitsets, there are better implementations.

Lucene currently uses WAH8DocIdSet. #7577 suggested that ES should also switch to that implementation. The PR was rejected, in favour of getting the RoaringDocIdSet from Lucene 5. Upgrading to Lucene 5 was done in 610ce07, and it does indeed use the RoaringDocId implementation.

For ES users with large indices, it would be beneficial if the filter cache did not use so much memory, even before ES uses Lucene 5 (seems to be targeted for 2.0). This PR uses the RoaringDocIdSet from Lucene trunk (with minor modifications) instead of the FixedBitSet implementation. For the modifications, see inline comments.

I've ran the test suite locally a few times, and it works. I did see some issues, but fixed them.

sets[currentBlock] = new NotDocIdSet(new ShortArrayDocIdSet(excludedDocs), BLOCK_SIZE);
} else {
// Neither sparse nor super dense, use a fixed bit set
sets[currentBlock] = new FixedBitSet(denseBuffer.getBits(), denseBuffer.length());
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the FixedBitSet instead of the BitDocIdSet implementation is Lucene 5. It's worth noting that the constructor arguments of BitDocIdSet and FibedBitSet have the same declaration, but different meanings.

@jpountz
Copy link
Contributor

jpountz commented Nov 27, 2014

Something that refrains me from merging this pull request is that elasticsearch is highly biased towards doc id sets that support random-access (such as FixedBitSet, but not RoaringDocIdSet). And for instance, there are lots of places where elasticsearch calls DocIdSet.toSafeBits in order to have random-access. If you call this method on a FixedBitSet, it will return the FixedBitSet itself. On the other hand if you call it on a RoaringDocIdSet, it will load the RoaringDocIdSet into a FixedBitSet, which is potentially very slow if the doc id set is dense. We started refactoring some consumers of filters so that they work with the iterator instead of relying on random-access, but it's quite some work and even on master it's not finished yet. For instance we still have the filter aggregation that uses this toSafeBits method on every filter/segment which we need to fix.

I really think this is the right move, but I'm afraid that if we do not backport it to 1.x without other changes in order to consume iterators rather than using random-access, then it's going to bring more bad than good.

@antonha
Copy link
Author

antonha commented Nov 28, 2014

Thanks for your feedback. It does indeed seem like ES is heavy biased towards fixed bitsets.

For your example with .toSafeBits, I've added a new commit to the PR, containing a Bits implementation that is fast for ordered access, if the underlying DocIdSet's iterator has a fast advance() method, and replaced usages of .toSafeBits with calls to that method. I'm a bit uncertain about some things, please see inline comments.

Is this a decent approach to take?

I also tried looking for parts of ES where the FixedBitSet is used. It seems like it is used most heavily in nested document search logic and geo things. I can not see how that would affect this PR, but I have probably overlooked something.

Could you help me identify the areas that need to be adressed in order for this PR to be able to move forward?

@@ -52,7 +52,7 @@ public static boolean isEmpty(@Nullable DocIdSet set) {
* For example, it does not ends up iterating one doc at a time check for its "value".
*/
public static boolean isFastIterator(DocIdSet set) {
return set instanceof FixedBitSet;
return set instanceof FixedBitSet || set instanceof RoaringDocIdSet;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The iterator in the RoaringDocIdSet is at best O(1) and at worst O(logN) (N = present/absent docs), so I'd call it fast enough to meet the comment's criteria.

@jpountz
Copy link
Contributor

jpountz commented Dec 3, 2014

@antonha Given that we started efforts to move the master branch forward in preparation for version 2.0 (this is why there are more 2.0 commits these days than in the past), I think we should really focus on master and not backport such important changes. For instance, I like the way that you hide an iterator behind bits in order to easily move away from DocIdSets.toSafeBits. Ideally I think we should try to bring such changes to master and keep on using dense bitsets on 1.x?

About nested documents, it is fine that they still use bitsets. The reason is that they need the prevSetBit method that only exists on the BitSet class, not DocIdSet(Iterator).

@antonha
Copy link
Author

antonha commented Dec 4, 2014

@jpountz I am very happy about the direction master and 2.0 is taking in this regard, and I would be very happy to contribute if that brings value.

However, the main reason for me opening this PR was for bringing the functionality to ES users before 2.0, especially since that upgrade might be cumbersome to make (major versions both for ES and Lucene). In our ES setup, we are more or less blocked from using filter caches for most of our filters, due to the memory consumption of the cache. The same would be true for most users with large indices.

Thus, I believe that it would be a good thing if we could reduce the memory impact of the caches before 2.0.

Is there a better approach to making such a thing happen?

@s1monw
Copy link
Contributor

s1monw commented Mar 24, 2015

@antonha I just revisited this PR and spoke to @jpountz about it. We decided to pass on this PR since we already move to it in master as well as for the sake of the complexity of the change. In 1.x we have a still a large amount of code that relies on FixedBitSet being used for filter caching and we don't want to make the moves there since changes are quite complex when it gets to filter conjunctions, intersections etc. I hope you understand our decision. Thanks for helping out on this end!

@s1monw s1monw closed this Mar 24, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants