Use RoaringDocIdSet instead of FixedBitset for caching. #8689

antonha · 2014-11-27T17:18:23Z

The default DocIdBitSet used for caching is currently the FixedBitSet implementation. For very sparse or dense bitsets, there are better implementations.

Lucene currently uses WAH8DocIdSet. #7577 suggested that ES should also switch to that implementation. The PR was rejected, in favour of getting the RoaringDocIdSet from Lucene 5. Upgrading to Lucene 5 was done in 610ce07, and it does indeed use the RoaringDocId implementation.

For ES users with large indices, it would be beneficial if the filter cache did not use so much memory, even before ES uses Lucene 5 (seems to be targeted for 2.0). This PR uses the RoaringDocIdSet from Lucene trunk (with minor modifications) instead of the FixedBitSet implementation. For the modifications, see inline comments.

I've ran the test suite locally a few times, and it works. I did see some issues, but fixed them.

antonha · 2014-11-27T17:22:34Z

src/main/java/org/elasticsearch/common/lucene/docset/RoaringDocIdSet.java

+                    sets[currentBlock] = new NotDocIdSet(new ShortArrayDocIdSet(excludedDocs), BLOCK_SIZE);
+                } else {
+                    // Neither sparse nor super dense, use a fixed bit set
+                    sets[currentBlock] = new FixedBitSet(denseBuffer.getBits(), denseBuffer.length());


Using the FixedBitSet instead of the BitDocIdSet implementation is Lucene 5. It's worth noting that the constructor arguments of BitDocIdSet and FibedBitSet have the same declaration, but different meanings.

jpountz · 2014-11-27T19:41:10Z

Something that refrains me from merging this pull request is that elasticsearch is highly biased towards doc id sets that support random-access (such as FixedBitSet, but not RoaringDocIdSet). And for instance, there are lots of places where elasticsearch calls DocIdSet.toSafeBits in order to have random-access. If you call this method on a FixedBitSet, it will return the FixedBitSet itself. On the other hand if you call it on a RoaringDocIdSet, it will load the RoaringDocIdSet into a FixedBitSet, which is potentially very slow if the doc id set is dense. We started refactoring some consumers of filters so that they work with the iterator instead of relying on random-access, but it's quite some work and even on master it's not finished yet. For instance we still have the filter aggregation that uses this toSafeBits method on every filter/segment which we need to fix.

I really think this is the right move, but I'm afraid that if we do not backport it to 1.x without other changes in order to consume iterators rather than using random-access, then it's going to bring more bad than good.

antonha · 2014-11-28T09:47:31Z

Thanks for your feedback. It does indeed seem like ES is heavy biased towards fixed bitsets.

For your example with .toSafeBits, I've added a new commit to the PR, containing a Bits implementation that is fast for ordered access, if the underlying DocIdSet's iterator has a fast advance() method, and replaced usages of .toSafeBits with calls to that method. I'm a bit uncertain about some things, please see inline comments.

Is this a decent approach to take?

I also tried looking for parts of ES where the FixedBitSet is used. It seems like it is used most heavily in nested document search logic and geo things. I can not see how that would affect this PR, but I have probably overlooked something.

Could you help me identify the areas that need to be adressed in order for this PR to be able to move forward?

antonha · 2014-11-28T09:49:27Z

src/main/java/org/elasticsearch/common/lucene/docset/DocIdSets.java

@@ -52,7 +52,7 @@ public static boolean isEmpty(@Nullable DocIdSet set) {
     * For example, it does not ends up iterating one doc at a time check for its "value".
     */
    public static boolean isFastIterator(DocIdSet set) {
-        return set instanceof FixedBitSet;
+        return set instanceof FixedBitSet || set instanceof RoaringDocIdSet;


The iterator in the RoaringDocIdSet is at best O(1) and at worst O(logN) (N = present/absent docs), so I'd call it fast enough to meet the comment's criteria.

jpountz · 2014-12-03T23:50:21Z

@antonha Given that we started efforts to move the master branch forward in preparation for version 2.0 (this is why there are more 2.0 commits these days than in the past), I think we should really focus on master and not backport such important changes. For instance, I like the way that you hide an iterator behind bits in order to easily move away from DocIdSets.toSafeBits. Ideally I think we should try to bring such changes to master and keep on using dense bitsets on 1.x?

About nested documents, it is fine that they still use bitsets. The reason is that they need the prevSetBit method that only exists on the BitSet class, not DocIdSet(Iterator).

antonha · 2014-12-04T13:27:07Z

@jpountz I am very happy about the direction master and 2.0 is taking in this regard, and I would be very happy to contribute if that brings value.

However, the main reason for me opening this PR was for bringing the functionality to ES users before 2.0, especially since that upgrade might be cumbersome to make (major versions both for ES and Lucene). In our ES setup, we are more or less blocked from using filter caches for most of our filters, due to the memory consumption of the cache. The same would be true for most users with large indices.

Thus, I believe that it would be a good thing if we could reduce the memory impact of the caches before 2.0.

Is there a better approach to making such a thing happen?

s1monw · 2015-03-24T09:56:15Z

@antonha I just revisited this PR and spoke to @jpountz about it. We decided to pass on this PR since we already move to it in master as well as for the sake of the complexity of the change. In 1.x we have a still a large amount of code that relies on FixedBitSet being used for filter caching and we don't want to make the moves there since changes are quite complex when it gets to filter conjunctions, intersections etc. I hope you understand our decision. Thanks for helping out on this end!

Use RoaringDocIdSet instead of FixedBitset for caching.

b40ea42

antonha reviewed Nov 27, 2014
View reviewed changes

Added Bits implementation that is fast for ordered lookup.

dfc4d34

antonha reviewed Nov 28, 2014
View reviewed changes

s1monw closed this Mar 24, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use RoaringDocIdSet instead of FixedBitset for caching. #8689

Use RoaringDocIdSet instead of FixedBitset for caching. #8689

antonha commented Nov 27, 2014

antonha Nov 27, 2014

jpountz commented Nov 27, 2014

antonha commented Nov 28, 2014

antonha Nov 28, 2014

jpountz commented Dec 3, 2014

antonha commented Dec 4, 2014

s1monw commented Mar 24, 2015

Use RoaringDocIdSet instead of FixedBitset for caching. #8689

Use RoaringDocIdSet instead of FixedBitset for caching. #8689

Conversation

antonha commented Nov 27, 2014

antonha Nov 27, 2014

Choose a reason for hiding this comment

jpountz commented Nov 27, 2014

antonha commented Nov 28, 2014

antonha Nov 28, 2014

Choose a reason for hiding this comment

jpountz commented Dec 3, 2014

antonha commented Dec 4, 2014

s1monw commented Mar 24, 2015