New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change the default cache filter impl from FixedBitSet to WAH8DocIdSet #7577
Conversation
5bdebf3
to
17aff61
Compare
…cIdSet. A few months ago, Lucene switched from FixedBitSet to WAH8DocIdSet in order to cache filters. WAH8DocIdSet is especially better when dealing with sparse sets: iteration is faster, memory usage is lower and there is an index that helps keep advance fast. This doesn't break existing code since elastic#6280 already made sure that there was no abusive cast from DocIdSet to Bits or FixedBitSet and elastic#7037 moved consumers of the filter cache that absolutely need to get fixed bitsets to their own cache. Since cached filters will be more memory-efficient, the filter cache size has been decreased from 10 to 5%. Although smaller, this might still allow to cache more filters.
17aff61
to
8d39d48
Compare
Few things: (1) This change now means that it uses See for example We can do 2 things here, either remove the isCacheable check (and check on instanceof for Fixed and WH ones), or, go over all the filters that generate a bit set, and make sure to "hack" the (2) If we are going to have more WH variants of bitsets, I wonder if we should also optimize the (3) The cache of the filters is reduced, but I wonder how much memory savings will we have in practice to warrant a reduction? For example, |
More thoughts on (1), actually, since we allow for field data to be evicted, I would change all the field data based impls to non cacheable, that actually makes sense contact wise (otherwise they still hold a reference). Still need to look at the other impls, for example, |
I think FixedBitSet is still better suited here. WAH8 is nice, but it has the limitation that it needs to be built in order so I don't think we can use it here?
Good point. I will revert that part of the change.
I think that behavior is correct since the returned DocIdSet is based on a postings list (it's not in memory). |
I read it as fast without doing disk I//O. Also, Field Data is in memory (cost of loading it once, though as I mentioned, it might get evicted, so the referehce will still be there and not properly GC'ed) Regardless, maybe its just semantics, the point I think remains, we need to fix those if we rely on
Yea, but if the inner filters are WH based and not Fixed based, the execution will be slower. And it will now because thats what used for caching (on certain filters). Maybe we can add an optimization (donno if its possible) to do fast boolean operations on an existing FixedBitSet with WH one?
No, look at TermFilter, it overrides DocIdSet, without overriding isCacheable (so it defaults to true) and setting it to false. Thats a bug, no? We didn't see it in ES because we didn't take isCacheable into account. |
My bad, got confused with the EMPTY implementation, so its good and it defaults to false. Same goes for GeoFilter. I would love (unless you did it already) and go over all the filters we have, and verify that the isCacheable behavior is correct, since now we rely on it. |
I was just doing it: everything looks correct in Elasticsearch. However there is one impl in Lucene that doesn't look correct: FieldCacheDocIdSet. I will open an issue. Since we don't use it in Elasticsearch, I don't think that should be a blocker though. |
OK, I just understood how it works: I'm wondering that if we switch to WAH8DocIdSet it will not be slower in all cases? For example FixedBitSet.or optimization allows to or 64 bits at once, so in the end you need about maxDoc/64 operations in order to compute the union of two sets. But if we take the union with a WAH8DocIdSet instead of a FixedBitSet you will need O(cardinality) operations to compute the union, which is faster on sparse sets? I will try to benchmark the change. |
Just another idea about the boolean stuff. I was thinking about this when trying to figure out how to merge query/filter: We have FixedBitSet.or/and and WAH8.intersect/union. Should we somehow push this up to DocIdSet? Something simple like:
I dont know if it makes sense or is a good or bad idea. We should measure if the performance is really better with these optimized methods, versus just intersecting iterators with e.g. ConjunctionScorer. It could be, they are only faster because FixedBitSet has a crappy cost() impl, but WAH8 does not have this issue. I have a lot of concerns about how useful they really are: how general are they in the presence of other complexities? Or are they only useful for certain trivial cases like BooleanFilter? I kinda don't like the idea of having boolean logic embedded in these impls themselves for that reason. But its an idea |
I did some very basic benchmarks on an index containing 50M documents in one shard with field values that have various densities. Pure iterationFor this test, I measured the response time of the following query on cached filters: GET index1/_search?search_type=count
{
"query": {
"filtered": {
"filter": {
"term": {
"field": "value"
}
}
}
}
}
WAH8 is faster on sparse (< 1%) and dense (>=90%) sets but slower otherwise. It also confirms that term filters do not benefit much from caching (#7583). bool_filterThis time I tried a GET index1/_search?search_type=count
{
"query": {
"filtered": {
"filter": {
"bool": {
"should": [] // <- 3 clauses on term filters
}
}
}
}
}
Here we can see the impact of the bool filter optimization when frequencies are high. Leap frog (ie. advance)Again 3 clauses, but this time the query is a conjunction and performs some leap-frog: GET index1/_search?search_type=count
{
"query": {
"filtered": {
"filter": {
"and": {
"filters": [] // <- 3 clauses on term filters
}
}
}
}
}
ConclusionOverall these benchmarks mostly agree with http://people.apache.org/~jpountz/doc_id_sets.html that I worked on when I wrote those compact bitsets. Which one is faster depends on the use-case. FixedBitSet tends to be better on dense sets while WAH8 has interesting memory savings. Independently of benchmarks, something I like about WAH8 is that it has an accurate cost which would allow to get rid of some heuristics like |
+1 |
Agreed, but in the most common use case, where one has either BooleanFilter, or terms/range filters, those heuristics will not be possible, since a FixedBitSet is used, even when cached. So I think there is a bigger question here regarding the use of cardinality, or trying to use WH for all one way or another (unclear to me). @rmuir idea around pushing union/intersect to DocIdSet sounds promising, might be worth exploring in Lucene, even if the "default" impl is just an iteration, it will simplify a lot of code out there, and allow for more optimization to be done down the road without having to go and apply them somehow on different places. Regarding the benchmarks, I think the main difference that we have with this pull request is that out of the filters that have caching enabled by default, term filter is the only one that will actually use WH compared to FixedBitSet (since terms and range already produce a FixedBitSet), so its not such a big deal potentially. |
Lucene recently changed its multi-term query wrapper filter: when the density is high it still uses FixedBitSet but when the density is low it now uses a sparse bit set: https://issues.apache.org/jira/browse/LUCENE-5938 Additionally, I'm hoping that we can get better performance out of RoaringDocIdSet compared to WAH8DocIdSet: https://issues.apache.org/jira/browse/LUCENE-5983 Another idea I have is to use the cost API in order to decide on which impl to use for caching. This way we could still use FixedBitSet in the very dense case and fall back to something that is less brute-force in the sparse case. |
Hi. It seems like I got to see this ticket a tiny bit too late, but I wonder if there is any plans to do a version of this compatible with the 1.x branches? The RoaringDocIdSet seems to be introduced in Lucene 5, which means that we won't see this fix until 2.0.0 is released. For the use case we have, it would help a lot to have an implementation with a lower memory usage. I guess this can either be done by using the WAH8 implementation, or by copying the RoaringDocIdSet implementation from Lucene to the ES code base. Is there anything preventing that change? |
A few months ago, Lucene switched from FixedBitSet to WAH8DocIdSet in order
to cache filters. WAH8DocIdSet is especially better when dealing with sparse
sets: iteration is faster, memory usage is lower and there is an index that
helps keep advance fast.
This doesn't break existing code since #6280 already made sure that there was no
abusive cast from DocIdSet to Bits or FixedBitSet and #7037 moved consumers of
the filter cache that absolutely need to get fixed bitsets to their own cache.
Since cached filters will be more memory-efficient, the filter cache size has
been decreased from 10 to 5%. Although smaller, this might still allow to cache
more filters.