Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change the default cache filter impl from FixedBitSet to WAH8DocIdSet #7577

Closed

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Sep 3, 2014

A few months ago, Lucene switched from FixedBitSet to WAH8DocIdSet in order
to cache filters. WAH8DocIdSet is especially better when dealing with sparse
sets: iteration is faster, memory usage is lower and there is an index that
helps keep advance fast.

This doesn't break existing code since #6280 already made sure that there was no
abusive cast from DocIdSet to Bits or FixedBitSet and #7037 moved consumers of
the filter cache that absolutely need to get fixed bitsets to their own cache.

Since cached filters will be more memory-efficient, the filter cache size has
been decreased from 10 to 5%. Although smaller, this might still allow to cache
more filters.

…cIdSet.

A few months ago, Lucene switched from FixedBitSet to WAH8DocIdSet in order
to cache filters. WAH8DocIdSet is especially better when dealing with sparse
sets: iteration is faster, memory usage is lower and there is an index that
helps keep advance fast.

This doesn't break existing code since elastic#6280 already made sure that there was no
abusive cast from DocIdSet to Bits or FixedBitSet and elastic#7037 moved consumers of
the filter cache that absolutely need to get fixed bitsets to their own cache.

Since cached filters will be more memory-efficient, the filter cache size has
been decreased from 10 to 5%. Although smaller, this might still allow to cache
more filters.
@jpountz jpountz force-pushed the enhancement/filter_cache/wah8 branch from 17aff61 to 8d39d48 Compare September 3, 2014 22:56
@kimchy
Copy link
Member

kimchy commented Sep 3, 2014

Few things:

(1) This change now means that it uses DocIdSet#set.isCacheable to not convert the DocIdSet before caching. I always had a problem with this method to be honest, since it actually means that it won't need the reader in order to get back IndexReader for expensive operations, for for example, today, if the DocIdSet is generated by a filter that access our field data, then its marked as cacheable.

See for example GeoDistanceFilter. Now, this filter generated a DocIdSet that based on the isCacehable contract, it should be true. On the other hand, when a user explicitly opts in to cache this filter, they are actually saying, please don't compute the distance each time. With this pull request, this will no longer happens.

We can do 2 things here, either remove the isCacheable check (and check on instanceof for Fixed and WH ones), or, go over all the filters that generate a bit set, and make sure to "hack" the isCacheable to make it do what we want.

(2) If we are going to have more WH variants of bitsets, I wonder if we should also optimize the XBooleanFilter to make use of it? Since today, if its a FixedBitSet, it will optimize the the boolean operations, can we do this also with WH variant? I am concerned that this will introduce a perf hit in boolean filter. We need to check.

(3) The cache of the filters is reduced, but I wonder how much memory savings will we have in practice to warrant a reduction? For example, TermsFilter still uses FixedBitSet, as well as Range filter variants. Which are the most common filters. In a case where the filter returns a FixedBitSst, we use it as is (to not have the cost of recomputing the bit set into WH), but then, we will end up, I suspect, in most cases, with filters where we turn on by default caching, with similar memory usage.

@kimchy
Copy link
Member

kimchy commented Sep 3, 2014

More thoughts on (1), actually, since we allow for field data to be evicted, I would change all the field data based impls to non cacheable, that actually makes sense contact wise (otherwise they still hold a reference). Still need to look at the other impls, for example, TermFilter mistakenly returns a DocIdSet that does not override the isCacheable method and returns false.

@jpountz
Copy link
Contributor Author

jpountz commented Sep 3, 2014

See for example GeoDistanceFilter. Now, this filter generated a DocIdSet that based on the isCacehable contract, it should be true.

isCacheable documentation says If you have an own <code>DocIdSet</code> implementation that does its iteration very effective and fast without doing disk I/O, override this method and return <code>true</code>. However The doc id set for GeoDistanceFilter does neither have fast iteration and potentially performs disk I/O since it's based on field data, so I don't think it is eligible?

If we are going to have more WH variants of bitsets, I wonder if we should also optimize the XBooleanFilter to make use of it?

I think FixedBitSet is still better suited here. WAH8 is nice, but it has the limitation that it needs to be built in order so I don't think we can use it here?

For example, TermsFilter still uses FixedBitSet, as well as Range filter variants. Which are the most common filters. In a case where the filter returns a FixedBitSst, we use it as is (to not have the cost of recomputing the bit set into WH), but then, we will end up, I suspect, in most cases, with filters where we turn on by default caching, with similar memory usage.

Good point. I will revert that part of the change.

TermFilter mistakenly returns a DocIdSet that does not override the isCacheable method and returns false

I think that behavior is correct since the returned DocIdSet is based on a postings list (it's not in memory).

@kimchy
Copy link
Member

kimchy commented Sep 3, 2014

isCacheable documentation says If you have an own DocIdSet implementation that does its iteration very effective and fast without doing disk I/O, override this method and return true. However The doc id set for GeoDistanceFilter does neither have fast iteration and potentially performs disk I/O since it's based on field data, so I don't think it is eligible?

I read it as fast without doing disk I//O. Also, Field Data is in memory (cost of loading it once, though as I mentioned, it might get evicted, so the referehce will still be there and not properly GC'ed) Regardless, maybe its just semantics, the point I think remains, we need to fix those if we rely on isCacheable now.

I think FixedBitSet is still better suited here. WAH8 is nice, but it has the limitation that it needs to be built in order so I don't think we can use it here?

Yea, but if the inner filters are WH based and not Fixed based, the execution will be slower. And it will now because thats what used for caching (on certain filters). Maybe we can add an optimization (donno if its possible) to do fast boolean operations on an existing FixedBitSet with WH one?

I think that behavior is correct since the returned DocIdSet is based on a postings list (it's not in memory).

No, look at TermFilter, it overrides DocIdSet, without overriding isCacheable (so it defaults to true) and setting it to false. Thats a bug, no? We didn't see it in ES because we didn't take isCacheable into account.

@kimchy
Copy link
Member

kimchy commented Sep 4, 2014

No, look at TermFilter, it overrides DocIdSet, without overriding isCacheable (so it defaults to true) and setting it to false. Thats a bug, no? We didn't see it in ES because we didn't take isCacheable into account.

My bad, got confused with the EMPTY implementation, so its good and it defaults to false. Same goes for GeoFilter. I would love (unless you did it already) and go over all the filters we have, and verify that the isCacheable behavior is correct, since now we rely on it.

@jpountz
Copy link
Contributor Author

jpountz commented Sep 4, 2014

I would love (unless you did it already) and go over all the filters we have, and verify that the isCacheable behavior is correct, since now we rely on it.

I was just doing it: everything looks correct in Elasticsearch. However there is one impl in Lucene that doesn't look correct: FieldCacheDocIdSet. I will open an issue. Since we don't use it in Elasticsearch, I don't think that should be a blocker though.

@jpountz
Copy link
Contributor Author

jpountz commented Sep 4, 2014

If we are going to have more WH variants of bitsets, I wonder if we should also optimize the XBooleanFilter to make use of it? Since today, if its a FixedBitSet, it will optimize the the boolean operations, can we do this also with WH variant? I am concerned that this will introduce a perf hit in boolean filter. We need to check.

OK, I just understood how it works: FixedBitSet.or(DocIdSetIterator) has some instanceof calls in order to optimize for when the provided iterator is an iterator over a fixed bit set. I didn't know about this optimization. :-)

I'm wondering that if we switch to WAH8DocIdSet it will not be slower in all cases? For example FixedBitSet.or optimization allows to or 64 bits at once, so in the end you need about maxDoc/64 operations in order to compute the union of two sets. But if we take the union with a WAH8DocIdSet instead of a FixedBitSet you will need O(cardinality) operations to compute the union, which is faster on sparse sets? I will try to benchmark the change.

@rmuir
Copy link
Contributor

rmuir commented Sep 4, 2014

Just another idea about the boolean stuff. I was thinking about this when trying to figure out how to merge query/filter:

We have FixedBitSet.or/and and WAH8.intersect/union. Should we somehow push this up to DocIdSet? Something simple like:

DocIdSet union(DocIDSet...)
DocIdSet intersect(DocIDSet...)

I dont know if it makes sense or is a good or bad idea. We should measure if the performance is really better with these optimized methods, versus just intersecting iterators with e.g. ConjunctionScorer. It could be, they are only faster because FixedBitSet has a crappy cost() impl, but WAH8 does not have this issue.

I have a lot of concerns about how useful they really are: how general are they in the presence of other complexities? Or are they only useful for certain trivial cases like BooleanFilter? I kinda don't like the idea of having boolean logic embedded in these impls themselves for that reason.

But its an idea

@jpountz
Copy link
Contributor Author

jpountz commented Sep 4, 2014

I did some very basic benchmarks on an index containing 50M documents in one shard with field values that have various densities.

Pure iteration

For this test, I measured the response time of the following query on cached filters:

GET index1/_search?search_type=count
{
  "query": {
    "filtered": {
      "filter": {
        "term": {
          "field": "value"
        }
      }
    }
  }
}
Term freq WAH8 FixedBitSet Not cached
99% 118 248 397
90% 143 228 375
10% 68 28 47
1% 13 12 12
1‰ 3 5 6

WAH8 is faster on sparse (< 1%) and dense (>=90%) sets but slower otherwise.

It also confirms that term filters do not benefit much from caching (#7583).

bool_filter

This time I tried a bool_filter on 3 should clauses that have various frequencies. The query looks like:

GET index1/_search?search_type=count
{
  "query": {
    "filtered": {
      "filter": {
        "bool": {
          "should": [] // <- 3 clauses on term filters
        }
      }
    }
  }
}
freq1/freq2/freq3 WAH8 FixedBitSet
90%/10%/1% 486 257
10%/1%/1‰ 105 40
1%/1‰/0.1‰ 20 14

Here we can see the impact of the bool filter optimization when frequencies are high.

Leap frog (ie. advance)

Again 3 clauses, but this time the query is a conjunction and performs some leap-frog:

GET index1/_search?search_type=count
{
  "query": {
    "filtered": {
      "filter": {
        "and": {
          "filters": [] // <- 3 clauses on term filters
        }
      }
    }
  }
}
freq1/freq2/freq3 WAH8 FixedBitSet
90%/10%/1% 40 16
10%/1%/1‰ 11 8
1%/1‰/0.1‰ 5 6

Conclusion

Overall these benchmarks mostly agree with http://people.apache.org/~jpountz/doc_id_sets.html that I worked on when I wrote those compact bitsets.

Which one is faster depends on the use-case. FixedBitSet tends to be better on dense sets while WAH8 has interesting memory savings.

Independently of benchmarks, something I like about WAH8 is that it has an accurate cost which would allow to get rid of some heuristics like if the first document is less than 100.

@rmuir
Copy link
Contributor

rmuir commented Sep 5, 2014

Independently of benchmarks, something I like about WAH8 is that it has an accurate cost which would allow to get rid of some heuristics like if the first document is less than 100.

+1

@kimchy
Copy link
Member

kimchy commented Sep 5, 2014

Independently of benchmarks, something I like about WAH8 is that it has an accurate cost which would allow to get rid of some heuristics like if the first document is less than 100.

Agreed, but in the most common use case, where one has either BooleanFilter, or terms/range filters, those heuristics will not be possible, since a FixedBitSet is used, even when cached. So I think there is a bigger question here regarding the use of cardinality, or trying to use WH for all one way or another (unclear to me).

@rmuir idea around pushing union/intersect to DocIdSet sounds promising, might be worth exploring in Lucene, even if the "default" impl is just an iteration, it will simplify a lot of code out there, and allow for more optimization to be done down the road without having to go and apply them somehow on different places.

Regarding the benchmarks, I think the main difference that we have with this pull request is that out of the filters that have caching enabled by default, term filter is the only one that will actually use WH compared to FixedBitSet (since terms and range already produce a FixedBitSet), so its not such a big deal potentially.

@jpountz
Copy link
Contributor Author

jpountz commented Oct 2, 2014

Lucene recently changed its multi-term query wrapper filter: when the density is high it still uses FixedBitSet but when the density is low it now uses a sparse bit set: https://issues.apache.org/jira/browse/LUCENE-5938

Additionally, I'm hoping that we can get better performance out of RoaringDocIdSet compared to WAH8DocIdSet: https://issues.apache.org/jira/browse/LUCENE-5983

Another idea I have is to use the cost API in order to decide on which impl to use for caching. This way we could still use FixedBitSet in the very dense case and fall back to something that is less brute-force in the sparse case.

@jpountz jpountz added stalled and removed review labels Oct 2, 2014
@jpountz
Copy link
Contributor Author

jpountz commented Nov 10, 2014

This has been fixed by @s1monw (with RoaringDocIdSet instead of WAH8) in 610ce07

@antonha
Copy link

antonha commented Nov 24, 2014

Hi.

It seems like I got to see this ticket a tiny bit too late, but I wonder if there is any plans to do a version of this compatible with the 1.x branches? The RoaringDocIdSet seems to be introduced in Lucene 5, which means that we won't see this fix until 2.0.0 is released. For the use case we have, it would help a lot to have an implementation with a lower memory usage.

I guess this can either be done by using the WAH8 implementation, or by copying the RoaringDocIdSet implementation from Lucene to the ES code base. Is there anything preventing that change?

@clintongormley clintongormley added the :Core/Infra/Core Core issues without another label label Jun 7, 2015
@clintongormley clintongormley changed the title Core: Change the default cache filter impl from FixedBitSet to WAH8DocIdSet Change the default cache filter impl from FixedBitSet to WAH8DocIdSet Jun 7, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label >enhancement v2.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants