Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search: Track most commonly used filters and cache them #8449

Closed
jpountz opened this issue Nov 12, 2014 · 6 comments
Closed

Search: Track most commonly used filters and cache them #8449

jpountz opened this issue Nov 12, 2014 · 6 comments
Assignees

Comments

@jpountz
Copy link
Contributor

jpountz commented Nov 12, 2014

Whether or not a filter should be cached is currently up to the filter parsers. For instance, we cache terms filters because they are costly to build and produce a result that is already cacheable, and do not cache geo-distance filters because they are usually expected to be applied to the position of a user of the system that keeps changing all the time.

We could make the default better by tracking usage statistics of the filters and decide to cache based on the cost of the filter and how often the filter is used.

@jpountz jpountz added >enhancement v2.0.0-beta1 :Search/Search Search-related issues that do not fall into other categories labels Nov 12, 2014
@clintongormley
Copy link

+1

1 similar comment
@mikemccand
Copy link
Contributor

+1

@nik9000
Copy link
Member

nik9000 commented Nov 12, 2014

+1

  • Probably should be able to opt in/out of the behavior.
  • What about filters who's cache key is ridiculous without being overridden?
  • Might be cool to turn _cache support _cache as a number between 0 and 1 that the user can set closed to 1 to signal a priori knowledge about whether the filter is more likely to be reused. true would map to 1 (always cache) and false to 0 (never cache). I dunno. Maybe over complicated.

@jpountz jpountz self-assigned this Nov 14, 2014
@rjernst
Copy link
Member

rjernst commented Nov 14, 2014

Might be cool to turn _cache support _cache as a number between 0 and 1 that the user can set closed to 1 to signal a priori knowledge about whether the filter is more likely to be reused. true would map to 1 (always cache) and false to 0 (never cache). I dunno. Maybe over complicated.

That sounds too complicated. I think it would be better to have the default based on stats, and then have a special value, say "always" that would bypass the stats and ensure the filter is cached. Although I'm not sure how much we need the ability to override the stats based approach, since anything that should be cached should quickly rise statistically.

@clintongormley
Copy link

@rjernst @nik9000 I think the _cache setting should accept three parameters only: auto (default), true, false.

@rjernst
Copy link
Member

rjernst commented Nov 19, 2014

@clintongormley I like that, defaults are the right place for this. And I'm ok with the "auto" terminology.

jpountz added a commit to jpountz/elasticsearch that referenced this issue Dec 16, 2014
Up to now, all filters could be cached using the `_cache` flag that could be
set to `true` or `false` and the default was set depending on the type of the
`filter`. For instance, `script` filters are not cached by default while
`terms` are. For some filters, the default is more complicated and eg. date
range filters are cached unless they use `now` in a non-rounded fashion.

This commit adds a 3rd option called `auto`, which becomes the default for
all filters. So for all filters a cache wrapper will be returned, and the
decision will be made at caching time, per-segment. Here is the default logic:
 - if there is already a cache entry for this filter in the current segment,
   then return the cache entry.
 - else if the doc id set cannot iterate (eg. script filter) then do not cache.
 - else if the doc id set is already cacheable and it has been used twice or
   more in the last 1000 filters then cache it.
 - else if the filter is costly (eg. multi-term) and has been used twice or more
   in the last 1000 filters then cache it.
 - else if the doc id set is not cacheable and it has been used 5 times or more
   in the last 1000 filters, then load it into a cacheable set and cache it.
 - else return the uncached set.

So for instance geo-distance filters and script filters are going to use this
new default and are not going to be cached because of their iterators.

Similarly, date range filters are going to use this default all the time, but
it is very unlikely that those that use `now` in a not rounded fashion will get
reused so in practice they won't be cached.

`terms`, `range`, ... filters produce cacheable doc id sets with good iterators
so they will be cached as soon as they have been used twice.

Filters that don't produce cacheable doc id sets such as the `term` filter will
need to be used 5 times before being cached. This ensures that we don't spend
CPU iterating over all documents matching such filters unless we have good
evidence of reuse.

One last interesting point about this change is that it also applies to compound
filters. So if you keep on repeating the same `bool` filter with the same
underlying clauses, it will be cached on its own while up to now it used to
never be cached by default.

`_cache: true` has been changed to only cache on large segments, in order to not
pollute the cache since small segments should not be the bottleneck anyway.
However `_cache: false` still has the same semantics.

Close elastic#8449
@jpountz jpountz added :Search/Search Search-related issues that do not fall into other categories and removed :Search/Search Search-related issues that do not fall into other categories v2.0.0-beta1 release highlight >enhancement labels Dec 29, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants