Skip to content

Commit

Permalink
Filter cache: add a _cache: auto option and make it the default.
Browse files Browse the repository at this point in the history
Up to now, all filters could be cached using the `_cache` flag that could be
set to `true` or `false` and the default was set depending on the type of the
`filter`. For instance, `script` filters are not cached by default while
`terms` are. For some filters, the default is more complicated and eg. date
range filters are cached unless they use `now` in a non-rounded fashion.

This commit adds a 3rd option called `auto`, which becomes the default for
all filters. So for all filters a cache wrapper will be returned, and the
decision will be made at caching time, per-segment. Here is the default logic:
 - if there is already a cache entry for this filter in the current segment,
   then return the cache entry.
 - else if the doc id set cannot iterate (eg. script filter) then do not cache.
 - else if the doc id set is already cacheable and it has been used twice or
   more in the last 1000 filters then cache it.
 - else if the filter is costly (eg. multi-term) and has been used twice or more
   in the last 1000 filters then cache it.
 - else if the doc id set is not cacheable and it has been used 5 times or more
   in the last 1000 filters, then load it into a cacheable set and cache it.
 - else return the uncached set.

So for instance geo-distance filters and script filters are going to use this
new default and are not going to be cached because of their iterators.

Similarly, date range filters are going to use this default all the time, but
it is very unlikely that those that use `now` in a not rounded fashion will get
reused so in practice they won't be cached.

`terms`, `range`, ... filters produce cacheable doc id sets with good iterators
so they will be cached as soon as they have been used twice.

Filters that don't produce cacheable doc id sets such as the `term` filter will
need to be used 5 times before being cached. This ensures that we don't spend
CPU iterating over all documents matching such filters unless we have good
evidence of reuse.

One last interesting point about this change is that it also applies to compound
filters. So if you keep on repeating the same `bool` filter with the same
underlying clauses, it will be cached on its own while up to now it used to
never be cached by default.

`_cache: true` has been changed to only cache on large segments, in order to not
pollute the cache since small segments should not be the bottleneck anyway.
However `_cache: false` still has the same semantics.

Close elastic#8449
  • Loading branch information
jpountz committed Dec 16, 2014
1 parent 5910b17 commit ed0b9e3
Show file tree
Hide file tree
Showing 81 changed files with 607 additions and 1,166 deletions.
7 changes: 6 additions & 1 deletion docs/reference/query-dsl/filters.asciidoc
Expand Up @@ -42,7 +42,12 @@ The last type of filters are those working with other filters. The
cached as they basically just manipulate the internal filters.

All filters allow to set `_cache` element on them to explicitly control
caching. They also allow to set `_cache_key` which will be used as the
caching. It accepts 3 values: `true` in order to cache the filter, `false`
to make sure that the filter will not be cached, and `auto`, which is the
default and will decide on whether to cache the filter based on the cost
to cache the filter and how often the filter has been used.

Filters also allow to set `_cache_key` which will be used as the
caching key for that filter. This can be handy when using very large
filters (like a terms filter with many elements in it).

Expand Down
8 changes: 4 additions & 4 deletions docs/reference/query-dsl/filters/and-filter.asciidoc
Expand Up @@ -33,10 +33,10 @@ filters. Can be placed within queries that accept a filter.
[float]
==== Caching

The result of the filter is not cached by default. The `_cache` can be
set to `true` in order to cache it (though usually not needed). Since
the `_cache` element requires to be set on the `and` filter itself, the
structure then changes a bit to have the filters provided within a
The result of the filter is only cached by default if there is evidence of
reuse. It is possible to opt-in explicitely for caching by setting `_cache`
to `true`. Since the `_cache` element requires to be set on the `and` filter
itself, the structure then changes a bit to have the filters provided within a
`filters` element:

[source,js]
Expand Down
6 changes: 0 additions & 6 deletions docs/reference/query-dsl/filters/bool-filter.asciidoc
Expand Up @@ -41,9 +41,3 @@ accept a filter.
}
--------------------------------------------------

[float]
==== Caching

The result of the `bool` filter is not cached by default (though
internal filters might be). The `_cache` can be set to `true` in order
to enable caching.
5 changes: 0 additions & 5 deletions docs/reference/query-dsl/filters/exists-filter.asciidoc
Expand Up @@ -74,8 +74,3 @@ no values in the `user` field and thus would not match the `exists` filter:
{ "foo": "bar" }
--------------------------------------------------


[float]
==== Caching

The result of the filter is always cached.
6 changes: 0 additions & 6 deletions docs/reference/query-dsl/filters/missing-filter.asciidoc
Expand Up @@ -130,9 +130,3 @@ When set to `false` (the default), these documents will not be included.
--

NOTE: Either `existence` or `null_value` or both must be set to `true`.


[float]
==== Caching

The result of the filter is always cached.
6 changes: 3 additions & 3 deletions docs/reference/query-dsl/filters/not-filter.asciidoc
Expand Up @@ -53,9 +53,9 @@ Or, in a longer form with a `filter` element:
[float]
==== Caching

The result of the filter is not cached by default. The `_cache` can be
set to `true` in order to cache it (though usually not needed). Here is
an example:
The result of the filter is only cached if there is evidence of reuse.
The `_cache` can be set to `true` in order to cache it (though usually
not needed). Here is an example:

[source,js]
--------------------------------------------------
Expand Down
3 changes: 2 additions & 1 deletion docs/reference/query-dsl/filters/or-filter.asciidoc
Expand Up @@ -28,7 +28,8 @@ filters. Can be placed within queries that accept a filter.
[float]
==== Caching

The result of the filter is not cached by default. The `_cache` can be
The result of the filter is only cached by default if there is evidence
of reuse. The `_cache` can be
set to `true` in order to cache it (though usually not needed). Since
the `_cache` element requires to be set on the `or` filter itself, the
structure then changes a bit to have the filters provided within a
Expand Down
6 changes: 3 additions & 3 deletions docs/reference/query-dsl/filters/prefix-filter.asciidoc
Expand Up @@ -19,8 +19,8 @@ a filter. Can be placed within queries that accept a filter.
[float]
==== Caching

The result of the filter is cached by default. The `_cache` can be set
to `false` in order not to cache it. Here is an example:
The result of the filter is cached by default if there is evidence of reuse.
The `_cache` can be set to `true` in order to cache it. Here is an example:

[source,js]
--------------------------------------------------
Expand All @@ -29,7 +29,7 @@ to `false` in order not to cache it. Here is an example:
"filter" : {
"prefix" : {
"user" : "ki",
"_cache" : false
"_cache" : true
}
}
}
Expand Down
4 changes: 3 additions & 1 deletion docs/reference/query-dsl/filters/query-filter.asciidoc
Expand Up @@ -22,7 +22,9 @@ that accept a filter.
[float]
==== Caching

The result of the filter is not cached by default. The `_cache` can be
The result of the filter is only cached by default if there is evidence of reuse.

The `_cache` can be
set to `true` to cache the *result* of the filter. This is handy when
the same query is used on several (many) other queries. Note, the
process of caching the first execution is higher when not caching (since
Expand Down
6 changes: 3 additions & 3 deletions docs/reference/query-dsl/filters/range-filter.asciidoc
Expand Up @@ -98,8 +98,8 @@ you're already aggregating or sorting by.
[float]
==== Caching

The result of the filter is only automatically cached by default if the `execution` is set to `index`. The
The result of the filter is only cached by default if there is evidence of reuse. The
`_cache` can be set to `false` to turn it off.

If the `now` date math expression is used without rounding then a range filter will never be cached even if `_cache` is
set to `true`. Also any filter that wraps this filter will never be cached.
Having the `now` expression used without rounding will make the filter unlikely to be
cached since reuse is very unlikely.
4 changes: 2 additions & 2 deletions docs/reference/query-dsl/filters/term-filter.asciidoc
Expand Up @@ -20,8 +20,8 @@ accept a filter, for example:
[float]
==== Caching

The result of the filter is automatically cached by default. The
`_cache` can be set to `false` to turn it off. Here is an example:
The result of the filter is only cached by default if there is evidence of reuse.
The `_cache` can be set to `false` to turn it off. Here is an example:

[source,js]
--------------------------------------------------
Expand Down
5 changes: 3 additions & 2 deletions docs/reference/query-dsl/filters/terms-filter.asciidoc
Expand Up @@ -86,8 +86,9 @@ For example:
[float]
==== Caching

The result of the filter is automatically cached by default. The
`_cache` can be set to `false` to turn it off.
The result of the filter is cached if there is evidence of reuse. It is
possible to enable caching explicitely by setting `_cache` to `true` and
to disable caching by setting `_cache` to `false`.

[float]
==== Terms lookup mechanism
Expand Down
@@ -0,0 +1,103 @@
/*
* Licensed to Elasticsearch under one or more contributor
* license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright
* ownership. Elasticsearch licenses this file to you under
* the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing,
* software distributed under the License is distributed on an
* "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
* KIND, either express or implied. See the License for the
* specific language governing permissions and limitations
* under the License.
*/

package org.elasticsearch.index.cache.filter;

import org.apache.lucene.index.LeafReaderContext;
import org.apache.lucene.search.DocIdSet;
import org.apache.lucene.search.Filter;
import org.apache.lucene.search.FilterCachingPolicy;
import org.apache.lucene.search.UsageTrackingFilterCachingPolicy;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.lucene.docset.DocIdSets;
import org.elasticsearch.common.settings.ImmutableSettings;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.AbstractIndexComponent;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.settings.IndexSettings;

import java.io.IOException;

/**
* This class is a wrapper around {@link UsageTrackingFilterCachingPolicy}
* which wires parameters through index settings and makes sure to not
* cache {@link DocIdSet}s which have a {@link DocIdSets#isBroken(DocIdSetIterator) broken}
* iterator.
*/
public class AutoFilterCachingPolicy extends AbstractIndexComponent implements FilterCachingPolicy {

// These settings don't have the purpose of being documented. They are only here so that
// if anyone ever hits an issue with elasticsearch that is due to the value of one of these
// parameters, then it might be possible to temporarily work around the issue without having
// to wait for a new release

// number of times a filter that produces cacheable filters should be seen before the doc id sets are cached
public static final String MIN_FREQUENCY_COSTLY = "index.cache.filter.policy.min_frequency.costly";
// number of times a filter that produces cacheable filters should be seen before the doc id sets are cached
public static final String MIN_FREQUENCY_CACHEABLE = "index.cache.filter.policy.min_frequency.cacheable";
// same for filters that produce doc id sets that are not directly cacheable
public static final String MIN_FREQUENCY_OTHER = "index.cache.filter.policy.min_frequency.other";
// sources of segments that should be cached
public static final String MIN_SEGMENT_SIZE_RATIO = "index.cache.filter.policy.min_segment_size_ratio";
// size of the history to keep for filters. A filter will be cached if it has been seen more than a given
// number of times (depending on the filter, the segment and the produced DocIdSet) in the most
// ${history_size} recently used filters
public static final String HISTORY_SIZE = "index.cache.filter.policy.history_size";

public static Settings AGGRESSIVE_CACHING_SETTINGS = ImmutableSettings.builder()
.put(MIN_FREQUENCY_CACHEABLE, 1)
.put(MIN_FREQUENCY_COSTLY, 1)
.put(MIN_FREQUENCY_OTHER, 1)
.put(MIN_SEGMENT_SIZE_RATIO, 0.000000001f)
.build();

private final FilterCachingPolicy in;

@Inject
public AutoFilterCachingPolicy(Index index, @IndexSettings Settings indexSettings) {
super(index, indexSettings);
final int historySize = indexSettings.getAsInt(HISTORY_SIZE, 1000);
// cache aggressively filters that produce sets that are already cacheable,
// ie. if the filter has been used twice or more among the most 1000 recently
// used filters
final int minFrequencyCacheable = indexSettings.getAsInt(MIN_FREQUENCY_CACHEABLE, 2);
// cache aggressively filters whose getDocIdSet method is costly
final int minFrequencyCostly = indexSettings.getAsInt(MIN_FREQUENCY_COSTLY, 2);
// be a bit less aggressive when the produced doc id sets are not cacheable
final int minFrequencyOther = indexSettings.getAsInt(MIN_FREQUENCY_OTHER, 5);
final float minSegmentSizeRatio = indexSettings.getAsFloat(MIN_SEGMENT_SIZE_RATIO, 0.01f);
in = new UsageTrackingFilterCachingPolicy(minSegmentSizeRatio, historySize, minFrequencyCostly, minFrequencyCacheable, minFrequencyOther);
}

@Override
public void onCache(Filter filter) {
in.onCache(filter);
}

@Override
public boolean shouldCache(Filter filter, LeafReaderContext context, DocIdSet set) throws IOException {
if (set != null && DocIdSets.isBroken(set.iterator())) {
// O(maxDoc) to cache, no thanks.
return false;
}

return in.shouldCache(filter, context, set);
}

}
Expand Up @@ -20,7 +20,10 @@
package org.elasticsearch.index.cache.filter;

import org.apache.lucene.search.Filter;
import org.apache.lucene.search.FilterCachingPolicy;
import org.elasticsearch.common.Nullable;
import org.elasticsearch.common.component.CloseableComponent;
import org.elasticsearch.common.lucene.HashedBytesRef;
import org.elasticsearch.index.IndexComponent;
import org.elasticsearch.index.IndexService;

Expand All @@ -44,7 +47,7 @@ public EntriesStats(long sizeInBytes, long count) {

String type();

Filter cache(Filter filterToCache);
Filter cache(Filter filterToCache, @Nullable HashedBytesRef cacheKey, FilterCachingPolicy policy);

void clear(Object reader);

Expand Down
Expand Up @@ -19,6 +19,7 @@

package org.elasticsearch.index.cache.filter;

import org.apache.lucene.search.FilterCachingPolicy;
import org.elasticsearch.common.inject.AbstractModule;
import org.elasticsearch.common.inject.Scopes;
import org.elasticsearch.common.settings.Settings;
Expand All @@ -44,5 +45,10 @@ protected void configure() {
bind(FilterCache.class)
.to(settings.getAsClass(FilterCacheSettings.FILTER_CACHE_TYPE, WeightedFilterCache.class, "org.elasticsearch.index.cache.filter.", "FilterCache"))
.in(Scopes.SINGLETON);
// the filter cache is a node-level thing, however we want the most popular filters
// to be computed on a per-index basis, that is why we don't use the SINGLETON
// scope below
bind(FilterCachingPolicy.class)
.to(AutoFilterCachingPolicy.class);
}
}
Expand Up @@ -20,12 +20,15 @@
package org.elasticsearch.index.cache.filter.none;

import org.apache.lucene.search.Filter;
import org.apache.lucene.search.FilterCachingPolicy;
import org.elasticsearch.common.Nullable;
import org.elasticsearch.common.inject.Inject;
import org.elasticsearch.common.lucene.HashedBytesRef;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.index.AbstractIndexComponent;
import org.elasticsearch.index.Index;
import org.elasticsearch.index.cache.filter.FilterCache;
import org.elasticsearch.index.IndexService;
import org.elasticsearch.index.cache.filter.FilterCache;
import org.elasticsearch.index.settings.IndexSettings;

/**
Expand Down Expand Up @@ -55,7 +58,7 @@ public void close() {
}

@Override
public Filter cache(Filter filterToCache) {
public Filter cache(Filter filterToCache, @Nullable HashedBytesRef cacheKey, FilterCachingPolicy policy) {
return filterToCache;
}

Expand All @@ -73,4 +76,4 @@ public void clear(String reason, String[] keys) {
public void clear(Object reader) {
// nothing to do here
}
}
}

0 comments on commit ed0b9e3

Please sign in to comment.