Filter cache: add a _cache: auto option and make it the default.

Up to now, all filters could be cached using the `_cache` flag that could be set to `true` or `false` and the default was set depending on the type of the `filter`. For instance, `script` filters are not cached by default while `terms` are. For some filters, the default is more complicated and eg. date range filters are cached unless they use `now` in a non-rounded fashion. This commit adds a 3rd option called `auto`, which becomes the default for all filters. So for all filters a cache wrapper will be returned, and the decision will be made at caching time, per-segment. Here is the default logic: - if there is already a cache entry for this filter in the current segment, then return the cache entry. - else if the doc id set cannot iterate (eg. script filter) then do not cache. - else if the doc id set is already cacheable and it has been used twice or more in the last 1000 filters then cache it. - else if the filter is costly (eg. multi-term) and has been used twice or more in the last 1000 filters then cache it. - else if the doc id set is not cacheable and it has been used 5 times or more in the last 1000 filters, then load it into a cacheable set and cache it. - else return the uncached set. So for instance geo-distance filters and script filters are going to use this new default and are not going to be cached because of their iterators. Similarly, date range filters are going to use this default all the time, but it is very unlikely that those that use `now` in a not rounded fashion will get reused so in practice they won't be cached. `terms`, `range`, ... filters produce cacheable doc id sets with good iterators so they will be cached as soon as they have been used twice. Filters that don't produce cacheable doc id sets such as the `term` filter will need to be used 5 times before being cached. This ensures that we don't spend CPU iterating over all documents matching such filters unless we have good evidence of reuse. One last interesting point about this change is that it also applies to compound filters. So if you keep on repeating the same `bool` filter with the same underlying clauses, it will be cached on its own while up to now it used to never be cached by default. `_cache: true` has been changed to only cache on large segments, in order to not pollute the cache since small segments should not be the bottleneck anyway. However `_cache: false` still has the same semantics. Close elastic#8449
jpountz · Dec 16, 2014 · ed0b9e3 · ed0b9e3
1 parent 5910b17
commit ed0b9e3
Show file tree

Hide file tree

Showing 81 changed files with 607 additions and 1,166 deletions.
diff --git a/docs/reference/query-dsl/filters.asciidoc b/docs/reference/query-dsl/filters.asciidoc
@@ -42,7 +42,12 @@ The last type of filters are those working with other filters. The
 cached as they basically just manipulate the internal filters.
 
 All filters allow to set `_cache` element on them to explicitly control
-caching. They also allow to set `_cache_key` which will be used as the
+caching. It accepts 3 values: `true` in order to cache the filter, `false`
+to make sure that the filter will not be cached, and `auto`, which is the
+default and will decide on whether to cache the filter based on the cost
+to cache the filter and how often the filter has been used.
+
+Filters also allow to set `_cache_key` which will be used as the
 caching key for that filter. This can be handy when using very large
 filters (like a terms filter with many elements in it).
 

diff --git a/docs/reference/query-dsl/filters/and-filter.asciidoc b/docs/reference/query-dsl/filters/and-filter.asciidoc
@@ -33,10 +33,10 @@ filters. Can be placed within queries that accept a filter.
 [float]
 ==== Caching
 
-The result of the filter is not cached by default. The `_cache` can be
-set to `true` in order to cache it (though usually not needed). Since
-the `_cache` element requires to be set on the `and` filter itself, the
-structure then changes a bit to have the filters provided within a
+The result of the filter is only cached by default if there is evidence of
+reuse. It is possible to opt-in explicitely for caching by setting `_cache`
+to `true`. Since the `_cache` element requires to be set on the `and` filter
+itself, the structure then changes a bit to have the filters provided within a
 `filters` element:
 
 [source,js]

diff --git a/docs/reference/query-dsl/filters/bool-filter.asciidoc b/docs/reference/query-dsl/filters/bool-filter.asciidoc
@@ -41,9 +41,3 @@ accept a filter.
 }    
 --------------------------------------------------
 
-[float]
-==== Caching
-
-The result of the `bool` filter is not cached by default (though
-internal filters might be). The `_cache` can be set to `true` in order
-to enable caching.
diff --git a/docs/reference/query-dsl/filters/exists-filter.asciidoc b/docs/reference/query-dsl/filters/exists-filter.asciidoc
@@ -74,8 +74,3 @@ no values in the `user` field and thus would not match the `exists` filter:
 { "foo": "bar" }
 --------------------------------------------------
 
-
-[float]
-==== Caching
-
-The result of the filter is always cached.
diff --git a/docs/reference/query-dsl/filters/missing-filter.asciidoc b/docs/reference/query-dsl/filters/missing-filter.asciidoc
@@ -130,9 +130,3 @@ When set to `false` (the default), these documents will not be included.
 --
 
 NOTE: Either `existence` or `null_value` or both must be set to `true`.
-
-
-[float]
-==== Caching
-
-The result of the filter is always cached.
diff --git a/docs/reference/query-dsl/filters/not-filter.asciidoc b/docs/reference/query-dsl/filters/not-filter.asciidoc
@@ -53,9 +53,9 @@ Or, in a longer form with a `filter` element:
 [float]
 ==== Caching
 
-The result of the filter is not cached by default. The `_cache` can be
-set to `true` in order to cache it (though usually not needed). Here is
-an example:
+The result of the filter is only cached if there is evidence of reuse.
+The `_cache` can be set to `true` in order to cache it (though usually
+not needed). Here is an example:
 
 [source,js]
 --------------------------------------------------

diff --git a/docs/reference/query-dsl/filters/or-filter.asciidoc b/docs/reference/query-dsl/filters/or-filter.asciidoc
@@ -28,7 +28,8 @@ filters. Can be placed within queries that accept a filter.
 [float]
 ==== Caching
 
-The result of the filter is not cached by default. The `_cache` can be
+The result of the filter is only cached by default if there is evidence
+of reuse. The `_cache` can be
 set to `true` in order to cache it (though usually not needed). Since
 the `_cache` element requires to be set on the `or` filter itself, the
 structure then changes a bit to have the filters provided within a

diff --git a/docs/reference/query-dsl/filters/prefix-filter.asciidoc b/docs/reference/query-dsl/filters/prefix-filter.asciidoc
@@ -19,8 +19,8 @@ a filter. Can be placed within queries that accept a filter.
 [float]
 ==== Caching
 
-The result of the filter is cached by default. The `_cache` can be set
-to `false` in order not to cache it. Here is an example:
+The result of the filter is cached by default if there is evidence of reuse.
+The `_cache` can be set to `true` in order to cache it. Here is an example:
 
 [source,js]
 --------------------------------------------------
@@ -29,7 +29,7 @@ to `false` in order not to cache it. Here is an example:
         "filter" : {
             "prefix" : { 
                 "user" : "ki",
-                "_cache" : false
+                "_cache" : true
             }
         }
     }

diff --git a/docs/reference/query-dsl/filters/query-filter.asciidoc b/docs/reference/query-dsl/filters/query-filter.asciidoc
@@ -22,7 +22,9 @@ that accept a filter.
 [float]
 ==== Caching
 
-The result of the filter is not cached by default. The `_cache` can be
+The result of the filter is only cached by default if there is evidence of reuse.
+
+The `_cache` can be
 set to `true` to cache the *result* of the filter. This is handy when
 the same query is used on several (many) other queries. Note, the
 process of caching the first execution is higher when not caching (since

diff --git a/docs/reference/query-dsl/filters/range-filter.asciidoc b/docs/reference/query-dsl/filters/range-filter.asciidoc
@@ -98,8 +98,8 @@ you're already aggregating or sorting by.
 [float]
 ==== Caching
 
-The result of the filter is only automatically cached by default if the `execution` is set to `index`. The
+The result of the filter is only cached by default if there is evidence of reuse. The
 `_cache` can be set to `false` to turn it off.
 
-If the `now` date math expression is used without rounding then a range filter will never be cached even if `_cache` is
-set to `true`. Also any filter that wraps this filter will never be cached.
+Having the `now` expression used without rounding will make the filter unlikely to be
+cached since reuse is very unlikely.
diff --git a/docs/reference/query-dsl/filters/term-filter.asciidoc b/docs/reference/query-dsl/filters/term-filter.asciidoc
@@ -20,8 +20,8 @@ accept a filter, for example:
 [float]
 ==== Caching
 
-The result of the filter is automatically cached by default. The
-`_cache` can be set to `false` to turn it off. Here is an example:
+The result of the filter is only cached by default if there is evidence of reuse.
+The `_cache` can be set to `false` to turn it off. Here is an example:
 
 [source,js]
 --------------------------------------------------

diff --git a/docs/reference/query-dsl/filters/terms-filter.asciidoc b/docs/reference/query-dsl/filters/terms-filter.asciidoc
@@ -86,8 +86,9 @@ For example:
 [float]
 ==== Caching
 
-The result of the filter is automatically cached by default. The
-`_cache` can be set to `false` to turn it off.
+The result of the filter is cached if there is evidence of reuse. It is
+possible to enable caching explicitely by setting `_cache` to `true` and
+to disable caching by setting `_cache` to `false`.
 
 [float]
 ==== Terms lookup mechanism

diff --git a/src/main/java/org/elasticsearch/index/cache/filter/AutoFilterCachingPolicy.java b/src/main/java/org/elasticsearch/index/cache/filter/AutoFilterCachingPolicy.java
@@ -0,0 +1,103 @@
+/*
+ * Licensed to Elasticsearch under one or more contributor
+ * license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright
+ * ownership. Elasticsearch licenses this file to you under
+ * the Apache License, Version 2.0 (the "License"); you may
+ * not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.elasticsearch.index.cache.filter;
+
+import org.apache.lucene.index.LeafReaderContext;
+import org.apache.lucene.search.DocIdSet;
+import org.apache.lucene.search.Filter;
+import org.apache.lucene.search.FilterCachingPolicy;
+import org.apache.lucene.search.UsageTrackingFilterCachingPolicy;
+import org.elasticsearch.common.inject.Inject;
+import org.elasticsearch.common.lucene.docset.DocIdSets;
+import org.elasticsearch.common.settings.ImmutableSettings;
+import org.elasticsearch.common.settings.Settings;
+import org.elasticsearch.index.AbstractIndexComponent;
+import org.elasticsearch.index.Index;
+import org.elasticsearch.index.settings.IndexSettings;
+
+import java.io.IOException;
+
+/**
+ * This class is a wrapper around {@link UsageTrackingFilterCachingPolicy}
+ * which wires parameters through index settings and makes sure to not
+ * cache {@link DocIdSet}s which have a {@link DocIdSets#isBroken(DocIdSetIterator) broken}
+ * iterator.
+ */
+public class AutoFilterCachingPolicy extends AbstractIndexComponent implements FilterCachingPolicy {
+
+    // These settings don't have the purpose of being documented. They are only here so that
+    // if anyone ever hits an issue with elasticsearch that is due to the value of one of these
+    // parameters, then it might be possible to temporarily work around the issue without having
+    // to wait for a new release
+
+    // number of times a filter that produces cacheable filters should be seen before the doc id sets are cached
+    public static final String MIN_FREQUENCY_COSTLY = "index.cache.filter.policy.min_frequency.costly";
+    // number of times a filter that produces cacheable filters should be seen before the doc id sets are cached
+    public static final String MIN_FREQUENCY_CACHEABLE = "index.cache.filter.policy.min_frequency.cacheable";
+    // same for filters that produce doc id sets that are not directly cacheable
+    public static final String MIN_FREQUENCY_OTHER = "index.cache.filter.policy.min_frequency.other";
+    // sources of segments that should be cached
+    public static final String MIN_SEGMENT_SIZE_RATIO = "index.cache.filter.policy.min_segment_size_ratio";
+    // size of the history to keep for filters. A filter will be cached if it has been seen more than a given
+    // number of times (depending on the filter, the segment and the produced DocIdSet) in the most
+    // ${history_size} recently used filters
+    public static final String HISTORY_SIZE = "index.cache.filter.policy.history_size";
+
+    public static Settings AGGRESSIVE_CACHING_SETTINGS = ImmutableSettings.builder()
+            .put(MIN_FREQUENCY_CACHEABLE, 1)
+            .put(MIN_FREQUENCY_COSTLY, 1)
+            .put(MIN_FREQUENCY_OTHER, 1)
+            .put(MIN_SEGMENT_SIZE_RATIO, 0.000000001f)
+            .build();
+
+    private final FilterCachingPolicy in;
+
+    @Inject
+    public AutoFilterCachingPolicy(Index index, @IndexSettings Settings indexSettings) {
+        super(index, indexSettings);
+        final int historySize = indexSettings.getAsInt(HISTORY_SIZE, 1000);
+        // cache aggressively filters that produce sets that are already cacheable,
+        // ie. if the filter has been used twice or more among the most 1000 recently
+        // used filters
+        final int minFrequencyCacheable = indexSettings.getAsInt(MIN_FREQUENCY_CACHEABLE, 2);
+        // cache aggressively filters whose getDocIdSet method is costly
+        final int minFrequencyCostly = indexSettings.getAsInt(MIN_FREQUENCY_COSTLY, 2);
+        // be a bit less aggressive when the produced doc id sets are not cacheable
+        final int minFrequencyOther = indexSettings.getAsInt(MIN_FREQUENCY_OTHER, 5);
+        final float minSegmentSizeRatio = indexSettings.getAsFloat(MIN_SEGMENT_SIZE_RATIO, 0.01f);
+        in = new UsageTrackingFilterCachingPolicy(minSegmentSizeRatio, historySize, minFrequencyCostly, minFrequencyCacheable, minFrequencyOther);
+    }
+
+    @Override
+    public void onCache(Filter filter) {
+        in.onCache(filter);
+    }
+
+    @Override
+    public boolean shouldCache(Filter filter, LeafReaderContext context, DocIdSet set) throws IOException {
+        if (set != null && DocIdSets.isBroken(set.iterator())) {
+            // O(maxDoc) to cache, no thanks.
+            return false;
+        }
+
+        return in.shouldCache(filter, context, set);
+    }
+
+}
diff --git a/src/main/java/org/elasticsearch/index/cache/filter/FilterCache.java b/src/main/java/org/elasticsearch/index/cache/filter/FilterCache.java
@@ -20,7 +20,10 @@
 package org.elasticsearch.index.cache.filter;
 
 import org.apache.lucene.search.Filter;
+import org.apache.lucene.search.FilterCachingPolicy;
+import org.elasticsearch.common.Nullable;
 import org.elasticsearch.common.component.CloseableComponent;
+import org.elasticsearch.common.lucene.HashedBytesRef;
 import org.elasticsearch.index.IndexComponent;
 import org.elasticsearch.index.IndexService;
 
@@ -44,7 +47,7 @@ public EntriesStats(long sizeInBytes, long count) {
 
     String type();
 
-    Filter cache(Filter filterToCache);
+    Filter cache(Filter filterToCache, @Nullable HashedBytesRef cacheKey, FilterCachingPolicy policy);
 
     void clear(Object reader);
 

diff --git a/src/main/java/org/elasticsearch/index/cache/filter/FilterCacheModule.java b/src/main/java/org/elasticsearch/index/cache/filter/FilterCacheModule.java
@@ -19,6 +19,7 @@
 
 package org.elasticsearch.index.cache.filter;
 
+import org.apache.lucene.search.FilterCachingPolicy;
 import org.elasticsearch.common.inject.AbstractModule;
 import org.elasticsearch.common.inject.Scopes;
 import org.elasticsearch.common.settings.Settings;
@@ -44,5 +45,10 @@ protected void configure() {
         bind(FilterCache.class)
                 .to(settings.getAsClass(FilterCacheSettings.FILTER_CACHE_TYPE, WeightedFilterCache.class, "org.elasticsearch.index.cache.filter.", "FilterCache"))
                 .in(Scopes.SINGLETON);
+        // the filter cache is a node-level thing, however we want the most popular filters
+        // to be computed on a per-index basis, that is why we don't use the SINGLETON
+        // scope below
+        bind(FilterCachingPolicy.class)
+                .to(AutoFilterCachingPolicy.class);
     }
 }
diff --git a/src/main/java/org/elasticsearch/index/cache/filter/none/NoneFilterCache.java b/src/main/java/org/elasticsearch/index/cache/filter/none/NoneFilterCache.java
@@ -20,12 +20,15 @@
 package org.elasticsearch.index.cache.filter.none;
 
 import org.apache.lucene.search.Filter;
+import org.apache.lucene.search.FilterCachingPolicy;
+import org.elasticsearch.common.Nullable;
 import org.elasticsearch.common.inject.Inject;
+import org.elasticsearch.common.lucene.HashedBytesRef;
 import org.elasticsearch.common.settings.Settings;
 import org.elasticsearch.index.AbstractIndexComponent;
 import org.elasticsearch.index.Index;
-import org.elasticsearch.index.cache.filter.FilterCache;
 import org.elasticsearch.index.IndexService;
+import org.elasticsearch.index.cache.filter.FilterCache;
 import org.elasticsearch.index.settings.IndexSettings;
 
 /**
@@ -55,7 +58,7 @@ public void close() {
     }
 
     @Override
-    public Filter cache(Filter filterToCache) {
+    public Filter cache(Filter filterToCache, @Nullable HashedBytesRef cacheKey, FilterCachingPolicy policy) {
         return filterToCache;
     }
 
@@ -73,4 +76,4 @@ public void clear(String reason, String[] keys) {
     public void clear(Object reader) {
         // nothing to do here
     }
-}
+}