Percentiles aggregation.

A new metric aggregation that can compute approximate values of arbitrary percentiles. Close #5323
elastic · Mar 3, 2014 · 91c4c77 · 91c4c77
1 parent bc5e8f4
commit 91c4c77
Show file tree

Hide file tree

Showing 32 changed files with 4,284 additions and 4 deletions.
diff --git a/docs/reference/search/aggregations/metrics.asciidoc b/docs/reference/search/aggregations/metrics.asciidoc
@@ -13,3 +13,5 @@ include::metrics/stats-aggregation.asciidoc[]
 include::metrics/extendedstats-aggregation.asciidoc[]
 
 include::metrics/valuecount-aggregation.asciidoc[]
+
+include::metrics/percentile-aggregation.asciidoc[]
diff --git a/docs/reference/search/aggregations/metrics/percentile-aggregation.asciidoc b/docs/reference/search/aggregations/metrics/percentile-aggregation.asciidoc
@@ -0,0 +1,180 @@
+[[search-aggregations-metrics-percentile-aggregation]]
+=== Percentiles
+A `multi-value` metrics aggregation that calculates one or more percentiles
+over numeric values extracted from the aggregated documents.  These values
+can be extracted either from specific numeric fields in the documents, or
+be generated by a provided script.
+
+.Experimental!
+[IMPORTANT]
+=====
+This feature is marked as experimental, and may be subject to change in the
+future.  If you use this feature, please let us know your experience with it!
+=====
+
+Percentiles show the point at which a certain percentage of observed values
+occur.  For example, the 95th percentile is the value which is greater than 95% 
+of the observed values.
+
+Percentiles are often used to find outliers.  In normal distributions, the 
+0.13th and 99.87th percentiles represents three standard deviations from the 
+mean.  Any data which falls outside three standard deviations is often considered
+an anomaly.
+
+When a range of percentiles are retrieved, they can be used to estimate the 
+data distribution and determine if the data is skewed, bimodal, etc.
+
+Assume your data consists of website load times.  The average and median
+load times are not overly useful to an administrator.  The max may be interesting,
+but it can be easily skewed by a single slow response.
+
+Let's look at a range of percentiles representing load time:
+
+[source,js]
+--------------------------------------------------
+{
+    "aggs" : {
+        "load_time_outlier" : { 
+            "percentiles" : { 
+                "field" : "load_time" <1>
+            } 
+        }
+    }
+}
+--------------------------------------------------
+<1> The field `load_time` must be a numeric field
+
+By default, the `percentile` metric will generate a range of 
+percentiles: `[ 1, 5, 25, 50, 75, 95, 99 ]`.  The response will look like this:
+
+[source,js]
+--------------------------------------------------
+{
+    ...
+
+   "aggregations": {
+      "load_time_outlier": {
+         "1.0": 15,
+         "5.0": 20,
+         "25.0": 23,
+         "50.0": 25,
+         "75.0": 29,
+         "95.0": 60,
+         "99.0": 150
+      }
+   }
+}
+--------------------------------------------------
+
+As you can see, the aggregation will return a calculated value for each percentile
+in the default range.  If we assume response times are in milliseconds, it is 
+immediately obvious that the webpage normally loads in 15-30ms, but occasionally
+spikes to 60-150ms.  
+
+Often, administrators are only interested in outliers -- the extreme percentiles.
+We can specify just the percents we are interested in (requested percentiles 
+must be a value between 0-100 inclusive):
+
+[source,js]
+--------------------------------------------------
+{
+    "aggs" : {
+        "load_time_outlier" : { 
+            "percentiles" : { 
+                "field" : "load_time",
+                "percents" : [95, 99, 99.9] <1>
+            } 
+        }
+    }
+}
+--------------------------------------------------
+<1> Use the `percents` parameter to specify particular percentiles to calculate
+
+
+
+==== Script
+
+The percentile metric supports scripting.  For example, if our load times
+are in milliseconds but we want percentiles calculated in seconds, we could use
+a script to convert them on-the-fly:
+
+[source,js]
+--------------------------------------------------
+{
+    "aggs" : {
+        "load_time_outlier" : { 
+            "percentiles" : { 
+                "script" : "doc['load_time'].value / timeUnit", <1>
+                "params" : {
+                    "timeUnit" : 1000   <2>
+                }
+            } 
+        }
+    }
+}
+--------------------------------------------------
+<1> The `field` parameter is replaced with a `script` parameter, which uses the
+script to generate values which percentiles are calculated on
+<2> Scripting supports parameterized input just like any other script
+
+==== Percentiles are (usually) approximate
+
+There are many different algorithms to calculate percentiles.  The naive 
+implementation simply stores all the values in a sorted array.  To find the 50th
+percentile, you simple find the value that is at `my_array[count(my_array) * 0.5]`.
+
+Clearly, the naive implementation does not scale -- the sorted array grows
+linearly with the number of values in your dataset.  To calculate percentiles
+across potentially billions of values in an Elasticsearch cluster, _approximate_
+percentiles are calculated.
+
+The algorithm used by the `percentile` metric is called TDigest (introduced by
+Ted Dunning in 
+https://github.com/tdunning/t-digest/blob/master/docs/t-digest-paper/histo.pdf[Computing Accurate Quantiles using T-Digests]).
+
+When using this metric, there are a few guidelines to keep in mind:
+
+- Accuracy is proportional to `q(1-q)`.  This means that extreme percentiles (e.g. 99%)
+are more accurate than less extreme percentiles, such as the median
+- For small sets of values, percentiles are highly accurate (and potentially
+100% accurate if the data is small enough).
+- As the quantity of values in a bucket grows, the algorithm begins to approximate
+the percentiles.  It is effectively trading accuracy for memory savings.  The 
+exact level of inaccuracy is difficult to generalize, since it depends on your 
+data distribution and volume of data being aggregated
+
+==== Compression
+
+Approximate algorithms must balance memory utilization with estimation accuracy.
+This balance can be controlled using a `compression` parameter:
+
+[source,js]
+--------------------------------------------------
+{
+    "aggs" : {
+        "load_time_outlier" : { 
+            "percentiles" : { 
+                "field" : "load_time",
+                "compression" : 200 <1>
+            } 
+        }
+    }
+}
+--------------------------------------------------
+<1> Compression controls memory usage and approximation error
+
+The TDigest algorithm uses a number of "nodes" to approximate percentiles -- the 
+more nodes available, the higher the accuracy (and large memory footprint) proportional
+to the volume of data.  The `compression` parameter limits the maximum number of 
+nodes to `100 * compression`.
+
+Therefore, by increasing the compression value, you can increase the accuracy of
+your percentiles at the cost of more memory.  Larger compression values also
+make the algorithm slower since the underlying tree data structure grows in size,
+resulting in more expensive operations.  The default compression value is
+`100`.
+
+A "node" uses roughly 48 bytes of memory, so under worst-case scenarios (large amount
+of data which arrives sorted and in-order) the default settings will produce a
+TDigest roughly 480KB in size.  In practice data tends to be more random and
+the TDigest will use less memory.
diff --git a/pom.xml b/pom.xml
@@ -63,6 +63,12 @@
             <version>4.3.1</version>
             <scope>test</scope>
         </dependency>
+        <dependency>
+            <groupId>org.apache.mahout</groupId>
+            <artifactId>mahout-core</artifactId>
+            <version>0.9</version>
+            <scope>test</scope>
+        </dependency>
 
         <dependency>
             <groupId>org.apache.lucene</groupId>
@@ -1195,6 +1201,9 @@
                   <exclude>src/main/java/org/elasticsearch/common/lucene/search/XFilteredQuery.java</exclude>
                   <exclude>src/main/java/org/apache/lucene/queryparser/XSimpleQueryParser.java</exclude>
                   <exclude>src/main/java/org/apache/lucene/**/X*.java</exclude>
+                  <!-- t-digest -->
+                  <exclude>src/main/java/org/elasticsearch/search/aggregations/metrics/percentiles/tdigest/TDigestState.java</exclude>
+                  <exclude>src/test/java/org/elasticsearch/search/aggregations/metrics/GroupTree.java</exclude>
                 </excludes>
               </configuration>
               <!--  We can't run by default since the package is broken with java 1.6

diff --git a/src/main/java/org/elasticsearch/common/util/ArrayUtils.java b/src/main/java/org/elasticsearch/common/util/ArrayUtils.java
@@ -0,0 +1,72 @@
+/*
+ * Licensed to Elasticsearch under one or more contributor
+ * license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright
+ * ownership. Elasticsearch licenses this file to you under
+ * the Apache License, Version 2.0 (the "License"); you may
+ * not use this file except in compliance with the License.
+ * You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.elasticsearch.common.util;
+
+import java.util.Arrays;
+
+/**
+ *
+ */
+public class ArrayUtils {
+
+    private ArrayUtils() {}
+
+    /**
+     * Return the index of <code>value</code> in <code>array</code>, or <tt>-1</tt> if there is no such index.
+     * If there are several values that are within <code>tolerance</code> or less of <code>value</code>, this method will return the
+     * index of the closest value. In case of several values being as close ot <code>value</code>, there is no guarantee which index
+     * will be returned.
+     * Results are undefined if the array is not sorted.
+     */
+    public static int binarySearch(double[] array, double value, double tolerance) {
+        if (array.length == 0) {
+            return -1;
+        }
+        return binarySearch(array, 0, array.length, value, tolerance);
+    }
+
+    private static int binarySearch(double[] array, int fromIndex, int toIndex, double value, double tolerance) {
+        int index = Arrays.binarySearch(array, fromIndex, toIndex, value);
+        if (index < 0) {
+            final int highIndex = -1 - index; // first index of a value that is > value
+            final int lowIndex = highIndex - 1; // last index of a value that is < value
+
+            double lowError = Double.POSITIVE_INFINITY;
+            double highError = Double.POSITIVE_INFINITY;
+            if (lowIndex >= 0) {
+                lowError = value - array[lowIndex];
+            }
+            if (highIndex < array.length) {
+                highError = array[highIndex] - value;
+            }
+
+            if (highError < lowError) {
+                if (highError < tolerance) {
+                    index = highIndex;
+                }
+            } else if (lowError < tolerance) {
+                index = lowIndex;
+            } else {
+                index = -1;
+            }
+        }
+        return index;
+    }
+}
diff --git a/src/main/java/org/elasticsearch/common/util/BigArrays.java b/src/main/java/org/elasticsearch/common/util/BigArrays.java
@@ -171,6 +171,13 @@ public int increment(long index, int inc) {
             return array[(int) index] += inc;
         }
 
+        @Override
+        public void fill(long fromIndex, long toIndex, int value) {
+            assert indexIsInt(fromIndex);
+            assert indexIsInt(toIndex);
+            Arrays.fill(array, (int) fromIndex, (int) toIndex, value);
+        }
+
     }
 
     private static class LongArrayWrapper extends AbstractArrayWrapper implements LongArray {

diff --git a/src/main/java/org/elasticsearch/common/util/BigIntArray.java b/src/main/java/org/elasticsearch/common/util/BigIntArray.java
@@ -19,6 +19,7 @@
 
 package org.elasticsearch.common.util;
 
+import com.google.common.base.Preconditions;
 import org.apache.lucene.util.ArrayUtil;
 import org.apache.lucene.util.RamUsageEstimator;
 import org.elasticsearch.cache.recycler.PageCacheRecycler;
@@ -69,6 +70,22 @@ public int increment(long index, int inc) {
         return pages[pageIndex][indexInPage] += inc;
     }
 
+    @Override
+    public void fill(long fromIndex, long toIndex, int value) {
+        Preconditions.checkArgument(fromIndex <= toIndex);
+        final int fromPage = pageIndex(fromIndex);
+        final int toPage = pageIndex(toIndex - 1);
+        if (fromPage == toPage) {
+            Arrays.fill(pages[fromPage], indexInPage(fromIndex), indexInPage(toIndex - 1) + 1, value);
+        } else {
+            Arrays.fill(pages[fromPage], indexInPage(fromIndex), pages[fromPage].length, value);
+            for (int i = fromPage + 1; i < toPage; ++i) {
+                Arrays.fill(pages[i], value);
+            }
+            Arrays.fill(pages[toPage], 0, indexInPage(toIndex - 1) + 1, value);
+        }
+    }
+
     @Override
     protected int numBytesPerElement() {
         return RamUsageEstimator.NUM_BYTES_INT;

diff --git a/src/main/java/org/elasticsearch/common/util/BigObjectArray.java b/src/main/java/org/elasticsearch/common/util/BigObjectArray.java
@@ -85,4 +85,4 @@ public void resize(long newSize) {
         this.size = newSize;
     }
 
-}
+}
diff --git a/src/main/java/org/elasticsearch/common/util/IntArray.java b/src/main/java/org/elasticsearch/common/util/IntArray.java
@@ -39,4 +39,9 @@ public interface IntArray extends BigArray {
      */
     public abstract int increment(long index, int inc);
 
+    /**
+     * Fill slots between <code>fromIndex</code> inclusive to <code>toIndex</code> exclusive with <code>value</code>.
+     */
+    public abstract void fill(long fromIndex, long toIndex, int value);
+
 }
diff --git a/src/main/java/org/elasticsearch/common/util/ObjectArray.java b/src/main/java/org/elasticsearch/common/util/ObjectArray.java
@@ -27,11 +27,11 @@ public interface ObjectArray<T> extends BigArray {
     /**
      * Get an element given its index.
      */
-    public abstract T get(long index);
+    T get(long index);
 
     /**
      * Set a value at the given index and return the previous value.
      */
-    public abstract T set(long index, T value);
+    T set(long index, T value);
 
 }
diff --git a/src/main/java/org/elasticsearch/search/aggregations/AggregationBuilders.java b/src/main/java/org/elasticsearch/search/aggregations/AggregationBuilders.java
@@ -33,6 +33,7 @@
 import org.elasticsearch.search.aggregations.metrics.avg.AvgBuilder;
 import org.elasticsearch.search.aggregations.metrics.max.MaxBuilder;
 import org.elasticsearch.search.aggregations.metrics.min.MinBuilder;
+import org.elasticsearch.search.aggregations.metrics.percentiles.PercentilesBuilder;
 import org.elasticsearch.search.aggregations.metrics.stats.StatsBuilder;
 import org.elasticsearch.search.aggregations.metrics.stats.extended.ExtendedStatsBuilder;
 import org.elasticsearch.search.aggregations.metrics.sum.SumBuilder;
@@ -121,4 +122,8 @@ public static IPv4RangeBuilder ipRange(String name) {
     public static TermsBuilder terms(String name) {
         return new TermsBuilder(name);
     }
+
+    public static PercentilesBuilder percentiles(String name) {
+        return new PercentilesBuilder(name);
+    }
 }
diff --git a/src/main/java/org/elasticsearch/search/aggregations/AggregationModule.java b/src/main/java/org/elasticsearch/search/aggregations/AggregationModule.java
@@ -36,6 +36,7 @@
 import org.elasticsearch.search.aggregations.metrics.avg.AvgParser;
 import org.elasticsearch.search.aggregations.metrics.max.MaxParser;
 import org.elasticsearch.search.aggregations.metrics.min.MinParser;
+import org.elasticsearch.search.aggregations.metrics.percentiles.PercentilesParser;
 import org.elasticsearch.search.aggregations.metrics.stats.StatsParser;
 import org.elasticsearch.search.aggregations.metrics.stats.extended.ExtendedStatsParser;
 import org.elasticsearch.search.aggregations.metrics.sum.SumParser;
@@ -58,6 +59,7 @@ public AggregationModule() {
         parsers.add(StatsParser.class);
         parsers.add(ExtendedStatsParser.class);
         parsers.add(ValueCountParser.class);
+        parsers.add(PercentilesParser.class);
 
         parsers.add(GlobalParser.class);
         parsers.add(MissingParser.class);