Aggregations: Add standard error bounds to extended_stats #9389

polyfractal · 2015-01-22T23:04:43Z

This PR adds configurable upper/lower standard deviation bounds to extended_stats aggregation, as requested in #9356:

Syntax:

{
    "aggs": {
        "my_stats": {
            "extended_stats": {
                "field": "price",
                "sigma": 2
            }
        }
    }
}

sigma: controls how many standard deviations above/below the mean should be displayed. Default is 2 standard deviations. Accepts non-negative integers.

Response

"aggregations": {
   "my_stats": {
      "count": 8,
      "min": 10,
      "max": 1000,
      "avg": 191.25,
      "sum": 1530,
      "sum_of_squares": 1047500,
      "variance": 94360.9375,
      "std_deviation": 307.1822545330378
      "std_deviation_bounds": {
         "upper": 805.6145090660756,
         "lower": -423.1145090660756
      }
   }
}

Note: We talked about adding std_error and confidence intervals, but that introduces a set of restrictions on your data (sampled, normally distributed). Decided it would be better to investigate a new agg that deals with normally distributed data, so that users understand the implications of the restrictions better

colings86 · 2015-01-23T16:31:08Z

...va/org/elasticsearch/search/aggregations/metrics/stats/extended/ExtendedStatsAggregator.java

@@ -167,19 +198,25 @@ public void doClose() {

    public static class Factory extends ValuesSourceAggregatorFactory.LeafOnly<ValuesSource.Numeric> {

-        public Factory(String name, ValuesSourceConfig<ValuesSource.Numeric> valuesSourceConfig) {
+        private double sigma = 2;


usually we leave these to be set in the constructor so that the Parser class takes care of setting default values. Also these variables should probably be final

colings86 · 2015-01-23T16:44:07Z

@polyfractal left some comments. also a few other general points:

I think the documentation is really important for this since there are some implementation details that need to be explained, such as:
- What we mean by standard deviation bounds and confidence intervals
- How to change the bounds in the request
- limits of the confidence level (whats the maximum and minimum confidence levels we support)

Also wonder if there is a better way of testing the bounds on the confidence intervals without recreating the formula in the test as this is hard to maintain if we improve the method in the future?

colings86 · 2015-01-26T09:42:24Z

src/main/java/org/elasticsearch/search/aggregations/metrics/stats/extended/ZTable.java

+
+    public static double scoreFromAlpha(double alpha) {
+        if (alpha >= 0.99994) {
+            return 3.9;


The table above has values up to 0.99997, should the limit of alpha here not be 0.99997? Also I wonder if it should throw an exception above this value since it is outside the bounds of what we can calculate rather than all value of alpha above this corresponding to a Z value of 3.9?

colings86 · 2015-01-26T09:43:22Z

@polyfractal left a small comment but I think its almost there

polyfractal · 2015-01-26T20:26:05Z

Fixed. Will start writing docs and push those soon too.

colings86 · 2015-01-27T10:03:37Z

@polyfractal I reviewed your latest two commits, but I accidentally commented on the commits instead of the PR.

Comments are really minor, otherwise it LGTM

jpountz · 2015-01-27T10:59:38Z

I see the issue tagged with 1.4.3 although it looks like a feature rather than a bugfix? So I think it should only go to master and 1.x?

jpountz · 2015-01-27T11:07:11Z

...n/java/org/elasticsearch/search/aggregations/metrics/stats/extended/ExtendedStatsParser.java

+                    sigma = parser.intValue();
+                } else if (CONFIDENCE_INTERVAL.match(currentFieldName)) {
+                    confidenceInterval = parser.doubleValue();
+                    if (confidenceInterval < 0 || confidenceInterval > 0.99994) {


interesting, what is the reason for this funny upper bound? :)

Oh I see, it's the ZTable stuff. Sorry for the noise :)

The confidence interval is worked out using a Z-table. The formula for working out the value of Z is a complex integral with no mathematical solution so the solutions are numerical. We have a Z-Table with values for confidence from 0 to 0.99994, so it just so happens to be where our Z-Table stops. Since the confidence interval approaches 1 asymptotically, we have to stop somewhere before 1 and this value seems like it will cover almost all use-cases so should be adequate. Does that make sense?

Sorry, I added this comment before I saw your reply :)

jpountz · 2015-01-27T13:32:59Z

I'm wondering if we should allow for reporting several confidence intervals (like eg. we can report several percentiles in other aggs)?

colings86 · 2015-01-27T13:41:00Z

If we decide to do that, I would vote for removing confidence interval from the extended_stats aggregation and putting in its own aggregation since the extended_stats aggregation should not become bloated with every statistical measure we can think of. I think the same applies to the std_deviation interval too. It's not going to be very user friendly if we try to make the extended_stats aggregation do everything.

As a small aside, I was talking to @polyfractal the other day about this and suggested that if we add much more to the extended_stats aggregation we should split the functionality into an aggregation for each data distribution type. What we have now is really only applicable for normally distributed data, and we are already getting into the realms where that might/could/should be split into sampled normal data and population normal data aggregations, but this would open us up to providing statistical measure for other data distributions too.

jpountz · 2015-01-27T13:52:17Z

src/main/java/org/elasticsearch/search/aggregations/metrics/stats/extended/ZTable.java

+            throw new ElasticsearchIllegalArgumentException("Z-Table has maximum resolution of phi=0.99997.  Please use a smaller value.");
+        } else if (phi < 0.5) {
+            throw new ElasticsearchIllegalArgumentException("Not possible to request a Z-score with phi < 0.5");
+        }


Comparisons on floating-point numbers can sometimes be... surprising. I'm wondering that it might help to move these exceptions to the computed index from binary search below? Ie. if index == -1 => error, phi must be >= 0.5 and if index == -1 - length => error, phi mist be <= 0.99997?

colings86 · 2015-01-28T10:27:13Z

Would be good to add tests for non-default values of sigma. I didn't see any in the test class?

polyfractal · 2015-01-28T14:03:37Z

@colings86 Good call. Modified the tests to randomize sigma (random double between 1-10). Found a bug in the parser which was loading the intValue instead of doubleValue

colings86 · 2015-01-28T14:43:43Z

LGTM

Extended_stats now displays the upper and lower bounds on standard deviations (e.g. avg +/- std). Default is to show 2 std above/below, but can be changed using the `sigma` parameter. Accepts non-negative doubles Closes elastic#9356

polyfractal · 2015-01-28T17:17:18Z

Merged in a8f555d

$polyfractal$

$@polyfractal$ polyfractal added :Analytics/Aggregations Aggregations >enhancement review and removed >enhancement :Analytics/Aggregations Aggregations review labels Jan 22, 2015

clintongormley added the v1.4.3 label Jan 23, 2015

colings86 self-assigned this Jan 23, 2015

colings86 reviewed Jan 23, 2015
View reviewed changes

colings86 reviewed Jan 26, 2015
View reviewed changes

jpountz reviewed Jan 27, 2015
View reviewed changes

colings86 assigned polyfractal and unassigned colings86 Jan 28, 2015

$@polyfractal$ polyfractal force-pushed the enhancement/extended_stats_std branch from 21a23e0 to 3c40c68 Compare January 28, 2015 15:51

$@polyfractal$ polyfractal changed the title ~~Aggregations: Add stdError and upper/lower interval bounds for stdDev and Confidence~~ Aggregations: Add standard error bounds to extended_stats Jan 28, 2015

$@polyfractal$ polyfractal closed this Jan 28, 2015

clintongormley removed the review label Mar 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregations: Add standard error bounds to extended_stats #9389

Aggregations: Add standard error bounds to extended_stats #9389

$@polyfractal$ polyfractal commented Jan 22, 2015

colings86 Jan 23, 2015

colings86 commented Jan 23, 2015

colings86 Jan 26, 2015

colings86 commented Jan 26, 2015

polyfractal commented Jan 26, 2015

colings86 commented Jan 27, 2015

jpountz commented Jan 27, 2015

jpountz Jan 27, 2015

jpountz Jan 27, 2015

colings86 Jan 27, 2015

colings86 Jan 27, 2015

jpountz commented Jan 27, 2015

colings86 commented Jan 27, 2015

jpountz Jan 27, 2015

colings86 commented Jan 28, 2015

polyfractal commented Jan 28, 2015

colings86 commented Jan 28, 2015

polyfractal commented Jan 28, 2015

Aggregations: Add standard error bounds to extended_stats #9389

Aggregations: Add standard error bounds to extended_stats #9389

Conversation

polyfractal commented Jan 22, 2015

Syntax:

Response

colings86 Jan 23, 2015

Choose a reason for hiding this comment

colings86 commented Jan 23, 2015

colings86 Jan 26, 2015

Choose a reason for hiding this comment

colings86 commented Jan 26, 2015

polyfractal commented Jan 26, 2015

colings86 commented Jan 27, 2015

jpountz commented Jan 27, 2015

jpountz Jan 27, 2015

Choose a reason for hiding this comment

jpountz Jan 27, 2015

Choose a reason for hiding this comment

colings86 Jan 27, 2015

Choose a reason for hiding this comment

colings86 Jan 27, 2015

Choose a reason for hiding this comment

jpountz commented Jan 27, 2015

colings86 commented Jan 27, 2015

jpountz Jan 27, 2015

Choose a reason for hiding this comment

colings86 commented Jan 28, 2015

polyfractal commented Jan 28, 2015

colings86 commented Jan 28, 2015

polyfractal commented Jan 28, 2015

$@polyfractal$ polyfractal commented Jan 22, 2015