Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregations: Add standard error bounds to extended_stats #9389

Closed

Conversation

polyfractal
Copy link
Contributor

This PR adds configurable upper/lower standard deviation bounds to extended_stats aggregation, as requested in #9356:

Syntax:

{
    "aggs": {
        "my_stats": {
            "extended_stats": {
                "field": "price",
                "sigma": 2
            }
        }
    }
}

sigma: controls how many standard deviations above/below the mean should be displayed. Default is 2 standard deviations. Accepts non-negative integers.

Response

"aggregations": {
   "my_stats": {
      "count": 8,
      "min": 10,
      "max": 1000,
      "avg": 191.25,
      "sum": 1530,
      "sum_of_squares": 1047500,
      "variance": 94360.9375,
      "std_deviation": 307.1822545330378
      "std_deviation_bounds": {
         "upper": 805.6145090660756,
         "lower": -423.1145090660756
      }
   }
}

Note: We talked about adding std_error and confidence intervals, but that introduces a set of restrictions on your data (sampled, normally distributed). Decided it would be better to investigate a new agg that deals with normally distributed data, so that users understand the implications of the restrictions better

@@ -167,19 +198,25 @@ public void doClose() {

public static class Factory extends ValuesSourceAggregatorFactory.LeafOnly<ValuesSource.Numeric> {

public Factory(String name, ValuesSourceConfig<ValuesSource.Numeric> valuesSourceConfig) {
private double sigma = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually we leave these to be set in the constructor so that the Parser class takes care of setting default values. Also these variables should probably be final

@colings86
Copy link
Contributor

@polyfractal left some comments. also a few other general points:

  • I think the documentation is really important for this since there are some implementation details that need to be explained, such as:
    • What we mean by standard deviation bounds and confidence intervals
    • How to change the bounds in the request
    • limits of the confidence level (whats the maximum and minimum confidence levels we support)

Also wonder if there is a better way of testing the bounds on the confidence intervals without recreating the formula in the test as this is hard to maintain if we improve the method in the future?


public static double scoreFromAlpha(double alpha) {
if (alpha >= 0.99994) {
return 3.9;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The table above has values up to 0.99997, should the limit of alpha here not be 0.99997? Also I wonder if it should throw an exception above this value since it is outside the bounds of what we can calculate rather than all value of alpha above this corresponding to a Z value of 3.9?

@colings86
Copy link
Contributor

@polyfractal left a small comment but I think its almost there

@polyfractal
Copy link
Contributor Author

Fixed. Will start writing docs and push those soon too.

@colings86
Copy link
Contributor

@polyfractal I reviewed your latest two commits, but I accidentally commented on the commits instead of the PR.

Comments are really minor, otherwise it LGTM

@jpountz
Copy link
Contributor

jpountz commented Jan 27, 2015

I see the issue tagged with 1.4.3 although it looks like a feature rather than a bugfix? So I think it should only go to master and 1.x?

sigma = parser.intValue();
} else if (CONFIDENCE_INTERVAL.match(currentFieldName)) {
confidenceInterval = parser.doubleValue();
if (confidenceInterval < 0 || confidenceInterval > 0.99994) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, what is the reason for this funny upper bound? :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see, it's the ZTable stuff. Sorry for the noise :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The confidence interval is worked out using a Z-table. The formula for working out the value of Z is a complex integral with no mathematical solution so the solutions are numerical. We have a Z-Table with values for confidence from 0 to 0.99994, so it just so happens to be where our Z-Table stops. Since the confidence interval approaches 1 asymptotically, we have to stop somewhere before 1 and this value seems like it will cover almost all use-cases so should be adequate. Does that make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I added this comment before I saw your reply :)

@jpountz
Copy link
Contributor

jpountz commented Jan 27, 2015

I'm wondering if we should allow for reporting several confidence intervals (like eg. we can report several percentiles in other aggs)?

@colings86
Copy link
Contributor

If we decide to do that, I would vote for removing confidence interval from the extended_stats aggregation and putting in its own aggregation since the extended_stats aggregation should not become bloated with every statistical measure we can think of. I think the same applies to the std_deviation interval too. It's not going to be very user friendly if we try to make the extended_stats aggregation do everything.

As a small aside, I was talking to @polyfractal the other day about this and suggested that if we add much more to the extended_stats aggregation we should split the functionality into an aggregation for each data distribution type. What we have now is really only applicable for normally distributed data, and we are already getting into the realms where that might/could/should be split into sampled normal data and population normal data aggregations, but this would open us up to providing statistical measure for other data distributions too.

throw new ElasticsearchIllegalArgumentException("Z-Table has maximum resolution of phi=0.99997. Please use a smaller value.");
} else if (phi < 0.5) {
throw new ElasticsearchIllegalArgumentException("Not possible to request a Z-score with phi < 0.5");
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparisons on floating-point numbers can sometimes be... surprising. I'm wondering that it might help to move these exceptions to the computed index from binary search below? Ie. if index == -1 => error, phi must be >= 0.5 and if index == -1 - length => error, phi mist be <= 0.99997?

@colings86
Copy link
Contributor

Would be good to add tests for non-default values of sigma. I didn't see any in the test class?

@colings86 colings86 assigned polyfractal and unassigned colings86 Jan 28, 2015
@polyfractal
Copy link
Contributor Author

@colings86 Good call. Modified the tests to randomize sigma (random double between 1-10). Found a bug in the parser which was loading the intValue instead of doubleValue

@colings86
Copy link
Contributor

LGTM

Extended_stats now displays the upper and lower bounds on standard deviations (e.g. avg +/- std).
Default is to show 2 std above/below, but can be changed using the `sigma` parameter.
Accepts non-negative doubles

Closes elastic#9356
@polyfractal polyfractal changed the title Aggregations: Add stdError and upper/lower interval bounds for stdDev and Confidence Aggregations: Add standard error bounds to extended_stats Jan 28, 2015
@polyfractal
Copy link
Contributor Author

Merged in a8f555d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants