New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregations: Add standard error bounds to extended_stats #9389
Aggregations: Add standard error bounds to extended_stats #9389
Conversation
@@ -167,19 +198,25 @@ public void doClose() { | |||
|
|||
public static class Factory extends ValuesSourceAggregatorFactory.LeafOnly<ValuesSource.Numeric> { | |||
|
|||
public Factory(String name, ValuesSourceConfig<ValuesSource.Numeric> valuesSourceConfig) { | |||
private double sigma = 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually we leave these to be set in the constructor so that the Parser class takes care of setting default values. Also these variables should probably be final
@polyfractal left some comments. also a few other general points:
Also wonder if there is a better way of testing the bounds on the confidence intervals without recreating the formula in the test as this is hard to maintain if we improve the method in the future? |
|
||
public static double scoreFromAlpha(double alpha) { | ||
if (alpha >= 0.99994) { | ||
return 3.9; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The table above has values up to 0.99997, should the limit of alpha here not be 0.99997? Also I wonder if it should throw an exception above this value since it is outside the bounds of what we can calculate rather than all value of alpha above this corresponding to a Z value of 3.9?
@polyfractal left a small comment but I think its almost there |
Fixed. Will start writing docs and push those soon too. |
@polyfractal I reviewed your latest two commits, but I accidentally commented on the commits instead of the PR. Comments are really minor, otherwise it LGTM |
I see the issue tagged with 1.4.3 although it looks like a feature rather than a bugfix? So I think it should only go to master and 1.x? |
sigma = parser.intValue(); | ||
} else if (CONFIDENCE_INTERVAL.match(currentFieldName)) { | ||
confidenceInterval = parser.doubleValue(); | ||
if (confidenceInterval < 0 || confidenceInterval > 0.99994) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting, what is the reason for this funny upper bound? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see, it's the ZTable stuff. Sorry for the noise :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The confidence interval is worked out using a Z-table. The formula for working out the value of Z is a complex integral with no mathematical solution so the solutions are numerical. We have a Z-Table with values for confidence from 0 to 0.99994, so it just so happens to be where our Z-Table stops. Since the confidence interval approaches 1 asymptotically, we have to stop somewhere before 1 and this value seems like it will cover almost all use-cases so should be adequate. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I added this comment before I saw your reply :)
I'm wondering if we should allow for reporting several confidence intervals (like eg. we can report several percentiles in other aggs)? |
If we decide to do that, I would vote for removing confidence interval from the extended_stats aggregation and putting in its own aggregation since the extended_stats aggregation should not become bloated with every statistical measure we can think of. I think the same applies to the std_deviation interval too. It's not going to be very user friendly if we try to make the As a small aside, I was talking to @polyfractal the other day about this and suggested that if we add much more to the extended_stats aggregation we should split the functionality into an aggregation for each data distribution type. What we have now is really only applicable for normally distributed data, and we are already getting into the realms where that might/could/should be split into |
throw new ElasticsearchIllegalArgumentException("Z-Table has maximum resolution of phi=0.99997. Please use a smaller value."); | ||
} else if (phi < 0.5) { | ||
throw new ElasticsearchIllegalArgumentException("Not possible to request a Z-score with phi < 0.5"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparisons on floating-point numbers can sometimes be... surprising. I'm wondering that it might help to move these exceptions to the computed index from binary search below? Ie. if index == -1 => error, phi must be >= 0.5 and if index == -1 - length => error, phi mist be <= 0.99997?
Would be good to add tests for non-default values of sigma. I didn't see any in the test class? |
@colings86 Good call. Modified the tests to randomize sigma (random double between 1-10). Found a bug in the parser which was loading the intValue instead of doubleValue |
LGTM |
Extended_stats now displays the upper and lower bounds on standard deviations (e.g. avg +/- std). Default is to show 2 std above/below, but can be changed using the `sigma` parameter. Accepts non-negative doubles Closes elastic#9356
21a23e0
to
3c40c68
Compare
Merged in a8f555d |
This PR adds configurable upper/lower standard deviation bounds to
extended_stats
aggregation, as requested in #9356:Syntax:
sigma: controls how many standard deviations above/below the mean should be displayed. Default is
2
standard deviations. Accepts non-negative integers.Response
Note: We talked about adding std_error and confidence intervals, but that introduces a set of restrictions on your data (sampled, normally distributed). Decided it would be better to investigate a new agg that deals with normally distributed data, so that users understand the implications of the restrictions better