Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Percentiles aggregation #5323

Closed
jpountz opened this issue Mar 3, 2014 · 4 comments
Closed

Percentiles aggregation #5323

jpountz opened this issue Mar 3, 2014 · 4 comments

Comments

@jpountz
Copy link
Contributor

jpountz commented Mar 3, 2014

A percentiles aggregation would allow to compute (approximate) values of arbitrary percentiles based on the t-digest algorithm. Computing exact percentiles is not reasonably feasible as it would require shards to stream all values to the node that coordinates search execution, which could be gigabytes on a high-cardinality field. On the other hand, t-digest allows to trade accuracy for memory by trying to summarize the set of values that have been accumulated with interesting properties/features:

  • compression is configurable, meaning that if you can configure it to have better accuracy at the cost of a higher memory usage,
  • accuracy is excellent for extreme percentiles,
  • percentiles are going to be accurate if few values were accumulated.

Example:

{
    "aggs" : {
        "load_time_outlier" : {
            "percentiles" : {
                "field" : "load_time",
                "percents" : [95, 99, 99.9] 
            }
        }
    }
}
@jpountz jpountz self-assigned this Mar 3, 2014
polyfractal added a commit that referenced this issue Mar 3, 2014
A new metric aggregation that can compute approximate values of arbitrary
percentiles.

Close #5323
@lukas-vlcek
Copy link
Contributor

Beautiful!

@otisg
Copy link

otisg commented Mar 4, 2014

Out of curiosity, why did you choose t-digest and not QDigest? Did you do extensive comparison and concluded that t-digest has both lower memory footprint, speed, and accuracy?

@jpountz
Copy link
Contributor Author

jpountz commented Mar 4, 2014

The two main reasons why we did not consider q-digest are that it does not work with doubles and looked less accurate than t-digest. The t-digest paper also gives interesting explanations why t-digest performs better than q-digest.

@otisg
Copy link

otisg commented Mar 4, 2014

Thanks Adrien! Sounds like you didn't actually run comparison tests, right? (not "blaming", just trying to understand). @tdunning may have more speed improvements in t-digest, a little bird told me...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants