New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add nth percentile calculation #137
Changes from 4 commits
0de5750
541e236
cb3b388
400f1ef
29bd3a7
f15935d
dfc5167
b02a23e
99a3a22
92adee4
068cb09
f10387a
f693606
fe34046
9aead30
0a5ca7d
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,86 @@ | ||
defmodule Benchee.Statistics.Percentile do | ||
@moduledoc false | ||
|
||
@doc """ | ||
Calculates the value at the `percentile_number`-th percentile. Think of this as the | ||
value below which `percentile_number` percent of the samples lie. For example, | ||
if `Benchee.Statistics.Percentile.percentile(samples, 99)` == 123.45, | ||
99% of samples are less than 123.45. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. less than or equal to imo? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Definitely open to improvements in the documentation. My understanding is that this value we are calculating is a "less than" number. Here's what Wikipedia says:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. imo it's or equal. E.g. When we have 10 values 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 - the 80th percentile would be 8 and 8 is a member of that 80th percentile group so from my understanding it should be less than or equal There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. However.... iex(2)> Percentile.percentile([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 80)
8.8 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. but.... but... I guess I have to take a look at the calculation again then and the involved values or the algorithm used to obtain them. I had some suspicions about indexes being off somewhere... for my understanding of 80th percentile (aka 80 percent of the data) this should most definitely return 8 - or where am I going wrong? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. After cross checking this seems to be the right value (tm) but it still appearsto be weird to me. I mean 80% of my values are up until including 8. In this particular case I would have understood if it's like 8.2 or something... 8.8 just plain seems too weird. Need to up my percentile knowledge apparently |
||
|
||
## Examples | ||
|
||
iex> Benchee.Statistics.Percentile.percentile([5, 3, 4, 5, 1, 3, 1, 3], 8, 100) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the lovely doctests 🎉 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🎉 🎉 🎉 |
||
5.0 | ||
|
||
iex> Benchee.Statistics.Percentile.percentile([5, 3, 4, 5, 1, 3, 1, 3], 8, 150) | ||
5.0 | ||
|
||
iex> Benchee.Statistics.Percentile.percentile([5, 3, 4, 5, 1, 3, 1, 3], 8, 0) | ||
1.0 | ||
|
||
iex> Benchee.Statistics.Percentile.percentile([5, 3, 4, 5, 1, 3, 1, 3], 8, -1) | ||
1.0 | ||
|
||
iex> Benchee.Statistics.Percentile.percentile([5, 3, 4, 5, 1, 3, 1, 3], 50) | ||
3.0 | ||
|
||
iex> Benchee.Statistics.Percentile.percentile([5, 3, 4, 5, 1, 3, 1, 3], 75) | ||
4.75 | ||
""" | ||
@spec percentile(list(number()), integer()) :: float() | ||
def percentile(samples, percentile_number) do | ||
percentile(samples, length(samples), percentile_number) | ||
end | ||
|
||
@spec percentile(list(number()), integer(), integer()) :: float() | ||
def percentile(samples, number_of_samples, percentile_number) when percentile_number > 100 do | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. First off, do we want What would you think about failing in this case and the one on line 40, and giving a helpful error message? I can't imagine that a user would give us a percentage of more than 99 or less than 1 on purpose. I think it would be likely that these would be cases of typos, and that by coercing these error cases we might be giving users results they think are strange and hiding their typos when we should be telling them how to fix their typos and get the results they want. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @devonestes so....the reason I have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Actually that was my first inclination, but then I backed it out. What do you imagine, throwing something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For performance see #139 - with what is reasonable to expect to hit us (~1 Million) it takes ~4ms on my machine, we might have multiple scenarios so let's say * 10 that gives 40ms which is ok I guess ;) For 10 Million we'd be at about 300ms total according to this calculation. Non noticable for the given scenario user wise. Raising an argument error sounds nice! 👍 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so, not sure where we stand with |
||
percentile(samples, number_of_samples, 100) | ||
end | ||
|
||
def percentile(samples, number_of_samples, percentile_number) when percentile_number < 0 do | ||
percentile(samples, number_of_samples, 0) | ||
end | ||
|
||
def percentile(samples, number_of_samples, percentile_number) do | ||
sorted = Enum.sort(samples) | ||
rank = (percentile_number / 100) * max(0, number_of_samples + 1) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @PragTob I had the same thought. But it's part of the formula. everything comes out wrong if you just use currently: So, as-is, we look for the value halfway between the 5th and 6th elements (yay, median). If we remove the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. makes sense... for the odd case of 9, 1/2 * 9 would also be 4.5 but we are looking for 5. Thanks. However.... dum dum dum - we are 0 indexed. So in my example above we should be looking at number 4 not number 5? But maybe that is redeemed through the I think it's ok as the median tests I wrote back in the day + your tests are passing. One of these things where I trust the tests more than my brain :D |
||
percentile_value(sorted, rank) | ||
end | ||
|
||
defp percentile_value(sorted, rank) when trunc(rank) == 0 do | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I had never seen There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With our new error throwing ways we shouldn't need this method anymore :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same thing for the function version below but it won't let me comment :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this probably got buried in my initial comment, but I added this back as a private function mainly so we don't have to sort the list again while calculating multiple percentiles on the same list There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. i meant the specific There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll try—it's actually a guard for the |
||
sorted | ||
|> hd | ||
|> to_float | ||
end | ||
|
||
defp percentile_value(sorted, rank) when trunc(rank) >= length(sorted) do | ||
sorted | ||
|> Enum.reverse | ||
|> hd | ||
|> to_float | ||
end | ||
|
||
defp percentile_value(sorted, rank) do | ||
index = trunc(rank) | ||
[lower_bound, upper_bound | _] = Enum.drop(sorted, index - 1) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. at first I was pretty skeptical, but this seems like a great usage of |
||
interpolation_value = interpolation_value(lower_bound, upper_bound, rank) | ||
lower_bound + interpolation_value | ||
end | ||
|
||
# "Type 6" interpolation strategy. There are many ways to interpolate a value | ||
# when the rank is not an integer (in other words, we don't exactly land on a | ||
# particular sample). Of the 9 main strategies, (types 1-9), types 6, 7, and 8 | ||
# are generally acceptable and give similar results. | ||
# | ||
# For more information on interpolation strategies, see: | ||
# - https://stat.ethz.ch/R-manual/R-devel/library/stats/html/quantile.html | ||
# - http://www.itl.nist.gov/div898/handbook/prc/section2/prc262.htm | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 information |
||
defp interpolation_value(lower_bound, upper_bound, rank) do | ||
interpolation_weight = rank - trunc(rank) | ||
interpolation_weight * (upper_bound - lower_bound) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 😍 |
||
end | ||
|
||
defp to_float(maybe_integer) do | ||
:erlang.float maybe_integer | ||
end | ||
end |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
defmodule Benchee.Statistics.PercentileTest do | ||
use ExUnit.Case, async: true | ||
alias Benchee.Statistics.Percentile | ||
doctest Percentile | ||
|
||
@nist_sample_data [ | ||
95.1772, | ||
95.1567, | ||
95.1937, | ||
95.1959, | ||
95.1442, | ||
95.0610, | ||
95.1591, | ||
95.1195, | ||
95.1065, | ||
95.0925, | ||
95.1990, | ||
95.1682 | ||
] | ||
|
||
@tag run: true | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we don't need the tag, do we? :) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 😊 |
||
test "90th percentile" do | ||
result = Percentile.percentile(@nist_sample_data, 90) | ||
assert Float.round(result, 4) == 95.1981 | ||
end | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call re-using this! I never would have put that together myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hahaha, I think you would have. I didn't realize until I kept seeing the same values for median and 50th percentile. And reading!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's non obvious yeah but this is how I imagined it. Thanks 😁