Add quantilerank and percentilerank functions #741

AugustoCL · 2021-11-29T19:17:15Z

This PR continue the discussion of the merged/closed PR #733.

So, It has the code updated, having:

The new function names (quantilerank and percentilerank) as combined in the previous discussion.
An update to method :rank according to this book reference.
A cleaner commit history to facilitate the evaluations and suggestions that will be proposed by maintainers.

src/ranking.jl

test/ranking.jl

AugustoCL · 2021-12-13T00:53:55Z

Actually I was confused, I think method rather corresponds to the alpha and beta argument to quantile. For example, in Excel PERCENTILE.INC corresponds to PERCENTRANK.INC and PERCENTILE.EXC corresponds to PERCENTRANK.EXC. Could you check whether we could replace method with alpha and beta instead, so that we have a consistent API? If necessary, for simplicity we could support only some combinations of these two arguments, which would match the current possible choices for method (and therefore keep the same if structure). See ?quantile for details.

I have been looking up bibliographic references for percentilerankand it has been difficult to find and this worries me.
In Excel there are no references or source code. In scipy, only source code, no reference. In LibreOffice there is a brief description of the method used (unlike anything we've seen before), but also without references. In dplyr (R), they emulate the percent_rank of SQL2003, whose documentation has no source code, just description and example. On Wikipedia, there's only the reference to the book I've already cited.

I did a brief reading of the quantile reference article, found it a bit advanced for me, and couldn't see a direct relationship between the quantilerank methods and the different quantile methods. This worries me and makes me doubt whether we can maintain consistency in the API.

As it has been difficult to find references in the documentation of other languages, it is up to you to keep or not the construction of the functions with the 3 methods, :mean, :strict and :weak, using the book Fundamental Statistics for the Behavioral Sciences as reference for the :mean method.

Also, if you find that it's not worth implementing the function, please know that I have no problem closing this PR.

nalimilan · 2021-12-14T22:10:21Z

I did a brief reading of the quantile reference article, found it a bit advanced for me, and couldn't see a direct relationship between the quantilerank methods and the different quantile methods. This worries me and makes me doubt whether we can maintain consistency in the API.

These things are really tricky, but looking at the Wikipedia table we can see that the R-7 definition, which is used by Excel's PERCENTILE/PERCENTILE.INC and Julia has h=N − 1, just like in the denominator of the formula Excel's PERCENTRANK/PERCENTRANK.INC uses (discussion above at #741 (comment)). Likewise, the R-6 definition, which is used by Excel's PERCENTILE.EXC, has h=N + 1, just like in the denominator of the formula Excel's PERCENTRANK.EXC uses. So this seems to indicate there's a direct relationship with the different quantile types. The percentrank methods from SciPy have N as the denominator, and they seem to correspond to the R-1 (method=:strict) and R-2 (method=:mean) definitions. If we could confirm this the situation would be much clearer.

AugustoCL · 2021-12-23T20:03:30Z

So this seems to indicate there's a direct relationship with the different quantile types.

@nalimilan,

I checked the relationship of h in the percentile definitions presented in the wikipedia table and to me it looks like there is a direct relationship between them, but I'm not comfortable asserting this on my own.

So I proceeded with the implementation of the different methods using the method argument in the function and highlighted the direct relationship only in the :inc and :exc methods. I imagine if there is trust in this direct relationship we can use the alpha and beta arguments to be consistent with the API, as suggested.
And if you feel comfortable establishing this direct relationship in the docstring, please feel free to update the docstring as you think is better. I'm always open to suggestions.

nalimilan · 2021-12-27T11:17:19Z

Thanks! Let's get the method argument right with the proper correspondence with quantile definitions, alpha and beta can be added later.

src/ranking.jl

test/ranking.jl

AugustoCL · 2022-01-09T15:24:47Z

the question arose whether or not the array needs to be sorted and looking for the libreoffice implementation, I discovered that is necessary.

I saw in the quantile code tat there is a specific sort used there. So, since we need to sort the vector in some methods, could you please help us find the best recommended sort for the task? I don't have a CS background, so I'm not the best person to answer that problem.

I made some benchmark tests and saw that if I use !issorted(v) && (v = sort(v)) it is better than !sorted && (v = sort(v))
( I don't know why haha). So, I also removed the sorted argument in the function.

This sort question I'm up to you. whatever you suggest, we do.

src/ranking.jl

bkamins · 2022-01-16T22:16:17Z

I discovered that is necessary.

Sorting is not necessary. What you need to do is:

count_less = 0
count_equal = 0
last_less = value
first_greater = value
for x in v
    if x == value
        count_equal += 1
    elseif x < value
        count_less += 1
        if last_less == value || last_less < x
            last_less = x
        end
    else
        if first_greater == value || first_greater > x
            first_greater = x
        end
    end
end

(I have written this without looking at @nalimilan code not to get influenced by it but it seems it is the same).

Then you should be able to compute all required variants using count_less, count_equal, last_less and first greater. (note that you need to be careful to correctly handle cases when last_less == value or first_greater == value as they are special).

You have not provided the definitions you use in the docstring (you gave only references and I do not have enough time currently to digest them - sorry for this) but there should be a natural mapping.

E.g. for :inc it seems that the following would work (I am writing it out of my head as there is no definition given):

n = length(v)
if last_less == value
    return 0.0
elseif count_equal > 0
        return count_less / (n - 1)
elseif first_greater == value
        return 1.0
else
    lower = (count_less - 1) / (n - 1)
    upper = count_less / (n - 1)
    ratio = (value - last_less) / (first_greater - last_less)
    return lower * (1 - ratio) + upper * ratio
end

src/ranking.jl

AugustoCL · 2022-01-23T13:22:04Z

@bkamins ,

I used the for-loop you wrote, it worked perfectly and I also adapted all the other methods using the 4 variables generated in the for-loop. All tests passed, after the changes made in commit dca02ea. Tks.

I understand that an important point that remains to be resolved is to find a way to incorporate the use of the skipmissing() function without using collect() in the current code. I caught the use of the collect by looking at the quantile.jl source code on Statistics.jl, but if anyone knows an alternative, please share it with us.

bkamins · 2022-01-23T13:43:16Z

I understand that an important point that remains to be resolved is to find a way to incorporate the use of the skipmissing() function without using collect() in the current code.

I have written my implementation to work on any iterable. It does not rely on the fact that what you pass is a vector. So it can be just dropped.

The only consideration is (but we could potentially add it later, cc @nalimilan) if we want AbstractArray to get a special treatment with dims kwarg (i.e. doing reductions over some dimensions like other functions do. But I think we could add this later.

AugustoCL · 2022-01-28T22:17:16Z

I have not checked manually the tests line by line, but I hope you used some references to make sure they are correct

Yes. To ensure that all the methods are right, I reproduced on the tests the same examples of the docs of each method.

src/ranking.jl

bkamins

let us wait for @nalimilan approval before merging

src/ranking.jl

src/StatsBase.jl

docs/src/ranking.md

src/ranking.jl

test/ranking.jl

src/ranking.jl

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

nalimilan

Thanks for bearing with us @AugustoCL! The road was long but this is a very nice addition to StatsBase! I believe we now have one of the most complete and best documented implementations around.

I've made a few direct edits to the docstrings, let me know whether that's OK.

AugustoCL · 2022-01-31T22:19:11Z

let me know whether that's OK.

I like those edits in docstrings. It's very good to me.

Thanks for bearing with us @AugustoCL

You're welcome. I also must thank you for guiding me on this path. I did the best I could for my level and learned a lot on this journey.

The road was long but this is a very nice addition to StatsBase! I believe we now have one of the most complete and best documented implementations around.

I wondered if these functions were relevant enough to be accepted in a base package like StatsBase.jl. So knowing that makes me proud of this work. Thanks for all.

src/scalarstats.jl

AugustoCL · 2022-02-01T20:15:52Z

@nalimilan,

Just for curiosity,

is there an estimated date for the merge?
I don't know the factors that determine a new release, I would love to hear an explanation of the best practices of this.

bkamins · 2022-02-01T21:14:28Z

For now we are waiting for a few days to hear if other people have comments on the PR.
If you will not see this PR merged after this period can you please ping us?

AugustoCL · 2022-02-01T21:30:38Z

If you will not see this PR merged after this period can you please ping us?

yes. So I'll keep an eye out until this friday.

AugustoCL · 2022-02-04T15:31:37Z

@nalimilan, @bkamins

If you will not see this PR merged after this period can you please ping us?

First ping🏓

nalimilan · 2022-02-04T16:05:00Z

Let's merge, it sounds unlikely that somebody new is going to comment at this point!

AugustoCL added 2 commits November 29, 2021 15:55

add quantilerank and percentilerank functions

206f3ce

correct docstring

d3fa4d7

AugustoCL mentioned this pull request Nov 29, 2021

add percentileofscore() function #733

Merged

update references in docstring

9f6ce6c

nalimilan reviewed Dec 4, 2021

View reviewed changes

AugustoCL and others added 2 commits December 10, 2021 20:41

Merge branch 'JuliaStats:master' into quantile-percentile-rank

70a3e1d

updates for first review

11671f4

adjust missing/NaN entries for the first review

f1926c7

AugustoCL changed the title ~~add quantilerank and percentilerank functions~~ [WIP] add quantilerank and percentilerank functions Dec 21, 2021

AugustoCL and others added 2 commits December 23, 2021 16:33

Merge branch 'JuliaStats:master' into quantile-percentile-rank

3de82cd

updates for second review

9accf35

AugustoCL added 2 commits December 23, 2021 17:04

adjust jldoctest by addin 'using StatsBase'

f04b2e2

adjust jldoctest

332e204

nalimilan reviewed Dec 27, 2021

View reviewed changes

AugustoCL and others added 2 commits January 8, 2022 16:37

Merge branch 'JuliaStats:master' into quantile-percentile-rank

69d7202

updates for third review

0f74582

remove arg in func

7f91a49

bkamins reviewed Jan 16, 2022

View reviewed changes

src/ranking.jl Outdated Show resolved Hide resolved

bkamins reviewed Jan 16, 2022

View reviewed changes

src/ranking.jl Outdated Show resolved Hide resolved

bkamins reviewed Jan 16, 2022

View reviewed changes

src/ranking.jl Outdated Show resolved Hide resolved

updates for the fourth revision

dca02ea

add skipmissing example in docstring

fa28440

adjust suggestion in docstring '[0, 1] interval of ..."

c8a5ee1

bkamins reviewed Jan 29, 2022

View reviewed changes

src/ranking.jl Outdated Show resolved Hide resolved

bkamins reviewed Jan 29, 2022

View reviewed changes

src/ranking.jl Outdated Show resolved Hide resolved

bkamins reviewed Jan 29, 2022

View reviewed changes

src/ranking.jl Outdated Show resolved Hide resolved

adjust text in ArgumentError thrown

a4f7b16

bkamins approved these changes Jan 29, 2022

View reviewed changes

nalimilan reviewed Jan 29, 2022

View reviewed changes

apply suggestion from review of nalimian

3e30dd1

AugustoCL force-pushed the quantile-percentile-rank branch from 2e2ba27 to 3e30dd1 Compare January 29, 2022 23:55

AugustoCL added 3 commits January 29, 2022 21:10

adjust text PERCENTRANK in docstring

b668e5d

add a blank end line in files

818d15a

apply suggestions in tests and docstring

b74e8dc

AugustoCL requested a review from nalimilan January 30, 2022 20:05

bkamins reviewed Jan 31, 2022

View reviewed changes

src/ranking.jl Outdated Show resolved Hide resolved

AugustoCL and others added 4 commits January 31, 2022 09:25

apply suggestion on definition in docstring

6bbfa5d

Co-authored-by: Bogumił Kamiński <bkamins@sgh.waw.pl>

change location of function, from ranking to scalarstats

bfdffc7

Doc cleanup

a7c451d

Move median up

b61fb21

nalimilan changed the title ~~[WIP] add quantilerank and percentilerank functions~~ Add quantilerank and percentilerank functions Jan 31, 2022

nalimilan approved these changes Jan 31, 2022

View reviewed changes

nalimilan reviewed Feb 1, 2022

View reviewed changes

src/scalarstats.jl Show resolved Hide resolved

Update src/scalarstats.jl

b9f977e

nalimilan merged commit 7fca6e8 into JuliaStats:master Feb 4, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add quantilerank and percentilerank functions #741

Add quantilerank and percentilerank functions #741

AugustoCL commented Nov 29, 2021

AugustoCL commented Dec 13, 2021 •

edited

nalimilan commented Dec 14, 2021 •

edited

AugustoCL commented Dec 23, 2021

nalimilan commented Dec 27, 2021

AugustoCL commented Jan 9, 2022

bkamins commented Jan 16, 2022

AugustoCL commented Jan 23, 2022 •

edited

bkamins commented Jan 23, 2022

AugustoCL commented Jan 28, 2022 •

edited

bkamins left a comment

nalimilan left a comment

AugustoCL commented Jan 31, 2022

AugustoCL commented Feb 1, 2022

bkamins commented Feb 1, 2022

AugustoCL commented Feb 1, 2022

AugustoCL commented Feb 4, 2022

nalimilan commented Feb 4, 2022

Add quantilerank and percentilerank functions #741

Add quantilerank and percentilerank functions #741

Conversation

AugustoCL commented Nov 29, 2021

AugustoCL commented Dec 13, 2021 • edited

nalimilan commented Dec 14, 2021 • edited

AugustoCL commented Dec 23, 2021

nalimilan commented Dec 27, 2021

AugustoCL commented Jan 9, 2022

bkamins commented Jan 16, 2022

AugustoCL commented Jan 23, 2022 • edited

bkamins commented Jan 23, 2022

AugustoCL commented Jan 28, 2022 • edited

bkamins left a comment

Choose a reason for hiding this comment

nalimilan left a comment

Choose a reason for hiding this comment

AugustoCL commented Jan 31, 2022

AugustoCL commented Feb 1, 2022

bkamins commented Feb 1, 2022

AugustoCL commented Feb 1, 2022

AugustoCL commented Feb 4, 2022

nalimilan commented Feb 4, 2022

AugustoCL commented Dec 13, 2021 •

edited

nalimilan commented Dec 14, 2021 •

edited

AugustoCL commented Jan 23, 2022 •

edited

AugustoCL commented Jan 28, 2022 •

edited