Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add quantilerank and percentilerank functions #741

Merged
merged 31 commits into from
Feb 4, 2022

Conversation

AugustoCL
Copy link
Contributor

This PR continue the discussion of the merged/closed PR #733.

So, It has the code updated, having:

  • The new function names (quantilerank and percentilerank) as combined in the previous discussion.
  • An update to method :rank according to this book reference.
  • A cleaner commit history to facilitate the evaluations and suggestions that will be proposed by maintainers.

src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
test/ranking.jl Outdated Show resolved Hide resolved
test/ranking.jl Outdated Show resolved Hide resolved
@AugustoCL
Copy link
Contributor Author

AugustoCL commented Dec 13, 2021

Actually I was confused, I think method rather corresponds to the alpha and beta argument to quantile. For example, in Excel PERCENTILE.INC corresponds to PERCENTRANK.INC and PERCENTILE.EXC corresponds to PERCENTRANK.EXC. Could you check whether we could replace method with alpha and beta instead, so that we have a consistent API? If necessary, for simplicity we could support only some combinations of these two arguments, which would match the current possible choices for method (and therefore keep the same if structure). See ?quantile for details.

I have been looking up bibliographic references for percentilerankand it has been difficult to find and this worries me.
In Excel there are no references or source code. In scipy, only source code, no reference. In LibreOffice there is a brief description of the method used (unlike anything we've seen before), but also without references. In dplyr (R), they emulate the percent_rank of SQL2003, whose documentation has no source code, just description and example. On Wikipedia, there's only the reference to the book I've already cited.

I did a brief reading of the quantile reference article, found it a bit advanced for me, and couldn't see a direct relationship between the quantilerank methods and the different quantile methods. This worries me and makes me doubt whether we can maintain consistency in the API.

As it has been difficult to find references in the documentation of other languages, it is up to you to keep or not the construction of the functions with the 3 methods, :mean, :strict and :weak, using the book Fundamental Statistics for the Behavioral Sciences as reference for the :mean method.

Also, if you find that it's not worth implementing the function, please know that I have no problem closing this PR.

@nalimilan
Copy link
Member

nalimilan commented Dec 14, 2021

I did a brief reading of the quantile reference article, found it a bit advanced for me, and couldn't see a direct relationship between the quantilerank methods and the different quantile methods. This worries me and makes me doubt whether we can maintain consistency in the API.

These things are really tricky, but looking at the Wikipedia table we can see that the R-7 definition, which is used by Excel's PERCENTILE/PERCENTILE.INC and Julia has h=N − 1, just like in the denominator of the formula Excel's PERCENTRANK/PERCENTRANK.INC uses (discussion above at #741 (comment)). Likewise, the R-6 definition, which is used by Excel's PERCENTILE.EXC, has h=N + 1, just like in the denominator of the formula Excel's PERCENTRANK.EXC uses. So this seems to indicate there's a direct relationship with the different quantile types. The percentrank methods from SciPy have N as the denominator, and they seem to correspond to the R-1 (method=:strict) and R-2 (method=:mean) definitions. If we could confirm this the situation would be much clearer.

@AugustoCL AugustoCL changed the title add quantilerank and percentilerank functions [WIP] add quantilerank and percentilerank functions Dec 21, 2021
@AugustoCL
Copy link
Contributor Author

So this seems to indicate there's a direct relationship with the different quantile types.

@nalimilan,

I checked the relationship of h in the percentile definitions presented in the wikipedia table and to me it looks like there is a direct relationship between them, but I'm not comfortable asserting this on my own.

So I proceeded with the implementation of the different methods using the method argument in the function and highlighted the direct relationship only in the :inc and :exc methods. I imagine if there is trust in this direct relationship we can use the alpha and beta arguments to be consistent with the API, as suggested.
And if you feel comfortable establishing this direct relationship in the docstring, please feel free to update the docstring as you think is better. I'm always open to suggestions.

@nalimilan
Copy link
Member

Thanks! Let's get the method argument right with the proper correspondence with quantile definitions, alpha and beta can be added later.

src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
test/ranking.jl Outdated Show resolved Hide resolved
@AugustoCL
Copy link
Contributor Author

the question arose whether or not the array needs to be sorted and looking for the libreoffice implementation, I discovered that is necessary.

image

I saw in the quantile code tat there is a specific sort used there. So, since we need to sort the vector in some methods, could you please help us find the best recommended sort for the task? I don't have a CS background, so I'm not the best person to answer that problem.

I made some benchmark tests and saw that if I use !issorted(v) && (v = sort(v)) it is better than !sorted && (v = sort(v))
( I don't know why haha). So, I also removed the sorted argument in the function.

This sort question I'm up to you. whatever you suggest, we do.

src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
@bkamins
Copy link
Contributor

bkamins commented Jan 16, 2022

I discovered that is necessary.

Sorting is not necessary. What you need to do is:

count_less = 0
count_equal = 0
last_less = value
first_greater = value
for x in v
    if x == value
        count_equal += 1
    elseif x < value
        count_less += 1
        if last_less == value || last_less < x
            last_less = x
        end
    else
        if first_greater == value || first_greater > x
            first_greater = x
        end
    end
end

(I have written this without looking at @nalimilan code not to get influenced by it but it seems it is the same).

Then you should be able to compute all required variants using count_less, count_equal, last_less and first greater. (note that you need to be careful to correctly handle cases when last_less == value or first_greater == value as they are special).

You have not provided the definitions you use in the docstring (you gave only references and I do not have enough time currently to digest them - sorry for this) but there should be a natural mapping.

E.g. for :inc it seems that the following would work (I am writing it out of my head as there is no definition given):

n = length(v)
if last_less == value
    return 0.0
elseif count_equal > 0
        return count_less / (n - 1)
elseif first_greater == value
        return 1.0
else
    lower = (count_less - 1) / (n - 1)
    upper = count_less / (n - 1)
    ratio = (value - last_less) / (first_greater - last_less)
    return lower * (1 - ratio) + upper * ratio
end

src/ranking.jl Outdated Show resolved Hide resolved
@AugustoCL
Copy link
Contributor Author

AugustoCL commented Jan 23, 2022

@bkamins ,

I used the for-loop you wrote, it worked perfectly and I also adapted all the other methods using the 4 variables generated in the for-loop. All tests passed, after the changes made in commit dca02ea. Tks.

I understand that an important point that remains to be resolved is to find a way to incorporate the use of the skipmissing() function without using collect() in the current code. I caught the use of the collect by looking at the quantile.jl source code on Statistics.jl, but if anyone knows an alternative, please share it with us.

@bkamins
Copy link
Contributor

bkamins commented Jan 23, 2022

I understand that an important point that remains to be resolved is to find a way to incorporate the use of the skipmissing() function without using collect() in the current code.

I have written my implementation to work on any iterable. It does not rely on the fact that what you pass is a vector. So it can be just dropped.

The only consideration is (but we could potentially add it later, cc @nalimilan) if we want AbstractArray to get a special treatment with dims kwarg (i.e. doing reductions over some dimensions like other functions do. But I think we could add this later.

@AugustoCL
Copy link
Contributor Author

AugustoCL commented Jan 28, 2022

I have not checked manually the tests line by line, but I hope you used some references to make sure they are correct

Yes. To ensure that all the methods are right, I reproduced on the tests the same examples of the docs of each method.

src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
Copy link
Contributor

@bkamins bkamins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let us wait for @nalimilan approval before merging

src/ranking.jl Outdated Show resolved Hide resolved
src/StatsBase.jl Outdated Show resolved Hide resolved
docs/src/ranking.md Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
test/ranking.jl Outdated Show resolved Hide resolved
test/ranking.jl Outdated Show resolved Hide resolved
test/ranking.jl Outdated Show resolved Hide resolved
test/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
src/ranking.jl Outdated Show resolved Hide resolved
@nalimilan nalimilan changed the title [WIP] add quantilerank and percentilerank functions Add quantilerank and percentilerank functions Jan 31, 2022
Copy link
Member

@nalimilan nalimilan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bearing with us @AugustoCL! The road was long but this is a very nice addition to StatsBase! I believe we now have one of the most complete and best documented implementations around.

I've made a few direct edits to the docstrings, let me know whether that's OK.

@AugustoCL
Copy link
Contributor Author

let me know whether that's OK.

I like those edits in docstrings. It's very good to me.

Thanks for bearing with us @AugustoCL

You're welcome. I also must thank you for guiding me on this path. I did the best I could for my level and learned a lot on this journey.

The road was long but this is a very nice addition to StatsBase! I believe we now have one of the most complete and best documented implementations around.

I wondered if these functions were relevant enough to be accepted in a base package like StatsBase.jl. So knowing that makes me proud of this work. Thanks for all.

@AugustoCL
Copy link
Contributor Author

@nalimilan,

Just for curiosity,

is there an estimated date for the merge?
I don't know the factors that determine a new release, I would love to hear an explanation of the best practices of this.

@bkamins
Copy link
Contributor

bkamins commented Feb 1, 2022

For now we are waiting for a few days to hear if other people have comments on the PR.
If you will not see this PR merged after this period can you please ping us?

@AugustoCL
Copy link
Contributor Author

If you will not see this PR merged after this period can you please ping us?

yes. So I'll keep an eye out until this friday.

@AugustoCL
Copy link
Contributor Author

@nalimilan, @bkamins

If you will not see this PR merged after this period can you please ping us?

First ping🏓

@nalimilan
Copy link
Member

Let's merge, it sounds unlikely that somebody new is going to comment at this point!

@nalimilan nalimilan merged commit 7fca6e8 into JuliaStats:master Feb 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants