Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantiles don't respect weights of 0 #313

Closed
joshday opened this issue Nov 1, 2017 · 10 comments
Closed

Quantiles don't respect weights of 0 #313

joshday opened this issue Nov 1, 2017 · 10 comments

Comments

@joshday
Copy link
Contributor

joshday commented Nov 1, 2017

I would expect these two to be the same, but maybe I'm missing something?

julia> quantile([1,2,3,4,5], fweights([0,1,2,1,0]), [0, .25, .5, .75, 1])
5-element Array{Float64,1}:
 1.0
 2.42857
 3.0
 3.57143
 5.0

julia> quantile([2,3,3,4])
5-element Array{Float64,1}:
 2.0
 2.75
 3.0
 3.25
 4.0
@joshday joshday changed the title Quantiles don't seem to respect weights of 0 Quantiles don't respect weights of 0 at endpoints Nov 1, 2017
@joshday joshday changed the title Quantiles don't respect weights of 0 at endpoints Quantiles don't respect weights of 0 Nov 1, 2017
@nalimilan
Copy link
Member

@matthieugomez Remarks?

@joshday
Copy link
Contributor Author

joshday commented Nov 6, 2017

The Implementation in StatsBase points to http://stats.stackexchange.com/questions/13169/defining-quantiles-over-a-weighted-sample, which doesn't provide a reference or theoretical guarantees.

My understanding is that any subtype of AbstractWeights{<:Integer, <:Integer} is essentially run length encoding, and it may be that StatsBase only need a new quantile method to handle this.

For context, my use case is creating approximate quantiles from a histogram (where bins may contain zero observations).

If anyone has a good reference for weighted quantiles (a few google searches returns there's no well-established method), I would be happy to take a look.

@nalimilan
Copy link
Member

These issues are really tricky. As Wikipedia notes, there are 9 different ways of computing sample quantiles. AFAIK we use the same definition as R (R-7), and indeed we get the same result for unweighted quantiles. For weighted quantiles, the implementation by @matthieugomez from #117 gives the same result as the unweighted quantiles when all weights are 1.

But then it doesn't appear to follow the (a priori natural) property that it's equivalent to replicating each entry the number of times corresponding to its integer weight:

julia> quantile([2,3,3,4], [0, .25, .5, .75, 1])
5-element Array{Float64,1}:
 2.0 
 2.75
 3.0 
 3.25
 4.0 

julia> quantile([2,3,3,4], fweights([1,1,1,1]), [0, .25, .5, .75, 1])
5-element Array{Float64,1}:
 2.0 
 2.75
 3.0 
 3.25
 4.0 

julia> quantile([2,3,4], fweights([1,2,1]), [0, .25, .5, .75, 1])
5-element Array{Float64,1}:
 2.0
 2.5
 3.0
 3.5
 4.0

julia> quantile([1,2,3,4,5], fweights([0,1,2,1,0]), [0, .25, .5, .75, 1])
5-element Array{Float64,1}:
 1.0    
 2.42857
 3.0    
 3.57143
 5.0    

Note that with R's Hmisc package, weighted and unweighted results are consistent, except in the presence of zero weights (last case). I've filed harrelfe/Hmisc#81 to try to get more details.

> quantile(c(2, 3, 3, 4))
  0%  25%  50%  75% 100% 
2.00 2.75 3.00 3.25 4.00 
> Hmisc::wtd.quantile(c(2,3,3,4), c(1,1,1,1), seq(0, 1, by=.25))
  0%  25%  50%  75% 100% 
2.00 2.75 3.00 3.25 4.00 
> Hmisc::wtd.quantile(c(2,3,4), c(1,2,1), seq(0, 1, by=.25))
  0%  25%  50%  75% 100% 
2.00 2.75 3.00 3.25 4.00 
> Hmisc::wtd.quantile(c(1,2,3,4,5), c(0,1,2,1,0), seq(0, 1, by=.25))
   0%   25%   50%   75%  100% 
2.000 2.750 3.000 3.375 4.500 

My understanding is that any subtype of AbstractWeights{<:Integer, <:Integer} is essentially run length encoding, and it may be that StatsBase only need a new quantile method to handle this.

More precisely, this is the case for FrequencyWeights. Other AbstractWeights types have different meanings, though in the context of quantiles I don't think it matters and taking frequency weights as a reference is simpler since it is equivalent to repeating values. Anyway, we don't need a new method, we just need to fix the current one.

For context, my use case is creating approximate quantiles from a histogram (where bins may contain zero observations).

If anyone has a good reference for weighted quantiles (a few google searches returns there's no well-established method), I would be happy to take a look.

I don't know of a good reference, but I've found this PR against Numpy which claims to be equivalent to repeating values: numpy/numpy#9211. Since Numpy is BSD, you can take inspiration from the code without licensing issues.

There are also various implementations at https://stackoverflow.com/questions/21844024/weighted-percentile-using-numpy and https://stackoverflow.com/questions/11585564/definitive-way-to-match-stata-weighted-xtile-command-using-python (the latter claiming to reproduce R's results). You could also look at what Hmisc does (from a quick glance it looks quite short) and describe the algorithm here so that somebody else can implement it without making StatsBase a derivative work of Hmisc.

@matthieugomez
Copy link
Contributor

matthieugomez commented Nov 7, 2017

It makes sense to change the current definition so that the output does not depend on values with zero weights.

That being said, it is not possible to have a quantile definition that corresponds to frequency weight and that does not depend on normalization of weights (w.sum). For instance, intuitively, we want

quantile([1, 2], fweights([2, 2]), [0.25]) = 1
#but
quantile([1, 2], fweights([1, 1]), [0.25]) > 1

In particular, we cannot use the frequency definition to handle non integer weights:

quantile([1, 2], fweights([0.1, 0.1]), [0.25])

So I guess we need two definitions of weighted quantile, one for fweights, and another one for other types of weights.

@nalimilan
Copy link
Member

It makes sense to change the current definition so that the output does not depend on values with zero weights.

Right. Though currently the problem seems to be more general: the weighted quantile is not equivalent to the unweighted quantile with replicated values. Is that expected?

That being said, it is not possible to have a quantile definition that corresponds to frequency weight and that does not depend on normalization of weights (w.sum). [...]
So I guess we need two definitions of weighted quantile, one for fweights, and another one for other types of weights.

That would certainly be possible (that's one of the advantages of having multiple types of weights), but I don't understand why it would be needed. Shouldn't we just take the values of the weights as they are in all cases? What kind of normalization would you want to apply? I'd have thought we should use a formula which is a generalization of what we expect for the special case of integer weights.

BTW, how confident are you that the algorithm you implemented is good? Do you have more references? I guess you have a much better understanding of the possible implementations than I do.

@matthieugomez
Copy link
Contributor

matthieugomez commented Nov 8, 2017

I don't think that the notion of quantile with frequency weights can be extended to non integer weights.

The observation that the results with weights = 1 should differ from the result with weights = 2 says two things.

  • It is unclear that we always want to return the result corresponding to frequency weights because it is sensitive to the sum of weights
  • Even if we wanted to do that, the frequency weight paradigm cannot give an answer on what to do if all weights equal 0.1.

@nalimilan
Copy link
Member

When you say "cannot be extended to non integer weights", do you mean that there are several ways of extending it (with no clear reason to choose a solution over another one), or that it's really not possible? The fact that "frequency weight paradigm cannot give an answer on what to do" doesn't mean we can't choose a particular behavior.

@matthieugomez
Copy link
Contributor

matthieugomez commented Nov 8, 2017

I mean that there is no clear way to do it. Moreover, any method that does it would make the result depend on the sum of weights. So it would be a bad idea because the result of an operation with non frequency weights should not depend on the sum of weights.

@nalimilan
Copy link
Member

We could certainly restrict the method to FrequencyWeights for now, at least that would help clarifying its meaning. Hmisc has a normwt option which when set to true normalizes the weights to sum to the length of the input vector, which might be somewhat arbitrary but at least is simple. Anyway the priority should clearly be to have a consistent behavior for frequency weights. Any ideas about how to do that?

@nalimilan
Copy link
Member

This has been fixed by #316.

julia> quantile([1,2,3,4,5], fweights([0,1,2,1,0]), [0, .25, .5, .75, 1])
5-element Array{Float64,1}:
 2.0
 2.75
 3.0
 3.25
 4.0

julia> quantile([2,3,3,4])
5-element Array{Float64,1}:
 2.0
 2.75
 3.0
 3.25
 4.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants