Quantiles don't respect weights of 0 #313

joshday · 2017-11-01T18:51:13Z

I would expect these two to be the same, but maybe I'm missing something?

julia> quantile([1,2,3,4,5], fweights([0,1,2,1,0]), [0, .25, .5, .75, 1])
5-element Array{Float64,1}:
 1.0
 2.42857
 3.0
 3.57143
 5.0

julia> quantile([2,3,3,4])
5-element Array{Float64,1}:
 2.0
 2.75
 3.0
 3.25
 4.0

nalimilan · 2017-11-03T21:39:43Z

@matthieugomez Remarks?

joshday · 2017-11-06T23:21:02Z

The Implementation in StatsBase points to http://stats.stackexchange.com/questions/13169/defining-quantiles-over-a-weighted-sample, which doesn't provide a reference or theoretical guarantees.

My understanding is that any subtype of AbstractWeights{<:Integer, <:Integer} is essentially run length encoding, and it may be that StatsBase only need a new quantile method to handle this.

For context, my use case is creating approximate quantiles from a histogram (where bins may contain zero observations).

If anyone has a good reference for weighted quantiles (a few google searches returns there's no well-established method), I would be happy to take a look.

nalimilan · 2017-11-07T09:40:16Z

These issues are really tricky. As Wikipedia notes, there are 9 different ways of computing sample quantiles. AFAIK we use the same definition as R (R-7), and indeed we get the same result for unweighted quantiles. For weighted quantiles, the implementation by @matthieugomez from #117 gives the same result as the unweighted quantiles when all weights are 1.

But then it doesn't appear to follow the (a priori natural) property that it's equivalent to replicating each entry the number of times corresponding to its integer weight:

julia> quantile([2,3,3,4], [0, .25, .5, .75, 1])
5-element Array{Float64,1}:
 2.0 
 2.75
 3.0 
 3.25
 4.0 

julia> quantile([2,3,3,4], fweights([1,1,1,1]), [0, .25, .5, .75, 1])
5-element Array{Float64,1}:
 2.0 
 2.75
 3.0 
 3.25
 4.0 

julia> quantile([2,3,4], fweights([1,2,1]), [0, .25, .5, .75, 1])
5-element Array{Float64,1}:
 2.0
 2.5
 3.0
 3.5
 4.0

julia> quantile([1,2,3,4,5], fweights([0,1,2,1,0]), [0, .25, .5, .75, 1])
5-element Array{Float64,1}:
 1.0    
 2.42857
 3.0    
 3.57143
 5.0

Note that with R's Hmisc package, weighted and unweighted results are consistent, except in the presence of zero weights (last case). I've filed harrelfe/Hmisc#81 to try to get more details.

> quantile(c(2, 3, 3, 4))
  0%  25%  50%  75% 100% 
2.00 2.75 3.00 3.25 4.00 
> Hmisc::wtd.quantile(c(2,3,3,4), c(1,1,1,1), seq(0, 1, by=.25))
  0%  25%  50%  75% 100% 
2.00 2.75 3.00 3.25 4.00 
> Hmisc::wtd.quantile(c(2,3,4), c(1,2,1), seq(0, 1, by=.25))
  0%  25%  50%  75% 100% 
2.00 2.75 3.00 3.25 4.00 
> Hmisc::wtd.quantile(c(1,2,3,4,5), c(0,1,2,1,0), seq(0, 1, by=.25))
   0%   25%   50%   75%  100% 
2.000 2.750 3.000 3.375 4.500

My understanding is that any subtype of AbstractWeights{<:Integer, <:Integer} is essentially run length encoding, and it may be that StatsBase only need a new quantile method to handle this.

More precisely, this is the case for FrequencyWeights. Other AbstractWeights types have different meanings, though in the context of quantiles I don't think it matters and taking frequency weights as a reference is simpler since it is equivalent to repeating values. Anyway, we don't need a new method, we just need to fix the current one.

For context, my use case is creating approximate quantiles from a histogram (where bins may contain zero observations).

If anyone has a good reference for weighted quantiles (a few google searches returns there's no well-established method), I would be happy to take a look.

I don't know of a good reference, but I've found this PR against Numpy which claims to be equivalent to repeating values: numpy/numpy#9211. Since Numpy is BSD, you can take inspiration from the code without licensing issues.

There are also various implementations at https://stackoverflow.com/questions/21844024/weighted-percentile-using-numpy and https://stackoverflow.com/questions/11585564/definitive-way-to-match-stata-weighted-xtile-command-using-python (the latter claiming to reproduce R's results). You could also look at what Hmisc does (from a quick glance it looks quite short) and describe the algorithm here so that somebody else can implement it without making StatsBase a derivative work of Hmisc.

matthieugomez · 2017-11-07T22:10:47Z

It makes sense to change the current definition so that the output does not depend on values with zero weights.

That being said, it is not possible to have a quantile definition that corresponds to frequency weight and that does not depend on normalization of weights (w.sum). For instance, intuitively, we want

quantile([1, 2], fweights([2, 2]), [0.25]) = 1
#but
quantile([1, 2], fweights([1, 1]), [0.25]) > 1

In particular, we cannot use the frequency definition to handle non integer weights:

quantile([1, 2], fweights([0.1, 0.1]), [0.25])

So I guess we need two definitions of weighted quantile, one for fweights, and another one for other types of weights.

nalimilan · 2017-11-08T08:26:21Z

It makes sense to change the current definition so that the output does not depend on values with zero weights.

Right. Though currently the problem seems to be more general: the weighted quantile is not equivalent to the unweighted quantile with replicated values. Is that expected?

That being said, it is not possible to have a quantile definition that corresponds to frequency weight and that does not depend on normalization of weights (w.sum). [...]
So I guess we need two definitions of weighted quantile, one for fweights, and another one for other types of weights.

That would certainly be possible (that's one of the advantages of having multiple types of weights), but I don't understand why it would be needed. Shouldn't we just take the values of the weights as they are in all cases? What kind of normalization would you want to apply? I'd have thought we should use a formula which is a generalization of what we expect for the special case of integer weights.

BTW, how confident are you that the algorithm you implemented is good? Do you have more references? I guess you have a much better understanding of the possible implementations than I do.

matthieugomez · 2017-11-08T16:30:11Z

I don't think that the notion of quantile with frequency weights can be extended to non integer weights.

The observation that the results with weights = 1 should differ from the result with weights = 2 says two things.

It is unclear that we always want to return the result corresponding to frequency weights because it is sensitive to the sum of weights
Even if we wanted to do that, the frequency weight paradigm cannot give an answer on what to do if all weights equal 0.1.

nalimilan · 2017-11-08T16:35:34Z

When you say "cannot be extended to non integer weights", do you mean that there are several ways of extending it (with no clear reason to choose a solution over another one), or that it's really not possible? The fact that "frequency weight paradigm cannot give an answer on what to do" doesn't mean we can't choose a particular behavior.

matthieugomez · 2017-11-08T16:39:42Z

I mean that there is no clear way to do it. Moreover, any method that does it would make the result depend on the sum of weights. So it would be a bad idea because the result of an operation with non frequency weights should not depend on the sum of weights.

nalimilan · 2017-11-08T17:57:53Z

We could certainly restrict the method to FrequencyWeights for now, at least that would help clarifying its meaning. Hmisc has a normwt option which when set to true normalizes the weights to sum to the length of the input vector, which might be somewhat arbitrary but at least is simple. Anyway the priority should clearly be to have a consistent behavior for frequency weights. Any ideas about how to do that?

nalimilan · 2019-02-01T15:35:20Z

This has been fixed by #316.

julia> quantile([1,2,3,4,5], fweights([0,1,2,1,0]), [0, .25, .5, .75, 1])
5-element Array{Float64,1}:
 2.0
 2.75
 3.0
 3.25
 4.0

julia> quantile([2,3,3,4])
5-element Array{Float64,1}:
 2.0
 2.75
 3.0
 3.25
 4.0

joshday changed the title ~~Quantiles don't seem to respect weights of 0~~ Quantiles don't respect weights of 0 at endpoints Nov 1, 2017

joshday changed the title ~~Quantiles don't respect weights of 0 at endpoints~~ Quantiles don't respect weights of 0 Nov 1, 2017

joshday mentioned this issue Nov 1, 2017

WIP: expand data by FrequencyWeights #314

Closed

matthieugomez mentioned this issue Nov 9, 2017

modify weighted quantile (aweights + fweights) #316

Merged

nalimilan mentioned this issue Dec 9, 2018

Wrong results for weighted quantile #435

Closed

nalimilan closed this as completed Feb 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quantiles don't respect weights of 0 #313

Quantiles don't respect weights of 0 #313

joshday commented Nov 1, 2017

nalimilan commented Nov 3, 2017

joshday commented Nov 6, 2017

nalimilan commented Nov 7, 2017

matthieugomez commented Nov 7, 2017 •

edited

Loading

nalimilan commented Nov 8, 2017

matthieugomez commented Nov 8, 2017 •

edited

Loading

nalimilan commented Nov 8, 2017

matthieugomez commented Nov 8, 2017 •

edited

Loading

nalimilan commented Nov 8, 2017

nalimilan commented Feb 1, 2019

Quantiles don't respect weights of 0 #313

Quantiles don't respect weights of 0 #313

Comments

joshday commented Nov 1, 2017

nalimilan commented Nov 3, 2017

joshday commented Nov 6, 2017

nalimilan commented Nov 7, 2017

matthieugomez commented Nov 7, 2017 • edited Loading

nalimilan commented Nov 8, 2017

matthieugomez commented Nov 8, 2017 • edited Loading

nalimilan commented Nov 8, 2017

matthieugomez commented Nov 8, 2017 • edited Loading

nalimilan commented Nov 8, 2017

nalimilan commented Feb 1, 2019

matthieugomez commented Nov 7, 2017 •

edited

Loading

matthieugomez commented Nov 8, 2017 •

edited

Loading

matthieugomez commented Nov 8, 2017 •

edited

Loading