Wrong results for weighted quantile #435

andreasnoack · 2018-12-07T13:12:00Z

From https://discourse.julialang.org/t/median-vs-50th-quantile-giving-different-answers/18414/4

julia> x = [1; 4; 3; 2; 2.5; 7];

julia> w = [0.1;0.3;0.05;0.05;0.2;0.3];

julia> quantile(x, weights(w), 0.5)
3.5

but

julia> sum(w[x .<= 3.5]) >= 0.5
false

so 0.5 is not a median.

cc @matthieugomez

The text was updated successfully, but these errors were encountered:

matthieugomez · 2018-12-07T16:06:56Z

There are multiple ways to extend the base function quantile to the case of weights, and I am happy to discuss how to do it properly.

But it is not obvious to me that the existing function has the wrong output. You reject that 3.5 is the median because you say: if there is only one value v in the vector that satisfies P(x <=v) >= p and P(x >= v ) >= 1 - p, then v should be the p quantile.

But this is not true with the definition of quantile in Julia. For instance:

x = [1, 2, 3, 4]
df= DataFrame(plow = cumsum([0.25, 0.25, 0.25, 0.25]), phigh = reverse(cumsum(reverse([0.25, 0.25, 0.25, 0.25]))), x = x)
#>│ Row │ plow    │ phigh   │ x     │
#>│     │ Float64 │ Float64 │ Int64 │
#>├─────┼─────────┼─────────┼───────┤
#>│ 1   │ 0.25    │ 1.0     │ 1     │
#>│ 2   │ 0.5     │ 0.75    │ 2     │
#>│ 3   │ 0.75    │ 0.5     │ 3     │
#>│ 4   │ 1.0     │ 0.25    │ 4     │

Julia says that the 0.49 quantile is 2.47 but, according to your definition, it should be 2.
So even though the property happens to be true for the unweighted 0.5 quantile, I am not sure it should be true for the weighted 0.5 quantile.

That being said, I think the thread shows two problems with the existing implementation:

Whatever the definition of weighted quantile we choose in the end, wmedian should return the same output as wquantile(., 0.5). For now they return different things because they use different definitions of weighted medians.
with frequency weigths, wquantile fails. This is because I wrote this function assuming that fweights were restricted to integer. How to interpret a frequency weight that is not an integer?

matthieugomez · 2018-12-07T16:25:42Z

Btw, note that the current definition of weighted median in StatsBase is wrong because the median with frequency weights does not give the same thing as the median of the vector with repeated values.

median([1, 2, 3, 4], fweights([2, 1, 3, 2]))
# 2.5
median([1, 1, 2, 3, 3, 3, 4, 4])
# 3

matthieugomez · 2018-12-07T17:55:35Z

At the end of the day, quantile type 7 is just not straighforward to extend to weights. I would be happier to have quantile defined as type 1 or 3 in base. It would make it simpler to extend it to weight. It would also make it possible to use quantile with types that are only ordinal. JuliaLang/julia#27367

nalimilan · 2018-12-09T18:52:26Z

Using the same default as R and NumPy is nice, though. For ordinal data, it sounds OK to require people to choose a different method.

Whatever the definition of weighted quantile we choose in the end, wmedian should return the same output as wquantile(., 0.5). For now they return different things because they use different definitions of weighted medians.

Can't we just have it call quantile(x, 0.5)? Or would it be less efficient?

with frequency weigths, wquantile fails. This is because I wrote this function assuming that fweights were restricted to integer. How to interpret a frequency weight that is not an integer?

Cf. preivous discussion here: #313 (comment). I guess several generalizations are possible, we just need one which is reasonable (for some definition of that term). Does the current answer (7.0) fit that bill?

matthieugomez · 2018-12-09T19:57:38Z

quantile(x, 0.5) seems to be as (in)efficient as wmedian so there is no performance reason to have a separate implementation.

x = rand(10_000_000)
w = rand(10_000_000)
@time wquantile(x, w, 0.5)
  1.908000 seconds (44 allocations: 382.676 MiB, 2.88% gc time)
0.5001186970922967

julia> @time wmedian(x, w)
  1.911641 seconds (27 allocations: 231.275 MiB, 2.02% gc time)
0.5001187598955092

For the second part, I just don't understand what non integer frequency weight mean. In Stata, it is impossible to have frequency weights that are not integer. Why is it possible in StatsBase? The previous discussion you linked to is about non-integer probability weights, which is fine.

nalimilan · 2018-12-09T20:18:25Z

For many algorithms there's an obvious generalization of frequency weights to non-integer values. For example, to compute the mean it doesn't matter whether the weight is integer or not. Generally speaking, it sounds simple to give a meaning to non-integer weights: having two observations with weight w/2 should be equivalent to having one observation with weight w. That said I'm not sure it's very useful (even if I remember somebody asked for them in another context), so if we're unsure we could throw an error in the presence of non-integer weights for now.

Regarding the weighted median, let's remove the separate implementation then.

matthieugomez · 2018-12-09T20:22:34Z

But, in your example, it should then be interpreted as a probability weight.I am having a hard time thinking of a case where someone wants non frequency weights to be different than probability weights. From Stata:

fweights, or frequency weights, are weights that indicate the number of duplicated observations.
pweights, or sampling weights, are weights that denote the inverse of the probability that the
        observation is included because of the sampling design.

nalimilan · 2018-12-09T21:26:42Z

I'm not saying they have to be different when computing quantiles. That's the case e.g. for inference of when computing the Bessel correction for variance, but not necessarily for descriptive statistics.

matthieugomez · 2018-12-09T22:37:54Z

While I am doing a pull request, why is wmean deprecated but not wsum or wmedian?

nalimilan · 2018-12-10T08:19:56Z

I guess that's just an oversight.

Do we have any idea why Julia uses a separate method for median rather than calling quantile? Just because it's simple? I'd just like to be sure we're not missing something.

nalimilan · 2019-02-01T15:33:28Z

So apart from #436, does something need to be addressed here @matthieugomez? We could reject non-integer frequency weights, but do you have an example where they give a clearly incorrect result?

matthieugomez · 2019-02-01T18:06:30Z

I don't really know what correct or incorrect result would mean in this situation, so I prefer to return an error. This is what Stata does (not sure about whether other softwares have a concept of frequency weights).

…

On Fri, Feb 1, 2019 at 10:33 AM Milan Bouchet-Valat < ***@***.***> wrote: So apart from #436 <#436>, does something need to be addressed here @matthieugomez <https://github.com/matthieugomez>? We could reject non-integer frequency weights, but do you have an example where they give a clearly incorrect result? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#435 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AF733f6O6AECWPKCokhZ_mcVzUsTIiasks5vJF5JgaJpZM4ZIayv> .

nalimilan · 2019-02-01T18:45:57Z

I don't really know what correct or incorrect result would mean in this situation,

For example, wouldn't the rule I describe above make sense? "having two observations with weight w/2 should be equivalent to having one observation with weight w".

matthieugomez · 2019-02-02T17:33:04Z

I'm not sure this rule would work. Remember the example:

quantile([1, 2], fweights([2, 2]), [0.25]) = 1
#but
quantile([1, 2], fweights([1, 1]), [0.25]) > 1

which violates your rule.

nalimilan · 2019-02-02T18:18:52Z

Does it really? What I suggest would only implies quantile([1, 2], fweights([2, 2]), [0.25]) == quantile([1, 2, 1, 2], fweights([1, 1, 1, 1]), [0.25]) == quantile([1, 2, 1, 2, 1, 2, 1, 2], fweights(fill(0.5, 8)), [0.25]), which is already true.

matthieugomez · 2019-02-02T18:42:29Z

Oh ok I misunderstood your rule. But then I'm not sure how it generalizes beyond the case where the weights for each value happen to sum up to an integer.

In any case, my point is just that I wrote this code under the wrong assumption that frequency weights were always integer. So until someone develops a new definition, which I'm not capable of, I think it is better to return an error.

nalimilan · 2019-02-03T14:17:44Z

Fine with me. I suspect there's nothing special about integers in this algorithm, but better safe than sorry.

nalimilan · 2019-02-18T13:22:36Z

Fixed by #436.

perrette · 2024-06-12T08:27:42Z

Hi, I clearly lack subtelty about the various definitions of weighted quantiles (and I passed quickly over the above discussion as a result), but I thought I'd share another, more obvious example of what what the current implementation is doing:

The frequency weights seems to do what I'd expect:

julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1000, 1, 1, 1]))
1.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1000, 1, 1]))
2.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1, 1000, 1]))
3.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1, 1, 1000]))
4.0

unlike the ProbabilityWeights:

julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1000, 1, 1, 1]/1003))
2.500000000000056
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1000, 1, 1]/1003))
1.501
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1, 1000, 1]/1003))
2.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1, 1, 1000]/1003))
3.499

which seems to:

ignore the first weight purely and simply (isn't that a bug ?? -- it's hard to believe it comes down to definition, unless it's a definition for continuous analytics applied bluntly to discrete arrays -- I ask the author(s) to please excuse me, I mean no offense)
takes the mean between whatever two numbers the (weighted) median falls into, which can never be exactly reached

Worse, it is numerically inaccurate:

julia> eps = 1e-50 ;
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1-3*eps, eps, eps, eps]))
4.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, 1-3*eps, eps, eps]))
1.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, 1-3*eps, eps]))
2.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, eps, 1-3*eps]))
3.5

whereas setting epsilon to exactly zero yields the same (and for me, expected) result as FrequencyWeights

julia> eps = 0.;
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1-3*eps, eps, eps, eps]))
1.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, 1-3*eps, eps, eps]))
2.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, 1-3*eps, eps]))
3.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, eps, 1-3*eps]))
4.0

IMO the above shows surprising results that may go beyond the difference between various definitions (especially that the first weight is ignored, and possibly the discontinuity at the limit when the weights tend toward being concentrated on one element, though OK, discontinuities are parts of mathematics -- but they often don't help when analyzing real data).

Anyway, for me the workaround will be to multiply my weights by a large number, convert to integers, and use FrequencyWeights instead of ProbabilityWeights.

matthieugomez mentioned this issue Dec 9, 2018

Change the definition of wmedian to wquantile(., 0.5) #436

Merged

nalimilan closed this as completed Feb 18, 2019

nalimilan mentioned this issue Apr 26, 2021

alternative characterization for weighted sample quantile #687

Open

perrette mentioned this issue Jun 13, 2024

wmedian: the first weight is ignored when ProbabilityWeights is used #933

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong results for weighted quantile #435

Wrong results for weighted quantile #435

andreasnoack commented Dec 7, 2018 •

edited

Loading

matthieugomez commented Dec 7, 2018 •

edited

Loading

matthieugomez commented Dec 7, 2018 •

edited

Loading

matthieugomez commented Dec 7, 2018 •

edited

Loading

nalimilan commented Dec 9, 2018

matthieugomez commented Dec 9, 2018 •

edited

Loading

nalimilan commented Dec 9, 2018

matthieugomez commented Dec 9, 2018

nalimilan commented Dec 9, 2018

matthieugomez commented Dec 9, 2018

nalimilan commented Dec 10, 2018

nalimilan commented Feb 1, 2019

matthieugomez commented Feb 1, 2019 via email

nalimilan commented Feb 1, 2019

matthieugomez commented Feb 2, 2019

nalimilan commented Feb 2, 2019

matthieugomez commented Feb 2, 2019 •

edited

Loading

nalimilan commented Feb 3, 2019

nalimilan commented Feb 18, 2019

perrette commented Jun 12, 2024 •

edited

Loading

Wrong results for weighted quantile #435

Wrong results for weighted quantile #435

Comments

andreasnoack commented Dec 7, 2018 • edited Loading

matthieugomez commented Dec 7, 2018 • edited Loading

matthieugomez commented Dec 7, 2018 • edited Loading

matthieugomez commented Dec 7, 2018 • edited Loading

nalimilan commented Dec 9, 2018

matthieugomez commented Dec 9, 2018 • edited Loading

nalimilan commented Dec 9, 2018

matthieugomez commented Dec 9, 2018

nalimilan commented Dec 9, 2018

matthieugomez commented Dec 9, 2018

nalimilan commented Dec 10, 2018

nalimilan commented Feb 1, 2019

matthieugomez commented Feb 1, 2019 via email

nalimilan commented Feb 1, 2019

matthieugomez commented Feb 2, 2019

nalimilan commented Feb 2, 2019

matthieugomez commented Feb 2, 2019 • edited Loading

nalimilan commented Feb 3, 2019

nalimilan commented Feb 18, 2019

perrette commented Jun 12, 2024 • edited Loading

andreasnoack commented Dec 7, 2018 •

edited

Loading

matthieugomez commented Dec 7, 2018 •

edited

Loading

matthieugomez commented Dec 7, 2018 •

edited

Loading

matthieugomez commented Dec 7, 2018 •

edited

Loading

matthieugomez commented Dec 9, 2018 •

edited

Loading

matthieugomez commented Feb 2, 2019 •

edited

Loading

perrette commented Jun 12, 2024 •

edited

Loading