Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong results for weighted quantile #435

Closed
andreasnoack opened this issue Dec 7, 2018 · 19 comments
Closed

Wrong results for weighted quantile #435

andreasnoack opened this issue Dec 7, 2018 · 19 comments

Comments

@andreasnoack
Copy link
Member

andreasnoack commented Dec 7, 2018

From https://discourse.julialang.org/t/median-vs-50th-quantile-giving-different-answers/18414/4

julia> x = [1; 4; 3; 2; 2.5; 7];

julia> w = [0.1;0.3;0.05;0.05;0.2;0.3];

julia> quantile(x, weights(w), 0.5)
3.5

but

julia> sum(w[x .<= 3.5]) >= 0.5
false

so 0.5 is not a median.

cc @matthieugomez

@matthieugomez
Copy link
Contributor

matthieugomez commented Dec 7, 2018

There are multiple ways to extend the base function quantile to the case of weights, and I am happy to discuss how to do it properly.

But it is not obvious to me that the existing function has the wrong output. You reject that 3.5 is the median because you say: if there is only one value v in the vector that satisfies P(x <=v) >= p and P(x >= v ) >= 1 - p, then v should be the p quantile.

But this is not true with the definition of quantile in Julia. For instance:

x = [1, 2, 3, 4]
df= DataFrame(plow = cumsum([0.25, 0.25, 0.25, 0.25]), phigh = reverse(cumsum(reverse([0.25, 0.25, 0.25, 0.25]))), x = x)
#>│ Row │ plow    │ phigh   │ x     │
#>│     │ Float64 │ Float64 │ Int64 │
#>├─────┼─────────┼─────────┼───────┤
#>│ 1   │ 0.25    │ 1.0     │ 1     │
#>│ 2   │ 0.5     │ 0.75    │ 2     │
#>│ 3   │ 0.75    │ 0.5     │ 3     │
#>│ 4   │ 1.0     │ 0.25    │ 4     │

Julia says that the 0.49 quantile is 2.47 but, according to your definition, it should be 2.
So even though the property happens to be true for the unweighted 0.5 quantile, I am not sure it should be true for the weighted 0.5 quantile.

That being said, I think the thread shows two problems with the existing implementation:

  • Whatever the definition of weighted quantile we choose in the end, wmedian should return the same output as wquantile(., 0.5). For now they return different things because they use different definitions of weighted medians.
  • with frequency weigths, wquantile fails. This is because I wrote this function assuming that fweights were restricted to integer. How to interpret a frequency weight that is not an integer?

@matthieugomez
Copy link
Contributor

matthieugomez commented Dec 7, 2018

Btw, note that the current definition of weighted median in StatsBase is wrong because the median with frequency weights does not give the same thing as the median of the vector with repeated values.

median([1, 2, 3, 4], fweights([2, 1, 3, 2]))
# 2.5
median([1, 1, 2, 3, 3, 3, 4, 4])
# 3

@matthieugomez
Copy link
Contributor

matthieugomez commented Dec 7, 2018

At the end of the day, quantile type 7 is just not straighforward to extend to weights. I would be happier to have quantile defined as type 1 or 3 in base. It would make it simpler to extend it to weight. It would also make it possible to use quantile with types that are only ordinal. JuliaLang/julia#27367

@nalimilan
Copy link
Member

Using the same default as R and NumPy is nice, though. For ordinal data, it sounds OK to require people to choose a different method.

  • Whatever the definition of weighted quantile we choose in the end, wmedian should return the same output as wquantile(., 0.5). For now they return different things because they use different definitions of weighted medians.

Can't we just have it call quantile(x, 0.5)? Or would it be less efficient?

  • with frequency weigths, wquantile fails. This is because I wrote this function assuming that fweights were restricted to integer. How to interpret a frequency weight that is not an integer?

Cf. preivous discussion here: #313 (comment). I guess several generalizations are possible, we just need one which is reasonable (for some definition of that term). Does the current answer (7.0) fit that bill?

@matthieugomez
Copy link
Contributor

matthieugomez commented Dec 9, 2018

  • quantile(x, 0.5) seems to be as (in)efficient as wmedian so there is no performance reason to have a separate implementation.
x = rand(10_000_000)
w = rand(10_000_000)
@time wquantile(x, w, 0.5)
  1.908000 seconds (44 allocations: 382.676 MiB, 2.88% gc time)
0.5001186970922967

julia> @time wmedian(x, w)
  1.911641 seconds (27 allocations: 231.275 MiB, 2.02% gc time)
0.5001187598955092
  • For the second part, I just don't understand what non integer frequency weight mean. In Stata, it is impossible to have frequency weights that are not integer. Why is it possible in StatsBase? The previous discussion you linked to is about non-integer probability weights, which is fine.

@nalimilan
Copy link
Member

For many algorithms there's an obvious generalization of frequency weights to non-integer values. For example, to compute the mean it doesn't matter whether the weight is integer or not. Generally speaking, it sounds simple to give a meaning to non-integer weights: having two observations with weight w/2 should be equivalent to having one observation with weight w. That said I'm not sure it's very useful (even if I remember somebody asked for them in another context), so if we're unsure we could throw an error in the presence of non-integer weights for now.

Regarding the weighted median, let's remove the separate implementation then.

@matthieugomez
Copy link
Contributor

But, in your example, it should then be interpreted as a probability weight.I am having a hard time thinking of a case where someone wants non frequency weights to be different than probability weights. From Stata:

fweights, or frequency weights, are weights that indicate the number of duplicated observations.
pweights, or sampling weights, are weights that denote the inverse of the probability that the
        observation is included because of the sampling design.

@nalimilan
Copy link
Member

I'm not saying they have to be different when computing quantiles. That's the case e.g. for inference of when computing the Bessel correction for variance, but not necessarily for descriptive statistics.

@matthieugomez
Copy link
Contributor

While I am doing a pull request, why is wmean deprecated but not wsum or wmedian?

@nalimilan
Copy link
Member

I guess that's just an oversight.

Do we have any idea why Julia uses a separate method for median rather than calling quantile? Just because it's simple? I'd just like to be sure we're not missing something.

@nalimilan
Copy link
Member

So apart from #436, does something need to be addressed here @matthieugomez? We could reject non-integer frequency weights, but do you have an example where they give a clearly incorrect result?

@matthieugomez
Copy link
Contributor

matthieugomez commented Feb 1, 2019 via email

@nalimilan
Copy link
Member

I don't really know what correct or incorrect result would mean in this situation,

For example, wouldn't the rule I describe above make sense? "having two observations with weight w/2 should be equivalent to having one observation with weight w".

@matthieugomez
Copy link
Contributor

I'm not sure this rule would work. Remember the example:

quantile([1, 2], fweights([2, 2]), [0.25]) = 1
#but
quantile([1, 2], fweights([1, 1]), [0.25]) > 1

which violates your rule.

@nalimilan
Copy link
Member

Does it really? What I suggest would only implies quantile([1, 2], fweights([2, 2]), [0.25]) == quantile([1, 2, 1, 2], fweights([1, 1, 1, 1]), [0.25]) == quantile([1, 2, 1, 2, 1, 2, 1, 2], fweights(fill(0.5, 8)), [0.25]), which is already true.

@matthieugomez
Copy link
Contributor

matthieugomez commented Feb 2, 2019

Oh ok I misunderstood your rule. But then I'm not sure how it generalizes beyond the case where the weights for each value happen to sum up to an integer.

In any case, my point is just that I wrote this code under the wrong assumption that frequency weights were always integer. So until someone develops a new definition, which I'm not capable of, I think it is better to return an error.

@nalimilan
Copy link
Member

Fine with me. I suspect there's nothing special about integers in this algorithm, but better safe than sorry.

@nalimilan
Copy link
Member

Fixed by #436.

@perrette
Copy link

perrette commented Jun 12, 2024

Hi, I clearly lack subtelty about the various definitions of weighted quantiles (and I passed quickly over the above discussion as a result), but I thought I'd share another, more obvious example of what what the current implementation is doing:

The frequency weights seems to do what I'd expect:

julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1000, 1, 1, 1]))
1.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1000, 1, 1]))
2.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1, 1000, 1]))
3.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1, 1, 1000]))
4.0

unlike the ProbabilityWeights:

julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1000, 1, 1, 1]/1003))
2.500000000000056
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1000, 1, 1]/1003))
1.501
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1, 1000, 1]/1003))
2.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1, 1, 1000]/1003))
3.499

which seems to:

  • ignore the first weight purely and simply (isn't that a bug ?? -- it's hard to believe it comes down to definition, unless it's a definition for continuous analytics applied bluntly to discrete arrays -- I ask the author(s) to please excuse me, I mean no offense)
  • takes the mean between whatever two numbers the (weighted) median falls into, which can never be exactly reached

Worse, it is numerically inaccurate:

julia> eps = 1e-50 ;
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1-3*eps, eps, eps, eps]))
4.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, 1-3*eps, eps, eps]))
1.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, 1-3*eps, eps]))
2.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, eps, 1-3*eps]))
3.5

whereas setting epsilon to exactly zero yields the same (and for me, expected) result as FrequencyWeights

julia> eps = 0.;
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1-3*eps, eps, eps, eps]))
1.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, 1-3*eps, eps, eps]))
2.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, 1-3*eps, eps]))
3.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, eps, 1-3*eps]))
4.0

IMO the above shows surprising results that may go beyond the difference between various definitions (especially that the first weight is ignored, and possibly the discontinuity at the limit when the weights tend toward being concentrated on one element, though OK, discontinuities are parts of mathematics -- but they often don't help when analyzing real data).

Anyway, for me the workaround will be to multiply my weights by a large number, convert to integers, and use FrequencyWeights instead of ProbabilityWeights.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants