-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: weightedVar() returning incorrect (including negative) values #105
Comments
Thanks Pete, I can reproduce. It certainly looks like a bug to me. |
Temporarilily remove colWeightedSds() and colWeightedVars() (until HenrikBengtsson/matrixStats#105 is resolved) Some bug fixed to other *Weighted* methods
Just a heads up; I don't think this is not really a bug in the code per se, but rather a oversight in this weighted estimator. After briefly thinking about it, it's likely one wants to us different types of estimators depending on what type of weights one uses and whether the estimator should be unbiased or not. Some serious thoughts needs to go into this and ideal we should find prior work on this, i.e. locate trustworthy sources that provide well studied estimators. Because of this, I would hold off using |
FWIW, |
A reference is https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_variance which covers estimators for when the weights are (i) "frequency" weights, and when they are (ii) "reliability" / "normalization" weights. I believe this is also what Frank Harrell talks about his |
A few more notes on this: so, there is no one estimator for the standard error of the weighted mean. Several different has been proposed (cf. Gatz & Smith (1995) and they all have different properties. Because of this, The question is then (a) which set of them, (b) which should be the default, and (c) how to specify which estimator? As a started I think (a) |
Agreed on (a) and (b). Not sure I understand (c); > args(Hmisc::wtd.var)
function (x, weights = NULL, normwt = FALSE, na.rm = TRUE, method = c("unbiased",
"ML"))
NULL In any case, right now I have no strong opinion on the argument names |
Yeah, but |
Now... in the Hmisc update 4.0-3 => 4.1-0, they actually made the default of n <- 3
x <- seq_len(n)
w <- c(0.1, 0.2, 0.6)
w2 <- w / min(w)
w3 <- w / sum(w)
s2m1 <- matrixStats::weightedVar(x = x, w = w)
s2m2 <- matrixStats::weightedVar(x = x, w = w2)
s2m3 <- matrixStats::weightedVar(x = x, w = w3)
s2h1 <- Hmisc::wtd.var(x = x, weights = w)
s2h2 <- Hmisc::wtd.var(x = x, weights = w2)
s2h3 <- Hmisc::wtd.var(x = x, weights = w3)
stopifnot(s2m1 == s2h1, s2m2 == s2h2, s2m3 == s2h3) Now, what's more interesting is that this was also the behavior in Hmisc (<= 4.0.2). So, the original issue you reported did only apply to Hmisc 4.0.3, which was available 2017-04-30 -- 2017-12-19, and the version you happened use. UPDATE 2018-01-21:
|
Just like in Hmisc (>= 4.1.0), > options(warn = 1)
> n <- 3
> x <- seq_len(n)
> w <- c(0.1, 0.2, 0.6)
> w2 <- w / min(w)
> w3 <- w / sum(w)
> matrixStats::weightedVar(x = x, w = w)
Warning in matrixStats::weightedVar(x = x, w = w) :
Produced invalid variance estimate, because the weights suggest at most one effective observation (sum(w) <= 1): -4.22222 (wsum = 0.9)
[1] -4.222222
> matrixStats::weightedVar(x = x, w = w2)
[1] 0.5277778
> matrixStats::weightedVar(x = x, w = w3)
Warning in matrixStats::weightedVar(x = x, w = w3) :
Produced invalid variance estimate, because the weights suggest at most one effective observation (sum(w) <= 1): Inf (wsum = 1)
[1] Inf Maybe one could argue it should return |
Interesting that Hmisc went through the same issue! I think this is a good solution for matrixStats. I find it a little surprising that the default option of |
Since the current implementation is no longer a "bug" per se, I'll push forward to release the next version of the package without further modifications to this function. However, I'm certainly open to add support for alternative estimators here. We need to strive carefully though and understanding what we're adding since adding features is easy but removing them is complicated. So, I think identifying and understand estimators is key. Then we can start adding them one by one. Adding support for the ones in Hmisc makes sense. |
Sounds good. Do you have an ETA on the next release? |
I'm doing the final cleanup (#119) and then I'll be running the rev dep checks on ~160 packages. If all go well I'll submit to CRAN after that. |
matrixStats 0.53.0 is now on CRAN |
Hi Henrik,
I encountered a situation where
matrixStats::colWeightedSds()
was giving meNaN
variances. After doing some digging it appears this is due tomatrixStats::weightedVar()
returning negative variance estimates. Here's a minimal reproducible example.I also compared
matrixStats::weightedVar()
toHmisc::wtd.var()
as my understanding is these should give the same result, right?Thanks,
Pete
Session info
The text was updated successfully, but these errors were encountered: