Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: weightedVar() returning incorrect (including negative) values #105

Open
PeteHaitch opened this issue Aug 27, 2017 · 14 comments
Open

BUG: weightedVar() returning incorrect (including negative) values #105

PeteHaitch opened this issue Aug 27, 2017 · 14 comments

Comments

@PeteHaitch
Copy link
Contributor

Hi Henrik,

I encountered a situation where matrixStats::colWeightedSds() was giving me NaN variances. After doing some digging it appears this is due to matrixStats::weightedVar() returning negative variance estimates. Here's a minimal reproducible example.

I also compared matrixStats::weightedVar() to Hmisc::wtd.var() as my understanding is these should give the same result, right?

Thanks,
Pete

n <- 3
x <- seq_len(n)
w <- c(0.1, 0.2, 0.6)
w2 <- w / min(w)
w3 <- w / sum(w)

matrixStats::weightedVar(x = x, w = w)
#> [1] -4.222222
matrixStats::weightedVar(x = x, w = w2)
#> [1] 0.5277778
matrixStats::weightedVar(x = x, w = w3)
#> [1] Inf

Hmisc::wtd.var(x = x, weights = w)
#> [1] 0.95
Hmisc::wtd.var(x = x, weights = w2)
#> [1] 0.95
Hmisc::wtd.var(x = x, weights = w3)
#> [1] 0.95
Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.1 (2017-06-30)
#>  system   x86_64, darwin15.6.0        
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_AU.UTF-8                 
#>  tz       America/New_York            
#>  date     2017-08-27
#> Packages -----------------------------------------------------------------
#>  package      * version date       source                         
#>  acepack        1.4.1   2016-10-29 CRAN (R 3.4.0)                 
#>  backports      1.1.0   2017-05-22 CRAN (R 3.4.0)                 
#>  base         * 3.4.1   2017-07-07 local                          
#>  base64enc      0.1-3   2015-07-28 CRAN (R 3.4.0)                 
#>  checkmate      1.8.3   2017-07-03 CRAN (R 3.4.1)                 
#>  cluster        2.0.6   2017-03-10 CRAN (R 3.4.1)                 
#>  colorspace     1.3-2   2016-12-14 CRAN (R 3.4.0)                 
#>  compiler       3.4.1   2017-07-07 local                          
#>  data.table     1.10.4  2017-02-01 CRAN (R 3.4.0)                 
#>  datasets     * 3.4.1   2017-07-07 local                          
#>  devtools       1.13.3  2017-08-02 CRAN (R 3.4.1)                 
#>  digest         0.6.12  2017-01-27 CRAN (R 3.4.0)                 
#>  evaluate       0.10.1  2017-06-24 CRAN (R 3.4.0)                 
#>  foreign        0.8-69  2017-06-22 CRAN (R 3.4.1)                 
#>  Formula        1.2-2   2017-07-10 CRAN (R 3.4.1)                 
#>  ggplot2        2.2.1   2016-12-30 CRAN (R 3.4.0)                 
#>  graphics     * 3.4.1   2017-07-07 local                          
#>  grDevices    * 3.4.1   2017-07-07 local                          
#>  grid           3.4.1   2017-07-07 local                          
#>  gridExtra      2.2.1   2016-02-29 CRAN (R 3.4.0)                 
#>  gtable         0.2.0   2016-02-26 CRAN (R 3.4.0)                 
#>  Hmisc          4.0-3   2017-05-02 CRAN (R 3.4.0)                 
#>  htmlTable      1.9     2017-01-26 CRAN (R 3.4.0)                 
#>  htmltools      0.3.6   2017-04-28 CRAN (R 3.4.0)                 
#>  htmlwidgets    0.9     2017-07-10 CRAN (R 3.4.0)                 
#>  knitr          1.17    2017-08-10 CRAN (R 3.4.1)                 
#>  lattice        0.20-35 2017-03-25 CRAN (R 3.4.1)                 
#>  latticeExtra   0.6-28  2016-02-09 CRAN (R 3.4.0)                 
#>  lazyeval       0.2.0   2016-06-12 CRAN (R 3.4.0)                 
#>  magrittr       1.5     2014-11-22 CRAN (R 3.4.0)                 
#>  Matrix         1.2-11  2017-08-16 CRAN (R 3.4.1)                 
#>  matrixStats    0.52.2  2017-04-14 CRAN (R 3.4.0)                 
#>  memoise        1.1.0   2017-05-26 Github (hadley/memoise@e372cde)
#>  methods      * 3.4.1   2017-07-07 local                          
#>  munsell        0.4.3   2016-02-13 CRAN (R 3.4.0)                 
#>  nnet           7.3-12  2016-02-02 CRAN (R 3.4.1)                 
#>  plyr           1.8.4   2016-06-08 CRAN (R 3.4.0)                 
#>  RColorBrewer   1.1-2   2014-12-07 CRAN (R 3.4.0)                 
#>  Rcpp           0.12.12 2017-07-15 CRAN (R 3.4.0)                 
#>  rlang          0.1.2   2017-08-09 CRAN (R 3.4.1)                 
#>  rmarkdown      1.6     2017-06-15 CRAN (R 3.4.0)                 
#>  rpart          4.1-11  2017-03-13 CRAN (R 3.4.1)                 
#>  rprojroot      1.2     2017-01-16 CRAN (R 3.4.0)                 
#>  scales         0.4.1   2016-11-09 CRAN (R 3.4.0)                 
#>  splines        3.4.1   2017-07-07 local                          
#>  stats        * 3.4.1   2017-07-07 local                          
#>  stringi        1.1.5   2017-04-07 CRAN (R 3.4.0)                 
#>  stringr        1.2.0   2017-02-18 CRAN (R 3.4.0)                 
#>  survival       2.41-3  2017-04-04 CRAN (R 3.4.1)                 
#>  tibble         1.3.4   2017-08-22 CRAN (R 3.4.1)                 
#>  tools          3.4.1   2017-07-07 local                          
#>  utils        * 3.4.1   2017-07-07 local                          
#>  withr          2.0.0   2017-07-28 CRAN (R 3.4.1)                 
#>  yaml           2.1.14  2016-11-12 CRAN (R 3.4.0)
@HenrikBengtsson
Copy link
Owner

Thanks Pete, I can reproduce. It certainly looks like a bug to me.

@HenrikBengtsson HenrikBengtsson added this to the Next release milestone Aug 27, 2017
PeteHaitch added a commit to PeteHaitch/DelayedMatrixStats that referenced this issue Aug 27, 2017
Temporarilily remove colWeightedSds() and colWeightedVars() (until HenrikBengtsson/matrixStats#105 is resolved)
Some bug fixed to other *Weighted* methods
@HenrikBengtsson
Copy link
Owner

Just a heads up; I don't think this is not really a bug in the code per se, but rather a oversight in this weighted estimator. After briefly thinking about it, it's likely one wants to us different types of estimators depending on what type of weights one uses and whether the estimator should be unbiased or not. Some serious thoughts needs to go into this and ideal we should find prior work on this, i.e. locate trustworthy sources that provide well studied estimators. Because of this, I would hold off using matrixStats::weightedVar() until resolved.

@PeteHaitch
Copy link
Contributor Author

FWIW, Hmisc::wtd.var() ultimately calls stats::cov.wt() after some processing of the weights. Unfortunately, in a cursory look the GitHub mirror of the R source, I couldn't find much about who implemented stats::cov.wt() or a reference.

@HenrikBengtsson
Copy link
Owner

A reference is https://en.wikipedia.org/wiki/Weighted_arithmetic_mean#Weighted_sample_variance which covers estimators for when the weights are (i) "frequency" weights, and when they are (ii) "reliability" / "normalization" weights. I believe this is also what Frank Harrell talks about his ?Hmisc::wtd.var.

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Nov 26, 2017

A few more notes on this: so, there is no one estimator for the standard error of the weighted mean. Several different has been proposed (cf. Gatz & Smith (1995) and they all have different properties. Because of this, weightedVar() needs to implement different estimators.

The question is then (a) which set of them, (b) which should be the default, and (c) how to specify which estimator? As a started I think (a) weightedVar() should support the estimators that Hmisc::wtd.var() supports, and (b) use the same default. The rationale for that is based on trust in Harrell's work. For (c), contrary to Hmisc::wtd.var(), we might use a single argument called method (character string) to specify what type of estimator. That will allow us to add more exotic estimators in the future, or emulate other software (e.g. WinCross, Quantum, SPSS) if not already supported by existing methods.

@PeteHaitch
Copy link
Contributor Author

Agreed on (a) and (b). Not sure I understand (c); Hmis::wtd.var() has a method argument

> args(Hmisc::wtd.var)
function (x, weights = NULL, normwt = FALSE, na.rm = TRUE, method = c("unbiased", 
    "ML")) 
NULL

In any case, right now I have no strong opinion on the argument names

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Nov 26, 2017

Yeah, but normwt also comes in play. So, that should (probably) be incorporated in method.

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Jan 22, 2018

Now... in the Hmisc update 4.0-3 => 4.1-0, they actually made the default of wtd.var() work identically with weightedVar(). Example:

n <- 3
x <- seq_len(n)
w <- c(0.1, 0.2, 0.6)
w2 <- w / min(w)
w3 <- w / sum(w)

s2m1 <- matrixStats::weightedVar(x = x, w = w)
s2m2 <- matrixStats::weightedVar(x = x, w = w2)
s2m3 <- matrixStats::weightedVar(x = x, w = w3)

s2h1 <- Hmisc::wtd.var(x = x, weights = w)
s2h2 <- Hmisc::wtd.var(x = x, weights = w2)
s2h3 <- Hmisc::wtd.var(x = x, weights = w3)

stopifnot(s2m1 == s2h1, s2m2 == s2h2, s2m3 == s2h3)

Now, what's more interesting is that this was also the behavior in Hmisc (<= 4.0.2). So, the original issue you reported did only apply to Hmisc 4.0.3, which was available 2017-04-30 -- 2017-12-19, and the version you happened use.

UPDATE 2018-01-21:

@HenrikBengtsson
Copy link
Owner

Just like in Hmisc (>= 4.1.0), weightedVar(x, w) now produces a warning when the estimate is invalid:

> options(warn = 1)

> n <- 3
> x <- seq_len(n)
> w <- c(0.1, 0.2, 0.6)
> w2 <- w / min(w)
> w3 <- w / sum(w)

> matrixStats::weightedVar(x = x, w = w)
Warning in matrixStats::weightedVar(x = x, w = w) :
  Produced invalid variance estimate, because the weights suggest at most one effective observation (sum(w) <= 1): -4.22222 (wsum = 0.9)
[1] -4.222222

> matrixStats::weightedVar(x = x, w = w2)
[1] 0.5277778

> matrixStats::weightedVar(x = x, w = w3)
Warning in matrixStats::weightedVar(x = x, w = w3) :
  Produced invalid variance estimate, because the weights suggest at most one effective observation (sum(w) <= 1): Inf (wsum = 1)
[1] Inf

Maybe one could argue it should return NaN or similar.

@PeteHaitch
Copy link
Contributor Author

Interesting that Hmisc went through the same issue! I think this is a good solution for matrixStats.

I find it a little surprising that the default option of Hmisc::wtd.var() is normwt = FALSE which consequently means that scaling the weights can lead to (vastly) different estimates and even 'invalid' estimates. But clearly people who think about weighting will know better than me. What do you think about adding a normwt-type argument to matrixStats::weightedVar()?

@HenrikBengtsson
Copy link
Owner

Since the current implementation is no longer a "bug" per se, I'll push forward to release the next version of the package without further modifications to this function.

However, I'm certainly open to add support for alternative estimators here. We need to strive carefully though and understanding what we're adding since adding features is easy but removing them is complicated.

So, I think identifying and understand estimators is key. Then we can start adding them one by one. Adding support for the ones in Hmisc makes sense.

@PeteHaitch
Copy link
Contributor Author

Sounds good. Do you have an ETA on the next release?

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Jan 22, 2018

I'm doing the final cleanup (#119) and then I'll be running the rev dep checks on ~160 packages. If all go well I'll submit to CRAN after that.

@HenrikBengtsson
Copy link
Owner

matrixStats 0.53.0 is now on CRAN

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants