Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The HodgesLehmann mean appears to have a bug #97

Open
tkpmep opened this issue Dec 7, 2022 · 28 comments
Open

The HodgesLehmann mean appears to have a bug #97

tkpmep opened this issue Dec 7, 2022 · 28 comments

Comments

@tkpmep
Copy link

tkpmep commented Dec 7, 2022

The HodgesLehmann function appears to give the wrong answer, even in simple cases:

Reprex:
Desctools::HodgesLehmann(c(0.7, 0.6, 0.5)) = 0.6. This is correct - it is the median of (0.7, 0.6, 0.5, 0.65, 0.6, 0.55), which is 0.6 (average of 0.6 and 0.6).

But
Desctools::HodgesLehmann(c(0.7, 0.5 0.5)) = 0.5828427. This is incorrect - it should be the median of (0.7, 0.6, 0.6, 0.5, 0.5, 0.5), which is 0.55 (average of 0.6 and 0.5).

@AndriSignorell
Copy link
Owner

This is indeed a strange behaviour. The HL-estimator in DescTools uses the estimate generated by the base R function wilcox.test and this function yields exactly this result.

wilcox.test(c(0.7, 0.5, 0.5), exact=F, conf.int=T, conf.level = 0.95)

I have to investigate this further...

@tkpmep
Copy link
Author

tkpmep commented Jan 14, 2023 via email

@tkpmep
Copy link
Author

tkpmep commented May 1, 2023

Hello Andri,
Have you made any progress on this?

Sincerely

Tom

@AndriSignorell
Copy link
Owner

Not yet, sorry, I keep it open...

@AndriSignorell
Copy link
Owner

I did some literature research. It turns out, that existing of tied values forces wilcox.test to switch to the approximated algorithm that returns strange results.
A good description can be found at: https://aakinshin.net/posts/r-hodges-lehmann-problems/

I found the following alternative implementations for the HL-estimator:

f.HodgesLehmann <- function(x)
{
  ## Purpose:   Hodges-Lehmann estimate and confidence interval
  ## -------------------------------------------------------------------------
  ## Arguments:
  ## Remark: function changed so that CI covers >= 95%, before it was too
  ##         small (9/22/04)
  ## -------------------------------------------------------------------------
  ## Author: Werner Stahel, Date: 12 Aug 2002, 14:13
  ## Update: Beat Jaggi, Date: 22 Sept 2004
  .cexact <-
    # c(NA,NA,NA,NA,NA,21,26,33,40,47,56,65,74,84,95,107,119,131,144,158)
    c(NA,NA,NA,NA,NA,22,27,34,41,48,57,66,75,85,96,108,120,132,145,159)
  .d <- na.omit(x)
  .n <- length(.d)
  .wa <- sort(c(outer(.d,.d,"+")/2)[outer(1:.n,1:.n,"<=")])
  .c <- if (.n<=length(.cexact)) .n*(.n+1)/2+1-.cexact[.n] else
    floor(.n*(.n+1)/4-1.96*sqrt(.n*(.n+1)*(2*.n+1)/24))
  .r <- c(median(.wa), .wa[c(.c,.n*(.n+1)/2+1-.c)])
  names(.r) <- c("estimate","lower","upper")
  .r
}


hl <- function(x, y = NULL) {
  # https://aakinshin.net/posts/r-hodges-lehmann-problems/
  if (is.null(y)) {
    walsh <- outer(x, x, "+") / 2
    median(walsh[lower.tri(walsh, diag = TRUE)])
  } else {
    median(outer(x, y, "-"))
  }
}

A <- c(50.6, 39.2, 35.2, 17, 11.2, 14.2, 24.2, 37.4, 35.2)
B  <- c(38, 18.6, 23.2, 19, 6.6, 16.4, 14.4, 37.6, 24.4) 

wilcox.test(A, B, conf.int=T)
wilcox.test(A, B, conf.int=T, exact=T)

with(asht::wsrTest(A, B), unname(c(estimate, conf.int)))
rQCC::HL(A-B, estimator = "HL2")
ICSNP::hl.loc(A-B)
f.HodgesLehmann(A-B)
hl(A, B)


# performance 

set.seed(8923)
xx <- sample(round(rnorm(1000), 3), 1e3, replace = TRUE)

microbenchmark(
  wilcox.test(x),
  wilcox.test(x, conf.int = TRUE),
  wilcox.test(x, exact=TRUE, conf.int = TRUE),
  asht = asht::wsrTest(x),
  rQCC = rQCC::HL(x),
  ICSNP = ICSNP::hl.loc(x),
  Stahel = f.HodgesLehmann(x),
  akinshin = hl(x)
)

where rQCC turns out to perform best, the exact wsrTest() has massive overhead for just computing the estimator and so has wilcox.test.

So I will change the implementation to one of the candidates, resp. try to add some engineering.
(all exact implementations fail with large vectors
image
).

So, stay tuned, or better yet, collaborate! ;-)

@tkpmep
Copy link
Author

tkpmep commented May 17, 2023

Hello Andri,
I'm more than happy to work with you, but must confess that I am, at best, a mediocre programmer -I'm one of the people who puts the Overflow into StackOverflow. That said, there is an O(N log(N)) implementation due to John Monahan (Algorithm 616, Fast Computation of the Hodges-Lehmann Location Estimator, 1983), and its description is accessible at https://dl.acm.org/doi/pdf/10.1145/1271.319414. While the entire code is not given, it can be downloaded from https://calgo.acm.org/ (click on the link at 616.gz). It has two simple O(N^2) implementations for a sanity check as well as the actual fast implementation, which is about 200 lines of spaghetti Fortran. I have attached both files - let me know how best we can work together.

Separately, there is an robust t-test package, rt.test, available at CRAN (https://cran.r-project.org/web/packages/rt.test/index.html) and it would be worth looking at their implementation as well.

Sincerely

Tom

616.txt
Monahan Algorithm 616 Fast Computation of the Hodges-Lehmann Location Estimator.pdf

@AndriSignorell
Copy link
Owner

This is an excellent presentation, thank you! I also did some research myself, Monahan's algorithm seems to be the only available suggestion.
I will try to translate the Fortran code into C++. I'll report back when I've made some progress.

@AndriSignorell
Copy link
Owner

Oh, btw https://cran.r-project.org/web/packages/rt.test authors use the known O(N^2) outer(x, y, "-") approach.
Nothing new there...

@tkpmep
Copy link
Author

tkpmep commented May 22, 2023

I'm glad you found it useful. I'm afraid I have not ever used C++, but even so, I'm happy to work with you on implementing Monahan's algorithm.

Sincerely

Tom

@tkpmep
Copy link
Author

tkpmep commented Jun 29, 2023

Hello Andri,
Have you made any progress on implementing Monahan's algorithm?

Sincerely

Tom

@AndriSignorell
Copy link
Owner

@tkpmep, @CyrilFMoser: Motion regarding Monahan's algorithm. The algorithm was implemented in version 0.99.52 by Cyril in C++. In a further step we will try to complete the hodges-lehmann-sen estimator for 2 samples including the respective confidence intervals.

@tkpmep
Copy link
Author

tkpmep commented Dec 3, 2023

Thank you very much Andri and Cyril. I updated all the packages in my R 4.3.1 installation (DescTools is now at 0.99.52), and tested it with the original reprex, but got an error message:

DescTools::HodgesLehmann(c(0.7, 0.5, 0.5))
[1] NA
Warning message: In DescTools::HodgesLehmann(c(0.7, 0.5, 0.5)) : Two sample case for HL-estimator not yet implemented.

If it is any help, here is what sessionInfo() gives:

sessionInfo()
R version 4.3.1 (2023-06-16 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22631)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8

time zone: Asia/Calcutta
tzcode source: internal

attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base

other attached packages:
[1] DescTools_0.99.52

loaded via a namespace (and not attached):
[1] Matrix_1.6-4 expm_0.999-8 dplyr_1.1.4
[4] compiler_4.3.1 tidyselect_1.2.0 Rcpp_1.0.11
[7] boot_1.3-28.1 gld_2.6.6 readxl_1.4.3
[10] lattice_0.22-5 R6_2.5.1 generics_0.1.3
[13] MASS_7.3-60 tibble_3.2.1 pillar_1.9.0
[16] rlang_1.1.2 utf8_1.2.4 cli_3.6.1
[19] withr_2.5.2 magrittr_2.0.3 class_7.3-22
[22] grid_4.3.1 rstudioapi_0.15.0 mvtnorm_1.2-4
[25] rootSolve_1.8.2.4 rt.test_1.18.7.9 Exact_3.2
[28] lifecycle_1.0.4 vctrs_0.6.5 proxy_0.4-27
[31] glue_1.6.2 data.table_1.14.8 cellranger_1.1.0
[34] zoo_1.8-12 fansi_1.0.5 lmom_3.0
[37] e1071_1.7-13 httr_1.4.7 tools_4.3.1
[40] pkgconfig_2.0.3

Would you kindly look into this?

Many thanks in advance

Tom

@AndriSignorell
Copy link
Owner

Ohh, what a mess! An untested last minute change causes that, sorry for the inconvenience!
Please check it with some value for y... :-(
DescTools::HodgesLehmann(c(0.7, 0.5, 0.5), 1)

I will fix that as soon as possible with CRAN.

@tkpmep
Copy link
Author

tkpmep commented Dec 3, 2023

Just tried it again and it works perfectly with y=1 and the error message makes perfect sense for y=2. Thank you very much and I look forward to the two sample test as well. By the way, Sebastian Krantz is building a set of fast C++ packages for data processing called the fastverse (https://fastverse.github.io/fastverse/) that is built around data.table, collapse (which i have used, see https://sebkrantz.github.io/collapse/), kit and magrittr, and he might be happy to get a copy of your C++ implementation to add a powerful new robust estimator of location to collapse and kit. Would you be OK with me telling him to reach out to you for some help with the code etc.? One stated objective of the fastverse is stable syntax (the tidyverse tends to change syntax over time, requiring maintenace).

The attached pdf goes back to 2003, and has the asymptotic relative efficiency of the Hodges-Lehmann Estimator (i.e. relative to the MLE) as a measure of location for t distributions with varying degrees of freedom, all the way from Cauchy to Normal (see the table on the last page). Its performance is truly remarkable, and, in fact, I routinely use the Hodges-Lehmann Mean in place of the sample mean as its efficiency is almost always higher than that of the sample mean, and it gives more than adequate (1-1/sqrt(2) ~ 29%) protection against outliers as a bonus.

Sincerely

Tom
Geyer Nonparametric Tests and Confidence Intervals including Hodges Lehmann Mean.pdf

clrpackages pushed a commit to clearlinux-pkgs/R-DescTools that referenced this issue Dec 7, 2023
…sion 0.99.52

DescTools 0.99.52 (2023-11-30)
------------------------------

NEW FUNCTIONS ADDED:
  * dTri(), pTri(), qTri() and rTri() return the specific values for
    a triangular distribution.

UPDATED FUNCTIONS:
  * New C++ port of the Hodges-Lehman estimator following
    Monahan's algorithm https://dl.acm.org/doi/10.1145/1271.319414.
    (Credits go to Cyril Flurin Moser)
    (Two sample case and confidence intervals are still pending.)
    (AndriSignorell/DescTools#97)

BUGFIXES:

(NEWS truncated at 15 lines)
@AndriSignorell
Copy link
Owner

Cyril's impressively productive! We now have the 2-sample estimator:

hl <- function(x, y = NULL) {
  # https://aakinshin.net/posts/r-hodges-lehmann-problems/
  if (is.null(y)) {
    walsh <- outer(x, x, "+") / 2
    median(walsh[lower.tri(walsh, diag = TRUE)])
  } else {
    median(outer(x, y, "-"))
  }
}

A <- c(50.6, 39.2, 35.2, 17, 11.2, 14.2, 24.2, 37.4, 35.2)
B  <- c(38, 18.6, 23.2, 19, 6.6, 16.4, 14.4, 37.6, 24.4) 

identical(hl(A, B), HodgesLehmann(A, B))
# [1] TRUE

 
A <- rnorm(1e4)
B  <- rnorm(length(A))
 
identical(hl(A, B), HodgesLehmann(A, B))
# [1] TRUE

@tkpmep
Copy link
Author

tkpmep commented Dec 9, 2023

Terrific! Thank you very much for fixing this. It appears the second argument (i.e. 1 or 2) is not needed any longer. I assume this will be on GitHub in a day or two and I will test it again. It might be worth sharing this with the Base R team as well, as wilcox.test is so widely used.

@AndriSignorell
Copy link
Owner

yes, on its way to github...

@tkpmep
Copy link
Author

tkpmep commented Dec 12, 2023

Hello Andri,
I have been checking regularly for updates to DescTools through CRAN for a few days, but it seems to be stuck on 0.99.52. Is there something that needs to be addressed before they will accept it? Should I be installing it directly from GitHub using devtools::install_github?

Sincerely

Tom

@tkpmep
Copy link
Author

tkpmep commented Dec 15, 2023

Hello Andri
No sign of 0.99.53 on CRAN as yet, so I installed it from GitHub. HodgesLehmann works beautifully. Thanks a million.

Tom

@AndriSignorell
Copy link
Owner

Well done, we currently work on the confidence intervals and may come to a solution soon. If we get there this year we will include it, in any case there will be a new DescTools version this year fixing the H-L estimator.

@AndriSignorell
Copy link
Owner

@mmaechler: @CyrilFMoser has done an excellent job. Referring to @tkpmep's comment above, this might be a good addition to wilcox.test, especially as the calculation is actually really fast.
What do you think, would that be something for R-base?

@tkpmep
Copy link
Author

tkpmep commented Dec 18, 2023

Hello Andri,
I sent a note to this effect to the R Developers group a week ago and got a note back saying that it was being held for review before being forwarded to r-devel@r-project.org. I have not heard from them since, and I am not sure if it has indeed been forwarded. In any event, the email I sent is copied below. Separately, Andrey Akinshin sent me links to some of his other blog posts in which he discusses issues with other functions:

A possible improvement for the Mann-Whitney test implementation is using the Edgeworth expansion:

Sincerely

Tom


While using the Hodges Lehmann Mean in DescTools (DescTools::HodgesLehmann), I found that it generated incorrect answers (see #97). The error is driven by the existence of tied values forcing wilcox.test in Base R to switch to an approximate algorithm that returns incorrect results - see https://aakinshin.net/posts/r-hodges-lehmann-problems/ for a detailed exposition of the issue.

Andri Signorell and Cyril Moser have a new C++ implementation of DescTools::HodgesLehmann using a O(N log(N)) algorithm due to Monahan, but wilcox.test in Base R appears to be still broken. Will someone kindly bring this observation, as well as the existence of a solution, to the attention of the relevant person(s) in the Base R development team?

The paper by Mohanan, as well as the original Fortran implementation of the algorithm are linked to from #97. Inefficient O(N^2) algorithms for the Hodges-Lehmann mean are known and are implemented in a variety of packages. For example, the authors of rt.test (https://cran.r-project.org/web/packages/rt.test) use the O(N^2) approach. I suspect that Andri and Cyril will be more than happy to assist with fixing wilcox.test in Base R with their implementation of Monahan’s fast algorithm.

Sincerely

Thomas Philips

@tkpmep
Copy link
Author

tkpmep commented Jan 22, 2024

Just updated to 0.99.53 from CRAN. Thanks as always!

@ahl27
Copy link

ahl27 commented Feb 12, 2024

Hey all--Andreas Loeffler brought this discussion to my attention today following up on our previous discussions on improving the base *wilcox functions. As far as I'm aware, I haven't seen any discussion of this on R-devel or activity on Bugzilla regarding these issues, so I'm pretty sure it hasn't been brought up.

I'll make sure that the discussion gets forwarded onto R-devel with participants here cc'd; I really don't know why it didn't make it to the mailing list. I'm also planning to spend the next week or two working on continuing to patch the (numerous) issues with wilcox-family functions that Andrey has comprehensively outlined on his blog (the same list linked above by Thomas).

A C++ implementation won't be directly usable in base R, but I'm sure we could port the Mohanan implementation to R/C for it. Worst case, that O(N^2) implementation isn't awful, I'm just concerned about exhausting vector memory in the outer() call.

-Aidan

@tkpmep
Copy link
Author

tkpmep commented Feb 12, 2024 via email

@AndreyAkinshin
Copy link

Hello @ahl27,

Thanks for your interest in my blog posts! It would be awesome if we could extend the *wilcox family with the Edgeworth approximation. At the moment, I can't rely on the standard wilcox.test function because it produces unacceptably large errors. The equations presented in my blog post give significantly better accuracy. I use them in my own research, and they work great. However, in order to adopt it in R, we should think about the following:

  • How do we properly handle the tied values?
  • Can we completely replace the normal approximation with the Edgeworth expansion, or should we support both?
  • How do we extend the existing API to ensure backward compatibility?
  • How do we properly document this change so that users can understand what's going on?

I will be happy to help with brainstorming, implementing, and testing. Don't hesitate to reach out to me at andrey.akinshin@gmail.com. Speaking of the documentation, I can also publish an arXiv preprint that covers all the details so that we can refer to it (the blog posts just briefly describe the idea, but they lack context and examples).

@ahl27
Copy link

ahl27 commented Feb 12, 2024

Sounds good! I'm in the process of putting some patches together--I have a small github repo opened here just to organize my own thoughts for the time being. I was able to quickly put together a patch to implement the Edgeworth approximation for 2-sample wilcox.test there, and I also successfully reproduced the differences in accuracy you posted about in your blog.

I'll email R-devel momentarily so we can start a discussion on the points you're raising--I think they're all valid and worth consideration. I think going forward we can move discussion to the R-devel mailing list and Bugzilla, since it'll be easier to loop in other people in R.

@AndreyAkinshin
Copy link

@ahl27 awesome, thanks for the update! Could you please include me in these discussions? (I've never participated in the R-devel mailing list before and I don't have an R Bugzilla account)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants