Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apply() faster than colQuantiles() #153

Closed
jphill01 opened this issue Jul 18, 2019 · 13 comments
Closed

apply() faster than colQuantiles() #153

jphill01 opened this issue Jul 18, 2019 · 13 comments
Milestone

Comments

@jphill01
Copy link

jphill01 commented Jul 18, 2019

Computing quantiles via apply() across columns (MARGIN = 2) appears to be faster than using colQuantiles(). How can this be? Since colQuantiles is written in pure C, it should be much faster than the pure R apply(), but it's actually the other way around...

library(matrixStats)
library(microbenchmark)

x <- array(1:(10000 * 1000), dim = c(10000, 1000, 1)) # example array

microbenchmark(apply(x, 2, function(x) quantile(x, 0.975)),
colQuantiles(drop(x), probs = 0.975)) # timing each function

Unit: milliseconds
expr min lq mean median
apply(x, 2, function(x) quantile(x, 0.975)) 252.7643 263.4260 278.0485 268.7132
colQuantiles(drop(x), probs = 0.975) 447.6436 514.2717 537.6342 526.2200
uq max neval
281.7958 356.3389 100
573.9136 615.9201 100

identical(apply(x, 2, function(x) quantile(x, 0.975)), colQuantiles(drop(x), probs = 0.975))
[1] TRUE

Can someone explain?

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Jul 18, 2019

Not near a computer for a few days, but I suspect the drop(x) is what takes time, not colQuantiles(). Try adding a solo drop(x) entry to the benchmark to see this. Or drop that dimension before benchmarking.

@HenrikBengtsson
Copy link
Owner

Forgot to say, colQuantiles() is one of the functions that still is not implemented in native code.

@jphill01
Copy link
Author

Thanks. Any idea at what point colQuantiles() will be optimized?

@HenrikBengtsson
Copy link
Owner

Since it's most likely drop(x) culprit here, a native implementation is unlikely to make a big difference.

Note, matrixStats is designed for matrices (2d arrays) and some vectors (1d arrays). Your example uses a 3d array.

@jphill01
Copy link
Author

Alright, thanks. I guess I will have to find a faster version of drop(), or wait until an 'arrayStats' package is released...

@HenrikBengtsson
Copy link
Owner

Now at a computer. My guess above was incorrect; it's not drop() that is the culprit. I expected it to slow things down due to memory copying. Here are some benchmarks confirming this:

library(matrixStats)

x <- array(1:(10000*1000), dim = c(10000, 1000, 1))
xd <- drop(x)

y0 <- apply(x, MARGIN = 2L, FUN = quantile, probs = 0.975)
y1 <- colQuantiles(drop(x), probs = 0.975)
y2 <- colQuantiles(xd, probs = 0.975)
stopifnot(identical(y1, y0), identical(y2, y0))

stats <- microbenchmark::microbenchmark(
  apply(x, MARGIN = 2L, FUN = quantile, probs = 0.975),
  colQuantiles(drop(x), probs = 0.975),
  colQuantiles(xd, probs = 0.975),
  drop(x),
  unit = "ms",
  times = 10L
)

print(stats)
## Unit: milliseconds
##                                                  expr        min         lq
##  apply(x, MARGIN = 2L, FUN = quantile, probs = 0.975) 142.430746 143.457675
##                  colQuantiles(drop(x), probs = 0.975) 286.959245 363.392250
##                       colQuantiles(xd, probs = 0.975) 264.892077 303.891283
##                                               drop(x)   0.000419   0.000782
##         mean     median         uq        max neval  cld
##  165.3461039 150.036045 170.998975 267.598677    10  b  
##  399.3671685 384.558625 457.812438 481.906615    10    d
##  345.1908052 364.407124 384.364789 389.169873    10   c 
##    0.0029067   0.002321   0.002681   0.011392    10 a

In other words, there's indeed room for improvement.

@jphill01
Copy link
Author

Thanks again! My original code has been released to CRAN as part of an R package.

It's sufficiently fast for now, but whenever you get around to optimizing colQuantiles(), I will use that instead of stats::quantile(), which can be improved based on code profiling.

@HenrikBengtsson
Copy link
Owner

HenrikBengtsson commented Sep 10, 2019

Benchmarking with

library(matrixStats)
options(width=120)

X <- matrix(1:(10000*1000), nrow=10000L, ncol=1000L)
y0 <- apply(X, MARGIN=2L, FUN=quantile, probs=0.975)
y1 <- colQuantiles(X, probs=0.975)
stopifnot(identical(y1, y0))

stats <- bench::mark(
  apply(X, MARGIN=2L, FUN=quantile, probs=0.975),
  colQuantiles(X, probs=0.975),
  min_iterations=10L
)

It does not like replacing generic sort() with sort.int() (Issue #155) makes a difference, i.e. that's not the culprit:

## matrixStats 0.55.0
print(stats)
## # A tibble: 2 x 13
##   expression                                             min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
##   <bch:expr>                                           <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
## 1 apply(X, MARGIN = 2L, FUN = quantile, probs = 0.975) 146ms  155ms      6.03     191MB     19.9    10    33      1.66s
## 2 colQuantiles(X, probs = 0.975)                       286ms  310ms      3.18     496MB     20.1    10    63      3.14s
## # … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
## matrixStats 0.55.0-9000 (with sort.int() instead of generic sort())
print(stats)
# A tibble: 2 x 13
  expression                                             min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>                                           <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 apply(X, MARGIN = 2L, FUN = quantile, probs = 0.975) 151ms  174ms      5.38     191MB     17.7    10    33      1.86s
2 colQuantiles(X, probs = 0.975)                       292ms  333ms      2.89     496MB     23.4    10    81      3.46s
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>

Now, looking at the above benchmark results, it's clear that colQuantiles() does 2.6x times more memory allocations. That needs to be investigated.

The dominant memory allocations are:

> p1 <- profmem::profmem(y1 <- colQuantiles(X, probs=0.975))
> subset(p1, bytes > 100e3)
Rprofmem memory profiling of:
y1 <- colQuantiles(X, probs = 0.975)

Memory allocations:
       what     bytes                                                   calls
4     alloc  40000048 colQuantiles() -> apply() -> aperm() -> aperm.default()
4042  alloc  40000048                   colQuantiles() -> apply() -> unlist()
4043  alloc  40000048                    colQuantiles() -> apply() -> array()
4044  alloc  40000048 colQuantiles() -> apply() -> aperm() -> aperm.default()
6051  alloc  40000048 colQuantiles() -> apply() -> aperm() -> aperm.default()
total       200000240

@HenrikBengtsson
Copy link
Owner

Great news: In matrixStats 0.55.0-9000 (develop branch) we now have:

PERFORMANCE:

  • colQuantiles() and rowQuantiles() with the default type=7L and when there
    are no missing values is now significantly faster and uses significantly
    fewer memory allocations.

For the example in this issue, the speedup is ~7 times (compared to #153 (comment)), which means colQuantiles() is now significantly faster than apply(..., MARGIN=2L, FUN=stats::quantile) (here ~3 times)

> library(matrixStats)
> options(width=120)
> 
> X <- matrix(1:(10000*1000), nrow=10000L, ncol=1000L)
> y0 <- apply(X, MARGIN=2L, FUN=quantile, probs=0.975)
> y1 <- colQuantiles(X, probs=0.975)
> stopifnot(identical(y1, y0))
> 
> stats <- bench::mark(
+   apply(X, MARGIN=2L, FUN=quantile, probs=0.975),
+   colQuantiles(X, probs=0.975),
+   min_iterations=10L
+ )
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 
> stats
# A tibble: 2 x 13
  expression                                               min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>                                           <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 apply(X, MARGIN = 2L, FUN = quantile, probs = 0.975) 209.6ms 218.6ms      4.56     191MB     20.5    10    45
2 colQuantiles(X, probs = 0.975)                        68.5ms  70.4ms     13.7      115MB     28.8    10    21
# … with 5 more variables: total_time <bch:tm>, result <list>, memory <list>, time <list>, gc <list>

This was achieved by avoid lots of large memory allocation. Comparing to #153 (comment), there large allocations are no longer done:

> p1 <- profmem::profmem(y1 <- colQuantiles(X, probs=0.975))
> subset(p1, bytes > 50e3)
Rprofmem memory profiling of:
y1 <- colQuantiles(X, probs = 0.975)

Memory allocations:
      what bytes calls
total          0      

PS. There's still some room for improvements, but that's minor to what was achieved here.

@jphill01
Copy link
Author

Thanks! Do you know when you plan to submit an updated package to CRAN?

@HenrikBengtsson
Copy link
Owner

No guesstimate. matrixStats 0.55.0 was just released and it was a multi-day effort to check it across platforms and run all of the 244 reverse dependency check (I have access to computer cluster so that helped). So, releasing matrixStats is a bit of a time commitment.

@HenrikBengtsson
Copy link
Owner

FYI, matrixStats 0.56.0 with this speedup is now on CRAN.

@jphill01
Copy link
Author

Excellent! Thanks Henrik!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants