apply() faster than colQuantiles() #153

jphill01 · 2019-07-18T00:37:32Z

Computing quantiles via apply() across columns (MARGIN = 2) appears to be faster than using colQuantiles(). How can this be? Since colQuantiles is written in pure C, it should be much faster than the pure R apply(), but it's actually the other way around...

library(matrixStats)
library(microbenchmark)

x <- array(1:(10000 * 1000), dim = c(10000, 1000, 1)) # example array

microbenchmark(apply(x, 2, function(x) quantile(x, 0.975)),
colQuantiles(drop(x), probs = 0.975)) # timing each function

Unit: milliseconds
expr min lq mean median
apply(x, 2, function(x) quantile(x, 0.975)) 252.7643 263.4260 278.0485 268.7132
colQuantiles(drop(x), probs = 0.975) 447.6436 514.2717 537.6342 526.2200
uq max neval
281.7958 356.3389 100
573.9136 615.9201 100

identical(apply(x, 2, function(x) quantile(x, 0.975)), colQuantiles(drop(x), probs = 0.975))
[1] TRUE

Can someone explain?

HenrikBengtsson · 2019-07-18T05:45:19Z

Not near a computer for a few days, but I suspect the drop(x) is what takes time, not colQuantiles(). Try adding a solo drop(x) entry to the benchmark to see this. Or drop that dimension before benchmarking.

HenrikBengtsson · 2019-07-18T05:50:12Z

Forgot to say, colQuantiles() is one of the functions that still is not implemented in native code.

jphill01 · 2019-07-18T12:36:54Z

Thanks. Any idea at what point colQuantiles() will be optimized?

HenrikBengtsson · 2019-07-18T15:37:07Z

Since it's most likely drop(x) culprit here, a native implementation is unlikely to make a big difference.

Note, matrixStats is designed for matrices (2d arrays) and some vectors (1d arrays). Your example uses a 3d array.

jphill01 · 2019-07-18T19:09:20Z

Alright, thanks. I guess I will have to find a faster version of drop(), or wait until an 'arrayStats' package is released...

HenrikBengtsson · 2019-07-21T11:08:08Z

Now at a computer. My guess above was incorrect; it's not drop() that is the culprit. I expected it to slow things down due to memory copying. Here are some benchmarks confirming this:

library(matrixStats)

x <- array(1:(10000*1000), dim = c(10000, 1000, 1))
xd <- drop(x)

y0 <- apply(x, MARGIN = 2L, FUN = quantile, probs = 0.975)
y1 <- colQuantiles(drop(x), probs = 0.975)
y2 <- colQuantiles(xd, probs = 0.975)
stopifnot(identical(y1, y0), identical(y2, y0))

stats <- microbenchmark::microbenchmark(
  apply(x, MARGIN = 2L, FUN = quantile, probs = 0.975),
  colQuantiles(drop(x), probs = 0.975),
  colQuantiles(xd, probs = 0.975),
  drop(x),
  unit = "ms",
  times = 10L
)

print(stats)
## Unit: milliseconds
##                                                  expr        min         lq
##  apply(x, MARGIN = 2L, FUN = quantile, probs = 0.975) 142.430746 143.457675
##                  colQuantiles(drop(x), probs = 0.975) 286.959245 363.392250
##                       colQuantiles(xd, probs = 0.975) 264.892077 303.891283
##                                               drop(x)   0.000419   0.000782
##         mean     median         uq        max neval  cld
##  165.3461039 150.036045 170.998975 267.598677    10  b  
##  399.3671685 384.558625 457.812438 481.906615    10    d
##  345.1908052 364.407124 384.364789 389.169873    10   c 
##    0.0029067   0.002321   0.002681   0.011392    10 a

In other words, there's indeed room for improvement.

jphill01 · 2019-07-21T15:19:42Z

Thanks again! My original code has been released to CRAN as part of an R package.

It's sufficiently fast for now, but whenever you get around to optimizing colQuantiles(), I will use that instead of stats::quantile(), which can be improved based on code profiling.

HenrikBengtsson · 2019-09-10T18:19:06Z

Benchmarking with

library(matrixStats)
options(width=120)

X <- matrix(1:(10000*1000), nrow=10000L, ncol=1000L)
y0 <- apply(X, MARGIN=2L, FUN=quantile, probs=0.975)
y1 <- colQuantiles(X, probs=0.975)
stopifnot(identical(y1, y0))

stats <- bench::mark(
  apply(X, MARGIN=2L, FUN=quantile, probs=0.975),
  colQuantiles(X, probs=0.975),
  min_iterations=10L
)

It does not like replacing generic sort() with sort.int() (Issue #155) makes a difference, i.e. that's not the culprit:

## matrixStats 0.55.0
print(stats)
## # A tibble: 2 x 13
##   expression                                             min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
##   <bch:expr>                                           <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
## 1 apply(X, MARGIN = 2L, FUN = quantile, probs = 0.975) 146ms  155ms      6.03     191MB     19.9    10    33      1.66s
## 2 colQuantiles(X, probs = 0.975)                       286ms  310ms      3.18     496MB     20.1    10    63      3.14s
## # … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>

## matrixStats 0.55.0-9000 (with sort.int() instead of generic sort())
print(stats)
# A tibble: 2 x 13
  expression                                             min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>                                           <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 apply(X, MARGIN = 2L, FUN = quantile, probs = 0.975) 151ms  174ms      5.38     191MB     17.7    10    33      1.86s
2 colQuantiles(X, probs = 0.975)                       292ms  333ms      2.89     496MB     23.4    10    81      3.46s
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>

Now, looking at the above benchmark results, it's clear that colQuantiles() does 2.6x times more memory allocations. That needs to be investigated.

The dominant memory allocations are:

> p1 <- profmem::profmem(y1 <- colQuantiles(X, probs=0.975))
> subset(p1, bytes > 100e3)
Rprofmem memory profiling of:
y1 <- colQuantiles(X, probs = 0.975)

Memory allocations:
       what     bytes                                                   calls
4     alloc  40000048 colQuantiles() -> apply() -> aperm() -> aperm.default()
4042  alloc  40000048                   colQuantiles() -> apply() -> unlist()
4043  alloc  40000048                    colQuantiles() -> apply() -> array()
4044  alloc  40000048 colQuantiles() -> apply() -> aperm() -> aperm.default()
6051  alloc  40000048 colQuantiles() -> apply() -> aperm() -> aperm.default()
total       200000240

…pe=7L and no NAs [#153]

HenrikBengtsson · 2019-09-10T21:04:54Z

Great news: In matrixStats 0.55.0-9000 (develop branch) we now have:

PERFORMANCE:

colQuantiles() and rowQuantiles() with the default type=7L and when there
are no missing values is now significantly faster and uses significantly
fewer memory allocations.

For the example in this issue, the speedup is ~7 times (compared to #153 (comment)), which means colQuantiles() is now significantly faster than apply(..., MARGIN=2L, FUN=stats::quantile) (here ~3 times)

> library(matrixStats)
> options(width=120)
> 
> X <- matrix(1:(10000*1000), nrow=10000L, ncol=1000L)
> y0 <- apply(X, MARGIN=2L, FUN=quantile, probs=0.975)
> y1 <- colQuantiles(X, probs=0.975)
> stopifnot(identical(y1, y0))
> 
> stats <- bench::mark(
+   apply(X, MARGIN=2L, FUN=quantile, probs=0.975),
+   colQuantiles(X, probs=0.975),
+   min_iterations=10L
+ )
Warning message:
Some expressions had a GC in every iteration; so filtering is disabled. 
> stats
# A tibble: 2 x 13
  expression                                               min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>                                           <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 apply(X, MARGIN = 2L, FUN = quantile, probs = 0.975) 209.6ms 218.6ms      4.56     191MB     20.5    10    45
2 colQuantiles(X, probs = 0.975)                        68.5ms  70.4ms     13.7      115MB     28.8    10    21
# … with 5 more variables: total_time <bch:tm>, result <list>, memory <list>, time <list>, gc <list>

This was achieved by avoid lots of large memory allocation. Comparing to #153 (comment), there large allocations are no longer done:

> p1 <- profmem::profmem(y1 <- colQuantiles(X, probs=0.975))
> subset(p1, bytes > 50e3)
Rprofmem memory profiling of:
y1 <- colQuantiles(X, probs = 0.975)

Memory allocations:
      what bytes calls
total          0

PS. There's still some room for improvements, but that's minor to what was achieved here.

jphill01 · 2019-09-11T14:11:44Z

Thanks! Do you know when you plan to submit an updated package to CRAN?

HenrikBengtsson · 2019-09-11T19:32:38Z

No guesstimate. matrixStats 0.55.0 was just released and it was a multi-day effort to check it across platforms and run all of the 244 reverse dependency check (I have access to computer cluster so that helped). So, releasing matrixStats is a bit of a time commitment.

HenrikBengtsson · 2020-03-13T17:39:22Z

FYI, matrixStats 0.56.0 with this speedup is now on CRAN.

jphill01 · 2020-03-13T17:43:53Z

Excellent! Thanks Henrik!

HenrikBengtsson mentioned this issue Sep 10, 2019

PERFORMANCE: Use sort.int() instead of generic sort() #155

Closed

HenrikBengtsson added a commit that referenced this issue Sep 10, 2019

colQuantiles() and rowQuantiles() are now significantly faster for ty…

955678b

…pe=7L and no NAs [#153]

HenrikBengtsson closed this as completed Sep 10, 2019

HenrikBengtsson added this to the Next release milestone Sep 10, 2019

HenrikBengtsson added a commit that referenced this issue Sep 10, 2019

colQuantiles(): Avoid two t() calls [#153]

1407d25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apply() faster than colQuantiles() #153

apply() faster than colQuantiles() #153

jphill01 commented Jul 18, 2019 •

edited

Loading

HenrikBengtsson commented Jul 18, 2019 •

edited

Loading

HenrikBengtsson commented Jul 18, 2019

jphill01 commented Jul 18, 2019

HenrikBengtsson commented Jul 18, 2019

jphill01 commented Jul 18, 2019

HenrikBengtsson commented Jul 21, 2019

jphill01 commented Jul 21, 2019

HenrikBengtsson commented Sep 10, 2019 •

edited

Loading

HenrikBengtsson commented Sep 10, 2019

jphill01 commented Sep 11, 2019

HenrikBengtsson commented Sep 11, 2019

HenrikBengtsson commented Mar 13, 2020

jphill01 commented Mar 13, 2020

apply() faster than colQuantiles() #153

apply() faster than colQuantiles() #153

Comments

jphill01 commented Jul 18, 2019 • edited Loading

HenrikBengtsson commented Jul 18, 2019 • edited Loading

HenrikBengtsson commented Jul 18, 2019

jphill01 commented Jul 18, 2019

HenrikBengtsson commented Jul 18, 2019

jphill01 commented Jul 18, 2019

HenrikBengtsson commented Jul 21, 2019

jphill01 commented Jul 21, 2019

HenrikBengtsson commented Sep 10, 2019 • edited Loading

HenrikBengtsson commented Sep 10, 2019

jphill01 commented Sep 11, 2019

HenrikBengtsson commented Sep 11, 2019

HenrikBengtsson commented Mar 13, 2020

jphill01 commented Mar 13, 2020

jphill01 commented Jul 18, 2019 •

edited

Loading

HenrikBengtsson commented Jul 18, 2019 •

edited

Loading

HenrikBengtsson commented Sep 10, 2019 •

edited

Loading