Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

R 3.4.3 -> 3.5 performance drop #2962

Closed
s-Nick-s opened this issue Jul 4, 2018 · 9 comments
Closed

R 3.4.3 -> 3.5 performance drop #2962

s-Nick-s opened this issue Jul 4, 2018 · 9 comments
Milestone

Comments

@s-Nick-s
Copy link

@s-Nick-s s-Nick-s commented Jul 4, 2018

I've noticed a significant drop in data.table performance after upgrading to R 3.5.
For example:

set.seed(1)
types <- c("A", "B", "C", "D", "E", "F")
obs <- 4e7
dt1 <- data.table(percent = round(runif(obs, min = 0, max = 1), digits = 2),
                  type = as.factor(sample(types, obs, replace = TRUE)))
microbenchmark::microbenchmark(
  test1 <- one[, list(percent_total = sum(percent)), by = type]
, times = 30)

On 3.5 gives me:

Unit: milliseconds
                                                          expr    min      lq     mean   median       uq     max neval
 test1 <- dt1[, list(percent_total = sum(percent)), by = type] 522.53 539.668 576.2352 572.8301 602.0069 677.892    30
>sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.4

loaded via a namespace (and not attached):
[1] microbenchmark_1.4-4 compiler_3.5.0       tools_3.5.0  

But on 3.4.3 it's ~20% faster:

Unit: milliseconds
                                                          expr      min       lq     mean   median       uq      max neval
 test1 <- dt1[, list(percent_total = sum(percent)), by = type] 415.5938 445.8137 477.6562 470.0536 499.8974 576.9203    30
> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=C              LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.4

loaded via a namespace (and not attached):
[1] microbenchmark_1.4-4 compiler_3.4.3       tools_3.4.3   
@jangorecki
Copy link
Member

@jangorecki jangorecki commented Jul 4, 2018

Thank you for reporting. Confirmed on latest devel.
Smaller example.

library(data.table)
set.seed(1)
types <- c("A", "B", "C", "D", "E", "F")
obs <- 4e7
dt1 <- data.table(percent = round(runif(obs, min = 0, max = 1), digits = 2),
                  type = as.factor(sample(types, obs, replace = TRUE)))
system.time(dt1[, list(percent_total = sum(percent)), by = type])

3.4.4 timings are same as in 3.4.3.
3.5.1 timings are same as in 3.5.0.

@mattdowle mattdowle added this to the 1.11.6 milestone Sep 13, 2018
@mattdowle mattdowle removed this from the 1.11.6 milestone Sep 20, 2018
@mattdowle mattdowle added this to the 1.12.0 milestone Sep 20, 2018
@jangorecki
Copy link
Member

@jangorecki jangorecki commented Dec 13, 2018

I tested latest devel on 3.5.1 and 3.4.4 and there is no performance difference anymore. It is likely to be fixed by recent rework of many C functions. Please re-open if issue is still valid for you on latest devel.

@s-Nick-s
Copy link
Author

@s-Nick-s s-Nick-s commented Feb 28, 2019

@jangorecki
Unfortunately, the issue is still valid. After rerunning tests (with 4e6 observations) I get:

#Platform: x86_64-pc-linux-gnu (64-bit)
#R3.5.1
#1.11.8
Unit: milliseconds
                                                          expr      min       lq     mean   median       uq      max neval
 test1 <- dt1[, list(percent_total = sum(percent)), by = type] 50.22021 53.60446 56.84797 54.67532 57.61063 118.4352   100
 
#R3.4.4
#1.11.8
Unit: milliseconds
                                                          expr      min       lq     mean   median       uq      max neval
 test1 <- dt1[, list(percent_total = sum(percent)), by = type] 43.01133 44.16637 46.24535 44.63486 44.84428 89.37498   100

And 1.12.0 gives me x2 time increase on both R versions. (I think this is due to #3395):

#1.12.0
Unit: milliseconds
                                                          expr      min       lq    mean   median     uq      max neval
 test1 <- dt1[, list(percent_total = sum(percent)), by = type] 90.60795 93.11161 119.436 109.1086 146.69 178.8412    30

It looks like I don't have permissions to re-open this issue. Please re-open it if you can confirm my tests.

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Feb 28, 2019

@s-Nick-s I think it is related to number of threads (as in the link you mentioned). Below is R 3.5.2 vs R 3.4.4 using only single thread in both cases, data.table 1.12.0.
Please confirm that setting threads to 1 makes R 3.5+ faster than R 3.4.

library(data.table)
set.seed(1)
types <- c("A", "B", "C", "D", "E", "F")
setDTthreads(1L)
obs <- 4e6
dt1 <- data.table(percent = round(runif(obs, min = 0, max = 1), digits = 2),
                  type = as.factor(sample(types, obs, replace = TRUE)))
system.time(dt1[, list(percent_total = sum(percent)), by = type])
microbenchmark::microbenchmark(
  test1 <- dt1[, list(percent_total = sum(percent)), by = type]
, times = 30)

3.5.2

   user  system elapsed 
  0.095   0.012   0.107 
> microbenchmark::microbenchmark(
+   test1 <- dt1[, list(percent_total = sum(percent)), by = type]
+ , times = 30)
Unit: milliseconds
                                                          expr      min
 test1 <- dt1[, list(percent_total = sum(percent)), by = type] 85.36912
       lq    mean   median       uq      max neval
 87.16382 99.1758 98.38073 105.0603 120.9579    30

3.4.4

   user  system elapsed 
  0.103   0.004   0.107 
> microbenchmark::microbenchmark(
+   test1 <- dt1[, list(percent_total = sum(percent)), by = type]
+ , times = 30)

Unit: milliseconds
                                                          expr      min
 test1 <- dt1[, list(percent_total = sum(percent)), by = type] 101.8106
       lq     mean   median       uq      max neval
 106.1996 115.2929 114.8559 122.1709 136.3099    30

@jangorecki jangorecki reopened this Feb 28, 2019
@s-Nick-s
Copy link
Author

@s-Nick-s s-Nick-s commented Feb 28, 2019

@jangorecki
For me the 1.12.0 with setDTthreads(1L) has about the same speed on both R versions.
I guess that means that this issue is fixed in 1.12.0
Thank you for testing this. I will re-test and confirm once the thread number issue is fixed.

@RaynorJim
Copy link

@RaynorJim RaynorJim commented May 21, 2020

Hi @jangorecki

I worked with @s-Nick-s a while back when he discovered the performance drop. I did some tests these days, and although I find 1.12.+ to give same or better time on R3.5+ vs R.3.4.4, i do find that running R3.4.4 with data.table 1.11.8 is about 2x as quick as running this example on anything newer than that combination (any R version beginning R3.5 all the way to R4.0.0) with any version of data.table 1.12.x (I did use setDTthreads(1L))

R3.4.4 + data.table 1.11.8
Unit: milliseconds expr min lq mean median test1 <- dt1[, list(percent_total = sum(percent)), by = type] 51.30958 55.41111 73.35613 59.93117 uq max neval 87.61779 154.0714 30

R3.6.3 + data.table 1.12.8

Unit: milliseconds expr min lq mean median test1 <- dt1[, list(percent_total = sum(percent)), by = type] 121.8025 135.3461 178.0072 178.2529 uq max neval 195.0769 347.6791 30

Please if you have the time, make a clean install of R3.4.4 + data.table 1.11.8 and verify this.

Looking forward to your reply,
Cheers.

@jangorecki
Copy link
Member

@jangorecki jangorecki commented May 21, 2020

@RaynorJim Thanks for letting us know. I run some benchmarks

#docker run --rm -it r-base:3.4.4
#docker run --rm -it r-base:3.6.3

install.packages("microbenchmark", quiet=TRUE)
#install.packages("https://cloud.r-project.org/src/contrib/Archive/data.table/data.table_1.11.8.tar.gz", repos=NULL)
#install.packages("https://cloud.r-project.org/src/contrib/Archive/data.table/data.table_1.12.0.tar.gz", repos=NULL)
#install.packages("data.table", quiet=TRUE) # 1.12.8
options(width=999)
library(data.table)
set.seed(1)
types = c("A", "B", "C", "D", "E", "F")
#obs = 4e6
#obs = 4e7
dt1 = data.table(
  percent = round(runif(obs, min = 0, max = 1), digits = 2),
  type = as.factor(sample(types, obs, replace = TRUE))
)
setDTthreads(1L)
th1=microbenchmark::microbenchmark(th01 = dt1[, list(percent_total = sum(percent)), by = type], times = 10)
setDTthreads(40L)
th2=microbenchmark::microbenchmark(th40 = dt1[, list(percent_total = sum(percent)), by = type], times = 10)
rbind(th1,th2)
q("no")

And got the following timings (milliseconds)

4e6
R 3.4.4
     expr       min        lq      mean    median        uq       max neval
11.8 th01  62.63322  63.61023  69.46025  67.22793  70.69045  92.64278    30
11.8 th40  62.75540  66.81539  71.80807  69.78321  75.12829  87.32864    30
12.0 th01 162.46948 175.45990 192.31498 197.09693 204.19589  227.6725    30
12.0 th40  58.82906  67.18446  81.93788  73.72452  96.58676  120.0546    30
12.8 th01 166.99444 182.70240 193.66958 193.62505  208.8364  221.9188    30
12.8 th40  70.08485  73.37909  87.06192  75.40962  102.3513  129.4137    30
R 3.6.3
     expr       min        lq      mean    median        uq       max neval
11.8 th01  70.20075  82.61568  84.79386  86.75024  88.96923  93.98895    30
11.8 th40  70.03898  72.06237  83.17097  86.10463  89.22889 106.69598    30
12.0 th01 169.67496 178.29791 187.48729 184.22019 196.46390  224.7092    30
12.0 th40  60.06581  65.27729  73.94895  70.57542  77.37153  106.6533    30
12.8 th01 183.24098 183.93141 196.07175  198.5182 206.24578  211.0278    30
12.8 th40  67.44557  71.16225  83.00839   75.6796  98.07114  110.1393    30
4e7
R 3.4.4
     expr       min        lq      mean    median        uq       max neval
11.8 th01  812.8381  824.5753  835.4540  836.2424  847.0650   870.853    30
11.8 th40  813.1851  818.7180  832.1787  826.0476  848.4874   873.471    30
12.0 th01 2201.2274 2230.5826 2283.4230 2291.6118 2328.9514 2356.1045    30
12.0 th40  582.6014  619.0834  646.6066  647.6043  674.6852  726.8556    30
12.8 th01 2182.8077 2202.8348 2218.1768 2213.1107 2244.6785 2255.4742    10
12.8 th40  724.3372  729.4303  748.9661  736.9562  780.0272  796.9019    10
R 3.6.3
     expr       min        lq      mean    median        uq       max neval
11.8 th01  920.6722  936.0939  946.9035  943.9847  956.2975  977.2512    10
11.8 th40  935.6252  944.1761  954.3911  950.5467  956.6025  995.5108    10
12.0 th01 2241.5981 2252.7387 2262.9659 2260.5077 2274.1610 2282.4533    10
12.0 th40  625.8483  639.3864  659.6355  661.8368  675.7402  689.2096    10
12.8 th01 2229.2616 2279.0100 2324.2365 2346.7960 2364.8414 2390.7510    10
12.8 th40  664.3308  699.2312  740.8568  747.6503  779.1357  782.5371    10

Will come back soon to make conclusion. I would appriciate if you could briefly check if those numbers match against yours, more or less.

@jangorecki jangorecki reopened this May 21, 2020
@jangorecki
Copy link
Member

@jangorecki jangorecki commented May 22, 2020

One observation that is quite clear from above timings is that single threaded code is slower starting from 1.12.0 comparing to 1.11.8, this is covered by #3330.
At the same time we can see that more threads doesn't speed up 1.11.8, this is because parallelism was not used much there yet, as we can see in news entries for v1.12.0.

Let's then focus on R 3.4.4 vs post-R 3.5, here R 3.6.3.

As we can see here, there is a difference, but it was not really that big:

4e6
           expr       min        lq      mean    median        uq       max neval
3.4.4 11.8 th01  62.63322  63.61023  69.46025  67.22793  70.69045  92.64278    30
3.6.3 11.8 th01  70.20075  82.61568  84.79386  86.75024  88.96923  93.98895    30

4e7
           expr       min        lq      mean    median        uq       max neval
3.4.4 11.8 th01  812.8381  824.5753  835.4540  836.2424  847.0650   870.853    30
3.6.3 11.8 th01  920.6722  936.0939  946.9035  943.9847  956.2975  977.2512    10

Now lets see if that is still valid, doesn't seem so:

4e6
           expr       min        lq      mean    median        uq       max neval
3.4.4 12.8 th01 166.99444 182.70240 193.66958 193.62505  208.8364  221.9188    30
3.6.3 12.8 th01 183.24098 183.93141 196.07175  198.5182 206.24578  211.0278    30

4e7
           expr       min        lq      mean    median        uq       max neval
3.4.4 12.8 th01 2182.8077 2202.8348 2218.1768 2213.1107 2244.6785 2255.4742    10
3.6.3 12.8 th01 2229.2616 2279.0100 2324.2365 2346.7960 2364.8414 2390.7510    10

If you are after minimizing speed, and you cannot use multiple threads, then going back to 1.11.8 is probably the best choice for your. You may also want to upvote and subscribe to #3330.

Closing this issue for now. @RaynorJim If you believe your case is not covered by #3330 please let us know, we can always re-open.

@RaynorJim
Copy link

@RaynorJim RaynorJim commented May 22, 2020

Appologies, I have been busy all day. I wasn't aware single threaded performance is covered by #3330, I will watch that issue.

Indeed i cannot move away from 1.11.8 because we run all our code on a parallel cluster and all our CPUs are maxed out, so i am looking for as much speed from data.table as possible per one thread.

Thanks again for checking this out, I appreciate it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants