New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calculations with many groups are much slower by default than with setDTthreads(1) #4294
Comments
I suspect it is due to copying of all of the data to each of the 18 clusters |
This is largely a duplicate of #4200. I am making some additional notes at that issue. |
@chinsoon12 AFAIK there is no data copying in openmp. The overhead is related to creating a team of threads for every single group. |
This seems resolved by #4558 as timing is equal for 1 thread vs. 8 threads. Testing done via Windows 10 / R 4.0.2 / data.table 1.13.0: library(data.table)
NN = 1e5
set.seed(1)
DT = data.table(grp1 = as.character(rep(1:(NN/4),each = 4)),
grp2 = sample(5000L,NN,TRUE),
V = rpois(NN, 10))
setDTthreads(8L)
system.time(DT[ , log(sum(V)), by = grp1])
#> user system elapsed
#> 0.06 0.03 0.08
setDTthreads(1L)
system.time(DT[ , log(sum(V)), by = grp1])
#> user system elapsed
#> 0.08 0.00 0.08 Feel free to re-open. |
@ColeMiller1 not sure it qualify to close in my case. I run twice reversing order to ensure warm-up is not affecting timing. R -q
library(data.table)
NN = 1e5
set.seed(1)
DT = data.table(grp1 = as.character(rep(1:(NN/4),each = 4)),
grp2 = sample(5000L,NN,TRUE),
V = rpois(NN, 10))
setDTthreads(8L)
system.time(DT[ , log(sum(V)), by = grp1])
# user system elapsed
# 0.295 0.009 0.063
setDTthreads(1L)
system.time(DT[ , log(sum(V)), by = grp1])
# user system elapsed
# 0.04 0.00 0.04
q("no")
R -q
library(data.table)
NN = 1e5
set.seed(1)
DT = data.table(grp1 = as.character(rep(1:(NN/4),each = 4)),
grp2 = sample(5000L,NN,TRUE),
V = rpois(NN, 10))
setDTthreads(1L)
system.time(DT[ , log(sum(V)), by = grp1])
# user system elapsed
# 0.049 0.000 0.048
setDTthreads(8L)
system.time(DT[ , log(sum(V)), by = grp1])
# user system elapsed
# 0.351 0.000 0.061
q("no")
|
another machine R -q
library(data.table)
NN = 1e5
set.seed(1)
DT = data.table(grp1 = as.character(rep(1:(NN/4),each = 4)),
grp2 = sample(5000L,NN,TRUE),
V = rpois(NN, 10))
setDTthreads(40L)
system.time(DT[ , log(sum(V)), by = grp1])
# user system elapsed
# 1.131 0.000 0.069
setDTthreads(1L)
system.time(DT[ , log(sum(V)), by = grp1])
# user system elapsed
# 0.035 0.000 0.035
q("no")
R -q
library(data.table)
NN = 1e5
set.seed(1)
DT = data.table(grp1 = as.character(rep(1:(NN/4),each = 4)),
grp2 = sample(5000L,NN,TRUE),
V = rpois(NN, 10))
setDTthreads(1L)
system.time(DT[ , log(sum(V)), by = grp1])
# user system elapsed
# 0.033 0.001 0.033
setDTthreads(40L)
system.time(DT[ , log(sum(V)), by = grp1])
# user system elapsed
# 1.132 0.007 0.061
q("no")
|
Sorry about that, Jan. That's interesting that this is now good for Windows but still bad for Linux. I ran twice as well but there was no real difference. It would be interesting if your major benchmark project could do cross-platform differences - not to error out but just to highlight these type of things. |
No worries. If you would have access to a 32-40 cores windows machine that would be useful to check it as well. IMO benchmarking on windows is low priority because this OS should not be used for any serious stuff other than gaming ;) |
I can check on a high core count Windows machine sometime in the next few days. |
I tested on a 64 core Windows PC. The performance is now approximately the same with 1 and 40 threads: library(data.table)
NN = 1e5
set.seed(1)
DT = data.table(grp1 = as.character(rep(1:(NN/4),each = 4)),
grp2 = sample(5000L,NN,TRUE),
V = rpois(NN, 10))
setDTthreads(40L)
system.time(DT[ , log(sum(V)), by = grp1])
#user system elapsed
#0.04 0.00 0.04
setDTthreads(1L)
system.time(DT[ , log(sum(V)), by = grp1])
#user system elapsed
#0.03 0.00 0.03
## Restart R
library(data.table)
NN = 1e5
set.seed(1)
DT = data.table(grp1 = as.character(rep(1:(NN/4),each = 4)),
grp2 = sample(5000L,NN,TRUE),
V = rpois(NN, 10))
setDTthreads(1L)
system.time(DT[ , log(sum(V)), by = grp1])
# user system elapsed
# 0.03 0.00 0.03
setDTthreads(40L)
system.time(DT[ , log(sum(V)), by = grp1])
# user system elapsed
# 0.05 0.06 0.04
##
## Restart R
##
library(data.table)
N = 1e6L
set.seed(108)
d = data.table(id3 = sample(c(seq.int(N*0.9), sample(N*0.9, N*0.1, TRUE))), # 9e5 unq values
v1 = sample(5L, N, TRUE),
v2 = sample(5L, N, TRUE))
setDTthreads(40L)
> system.time(d[, max(v1)-min(v2), by=id3])
user system elapsed
1.12 0.05 1.06
> system.time(d[, max(v1)-min(v2), by=id3])
user system elapsed
1 0 1
setDTthreads(1L)
> system.time(d[, max(v1)-min(v2), by=id3])
user system elapsed
1.03 0.03 1.06
> system.time(d[, max(v1)-min(v2), by=id3])
user system elapsed
1.03 0.00 1.03
> Sys.getenv("NUMBER_OF_PROCESSORS")
[1] "64"
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.13.2
loaded via a namespace (and not attached):
[1] compiler_4.0.2 tools_4.0.2 |
The below example shows how the default number of threads on my machine (18) results in data.table being much slower than if we get rid of the parallelism. I have a minimal reproducible example below.
This is related to but not exactly like the following issues:
#
Minimal reproducible example
#
Output of sessionInfo()
The text was updated successfully, but these errors were encountered: