Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign uptest and confirm new parallel subset performance #3175
Comments
|
Following script tests subset by integer row ids. It also measures the timing of vim dt-parallel-subset.Rargs = as.integer(commandArgs(TRUE))
th = args[1L]
N = args[2L]
K = 100L
get_i = function(n.out, n.in) {
n.out = as.integer(n.out)
n.in = as.integer(n.in)
set.seed(n.out)
sample(n.in, n.out)
}
library(data.table)
cat(sprintf("# datagen %s rows\n", N))
set.seed(108)
DT = data.table(
id1 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char)
id2 = sample(sprintf("id%03d",1:K), N, TRUE), # large groups (char)
id3 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # small groups (char)
id4 = sample(K, N, TRUE), # large groups (int)
id5 = sample(K, N, TRUE), # large groups (int)
id6 = sample(N/K, N, TRUE), # small groups (int)
v1 = sample(5, N, TRUE), # int in range [1,5]
v2 = sample(5, N, TRUE), # int in range [1,5]
v3 = sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
)
cat(sprintf("# setDTthreads(%s)\n", th))
setDTthreads(th)
cat("# 0 row (first `[`` call overhead):\n")
system.time(ans<-DT[0L])
cat("# 1 row:\n")
i = get_i(1L, nrow(DT))
system.time(ans<-DT[i])
cat("# 2 rows:\n")
i = get_i(2L, nrow(DT))
system.time(ans<-DT[i])
cat("# 5 rows:\n")
i = get_i(5L, nrow(DT))
system.time(ans<-DT[i])
cat("# 10% of rows:\n")
i = get_i(nrow(DT)*0.1, nrow(DT))
system.time(ans<-DT[i])
q("no")Rscript dt-parallel-subset.R 1 1e6timings coming soon |
1th 1e7
20th 1e7
1th 1e8
20th 1e8
1th 1e9
20th 1e9
|
|
During the timings above I observed that team of threads was started even for 1, 2, 5 rows. Still it did not result in noticeable overhead. All subsets of 1, 2, 5 rows were 0.000-0.001. |
|
Above checks were using single subset operation. I encounter some noticeable difference when I loop over subset operation. library(data.table)
m = matrix(1L, nrow=1e8, ncol=10)
DT = as.data.table(m)
setDTthreads(20)
system.time(for (i in 1:1000) DT[i,])
# user system elapsed
# 4.210 0.000 0.229
setDTthreads(1)
system.time(for (i in 1:1000) DT[i,])
# user system elapsed
# 0.107 0.007 0.114 @mattdowle does it quality for reopen? |
|
PR #4484 closes this one. v1.12.8 to confirm Jan's result: > m = matrix(1L, nrow=1e8, ncol=10)
> DT = as.data.table(m)
> setDTthreads(0)
> system.time(for (i in 1:1000) DT[i,])
user system elapsed
1.512 0.000 0.143
> setDTthreads(1)
> system.time(for (i in 1:1000) DT[i,])
user system elapsed
0.083 0.000 0.083 With #4484 : > setDTthreads(0)
> system.time(for (i in 1:1000) DT[i,])
user system elapsed
0.071 0.000 0.071
> setDTthreads(1)
> system.time(for (i in 1:1000) DT[i,])
user system elapsed
0.072 0.000 0.072 |
Matt commented :
data.table/src/subset.c
Lines 27 to 30 in 1847500