Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upparallelise subset in `[.data.table` operator #2951
Comments
|
In dev as of #3170, it's now parallel within column. So N=1e8; K=100
set.seed(1)
DT <- data.table(
id1 = sample(sprintf("id%03d",1:K), N, TRUE), # any character profile will do
id2 = sample(sprintf("id%010d",1:(N/K)), N, TRUE), # any character profile will do
v1 = sample(5, N, TRUE), # int in range [1,5]
v2 = sample(5, N, TRUE), # int in range [1,5]
v3 = sample(5, N, TRUE), # int in range [1,5]
v4 = sample(5, N, TRUE), # int in range [1,5]
v5 = sample(5, N, TRUE), # int in range [1,5]
v6 = sample(round(runif(100,max=100),4), N, TRUE) # numeric e.g. 23.5749
)
x = which(DT$v1 > 3)
length(x)/nrow(DT) # select 40% of the rows
# dev v1.11.8
# --- ---
setDTthreads(8)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT))) # 1.6 1.9 seconds
system.time(DT[x]) # 1.7 4.7
setDTthreads(1)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT))) # 1.9 2.7
system.time(DT[x]) # 2.0 4.9
DT[,id1:=as.factor(id1)]
DT[,id2:=as.factor(id2)]
setDTthreads(8)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT))) # 0.4 0.6
system.time(DT[x]) # 0.6 3.7
setDTthreads(1)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT))) # 0.8 1.5
system.time(DT[x]) # 1.0 3.6So as of dev now, using |
|
Now with #3210 merged, # dev v1.11.8
# --- ---
setDTthreads(8)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT))) # 1.6 1.9 seconds
system.time(DT[x]) # 1.6 4.7
setDTthreads(1)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT))) # 1.9 2.7
system.time(DT[x]) # 1.9 4.9
DT[,id1:=as.factor(id1)]
DT[,id2:=as.factor(id2)]
setDTthreads(8)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT))) # 0.4 0.6
system.time(DT[x]) # 0.4 3.7
setDTthreads(1)
system.time(.Call("CsubsetDT", DT, x, 1:ncol(DT))) # 0.8 1.5
system.time(DT[x]) # 0.8 3.6 |
Internal not exported call to
CsubsetDTwas parallelised quite a long time ago, but there is no API for users to use it. I am sure we are benefiting from it internally, still it should be utilized well in[.data.table, currently it is not. Using data from #1660: