Selecting from data.table by row is very slow #3735

chnynf · 2019-07-30T21:49:07Z

allIterations <- data.frame(v1 = runif(1e5), v2 = runif(1e5))
DoSomething <- function(row) {
  someCalculation <- row[["v1"]] + 1
}
system.time(
       {
         for (r in 1:nrow(allIterations)) {
           DoSomething(allIterations[r, ])
         }
       }
     )
##   user  system elapsed 
##   4.50    0.02    4.55 

allIterations <- as.data.table(allIterations)
system.time(
       {
         for (r in 1:nrow(allIterations)) {
           DoSomething(allIterations[r, ])
         }
       }
     )
##   user  system elapsed 
##   53.78   25.05   78.46

I'm working on a R project that involves applying fairly complicated functions across data.table or data.frame by rows.
In cases where vectorizing is not a good option, one might need to loop through rows, and that's when I realized selecting by row number from a data.table is actually much slower than from a data.frame.
I guess selecting by row number is not a recommended practice for data.table? Or would the team be interested in looking into this and optimize the performance?

I have more details about my test here.

shrektan · 2019-07-31T05:17:09Z

The main reason is not about using row number to select the rows or not. It's because the loop invokes the data.table's function call too many times. data.table is fast due to internal optimization, which comes with a cost. It means the [ call in data.table will do much more things (optimizing, checks, etc.) than in data.frame. Apparently, in this special looping case, all the optimizing efforts are in vain.

If loop on all the rows is unavoidable, I suggest you to use purrr::pmap().

df <- data.frame(v1 = runif(1e3), v2 = runif(1e3))
cal <- function(row) {
  row$v1 + 1
}
res1 <- res2 <- res3 <- res4 <- double(nrow(df))

t <- proc.time()
for (r in 1:nrow(df)) {
  res1[r] <- cal(df[r, ])
}
data.table::timetaken(t)
#> [1] "0.110s elapsed (0.090s cpu)"

dt <- data.table::as.data.table(df)
t <- proc.time()
for (r in 1:nrow(dt)) {
  res2[r] <- cal(dt[r, ])
}
data.table::timetaken(t)
#> [1] "0.510s elapsed (0.470s cpu)"

t <- proc.time()
res3 <- purrr::pmap_dbl(dt, function(...) {
  cal(list(...))
})
data.table::timetaken(t)
#> [1] "0.030s elapsed (0.010s cpu)"

all.equal(res2, res1)
#> [1] TRUE
all.equal(res3, res1)
#> [1] TRUE

^{Created on 2019-07-31 by the reprex package (v0.2.1)}

jangorecki · 2019-07-31T07:29:38Z

Confirming what @shrektan wrote. Anyway I think we should be able to speed up such things pretty easily.

jangorecki · 2019-07-31T08:45:27Z

When selecting single row by its integer index it make sense to switch to single threaded mode, so setting setDTthreads(1L) might help. Related issue #3175.
It is possible to match performance of data.frame subset by integer index, but it is not exported.
DF 3.697s
DT 3.579s

library(data.table)
set.seed(108)
n = 1e5
df = data.frame(v1 = runif(n), v2 = runif(n))
dt1 = data.table(v1 = runif(n), v2 = runif(n))
dt2 = data.table(v1 = runif(n), v2 = runif(n))
frow = function(x, irows, safe=FALSE) {
  stopifnot(is.data.table(x), is.integer(irows), length(irows)>0L, is.logical(safe), length(safe)==1L, !is.na(safe))
  if (safe) stopifnot(all(between(irows, 1L, nrow(x))))
  .Call(data.table:::CsubsetDT, x, irows, seq_along(x))
}
do = function(row) row[["v1"]]+1

system.time(for (r in 1:n) do(df[r, ]))
#   user  system elapsed 
#  3.693   0.003   3.697 

setDTthreads(4L)
system.time(for (r in 1:n) do(dt1[r, ]))
#   user  system elapsed 
# 73.497   0.299  19.205
system.time(for (r in 1:n) do(frow(dt2, r)))
#   user  system elapsed 
# 21.125   0.128   5.488 
system.time(for (r in 1:n) do(frow(dt2, r, safe=TRUE)))
#   user  system elapsed 
# 28.016   0.179   7.294 

setDTthreads(1L)
system.time(for (r in 1:n) do(dt1[r, ]))
#   user  system elapsed 
# 12.619   0.128  12.749 
system.time(for (r in 1:n) do(frow(dt2, r)))
#   user  system elapsed 
#  3.538   0.040   3.579 
system.time(for (r in 1:n) do(frow(dt2, r, safe=TRUE)))
#   user  system elapsed 
#  4.923   0.088   5.012

It could be handled internally transparently but requires a little bit of rewrite [.data.table because i argument can take various forms, where NSE processing makes it harder for early detecting input type and optimisation.

jangorecki · 2020-05-24T22:06:32Z

Some progress towards this issue has been made in #4484, but the overhead of [.data.table is still significant. I measured time of [.data.table internals and tried to escape as much extra code as possible, but speed up I was able to get was around 13%. I am not sure if we want another extra escape branch just for a 13% gain.
To address this issue fully, we either have to:

Provide non-NSE interface for i argument, so it behaves like data.frame's i arg. We already provide that for j argument via with. So this could be made using with=c(i=FALSE, j=TRUE), ~~already proposed somewhere but don't remember where~~ (found it, that was you :) ), i argument could get with=FALSE #4485. Alternatively we could use another argument, like existing which [.data.table which argument could accept integer #3736.
Rewrite initial parts of [.data.table

ColeMiller1 · 2020-06-01T11:21:01Z

Just promoting the idea - using by = 1:nrow(dt) solves this issue as well and is actually the fastest of the presented options.

Also, @chnynf, are you on Windows? Your high system.times reflect my experience on Windows.

library(data.table) ##1.12.8
setDTthreads(1L)

df <- data.frame(v1 = runif(1e3), v2 = runif(1e3))
cal <- function(row) row$v1 + 1

res1 <- res2 <- res3 <- res4 <- double(nrow(df))

t <- proc.time()
for (r in 1:nrow(df)) {
  res1[r] <- cal(df[r, ])
}
data.table::timetaken(t)
#> [1] "0.050s elapsed (0.030s cpu)"

dt <- data.table::as.data.table(df)
t <- proc.time()
for (r in 1:nrow(dt)) {
  res2[r] <- cal(dt[r, ])
}
data.table::timetaken(t)
#> [1] "0.240s elapsed (0.210s cpu)"

t <- proc.time()
res3 <- purrr::pmap_dbl(dt, function(...) {
  cal(list(...))
})
data.table::timetaken(t)
#> [1] "0.060s elapsed (0.040s cpu)"

t <- proc.time()
res4 <- dt[, cal(.SD), by = 1:nrow(dt)]$V1
data.table::timetaken(t)
#> [1] "0.010s elapsed (0.000s cpu)"

all.equal(res2, res1)
#> [1] TRUE
all.equal(res3, res1)
#> [1] TRUE
all.equal(res4, res1)
#> [1] TRUE

chnynf · 2020-06-03T15:03:16Z

Just promoting the idea - using by = 1:nrow(dt) solves this issue as well and is actually the fastest of the presented options.

Also, @chnynf, are you on Windows? Your high system.times reflect my experience on Windows.

library(data.table) ##1.12.8
setDTthreads(1L)

df <- data.frame(v1 = runif(1e3), v2 = runif(1e3))
cal <- function(row) row$v1 + 1

res1 <- res2 <- res3 <- res4 <- double(nrow(df))

t <- proc.time()
for (r in 1:nrow(df)) {
  res1[r] <- cal(df[r, ])
}
data.table::timetaken(t)
#> [1] "0.050s elapsed (0.030s cpu)"

dt <- data.table::as.data.table(df)
t <- proc.time()
for (r in 1:nrow(dt)) {
  res2[r] <- cal(dt[r, ])
}
data.table::timetaken(t)
#> [1] "0.240s elapsed (0.210s cpu)"

t <- proc.time()
res3 <- purrr::pmap_dbl(dt, function(...) {
  cal(list(...))
})
data.table::timetaken(t)
#> [1] "0.060s elapsed (0.040s cpu)"

t <- proc.time()
res4 <- dt[, cal(.SD), by = 1:nrow(dt)]$V1
data.table::timetaken(t)
#> [1] "0.010s elapsed (0.000s cpu)"

all.equal(res2, res1)
#> [1] TRUE
all.equal(res3, res1)
#> [1] TRUE
all.equal(res4, res1)
#> [1] TRUE

Yes, the test was on windows. I tried your approach on my windows machine and it is much faster.

Thank you guys for working on this!

MLopez-Ibanez · 2023-05-13T21:37:06Z

I have a similar problem. In this code:

library(data.table)
parameters <- list(types = c(p1 = "r", p2 = "r", p3 = "r", dummy = "c"),
                   digits = 4)
n <- 10000
newConfigurations <- data.table(p1 = runif(n), p2 = runif(n), p3 = runif(n),
                                dummy = sample(c("d1", "d2"), n, replace=TRUE))

repair_sum2one <- function(configuration, parameters)
{
  isreal <- names(which(parameters$types[colnames(configuration)] == "r"))
  digits <- parameters$digits[isreal]
  c_real <- unlist(configuration[isreal])
  c_real <- c_real / sum(c_real)
  c_real[-1] <- round(c_real[-1], digits[-1])
  c_real[1] <- 1 - sum(c_real[-1])
  configuration[isreal] <- c_real
  return(configuration)
}
j <- colnames(newConfigurations)
for (i in seq_len(nrow(newConfigurations)))
      set(newConfigurations, i, j = j, value = repair_sum2one(as.data.frame(newConfigurations[i]), parameters))

More than half the time is spent in [.data.table. Even the function repair_sum2one is faster.

jangorecki added the performance label Jul 31, 2019

jangorecki mentioned this issue Jul 31, 2019

[.data.table which argument could accept integer #3736

Open

jangorecki added a commit that referenced this issue May 24, 2020

subsetDT escape openmp, #3735

51b0286

jangorecki mentioned this issue May 24, 2020

throttle threads for iterated small data tasks #4484

Merged

This was referenced May 24, 2020

i argument could get with=FALSE #4485

Open

new with optimization to allow avoid [ overhead #4488

Draft

ColeMiller1 mentioned this issue Jun 16, 2020

Windows 10 is faster with -fno-openmp than setDTthreads(1) on many repeated calls #4527

Closed

mattdowle added this to the 1.12.11 milestone Jun 18, 2020

ColeMiller1 mentioned this issue Jul 1, 2020

Faster i #4585

Open

jangorecki mentioned this issue Aug 25, 2020

Continuous Benchmarking #4687

Closed

9 tasks

mattdowle modified the milestones: 1.13.1, 1.13.3 Oct 17, 2020

mattdowle removed this from the 1.14.1 milestone Aug 28, 2021

MLopez-Ibanez mentioned this issue May 13, 2023

[.data.table is very slow with a single integer #5636

Open

tdhock mentioned this issue Dec 19, 2023

DT[chr.vec] slower than tibble[chr.vec,] by sub-linear factors #5837

Open

This was referenced Jul 17, 2024

adding an atime test case; new with optimization to allow avoid [ overhead #PR4488 #6289

Closed

adding an atime test case; new with optimization to allow avoid [ overhead #PR4488 #6290

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selecting from data.table by row is very slow #3735

Selecting from data.table by row is very slow #3735

chnynf commented Jul 30, 2019 •

edited by shrektan

Loading

shrektan commented Jul 31, 2019 •

edited

Loading

jangorecki commented Jul 31, 2019

jangorecki commented Jul 31, 2019 •

edited

Loading

jangorecki commented May 24, 2020 •

edited

Loading

ColeMiller1 commented Jun 1, 2020 •

edited

Loading

chnynf commented Jun 3, 2020

MLopez-Ibanez commented May 13, 2023

Selecting from data.table by row is very slow #3735

Selecting from data.table by row is very slow #3735

Comments

chnynf commented Jul 30, 2019 • edited by shrektan Loading

shrektan commented Jul 31, 2019 • edited Loading

jangorecki commented Jul 31, 2019

jangorecki commented Jul 31, 2019 • edited Loading

jangorecki commented May 24, 2020 • edited Loading

ColeMiller1 commented Jun 1, 2020 • edited Loading

chnynf commented Jun 3, 2020

MLopez-Ibanez commented May 13, 2023

chnynf commented Jul 30, 2019 •

edited by shrektan

Loading

shrektan commented Jul 31, 2019 •

edited

Loading

jangorecki commented Jul 31, 2019 •

edited

Loading

jangorecki commented May 24, 2020 •

edited

Loading

ColeMiller1 commented Jun 1, 2020 •

edited

Loading