Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Selecting from data.table by row is very slow #3735

Open
chnynf opened this issue Jul 30, 2019 · 7 comments · May be fixed by #4488
Open

Selecting from data.table by row is very slow #3735

chnynf opened this issue Jul 30, 2019 · 7 comments · May be fixed by #4488

Comments

@chnynf
Copy link

chnynf commented Jul 30, 2019

allIterations <- data.frame(v1 = runif(1e5), v2 = runif(1e5))
DoSomething <- function(row) {
  someCalculation <- row[["v1"]] + 1
}
system.time(
       {
         for (r in 1:nrow(allIterations)) {
           DoSomething(allIterations[r, ])
         }
       }
     )
##   user  system elapsed 
##   4.50    0.02    4.55 

allIterations <- as.data.table(allIterations)
system.time(
       {
         for (r in 1:nrow(allIterations)) {
           DoSomething(allIterations[r, ])
         }
       }
     )
##   user  system elapsed 
##   53.78   25.05   78.46 

I'm working on a R project that involves applying fairly complicated functions across data.table or data.frame by rows.
In cases where vectorizing is not a good option, one might need to loop through rows, and that's when I realized selecting by row number from a data.table is actually much slower than from a data.frame.
I guess selecting by row number is not a recommended practice for data.table? Or would the team be interested in looking into this and optimize the performance?

I have more details about my test here.

@shrektan
Copy link
Member

shrektan commented Jul 31, 2019

The main reason is not about using row number to select the rows or not. It's because the loop invokes the data.table's function call too many times. data.table is fast due to internal optimization, which comes with a cost. It means the [ call in data.table will do much more things (optimizing, checks, etc.) than in data.frame. Apparently, in this special looping case, all the optimizing efforts are in vain.

If loop on all the rows is unavoidable, I suggest you to use purrr::pmap().

df <- data.frame(v1 = runif(1e3), v2 = runif(1e3))
cal <- function(row) {
  row$v1 + 1
}
res1 <- res2 <- res3 <- res4 <- double(nrow(df))

t <- proc.time()
for (r in 1:nrow(df)) {
  res1[r] <- cal(df[r, ])
}
data.table::timetaken(t)
#> [1] "0.110s elapsed (0.090s cpu)"

dt <- data.table::as.data.table(df)
t <- proc.time()
for (r in 1:nrow(dt)) {
  res2[r] <- cal(dt[r, ])
}
data.table::timetaken(t)
#> [1] "0.510s elapsed (0.470s cpu)"

t <- proc.time()
res3 <- purrr::pmap_dbl(dt, function(...) {
  cal(list(...))
})
data.table::timetaken(t)
#> [1] "0.030s elapsed (0.010s cpu)"

all.equal(res2, res1)
#> [1] TRUE
all.equal(res3, res1)
#> [1] TRUE

Created on 2019-07-31 by the reprex package (v0.2.1)

@jangorecki
Copy link
Member

Confirming what @shrektan wrote. Anyway I think we should be able to speed up such things pretty easily.

@jangorecki
Copy link
Member

jangorecki commented Jul 31, 2019

When selecting single row by its integer index it make sense to switch to single threaded mode, so setting setDTthreads(1L) might help. Related issue #3175.
It is possible to match performance of data.frame subset by integer index, but it is not exported.
DF 3.697s
DT 3.579s

library(data.table)
set.seed(108)
n = 1e5
df = data.frame(v1 = runif(n), v2 = runif(n))
dt1 = data.table(v1 = runif(n), v2 = runif(n))
dt2 = data.table(v1 = runif(n), v2 = runif(n))
frow = function(x, irows, safe=FALSE) {
  stopifnot(is.data.table(x), is.integer(irows), length(irows)>0L, is.logical(safe), length(safe)==1L, !is.na(safe))
  if (safe) stopifnot(all(between(irows, 1L, nrow(x))))
  .Call(data.table:::CsubsetDT, x, irows, seq_along(x))
}
do = function(row) row[["v1"]]+1

system.time(for (r in 1:n) do(df[r, ]))
#   user  system elapsed 
#  3.693   0.003   3.697 

setDTthreads(4L)
system.time(for (r in 1:n) do(dt1[r, ]))
#   user  system elapsed 
# 73.497   0.299  19.205
system.time(for (r in 1:n) do(frow(dt2, r)))
#   user  system elapsed 
# 21.125   0.128   5.488 
system.time(for (r in 1:n) do(frow(dt2, r, safe=TRUE)))
#   user  system elapsed 
# 28.016   0.179   7.294 

setDTthreads(1L)
system.time(for (r in 1:n) do(dt1[r, ]))
#   user  system elapsed 
# 12.619   0.128  12.749 
system.time(for (r in 1:n) do(frow(dt2, r)))
#   user  system elapsed 
#  3.538   0.040   3.579 
system.time(for (r in 1:n) do(frow(dt2, r, safe=TRUE)))
#   user  system elapsed 
#  4.923   0.088   5.012 

It could be handled internally transparently but requires a little bit of rewrite [.data.table because i argument can take various forms, where NSE processing makes it harder for early detecting input type and optimisation.

@jangorecki
Copy link
Member

jangorecki commented May 24, 2020

Some progress towards this issue has been made in #4484, but the overhead of [.data.table is still significant. I measured time of [.data.table internals and tried to escape as much extra code as possible, but speed up I was able to get was around 13%. I am not sure if we want another extra escape branch just for a 13% gain.
To address this issue fully, we either have to:

  • Provide non-NSE interface for i argument, so it behaves like data.frame's i arg. We already provide that for j argument via with. So this could be made using with=c(i=FALSE, j=TRUE), already proposed somewhere but don't remember where (found it, that was you :) ), i argument could get with=FALSE #4485. Alternatively we could use another argument, like existing which [.data.table which argument could accept integer #3736.
  • Rewrite initial parts of [.data.table

@ColeMiller1
Copy link
Contributor

ColeMiller1 commented Jun 1, 2020

Just promoting the idea - using by = 1:nrow(dt) solves this issue as well and is actually the fastest of the presented options.

Also, @chnynf, are you on Windows? Your high system.times reflect my experience on Windows.

library(data.table) ##1.12.8
setDTthreads(1L)

df <- data.frame(v1 = runif(1e3), v2 = runif(1e3))
cal <- function(row) row$v1 + 1

res1 <- res2 <- res3 <- res4 <- double(nrow(df))

t <- proc.time()
for (r in 1:nrow(df)) {
  res1[r] <- cal(df[r, ])
}
data.table::timetaken(t)
#> [1] "0.050s elapsed (0.030s cpu)"

dt <- data.table::as.data.table(df)
t <- proc.time()
for (r in 1:nrow(dt)) {
  res2[r] <- cal(dt[r, ])
}
data.table::timetaken(t)
#> [1] "0.240s elapsed (0.210s cpu)"

t <- proc.time()
res3 <- purrr::pmap_dbl(dt, function(...) {
  cal(list(...))
})
data.table::timetaken(t)
#> [1] "0.060s elapsed (0.040s cpu)"

t <- proc.time()
res4 <- dt[, cal(.SD), by = 1:nrow(dt)]$V1
data.table::timetaken(t)
#> [1] "0.010s elapsed (0.000s cpu)"

all.equal(res2, res1)
#> [1] TRUE
all.equal(res3, res1)
#> [1] TRUE
all.equal(res4, res1)
#> [1] TRUE

@chnynf
Copy link
Author

chnynf commented Jun 3, 2020

Just promoting the idea - using by = 1:nrow(dt) solves this issue as well and is actually the fastest of the presented options.

Also, @chnynf, are you on Windows? Your high system.times reflect my experience on Windows.

library(data.table) ##1.12.8
setDTthreads(1L)

df <- data.frame(v1 = runif(1e3), v2 = runif(1e3))
cal <- function(row) row$v1 + 1

res1 <- res2 <- res3 <- res4 <- double(nrow(df))

t <- proc.time()
for (r in 1:nrow(df)) {
  res1[r] <- cal(df[r, ])
}
data.table::timetaken(t)
#> [1] "0.050s elapsed (0.030s cpu)"

dt <- data.table::as.data.table(df)
t <- proc.time()
for (r in 1:nrow(dt)) {
  res2[r] <- cal(dt[r, ])
}
data.table::timetaken(t)
#> [1] "0.240s elapsed (0.210s cpu)"

t <- proc.time()
res3 <- purrr::pmap_dbl(dt, function(...) {
  cal(list(...))
})
data.table::timetaken(t)
#> [1] "0.060s elapsed (0.040s cpu)"

t <- proc.time()
res4 <- dt[, cal(.SD), by = 1:nrow(dt)]$V1
data.table::timetaken(t)
#> [1] "0.010s elapsed (0.000s cpu)"

all.equal(res2, res1)
#> [1] TRUE
all.equal(res3, res1)
#> [1] TRUE
all.equal(res4, res1)
#> [1] TRUE

Yes, the test was on windows. I tried your approach on my windows machine and it is much faster.

Thank you guys for working on this!

@MLopez-Ibanez
Copy link
Contributor

I have a similar problem. In this code:

library(data.table)
parameters <- list(types = c(p1 = "r", p2 = "r", p3 = "r", dummy = "c"),
                   digits = 4)
n <- 10000
newConfigurations <- data.table(p1 = runif(n), p2 = runif(n), p3 = runif(n),
                                dummy = sample(c("d1", "d2"), n, replace=TRUE))

repair_sum2one <- function(configuration, parameters)
{
  isreal <- names(which(parameters$types[colnames(configuration)] == "r"))
  digits <- parameters$digits[isreal]
  c_real <- unlist(configuration[isreal])
  c_real <- c_real / sum(c_real)
  c_real[-1] <- round(c_real[-1], digits[-1])
  c_real[1] <- 1 - sum(c_real[-1])
  configuration[isreal] <- c_real
  return(configuration)
}
j <- colnames(newConfigurations)
for (i in seq_len(nrow(newConfigurations)))
      set(newConfigurations, i, j = j, value = repair_sum2one(as.data.frame(newConfigurations[i]), parameters))

More than half the time is spent in [.data.table. Even the function repair_sum2one is faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants