Major performance drop of keyed := when index is present #4311

renkun-ken · 2020-03-20T07:13:28Z

The performance of dt[selector, foo := bar] on key could significantly drop when an index is present. Following is my use case and reproducible example:

library(data.table)

dt <- data.table(symbol = rep(1:1000, each = 5000))
dt[, date := seq_len(.N), by = symbol]
setkeyv(dt, c("symbol", "date"))

flag_dt <- data.table(symbol = sample.int(500, 5000, replace = TRUE))
flag_dt[, start_date := sample.int(3000, .N, replace = TRUE)]
flag_dt[, end_date := start_date + sample.int(3000, .N, replace = TRUE)]
flag_dt[, id := seq_len(.N)]

calendar <- dt[, sort(unique(date))]

When dt has no index, the following code that repeatedly using symbol, date selector to modify flag is fast enough.

system.time({
  dt[, flag := 0L]
  flag_dt[, {
    dates <- calendar[calendar %between% c(start_date, end_date)]
    if (length(dates)) {
      selector <- list(symbol, dates)
      dt[selector, flag := 1L]
    }
    NULL
  }, by = id]
})

   user  system elapsed 
 26.189   0.399   3.344

  user  system elapsed 
 24.648   0.220   3.119

However, if an index is created intentionally, or in many cases unintentionally (auto index triggered by dt[flag0 == 1, ...]), the performance of the above code significantly decreases and could be unstable:

dt[, flag0 := sample(0:1, .N, replace = TRUE)]
setindexv(dt, "flag0")
system.time({
  dt[, flag := 0L]
  flag_dt[, {
    dates <- calendar[calendar %between% c(start_date, end_date)]
    if (length(dates)) {
      selector <- list(symbol, dates)
      dt[selector, flag := 1L]
    }
    NULL
  }, by = id]
})

   user  system elapsed 
386.415  27.380  52.938

   user  system elapsed 
212.908   7.289  27.665

I also tried explicitly writing dt[selector, flag := 1L, on = .(symbol, date)], still no luck.

Avoiding creating an index or disabling auto-index could avoid this problem but I'm still curious if there's something that significantly adds the overhead of keyed := while there's an index.

The text was updated successfully, but these errors were encountered:

tlapak · 2020-04-15T18:15:05Z

I ran your code through profvis. It shows that with the index you're de/allocating ~170 GB of memory as opposed to ~1.6 GB without one. Most of the time is spent in calls to .shallow() originating from bmerge() with about half being occupied by the gc. That latter part would be where the variability in timings you observed comes from.

Now what's going on? First, bmerge() makes a shallow copy of x which is just dt in this case. Looking into shallow(), at C level you find a call of DUPLICATE_ATTRIB which, among other things, creates a complete copy of the index vector. Since the vector occupies 20 MB and you're doing it 5000 times we've already accounted for ~100 GB of the memory footprint.

I haven't dug into it further to figure out the difference. But just to confirm,

setindex(dt, NULL)
setattr(dt, "test", rep(1L, 5e6))

gives the same time and memory footprint.

So the effect is not related to := or the key.

I'm not going to mess with shallow, but comments around that line indicate a vague plan to remove the copy. Maybe a remark in the relevant vignette or in the docs would be nice to have.

And just in case the code wasn't just contrived to produce the effect

dt[flag_dt, flag := 1L, on=c("symbol", "date>=start_date", "date<=end_date")]

gets the job done in 2%-3% of the time needed when doing it your way without index either way.

jangorecki · 2020-04-15T18:24:34Z

@tlapak thanks for digging into this. I am doing some bmerge rework, if you are going to touch only shallow, then fine, otherwise if also bmerge, then we will have a conflict to resolve :)

renkun-ken · 2020-04-16T01:16:55Z

@tlapak Thanks for digging into that. The non-equi join is the perfect way to do that.

Some of my practical use cases involve more calculations on each row to get the dates in which it might either not be consecutive in calendar or uses another variable for each row to determine if there is an end_date or none, and this makes is not easy to apply non-equi join in a uniform way.

jangorecki · 2020-06-21T17:56:26Z

I checked that #4440 resolves the speed issue on @renkun-ken example. And regression was already in 1.12.8, so not just current devel issue.

… data.table before the regression that got reported in Rdatatable#4311 was introduced

jangorecki added the performance label Apr 5, 2020

tlapak mentioned this issue May 10, 2020

Remove deep copy of indices from shallow() #4440

Merged

jangorecki added the High label May 10, 2020

jangorecki added this to the 1.12.9 milestone May 10, 2020

mattdowle modified the milestones: 1.12.11, 1.12.9 Jun 21, 2020

mattdowle closed this as completed in #4440 Jun 26, 2020

DorisAmoakohene mentioned this issue Oct 18, 2023

closed commit ids data.table issue#4440 DorisAmoakohene/data.table_test#7

Open

Anirban166 mentioned this issue Apr 1, 2024

GitHub Action + atime test to observe the performance regression introduced by PR #4491 and fixed by PR #5463 Anirban166/data.table#2

Merged

Anirban166 added a commit to Anirban166/data.table that referenced this issue Apr 1, 2024

Assigning the 1.12.6 commit SHA to potentially replicate the state of…

fa4353e

… data.table before the regression that got reported in Rdatatable#4311 was introduced

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major performance drop of keyed := when index is present #4311

Major performance drop of keyed := when index is present #4311

renkun-ken commented Mar 20, 2020 •

edited

tlapak commented Apr 15, 2020 •

edited

jangorecki commented Apr 15, 2020 •

edited

renkun-ken commented Apr 16, 2020 •

edited

jangorecki commented Jun 21, 2020 •

edited

Major performance drop of keyed := when index is present #4311

Major performance drop of keyed := when index is present #4311

Comments

renkun-ken commented Mar 20, 2020 • edited

tlapak commented Apr 15, 2020 • edited

jangorecki commented Apr 15, 2020 • edited

renkun-ken commented Apr 16, 2020 • edited

jangorecki commented Jun 21, 2020 • edited

renkun-ken commented Mar 20, 2020 •

edited

tlapak commented Apr 15, 2020 •

edited

jangorecki commented Apr 15, 2020 •

edited

renkun-ken commented Apr 16, 2020 •

edited

jangorecki commented Jun 21, 2020 •

edited