Need an easier way to in-place merge multiple columns #3184

renkun-ken · 2018-12-05T07:08:26Z

In-place merge in the form of dt1[dt2, x := y, on = .(col1, col2)] is useful when dt1 is very large. It also supports merging multiple columns from dt2 using `:=`(x1 = y1, x2 = y2). However, when I need to merge many columns from dt2 to dt1, it seems only possible to explicitly list all columns rather than dynamically determine the column names via a character vector like done with .SD, or otherwise I need to use meta-programming facilities to generate an expression and evaluate it.

One simple example is as follows. A practice use case is when dt1 and dt2 is very large and using merge will cause copy that is very slow and may exceed memory limit (which is exactly why in-place operations are introduced)

library(data.table)

d1 <- data.table(id = 1:10)
for (i in 1:10) {
  d1[, paste0("x", i) := rnorm(.N)]
}

d2 <- data.table(id = 3:6)
for (i in 1:5) {
  d2[, paste0("y", i) := rnorm(.N)]
}

d1[d2, paste0("z", 1:5) := list(y1, y2, y3, y4, y5), on = "id"]

Another similar problem is to in-place merge all columns of d2 without specifying source and target columns names.

The text was updated successfully, but these errors were encountered:

jangorecki · 2018-12-05T09:11:34Z

most reliable way as of now will be to use meta programming. Good examples are https://stackoverflow.com/a/37008966/2490497
#3052 (comment)
#2655 (comment)
some ready made solutions you can found in data.cube package where it is very common operation:
https://gitlab.com/jangorecki/data.cube/blob/master/R/data.table.R#L91-110

MichaelChirico · 2018-12-06T03:05:33Z

Large overlap as well with #935, including #935 (comment)

jangorecki · 2019-05-21T05:14:17Z

@renkun-ken is there any API you would like to propose?
I don't think we can do significantly better than what is a proper way to achieve it now

as.call(c(as.name("list"), lapply(paste0("y",1:5), as.name)))

the good thing is that, above API relies only on base R, user does not need to learn any new non-base R function or package.

franknarf1 · 2019-05-21T13:08:44Z

One idea:

update(x, mx, on, i = NULL, cols)

# translates to...
# if cols is a character vector
    # and it has names
    x[i, names(cols) := mx[.SD, on=on, mget(sprintf("x.%s", cols))]]

    # and it doesn't have names
    x[i, (cols) := mx[.SD, on=on, mget(sprintf("x.%s", cols))]]

# if cols instead is an expression (expected to yield one value per row of x[i])
    # eg, update(x, mx, on=.(id), cols = .(s = sum(x.v))

    x[, "s" := mx[.SD, on=.(id), sum(x.v), by=.EACHI][, !"id"]]

I often do both of these. Avoiding mget would help save keystrokes and avoid finicky edge cases, I guess. Related: #935 (comment) already mentioned above. The summarization/expression syntax would help so that I don't need to type the on= column "id" twice, solving my main use-case for requesting #2061

jangorecki · 2019-05-21T13:20:44Z

If we want to have update wrapper then maybe better push down a little bit to use bmerge and set?

renkun-ken · 2021-05-31T14:22:21Z

I think #4304 has already addressed this feature request given that env= will turn a list into a call.

d1[d2, paste0("z", 1:5) := list(y1, y2, y3, y4, y5), on = "id"]

could be nicely programmed into

d1[d2, paste0("z", 1:5) := Y, on = "id", env = list(Y = as.list(paste0("y", 1:5)))]

jangorecki · 2021-05-31T15:11:25Z

Yes, definitely. I would just add "i." into paste to have more precisely defined columns. Let's close this issue with just a unit test then.

renkun-ken mentioned this issue Mar 22, 2020

programming on data.table #4304

Merged

jangorecki added the programming parameterizing queries: get, mget, eval, env label Apr 5, 2020

jangorecki mentioned this issue Jun 17, 2020

Convenience method for join+insertion operation #4080

Closed

jangorecki added the tests label Jun 10, 2021

jangorecki added a commit that referenced this issue Jun 22, 2021

easier way lkp all columns on join, closes #3184

b3aa08d

jangorecki mentioned this issue Jun 22, 2021

easier way lkp all columns on join, closes #3184 #5052

Merged

jangorecki linked a pull request Jun 22, 2021 that will close this issue

easier way lkp all columns on join, closes #3184 #5052

Merged

mattdowle added this to the 1.14.1 milestone Jun 22, 2021

mattdowle closed this as completed in #5052 Jun 22, 2021

mattdowle pushed a commit that referenced this issue Jun 22, 2021

easier way lkp all columns on join, closes #3184 (#5052)

db26698

jangorecki modified the milestones: 1.14.9, 1.15.0 Oct 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Need an easier way to in-place merge multiple columns #3184

Need an easier way to in-place merge multiple columns #3184

renkun-ken commented Dec 5, 2018

jangorecki commented Dec 5, 2018 •

edited

Loading

MichaelChirico commented Dec 6, 2018

jangorecki commented May 21, 2019 •

edited

Loading

franknarf1 commented May 21, 2019 •

edited

Loading

jangorecki commented May 21, 2019 •

edited

Loading

renkun-ken commented May 31, 2021

jangorecki commented May 31, 2021

Need an easier way to in-place merge multiple columns #3184

Need an easier way to in-place merge multiple columns #3184

Comments

renkun-ken commented Dec 5, 2018

jangorecki commented Dec 5, 2018 • edited Loading

MichaelChirico commented Dec 6, 2018

jangorecki commented May 21, 2019 • edited Loading

franknarf1 commented May 21, 2019 • edited Loading

jangorecki commented May 21, 2019 • edited Loading

renkun-ken commented May 31, 2021

jangorecki commented May 31, 2021

jangorecki commented Dec 5, 2018 •

edited

Loading

jangorecki commented May 21, 2019 •

edited

Loading

franknarf1 commented May 21, 2019 •

edited

Loading

jangorecki commented May 21, 2019 •

edited

Loading