Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Need an easier way to in-place merge multiple columns #3184

Closed
renkun-ken opened this issue Dec 5, 2018 · 7 comments · Fixed by #5052
Closed

Need an easier way to in-place merge multiple columns #3184

renkun-ken opened this issue Dec 5, 2018 · 7 comments · Fixed by #5052
Labels
programming parameterizing queries: get, mget, eval, env tests
Milestone

Comments

@renkun-ken
Copy link
Member

In-place merge in the form of dt1[dt2, x := y, on = .(col1, col2)] is useful when dt1 is very large. It also supports merging multiple columns from dt2 using `:=`(x1 = y1, x2 = y2). However, when I need to merge many columns from dt2 to dt1, it seems only possible to explicitly list all columns rather than dynamically determine the column names via a character vector like done with .SD, or otherwise I need to use meta-programming facilities to generate an expression and evaluate it.

One simple example is as follows. A practice use case is when dt1 and dt2 is very large and using merge will cause copy that is very slow and may exceed memory limit (which is exactly why in-place operations are introduced)

library(data.table)

d1 <- data.table(id = 1:10)
for (i in 1:10) {
  d1[, paste0("x", i) := rnorm(.N)]
}

d2 <- data.table(id = 3:6)
for (i in 1:5) {
  d2[, paste0("y", i) := rnorm(.N)]
}

d1[d2, paste0("z", 1:5) := list(y1, y2, y3, y4, y5), on = "id"]

Another similar problem is to in-place merge all columns of d2 without specifying source and target columns names.

@jangorecki
Copy link
Member

jangorecki commented Dec 5, 2018

most reliable way as of now will be to use meta programming. Good examples are https://stackoverflow.com/a/37008966/2490497
#3052 (comment)
#2655 (comment)
some ready made solutions you can found in data.cube package where it is very common operation:
https://gitlab.com/jangorecki/data.cube/blob/master/R/data.table.R#L91-110

@MichaelChirico
Copy link
Member

Large overlap as well with #935, including #935 (comment)

@jangorecki
Copy link
Member

jangorecki commented May 21, 2019

@renkun-ken is there any API you would like to propose?
I don't think we can do significantly better than what is a proper way to achieve it now

as.call(c(as.name("list"), lapply(paste0("y",1:5), as.name)))

the good thing is that, above API relies only on base R, user does not need to learn any new non-base R function or package.

@franknarf1
Copy link
Contributor

franknarf1 commented May 21, 2019

One idea:

update(x, mx, on, i = NULL, cols)

# translates to...
# if cols is a character vector
    # and it has names
    x[i, names(cols) := mx[.SD, on=on, mget(sprintf("x.%s", cols))]]

    # and it doesn't have names
    x[i, (cols) := mx[.SD, on=on, mget(sprintf("x.%s", cols))]]

# if cols instead is an expression (expected to yield one value per row of x[i])
    # eg, update(x, mx, on=.(id), cols = .(s = sum(x.v))

    x[, "s" := mx[.SD, on=.(id), sum(x.v), by=.EACHI][, !"id"]]

I often do both of these. Avoiding mget would help save keystrokes and avoid finicky edge cases, I guess. Related: #935 (comment) already mentioned above. The summarization/expression syntax would help so that I don't need to type the on= column "id" twice, solving my main use-case for requesting #2061

@jangorecki
Copy link
Member

jangorecki commented May 21, 2019

If we want to have update wrapper then maybe better push down a little bit to use bmerge and set?

@jangorecki jangorecki added the programming parameterizing queries: get, mget, eval, env label Apr 5, 2020
@renkun-ken
Copy link
Member Author

I think #4304 has already addressed this feature request given that env= will turn a list into a call.

d1[d2, paste0("z", 1:5) := list(y1, y2, y3, y4, y5), on = "id"]

could be nicely programmed into

d1[d2, paste0("z", 1:5) := Y, on = "id", env = list(Y = as.list(paste0("y", 1:5)))]

@jangorecki
Copy link
Member

Yes, definitely. I would just add "i." into paste to have more precisely defined columns. Let's close this issue with just a unit test then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
programming parameterizing queries: get, mget, eval, env tests
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants