Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.SD mistakenly includes column being set when get() appears in j #2326

Closed
renkun-ken opened this issue Aug 31, 2017 · 3 comments · Fixed by #2329
Closed

.SD mistakenly includes column being set when get() appears in j #2326

renkun-ken opened this issue Aug 31, 2017 · 3 comments · Fixed by #2329
Milestone

Comments

@renkun-ken
Copy link
Member

@renkun-ken renkun-ken commented Aug 31, 2017

When I use := together with .SD, .SDcols and get(), for example,

dt[, z := ncol(.SD) + get("y"), .SDcols = "x"]

The first time is ok, but whenever z already exists, the code above will produce incorrect results since .SD includes both x and z, which should be x only, as restricted by .SDcols.

The following code is a minimal example and demonstrates the problem:

library(data.table)

dt <- data.table(x = seq(1, 10), y = seq(10, 1))
dt[, z := ncol(.SD) + y, .SDcols = "x"]
dt
dt[, z := ncol(.SD) + get("y"), .SDcols = "x"]
dt

dt[, z := {
  str(.SD)
  ncol(.SD) + get("y")
}, .SDcols = "x"]

dt[, z := NULL]
dt[, z := {
  str(.SD)
  ncol(.SD) + get("y", inherits = FALSE)
}, .SDcols = "x"]

The output looks like

> library(data.table)
> dt <- data.table(x = seq(1, 10), y = seq(10, 1))
> dt
     x  y
 1:  1 10
 2:  2  9
 3:  3  8
 4:  4  7
 5:  5  6
 6:  6  5
 7:  7  4
 8:  8  3
 9:  9  2
10: 10  1
> dt[, z := ncol(.SD) + y, .SDcols = "x"]
> dt
     x  y  z
 1:  1 10 11
 2:  2  9 10
 3:  3  8  9
 4:  4  7  8
 5:  5  6  7
 6:  6  5  6
 7:  7  4  5
 8:  8  3  4
 9:  9  2  3
10: 10  1  2
> dt[, z := ncol(.SD) + get("y"), .SDcols = "x"]
> dt
     x  y  z
 1:  1 10 12
 2:  2  9 11
 3:  3  8 10
 4:  4  7  9
 5:  5  6  8
 6:  6  5  7
 7:  7  4  6
 8:  8  3  5
 9:  9  2  4
10: 10  1  3
> dt[, z := {
+   str(.SD)
+   ncol(.SD) + get("y")
+ }, .SDcols = "x"]
Classes ‘data.table’ and 'data.frame':	10 obs. of  2 variables:
 $ x: int  1 2 3 4 5 6 7 8 9 10
 $ z: int  12 11 10 9 8 7 6 5 4 3
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, ".data.table.locked")= logi TRUE
> dt[, z := NULL]
> dt[, z := {
+   str(.SD)
+   ncol(.SD) + get("y", inherits = FALSE)
+ }, .SDcols = "x"]
Classes ‘data.table’ and 'data.frame':	10 obs. of  1 variable:
 $ x: int  1 2 3 4 5 6 7 8 9 10
 - attr(*, ".internal.selfref")=<externalptr> 
 - attr(*, ".data.table.locked")= logi TRUE
@renkun-ken
Copy link
Member Author

@renkun-ken renkun-ken commented Sep 1, 2017

I debug the code and find that the problem seems to occur at https://github.com/Rdatatable/data.table/blob/master/R/data.table.R#L1054 where z and := are included in av and then made into othervars so both x and z end up in .SD:

Browse[2]> av
[1] ":="   "z"    "+"    "ncol" ".SD"  "get" 
Browse[2]> allcols
[1] "x" "y" "z"

@MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented Sep 5, 2017

feel free to file a PR if you know how to fix what you found 👍

@renkun-ken
Copy link
Member Author

@renkun-ken renkun-ken commented Sep 5, 2017

The simplest walk-around is

dt[, "z" := ncol(.SD) + get("y"), .SDcols = "x"]

The problem is that inside [.data.table, it detects all symbols used in j using all.vars(), for symbol := expr, it includes both symbol, := and all symbols in expr. Whenever there's no symbol on lhs of :=, there won't be such problem, e.g.

# no symbol on lhs of `:=` but a call to create a character vector
dt[, c("z", "w") := list(ncol(.SD), get("y")), .SDcols = "x"]

It looks like we should specially handle := expressions in j when determining the variables to include in .SD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants