Skip to content

Commit

Permalink
set operations for DT containing x and y as column names (#5256)
Browse files Browse the repository at this point in the history
  • Loading branch information
ben-schwen committed Nov 19, 2021
1 parent 96860f2 commit d8dc315
Show file tree
Hide file tree
Showing 3 changed files with 13 additions and 4 deletions.
2 changes: 2 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -453,6 +453,8 @@

48. `DT[, prod(int64Col), by=grp]` produced wrong results for `bit64::integer64` due to incorrect optimization, [#5225](https://github.com/Rdatatable/data.table/issues/5225). Thanks to Benjamin Schwendinger for reporting and fixing.

49. `fintersect(..., all=TRUE)` and `fsetdiff(..., all=TRUE)` could return incorrect results when the inputs had columns named `x` and `y`, [#5255](https://github.com/Rdatatable/data.table/issues/5255). Thanks @Fpadt for the report, and @ben-schwen for the fix.

## NOTES

1. New feature 29 in v1.12.4 (Oct 2019) introduced zero-copy coercion. Our thinking is that requiring you to get the type right in the case of `0` (type double) vs `0L` (type integer) is too inconvenient for you the user. So such coercions happen in `data.table` automatically without warning. Thanks to zero-copy coercion there is no speed penalty, even when calling `set()` many times in a loop, so there's no speed penalty to warn you about either. However, we believe that assigning a character value such as `"2"` into an integer column is more likely to be a user mistake that you would like to be warned about. The type difference (character vs integer) may be the only clue that you have selected the wrong column, or typed the wrong variable to be assigned to that column. For this reason we view character to numeric-like coercion differently and will warn about it. If it is correct, then the warning is intended to nudge you to wrap the RHS with `as.<type>()` so that it is clear to readers of your code that a coercion from character to that type is intended. For example :
Expand Down
10 changes: 6 additions & 4 deletions R/setops.R
Original file line number Diff line number Diff line change
Expand Up @@ -59,8 +59,9 @@ fintersect = function(x, y, all=FALSE) {
.set_ops_arg_check(x, y, all, .seqn = TRUE)
if (!nrow(x) || !nrow(y)) return(x[0L])
if (all) {
x = shallow(x)[, ".seqn" := rowidv(x)]
y = shallow(y)[, ".seqn" := rowidv(y)]
.seqn_id = NULL # to avoid 'no visible binding for global variable' note from R CMD check
x = shallow(x)[, ".seqn" := rowidv(.seqn_id), env=list(.seqn_id=x)]
y = shallow(y)[, ".seqn" := rowidv(.seqn_id), env=list(.seqn_id=y)]
jn.on = c(".seqn",setdiff(names(y),".seqn"))
# fixes #4716 by preserving order of 1st (uses y[x] join) argument instead of 2nd (uses x[y] join)
y[x, .SD, .SDcols=setdiff(names(y),".seqn"), nomatch=NULL, on=jn.on]
Expand All @@ -75,8 +76,9 @@ fsetdiff = function(x, y, all=FALSE) {
if (!nrow(x)) return(x)
if (!nrow(y)) return(if (!all) funique(x) else x)
if (all) {
x = shallow(x)[, ".seqn" := rowidv(x)]
y = shallow(y)[, ".seqn" := rowidv(y)]
.seqn_id = NULL # to avoid 'no visible binding for global variable' note from R CMD check
x = shallow(x)[, ".seqn" := rowidv(.seqn_id), env=list(.seqn_id=x)]
y = shallow(y)[, ".seqn" := rowidv(.seqn_id), env=list(.seqn_id=y)]
jn.on = c(".seqn",setdiff(names(x),".seqn"))
x[!y, .SD, .SDcols=setdiff(names(x),".seqn"), on=jn.on]
} else {
Expand Down
5 changes: 5 additions & 0 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -18377,3 +18377,8 @@ if (test_bit64) {
test(2226.3, DT[, prod(x,na.rm=TRUE), g], data.table(g=1:3, V1=as.integer64(c(NA,"9223372036854775807",-8L))))
}

# set ops when DT has column names x and y, #5255
DT = data.table(x=c(1,2,2,2), y=LETTERS[c(1,2,2,3)])
test(2227.1, fintersect(DT, DT, all=TRUE), DT)
test(2227.2, fsetdiff(DT, DT, all=TRUE), DT[0])

0 comments on commit d8dc315

Please sign in to comment.