When setkey is called on a data.table with columns that have identical names, and then those names are updated, the keys appear not to update.
That means that if you want to do a cross-join of row-IDs in a dataset, and then update the CJ with additional attributes from the original data, you have to (a) update the key, (b) do CJ(..., sorted=F), or (c) use base::merge.data.frame() to get the merge to work (MWE 2)
MWE 1 is a silly example to show the key/name issue. I think that it might be what drives the errors in MWE 2, which is based on the issue I ran into today.
I'm running data.table version 1.13.6
MWE 1
library(data.table)
jnk <- data.table(x=1:3, x=4:6)
setkey(jnk, x, x)
setnames(jnk, c("y","z"))
all(key(jnk) %in% c("y","z")) # key(jnk) = c("z", "x")... but there is no "x" anymore
MWE 2
library(data.table)
nobs = 4
dat = data.table(id = 1:nobs, x = runif(nobs))
cj_sort <- with(dat, CJ(id, id, sorted=T)) # don't do fixes on this one
cj_srt2 <- with(dat, CJ(id, id, sorted=T)) # works
cj_unst <- with(dat, CJ(id, id, sorted=F)) # works b/c we update keys?
# set colnames to be unique
setnames(cj_sort, c("id_1", "id_2"))
setnames(cj_unst, c("id_1", "id_2"))
setnames(cj_srt2, c("id_1", "id_2"))
# fixes the issue
setkey(cj_unst, id_1, id_2) # key unsorted data to fix
setkey(cj_srt2, id_1, id_2) # re-key sorted data to fix
stopifnot(key(cj_sort) == c("id_1", "id_2")) # broken, keys are c("id_2","id")
stopifnot(key(cj_unst) == c("id_1", "id_2")) # ok
stopifnot(key(cj_srt2) == c("id_1", "id_2")) # ok
stopifnot(cj_sort[i = dat, on = .(id_1 = id), .N] == nobs^2) # ok
stopifnot(cj_sort[i = dat, on = .(id_2 = id), .N] == nobs^2) # broken - won't merge to nobs^2 rows
stopifnot(cj_unst[i = dat, on = .(id_1 = id), .N] == nobs^2) # ok
stopifnot(cj_unst[i = dat, on = .(id_2 = id), .N] == nobs^2) # ok - works b/c of setkey?
stopifnot(cj_srt2[i = dat, on = .(id_1 = id), .N] == nobs^2) # ok
stopifnot(cj_srt2[i = dat, on = .(id_2 = id), .N] == nobs^2) # ok - works b/c of setkey() workaround?
# data.table::merge
stopifnot(nrow(merge.data.table(cj_sort, dat, by.x="id_2", by.y="id")) == nobs^2) # broken
stopifnot(nrow(merge.data.table(cj_unst, dat, by.x="id_2", by.y="id")) == nobs^2) # ok
stopifnot(nrow(merge.data.table(cj_srt2, dat, by.x="id_2", by.y="id")) == nobs^2) # ok
# base::merge works, even though data.table::merge doesn't
stopifnot(nrow(merge.data.frame(cj_sort, dat, by.x="id_2", by.y="id")) == nobs^2) # workaround: use base::merge
Output of sessionInfo()
> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.13.6
loaded via a namespace (and not attached):
[1] compiler_4.0.2 tools_4.0.2
>
When
setkeyis called on a data.table with columns that have identical names, and then those names are updated, the keys appear not to update.That means that if you want to do a cross-join of row-IDs in a dataset, and then update the CJ with additional attributes from the original data, you have to (a) update the key, (b) do
CJ(..., sorted=F), or (c) usebase::merge.data.frame()to get the merge to work (MWE 2)MWE 1 is a silly example to show the key/name issue. I think that it might be what drives the errors in MWE 2, which is based on the issue I ran into today.
I'm running
data.tableversion1.13.6MWE 1
MWE 2
Output of
sessionInfo()