Skip to content

keys are wrong/don't update if column names aren't unique #4888

@magerton

Description

@magerton

When setkey is called on a data.table with columns that have identical names, and then those names are updated, the keys appear not to update.

That means that if you want to do a cross-join of row-IDs in a dataset, and then update the CJ with additional attributes from the original data, you have to (a) update the key, (b) do CJ(..., sorted=F), or (c) use base::merge.data.frame() to get the merge to work (MWE 2)

MWE 1 is a silly example to show the key/name issue. I think that it might be what drives the errors in MWE 2, which is based on the issue I ran into today.

I'm running data.table version 1.13.6

MWE 1

library(data.table)

jnk <- data.table(x=1:3, x=4:6)
setkey(jnk, x, x)
setnames(jnk, c("y","z"))
all(key(jnk) %in% c("y","z")) # key(jnk) = c("z", "x")... but there is no "x" anymore

MWE 2

library(data.table)

nobs = 4
dat = data.table(id = 1:nobs, x = runif(nobs))

cj_sort <- with(dat, CJ(id, id, sorted=T))  # don't do fixes on this one
cj_srt2 <- with(dat, CJ(id, id, sorted=T)) # works
cj_unst <- with(dat, CJ(id, id, sorted=F)) # works b/c we update keys?

# set colnames to be unique
setnames(cj_sort, c("id_1", "id_2"))
setnames(cj_unst, c("id_1", "id_2"))
setnames(cj_srt2, c("id_1", "id_2"))

# fixes the issue
setkey(cj_unst, id_1, id_2)  # key unsorted data to fix
setkey(cj_srt2, id_1, id_2)   # re-key sorted data to fix

stopifnot(key(cj_sort) == c("id_1", "id_2")) # broken, keys are c("id_2","id")
stopifnot(key(cj_unst) == c("id_1", "id_2")) # ok
stopifnot(key(cj_srt2) == c("id_1", "id_2")) # ok

stopifnot(cj_sort[i = dat, on = .(id_1 = id), .N] == nobs^2)  # ok
stopifnot(cj_sort[i = dat, on = .(id_2 = id), .N] == nobs^2)  # broken - won't merge to nobs^2 rows

stopifnot(cj_unst[i = dat, on = .(id_1 = id), .N] == nobs^2)  # ok
stopifnot(cj_unst[i = dat, on = .(id_2 = id), .N] == nobs^2)  # ok - works b/c of setkey?

stopifnot(cj_srt2[i = dat, on = .(id_1 = id), .N] == nobs^2)  # ok
stopifnot(cj_srt2[i = dat, on = .(id_2 = id), .N] == nobs^2)  # ok - works b/c of setkey() workaround?

# data.table::merge
stopifnot(nrow(merge.data.table(cj_sort, dat, by.x="id_2", by.y="id")) == nobs^2) # broken
stopifnot(nrow(merge.data.table(cj_unst, dat, by.x="id_2", by.y="id")) == nobs^2) # ok
stopifnot(nrow(merge.data.table(cj_srt2, dat, by.x="id_2", by.y="id")) == nobs^2) # ok

# base::merge works, even though data.table::merge doesn't
stopifnot(nrow(merge.data.frame(cj_sort, dat, by.x="id_2", by.y="id")) == nobs^2) # workaround: use base::merge

Output of sessionInfo()

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.13.6

loaded via a namespace (and not attached):
[1] compiler_4.0.2 tools_4.0.2   
> 

 

Metadata

Metadata

Assignees

No one assigned

    Labels

    by-referenceIssues related to by-reference/copying behavior

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions