/ data.table Public

# "by = colA:colD" produces incorrect result when key = ("colA","colD")#4285

Closed
opened this issue Mar 7, 2020 · 1 comment · Fixed by #4376
Closed

# "by = colA:colD" produces incorrect result when key = ("colA","colD") #4285

opened this issue Mar 7, 2020 · 1 comment · Fixed by #4376
Assignees
Milestone

### cbilot commented Mar 7, 2020

 Here's a simple reproducible example that gets to the point quickly. Let's start with a very simple data.table: `DT <- data.table( col1 = c(1, 1, 1), col2 = c("a", "b", "a"), col3 = c("A", "B", "A"), col4 = c(2, 2, 2) )` `print(DT)` col1 col2 col3 col4 1: 1 a A 2 2: 1 b B 2 3: 1 a A 2 Note that rows 1 & 3 are identical, with a differing row (2) between them. This "interrupting" row is key to the bug that follows. If we run a simple grouping using columns 1 through 4 using two different syntax, we get the same (correct) result: `DT[, .N, by = c("col1", "col2", "col3", "col4")]` col1 col2 col3 col4 N 1: 1 a A 2 2 2: 1 b B 2 1 `DT[, .N, by = col1:col4]` col1 col2 col3 col4 N 1: 1 a A 2 2 2: 1 b B 2 1 Now, let's set a key, using columns 1 & 4, and re-run the above grouping commands: `setkey(DT, col1, col4)` `key(DT)` [1] "col1" "col4" `DT[, .N, by = c("col1", "col2", "col3", "col4")]` col1 col2 col3 col4 N 1: 1 a A 2 2 2: 1 b B 2 1 `DT[, .N, by = col1:col4]` col1 col2 col3 col4 N 1: 1 a A 2 1 2: 1 b B 2 1 3: 1 a A 2 1 Notice that the "by = col1:col4" now produces a different result. Removing the key -- or setting some key other than ("col1", "col4") -- will restore the correct results for both syntax. (Not shown) It's as though the presence of the key ("col1", "col4") induces the "by=col1:col4" syntax to assume that the data.table is already sorted by (col1, col2, col3, col4). And thus, the intervening row (2) causes the grouping to miss later matching row. So far, I have noticed this bug in only one case: when the key is ("colB", "colG") and the same two columns are named as endpoints in the by ":" syntax ("by = colB:colG"). FWIW, today is my first time ever using GitHub, so please forgive if I've missed something. (I joined today so that I could report what I noticed.) I searched the NEWS, the development version, open issues, and stack overflow .. but I found nothing similar. Perhaps I don't know the correct search terms ... or perhaps this is an edge case. As a mitigation for now, I've resorted to using only the "by = c("colA","colB", ..) syntax. The colB:colG syntax is very convenient for ad-hoc analysis, which is a good share of my daily work. `sessionInfo()` R version 3.6.3 (2020-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.4 LTS Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1 locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=en_US.UTF-8 [6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats graphics grDevices utils datasets methods base other attached packages: readODS_1.6.7 nvimcom_0.9-85 rvisidata_0.0.0.9000 data.table_1.12.8 colorout_1.2-2 loaded via a namespace (and not attached): [1] Rcpp_1.0.3 crayon_1.3.4 cellranger_1.1.0 R6_2.4.1 magrittr_1.5 pillar_1.4.3 stringi_1.4.6 rlang_0.4.5 xml2_1.2.2 [10] vctrs_0.2.3 tools_3.6.3 readr_1.3.1 hms_0.5.3 compiler_3.6.3 pkgconfig_2.0.3 tibble_2.1.3 The text was updated successfully, but these errors were encountered:

### ColeMiller1 commented Mar 7, 2020

Thanks for the report - I can confirm bug. Lines 749, 752, and 753 are the offending lines:

Lines 749 to 753 in b1b1832

 allbyvars = intersect(all.vars(bysub), names_x) orderedirows = .Call(CisOrderedSubset, irows, nrow(x)) # TRUE when irows is NULL (i.e. no i clause). Similar but better than is.sorted(f__) bysameorder = byindex = FALSE if (all(vapply_1b(bysubl, is.name))) { bysameorder = orderedirows && haskey(x) && length(allbyvars) && identical(allbyvars,head(key(x),length(allbyvars)))

PR #4248 will close this.

added this to the 1.12.9 milestone Apr 5, 2020
self-assigned this Apr 12, 2020
mentioned this issue Apr 12, 2020
modified the milestones: 1.12.11, 1.12.9 May 27, 2020