Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keyby= may use the wrong index #3498

Closed
shrektan opened this issue Apr 10, 2019 · 3 comments
Closed

keyby= may use the wrong index #3498

shrektan opened this issue Apr 10, 2019 · 3 comments
Labels
bug encoding
Milestone

Comments

@shrektan
Copy link
Member

@shrektan shrektan commented Apr 10, 2019

The keyby may use the wrong index if the keyby column name is the leading part of the index column name. For example, the index is CLASS_L3 while keyby is CLASS.

After a shallow investigation, I suspect the below line is the cause of this bug. I don't understand the reason of using substring() at all. It seems not necessary. @mattdowle Any ideas?

w = which.first(substring(indices(x),1L,nchar(tt)) == tt) # substring to avoid the overhead of grep

I will be happy to file a PR with tests if there's no particular reason to compare to a subset of the indices.

Thanks.

The reproducible example

(Note, although the below example is running under R3.4.4, they are reproducible on R3.5.3 as well.)

See after set up the indices, the keyby expression returns wrong results. In comparison, using by still returns correct results.

library(data.table)
dt <- data.table(
  CLASS_L3 = c("gggg", "iiii", "bbbb", "bbbb", "gggg", 
               "ffff", "bbbb", "Repo", "bbbb", "dddd", "hhhh", 
               "dddd", "gggg", "dddd", 
               "dddd", "hhhh", "dddd", "dddd", 
               "Repo", "bbbb", "dddd", "dddd", "dddd", 
               "dddd", "cccc", "aaaa", "dddd", 
               "cccc", "dddd", "dddd", "dddd", 
               "dddd", "dddd", "bbbb", "dddd", 
               "dddd", "cccc", "dddd", "dddd", 
               "dddd", "dddd", "dddd", "bbbb", 
               "dddd", "cccc", "cccc", "dddd", 
               "bbbb", "cccc", "aaaa", "cccc", "dddd", 
               "cccc", "cccc", "cccc", "aaaa", 
               "dddd", "dddd", "dddd", "dddd", 
               "b1111", "cccc", "dddd", "dddd", 
               "cccc", "cccc", "cccc", "Repo", 
               "bbbb", "bbbb", "dddd", "dddd", "cccc", 
               "dddd", "cccc", "dddd"),
  CLASS = c("aaaa", "dddd", "gggg", "gggg", 
            "aaaa", "eeee", "eeee", "ffff", "gggg", 
            "aaaa", "aaaa", "aaaa", "aaaa", "aaaa", 
            "aaaa", "aaaa", "aaaa", "aaaa", "ffff", 
            "cccc", "bbbb", "bbbb", 
            "aaaa", "aaaa", "aaaa", "dddd", "aaaa", 
            "aaaa", "aaaa", "aaaa", "aaaa", "aaaa", 
            "aaaa", "gggg", "bbbb", 
            "bbbb", "aaaa", "bbbb", 
            "aaaa", "aaaa", "aaaa", "aaaa", "cccc", 
            "aaaa", "aaaa", "aaaa", "aaaa", "eeee", 
            "aaaa", "dddd", "aaaa", "aaaa", "aaaa", 
            "aaaa", "aaaa", "dddd", "bbbb", 
            "aaaa", "aaaa", "aaaa", "eeee", "aaaa", 
            "aaaa", "aaaa", "aaaa", "aaaa", "aaaa", 
            "ffff", "gggg", "cccc", "bbbb", 
            "aaaa", "aaaa", "aaaa", "aaaa", "bbbb"
  )
)
indices(dt)
#> NULL
dt[, .N, keyby = CLASS, verbose = TRUE]
#> Detected that j uses these columns: <none> 
#> Finding groups using forderv ... 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as '.N'
#> GForce optimized j to '.N'
#> Making each group and running j (GForce TRUE) ... 0.000s elapsed (0.000s cpu)
#>    CLASS  N
#> 1:  aaaa 49
#> 2:  bbbb  8
#> 3:  cccc  3
#> 4:  dddd  4
#> 5:  eeee  4
#> 6:  ffff  3
#> 7:  gggg  5
setindex(dt, CLASS_L3)
indices(dt)
#> [1] "CLASS_L3"
dt[, .N, keyby = CLASS, verbose = TRUE]
#> Detected that j uses these columns: <none> 
#> Finding groups using uniqlist on index 'CLASS_L3' ... 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as '.N'
#> GForce optimized j to '.N'
#> Making each group and running j (GForce TRUE) ... 0.000s elapsed (0.000s cpu)
#>     CLASS  N
#>  1:  ffff  3
#>  2:  dddd  3
#>  3:  eeee  1
#>  4:  gggg  2
#>  5:  eeee  1
#>  6:  gggg  1
#>  7:  cccc  1
#>  8:  gggg  1
#>  9:  cccc  1
#> 10:  eeee  1
#> 11:  gggg  1
#> 12:  cccc  1
#> 13:  aaaa 22
#> 14:  bbbb  2
#> 15:  aaaa  8
#> 16:  bbbb  3
#> 17:  aaaa  7
#> 18:  bbbb  1
#> 19:  aaaa  5
#> 20:  bbbb  1
#> 21:  aaaa  2
#> 22:  bbbb  1
#> 23:  eeee  1
#> 24:  aaaa  5
#> 25:  dddd  1
#>     CLASS  N
dt[, .N, by = CLASS, verbose = TRUE]
#> Detected that j uses these columns: <none> 
#> Finding groups using forderv ... 0.000s elapsed (0.000s cpu) 
#> Finding group sizes from the positions (can be avoided to save RAM) ... 0.000s elapsed (0.000s cpu) 
#> Getting back original order ... 0.000s elapsed (0.000s cpu) 
#> lapply optimization is on, j unchanged as '.N'
#> GForce optimized j to '.N'
#> Making each group and running j (GForce TRUE) ... 0.000s elapsed (0.000s cpu)
#>    CLASS  N
#> 1:  aaaa 49
#> 2:  dddd  4
#> 3:  gggg  5
#> 4:  eeee  4
#> 5:  ffff  3
#> 6:  cccc  3
#> 7:  bbbb  8

Created on 2019-04-10 by the reprex package (v0.2.1)

Session info
devtools::session_info()
#> - Session info ----------------------------------------------------------
#>  setting  value                                              
#>  version  R version 3.4.4 (2018-03-15)                       
#>  os       Windows 7 x64 SP 1                                 
#>  system   x86_64, mingw32                                    
#>  ui       RTerm                                              
#>  language (EN)                                               
#>  collate  Chinese (Simplified)_People's Republic of China.936
#>  ctype    Chinese (Simplified)_People's Republic of China.936
#>  tz       Asia/Taipei                                        
#>  date     2019-04-10                                         
#> 
#> - Packages --------------------------------------------------------------
#>  package     * version    date       lib source        
#>  assertthat    0.2.1      2019-03-21 [2] CRAN (R 3.4.4)
#>  backports     1.1.3      2018-12-14 [2] CRAN (R 3.4.4)
#>  callr         3.2.0      2019-03-15 [2] CRAN (R 3.4.4)
#>  cli           1.1.0      2019-03-19 [2] CRAN (R 3.4.4)
#>  crayon        1.3.4      2017-09-16 [2] CRAN (R 3.4.4)
#>  data.table  * 1.12.3     2019-04-10 [1] local         
#>  desc          1.2.0      2018-05-01 [2] CRAN (R 3.4.4)
#>  devtools      2.0.1      2018-10-26 [2] CRAN (R 3.4.4)
#>  digest        0.6.18     2018-10-10 [2] CRAN (R 3.4.4)
#>  evaluate      0.13       2019-02-12 [2] CRAN (R 3.4.4)
#>  fs            1.2.7      2019-03-19 [2] CRAN (R 3.4.4)
#>  glue          1.3.1      2019-03-12 [2] CRAN (R 3.4.4)
#>  highr         0.8        2019-03-20 [2] CRAN (R 3.4.4)
#>  htmltools     0.3.6.9000 2018-05-07 [2] local         
#>  knitr         1.22       2019-03-08 [2] CRAN (R 3.4.4)
#>  magrittr      1.5        2014-11-22 [2] CRAN (R 3.4.4)
#>  memoise       1.1.0      2017-04-21 [2] CRAN (R 3.4.4)
#>  pkgbuild      1.0.3      2019-03-20 [2] CRAN (R 3.4.4)
#>  pkgload       1.0.2      2018-10-29 [2] CRAN (R 3.4.4)
#>  prettyunits   1.0.2      2015-07-13 [2] CRAN (R 3.4.1)
#>  processx      3.3.0      2019-03-10 [2] CRAN (R 3.4.4)
#>  ps            1.3.0      2018-12-21 [2] CRAN (R 3.4.4)
#>  R6            2.4.0      2019-02-14 [2] CRAN (R 3.4.4)
#>  Rcpp          1.0.1      2019-03-17 [2] CRAN (R 3.4.4)
#>  remotes       2.0.2      2018-10-30 [2] CRAN (R 3.4.4)
#>  rlang         0.3.3      2019-03-29 [2] CRAN (R 3.4.4)
#>  rmarkdown     1.12       2019-03-14 [2] CRAN (R 3.4.4)
#>  rprojroot     1.3-2      2018-01-03 [2] CRAN (R 3.4.3)
#>  sessioninfo   1.1.1      2018-11-05 [2] CRAN (R 3.4.4)
#>  stringi       1.4.3      2019-03-12 [2] CRAN (R 3.4.4)
#>  stringr       1.4.0      2019-02-10 [2] CRAN (R 3.4.4)
#>  testthat      2.0.1      2018-10-13 [2] CRAN (R 3.4.4)
#>  usethis       1.4.0      2018-08-14 [2] CRAN (R 3.4.4)
#>  withr         2.1.2      2018-03-15 [2] CRAN (R 3.4.4)
#>  xfun          0.6        2019-04-02 [2] CRAN (R 3.4.4)
#>  yaml          2.2.0      2018-07-25 [2] CRAN (R 3.4.4)
#> 
#> [1] D:/R/R-dev
#> [2] D:/app/R_lib/3.1
#> [3] D:/app/R/library
@shrektan shrektan added the bug label Apr 10, 2019
@mattdowle mattdowle added this to the 1.12.4 milestone Apr 11, 2019
@mattdowle
Copy link
Member

@mattdowle mattdowle commented Apr 11, 2019

Great find.
I tried removing the substring() but test 1942.7 then fails. The substring() is finding if any leading subsets of index column names match the by variables in the same order. The problem comes where one column name is the same as another column but with some extra letters on the end (as in your example).
Working on a fix ...

@mattdowle mattdowle changed the title [Bug Report] Keyby may use the wrong index at certain cases keyby= may use the wrong index Apr 11, 2019
@shrektan
Copy link
Member Author

@shrektan shrektan commented Apr 11, 2019

Thanks. I get the point by looking at the testing code of 1942.7...

@shrektan
Copy link
Member Author

@shrektan shrektan commented Apr 11, 2019

Thanks @mattdowle for the quick fix. It works great. 👏

@shrektan shrektan added the encoding label Sep 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug encoding
Projects
None yet
Development

No branches or pull requests

2 participants