Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: split.data.table can not split on a factor column if it is named 'x' #3151

Closed
tdeenes opened this issue Nov 18, 2018 · 3 comments · Fixed by #3153
Closed

Bug: split.data.table can not split on a factor column if it is named 'x' #3151

tdeenes opened this issue Nov 18, 2018 · 3 comments · Fixed by #3153
Milestone

Comments

@tdeenes
Copy link
Member

@tdeenes tdeenes commented Nov 18, 2018

Just found the following bug:

library(data.table)
temp1 <- data.table(x = factor("a"))
split(temp1, by = "x")
# Error: is.data.table(x) is not TRUE

# this works
temp2 <- data.table(.x = factor("a"))
split(temp2, by = ".x")

# this works as well
temp3 <- data.table(x = "a")
split(temp3, by = "x")

So the problem only emerges if the splitting variable is called 'x' and it is a factor. The problem can be rooted back to this temp = eval(dtq) call in split.data.table. In the dtq unevaluated call, make.levels(x, cols=.cols, sorted=.sorted) finds the 'x' variable instead of the 'x' data.table. A quick fix is to write make.levels(.___x, cols=.cols, sorted=.sorted) instead, and do a temporary assignment .___x <- x before temp = eval(dtq).

SessionInfo (the bug is also present in 1.11.8):

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.2

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.11.9 switchr_0.13.0   

loaded via a namespace (and not attached):
[1] compiler_3.5.1  tools_3.5.1     RCurl_1.95-4.11 remotes_2.0.2  
[5] bitops_1.0-6
@tdeenes
Copy link
Member Author

@tdeenes tdeenes commented Nov 18, 2018

The previous report was not totally correct. 'x' must not be the splitting variable:

temp5 <- data.table(y = factor("a"), x = 1)
split(temp5, by = "y")
# Error: is.data.table(x) is not TRUE

However, I think the diagnosis given above is still correct, and the proposed fix eliminates the bug. I can submit a PR but maybe you guys have a less hackish solution for this.

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Nov 19, 2018

Thanks for the report, and thanks for finding solution. PR is very welcome.
Note that split.data.table has verbose argument which I specifically added so the calls constructed and run with eval can be easily investigated.

library(data.table)
temp5 <- data.table(y = factor("a"), x = 1)
split(temp5, by = "y", verbose=TRUE) 
#Processing split.data.table with: x[i = make.levels(x, cols = "y", sorted = FALSE), j = list(.ll.tech.split = list(if (.N == 0L) .SD[0L] else .SD)), by = .EACHI, .SDcols = c("y", "x"), on = "y"]
#Error: is.data.table(x) is not TRUE

@jangorecki jangorecki added this to the 1.12.0 milestone Nov 19, 2018
@tdeenes
Copy link
Member Author

@tdeenes tdeenes commented Nov 19, 2018

OK, and thanks for the pointer to verbose=TRUE; I somehow missed it. In the meantime I found a much cleaner solution which I will submit as a PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants