Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lapply on .SD reorder columns when using get() in .SD #4089

dmongin opened this issue Dec 4, 2019 · 8 comments · Fixed by #4212

lapply on .SD reorder columns when using get() in .SD #4089

dmongin opened this issue Dec 4, 2019 · 8 comments · Fixed by #4212


Copy link

@dmongin dmongin commented Dec 4, 2019

I found a strange behavior of data.table. I found a situation where lapply applied on .SD changes the order of the column, messing then when one intend to assign the result.
An example:

Here the normal behavior

plouf <- data.table(x = 1, y = 2, z = 3)
cols <- c("y","x")
plouf[,.SD,.SDcols = cols ,by = z]
plouf[,lapply(.SD,function(x){x}),.SDcols = cols ,by = z]
plouf[,lapply(.SD[x == 1],function(x){x}),.SDcols = cols ,by = z]

All these lines give :

   z y x
1: 3 2 1

which I need for example to reassign to c("y","x"). But if I do:

plouf[,lapply(.SD[get("x") == 1],function(x){x}),.SDcols = c("y","x"),by = z]

   z x y
1: 3 1 2

Here the order of x and y changed without reason, when it should yield the same result as the last "working" example. If then assign the wrong values to c("y","x") if I assign the output of lapply to new vector of columns. It seems that the use of get in the i part of .SD triggers this bug.

Example of the effect of this on assignment:

plouf[, c(cols ) := lapply(.SD[get("x") == 1],function(x){x}),
      .SDcols = cols ,by = z][]
#    x y z
# 1: 2 1 3

# Output of sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

[1] LC_COLLATE=French_Switzerland.1252 LC_CTYPE=French_Switzerland.1252
[3] LC_MONETARY=French_Switzerland.1252 LC_NUMERIC=C
[5] LC_TIME=French_Switzerland.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] ggplot2_3.2.1 lmerTest_3.1-0 lme4_1.1-21 Matrix_1.2-17
[5] plyr_1.8.4 lubridate_1.7.4 readxl_1.3.1 data.table_1.12.2
[9] stringr_1.4.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 cellranger_1.1.0 compiler_3.6.1
[4] pillar_1.4.2 nloptr_1.2.1 tools_3.6.1
[7] digest_0.6.21 boot_1.3-22 tibble_2.1.3
[10] gtable_0.3.0 nlme_3.1-140 lattice_0.20-38
[13] pkgconfig_2.0.3 rlang_0.4.0 rstudioapi_0.10
[16] withr_2.1.2 dplyr_0.8.3 grid_3.6.1
[19] tidyselect_0.2.5 glue_1.3.1 R6_2.4.0
[22] minqa_1.2.4 reshape2_1.4.3 purrr_0.3.2
[25] magrittr_1.5 scales_1.0.0 MASS_7.3-51.4
[28] splines_3.6.1 assertthat_0.2.1 colorspace_1.4-1
[31] numDeriv_2016.8-1.1 labeling_0.3 stringi_1.4.3
[34] lazyeval_0.2.2 munsell_0.5.0 crayon_1.3.4

Copy link

@jangorecki jangorecki commented Dec 4, 2019

@dmongin thank you for detailed report. I can reproduce it on recent devel. As a workaround you can use now substitute rather than get

var = "x"
expr = substitute(
  plouf[, c(cols) := lapply(.SD[.var == 1],function(x){x}), .SDcols = cols, by = z][],
#plouf[, `:=`(c(cols), lapply(.SD[x == 1], function(x) {
#    x
#})), .SDcols = cols, by = z][]
#   x y z
#1: 2 1 3

Personally I would use it regularly, not as a workaround, I find R metaprogramming features superior.
Also be aware that some day instead of get(var) we should be able to use ..var, #2816, #3199
R metaprogramming always worked and, I assume, will always work, thanks to the conservative backward compatible R code development.

@jangorecki jangorecki added the bug label Dec 4, 2019
Copy link

@MichaelChirico MichaelChirico commented Dec 4, 2019

Note this is sort of covered by the verbose message:

'(m)get' found in j. ansvars being set to all columns. Use .SDcols or a single j=eval(macro) instead. Both will detect the columns used which is important for efficiency.
Old: y,x 
New: x,y

The issue is the message is a bit misleading/slightly confusing. Actually you've used get in a nested [ call, whereas this behavior is designed for get alongside .SD in the same [ call.

I'm not sure how useful it is to create fully robust tree parsing for edge cases like this that are better suited by metaprogramming as Jan suggested, especially given that the order of columns shouldn't really matter all that much.

Will leave this open for now to keep the discussion going a bit.

Copy link

@jangorecki jangorecki commented Dec 4, 2019

order does matter if you provide columns to assign on LHS of :=, then you end up with incorrectly labelled columns

Copy link

@dmongin dmongin commented Dec 4, 2019

@MichaelChirico as @jangorecki suggested, I discovered the problem in wrongly assigned column in my code (that I assigned on LHS with :=), and it took me a while to traceback the error. There could be cases were the user does not notice it. I find it quite dangerous, so I would vote for something robust.

I did not know about the metaprogramming, and I am not sure to see the advantage compared to get() (apart from avoiding this bug, but I find the code more hard to read and to use). If use of ..var avoid the bug, maybe the use of get() should be depreciated ?

Copy link

@dmongin dmongin commented Dec 4, 2019

Copy link

@jangorecki jangorecki commented Dec 5, 2019

@dmongin best place to get proper information on metaprogramming is official R language manual, there is a dedicated chapter to this feature:
Advantage is that you can construct any query you need to. If you are coding parametrized interface to some API (like data.table) you can construct parametrized queries as you would type the values of parameters by hand, and evaluate those. There are multiple packages that meant to help user to do metaprogramming in R introducing various ways to handle substitution ~ var, !!var, !!!var, {{ var }}}. I haven't use them really because base R metaprogramming works well, it is also the most reliable way, backward compatible, future proof, least prone to bugs.

Copy link

@ColeMiller1 ColeMiller1 commented Jan 15, 2020

Looking into it, changing this:

if (any(c("get", "mget") %chin% av)) {


if (any(c("get", "mget") %chin% av) && !(all(names_x %in% c(ansvars, allbyvars)) && is.null(names_i))) {

Would address this issue. Test 717 would need to be changed but all other tests check out for me when making the change.

The logic is that if all of the names of the data.table are already accounted for in our .SD and by and there are no names in i (meaning it's not a join), we wouldn't reorder the columns. Joins could be accounted for as well but it seems unlikely to have a join, :=, get(), and cols to update.

Copy link

@ColeMiller1 ColeMiller1 commented Jan 16, 2020

I am working on a PR to refactor the code to move the m(get) statement up a little bit, per the code comment:


Lines 991 to 994 in c005296

# TODO remove as (m)get is now folded in above.
# added 'mget' - fix for #994
if (any(c("get", "mget") %chin% av)) {
if (verbose) {

What is the desired behaviour? Should there be a warning to users who use := and (m)get() or should we prevent the column switching from happening when all variables are already accounted for?

@mattdowle mattdowle added this to the 1.12.9 milestone Feb 17, 2020
@jangorecki jangorecki removed this from the 1.12.11 milestone May 26, 2020
@jangorecki jangorecki added this to the 1.12.9 milestone May 26, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet

Successfully merging a pull request may close this issue.

5 participants