Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setNames causes error when used with groupby (the by parameter) #4963

Closed
fujiaxiang opened this issue Apr 21, 2021 · 8 comments · Fixed by #5178
Closed

setNames causes error when used with groupby (the by parameter) #4963

fujiaxiang opened this issue Apr 21, 2021 · 8 comments · Fixed by #5178
Milestone

Comments

@fujiaxiang
Copy link

Hi data.table team, this is the first time I'm reporting an issue to this repo. I have tried to search in release docs, other issues, stackoverflow, and have not found anything similar reported yet. If that's not true then I apologise for the trouble. I did not try this on latest dev version.

# [Minimal reproducible example]
In V1.14.0, the following gives an error. However, the exact same code in V1.12.8 works fine with valid output.

> library(data.table)
> dt = data.table(id=c(1:3), grp=c('a', 'a', 'b'), value=c(4:6))
> dt
   id grp value
1:  1   a     4
2:  2   a     5
3:  3   b     6

> dt[, by = grp, .(agg = list(setNames(as.list(value), id)))]
Error in lapply(x, runlock, current_depth = current_depth + 1L) : 
  'names' attribute [2] must be the same length as the vector [1]

Also, not sure if this is relevant but dplyr gives the same output in both versions of data.table, although with slightly different formatting, which I assume is merely a change of how data.table print out the list type in those two versions.

> dt %>% group_by(grp) %>% summarise(agg = list(setNames(as.list(value), id))) %>% data.table
`summarise()` ungrouping output (override with `.groups` argument)
   grp       agg
1:   a <list[2]>
2:   b <list[1]>

# Output of sessionInfo()

R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Catalina 10.15.7

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.14.0

loaded via a namespace (and not attached):
[1] compiler_4.0.3 tools_4.0.3
@avimallu
Copy link
Contributor

avimallu commented Apr 21, 2021

This should work:

> dt[, .(agg = list(setNames(value, id))), grp]
      grp    agg
   <char> <list>
1:      a    4,5
2:      b      6

You are getting the error because when you wrap as.list around the vector value, grouped by grp, the number of elements (by grp) of the output becomes 1 while your id has an extra element (i.e. 1 and 2) to name a single list list(4, 5).

It's been a long time since I used dplyr, so I'm not sure how list columns behave there now, however, the str from both indicates that dplyr might be storing list columns differently from data.table:

> dt %>% group_by(grp) %>% summarise(agg = list(setNames(as.list(value), id))) %>% data.table %>% str
Classesdata.tableand 'data.frame':	2 obs. of  2 variables:
 $ grp: chr  "a" "b"
 $ agg:List of 2
  ..$ :List of 2
  .. ..$ 1: int 4
  .. ..$ 2: int 5
  ..$ :List of 1
  .. ..$ 3: int 6
 - attr(*, ".internal.selfref")=<externalptr>

> dt[, .(agg = list(setNames(value, id))), grp] %>% str
Classesdata.tableand 'data.frame':	2 obs. of  2 variables:
 $ grp: chr  "a" "b"
 $ agg:List of 2
  ..$ : Named int  4 5
  .. ..- attr(*, "names")= chr [1:2] "1" "2"
  ..$ : Named int 6
  .. ..- attr(*, "names")= chr "3"
 - attr(*, ".internal.selfref")=<externalptr>

My best guess is that dplyr stores the column as a list of list, while data.table stores it simply as a list. This matches with the output you get from wrapping data.table on the output of the dplyr chain, where the printing does not match what you get with data.table.

@fujiaxiang
Copy link
Author

fujiaxiang commented Apr 21, 2021

hi @avimallu thanks for your reply.

You are right that dt[, .(agg = list(setNames(value, id))), grp] does work. However, this is not what I want. In this case dt$agg[[1]] would be of class integer rather than list.

I don't think as.list makes the list length equal to 1, as in my original data, the error message can look like this:

'names' attribute [58] must be the same length as the vector [40]

where 58 is the size of some groups and 40 is the size of some other group.

As an example, try the following.

> dt = data.table(id=c(1:6), grp=c('a', 'a', 'b', 'a', 'b', 'c'), value=c(4:9))
> dt[, by = grp, .(agg = list(setNames(as.list(value), id)))]
Error in lapply(x, runlock, current_depth = current_depth + 1L) : 
  'names' attribute [3] must be the same length as the vector [2]

@avimallu
Copy link
Contributor

avimallu commented Apr 21, 2021

In this case dt$agg[[1]] would be of class integer rather than list.

Could you let me know why you want it as a list rather than integer? I'm not quite able to see what purpose that would serve. Does this do what you seek?

> dt[, by = grp, .(agg = list(list(setNames(value, id))))]
   grp       agg
1:   a <list[1]>
2:   b <list[1]>
> str(dt[, by = grp, .(agg = list(list(setNames(value, id))))])
Classesdata.tableand 'data.frame':	2 obs. of  2 variables:
 $ grp: chr  "a" "b"
 $ agg:List of 2
  ..$ :List of 1
  .. ..$ : Named int  4 5
  .. .. ..- attr(*, "names")= chr [1:2] "1" "2"
  ..$ :List of 1
  .. ..$ : Named int 6
  .. .. ..- attr(*, "names")= chr "3"
 - attr(*, ".internal.selfref")=<externalptr> 

Reference. I'm looking deeper into your other comment, but don't have an answer right now.

@fujiaxiang
Copy link
Author

No that's still different from my original data structure.

# This gives a list of one list of many values for each row in `agg` column.
> dt[, by = grp, .(agg = list(setNames(as.list(value), id)))]

# This gives a list of one list of one vector of many values for each row in `agg` column.
> dt[, by = grp, .(agg = list(list(setNames(value, id))))]

# This should give the same result as intended but also gives error
> dt[, by = grp, .(agg = list(as.list(setNames(value, id))))]

To be frank I'm doing this just because of legacy reason. Someone else wrote the code making the data structure this way, and I have to be consistent.

However, I think there may be a bigger issue here in the implementation how the groups of data is handled.
A few key questions:

  1. Why is data.table trying to set names of one group's values with another group's keys?
  2. Why does the code work in V1.12.8 but not i V1.14.0?
  3. Why in the below two cases one does work and the other gives error?
> dt = data.table(id=c(1:6), grp=c('a', 'a', 'b', 'a', 'b', 'c'), value=c(4:9))

> dt[, by = grp, .(agg = list(list(setNames(value, id))))]  # this works
   grp       agg
1:   a <list[1]>
2:   b <list[1]>
3:   c <list[1]>

> dt[, by = grp, .(agg = list(as.list(setNames(value, id))))]  # this doesn't work
Error in lapply(x, runlock, current_depth = current_depth + 1L) : 
  'names' attribute [3] must be the same length as the vector [2]

@avimallu
Copy link
Contributor

Is there any difference in the R version between the two versions of data.table you are using?

@avimallu
Copy link
Contributor

avimallu commented Apr 22, 2021

In addition, as.list and list are two different functions, as observed in R's documentation, and outlined below:

> library(data.table)
> dt = data.table(id=c(1:6), grp=c('a', 'a', 'b', 'a', 'b', 'c'), value=c(4:9))
> as.list(setNames(dt$value, dt$id))
$`1`
[1] 4

$`2`
[1] 5

$`3`
[1] 6

$`4`
[1] 7

$`5`
[1] 8

$`6`
[1] 9

> list(setNames(dt$value, dt$id))
[[1]]
1 2 3 4 5 6 
4 5 6 7 8 9 

I realize I still haven't answered all of your questions - this is just to point out that your expectation that dt[, by = grp, .(agg = list(as.list(setNames(value, id))))] should work identically (or work at all) given that dt[, by = grp, .(agg = list(list(setNames(value, id))))] works may not be correct.

@fujiaxiang
Copy link
Author

yes I'm using the same version of R in both cases, and that is R4.0.3.

On the discussion of as.list vs list, yeah I understand they are different functions. I don't expect them to work identically. What puzzles me is, the error message seems to suggest the issue arose at the setNames step, but in the two examples both as.list and list should be evaluated after setNames is evaluated. Then why one works but the other one doesn't. I don't know if this has something to do with the lazy evaluation implementation of data.table.
Anyway these are just some thoughts of mine. I don't really need answers to those questions. I just thought those may help the team discover what's going on.

@tlapak
Copy link
Contributor

tlapak commented Sep 25, 2021

TL,DR: This is a version of an old bug where list columns returned by an aggregation would retain a pointer to the last group. #4655 rewrote how data.table handles an old bug with list columns which would retain a pointer to the last group in an aggregation. It missed this case where the elements of the list get their attributes set to values from another column.

Thanks for reporting, this looks like an actual bug in Cdogroups. Some cursory attempts have failed to reproduce the issue in other settings, but for your call, Cdogroups returns a malformed data.table. Specifically, the names of the list entries in the list column are wrong. If we put a breakpoint right after the call to dogroups and inspect its return value:

.Internal(inspect(ans))
# @0x00000029e6ff6c38 19 VECSXP g1c2 [MARK,REF(11)] (len=2, tl=0)
#   @0x00000029e6ff6bf8 16 STRSXP g1c2 [MARK,REF(3)] (len=2, tl=0)
#     @0x00000029d8cb9f70 09 CHARSXP g1c1 [MARK,REF(60),gp=0x61] [ASCII] [cached] "a"
#     @0x00000029d9088938 09 CHARSXP g1c1 [MARK,REF(65),gp=0x61] [ASCII] [cached] "b"
#   @0x00000029e6ff6bb8 19 VECSXP g1c2 [MARK,REF(5)] (len=2, tl=0)
#     @0x00000029e6ff6c78 19 VECSXP g1c2 [MARK,REF(6),ATT] (len=2, tl=0)
#       @0x00000029e2811908 13 INTSXP g1c1 [MARK,REF(3)] (len=1, tl=0) 4
#       @0x00000029e28118d0 13 INTSXP g1c1 [MARK,REF(3)] (len=1, tl=0) 5
#     ATTRIB:
#       @0x00000029e1273720 02 LISTSXP g1c0 [MARK,REF(1)] 
# 	TAG: @0x00000029d20016a0 01 SYMSXP g1c0 [MARK,REF(65535),LCK,gp=0x4000] "names" (has value)
# 	@0x00000029e1273758 16 STRSXP g1c0 [MARK,REF(65535)]   <expanded string conversion>
# 	  @0x00000029e7000e18 16 STRSXP g1c2 [MARK,REF(1)] (len=2, tl=0)
# 	    @0x00000029e10a4a70 09 CHARSXP g1c1 [MARK,REF(299),gp=0x60] [ASCII] [cached] "3"
# 	    @0x00000029d8effb70 09 CHARSXP g1c1 [MARK,REF(368),gp=0x60] [ASCII] [cached] "2"
#     @0x00000029e2811748 19 VECSXP g1c1 [MARK,REF(8),ATT] (len=1, tl=0)
#       @0x00000029e2811710 13 INTSXP g1c1 [MARK,REF(1)] (len=1, tl=0) 6
#     ATTRIB:
#       @0x00000029e1272df0 02 LISTSXP g1c0 [MARK,REF(1)] 
# 	TAG: @0x00000029d20016a0 01 SYMSXP g1c0 [MARK,REF(65535),LCK,gp=0x4000] "names" (has value)
# 	@0x00000029e1272e28 16 STRSXP g1c0 [MARK,REF(65535)]   <deferred string conversion>
# 	  @0x00000029e280f4e0 13 INTSXP g1c1 [MARK,REF(65535)] (len=2, tl=2) 3,2

Clearly, the names attribute of the first list item should be c("1", "2") instead of c("3", "2"). The names attribute of the second list item should be "3" but is c(3L, 2L) which is the wrong length. This causes the error message when runlock is called on it. The actual values of the list items are correct. Something goes wrong with updating the second argument.

git bisect shows that the bug was introduced in #4655.

The bug does not occur if we drop the as.list call. If I replace it with as.numeric we get the same bug (but no error message!) but now the string conversion gets deferred on both names attributes and I can see, as suspected, that they point to the same memory location.

I've also attempted to replace setNames with various other functions that take two arguments and the bug seems to be specific to assigning an attribute.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants