Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Empty data.table produced with .SD when grouping by all columns #3262

Open
st-pasha opened this issue Jan 9, 2019 · 5 comments
Open

Empty data.table produced with .SD when grouping by all columns #3262

st-pasha opened this issue Jan 9, 2019 · 5 comments

Comments

@st-pasha
Copy link
Contributor

st-pasha commented Jan 9, 2019

> DT = data.table(A=c(1,2,1,2,1,2), B=c(1,2,1,1,2,2))
> DT[, .SD, by=.(A)]  # As expected
   A B
1: 1 1
2: 1 1
3: 1 2
4: 2 2
5: 2 1
6: 2 2
> DT[, .SD, by=.(A, B)]  # not expected
Empty data.table (0 rows) of 2 cols: A,B

Likewise,

> DT[, .SD, by=.(A+B)]  # not expected
Empty data.table (0 rows) of 1 col: A
@st-pasha st-pasha added the bug label Jan 9, 2019
@Henrik-P
Copy link

Henrik-P commented Jan 9, 2019

If I understand this correctly, it may seem consistent with the help text:

.SD is a data.table containing the Subset of x's Data for each group, excluding any columns used in by

...with the part "excluding any columns used in by" being critical here. Thus, I don't think it's grouping by multiple (per se) columns which causes the empty data set, but grouping with all columns:

d <- data.table(x = 1, y = 2, z = 3, w = 4)

d[ , names(.SD), by = .(x)]$V1
# [1] "y" "z" "w"

d[ , names(.SD), by = .(x, y)]$V1
# [1] "z" "w"

d[ , names(.SD), by = .(x, y, z)]$V1
# [1] "w"

d[ , names(.SD), by = .(x, y, z, w)]$V1
# character(0)

Possibly related issue: Columns appearing in the function in by= disappers in j

@st-pasha
Copy link
Contributor Author

st-pasha commented Jan 9, 2019

@Henrik-P You're right that this is closely related to #1427.
And you're right that .SD becomes empty when all columns are used up in the groupby.
Still, I feel the final result is incorrect: the columns used in by are supposed to be implicitly added to the front of the j result even when that j is an empty data.table.

@mattdowle mattdowle removed the bug label Jan 10, 2019
@mattdowle mattdowle changed the title Empty data.table produced with .SD when grouping by multiple columns Empty data.table produced with .SD when grouping by all columns Jan 10, 2019
@r2evans
Copy link

r2evans commented Jul 28, 2020

I think there should be a distinction between "excluding any columns used in by" and "0 columns". I think it's perfectly valid to have 0 columns and some rows. This is actually not unique to .SD:

data.frame(a=1:5)[,0]
# data frame with 0 columns and 5 rows
data.table(a=1:5)[,0]
# Null data.table (0 rows and 0 cols)

@jangorecki
Copy link
Member

If it is valid to have rows and 0 columns depends on how you internally define structure of your data.
Databases does not allow to have rows and 0 columns at the same time.
Having rows but not columns is more a matrix/array expected behaviour, not a (db) table, which data.frame is closely corresponding to. There was a good discussion about that behaviour already.

@r2evans
Copy link

r2evans commented Jul 28, 2020

I'll look for the previous discussions, I'm not surprised they are around (but I didn't find them).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants