Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A small note on the order of cube result #3179

Closed
Henrik-P opened this issue Dec 2, 2018 · 4 comments
Closed

A small note on the order of cube result #3179

Henrik-P opened this issue Dec 2, 2018 · 4 comments
Milestone

Comments

@Henrik-P
Copy link

Henrik-P commented Dec 2, 2018

In the Description in ?cube, we find that [cube]

Reflects SQLs GROUPING SETS

Then, the Value section is rather vague:

A data.table with various aggregates

...but References are provided to PostgreSQL 7.2.4. GROUPING SETS, CUBE, and ROLLUP.

There we find that:

CUBE ( a, b, c )

is equivalent to

GROUPING SETS (
    ( a, b, c ),
    ( a, b    ),
    ( a,    c ),
    ( a       ),
    (    b, c ),
    (    b    ),
    (       c ),
    (         )
)

It seems like the first grouping variable varies slowest, and the last variable varies fastest. Disclaimer: I don't have access to PostgreSQL, so I can't confirm that this actually reflects the final output ;)

In cube, it is the other way around: the last grouping variable in by varies slowest, and the first fastest, like:

GROUPING SETS (
    ( a, b, c ),
    (    b, c ),
    ( a,    c ),
    (       c ),
    ( a, b    ),
    (    b    ),
    ( a       ),
    (         )
)

I don't claim that PostgreSQL is "right", but given that no explicit Value section is provided in ?cube, and that the help text instead refers to PostgreSQL docs, the order of cube output may be considered inconsistent (but again, note my disclaimer above). In addition, I find the PostgreSQL ordering more intuitive, a matter of taste perhaps.


An example to illustrate the order of cube:

set.seed(1)
d <- data.table(a = rep(1:2, each = 8),
                b = rep(1:2, each = 4),
                c = rep(1:2, each = 2),
                val = sample(0:1, 16, replace = TRUE))

all.equal(
  cube(d, j= sum(val), by = c("a", "b", "c")),
  groupingsets(d, j = sum(val), by = c("a", "b", "c"),
               sets = list(c("a", "b", "c"),
                           c(     "b", "c"),
                           c("a",      "c"),
                           c(          "c"),
                           c("a", "b"     ),
                           c(     "b"     ),
                           c("a"          ),
                           character()))
)
# [1] TRUE

Update

When I added an id, it's obvious that the counter is in fact based on the PostgreSQL order, which in the current output order becomes non-consecutive. Somewhat odd. It seems to me that the output rather could have the PostgreSQL order right away.

cube(d, j= sum(val), by = c("a", "b", "c"), id = TRUE)
    grouping  a  b  c V1
 1:        0  1  1  1  0 #  ( a, b, c )
 2:        0  1  1  2  2
 3:        0  1  2  1  1
 4:        0  1  2  2  2
 5:        0  2  1  1  1
 6:        0  2  1  2  0
 7:        0  2  2  1  1
 8:        0  2  2  2  1
 9:        4 NA  1  1  1 # (    b, c )
10:        4 NA  1  2  2
11:        4 NA  2  1  2
12:        4 NA  2  2  3
13:        2  1 NA  1  1 # ( a,    c )
14:        2  1 NA  2  4
15:        2  2 NA  1  2
16:        2  2 NA  2  1
17:        6 NA NA  1  3 # (       c )
18:        6 NA NA  2  5
19:        1  1  1 NA  2 # ( a, b    )
20:        1  1  2 NA  3
21:        1  2  1 NA  1
22:        1  2  2 NA  2
23:        5 NA  1 NA  3 # (    b    )
24:        5 NA  2 NA  5
25:        3  1 NA NA  5 # ( a       )
26:        3  2 NA NA  3
27:        7 NA NA NA  8 # (         )
 grouping  a  b  c V1
@jangorecki
Copy link
Member

jangorecki commented Dec 3, 2018

Thanks for investigating that. BTW. there is a great webtool www.sqlfiddle.com

@Henrik-P
Copy link
Author

Henrik-P commented Dec 3, 2018

Thanks for the tip about sqlfiddle Jan - will check it out.

I also want to take the opportunity to thank all data.table developers. I'm so grateful for your incredible work.

@jangorecki
Copy link
Member

jangorecki commented Dec 4, 2018

keepBool = sapply(2L^(seq_len(n) - 1L), function(k) rep(c(FALSE, TRUE), each=k, times=(2L^n / (2L*k))))

has to be modified to produce expected order, probably each/times arguments. Eventually we can move to slower (but easier to follow) way to generate list of character strings that is passed to groupingsets.
data for sqlfiddle:

d[, sprintf("insert into t1 values (%s,%s,%s,%s);", a, b, c, val), 1:nrow(d)
  ][, cat(paste(c("create table t1 (a int, b int, c int, val int);", V1), collapse="\n"))
    ] -> nul
SELECT GROUPING(a, b, c), a, b, c, sum(val) FROM t1 GROUP BY cube (a, b, c);

http://sqlfiddle.com/#!17/059d0/5

@jangorecki
Copy link
Member

I pushed fix, while it fixes the order of sets that cube is using when calling groupingsets, still the end results are not strictly matching order of query in postgres, where grouping is not sorted. This shouldn't be issue as sql is unordered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants