A small note on the order of cube result #3179

Henrik-P · 2018-12-02T14:56:32Z

In the Description in ?cube, we find that [cube]

Reflects SQLs GROUPING SETS

Then, the Value section is rather vague:

A data.table with various aggregates

...but References are provided to PostgreSQL 7.2.4. GROUPING SETS, CUBE, and ROLLUP.

There we find that:

CUBE ( a, b, c )

is equivalent to

GROUPING SETS (
    ( a, b, c ),
    ( a, b    ),
    ( a,    c ),
    ( a       ),
    (    b, c ),
    (    b    ),
    (       c ),
    (         )
)

It seems like the first grouping variable varies slowest, and the last variable varies fastest. Disclaimer: I don't have access to PostgreSQL, so I can't confirm that this actually reflects the final output ;)

In cube, it is the other way around: the last grouping variable in by varies slowest, and the first fastest, like:

GROUPING SETS (
    ( a, b, c ),
    (    b, c ),
    ( a,    c ),
    (       c ),
    ( a, b    ),
    (    b    ),
    ( a       ),
    (         )
)

I don't claim that PostgreSQL is "right", but given that no explicit Value section is provided in ?cube, and that the help text instead refers to PostgreSQL docs, the order of cube output may be considered inconsistent (but again, note my disclaimer above). In addition, I find the PostgreSQL ordering more intuitive, a matter of taste perhaps.

An example to illustrate the order of cube:

set.seed(1)
d <- data.table(a = rep(1:2, each = 8),
                b = rep(1:2, each = 4),
                c = rep(1:2, each = 2),
                val = sample(0:1, 16, replace = TRUE))

all.equal(
  cube(d, j= sum(val), by = c("a", "b", "c")),
  groupingsets(d, j = sum(val), by = c("a", "b", "c"),
               sets = list(c("a", "b", "c"),
                           c(     "b", "c"),
                           c("a",      "c"),
                           c(          "c"),
                           c("a", "b"     ),
                           c(     "b"     ),
                           c("a"          ),
                           character()))
)
# [1] TRUE

Update

When I added an id, it's obvious that the counter is in fact based on the PostgreSQL order, which in the current output order becomes non-consecutive. Somewhat odd. It seems to me that the output rather could have the PostgreSQL order right away.

cube(d, j= sum(val), by = c("a", "b", "c"), id = TRUE)
    grouping  a  b  c V1
 1:        0  1  1  1  0 #  ( a, b, c )
 2:        0  1  1  2  2
 3:        0  1  2  1  1
 4:        0  1  2  2  2
 5:        0  2  1  1  1
 6:        0  2  1  2  0
 7:        0  2  2  1  1
 8:        0  2  2  2  1
 9:        4 NA  1  1  1 # (    b, c )
10:        4 NA  1  2  2
11:        4 NA  2  1  2
12:        4 NA  2  2  3
13:        2  1 NA  1  1 # ( a,    c )
14:        2  1 NA  2  4
15:        2  2 NA  1  2
16:        2  2 NA  2  1
17:        6 NA NA  1  3 # (       c )
18:        6 NA NA  2  5
19:        1  1  1 NA  2 # ( a, b    )
20:        1  1  2 NA  3
21:        1  2  1 NA  1
22:        1  2  2 NA  2
23:        5 NA  1 NA  3 # (    b    )
24:        5 NA  2 NA  5
25:        3  1 NA NA  5 # ( a       )
26:        3  2 NA NA  3
27:        7 NA NA NA  8 # (         )
 grouping  a  b  c V1

The text was updated successfully, but these errors were encountered:

jangorecki · 2018-12-03T04:33:48Z

Thanks for investigating that. BTW. there is a great webtool www.sqlfiddle.com

Henrik-P · 2018-12-03T23:34:38Z

Thanks for the tip about sqlfiddle Jan - will check it out.

I also want to take the opportunity to thank all data.table developers. I'm so grateful for your incredible work.

jangorecki · 2018-12-04T13:12:34Z

data.table/R/groupingsets.R

Line 32 in e2a95fb

    
           keepBool = sapply(2L^(seq_len(n) - 1L), function(k) rep(c(FALSE, TRUE), each=k, times=(2L^n / (2L*k))))

has to be modified to produce expected order, probably each/times arguments. Eventually we can move to slower (but easier to follow) way to generate list of character strings that is passed to groupingsets.
data for sqlfiddle:

d[, sprintf("insert into t1 values (%s,%s,%s,%s);", a, b, c, val), 1:nrow(d)
  ][, cat(paste(c("create table t1 (a int, b int, c int, val int);", V1), collapse="\n"))
    ] -> nul

SELECT GROUPING(a, b, c), a, b, c, sum(val) FROM t1 GROUP BY cube (a, b, c);

http://sqlfiddle.com/#!17/059d0/5

jangorecki · 2018-12-04T14:30:37Z

I pushed fix, while it fixes the order of sets that cube is using when calling groupingsets, still the end results are not strictly matching order of query in postgres, where grouping is not sorted. This shouldn't be issue as sql is unordered.

jangorecki added a commit that referenced this issue Dec 4, 2018

cube calls groupingsets using expected order of sets, closes #3179

01d22cb

jangorecki mentioned this issue Dec 4, 2018

cube calls groupingsets using expected order of sets #3182

Merged

mattdowle added this to the 1.12.0 milestone Dec 4, 2018

mattdowle closed this as completed in #3182 Dec 5, 2018

jangorecki added a commit that referenced this issue Dec 5, 2018

proper tests for cube, related to #3179

7a2a5a3

jangorecki mentioned this issue Dec 5, 2018

proper tests for cube, related to #3179 #3183

Merged

mattdowle pushed a commit that referenced this issue Dec 5, 2018

proper tests for cube, related to #3179 (#3183)

1f439ec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A small note on the order of cube result #3179

A small note on the order of cube result #3179

Henrik-P commented Dec 2, 2018 •

edited

jangorecki commented Dec 3, 2018 •

edited

Henrik-P commented Dec 3, 2018

jangorecki commented Dec 4, 2018 •

edited

jangorecki commented Dec 4, 2018

A small note on the order of cube result #3179

A small note on the order of cube result #3179

Comments

Henrik-P commented Dec 2, 2018 • edited

jangorecki commented Dec 3, 2018 • edited

Henrik-P commented Dec 3, 2018

jangorecki commented Dec 4, 2018 • edited

jangorecki commented Dec 4, 2018

Henrik-P commented Dec 2, 2018 •

edited

jangorecki commented Dec 3, 2018 •

edited

jangorecki commented Dec 4, 2018 •

edited