Unexpected results when combining `on` with `keyby` or `by` #1943

Henrik-P · 2016-11-30T13:22:13Z

I was working my way through the data.table vignette Secondary indices and auto indexing. To make it easier to compare original data and results, I replaced the non-minimal "flights" data with much smaller datasets throughout.

It worked fine until section 2f "Aggregation using by", where on is combined with keyby. There we find code to "Get the maximum departure delay for each month corresponding to origin = "JFK". Order the result by month":

flights["JFK", max(dep_delay), keyby = month, on = "origin"]

When I tried the equivalent code on two different data sets, two different issues appeared:

1. Label of the `keyby` and `by` variable identical for different levels

d1 <- data.table(x = rep(c("a", "b"), each = 4), y = 1:0, z = c(3, 6, 8, 5, 4, 1, 2, 7))
d1
#    x y z
# 1: a 1 3
# 2: a 0 6
# 3: a 1 8
# 4: a 0 5
# 5: b 1 4 <~~ max z for x = b & y = 1
# 6: b 0 1
# 7: b 1 2
# 8: b 0 7 <~~ max z for x = b & y = 0

Translating the code from the vignette: Get the maximum z for each y corresponding to x = "b". Order the result by y

d1["b", max(z), keyby = y, on = "x"]
#    y V1
# 1: 1  7 <~~ y should be 0 here
# 2: 1  4

The label of the keyby variable y is erroneously 1 also for the 0 level.

Also when using by instead of keyby together with on the labels are wrong:

d1["b", max(z), by = y, on = "x"]
#    y V1
# 1: 1  4 
# 2: 1  7 <~~ y should be 0 here

Just to verify that corresponding code without on works fine:

d1[x == "b", max(z), keyby = y]
#    y V1
# 1: 0  7
# 2: 1  4

d1[x == "b", max(z), by = y]
#    y V1
# 1: 1  4
# 2: 0  7

2. (a) Label of the `keyby` (or `by`) variable and the resulting value do not match. (b) Result not sorted by the `keyby` variable.

d2 <- as.data.table(mtcars[mtcars$cyl %in% c(6, 8), c("am", "vs", "hp")])

Again, equivalent desired outcome: get the maximum hp for each vs corresponding to am = 0. Order the result by vs.

First, just look at the data corresponding to am = 0 to easier spot the desired result:

d2[.(0), on = "am"]

#     am vs  hp
# 1:   0  1 110
# 2:   0  0 175
# 3:   0  1 105
# 4:   0  0 245 <~~ max hp for vs = 0
# 5:   0  1 123 <~~ max hp for vs = 1 
# 6:   0  1 123
# 7:   0  0 180
# 8:   0  0 180
# 9:   0  0 180
# 10:  0  0 205
# 11:  0  0 215
# 12:  0  0 230
# 13:  0  0 150
# 14:  0  0 150
# 15:  0  0 245
# 16:  0  0 175

When combining keyby and on, the result is not sorted and the labels don't match the values:

d2[.(0), max(hp), keyby = vs, on = "am"]
#    vs  V1
# 1:  1 245
# 2:  0 123

When combining by and on, the labels don't match the values:

d2[.(0), max(hp), by = vs, on = "am"]
#    vs  V1
# 1:  0 123
# 2:  1 245

Subsetting without on works fine:

d2[am == 0, max(hp), keyby = vs]
#    vs  V1
# 1:  0 245
# 2:  1 123


d2[am == 0 , max(hp), by = vs]
#    vs  V1
# 1:  1 123
# 2:  0 245

So far I have not been able to discern any particular pattern in the data which generates these results (e.g. any particular order of on and/or keyby variables in the original and/or subset data).

Can you spot any mistakes in my code or is there something strange going on here?

R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
data.table_1.9.8

The text was updated successfully, but these errors were encountered:

mattdowle · 2016-12-01T00:52:26Z

Interesting. The problem in part 1 was in v1.9.6 as well, so it doesn't seem to be a regression.

> d1[.(x="b"),by=y,max(z),on="x"]       # v1.9.6
   y V1
1: 1  4
2: 1  7

Works in v1.9.6 with keyed join rather than on= :

> setkey(d1,x)
> d1["b",keyby=y,max(z)]
   y V1
1: 0  7
2: 1  4
> d1["b",by=y,max(z)]
   y V1
1: 1  4
2: 0  7

And for completeness just to check, part 2 gives Internal error. icols[0] is NA in v1.9.6.

mattdowle added this to the v1.9.10 milestone Nov 30, 2016

Henrik-P changed the title ~~Unexpected results when combining keyby and on~~ Unexpected results when combining on with keyby and by Nov 30, 2016

Henrik-P changed the title ~~Unexpected results when combining on with keyby and by~~ Unexpected results when combining on with keyby or by Nov 30, 2016

mattdowle closed this as completed in 2699c59 Dec 2, 2016

Balter85 mentioned this issue Sep 18, 2017

Aggregate function giving incorrect results using indexing #2309

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected results when combining `on` with `keyby` or `by` #1943

Unexpected results when combining `on` with `keyby` or `by` #1943

Henrik-P commented Nov 30, 2016 •

edited

mattdowle commented Dec 1, 2016 •

edited

Unexpected results when combining on with keyby or by #1943

Unexpected results when combining on with keyby or by #1943

Comments

Henrik-P commented Nov 30, 2016 • edited

1. Label of the keyby and by variable identical for different levels

2. (a) Label of the keyby (or by) variable and the resulting value do not match. (b) Result not sorted by the keyby variable.

mattdowle commented Dec 1, 2016 • edited

Unexpected results when combining `on` with `keyby` or `by` #1943

Unexpected results when combining `on` with `keyby` or `by` #1943

Henrik-P commented Nov 30, 2016 •

edited

1. Label of the `keyby` and `by` variable identical for different levels

2. (a) Label of the `keyby` (or `by`) variable and the resulting value do not match. (b) Result not sorted by the `keyby` variable.

mattdowle commented Dec 1, 2016 •

edited