Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected results when combining on with keyby or by #1943

Closed
Henrik-P opened this issue Nov 30, 2016 · 1 comment
Closed

Unexpected results when combining on with keyby or by #1943

Henrik-P opened this issue Nov 30, 2016 · 1 comment
Milestone

Comments

@Henrik-P
Copy link

Henrik-P commented Nov 30, 2016

I was working my way through the data.table vignette Secondary indices and auto indexing. To make it easier to compare original data and results, I replaced the non-minimal "flights" data with much smaller datasets throughout.

It worked fine until section 2f "Aggregation using by", where on is combined with keyby. There we find code to "Get the maximum departure delay for each month corresponding to origin = "JFK". Order the result by month":

flights["JFK", max(dep_delay), keyby = month, on = "origin"]

When I tried the equivalent code on two different data sets, two different issues appeared:

1. Label of the keyby and by variable identical for different levels

d1 <- data.table(x = rep(c("a", "b"), each = 4), y = 1:0, z = c(3, 6, 8, 5, 4, 1, 2, 7))
d1
#    x y z
# 1: a 1 3
# 2: a 0 6
# 3: a 1 8
# 4: a 0 5
# 5: b 1 4 <~~ max z for x = b & y = 1
# 6: b 0 1
# 7: b 1 2
# 8: b 0 7 <~~ max z for x = b & y = 0

Translating the code from the vignette: Get the maximum z for each y corresponding to x = "b". Order the result by y

d1["b", max(z), keyby = y, on = "x"]
#    y V1
# 1: 1  7 <~~ y should be 0 here
# 2: 1  4

The label of the keyby variable y is erroneously 1 also for the 0 level.

Also when using by instead of keyby together with on the labels are wrong:

d1["b", max(z), by = y, on = "x"]
#    y V1
# 1: 1  4 
# 2: 1  7 <~~ y should be 0 here

Just to verify that corresponding code without on works fine:

d1[x == "b", max(z), keyby = y]
#    y V1
# 1: 0  7
# 2: 1  4

d1[x == "b", max(z), by = y]
#    y V1
# 1: 1  4
# 2: 0  7

2. (a) Label of the keyby (or by) variable and the resulting value do not match. (b) Result not sorted by the keyby variable.

d2 <- as.data.table(mtcars[mtcars$cyl %in% c(6, 8), c("am", "vs", "hp")])

Again, equivalent desired outcome: get the maximum hp for each vs corresponding to am = 0. Order the result by vs.

First, just look at the data corresponding to am = 0 to easier spot the desired result:

d2[.(0), on = "am"]

#     am vs  hp
# 1:   0  1 110
# 2:   0  0 175
# 3:   0  1 105
# 4:   0  0 245 <~~ max hp for vs = 0
# 5:   0  1 123 <~~ max hp for vs = 1 
# 6:   0  1 123
# 7:   0  0 180
# 8:   0  0 180
# 9:   0  0 180
# 10:  0  0 205
# 11:  0  0 215
# 12:  0  0 230
# 13:  0  0 150
# 14:  0  0 150
# 15:  0  0 245
# 16:  0  0 175

When combining keyby and on, the result is not sorted and the labels don't match the values:

d2[.(0), max(hp), keyby = vs, on = "am"]
#    vs  V1
# 1:  1 245
# 2:  0 123

When combining by and on, the labels don't match the values:

d2[.(0), max(hp), by = vs, on = "am"]
#    vs  V1
# 1:  0 123
# 2:  1 245

Subsetting without on works fine:

d2[am == 0, max(hp), keyby = vs]
#    vs  V1
# 1:  0 245
# 2:  1 123


d2[am == 0 , max(hp), by = vs]
#    vs  V1
# 1:  1 123
# 2:  0 245

So far I have not been able to discern any particular pattern in the data which generates these results (e.g. any particular order of on and/or keyby variables in the original and/or subset data).

Can you spot any mistakes in my code or is there something strange going on here?


R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1
data.table_1.9.8

@mattdowle mattdowle added this to the v1.9.10 milestone Nov 30, 2016
@Henrik-P Henrik-P changed the title Unexpected results when combining keyby and on Unexpected results when combining on with keyby and by Nov 30, 2016
@Henrik-P Henrik-P changed the title Unexpected results when combining on with keyby and by Unexpected results when combining on with keyby or by Nov 30, 2016
@mattdowle
Copy link
Member

mattdowle commented Dec 1, 2016

Interesting. The problem in part 1 was in v1.9.6 as well, so it doesn't seem to be a regression.

> d1[.(x="b"),by=y,max(z),on="x"]       # v1.9.6
   y V1
1: 1  4
2: 1  7

Works in v1.9.6 with keyed join rather than on= :

> setkey(d1,x)
> d1["b",keyby=y,max(z)]
   y V1
1: 0  7
2: 1  4
> d1["b",by=y,max(z)]
   y V1
1: 1  4
2: 0  7

And for completeness just to check, part 2 gives Internal error. icols[0] is NA in v1.9.6.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants