New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[R-Forge #2696] change data.table by-without-by syntax to require a "by" #371

arunsrinivasan opened this Issue Jun 8, 2014 · 0 comments


None yet
2 participants

arunsrinivasan commented Jun 8, 2014

Submitted by: Eduard Antonyan; Assigned to: Nobody; R-Forge link

This request stems from the following SO thread:

Since this is something that old timers, and of course the author of the package are probably very used to, the following examples may not seem unusual to them, however I'll do my best to show you the progression of expected results for someone relatively new to the package (I've been using it for about a month now, and love it so far) and how the current syntax breaks expectations and forces to go through extensive investigation to figure out what's going on.

Let's take:

d = data.table(a = c(1,1,1,2,2,3,4), b = c(1,1,3,4,5,6,7), c = 1:7, key = "a")
t = data.table(a = c(1,2), key = "a")
z = data.table(a = 3, key = "a")

# first, the set up - getting to know data.table 
# i,j,by syntax and running a few commands
#    a b c
#3: 3 6 6

d[6, a]
# [1] 3

d[6, b]
# [1] 6

d[a <= 2]
#    a b c
#1: 1 1 1
#1: 1 1 2
#1: 1 3 3
#2: 2 4 4
#2: 2 5 5

d[a <= 2, sum(c)]
# [1] 15

d[a <= 2, sum(c), by = a]
#    a V1
#1: 1  6
#2: 2  9

# ok, so with the above set-up, let's do some merges and see what the results are (together with what I contend the results *should* be with that syntax)

#    a b c
#3: 3 6 6

d[z, a]
#    a a
#1: 3 3
# "should" be
# [1] 3
# to get the above result, one "should" type instead d[z, a, by = a]

d[z, b]
#    a b
#1: 3 6
# "should" be
# [1] 6

# prints same output as d[a <= 2]

d[t, sum(c)]
# prints same output as d[a <= 2, sum(c), by = a]
# "should" print same output as d[a <= 2, sum(c)]

d[t, sum(c), by = a]
# complains and prints same output as above ("should" not complain, and should silently do the by-without-by, for speed reasons, internally)

d[t, sum(c), by = b]
# no complaints and does exactly what one would expect, i.e. same as d[a <= 2, sum(c), by = b]

I can see how this may not seem obviously off for someone who's been relying on current behavior for a while, but please believe me when I say this, for someone who's just getting to know the package current behavior makes no sense. Yes, it's documented in no less than 3 FAQ points (which seems to indicate that this syntax is a stumbling block not just for me), but that doesn't make it less unintuitive.

The above completely breaks the reading of d[i,j,by=b] from take d, apply i, then return j by b and instead converts it to take d, apply i, if no b, then return j by key, else if b and b == key, complain and return j by b, else return j by b. I hope you can see how the latter interpretation of the syntax is much more complicated and needlessly taxing the user.

Let me be very clear - I love data.table, and I love that it's trying to be fast when I merge and do a "by" by the key of the merge, but it really shouldn't be doing that "by" unless I ask for it specifically (and if I do, it should of course do the automagical merge and by at the same time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment