# [R-Forge #2696] change data.table by-without-by syntax to require a "by" #371

Closed
opened this Issue Jun 8, 2014 · 0 comments

Projects
None yet
2 participants
Member

### arunsrinivasan commented Jun 8, 2014

 Submitted by: Eduard Antonyan; Assigned to: Nobody; R-Forge link This request stems from the following SO thread: Since this is something that old timers, and of course the author of the package are probably very used to, the following examples may not seem unusual to them, however I'll do my best to show you the progression of expected results for someone relatively new to the package (I've been using it for about a month now, and love it so far) and how the current syntax breaks expectations and forces to go through extensive investigation to figure out what's going on. Let's take: ```d = data.table(a = c(1,1,1,2,2,3,4), b = c(1,1,3,4,5,6,7), c = 1:7, key = "a") t = data.table(a = c(1,2), key = "a") z = data.table(a = 3, key = "a") # first, the set up - getting to know data.table # i,j,by syntax and running a few commands d[6] # a b c #3: 3 6 6 d[6, a] # [1] 3 d[6, b] # [1] 6 d[a <= 2] # a b c #1: 1 1 1 #1: 1 1 2 #1: 1 3 3 #2: 2 4 4 #2: 2 5 5 d[a <= 2, sum(c)] # [1] 15 d[a <= 2, sum(c), by = a] # a V1 #1: 1 6 #2: 2 9 # ok, so with the above set-up, let's do some merges and see what the results are (together with what I contend the results *should* be with that syntax) d[z] # a b c #3: 3 6 6 d[z, a] # a a #1: 3 3 # "should" be # [1] 3 # to get the above result, one "should" type instead d[z, a, by = a] d[z, b] # a b #1: 3 6 # "should" be # [1] 6 d[t] # prints same output as d[a <= 2] d[t, sum(c)] # prints same output as d[a <= 2, sum(c), by = a] # "should" print same output as d[a <= 2, sum(c)] d[t, sum(c), by = a] # complains and prints same output as above ("should" not complain, and should silently do the by-without-by, for speed reasons, internally) d[t, sum(c), by = b] # no complaints and does exactly what one would expect, i.e. same as d[a <= 2, sum(c), by = b]``` I can see how this may not seem obviously off for someone who's been relying on current behavior for a while, but please believe me when I say this, for someone who's just getting to know the package current behavior makes no sense. Yes, it's documented in no less than 3 FAQ points (which seems to indicate that this syntax is a stumbling block not just for me), but that doesn't make it less unintuitive. The above completely breaks the reading of `d[i,j,by=b]` from take d, apply i, then return j by b and instead converts it to take d, apply `i`, if no b, then return `j` by key, else if `b` and `b == key`, complain and return `j` by `b`, else return `j` by `b`. I hope you can see how the latter interpretation of the syntax is much more complicated and needlessly taxing the user. Let me be very clear - I love `data.table`, and I love that it's trying to be fast when I merge and do a "by" by the key of the merge, but it really shouldn't be doing that "by" unless I ask for it specifically (and if I do, it should of course do the automagical merge and by at the same time).

Closed