Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency for sanity, being data.frame like for easy transition #1188

Closed
arunsrinivasan opened this issue Jun 21, 2015 · 32 comments
Closed

Inconsistency for sanity, being data.frame like for easy transition #1188

arunsrinivasan opened this issue Jun 21, 2015 · 32 comments

Comments

@arunsrinivasan
Copy link
Member

@arunsrinivasan arunsrinivasan commented Jun 21, 2015

(After a brief discussion with Matt)

The behaviour with=FALSE:

require(data.table)
DT = data.table(x=1:5, y=6:10, z=11:15)

DT[, c("y", "z"), with=FALSE]

In talking to colleagues, and at meetings or over emails, it seems that restoring the data.frame behaviour only for those cases where j is integer/character vector can only bring more sanity (trading inconsistency).

The issue is that data.table usage revolves around [ a lot, and therefore users are confronted with having to learn this difference quite early, and having to learn new syntax for a known basic operation doesn't sit well. It also doesn't seem to help in explaining how a data.table is a data.frame with this basic operation.

AFAICT, there's no real usage to having just character/integer vectors in j. Therefore, it'd be great to have with=FALSE being unnecessary and be able to subset columns the data.frame way:

DT[, c("y", "z")]
DT[, 2:3]

The default return of vector in case of only one column and use of drop=FALSE should also be restored. This'll help get over the basic data.frame like usage very quickly without having to wonder "why", and start learning the actual essential enhanced-ness data.table provides.

It'd be great to hear thoughts from other users as well.

This has come up before (raised by Matt) : http://r.789695.n4.nabble.com/with-FALSE-td4589266.html but 'leave it as it is' was the response more or less.

@DavidArenburg
Copy link
Member

@DavidArenburg DavidArenburg commented Jun 21, 2015

+1MM

@jrowen
Copy link

@jrowen jrowen commented Jun 21, 2015

I like this idea too. From the earlier responses, it seems the biggest drawback could be introducing inconsistency, as a new user would expect the two approaches below to return the same result.

DT[,c("colA","colB")] 

colvars = c("colA","colB") 
DT[,colvars]

Is there a way they both could return the same result?

@arunsrinivasan
Copy link
Member Author

@arunsrinivasan arunsrinivasan commented Jun 21, 2015

@jrowen thanks. Yes they both should return the same result (as data.frame would).
Will have to think a bit more about this though.

x <- "z"
DT[, x]

It'd be ambiguous in this case, isn't it?

One way off the top of my head is for the enhanced-ness to kick in j, only when it is wrapped with .() or list(), but perhaps that's too big a design change...

Hm, now I'm thinking if this'd only create more problems instead :-(

@markdanese
Copy link

@markdanese markdanese commented Jun 21, 2015

Perhaps you could make the error message more friendly and help the user. Or even find the cases and add "with = FALSE" and advise the user that the change was made (like with setting column names the "old" way). I have been using data.table for a year and a half, and I periodically want to use column numbers for some quick interactive work and get an error. Not a big deal to type with = FALSE, but a nice reminder would be welcome. This would serve to teach new users as well.

@franknarf1
Copy link
Contributor

@franknarf1 franknarf1 commented Jun 21, 2015

I don't know. It might just make it harder for people to learn. I agree with Mark that adding a discouraging warning would help with that.

If you allow too much prominence to this way of accessing columns, it may prove something of a slippery slope. Can you really do this without also doing these?

  • allowing numeric vectors (which are truncated to floor(j) in data.frames)
  • making DT[int_or_char] match the data.frame analogue (where it subsets DT like a list)

Aside: If you do this, perhaps you could add some faster accessor for j (in terms of shorter code), analogous to the list subsetting in my last bullet point. I find with=FALSE awkward and verbose and so had been doing workarounds like [.listof`(DT,int_or_char)` (broken in R 3.2.0 onward) and [.noquote(DT,int_or_char). A function like this would allow experienced users of the new functionality to sidestep the warning Mark suggested and to write clearer, more readable code (since, on reviewing their code, they wouldn't have to wonder whether they were looking at data.table- or data.frame-style j).

EDIT: I'm trying to explain what I mean over here: http://chat.stackoverflow.com/transcript/message/24012297#24012297

@eantonya
Copy link
Contributor

@eantonya eantonya commented Jun 22, 2015

I quite like the automatic with=FALSE guessing, but not the drop
reinstatement - I don't want to see that terrible option resurrected and
muddying the waters of data.table.

@raubreywhite
Copy link

@raubreywhite raubreywhite commented Jun 22, 2015

I agree with eduard, drop= true is one of the worst parts of data.frame. I think it makes sense to implement with= false, as this improves consistency and doesn't materially degrade the quality of data table, but drop= true would just be implementing a bad idea for the sake of consistency.

Sent from my iPhone

On 22 Jun 2015, at 5:22 am, eduard notifications@github.com wrote:

I quite like the automatic with=FALSE guessing, but not the drop
reinstatement - I don't want to see that terrible option resurrected and
muddying the waters of data.table.

On Sun, Jun 21, 2015, 10:37 AM franknarf1 notifications@github.com wrote:

I don't know. It might just make it harder for people to learn. I agree
with Mark that adding a discouraging warning would help with that.

If you allow too much prominence to this way of accessing columns, it may
prove something of a slippery slope. Can you really do this without also
doing these?

  • setting drop=TRUE as default
  • allowing numeric vectors (which are truncated to floor(j) in
    data.frames)
  • making DT[int_or_char] match the data.frame analogue (where it
    subsets DT like a list)

Aside: If you do this, perhaps you could add some faster accessor for j
(in terms of shorter code), analogous to the list subsetting in my last
bullet point. I find with=FALSE awkward and verbose and so had been doing
workarounds like [.listof(DT,int_or_char)(broken in R 3.2.0 onward) and [.noquote(DT,int_or_char). A function like this would allow experienced users of the new functionality to sidestep the warning Mark suggested and to write clearer, more readable code (since, on reviewing their code, they wouldn't have to wonder whether they were looking at data.table- or data.frame-stylej).


Reply to this email directly or view it on GitHub
#1188 (comment)
.


Reply to this email directly or view it on GitHub.

@franknarf1
Copy link
Contributor

@franknarf1 franknarf1 commented Jun 22, 2015

I think it defeats the purpose of the change if you don't use drop=TRUE for these character-or-integer cases. If using data.table syntax DT[,.(mycol)], retain drop=FALSE, sure; I don't think changing that case would help anything.

@eantonya
Copy link
Contributor

@eantonya eantonya commented Jun 22, 2015

@franknarf1 I disagree. drop argument is only relevant for single column retrievals, so not having it only affects part of the cases, and the effect it has on those cases is one of consistency, and not the strange sometimes this sometimes that behavior of data.frame.

@franknarf1
Copy link
Contributor

@franknarf1 franknarf1 commented Jun 22, 2015

@eantonya Yeah, I guess we do disagree; sorry if I'm repeating myself, but I'll try to clarify. I'm not crazy about the sometimes-this-sometimes-that behavior of data.frame either, but the premise of this proposed enhancement is that data.frame syntax should be supported to some limited extent.

Within that limited scope (when j (1) does not use any columns of DT and (2) evaluates to character or integer... or something like that), we should give people what they expect. It's not like you or I are going to use it, so what harm? And if we don't give them what they expect, why bother giving them the concession to begin with? They'll still have grounds to complain about inconsistency. (I won't use it because I want to be able to read my code without the mental overhead of figuring out whether data.frame syntax is being used.)

@mattdowle mattdowle added this to the v1.9.8 milestone Jun 22, 2015
@arunsrinivasan
Copy link
Member Author

@arunsrinivasan arunsrinivasan commented Jun 22, 2015

@franknarf1 perhaps I should clarify.

Ideally, what I'd like is for data.frame syntax in j to do everything that data.frame syntax does as shown below:

DT[, 1:2]
DT[, c("x", "y")]

cols = c("x", "y")
DT[, cols]

all of these should return two column data.table.

However, as @jrowen pointed out from the old post, the last case is tricky (for cases like the one I've shown in the previous post). Unless this case can be taken care of quite nicely, I personally don't see a huge advantage of implementing this feature. I can imagine myself explaining the behaviour to beginners (or in a talk) with too many ifs-and-buts.. and that's not helping.

So, what would be great is to figure out whether there's a way around the last scenario without breaking too many things. And whether it's worth it.

I don't feel strongly about drop = . being present or not. And IMO that's not the main part of this discussion, at least until it's clear that we are going to implement this functionality.

I'm also fully aware of the case DT[3:4] vs DF[3:4], but this doesn't seem to come up at all as an issue.. (on SO, or r-help or here or data.table-help) AFAICT.

@franknarf1
Copy link
Contributor

@franknarf1 franknarf1 commented Jun 22, 2015

@arunsrinivasan Yeah, I also don't see a benefit from the feature change. As you say, it seems like it would make explaining the syntax harder and lead to messier code everywhere (as people start using data.frame syntax as a crutch).

Back to my aside (mentioned in your last sentence). Yeah, I've never seen anyone else complain about DT[1:3] vs DF[1:3], but maybe they should! Really, if we had the functionality mentioned in this thread so that DT[.SDcols=1:3] and DT[.SDcols=c("a","b")] worked as my intuition suggests they should, it would be really handy. It's off-topic here, because that change wouldn't be any sort of crutch for people who don't want to learn data.table syntax, though. Not sure if that's already a FR... Oh, just found it: #1149

@eantonya
Copy link
Contributor

@eantonya eantonya commented Jun 22, 2015

@arunsrinivasan I actually don't see a big problem with some cases not working. I see this as guessing with=FALSE, and it's ok to guess incorrectly sometimes. Maybe a warning message can be printed accompanying the guess, similar to the guesses melt/dcast make.

@franknarf1 I'm not sure what you mean - of course I'd use this feature myself - I use with=FALSE reasonably frequently, and would love to not have to type it.

The framework from which I see this change is that of enhancing data.table usage for everyone, and emphatically not one of trying to mimic what data.frame does. From that viewpoint adding drop back disintegrates usage for advanced users for what I see as a very minor short-term gain and long-term loss for beginners. Whereas the with=FALSE guess is a short- and long-term enhancement for everyone.

@franknarf1
Copy link
Contributor

@franknarf1 franknarf1 commented Jun 22, 2015

@eantonya My mistake. I'd find the use of the feature in my code very hard to parse (by eye).

As far as the enhancement goes (excluding the mimickry), doesn't Richardo's DT[.SDcols=1:3] pull that off better (linked above, issue 1149)?

@eantonya
Copy link
Contributor

@eantonya eantonya commented Jun 22, 2015

I don't have anything against that option (and I think that should work regardless of this one going in), but would prefer typing DT[, 1:3] since it's less typing.

As far as how to guess - I would propose the following - if any of the names in j contain a column name or any of the special dot-symbols (.SD, etc), then don't guess. Otherwise attempt to evaluate the expression in outside environment - if that succeeds and returns a character/int/numeric vector - then guess with=FALSE. Otherwise go back to what we do now.

I think this takes care of the cases above and a few more I can think of right now.

Thinking some more - evaluating smth twice is fairly dangerous, so perhaps it's ok to live with the evaluation result no matter what it is (so return columns for character/int/numeric and actual result otherwise).

@franknarf1
Copy link
Contributor

@franknarf1 franknarf1 commented Jun 22, 2015

@eantonya I'm not really familiar with parsing R calls, but it sounds like cases like this:

DT <- data.table(a1=1:2, a2=3:4, a3=5:6)
suff = 2
DT[,mean(get(paste0("a",suff)))] # 3.5

suffy = 3
DT[,plot(get(paste0("a",suff)),get(paste0("a",suffy)))] # plots a2 v a3

would no longer work, since j does not find any names...?

If some guesswork way were implemented, maybe it could be made into an option, datatable.guesswith, off by default but recommended for folks strongly tied to data.frame syntax.

@eantonya
Copy link
Contributor

@eantonya eantonya commented Jun 22, 2015

Ok, let's add get to the list that includes .SD and friends. What other cases would it not work for? Let's see if it's easy to classify the expressions.

@franknarf1
Copy link
Contributor

@franknarf1 franknarf1 commented Jun 22, 2015

Okay, I'll see if I think of or come across any others. Nothing comes to mind beyond mget (which I can't figure out how to actually use here) and eval, like

str  = paste0("a",suff)
expr = parse(text=str)
DT[,eval(expr)]

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Aug 5, 2015

Great comments above. In an attempt to draw it all together, I'm thinking we should make the following changes. If I've read correctly, I think (hope!) this will please everyone and displease nobody.

  1. inspect j before evaluation (as is done anyway). If it's a single number or single string then with=FALSE will be assumed. These will then work:
    DT[,1]
    DT[,"someCol"]
    These don't do anything useful now anyway, so won't break existing code. In both cases, a single column data.table will be returned, consistent with 'with=FALSE' and dropping 'drop'. The possible surprise of getting a single column data.table (unlike data.frame) is unlikely to upset, especially since the column will print nicely (top and bottom 5 rows) rather than a long vector filling up the console.
  2. if j is a single symbol, it'll return that column as a vector, as it has always done. If however that column name is missing, raise a new error (wouldn't do anything useful now anyway so new error won't break existing code).
    DT[, existingCol] # return the column as a vector as before
    DT[, missingCol]
    Error: j (the 2nd argument inside [...]) is a single symbol that isn't a column name. In data.table, j is evaluated within its scope. If missingCol is a variable in calling scope that contains column names or numbers, then add with=FALSE; i.e. DT[, missingCol, with=FALSE]. This difference to data.frame is deliberate and explained in FAQ 1.1 . It allows more advanced usage: see example in ?data.table.
  3. if j contains no symbols (e.g. calls to c(), :, paste(), paste0() and only column numbers or strings), evaluate it and expect the result to be a number or character vector. Then set with=FALSE. These would then work:
    DT[, c(1:10, 50)]
    DT[, c("ColA","ColB")]
    DT[, paste0("V",20:25)]
  4. Otherwise, current behaviour.

We can always go further later depending on how it goes. We'll wait for everyone who's commented so far to confirm before going ahead (and only then after 1.9.6 is (finally) on CRAN!)

@markdanese
Copy link

@markdanese markdanese commented Aug 5, 2015

Seems good to me. As always, thanks to you and Arun for doing the hard work.

@DavidArenburg
Copy link
Member

@DavidArenburg DavidArenburg commented Aug 5, 2015

Sounds great to me.

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Aug 5, 2015

To reduce inconsistency it can be good to remove default value for with argument, or make it NULL/logical(0)/NA, which would corresponds to guess. Then explicitly using with=TRUE would be still able to override new proposed behaviour. So the change would be focused on guessing with argument only when it is not provided.

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Aug 5, 2015

@jangorecki Yes nice idea - agree.

@jrowen
Copy link

@jrowen jrowen commented Aug 5, 2015

I too am in favor of the revised proposal.

@ronhylton
Copy link

@ronhylton ronhylton commented Aug 6, 2015

Here's a slightly different viewpoint. There are places where I'd dearly love to dispatch a DT into some old code expecting a DF and automagically pick up a big improvement in merge() speed (and conceivably for other operations involving grouping). Unfortunately with the non-DF behavior of [ that often won't work, and sometimes I end up basically as.x'ing back and forth between DT and DF in order to keep old code happy.

One clean solution to this is setcompatibility(c("on","off,"?")).

off provides "native DT" behavior for those who want to fully exploit DT capabilities.

on provides "native DF" behavior unless the operation clearly doesn't make sense for DF. E.g. I don't think you can have DF[DF,] so something like this would clearly be invoking a DT-style join.

Conceivably there could be other compatibility levels, e.g. almost-DF without drop.

Since this would be a setxxx by reference it also hopefully has very little performance cost.

@jangorecki
Copy link
Member

@jangorecki jangorecki commented Aug 6, 2015

@ronhylton your comments has much wider scope then discussed topic. It might be better to isolate it as new FR. Discussed detection of j compatibility can be managed for example by implementing with default value as getOption("datatable.with") etc.

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Aug 6, 2015

@ronhylton Agree with Jan - best raise a new issue. One option is to place your old code in a package, then it would automatically divert to base syntax when passed a data.table.

@ptoche
Copy link

@ptoche ptoche commented Feb 17, 2016

I have reached this thread 3 times over the last year or so, which is when I started using data.table. Every time it was because I had forgotten about the with = FALSE option. Every time I've read this thread, I've thought "Ah yep, true, must remember that," but somehow I don't.

data.table is a fantastic package. My 2 cents on the topic of this thread: I do not care much about compatibility/consistency with data.frame. If it's there, great (1 stone, 2 birds), but consistency shouldn't be there just for the sake of it, it should be there if the feature is desirable. And the bottom line for me is that dt[i,j] is a very, very intuitive way to access data, it's pretty much standard notation that's been around for centuries (or at least one century). Intuitive and natural, that's what I think matters.

One of the top search hits for "r data.table subset by column and row" is this page: http://personal.colby.edu/personal/m/mgimond/RIntro/04_Manipulating_data_tables.html, which states "For example, to access one dat cell value at row 23 and column 4, type dat[23, 4]" because that's what everyone expects.

@arunsrinivasan arunsrinivasan added this to the v2.0.0 milestone Mar 8, 2016
@arunsrinivasan arunsrinivasan removed this from the v1.9.8 milestone Mar 8, 2016
@mattdowle mattdowle added this to the v1.9.8 milestone Apr 22, 2016
@mattdowle mattdowle removed this from the v2.0.0 milestone Apr 22, 2016
@arunsrinivasan arunsrinivasan removed this from the v1.9.8 milestone May 13, 2016
@mattdowle mattdowle added this to the v1.9.8 milestone Sep 29, 2016
@mattdowle mattdowle closed this in f78d790 Sep 30, 2016
@JoshOBrien
Copy link
Contributor

@JoshOBrien JoshOBrien commented Oct 18, 2016

This is such a nice fix to what has been a real stumbling block for data.table users. Superb! Leaving a note here as I referenced this in comments following a related SO answer that should itself eventually be edited to reflect the change.

@priyak1917
Copy link

@priyak1917 priyak1917 commented Aug 30, 2017

Error in [.data.table(mba, , i) :
j (the 2nd argument inside [...]) is a single symbol but column name 'i' is not found. Perhaps you intended DT[,..i] or DT[,i,with=FALSE]. This difference to data.frame is deliberate and explained in FAQ 1.1.

so annoying error, not letting me knit the file, i'm so very poor in coding, this is daunting me more.

@skanskan
Copy link

@skanskan skanskan commented Jan 11, 2019

Hello.

I've noticed that some of my old code doesn't work anymore because of this change.

For example this code was scaling the values contained in the columns defined by mycols.

mycols <- c("A", "B", "C")
myDT[,scale(mycols)]
or
myDT[paste0("z",mycols) := scale(mycols) ]

But now it doesn't work.
And the following line doesn't work either.
myDT[,scale(..mycols)]
adding with=FALSE doesn't solve the problem.

I need to do something like this:
myDT[,scale(.SD), .SDcols=esca ]
myDT[,lapply(.SD[,..esca], scale) ]
myDT[, paste0("z",names(.SD)) := scale(.SD) , .SDcols=mycols]

Is there a better way?

@mattdowle
Copy link
Member

@mattdowle mattdowle commented Jan 11, 2019

Hi @skanskan,
Please start a new issue and link to this issue. It's very hard to track comments in closed issues. Please include the data to create myDT and the output. I've tried to follow what you've written but I can't without the data, the input and output shown: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
Thanks, Matt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet