New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency for sanity, being data.frame like for easy transition #1188

Closed
arunsrinivasan opened this Issue Jun 21, 2015 · 30 comments

Comments

Projects
None yet
@arunsrinivasan
Member

arunsrinivasan commented Jun 21, 2015

(After a brief discussion with Matt)

The behaviour with=FALSE:

require(data.table)
DT = data.table(x=1:5, y=6:10, z=11:15)

DT[, c("y", "z"), with=FALSE]

In talking to colleagues, and at meetings or over emails, it seems that restoring the data.frame behaviour only for those cases where j is integer/character vector can only bring more sanity (trading inconsistency).

The issue is that data.table usage revolves around [ a lot, and therefore users are confronted with having to learn this difference quite early, and having to learn new syntax for a known basic operation doesn't sit well. It also doesn't seem to help in explaining how a data.table is a data.frame with this basic operation.

AFAICT, there's no real usage to having just character/integer vectors in j. Therefore, it'd be great to have with=FALSE being unnecessary and be able to subset columns the data.frame way:

DT[, c("y", "z")]
DT[, 2:3]

The default return of vector in case of only one column and use of drop=FALSE should also be restored. This'll help get over the basic data.frame like usage very quickly without having to wonder "why", and start learning the actual essential enhanced-ness data.table provides.

It'd be great to hear thoughts from other users as well.

This has come up before (raised by Matt) : http://r.789695.n4.nabble.com/with-FALSE-td4589266.html but 'leave it as it is' was the response more or less.

@DavidArenburg

This comment has been minimized.

Show comment
Hide comment
@DavidArenburg
Member

DavidArenburg commented Jun 21, 2015

+1MM

@jrowen

This comment has been minimized.

Show comment
Hide comment
@jrowen

jrowen Jun 21, 2015

I like this idea too. From the earlier responses, it seems the biggest drawback could be introducing inconsistency, as a new user would expect the two approaches below to return the same result.

DT[,c("colA","colB")] 

colvars = c("colA","colB") 
DT[,colvars]

Is there a way they both could return the same result?

jrowen commented Jun 21, 2015

I like this idea too. From the earlier responses, it seems the biggest drawback could be introducing inconsistency, as a new user would expect the two approaches below to return the same result.

DT[,c("colA","colB")] 

colvars = c("colA","colB") 
DT[,colvars]

Is there a way they both could return the same result?

@arunsrinivasan

This comment has been minimized.

Show comment
Hide comment
@arunsrinivasan

arunsrinivasan Jun 21, 2015

Member

@jrowen thanks. Yes they both should return the same result (as data.frame would).
Will have to think a bit more about this though.

x <- "z"
DT[, x]

It'd be ambiguous in this case, isn't it?

One way off the top of my head is for the enhanced-ness to kick in j, only when it is wrapped with .() or list(), but perhaps that's too big a design change...

Hm, now I'm thinking if this'd only create more problems instead :-(

Member

arunsrinivasan commented Jun 21, 2015

@jrowen thanks. Yes they both should return the same result (as data.frame would).
Will have to think a bit more about this though.

x <- "z"
DT[, x]

It'd be ambiguous in this case, isn't it?

One way off the top of my head is for the enhanced-ness to kick in j, only when it is wrapped with .() or list(), but perhaps that's too big a design change...

Hm, now I'm thinking if this'd only create more problems instead :-(

@markdanese

This comment has been minimized.

Show comment
Hide comment
@markdanese

markdanese Jun 21, 2015

Perhaps you could make the error message more friendly and help the user. Or even find the cases and add "with = FALSE" and advise the user that the change was made (like with setting column names the "old" way). I have been using data.table for a year and a half, and I periodically want to use column numbers for some quick interactive work and get an error. Not a big deal to type with = FALSE, but a nice reminder would be welcome. This would serve to teach new users as well.

markdanese commented Jun 21, 2015

Perhaps you could make the error message more friendly and help the user. Or even find the cases and add "with = FALSE" and advise the user that the change was made (like with setting column names the "old" way). I have been using data.table for a year and a half, and I periodically want to use column numbers for some quick interactive work and get an error. Not a big deal to type with = FALSE, but a nice reminder would be welcome. This would serve to teach new users as well.

@franknarf1

This comment has been minimized.

Show comment
Hide comment
@franknarf1

franknarf1 Jun 21, 2015

I don't know. It might just make it harder for people to learn. I agree with Mark that adding a discouraging warning would help with that.

If you allow too much prominence to this way of accessing columns, it may prove something of a slippery slope. Can you really do this without also doing these?

  • allowing numeric vectors (which are truncated to floor(j) in data.frames)
  • making DT[int_or_char] match the data.frame analogue (where it subsets DT like a list)

Aside: If you do this, perhaps you could add some faster accessor for j (in terms of shorter code), analogous to the list subsetting in my last bullet point. I find with=FALSE awkward and verbose and so had been doing workarounds like [.listof`(DT,int_or_char)` (broken in R 3.2.0 onward) and[.noquote(DT,int_or_char). A function like this would allow experienced users of the new functionality to sidestep the warning Mark suggested and to write clearer, more readable code (since, on reviewing their code, they wouldn't have to wonder whether they were looking at data.table- or data.frame-style j).

EDIT: I'm trying to explain what I mean over here: http://chat.stackoverflow.com/transcript/message/24012297#24012297

franknarf1 commented Jun 21, 2015

I don't know. It might just make it harder for people to learn. I agree with Mark that adding a discouraging warning would help with that.

If you allow too much prominence to this way of accessing columns, it may prove something of a slippery slope. Can you really do this without also doing these?

  • allowing numeric vectors (which are truncated to floor(j) in data.frames)
  • making DT[int_or_char] match the data.frame analogue (where it subsets DT like a list)

Aside: If you do this, perhaps you could add some faster accessor for j (in terms of shorter code), analogous to the list subsetting in my last bullet point. I find with=FALSE awkward and verbose and so had been doing workarounds like [.listof`(DT,int_or_char)` (broken in R 3.2.0 onward) and[.noquote(DT,int_or_char). A function like this would allow experienced users of the new functionality to sidestep the warning Mark suggested and to write clearer, more readable code (since, on reviewing their code, they wouldn't have to wonder whether they were looking at data.table- or data.frame-style j).

EDIT: I'm trying to explain what I mean over here: http://chat.stackoverflow.com/transcript/message/24012297#24012297

@eantonya

This comment has been minimized.

Show comment
Hide comment
@eantonya

eantonya Jun 22, 2015

Contributor

I quite like the automatic with=FALSE guessing, but not the drop
reinstatement - I don't want to see that terrible option resurrected and
muddying the waters of data.table.

Contributor

eantonya commented Jun 22, 2015

I quite like the automatic with=FALSE guessing, but not the drop
reinstatement - I don't want to see that terrible option resurrected and
muddying the waters of data.table.

@raubreywhite

This comment has been minimized.

Show comment
Hide comment
@raubreywhite

raubreywhite Jun 22, 2015

I agree with eduard, drop= true is one of the worst parts of data.frame. I think it makes sense to implement with= false, as this improves consistency and doesn't materially degrade the quality of data table, but drop= true would just be implementing a bad idea for the sake of consistency.

Sent from my iPhone

On 22 Jun 2015, at 5:22 am, eduard notifications@github.com wrote:

I quite like the automatic with=FALSE guessing, but not the drop
reinstatement - I don't want to see that terrible option resurrected and
muddying the waters of data.table.

On Sun, Jun 21, 2015, 10:37 AM franknarf1 notifications@github.com wrote:

I don't know. It might just make it harder for people to learn. I agree
with Mark that adding a discouraging warning would help with that.

If you allow too much prominence to this way of accessing columns, it may
prove something of a slippery slope. Can you really do this without also
doing these?

  • setting drop=TRUE as default
  • allowing numeric vectors (which are truncated to floor(j) in
    data.frames)
  • making DT[int_or_char] match the data.frame analogue (where it
    subsets DT like a list)

Aside: If you do this, perhaps you could add some faster accessor for j
(in terms of shorter code), analogous to the list subsetting in my last
bullet point. I find with=FALSE awkward and verbose and so had been doing
workarounds like [.listof(DT,int_or_char)(broken in R 3.2.0 onward) and[.noquote(DT,int_or_char). A function like this would allow experienced users of the new functionality to sidestep the warning Mark suggested and to write clearer, more readable code (since, on reviewing their code, they wouldn't have to wonder whether they were looking at data.table- or data.frame-stylej).


Reply to this email directly or view it on GitHub
#1188 (comment)
.


Reply to this email directly or view it on GitHub.

raubreywhite commented Jun 22, 2015

I agree with eduard, drop= true is one of the worst parts of data.frame. I think it makes sense to implement with= false, as this improves consistency and doesn't materially degrade the quality of data table, but drop= true would just be implementing a bad idea for the sake of consistency.

Sent from my iPhone

On 22 Jun 2015, at 5:22 am, eduard notifications@github.com wrote:

I quite like the automatic with=FALSE guessing, but not the drop
reinstatement - I don't want to see that terrible option resurrected and
muddying the waters of data.table.

On Sun, Jun 21, 2015, 10:37 AM franknarf1 notifications@github.com wrote:

I don't know. It might just make it harder for people to learn. I agree
with Mark that adding a discouraging warning would help with that.

If you allow too much prominence to this way of accessing columns, it may
prove something of a slippery slope. Can you really do this without also
doing these?

  • setting drop=TRUE as default
  • allowing numeric vectors (which are truncated to floor(j) in
    data.frames)
  • making DT[int_or_char] match the data.frame analogue (where it
    subsets DT like a list)

Aside: If you do this, perhaps you could add some faster accessor for j
(in terms of shorter code), analogous to the list subsetting in my last
bullet point. I find with=FALSE awkward and verbose and so had been doing
workarounds like [.listof(DT,int_or_char)(broken in R 3.2.0 onward) and[.noquote(DT,int_or_char). A function like this would allow experienced users of the new functionality to sidestep the warning Mark suggested and to write clearer, more readable code (since, on reviewing their code, they wouldn't have to wonder whether they were looking at data.table- or data.frame-stylej).


Reply to this email directly or view it on GitHub
#1188 (comment)
.


Reply to this email directly or view it on GitHub.

@franknarf1

This comment has been minimized.

Show comment
Hide comment
@franknarf1

franknarf1 Jun 22, 2015

I think it defeats the purpose of the change if you don't use drop=TRUE for these character-or-integer cases. If using data.table syntax DT[,.(mycol)], retain drop=FALSE, sure; I don't think changing that case would help anything.

franknarf1 commented Jun 22, 2015

I think it defeats the purpose of the change if you don't use drop=TRUE for these character-or-integer cases. If using data.table syntax DT[,.(mycol)], retain drop=FALSE, sure; I don't think changing that case would help anything.

@eantonya

This comment has been minimized.

Show comment
Hide comment
@eantonya

eantonya Jun 22, 2015

Contributor

@franknarf1 I disagree. drop argument is only relevant for single column retrievals, so not having it only affects part of the cases, and the effect it has on those cases is one of consistency, and not the strange sometimes this sometimes that behavior of data.frame.

Contributor

eantonya commented Jun 22, 2015

@franknarf1 I disagree. drop argument is only relevant for single column retrievals, so not having it only affects part of the cases, and the effect it has on those cases is one of consistency, and not the strange sometimes this sometimes that behavior of data.frame.

@franknarf1

This comment has been minimized.

Show comment
Hide comment
@franknarf1

franknarf1 Jun 22, 2015

@eantonya Yeah, I guess we do disagree; sorry if I'm repeating myself, but I'll try to clarify. I'm not crazy about the sometimes-this-sometimes-that behavior of data.frame either, but the premise of this proposed enhancement is that data.frame syntax should be supported to some limited extent.

Within that limited scope (when j (1) does not use any columns of DT and (2) evaluates to character or integer... or something like that), we should give people what they expect. It's not like you or I are going to use it, so what harm? And if we don't give them what they expect, why bother giving them the concession to begin with? They'll still have grounds to complain about inconsistency. (I won't use it because I want to be able to read my code without the mental overhead of figuring out whether data.frame syntax is being used.)

franknarf1 commented Jun 22, 2015

@eantonya Yeah, I guess we do disagree; sorry if I'm repeating myself, but I'll try to clarify. I'm not crazy about the sometimes-this-sometimes-that behavior of data.frame either, but the premise of this proposed enhancement is that data.frame syntax should be supported to some limited extent.

Within that limited scope (when j (1) does not use any columns of DT and (2) evaluates to character or integer... or something like that), we should give people what they expect. It's not like you or I are going to use it, so what harm? And if we don't give them what they expect, why bother giving them the concession to begin with? They'll still have grounds to complain about inconsistency. (I won't use it because I want to be able to read my code without the mental overhead of figuring out whether data.frame syntax is being used.)

@mattdowle mattdowle added this to the v1.9.8 milestone Jun 22, 2015

@arunsrinivasan

This comment has been minimized.

Show comment
Hide comment
@arunsrinivasan

arunsrinivasan Jun 22, 2015

Member

@franknarf1 perhaps I should clarify.

Ideally, what I'd like is for data.frame syntax in j to do everything that data.frame syntax does as shown below:

DT[, 1:2]
DT[, c("x", "y")]

cols = c("x", "y")
DT[, cols]

all of these should return two column data.table.

However, as @jrowen pointed out from the old post, the last case is tricky (for cases like the one I've shown in the previous post). Unless this case can be taken care of quite nicely, I personally don't see a huge advantage of implementing this feature. I can imagine myself explaining the behaviour to beginners (or in a talk) with too many ifs-and-buts.. and that's not helping.

So, what would be great is to figure out whether there's a way around the last scenario without breaking too many things. And whether it's worth it.

I don't feel strongly about drop = . being present or not. And IMO that's not the main part of this discussion, at least until it's clear that we are going to implement this functionality.

I'm also fully aware of the case DT[3:4] vs DF[3:4], but this doesn't seem to come up at all as an issue.. (on SO, or r-help or here or data.table-help) AFAICT.

Member

arunsrinivasan commented Jun 22, 2015

@franknarf1 perhaps I should clarify.

Ideally, what I'd like is for data.frame syntax in j to do everything that data.frame syntax does as shown below:

DT[, 1:2]
DT[, c("x", "y")]

cols = c("x", "y")
DT[, cols]

all of these should return two column data.table.

However, as @jrowen pointed out from the old post, the last case is tricky (for cases like the one I've shown in the previous post). Unless this case can be taken care of quite nicely, I personally don't see a huge advantage of implementing this feature. I can imagine myself explaining the behaviour to beginners (or in a talk) with too many ifs-and-buts.. and that's not helping.

So, what would be great is to figure out whether there's a way around the last scenario without breaking too many things. And whether it's worth it.

I don't feel strongly about drop = . being present or not. And IMO that's not the main part of this discussion, at least until it's clear that we are going to implement this functionality.

I'm also fully aware of the case DT[3:4] vs DF[3:4], but this doesn't seem to come up at all as an issue.. (on SO, or r-help or here or data.table-help) AFAICT.

@franknarf1

This comment has been minimized.

Show comment
Hide comment
@franknarf1

franknarf1 Jun 22, 2015

@arunsrinivasan Yeah, I also don't see a benefit from the feature change. As you say, it seems like it would make explaining the syntax harder and lead to messier code everywhere (as people start using data.frame syntax as a crutch).

Back to my aside (mentioned in your last sentence). Yeah, I've never seen anyone else complain about DT[1:3] vs DF[1:3], but maybe they should! Really, if we had the functionality mentioned in this thread so that DT[.SDcols=1:3] and DT[.SDcols=c("a","b")] worked as my intuition suggests they should, it would be really handy. It's off-topic here, because that change wouldn't be any sort of crutch for people who don't want to learn data.table syntax, though. Not sure if that's already a FR... Oh, just found it: #1149

franknarf1 commented Jun 22, 2015

@arunsrinivasan Yeah, I also don't see a benefit from the feature change. As you say, it seems like it would make explaining the syntax harder and lead to messier code everywhere (as people start using data.frame syntax as a crutch).

Back to my aside (mentioned in your last sentence). Yeah, I've never seen anyone else complain about DT[1:3] vs DF[1:3], but maybe they should! Really, if we had the functionality mentioned in this thread so that DT[.SDcols=1:3] and DT[.SDcols=c("a","b")] worked as my intuition suggests they should, it would be really handy. It's off-topic here, because that change wouldn't be any sort of crutch for people who don't want to learn data.table syntax, though. Not sure if that's already a FR... Oh, just found it: #1149

@eantonya

This comment has been minimized.

Show comment
Hide comment
@eantonya

eantonya Jun 22, 2015

Contributor

@arunsrinivasan I actually don't see a big problem with some cases not working. I see this as guessing with=FALSE, and it's ok to guess incorrectly sometimes. Maybe a warning message can be printed accompanying the guess, similar to the guesses melt/dcast make.

@franknarf1 I'm not sure what you mean - of course I'd use this feature myself - I use with=FALSE reasonably frequently, and would love to not have to type it.

The framework from which I see this change is that of enhancing data.table usage for everyone, and emphatically not one of trying to mimic what data.frame does. From that viewpoint adding drop back disintegrates usage for advanced users for what I see as a very minor short-term gain and long-term loss for beginners. Whereas the with=FALSE guess is a short- and long-term enhancement for everyone.

Contributor

eantonya commented Jun 22, 2015

@arunsrinivasan I actually don't see a big problem with some cases not working. I see this as guessing with=FALSE, and it's ok to guess incorrectly sometimes. Maybe a warning message can be printed accompanying the guess, similar to the guesses melt/dcast make.

@franknarf1 I'm not sure what you mean - of course I'd use this feature myself - I use with=FALSE reasonably frequently, and would love to not have to type it.

The framework from which I see this change is that of enhancing data.table usage for everyone, and emphatically not one of trying to mimic what data.frame does. From that viewpoint adding drop back disintegrates usage for advanced users for what I see as a very minor short-term gain and long-term loss for beginners. Whereas the with=FALSE guess is a short- and long-term enhancement for everyone.

@franknarf1

This comment has been minimized.

Show comment
Hide comment
@franknarf1

franknarf1 Jun 22, 2015

@eantonya My mistake. I'd find the use of the feature in my code very hard to parse (by eye).

As far as the enhancement goes (excluding the mimickry), doesn't Richardo's DT[.SDcols=1:3] pull that off better (linked above, issue 1149)?

franknarf1 commented Jun 22, 2015

@eantonya My mistake. I'd find the use of the feature in my code very hard to parse (by eye).

As far as the enhancement goes (excluding the mimickry), doesn't Richardo's DT[.SDcols=1:3] pull that off better (linked above, issue 1149)?

@eantonya

This comment has been minimized.

Show comment
Hide comment
@eantonya

eantonya Jun 22, 2015

Contributor

I don't have anything against that option (and I think that should work regardless of this one going in), but would prefer typing DT[, 1:3] since it's less typing.

As far as how to guess - I would propose the following - if any of the names in j contain a column name or any of the special dot-symbols (.SD, etc), then don't guess. Otherwise attempt to evaluate the expression in outside environment - if that succeeds and returns a character/int/numeric vector - then guess with=FALSE. Otherwise go back to what we do now.

I think this takes care of the cases above and a few more I can think of right now.

Thinking some more - evaluating smth twice is fairly dangerous, so perhaps it's ok to live with the evaluation result no matter what it is (so return columns for character/int/numeric and actual result otherwise).

Contributor

eantonya commented Jun 22, 2015

I don't have anything against that option (and I think that should work regardless of this one going in), but would prefer typing DT[, 1:3] since it's less typing.

As far as how to guess - I would propose the following - if any of the names in j contain a column name or any of the special dot-symbols (.SD, etc), then don't guess. Otherwise attempt to evaluate the expression in outside environment - if that succeeds and returns a character/int/numeric vector - then guess with=FALSE. Otherwise go back to what we do now.

I think this takes care of the cases above and a few more I can think of right now.

Thinking some more - evaluating smth twice is fairly dangerous, so perhaps it's ok to live with the evaluation result no matter what it is (so return columns for character/int/numeric and actual result otherwise).

@franknarf1

This comment has been minimized.

Show comment
Hide comment
@franknarf1

franknarf1 Jun 22, 2015

@eantonya I'm not really familiar with parsing R calls, but it sounds like cases like this:

DT <- data.table(a1=1:2, a2=3:4, a3=5:6)
suff = 2
DT[,mean(get(paste0("a",suff)))] # 3.5

suffy = 3
DT[,plot(get(paste0("a",suff)),get(paste0("a",suffy)))] # plots a2 v a3

would no longer work, since j does not find any names...?

If some guesswork way were implemented, maybe it could be made into an option, datatable.guesswith, off by default but recommended for folks strongly tied to data.frame syntax.

franknarf1 commented Jun 22, 2015

@eantonya I'm not really familiar with parsing R calls, but it sounds like cases like this:

DT <- data.table(a1=1:2, a2=3:4, a3=5:6)
suff = 2
DT[,mean(get(paste0("a",suff)))] # 3.5

suffy = 3
DT[,plot(get(paste0("a",suff)),get(paste0("a",suffy)))] # plots a2 v a3

would no longer work, since j does not find any names...?

If some guesswork way were implemented, maybe it could be made into an option, datatable.guesswith, off by default but recommended for folks strongly tied to data.frame syntax.

@eantonya

This comment has been minimized.

Show comment
Hide comment
@eantonya

eantonya Jun 22, 2015

Contributor

Ok, let's add get to the list that includes .SD and friends. What other cases would it not work for? Let's see if it's easy to classify the expressions.

Contributor

eantonya commented Jun 22, 2015

Ok, let's add get to the list that includes .SD and friends. What other cases would it not work for? Let's see if it's easy to classify the expressions.

@franknarf1

This comment has been minimized.

Show comment
Hide comment
@franknarf1

franknarf1 Jun 22, 2015

Okay, I'll see if I think of or come across any others. Nothing comes to mind beyond mget (which I can't figure out how to actually use here) and eval, like

str  = paste0("a",suff)
expr = parse(text=str)
DT[,eval(expr)]

franknarf1 commented Jun 22, 2015

Okay, I'll see if I think of or come across any others. Nothing comes to mind beyond mget (which I can't figure out how to actually use here) and eval, like

str  = paste0("a",suff)
expr = parse(text=str)
DT[,eval(expr)]
@mattdowle

This comment has been minimized.

Show comment
Hide comment
@mattdowle

mattdowle Aug 5, 2015

Member

Great comments above. In an attempt to draw it all together, I'm thinking we should make the following changes. If I've read correctly, I think (hope!) this will please everyone and displease nobody.

  1. inspect j before evaluation (as is done anyway). If it's a single number or single string then with=FALSE will be assumed. These will then work:
    DT[,1]
    DT[,"someCol"]
    These don't do anything useful now anyway, so won't break existing code. In both cases, a single column data.table will be returned, consistent with 'with=FALSE' and dropping 'drop'. The possible surprise of getting a single column data.table (unlike data.frame) is unlikely to upset, especially since the column will print nicely (top and bottom 5 rows) rather than a long vector filling up the console.
  2. if j is a single symbol, it'll return that column as a vector, as it has always done. If however that column name is missing, raise a new error (wouldn't do anything useful now anyway so new error won't break existing code).
    DT[, existingCol] # return the column as a vector as before
    DT[, missingCol]
    Error: j (the 2nd argument inside [...]) is a single symbol that isn't a column name. In data.table, j is evaluated within its scope. If missingCol is a variable in calling scope that contains column names or numbers, then add with=FALSE; i.e. DT[, missingCol, with=FALSE]. This difference to data.frame is deliberate and explained in FAQ 1.1 . It allows more advanced usage: see example in ?data.table.
  3. if j contains no symbols (e.g. calls to c(), :, paste(), paste0() and only column numbers or strings), evaluate it and expect the result to be a number or character vector. Then set with=FALSE. These would then work:
    DT[, c(1:10, 50)]
    DT[, c("ColA","ColB")]
    DT[, paste0("V",20:25)]
  4. Otherwise, current behaviour.

We can always go further later depending on how it goes. We'll wait for everyone who's commented so far to confirm before going ahead (and only then after 1.9.6 is (finally) on CRAN!)

Member

mattdowle commented Aug 5, 2015

Great comments above. In an attempt to draw it all together, I'm thinking we should make the following changes. If I've read correctly, I think (hope!) this will please everyone and displease nobody.

  1. inspect j before evaluation (as is done anyway). If it's a single number or single string then with=FALSE will be assumed. These will then work:
    DT[,1]
    DT[,"someCol"]
    These don't do anything useful now anyway, so won't break existing code. In both cases, a single column data.table will be returned, consistent with 'with=FALSE' and dropping 'drop'. The possible surprise of getting a single column data.table (unlike data.frame) is unlikely to upset, especially since the column will print nicely (top and bottom 5 rows) rather than a long vector filling up the console.
  2. if j is a single symbol, it'll return that column as a vector, as it has always done. If however that column name is missing, raise a new error (wouldn't do anything useful now anyway so new error won't break existing code).
    DT[, existingCol] # return the column as a vector as before
    DT[, missingCol]
    Error: j (the 2nd argument inside [...]) is a single symbol that isn't a column name. In data.table, j is evaluated within its scope. If missingCol is a variable in calling scope that contains column names or numbers, then add with=FALSE; i.e. DT[, missingCol, with=FALSE]. This difference to data.frame is deliberate and explained in FAQ 1.1 . It allows more advanced usage: see example in ?data.table.
  3. if j contains no symbols (e.g. calls to c(), :, paste(), paste0() and only column numbers or strings), evaluate it and expect the result to be a number or character vector. Then set with=FALSE. These would then work:
    DT[, c(1:10, 50)]
    DT[, c("ColA","ColB")]
    DT[, paste0("V",20:25)]
  4. Otherwise, current behaviour.

We can always go further later depending on how it goes. We'll wait for everyone who's commented so far to confirm before going ahead (and only then after 1.9.6 is (finally) on CRAN!)

@markdanese

This comment has been minimized.

Show comment
Hide comment
@markdanese

markdanese Aug 5, 2015

Seems good to me. As always, thanks to you and Arun for doing the hard work.

markdanese commented Aug 5, 2015

Seems good to me. As always, thanks to you and Arun for doing the hard work.

@DavidArenburg

This comment has been minimized.

Show comment
Hide comment
@DavidArenburg

DavidArenburg Aug 5, 2015

Member

Sounds great to me.

Member

DavidArenburg commented Aug 5, 2015

Sounds great to me.

@jangorecki

This comment has been minimized.

Show comment
Hide comment
@jangorecki

jangorecki Aug 5, 2015

Member

To reduce inconsistency it can be good to remove default value for with argument, or make it NULL/logical(0)/NA, which would corresponds to guess. Then explicitly using with=TRUE would be still able to override new proposed behaviour. So the change would be focused on guessing with argument only when it is not provided.

Member

jangorecki commented Aug 5, 2015

To reduce inconsistency it can be good to remove default value for with argument, or make it NULL/logical(0)/NA, which would corresponds to guess. Then explicitly using with=TRUE would be still able to override new proposed behaviour. So the change would be focused on guessing with argument only when it is not provided.

@mattdowle

This comment has been minimized.

Show comment
Hide comment
@mattdowle

mattdowle Aug 5, 2015

Member

@jangorecki Yes nice idea - agree.

Member

mattdowle commented Aug 5, 2015

@jangorecki Yes nice idea - agree.

@jrowen

This comment has been minimized.

Show comment
Hide comment
@jrowen

jrowen Aug 5, 2015

I too am in favor of the revised proposal.

jrowen commented Aug 5, 2015

I too am in favor of the revised proposal.

@ronhylton

This comment has been minimized.

Show comment
Hide comment
@ronhylton

ronhylton Aug 6, 2015

Here's a slightly different viewpoint. There are places where I'd dearly love to dispatch a DT into some old code expecting a DF and automagically pick up a big improvement in merge() speed (and conceivably for other operations involving grouping). Unfortunately with the non-DF behavior of [ that often won't work, and sometimes I end up basically as.x'ing back and forth between DT and DF in order to keep old code happy.

One clean solution to this is setcompatibility(c("on","off,"?")).

off provides "native DT" behavior for those who want to fully exploit DT capabilities.

on provides "native DF" behavior unless the operation clearly doesn't make sense for DF. E.g. I don't think you can have DF[DF,] so something like this would clearly be invoking a DT-style join.

Conceivably there could be other compatibility levels, e.g. almost-DF without drop.

Since this would be a setxxx by reference it also hopefully has very little performance cost.

ronhylton commented Aug 6, 2015

Here's a slightly different viewpoint. There are places where I'd dearly love to dispatch a DT into some old code expecting a DF and automagically pick up a big improvement in merge() speed (and conceivably for other operations involving grouping). Unfortunately with the non-DF behavior of [ that often won't work, and sometimes I end up basically as.x'ing back and forth between DT and DF in order to keep old code happy.

One clean solution to this is setcompatibility(c("on","off,"?")).

off provides "native DT" behavior for those who want to fully exploit DT capabilities.

on provides "native DF" behavior unless the operation clearly doesn't make sense for DF. E.g. I don't think you can have DF[DF,] so something like this would clearly be invoking a DT-style join.

Conceivably there could be other compatibility levels, e.g. almost-DF without drop.

Since this would be a setxxx by reference it also hopefully has very little performance cost.

@jangorecki

This comment has been minimized.

Show comment
Hide comment
@jangorecki

jangorecki Aug 6, 2015

Member

@ronhylton your comments has much wider scope then discussed topic. It might be better to isolate it as new FR. Discussed detection of j compatibility can be managed for example by implementing with default value as getOption("datatable.with") etc.

Member

jangorecki commented Aug 6, 2015

@ronhylton your comments has much wider scope then discussed topic. It might be better to isolate it as new FR. Discussed detection of j compatibility can be managed for example by implementing with default value as getOption("datatable.with") etc.

@mattdowle

This comment has been minimized.

Show comment
Hide comment
@mattdowle

mattdowle Aug 6, 2015

Member

@ronhylton Agree with Jan - best raise a new issue. One option is to place your old code in a package, then it would automatically divert to base syntax when passed a data.table.

Member

mattdowle commented Aug 6, 2015

@ronhylton Agree with Jan - best raise a new issue. One option is to place your old code in a package, then it would automatically divert to base syntax when passed a data.table.

@ptoche

This comment has been minimized.

Show comment
Hide comment
@ptoche

ptoche Feb 17, 2016

I have reached this thread 3 times over the last year or so, which is when I started using data.table. Every time it was because I had forgotten about the with = FALSE option. Every time I've read this thread, I've thought "Ah yep, true, must remember that," but somehow I don't.

data.table is a fantastic package. My 2 cents on the topic of this thread: I do not care much about compatibility/consistency with data.frame. If it's there, great (1 stone, 2 birds), but consistency shouldn't be there just for the sake of it, it should be there if the feature is desirable. And the bottom line for me is that dt[i,j] is a very, very intuitive way to access data, it's pretty much standard notation that's been around for centuries (or at least one century). Intuitive and natural, that's what I think matters.

One of the top search hits for "r data.table subset by column and row" is this page: http://personal.colby.edu/personal/m/mgimond/RIntro/04_Manipulating_data_tables.html, which states "For example, to access one dat cell value at row 23 and column 4, type dat[23, 4]" because that's what everyone expects.

ptoche commented Feb 17, 2016

I have reached this thread 3 times over the last year or so, which is when I started using data.table. Every time it was because I had forgotten about the with = FALSE option. Every time I've read this thread, I've thought "Ah yep, true, must remember that," but somehow I don't.

data.table is a fantastic package. My 2 cents on the topic of this thread: I do not care much about compatibility/consistency with data.frame. If it's there, great (1 stone, 2 birds), but consistency shouldn't be there just for the sake of it, it should be there if the feature is desirable. And the bottom line for me is that dt[i,j] is a very, very intuitive way to access data, it's pretty much standard notation that's been around for centuries (or at least one century). Intuitive and natural, that's what I think matters.

One of the top search hits for "r data.table subset by column and row" is this page: http://personal.colby.edu/personal/m/mgimond/RIntro/04_Manipulating_data_tables.html, which states "For example, to access one dat cell value at row 23 and column 4, type dat[23, 4]" because that's what everyone expects.

@arunsrinivasan arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Mar 8, 2016

@mattdowle mattdowle modified the milestones: v1.9.8, v2.0.0 Apr 22, 2016

@arunsrinivasan arunsrinivasan removed this from the v1.9.8 milestone May 13, 2016

@mattdowle mattdowle added this to the v1.9.8 milestone Sep 29, 2016

@mattdowle mattdowle closed this in f78d790 Sep 30, 2016

@JoshOBrien

This comment has been minimized.

Show comment
Hide comment
@JoshOBrien

JoshOBrien Oct 18, 2016

Contributor

This is such a nice fix to what has been a real stumbling block for data.table users. Superb! Leaving a note here as I referenced this in comments following a related SO answer that should itself eventually be edited to reflect the change.

Contributor

JoshOBrien commented Oct 18, 2016

This is such a nice fix to what has been a real stumbling block for data.table users. Superb! Leaving a note here as I referenced this in comments following a related SO answer that should itself eventually be edited to reflect the change.

@priyak1917

This comment has been minimized.

Show comment
Hide comment
@priyak1917

priyak1917 Aug 30, 2017

Error in [.data.table(mba, , i) :
j (the 2nd argument inside [...]) is a single symbol but column name 'i' is not found. Perhaps you intended DT[,..i] or DT[,i,with=FALSE]. This difference to data.frame is deliberate and explained in FAQ 1.1.

so annoying error, not letting me knit the file, i'm so very poor in coding, this is daunting me more.

priyak1917 commented Aug 30, 2017

Error in [.data.table(mba, , i) :
j (the 2nd argument inside [...]) is a single symbol but column name 'i' is not found. Perhaps you intended DT[,..i] or DT[,i,with=FALSE]. This difference to data.frame is deliberate and explained in FAQ 1.1.

so annoying error, not letting me knit the file, i'm so very poor in coding, this is daunting me more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment