Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data.table method for [[ #892

Closed
matthieugomez opened this issue Oct 15, 2014 · 6 comments
Closed

Data.table method for [[ #892

matthieugomez opened this issue Oct 15, 2014 · 6 comments

Comments

@matthieugomez
Copy link
Contributor

Lots of answers of Stackoverflow assume the names of newly created columns in a data.table don't conflict with existing variable names. Although these answers work for the particular examples given in the original question, they may fail in more general situations. As an example, this answer led to a bug in dplyr::filter.

Since the last column created is always at the end of the data.table, I think a way of extracting the last (or second to last, etc) column would be useful. I have written a data.table method for [[ below, such that the symbol .M refers to the number of columns in the data.table.

`[[.data.table` <- function(x,i){
    isub <- substitute(i)
    if (is.call(isub)|is.name(isub)) isub <- lazyeval::interp(isub, .M = length(x))
    i <- eval(isub, parent.frame())
    `[[.data.frame`(x,i)
}

The answer in the original stackoverflow post would then be

bdt[bdt[, .I[g == max(g)], by = id][[.M]]]

This code works irrespectively of the names of existing variables in "by".

@arunsrinivasan
Copy link
Member

It's just a bug in dplyr, isn't it? I thought it was quite clear that the functionality (of not providing column names in j) is for interactive use only, but seemingly not. Perhaps we should mention this explicitly in the documentation hereafter. While programming, of course one needs to take care of column names.

Here's a similar dplyr example that'd fail as well (due to auto-naming):

require(dplyr)
foo <- function(x) x^2
DF = data.frame(x_foo=c(1,1,1,2,2), x=1:5, y=6:10, z=11:15) 
DF %>% group_by(x_foo) %>% mutate_each(funs(sum, foo))
# Error: cannot modify grouping variable
DF %>% group_by(x_foo) %>% mutate_each(funs(sum, foo), y,z)
# works fine

[Since duplicate column names are allowed in data.table, and updating columns requires providing column names, I can't envision a scenario where the above case would happen in data.table.] However, allowing duplicate names does ensure some level of caution in naming columns appropriately, when programming (non-interactively).

In summary, using a special/rare column name, for e.g., __tmp__, in j should fix the issue IMHO.

@matthieugomez
Copy link
Contributor Author

It's a bug in dplyr, sure. But I suspect it's a bug that may happen in a lot of user programs. Honestly, it's not even really related to data.table - I find it equally as hard to avoid duplicate names in data.frames.

More thoughts about this (sorry - this message is long)

In Stata, at the beginning of a function, one assigns a character to temporary variables:

tempvar v1 
egen `v1'=mean(v2), by(v3)

In the first command, Stata creates a corresponding character with prefix __temp (prefix __ is forbidden for user defined variables, this ensures no duplicates), and, within the rest of the functions, v1 (enclosed in quotes) refer to the name of this temporary variable. The second line therefore creates a variable named _temp001 in the user dataset. (As a parenthesis, Stata also deletes these variables at the end of the program. This is important since in Stata datasets are always modified in place. This is replicable with on.exit(DT[, (tempvar) :=NULL]) for now, and, hopefully, shallow() would even give a more elegant way to safely define temporary variables)

Now, the following steps are needed to be as careful in R.

a) define a function tempname that creates a character vector of names not present in an environment or a list (like a data.table)

b) use setNames when creating list in j. Things get a little bit more complicated if one wants to avoid overheard if by is not null (http://stackoverflow.com/a/16150233/3662288): one needs to switch to setmames and include the names of the columns in by.

c), add c() around name that refer to characters in j, by, .SD to be sure the function works even if a variable named "tempname" exists in the data.table. Some names in i or j cannot be substituted by characters and, for those, one needs to use get()

For instance, to write a function that keeps only rows within groups (suppose SD[] does not exist)

 # tempname
 tempname=function(where = globalenv() , n = 1, prefix = "temp") {
     all_names <- NULL
     i <- 0L
     name <- prefix
     while (n>0){
         i <- i + 1L
         while (exists(name, where = where)){
             name <- paste0(prefix, as.character(i))
             i <- i + 1L
        }
       all_names <- c(all_names, name)
       name <- paste0(prefix, as.character(i))
       n <- n-1
    }
 all_names
 }

 # then write the function
 keepmax <- function(DT, col, bycols){
   tempname <- tempname(DT)
   ans <- setnames(DT[, .I[get(col)==max(get(col))], by = c(bycols)], c(bycols,tempname))
   # could not find better way to programmatically extract a column as a vector based on its name
  ans <- as.vector(as.matrix(ans[, tempname, with = FALSE])) 
 DT[ans]
 }

My point is that making sure temporary variables don't create duplicate names in dataframes or datatables is cumbersome. I've actually encountered very few packages dealing explicitly with this issue.

A way to simplify this problem may be to add a function like tempname in the package, capture setNames in j so that overheard is avoided, and explain the best practices in the FAQ.

Now, in a lot of common situations, only the last or second to last columns are needed. I thought that DT[[.M]] might be the simplest solution, so this is what I initially proposed.

@arunsrinivasan
Copy link
Member

Neat example highlighting the actual issue. And yes, the underlying issue (in dealing with duplicate names) is complex. But it'd be easier (and cleaner?) to avoid it altogether (basically, if none exists, don't introduce a duplicate name, but if it already exists, play nicely as long as an index is provided). I remember a pending FR from @rsaporta regarding the very same problem.

On another note, this particular operation itself should be much cleaner using .SD, and shouldn't depend on .I. We'll have to think of some tweaks in j-expression. It's come up quite some times recently.

@matthieugomez
Copy link
Contributor Author

I'm actually not even dealing with duplicates names - as you say, I'm just making sure the temporary variables I create don't introduce duplicate names.

@arunsrinivasan
Copy link
Member

@matthieugomez agreed. Found the issue from @rsaporta, #551. I've added it to 1.9.8 milestone. Closing this one (as it's very related, and will be linked in the other post now).

@matthieugomez
Copy link
Contributor Author

Instead of a method for [[]], there could be an option to get only the columns created in j, without columns in by. Looking at the first example, one could do something like

bdt[bdt[, .I[g == max(g)], by = id, with.by = FALSE]]

It's more elegant than an intermediate line to get the length of the data.table and then get the last column. It may be worth it if columns in by are deeply copied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants