Data.table method for [[ #892

matthieugomez · 2014-10-15T13:37:54Z

Lots of answers of Stackoverflow assume the names of newly created columns in a data.table don't conflict with existing variable names. Although these answers work for the particular examples given in the original question, they may fail in more general situations. As an example, this answer led to a bug in dplyr::filter.

Since the last column created is always at the end of the data.table, I think a way of extracting the last (or second to last, etc) column would be useful. I have written a data.table method for [[ below, such that the symbol .M refers to the number of columns in the data.table.

`[[.data.table` <- function(x,i){
    isub <- substitute(i)
    if (is.call(isub)|is.name(isub)) isub <- lazyeval::interp(isub, .M = length(x))
    i <- eval(isub, parent.frame())
    `[[.data.frame`(x,i)
}

The answer in the original stackoverflow post would then be

bdt[bdt[, .I[g == max(g)], by = id][[.M]]]

This code works irrespectively of the names of existing variables in "by".

The text was updated successfully, but these errors were encountered:

arunsrinivasan · 2014-10-18T15:54:05Z

It's just a bug in dplyr, isn't it? I thought it was quite clear that the functionality (of not providing column names in j) is for interactive use only, but seemingly not. Perhaps we should mention this explicitly in the documentation hereafter. While programming, of course one needs to take care of column names.

Here's a similar dplyr example that'd fail as well (due to auto-naming):

require(dplyr)
foo <- function(x) x^2
DF = data.frame(x_foo=c(1,1,1,2,2), x=1:5, y=6:10, z=11:15) 
DF %>% group_by(x_foo) %>% mutate_each(funs(sum, foo))
# Error: cannot modify grouping variable
DF %>% group_by(x_foo) %>% mutate_each(funs(sum, foo), y,z)
# works fine

[Since duplicate column names are allowed in data.table, and updating columns requires providing column names, I can't envision a scenario where the above case would happen in data.table.] However, allowing duplicate names does ensure some level of caution in naming columns appropriately, when programming (non-interactively).

In summary, using a special/rare column name, for e.g., __tmp__, in j should fix the issue IMHO.

matthieugomez · 2014-10-18T17:12:05Z

It's a bug in dplyr, sure. But I suspect it's a bug that may happen in a lot of user programs. Honestly, it's not even really related to data.table - I find it equally as hard to avoid duplicate names in data.frames.

More thoughts about this (sorry - this message is long)

In Stata, at the beginning of a function, one assigns a character to temporary variables:

tempvar v1 
egen `v1'=mean(v2), by(v3)

In the first command, Stata creates a corresponding character with prefix __temp (prefix __ is forbidden for user defined variables, this ensures no duplicates), and, within the rest of the functions, v1 (enclosed in quotes) refer to the name of this temporary variable. The second line therefore creates a variable named _temp001 in the user dataset. (As a parenthesis, Stata also deletes these variables at the end of the program. This is important since in Stata datasets are always modified in place. This is replicable with on.exit(DT[, (tempvar) :=NULL]) for now, and, hopefully, shallow() would even give a more elegant way to safely define temporary variables)

Now, the following steps are needed to be as careful in R.

a) define a function tempname that creates a character vector of names not present in an environment or a list (like a data.table)

b) use setNames when creating list in j. Things get a little bit more complicated if one wants to avoid overheard if by is not null (http://stackoverflow.com/a/16150233/3662288): one needs to switch to setmames and include the names of the columns in by.

c), add c() around name that refer to characters in j, by, .SD to be sure the function works even if a variable named "tempname" exists in the data.table. Some names in i or j cannot be substituted by characters and, for those, one needs to use get()

For instance, to write a function that keeps only rows within groups (suppose SD[] does not exist)

 # tempname
 tempname=function(where = globalenv() , n = 1, prefix = "temp") {
     all_names <- NULL
     i <- 0L
     name <- prefix
     while (n>0){
         i <- i + 1L
         while (exists(name, where = where)){
             name <- paste0(prefix, as.character(i))
             i <- i + 1L
        }
       all_names <- c(all_names, name)
       name <- paste0(prefix, as.character(i))
       n <- n-1
    }
 all_names
 }

 # then write the function
 keepmax <- function(DT, col, bycols){
   tempname <- tempname(DT)
   ans <- setnames(DT[, .I[get(col)==max(get(col))], by = c(bycols)], c(bycols,tempname))
   # could not find better way to programmatically extract a column as a vector based on its name
  ans <- as.vector(as.matrix(ans[, tempname, with = FALSE])) 
 DT[ans]
 }

My point is that making sure temporary variables don't create duplicate names in dataframes or datatables is cumbersome. I've actually encountered very few packages dealing explicitly with this issue.

A way to simplify this problem may be to add a function like tempname in the package, capture setNames in j so that overheard is avoided, and explain the best practices in the FAQ.

Now, in a lot of common situations, only the last or second to last columns are needed. I thought that DT[[.M]] might be the simplest solution, so this is what I initially proposed.

arunsrinivasan · 2014-10-18T22:16:19Z

Neat example highlighting the actual issue. And yes, the underlying issue (in dealing with duplicate names) is complex. But it'd be easier (and cleaner?) to avoid it altogether (basically, if none exists, don't introduce a duplicate name, but if it already exists, play nicely as long as an index is provided). I remember a pending FR from @rsaporta regarding the very same problem.

On another note, this particular operation itself should be much cleaner using .SD, and shouldn't depend on .I. We'll have to think of some tweaks in j-expression. It's come up quite some times recently.

matthieugomez · 2014-10-18T22:39:05Z

I'm actually not even dealing with duplicates names - as you say, I'm just making sure the temporary variables I create don't introduce duplicate names.

arunsrinivasan · 2014-10-19T08:26:09Z

@matthieugomez agreed. Found the issue from @rsaporta, #551. I've added it to 1.9.8 milestone. Closing this one (as it's very related, and will be linked in the other post now).

matthieugomez · 2014-11-13T22:07:54Z

Instead of a method for [[]], there could be an option to get only the columns created in j, without columns in by. Looking at the first example, one could do something like

bdt[bdt[, .I[g == max(g)], by = id, with.by = FALSE]]

It's more elegant than an intermediate line to get the length of the data.table and then get the last column. It may be worth it if columns in by are deeply copied.

arunsrinivasan closed this as completed Oct 19, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data.table method for [[ #892

Data.table method for [[ #892

matthieugomez commented Oct 15, 2014

arunsrinivasan commented Oct 18, 2014

matthieugomez commented Oct 18, 2014

arunsrinivasan commented Oct 18, 2014

matthieugomez commented Oct 18, 2014

arunsrinivasan commented Oct 19, 2014

matthieugomez commented Nov 13, 2014

Data.table method for [[ #892

Data.table method for [[ #892

Comments

matthieugomez commented Oct 15, 2014

arunsrinivasan commented Oct 18, 2014

matthieugomez commented Oct 18, 2014

arunsrinivasan commented Oct 18, 2014

matthieugomez commented Oct 18, 2014

arunsrinivasan commented Oct 19, 2014

matthieugomez commented Nov 13, 2014