-
Notifications
You must be signed in to change notification settings - Fork 968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data.table method for [[ #892
Comments
It's just a bug in Here's a similar require(dplyr)
foo <- function(x) x^2
DF = data.frame(x_foo=c(1,1,1,2,2), x=1:5, y=6:10, z=11:15)
DF %>% group_by(x_foo) %>% mutate_each(funs(sum, foo))
# Error: cannot modify grouping variable
DF %>% group_by(x_foo) %>% mutate_each(funs(sum, foo), y,z)
# works fine [Since duplicate column names are allowed in data.table, and updating columns requires providing column names, I can't envision a scenario where the above case would happen in data.table.] However, allowing duplicate names does ensure some level of caution in naming columns appropriately, when programming (non-interactively). In summary, using a special/rare column name, for e.g., |
It's a bug in More thoughts about this (sorry - this message is long) In Stata, at the beginning of a function, one assigns a character to temporary variables:
In the first command, Stata creates a corresponding character with prefix __temp (prefix __ is forbidden for user defined variables, this ensures no duplicates), and, within the rest of the functions, v1 (enclosed in quotes) refer to the name of this temporary variable. The second line therefore creates a variable named _temp001 in the user dataset. (As a parenthesis, Stata also deletes these variables at the end of the program. This is important since in Stata datasets are always modified in place. This is replicable with Now, the following steps are needed to be as careful in R. a) define a function b) use c), add c() around name that refer to characters in j, by, .SD to be sure the function works even if a variable named "tempname" exists in the data.table. Some names in For instance, to write a function that keeps only rows within groups (suppose # tempname
tempname=function(where = globalenv() , n = 1, prefix = "temp") {
all_names <- NULL
i <- 0L
name <- prefix
while (n>0){
i <- i + 1L
while (exists(name, where = where)){
name <- paste0(prefix, as.character(i))
i <- i + 1L
}
all_names <- c(all_names, name)
name <- paste0(prefix, as.character(i))
n <- n-1
}
all_names
}
# then write the function
keepmax <- function(DT, col, bycols){
tempname <- tempname(DT)
ans <- setnames(DT[, .I[get(col)==max(get(col))], by = c(bycols)], c(bycols,tempname))
# could not find better way to programmatically extract a column as a vector based on its name
ans <- as.vector(as.matrix(ans[, tempname, with = FALSE]))
DT[ans]
} My point is that making sure temporary variables don't create duplicate names in dataframes or datatables is cumbersome. I've actually encountered very few packages dealing explicitly with this issue. A way to simplify this problem may be to add a function like Now, in a lot of common situations, only the last or second to last columns are needed. I thought that |
Neat example highlighting the actual issue. And yes, the underlying issue (in dealing with duplicate names) is complex. But it'd be easier (and cleaner?) to avoid it altogether (basically, if none exists, don't introduce a duplicate name, but if it already exists, play nicely as long as an index is provided). I remember a pending FR from @rsaporta regarding the very same problem. On another note, this particular operation itself should be much cleaner using |
I'm actually not even dealing with duplicates names - as you say, I'm just making sure the temporary variables I create don't introduce duplicate names. |
@matthieugomez agreed. Found the issue from @rsaporta, #551. I've added it to 1.9.8 milestone. Closing this one (as it's very related, and will be linked in the other post now). |
Instead of a method for [[]], there could be an option to get only the columns created in j, without columns in by. Looking at the first example, one could do something like bdt[bdt[, .I[g == max(g)], by = id, with.by = FALSE]] It's more elegant than an intermediate line to get the length of the data.table and then get the last column. It may be worth it if columns in by are deeply copied. |
Lots of answers of Stackoverflow assume the names of newly created columns in a data.table don't conflict with existing variable names. Although these answers work for the particular examples given in the original question, they may fail in more general situations. As an example, this answer led to a bug in
dplyr::filter
.Since the last column created is always at the end of the data.table, I think a way of extracting the last (or second to last, etc) column would be useful. I have written a data.table method for
[[
below, such that the symbol.M
refers to the number of columns in the data.table.The answer in the original stackoverflow post would then be
This code works irrespectively of the names of existing variables in "by".
The text was updated successfully, but these errors were encountered: