Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using ".SDcols" and "by" in one call, and the function call produces NA for one group, but not the other, you get an error #5341

Closed
CGlemser opened this issue Feb 25, 2022 · 7 comments

Comments

@CGlemser
Copy link

CGlemser commented Feb 25, 2022

I have run into the following situation (simplified for reproducibility):

  • I created a function that returns the sum of a vector with all NAs removed and NA if all elements are NA
    sumNA <- function(x) ifelse(all(is.na(x)), NA, sum(x, na.rm = TRUE))
  • then I want to apply it in a data.table to only certain columns grouped by another column
    dt <- data.table(col1 = c(rep(NA, 2), 1, 2), col2 = rep(c(1,2), each = 2), groupby = c(1,1,2,2))
    dt[, lapply(.SD, sumNA), .SDcols = c("col1", "col2"), by = "groupby"]
  • this returns an error since col1 is NA for all elements in group1, but has a sum value of 3 for the elements in group2. As a result data.table can't produce the column as the NA is coded as logical, but the result for group2 is double.
  • I have fixed this in my case by explicitly coercing the NA to be of the correct class in my function before returning the result... However, having this NA coercion happen in data.table directly instead would have been a very nice-to-have and could potentially save many users from having to code the same workaround in every function that returns NAs and takes several input classes
  • it also took me a while to figure out what caused the error message, so in case implementing the NA class coercion is not easily feasible, I also think that a more descriptive error message could be helpful

Output of sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_4.1.1 cli_3.1.0 tools_4.1.1 data.table_1.14.2 rlang_1.0.

@jangorecki
Copy link
Member

jangorecki commented Feb 25, 2022

Maybe you can just return double type NA from your function instead of logical NA? Otherwise your function is not type stable.

@CGlemser
Copy link
Author

Yes, I am aware of workarounds that will make it work and I have implemented it in my use case - I was just wondering if this is not something that could (and maybe should) be done by data.table implicitly, in particular cause we see this type of dynamic type conversion pretty much everywhere else in R. One example is if I for example use
sumNA <- function(x) ifelse(all(is.na(x)), as.double(NA), sum(x, na.rm = TRUE))
but one column is type integer
dt <- data.table(col1 = as.integer(c(rep(NA, 2), 1, 2)), col2 = rep(c(1,2), each = 2), groupby = c(1,1,2,2))
this again trips up data.table when calling
dt[, lapply(.SD, sumNA), .SDcols = c("col1", "col2"), by = "groupby"]
whereas I would usually expect R to dynamically change the typing of integer to double if needed, the same way e.g. class(as.integer(1) + as.double(1)) won't throw an error but simply return "numeric".

I am aware that this is not a huge issue and with some programming knowledge it should be easy to fix, but I find the very strong typing slightly peculiar and it might throw off users with less coding experience.

@avimallu
Copy link
Contributor

avimallu commented Feb 26, 2022

From the discussions on Github, automatic optimization of ifelse has not been made to fifelse. Some more comments that might help shed some light:

  1. fifelse is not a complete replacement for ifelse. See discussion here.
  2. There is a PR pending that allows for this, but is not yet implemented for the reason in (1).
  3. If you want automatic typecasting to the relevant NA value, it should already be there in fifelse, implemented via this PR. It is not yet implemented for fcase, however.

My terminology might be inaccurate for some of the described operations, but I hope this is helpful.

@jangorecki
Copy link
Member

jangorecki commented Feb 26, 2022

Instead of coercing to double you can always use NA_real_. This will not make it work well for mixed types on input, as example you asked. Best practice way to address this case is to have generic method and two methods, for double and for integers. Handling that in fifelse is another option, handling that in [ groupby call is probably not feasible because that means that for results of every group we have to check if types matches and then adjust them accordingly, which for low cardinality grouping will not be a big deal, but when there are many small groups will be definitely slowing down the aggregation.

@CGlemser
Copy link
Author

Alright, thank you both for your input! Then I will simply keep type stability better in mind in future functions and switch to fifelse where possible :)

@ben-schwen
Copy link
Member

Easiest option is probably to make a cast, e. g. as.numeric before returning in your function.

While as Jan pointed, it would be possible to do type checks and bump types if necessary, I think it would not really help users in the long run. Being aware of the different types of NA is something that useRs should learn imo.

@jangorecki
Copy link
Member

Closing.
As explained above, checking class of returned object for each group separately is too big overhead. Functions used in groupby should be type stable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants