When using ".SDcols" and "by" in one call, and the function call produces NA for one group, but not the other, you get an error #5341

CGlemser · 2022-02-25T14:11:23Z

I have run into the following situation (simplified for reproducibility):

I created a function that returns the sum of a vector with all NAs removed and NA if all elements are NA
sumNA <- function(x) ifelse(all(is.na(x)), NA, sum(x, na.rm = TRUE))
then I want to apply it in a data.table to only certain columns grouped by another column
dt <- data.table(col1 = c(rep(NA, 2), 1, 2), col2 = rep(c(1,2), each = 2), groupby = c(1,1,2,2))
dt[, lapply(.SD, sumNA), .SDcols = c("col1", "col2"), by = "groupby"]
this returns an error since col1 is NA for all elements in group1, but has a sum value of 3 for the elements in group2. As a result data.table can't produce the column as the NA is coded as logical, but the result for group2 is double.
I have fixed this in my case by explicitly coercing the NA to be of the correct class in my function before returning the result... However, having this NA coercion happen in data.table directly instead would have been a very nice-to-have and could potentially save many users from having to code the same workaround in every function that returns NAs and takes several input classes
it also took me a while to figure out what caused the error message, so in case implementing the NA class coercion is not easily feasible, I also think that a more descriptive error message could be helpful

Output of sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=German_Germany.1252 LC_CTYPE=German_Germany.1252 LC_MONETARY=German_Germany.1252
[4] LC_NUMERIC=C LC_TIME=German_Germany.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] compiler_4.1.1 cli_3.1.0 tools_4.1.1 data.table_1.14.2 rlang_1.0.

The text was updated successfully, but these errors were encountered:

jangorecki · 2022-02-25T17:16:03Z

Maybe you can just return double type NA from your function instead of logical NA? Otherwise your function is not type stable.

CGlemser · 2022-02-25T17:46:21Z

Yes, I am aware of workarounds that will make it work and I have implemented it in my use case - I was just wondering if this is not something that could (and maybe should) be done by data.table implicitly, in particular cause we see this type of dynamic type conversion pretty much everywhere else in R. One example is if I for example use
sumNA <- function(x) ifelse(all(is.na(x)), as.double(NA), sum(x, na.rm = TRUE))
but one column is type integer
dt <- data.table(col1 = as.integer(c(rep(NA, 2), 1, 2)), col2 = rep(c(1,2), each = 2), groupby = c(1,1,2,2))
this again trips up data.table when calling
dt[, lapply(.SD, sumNA), .SDcols = c("col1", "col2"), by = "groupby"]
whereas I would usually expect R to dynamically change the typing of integer to double if needed, the same way e.g. class(as.integer(1) + as.double(1)) won't throw an error but simply return "numeric".

I am aware that this is not a huge issue and with some programming knowledge it should be easy to fix, but I find the very strong typing slightly peculiar and it might throw off users with less coding experience.

avimallu · 2022-02-26T00:20:21Z

From the discussions on Github, automatic optimization of ifelse has not been made to fifelse. Some more comments that might help shed some light:

fifelse is not a complete replacement for ifelse. See discussion here.
There is a PR pending that allows for this, but is not yet implemented for the reason in (1).
If you want automatic typecasting to the relevant NA value, it should already be there in fifelse, implemented via this PR. It is not yet implemented for fcase, however.

My terminology might be inaccurate for some of the described operations, but I hope this is helpful.

jangorecki · 2022-02-26T08:05:34Z

Instead of coercing to double you can always use NA_real_. This will not make it work well for mixed types on input, as example you asked. Best practice way to address this case is to have generic method and two methods, for double and for integers. Handling that in fifelse is another option, handling that in [ groupby call is probably not feasible because that means that for results of every group we have to check if types matches and then adjust them accordingly, which for low cardinality grouping will not be a big deal, but when there are many small groups will be definitely slowing down the aggregation.

CGlemser · 2022-02-26T10:06:36Z

Alright, thank you both for your input! Then I will simply keep type stability better in mind in future functions and switch to fifelse where possible :)

ben-schwen · 2022-02-26T12:10:06Z

Easiest option is probably to make a cast, e. g. as.numeric before returning in your function.

While as Jan pointed, it would be possible to do type checks and bump types if necessary, I think it would not really help users in the long run. Being aware of the different types of NA is something that useRs should learn imo.

jangorecki · 2023-09-30T06:27:06Z

Closing.
As explained above, checking class of returned object for each group separately is too big overhead. Functions used in groupby should be type stable.

jangorecki closed this as completed Sep 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When using ".SDcols" and "by" in one call, and the function call produces NA for one group, but not the other, you get an error #5341

When using ".SDcols" and "by" in one call, and the function call produces NA for one group, but not the other, you get an error #5341

CGlemser commented Feb 25, 2022 •

edited

Loading

jangorecki commented Feb 25, 2022 •

edited

Loading

CGlemser commented Feb 25, 2022

avimallu commented Feb 26, 2022 •

edited

Loading

jangorecki commented Feb 26, 2022 •

edited

Loading

CGlemser commented Feb 26, 2022

ben-schwen commented Feb 26, 2022

jangorecki commented Sep 30, 2023

When using ".SDcols" and "by" in one call, and the function call produces NA for one group, but not the other, you get an error #5341

When using ".SDcols" and "by" in one call, and the function call produces NA for one group, but not the other, you get an error #5341

Comments

CGlemser commented Feb 25, 2022 • edited Loading

jangorecki commented Feb 25, 2022 • edited Loading

CGlemser commented Feb 25, 2022

avimallu commented Feb 26, 2022 • edited Loading

jangorecki commented Feb 26, 2022 • edited Loading

CGlemser commented Feb 26, 2022

ben-schwen commented Feb 26, 2022

jangorecki commented Sep 30, 2023

CGlemser commented Feb 25, 2022 •

edited

Loading

jangorecki commented Feb 25, 2022 •

edited

Loading

avimallu commented Feb 26, 2022 •

edited

Loading

jangorecki commented Feb 26, 2022 •

edited

Loading