Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fgroup_by when there are two rows of data: grouping incorrect #320

Closed
vinhdizzo opened this issue Sep 26, 2022 · 9 comments
Closed

fgroup_by when there are two rows of data: grouping incorrect #320

vinhdizzo opened this issue Sep 26, 2022 · 9 comments

Comments

@vinhdizzo
Copy link

Hi,

Running into an odd behavior with collapse's fgroup_by and fsum, and it appears to only happen when the data contains 2 rows. I'm on version 1.8.8 of collapse.

library(data.table)
library(collapse)

data.table(Ethnicity=c('Asian','Asian', 'Asian'), Gender=c('Male'), Transfer=1)[1:2, ] %>% fgroup_by(c('Ethnicity', 'Gender')) %>% get_vars('Transfer') %>% fsum

gives:

          NA Transfer
      <char>    <num>
1: Ethnicity        1
2:    Gender        1

I'm expecting a total of 3 columns: Ethnicity, Gender, and Transfer. The results look fine when the # of rows are 1 and 2:

> data.table(Ethnicity=c('Asian','Asian', 'Asian'), Gender=c('Male'), Transfer=1)[1:1, ] %>% fgroup_by(c('Ethnicity', 'Gender')) %>% get_vars('Transfer') %>% fsum
   Ethnicity Gender Transfer
      <char> <char>    <num>
1:     Asian   Male        1

> data.table(Ethnicity=c('Asian','Asian', 'Asian'), Gender=c('Male'), Transfer=1)[1:3, ] %>% fgroup_by(c('Ethnicity', 'Gender')) %>% get_vars('Transfer') %>% fsum
   Ethnicity Gender Transfer
      <char> <char>    <num>
1:     Asian   Male        3
@vinhdizzo
Copy link
Author

Also, it appears to be an issue when a vector of column names are passed to fgroup_by. When the columns are passed via NSE, the issue is not there:

> data.table(Ethnicity=c('Asian','Asian', 'Asian'), Gender=c('Male'), Transfer=1)[1:2, ] %>% fgroup_by(Ethnicity, Gender) %>% get_vars('Transfer') %>% fsum
   Ethnicity Gender Transfer
      <char> <char>    <num>
1:     Asian   Male        2

@vinhdizzo
Copy link
Author

It appears that if we modify the following line in fgroup_by,

if (length(e) == 1L && length(e[[1L]]) != length(.X[[1L]]) && 
    is.null(name)) {

to

if (length(e) == 1L  && is.null(name)) {

then things should work. However, I don't know what length(e[[1L]]) != length(.X[[1L]]) is checking for, so I don't know how safe it is to remove that.

Point is, the following should be run:

e <- .X[cols2int(e[[1L]], .X, names(.X), FALSE)]

SebKrantz added a commit that referenced this issue Sep 26, 2022
Merge branch 'development' of github.com:SebKrantz/collapse into development

# Conflicts:
#	R/GRP.R
#	R/indexing.R
@SebKrantz
Copy link
Owner

SebKrantz commented Sep 26, 2022

Thanks! The issue is tricky, because fgroup_by() can handle a great number of SE and NSE inputs, including:

fgroup_by(GGDC10S, Variable, Decade = floor(Year / 10) * 10) 
fgroup_by(GGDC10S, 1:3, 5)
fgroup_by(GGDC10S, c("Variable", "Country")) 
fgroup_by(GGDC10S, is.character) 
fgroup_by(GGDC10S, Country:Variable, Year) 
fgroup_by(GGDC10S, Country:Region, Var = Variable, Year) 
v <- c("Variable", "Country")
fgroup_by(GGDC10S, v) 

The inputs are dots <- substitute(list(...)) and e <- eval(dots, .X, parent.frame()). If dots is a NSE specification of a single grouping column e.g. fgroup_by(GGDC10S, Variable), then e[[1]] == GGDC10S$Variable will have the same length as nrow(GGDC10S) (or length(.X[[1]])). If not, then dots can be a vector of column names / indices or a selector function. So the line length(e[[1L]]) != length(.X[[1L]]) is critical, but as your example shows, not quite sufficient to dispatch in all circumstances. I have now used length(e) == 1L && (!is.symbol(dots[[2L]]) || length(e[[1L]]) != length(.X[[1L]]) || is.function(e[[1L]])) && is.null(name), to better detect SE inputs. Available now in the GH version.

@vinhdizzo
Copy link
Author

Understood, thanks @SebKrantz.

The next version of my package is going to depend on the collapse package. When do you anticipate version 1.8.9 to be released? So I can time things correctly. Thanks.

@SebKrantz
Copy link
Owner

SebKrantz commented Sep 27, 2022

I'm not sure exactly when 1.8.9 will be released, it may take a while still though. I suggest to move ahead with a safer programming approach. You have 2 options: you can use the collapv() function e.g.:

data.table(Ethnicity=c('Asian','Asian', 'Asian'), Gender=c('Male'), Transfer=1)[1:2, ] %>% 
collapv(c('Ethnicity', 'Gender'), fsum, cols = "Transfer", keep.col.order = FALSE)

# Or (can add multiple aggregation functions)
data.table(Ethnicity=c('Asian','Asian', 'Asian'), Gender=c('Male'), Transfer=1)[1:2, ] %>% 
collapv(c('Ethnicity', 'Gender'), custom = list(fsum = "Transfer"), keep.col.order = FALSE)

Or you could do things manually:

data <- data.table(Ethnicity=c('Asian','Asian', 'Asian'), Gender=c('Male'), Transfer=1)[1:2, ]
g <- GRP(data, c('Ethnicity', 'Gender'), call = FALSE)
add_vars(g$groups, fsum(get_vars(data, "Transfer"), g, use.g.names = FALSE))

SebKrantz added a commit that referenced this issue Sep 27, 2022
SebKrantz added a commit that referenced this issue Sep 27, 2022
@SebKrantz
Copy link
Owner

SebKrantz commented Sep 27, 2022

Actually I realized even the proposed check is insufficient if your ode looks like this:

data <- data.table(Ethnicity=c('Asian','Asian', 'Asian'), Gender=c('Male'), Transfer=1)[1:2, ]
vars <- c('Ethnicity', 'Gender')
cols <- "Transfer"
data %>% fgroup_by(vars) %>% get_vars(cols) %>% fsum

... because is.symbol(dots[[2L]]) will be TRUE and the length condition will also not sort this out. I have now used a different check to detect SE input: if(length(e) == 1L && (length(vars) == 0L || !any(vars == names(.X))) && is.null(name)), where vars = all.vars(dots), which also guards against this case. But this should convince you that if possible functions designed for NSE should ideally not be used with SE inputs in critical applications. In particular, if your data frame also had a column called vars, there is absolutely no way for me to distinguish between your external variable and that column. tidyverse has solved this problem using {rlang} and a new object called 'quosure', but collapse will not utilized this because of performance and dependency reasons.

SebKrantz added a commit that referenced this issue Sep 27, 2022
SebKrantz added a commit that referenced this issue Sep 27, 2022
@vinhdizzo
Copy link
Author

Thanks @SebKrantz! The collapv function works well for my use case.

@SebKrantz
Copy link
Owner

Cool. I‘d actually recommend you use it with the custom = list(fsum = vars)
argument, as this will give an error in fsum if some vars are non-numeric.
Otherwise fmode will be applied to non-numeric variables (including those
that have a class).

@vinhdizzo
Copy link
Author

Thanks @SebKrantz , I will leverage the custom argument as you have suggested. Loving the collapse package by the way :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants