`fgroup_by` when there are two rows of data: grouping incorrect #320

vinhdizzo · 2022-09-26T20:02:52Z

Hi,

Running into an odd behavior with collapse's fgroup_by and fsum, and it appears to only happen when the data contains 2 rows. I'm on version 1.8.8 of collapse.

library(data.table)
library(collapse)

data.table(Ethnicity=c('Asian','Asian', 'Asian'), Gender=c('Male'), Transfer=1)[1:2, ] %>% fgroup_by(c('Ethnicity', 'Gender')) %>% get_vars('Transfer') %>% fsum

gives:

          NA Transfer
      <char>    <num>
1: Ethnicity        1
2:    Gender        1

I'm expecting a total of 3 columns: Ethnicity, Gender, and Transfer. The results look fine when the # of rows are 1 and 2:

> data.table(Ethnicity=c('Asian','Asian', 'Asian'), Gender=c('Male'), Transfer=1)[1:1, ] %>% fgroup_by(c('Ethnicity', 'Gender')) %>% get_vars('Transfer') %>% fsum
   Ethnicity Gender Transfer
      <char> <char>    <num>
1:     Asian   Male        1

> data.table(Ethnicity=c('Asian','Asian', 'Asian'), Gender=c('Male'), Transfer=1)[1:3, ] %>% fgroup_by(c('Ethnicity', 'Gender')) %>% get_vars('Transfer') %>% fsum
   Ethnicity Gender Transfer
      <char> <char>    <num>
1:     Asian   Male        3

The text was updated successfully, but these errors were encountered:

vinhdizzo · 2022-09-26T20:09:30Z

Also, it appears to be an issue when a vector of column names are passed to fgroup_by. When the columns are passed via NSE, the issue is not there:

> data.table(Ethnicity=c('Asian','Asian', 'Asian'), Gender=c('Male'), Transfer=1)[1:2, ] %>% fgroup_by(Ethnicity, Gender) %>% get_vars('Transfer') %>% fsum
   Ethnicity Gender Transfer
      <char> <char>    <num>
1:     Asian   Male        2

vinhdizzo · 2022-09-26T20:35:31Z

It appears that if we modify the following line in fgroup_by,

if (length(e) == 1L && length(e[[1L]]) != length(.X[[1L]]) && 
    is.null(name)) {

to

if (length(e) == 1L  && is.null(name)) {

then things should work. However, I don't know what length(e[[1L]]) != length(.X[[1L]]) is checking for, so I don't know how safe it is to remove that.

Point is, the following should be run:

e <- .X[cols2int(e[[1L]], .X, names(.X), FALSE)]

…320.

…320. -2.

Merge branch 'development' of github.com:SebKrantz/collapse into development # Conflicts: # R/GRP.R # R/indexing.R

SebKrantz · 2022-09-26T23:28:43Z

Thanks! The issue is tricky, because fgroup_by() can handle a great number of SE and NSE inputs, including:

fgroup_by(GGDC10S, Variable, Decade = floor(Year / 10) * 10) 
fgroup_by(GGDC10S, 1:3, 5)
fgroup_by(GGDC10S, c("Variable", "Country")) 
fgroup_by(GGDC10S, is.character) 
fgroup_by(GGDC10S, Country:Variable, Year) 
fgroup_by(GGDC10S, Country:Region, Var = Variable, Year) 
v <- c("Variable", "Country")
fgroup_by(GGDC10S, v)

The inputs are dots <- substitute(list(...)) and e <- eval(dots, .X, parent.frame()). If dots is a NSE specification of a single grouping column e.g. fgroup_by(GGDC10S, Variable), then e[[1]] == GGDC10S$Variable will have the same length as nrow(GGDC10S) (or length(.X[[1]])). If not, then dots can be a vector of column names / indices or a selector function. So the line length(e[[1L]]) != length(.X[[1L]]) is critical, but as your example shows, not quite sufficient to dispatch in all circumstances. I have now used length(e) == 1L && (!is.symbol(dots[[2L]]) || length(e[[1L]]) != length(.X[[1L]]) || is.function(e[[1L]])) && is.null(name), to better detect SE inputs. Available now in the GH version.

vinhdizzo · 2022-09-27T01:47:21Z

Understood, thanks @SebKrantz.

The next version of my package is going to depend on the collapse package. When do you anticipate version 1.8.9 to be released? So I can time things correctly. Thanks.

SebKrantz · 2022-09-27T06:56:12Z

I'm not sure exactly when 1.8.9 will be released, it may take a while still though. I suggest to move ahead with a safer programming approach. You have 2 options: you can use the collapv() function e.g.:

data.table(Ethnicity=c('Asian','Asian', 'Asian'), Gender=c('Male'), Transfer=1)[1:2, ] %>% 
collapv(c('Ethnicity', 'Gender'), fsum, cols = "Transfer", keep.col.order = FALSE)

# Or (can add multiple aggregation functions)
data.table(Ethnicity=c('Asian','Asian', 'Asian'), Gender=c('Male'), Transfer=1)[1:2, ] %>% 
collapv(c('Ethnicity', 'Gender'), custom = list(fsum = "Transfer"), keep.col.order = FALSE)

Or you could do things manually:

data <- data.table(Ethnicity=c('Asian','Asian', 'Asian'), Gender=c('Male'), Transfer=1)[1:2, ]
g <- GRP(data, c('Ethnicity', 'Gender'), call = FALSE)
add_vars(g$groups, fsum(get_vars(data, "Transfer"), g, use.g.names = FALSE))

Better solution for #320.

SebKrantz · 2022-09-27T07:34:23Z

Actually I realized even the proposed check is insufficient if your ode looks like this:

data <- data.table(Ethnicity=c('Asian','Asian', 'Asian'), Gender=c('Male'), Transfer=1)[1:2, ]
vars <- c('Ethnicity', 'Gender')
cols <- "Transfer"
data %>% fgroup_by(vars) %>% get_vars(cols) %>% fsum

... because is.symbol(dots[[2L]]) will be TRUE and the length condition will also not sort this out. I have now used a different check to detect SE input: if(length(e) == 1L && (length(vars) == 0L || !any(vars == names(.X))) && is.null(name)), where vars = all.vars(dots), which also guards against this case. But this should convince you that if possible functions designed for NSE should ideally not be used with SE inputs in critical applications. In particular, if your data frame also had a column called vars, there is absolutely no way for me to distinguish between your external variable and that column. tidyverse has solved this problem using {rlang} and a new object called 'quosure', but collapse will not utilized this because of performance and dependency reasons.

Small tweak for #320 solution.

vinhdizzo · 2022-09-27T22:50:27Z

Thanks @SebKrantz! The collapv function works well for my use case.

SebKrantz · 2022-09-28T07:46:08Z

Cool. I‘d actually recommend you use it with the custom = list(fsum = vars)
argument, as this will give an error in fsum if some vars are non-numeric.
Otherwise fmode will be applied to non-numeric variables (including those
that have a class).

vinhdizzo · 2022-09-28T16:31:56Z

Thanks @SebKrantz , I will leverage the custom argument as you have suggested. Loving the collapse package by the way :)

SebKrantz added a commit that referenced this issue Sep 26, 2022

More secure checking of inputs to fgroup_by() and findex_by(). Fixes #…

fd11189

…320.

SebKrantz added a commit that referenced this issue Sep 26, 2022

More secure checking of inputs to fgroup_by() and findex_by(). Fixes #…

a5f5fad

…320. -2.

SebKrantz added a commit that referenced this issue Sep 26, 2022

Fixes #320.

6834201

Merge branch 'development' of github.com:SebKrantz/collapse into development # Conflicts: # R/GRP.R # R/indexing.R

SebKrantz closed this as completed in 6907bf6 Sep 26, 2022

SebKrantz added a commit that referenced this issue Sep 27, 2022

Better solution for #320.

ca3f39d

SebKrantz added a commit that referenced this issue Sep 27, 2022

Merge pull request #322 from SebKrantz/development

79b9fc1

Better solution for #320.

SebKrantz added a commit that referenced this issue Sep 27, 2022

Small tweak for #320 solution.

3736505

SebKrantz added a commit that referenced this issue Sep 27, 2022

Merge pull request #323 from SebKrantz/development

51a28ed

Small tweak for #320 solution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`fgroup_by` when there are two rows of data: grouping incorrect #320

`fgroup_by` when there are two rows of data: grouping incorrect #320

vinhdizzo commented Sep 26, 2022

vinhdizzo commented Sep 26, 2022

vinhdizzo commented Sep 26, 2022

SebKrantz commented Sep 26, 2022 •

edited

Loading

vinhdizzo commented Sep 27, 2022

SebKrantz commented Sep 27, 2022 •

edited

Loading

SebKrantz commented Sep 27, 2022 •

edited

Loading

vinhdizzo commented Sep 27, 2022

SebKrantz commented Sep 28, 2022

vinhdizzo commented Sep 28, 2022

fgroup_by when there are two rows of data: grouping incorrect #320

fgroup_by when there are two rows of data: grouping incorrect #320

Comments

vinhdizzo commented Sep 26, 2022

vinhdizzo commented Sep 26, 2022

vinhdizzo commented Sep 26, 2022

SebKrantz commented Sep 26, 2022 • edited Loading

vinhdizzo commented Sep 27, 2022

SebKrantz commented Sep 27, 2022 • edited Loading

SebKrantz commented Sep 27, 2022 • edited Loading

vinhdizzo commented Sep 27, 2022

SebKrantz commented Sep 28, 2022

vinhdizzo commented Sep 28, 2022

`fgroup_by` when there are two rows of data: grouping incorrect #320

`fgroup_by` when there are two rows of data: grouping incorrect #320

SebKrantz commented Sep 26, 2022 •

edited

Loading

SebKrantz commented Sep 27, 2022 •

edited

Loading

SebKrantz commented Sep 27, 2022 •

edited

Loading