Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

first function not working as fun.aggregate for dcast if fill argument is not provided #5390

Closed
iago-pssjd opened this issue May 23, 2022 · 6 comments · Fixed by #5549
Closed
Labels
bug reshape dcast melt

Comments

@iago-pssjd
Copy link
Contributor

iago-pssjd commented May 23, 2022

I would expect I could do:

set.seed(1)
DT <- data.table(v1 = rep(1:2, each = 6),
                 v2 = rep(rep(1:3, 2), each = 2),
                 v3 = rep(1:2, 6),
                 v4 = rnorm(6))
dcast(DT, v1 + v2 ~ ., value.var = "v4", fun.aggregate = data.table::first)
Error: Aggregating function(s) should take vector inputs and return a single value (length=1). However, function(s) returns length!=1. This value will have to be used to fill any missing combinations, and therefore must be length=1. Either override by setting the 'fill' argument explicitly or modify your function to handle this case appropriately.

However I get an error instead of

dcast(DT, v1 + v2 ~ ., value.var = "v4", fun.aggregate = \(.).[1])
   v1 v2          .
1:  1  1 -0.6264538
2:  1  2 -0.8356286
3:  1  3  0.3295078
4:  2  1 -0.6264538
5:  2  2 -0.8356286
6:  2  3  0.3295078

Is it expected or a bug?

@tdhock
Copy link
Member

tdhock commented May 25, 2022

Maybe the issue title could be misleading because first function works as fun.aggregate in dcast, as long as you provide your own fill argument:

> dcast(DT, v1 + v2  ~ ., value.var = "v4", fun.aggregate = data.table::first, fill=NA)
Key: <v1, v2>
      v1    v2          .
   <int> <int>      <num>
1:     1     1 -0.6264538
2:     1     2 -0.8356286
3:     1     3  0.3295078
4:     2     1 -0.6264538
5:     2     2 -0.8356286
6:     2     3  0.3295078

dcast is behaving as documented, although documentation could be clarified to indicate that only if fill=NULL then fun.aggregate is called on a 0-length vector to determine fill value. Current docs say:

    fill: Value with which to fill missing cells. If 'fun.aggregate' is
          present, takes the value by applying the function on a
          0-length vector.

In the example above it seems that since fill=NULL dcast uses fun.aggregate to compute a fill value, but the fill value is actually not used/needed. That may be a bug in dcast? I suggest updating dcast so that fun.aggregate is only used when needed:

    fill: Value with which to fill missing cells. If fill=NULL and 
         missing cells are present, then 'fun.aggregate' is
         used on a 0-length vector to obtain a fill value.

Another example would be where the fill argument is actually used/needed:

> (sub.DT <- DT[1:7])
      v1    v2    v3         v4
   <int> <int> <int>      <num>
1:     1     1     1 -0.6264538
2:     1     1     2  0.1836433
3:     1     2     1 -0.8356286
4:     1     2     2  1.5952808
5:     1     3     1  0.3295078
6:     1     3     2 -0.8204684
7:     2     1     1 -0.6264538
> dcast(sub.DT, v1 ~ v2, value.var = "v4", fun.aggregate = data.table::first, fill=NA)
Key: <v1>
      v1          1          2         3
   <int>      <num>      <num>     <num>
1:     1 -0.6264538 -0.8356286 0.3295078
2:     2 -0.6264538         NA        NA
> dcast(sub.DT, v1 ~ v2, value.var = "v4", fun.aggregate = data.table::first)
Error: Aggregating function(s) should take vector inputs and return a single value (length=1). However, function(s) returns length!=1. This value will have to be used to fill any missing combinations, and therefore must be length=1. Either override by setting the 'fill' argument explicitly or modify your function to handle this case appropriately.

I would argue that stopping with an error is reasonable in this case, since the user asked to aggregate using the first item, but there is no first item for two combinations.

The data.table::first docs do not mention what happens when the input has length=0, so this is probably an un-anticipated use case. We could fix by either (1) documenting the current behavior,

> data.table::first(numeric(0))
numeric(0)

or (2) changing first to return NA instead of numeric(0) in this case.

@tdhock tdhock added bug reshape dcast melt labels May 25, 2022
@iago-pssjd iago-pssjd changed the title first function not allowed as fun.aggregate for dcast first function not working as fun.aggregate for dcast if fill argument is not provided May 25, 2022
@iago-pssjd
Copy link
Contributor Author

Thanks! @tdhock, I modified the issue title.

@tdhock
Copy link
Member

tdhock commented Aug 8, 2022

#5168 adds na.rm=T support to first/last, but does not fix this issue: numeric(0) is still returned, but to fix this issue NA would have to be returned.

@tdhock
Copy link
Member

tdhock commented Nov 30, 2022

hey @ben-schwen do you think it makes sense for first/last to always return a vector with length=1, even in the case of an empty (length=0) input vector? If so, do you think it makes sense to add that change to #5168 or to create a new PR?

@ben-schwen
Copy link
Member

We could add an extra argument for that behavior, although, I'm not quite convinced since first/last are similar to head/tail and those also return fewer values if not enough are available.

Moreover, I wouldn't add it to #5168, since I don't know what the return value for first(integer(0), na.rm=TRUE) should be :D

@tdhock
Copy link
Member

tdhock commented Nov 30, 2022

ok.
according to the discussion in this issue, the return value of first(integer(0), na.rm=TRUE) should be NA_integer_.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug reshape dcast melt
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants