Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in grouping external variables - related to #495. #875

Closed
arunsrinivasan opened this issue Oct 10, 2014 · 5 comments · Fixed by #3569
Closed

Bug in grouping external variables - related to #495. #875

arunsrinivasan opened this issue Oct 10, 2014 · 5 comments · Fixed by #3569
Labels
Milestone

Comments

@arunsrinivasan
Copy link
Member

arunsrinivasan commented Oct 10, 2014

This bug is related to #495.

require(data.table)
DT = data.table(x=c(1,1,1,2,2), y=1:5)
z = 1:5
options(datatable.verbose=TRUE)

# correct answer (?)
options(datatable.optimize = Inf)
DT[, list(mean(z), mean(y)), by=x]
# GForce optimized j to 'list(gmean(z), gmean(y))'
#    x  V1  V2
#1: 1 2.0 2.0
#2: 2 4.5 4.5

# incorrect answer (?)
options(datatable.optimize = 1L) # no GForce
DT[, list(mean(z), mean(y)), by=x]
#    x V1  V2
#1: 1  3 2.0
#2: 2  3 4.5

Basically mean is computed on entire z in the second case (where mean gets optimised to fastmean internally). This is most likely because .SD doesn't have this variable in it. So it comes back to #495.

For the same reason, say calculating variance or standard deviation won't work, even if optimise value if Inf (because GForce isn't implemented for those functions).

options(datatable.optimize=Inf)
DT[, list(sd(z), sd(y)), by=x]
#    x       V1        V2
#1: 1 1.581139 1.0000000
#2: 2 1.581139 0.7071068

For now, using external variables for grouping has a bug. This observation came from this SO post. Thanks to drstevok.

@eantonya
Copy link
Contributor

eantonya commented Oct 10, 2014

Hold on. I would say the first answer is incorrect and the second one is correct. Imagine if z was instead 1:7 - what do you think is the correct answer?

@arunsrinivasan
Copy link
Member Author

arunsrinivasan commented Oct 10, 2014

I think the right answer is debatable. I'd argue that since it's a grouping operation, the variable referred to should be grouped as well. But I can see the other side of this argument as well.

In this case, it should end in an error (if the first one is correct), as it does (for mean).

More and more, I'm leaning towards your line of thinking though.

@eantonya
Copy link
Contributor

eantonya commented Oct 10, 2014

Well, the way I see it, the grouping simply divides up the data.table into pieces to do various (j-expression) operations on. When you refer to the column of a data.table in a group, it's obvious that you want the column for that piece of the data.table.

But when you're referring to some external object, there is no reason why it should also be subdivided into pieces just because it happens to be inside the j-expression - it's not part of the data.table (and additionally, from technical point of view is only divisible if and only if it has the exact correct type and size).

@anoopshah
Copy link

anoopshah commented Oct 27, 2014

I expect mean(z) to give the same as mean(z)[1], i.e. a one element vector containing the mean of z.
Similarly I would expect DT[, list(mean(z), mean(y)), by=x] to give the same results as DT[, list(mean(z)[1], mean(y)[1]), by=x]. But this is not the case using G-Force.

    > DT[, list(mean(z), mean(y)), by=x]
    Detected that j uses these columns: y 
    Finding groups (bysameorder=FALSE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
    lapply optimization is on, j unchanged as 'list(mean(z), mean(y))'
    GForce optimized j to 'list(gmean(z), gmean(y))'
       x  V1  V2
    1: 1 2.0 2.0
    2: 2 4.5 4.5
    > DT[, list(mean(z)[1], mean(y)[1]), by=x]
    Detected that j uses these columns: y 
    Finding groups (bysameorder=FALSE) ... done in 0secs. bysameorder=TRUE and o__ is length 0
    lapply optimization is on, j unchanged as 'list(mean(z)[1], mean(y)[1])'
    GForce is on, left j unchanged
    Old mean optimization is on, left j unchanged.
    Starting dogroups ... 
      memcpy contiguous groups took 0.000s for 2 groups
      eval(j) took 0.000s for 2 calls
    done dogroups in 0 secs
       x V1  V2
    1: 1  3 2.0
    2: 2  3 4.5
    > 

@arunsrinivasan
Copy link
Member Author

arunsrinivasan commented Oct 27, 2014

Yes but this once again is due to the same issue - providing a variable not in the data.table in j, while grouping. It'll go away once we handle that properly.

@arunsrinivasan arunsrinivasan modified the milestones: v2.0.0, v1.9.8 Mar 20, 2016
@mattdowle mattdowle removed this from the Candidate milestone May 10, 2018
@mattdowle mattdowle added this to the 1.12.4 milestone May 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants