Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GForce shutoff for non-standard classes #3546

Closed
wants to merge 6 commits into from
Closed

Conversation

MichaelChirico
Copy link
Member

@MichaelChirico MichaelChirico commented May 6, 2019

Closes #3079
Closes #1876
Closes #3533

@MichaelChirico MichaelChirico changed the title [WIP] GForce shutoff for non-standard classes GForce shutoff for non-standard classes May 6, 2019
@MichaelChirico
Copy link
Member Author

Two old tests were altered/eliminated here; see comments on each

DT = data.table(id=1:2, val1=6:1, val2=6:1) # 5380
test(1199, DT[, sum(.SD), by=id], error="GForce sum can only be applied to columns, not .SD or similar.*looking for.*lapply\\(.SD")
#DT = data.table(id=1:2, val1=6:1, val2=6:1) # 5380
#test(1199, DT[, sum(.SD), by=id], error="GForce sum can only be applied to columns, not .SD or similar.*looking for.*lapply\\(.SD")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this ever should have been an error... Behavior in this PR is to degrade from trying to GForce sum(.SD) and erroring to just never trying to GForce in the first place.

The error is nice in that it provides feedback that the user may be doing it wrong...

Potentially we could add some more checks for this specific case and continue to pass this test?

@codecov
Copy link

codecov bot commented May 6, 2019

Codecov Report

Merging #3546 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3546      +/-   ##
==========================================
+ Coverage   97.45%   97.45%   +<.01%     
==========================================
  Files          66       66              
  Lines       12801    12809       +8     
==========================================
+ Hits        12475    12483       +8     
  Misses        326      326
Impacted Files Coverage Δ
R/data.table.R 97.66% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 572b6e4...99b8dcb. Read the comment docs.

@@ -7039,6 +7039,11 @@ test(1521, x[, b := 5], data.table(a=c(1,2), b=5))

# Fix for #1160, fastmean retaining attributes
x = data.table(a = c(2,2,1,1,2), b=setattr(1:5, 'class', c('bla', 'integer')))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is building explicitly wrong (IMO) behavior by expecting fastmean to run on bla-class objects.

Essentially it's forcing mean.default to be run on all classes & then returning the object with the old class... As we've seen in #3533 etc, this implicit bypassing of the method dispatch can be ill-advised.

IIUC that's why mean.default will coerce to numeric instead of retaining c('bla', 'integer') -- especially the integer part feels explicitly wrong.

x = data.table(a = c(2,2,1,1,2), b=1:5)
setattr(x$b, 'class', c('bla', 'integer'))
class(mean(x$b))
# [1] numeric

The original issue of #1160 would have been fixed by this PR as well (just not using fastmean), so leaving the test in and adjusting by providing a mean.bla method. @arunsrinivasan made the original PR, any thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only that there is one pending issue related to fast mean #2799

@mattdowle
Copy link
Member

mattdowle commented May 14, 2019

As we've seen in #3533 etc, this implicit bypassing of the method dispatch can be ill-advised.

But that issue is very unusual in that mean(c(20,350)) is not 185 but 5 (in the circular package). It's long-standing behavior that data.table calculates summary stats (mean, sum, sd, min, max) on the underlying type and then copies the class attributes on to the result. The class attributes are most often used for class-specific formatting. For unusual packages like circular, needing to write out the method name mean.circular(c(20,350)) seems reasonable to me as it reminds the reader of the code that it is not a standard mean. Either that or turn off optimization when used with circular.

The cause of #3079 is we're missing a copyMostAttrib in gmedian which is inconsistent with almost all the other gforce functions which already do that copy.

@mattdowle mattdowle closed this May 14, 2019
@MichaelChirico
Copy link
Member Author

MichaelChirico commented May 14, 2019

I think I disagree pretty strongly here, this leads to bad user experience, and also there's no guarantee for mean methods to be exported (we often don't export methods, for example). So

  1. New users have to be thoroughly aware that we're bypassing method dispatch (adversarial coding), and the only reason we're doing so is for speed, accuracy be damned. Unaware users can have silently incorrect answers if they're not scrutinizing the output of mean carefully.
  2. We're consigning people to use ::: which is bad practice and forces users to use undocumented objects in their code (which in turn will warn on CRAN), the alternative being that again, they have to know data.table front-to-back and be aware to shut off datatable.optimize.

I believe the approach here is much more user-friendly -- expert users can as.numeric their exotic classes if they really want GForce efficiency. And the fact that we have three outstanding related issues on this means that users are getting caught off-guard by the default behavior already.

Is there a compromise approach? Perhaps using this PR but also issuing a warning or verbose message that efficiency is lost due to class decoration?

@mattdowle
Copy link
Member

mattdowle commented May 14, 2019

This PR turns off gforce optimization for any object with a class, doesn't it? Just because of one case (circular) where the mean is not the mean.

And the fact that we have three outstanding related issues on this means that users are getting caught off-guard by the default behavior already.

I wrote that the cause of #3079 is we're missing a copyMostAttrib in gmedian which is inconsistent with almost all the other gforce functions which already do that copy. On #1876 I don't quite follow that one yet, but I don't think it's proof (yet) that gforce needs turning off for any vector with a class. I mentioned both here: #3533 (comment). Assuming #1876 is different, it's one case (#3533, circular) not 3.

accuracy be damned

I think this is a bit unfair to write.

@MichaelChirico
Copy link
Member Author

MichaelChirico commented May 14, 2019

This PR turns off gforce optimization for any object with a class, doesn't it? Just because of one case (circular) where the mean is not the mean.

Yes, but we also don't really know that circular is the only "culprit" (culprit used very loosely -- the point of S3 dispatch is you are allowed to define any mean you'd like and users don't really have to know the implementation details). I see a few mean methods exported in my (probably relatively small) .libPaths()[1L]:

unlist(lapply(list.files(.libPaths()[1L]), function(pkg) ls(envir = tryCatch(asNamespace(pkg), error = function(e) .GlobalEnv), pattern = '^mean\\.')))
# [1] "mean.Date"      "mean.default"   "mean.difftime"  "mean.POSIXct"   "mean.POSIXlt"   "mean.integer64" "mean.circular" 
#  [8] "mean.IDate"     "mean.quosure"   "mean.Surv"      "mean.units"     "mean.xts"       "mean.yearmon"   "mean.yearqtr"  
# [15] "mean.zoo"  

Only a few of those are exported:

find . -type f -print | grep NAMESPACE | xargs grep "S3method[(]mean,"
./data.table/NAMESPACE:S3method(mean, IDate)
./rlang/NAMESPACE:S3method(mean,quosure)
./units/NAMESPACE:S3method(mean,units)
./circular/NAMESPACE:S3method(mean, circular)

"the mean is not the mean" is not for us to decide. It's very heavy-handed of us to bypass method dispatch and say "your class wanted to do a mean this way, but trust us, the arithmetic mean is what you want". It's easy to imagine creating a class that would default to using geometric or harmonic means, for example, which are perfectly valid "mean"s, statistically speaking.

I think this is a bit unfair to write.

It was strongly worded & a bit flip, but the kernel holds true. Bypassing method dispatch for speed automatically is saying that accuracy is not paramount; speed is. We spent a lot of effort in #3056 trying to prevent users from potentially getting silently wrong results; does that diligence not apply here?

@jangorecki
Copy link
Member

Agree with Michael that we should not automatically ignore method dispatch. It will be more safe and still user friendly if we would export gmean/fmean so user who want to ignore method dispatch can call those directly.

@mattdowle
Copy link
Member

mattdowle commented May 14, 2019

@jangorecki

It will be more safe and still user friendly if we would export gmean/fmean so user who want to ignore method dispatch can call those directly.

You realize this is big change in ethos and will require a lot of people to change. Do I understand your suggestion correctly? How is it still user friendly to require (a lot of) users to start to use new g functions explicitly otherwise there could be significant slow downs.

@MichaelChirico

Bypassing method dispatch for speed automatically is saying that accuracy is not paramount; speed is. We spent a lot of effort in #3056 trying to prevent users from potentially getting silently wrong results; does that diligence not apply here?

Because nobody has proposed a solution yet that doesn't have a greater cost. The title of this PR refers to "non-standard" classes, but really it shuts off gforce for any class, for lots of standard classes.

In general, data.table does break some eggs to make an omelette. For example, := is not perfect. By your argument, we should shut down := too.

fastmean was probably 10 years ago, and gforce maybe 5 years old now. Both have always done it this way and in all that time it's only now recently we get one report where it doesn't work: circular.

How about adding a blacklist of classes for which gforce should be turned off. circular would be the first entry. In the same way that users can add packages to the cedta whitelist, they could add to a new gforce.off list.

@jangorecki
Copy link
Member

How is it still user friendly to require (a lot of) users to start to use new g functions explicitly otherwise there could be significant slow downs.

What I meant is that gmean would be required only when user want to skip S3 dispatch, normally mean will optimise to gmean unless there is a "non-standard" class having defined mean method.
Black/white listing is another option as it doesn't come out often.

@MichaelChirico
Copy link
Member Author

lots of standard classes

I'm not seeing the cost that much. We also only have one test in the suite that relies on GForce applying to non-numeric class. So as much as we don't have ample evidence users are getting bit by current practice, we also don't have much evidence that people are relying on the current behavior. We could potentially run rev-dep on this branch? That would also give empirical weight to whether this change imposes "great cost".

Getting GForce to work with this PR is as simple as applying mean to the as.numeric version of your column, if artihmetic mean is what you intend. If your class is really just decorated-numeric (e.g. something that has nice print methods but storage mode is simple numeric and mean.default applies), as.numeric should be almost instant even on large data.

Another alternative is to use mean.default as an entry point for GForce, since it is definitely exported -- users could use that to signal it's safe for us to redirect to GForce regardless of class.

How about adding a blacklist of classes for which gforce should be turned off

The worry is still that unaware users can end up with silently wrong results. Whereas with cedta, the worst that can happen is some extra printing.

@mattdowle
Copy link
Member

We also only have one test in the suite that relies on GForce applying to non-numeric class.

I'm not following what you mean by "non-numeric class". This issue and PR isn't about character data per se, for example. is.vector() is used in this PR for its feature of considering a vector with a class attribute not be a vector. Maybe you mean non-vector class rather than non-numeric class. The fact we're missing tests just shows that we thought one test was sufficient to make sure class attribute was retained by gforce functions. But over time the gforce abilities have been extended and one (gmedian) was missed. So we need more tests.

We could potentially run rev-dep on this branch? That would also give empirical weight to whether this change imposes "great cost".

revdep wouldn't show anything. revdep is for correctness (so much as revdep package tests cover). But this cost would be a slow-down.

Getting GForce to work with this PR is as simple as applying mean to the as.numeric version of your column, if artihmetic mean is what you intend.

I'm not following what you mean here. That sounds to me like they would have to know and then do something new manually. Currently gforce is an automatic optimization. It's not about the compute speed of having to do as.numeric, it's the inconvenience of users having to change their code to write as.numeric.

Another alternative is to use mean.default as an entry point for GForce, since it is definitely exported -- users could use that to signal it's safe for us to redirect to GForce regardless of class.

Again - I don't know what this means. What is an entry point for GForce ?

The worry is still that unaware users can end up with silently wrong results.

Not if circular is added to a list. The first known example in 10 years. If more are discovered, we can think again.

Whereas with cedta, the worst that can happen is some extra printing.

That's not exactly true. It's much better now that DT[,j] does do what DF[,j] does when j is a numeric or character vector, but it used to be that j was returned straight back and it could be a silent wrong-result. There are still some wrinkles related to that, but it's much better but not perfect.

@MichaelChirico
Copy link
Member Author

revdep wouldn't show anything

we could edit this branch to error instead of setting GForce = FALSE and try then

they would have to know and then do something new manually

Yes. I'm fine with this personally (perhaps pending some numeric evidence via the revdep check)

What is an entry point for GForce

Another case of users having to know and then doing something manually. This time, instead of as.numeric (which could have a memory hit of storing two columns), users fine with using arithmetic mean on non-strictly-numeric columns do so by having mean.default(col) in j.

This was referenced May 15, 2019
@MichaelChirico MichaelChirico deleted the gforce_dispatch branch December 20, 2019 08:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants