GForce shutoff for non-standard classes #3546

MichaelChirico · 2019-05-06T05:58:35Z

Closes #3079
Closes #1876
Closes #3533

MichaelChirico · 2019-05-06T10:04:29Z

Two old tests were altered/eliminated here; see comments on each

MichaelChirico · 2019-05-06T10:07:52Z

inst/tests/tests.Rraw

-DT = data.table(id=1:2, val1=6:1, val2=6:1)   # 5380
-test(1199, DT[, sum(.SD), by=id], error="GForce sum can only be applied to columns, not .SD or similar.*looking for.*lapply\\(.SD")
+#DT = data.table(id=1:2, val1=6:1, val2=6:1)   # 5380
+#test(1199, DT[, sum(.SD), by=id], error="GForce sum can only be applied to columns, not .SD or similar.*looking for.*lapply\\(.SD")


I'm not sure this ever should have been an error... Behavior in this PR is to degrade from trying to GForce sum(.SD) and erroring to just never trying to GForce in the first place.

The error is nice in that it provides feedback that the user may be doing it wrong...

Potentially we could add some more checks for this specific case and continue to pass this test?

codecov · 2019-05-06T10:14:41Z

Codecov Report

Merging #3546 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3546      +/-   ##
==========================================
+ Coverage   97.45%   97.45%   +<.01%     
==========================================
  Files          66       66              
  Lines       12801    12809       +8     
==========================================
+ Hits        12475    12483       +8     
  Misses        326      326

Impacted Files	Coverage Δ
R/data.table.R	`97.66% <100%> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 572b6e4...99b8dcb. Read the comment docs.

MichaelChirico · 2019-05-06T10:15:54Z

inst/tests/tests.Rraw

@@ -7039,6 +7039,11 @@ test(1521, x[, b := 5], data.table(a=c(1,2), b=5))

 # Fix for #1160, fastmean retaining attributes
 x = data.table(a = c(2,2,1,1,2), b=setattr(1:5, 'class', c('bla', 'integer')))


This test is building explicitly wrong (IMO) behavior by expecting fastmean to run on bla-class objects.

Essentially it's forcing mean.default to be run on all classes & then returning the object with the old class... As we've seen in #3533 etc, this implicit bypassing of the method dispatch can be ill-advised.

IIUC that's why mean.default will coerce to numeric instead of retaining c('bla', 'integer') -- especially the integer part feels explicitly wrong.

x = data.table(a = c(2,2,1,1,2), b=1:5) setattr(x$b, 'class', c('bla', 'integer')) class(mean(x$b)) # [1] numeric

The original issue of #1160 would have been fixed by this PR as well (just not using fastmean), so leaving the test in and adjusting by providing a mean.bla method. @arunsrinivasan made the original PR, any thoughts?

only that there is one pending issue related to fast mean #2799

mattdowle · 2019-05-14T00:03:57Z

As we've seen in #3533 etc, this implicit bypassing of the method dispatch can be ill-advised.

But that issue is very unusual in that mean(c(20,350)) is not 185 but 5 (in the circular package). It's long-standing behavior that data.table calculates summary stats (mean, sum, sd, min, max) on the underlying type and then copies the class attributes on to the result. The class attributes are most often used for class-specific formatting. For unusual packages like circular, needing to write out the method name mean.circular(c(20,350)) seems reasonable to me as it reminds the reader of the code that it is not a standard mean. Either that or turn off optimization when used with circular.

The cause of #3079 is we're missing a copyMostAttrib in gmedian which is inconsistent with almost all the other gforce functions which already do that copy.

MichaelChirico · 2019-05-14T00:12:19Z

I think I disagree pretty strongly here, this leads to bad user experience, and also there's no guarantee for mean methods to be exported (we often don't export methods, for example). So

New users have to be thoroughly aware that we're bypassing method dispatch (adversarial coding), and the only reason we're doing so is for speed, accuracy be damned. Unaware users can have silently incorrect answers if they're not scrutinizing the output of mean carefully.
We're consigning people to use ::: which is bad practice and forces users to use undocumented objects in their code (which in turn will warn on CRAN), the alternative being that again, they have to know data.table front-to-back and be aware to shut off datatable.optimize.

I believe the approach here is much more user-friendly -- expert users can as.numeric their exotic classes if they really want GForce efficiency. And the fact that we have three outstanding related issues on this means that users are getting caught off-guard by the default behavior already.

Is there a compromise approach? Perhaps using this PR but also issuing a warning or verbose message that efficiency is lost due to class decoration?

mattdowle · 2019-05-14T05:53:39Z

This PR turns off gforce optimization for any object with a class, doesn't it? Just because of one case (circular) where the mean is not the mean.

And the fact that we have three outstanding related issues on this means that users are getting caught off-guard by the default behavior already.

I wrote that the cause of #3079 is we're missing a copyMostAttrib in gmedian which is inconsistent with almost all the other gforce functions which already do that copy. On #1876 I don't quite follow that one yet, but I don't think it's proof (yet) that gforce needs turning off for any vector with a class. I mentioned both here: #3533 (comment). Assuming #1876 is different, it's one case (#3533, circular) not 3.

accuracy be damned

I think this is a bit unfair to write.

MichaelChirico · 2019-05-14T06:29:09Z

This PR turns off gforce optimization for any object with a class, doesn't it? Just because of one case (circular) where the mean is not the mean.

Yes, but we also don't really know that circular is the only "culprit" (culprit used very loosely -- the point of S3 dispatch is you are allowed to define any mean you'd like and users don't really have to know the implementation details). I see a few mean methods exported in my (probably relatively small) .libPaths()[1L]:

unlist(lapply(list.files(.libPaths()[1L]), function(pkg) ls(envir = tryCatch(asNamespace(pkg), error = function(e) .GlobalEnv), pattern = '^mean\\.')))
# [1] "mean.Date"      "mean.default"   "mean.difftime"  "mean.POSIXct"   "mean.POSIXlt"   "mean.integer64" "mean.circular" 
#  [8] "mean.IDate"     "mean.quosure"   "mean.Surv"      "mean.units"     "mean.xts"       "mean.yearmon"   "mean.yearqtr"  
# [15] "mean.zoo"

Only a few of those are exported:

find . -type f -print | grep NAMESPACE | xargs grep "S3method[(]mean,"
./data.table/NAMESPACE:S3method(mean, IDate)
./rlang/NAMESPACE:S3method(mean,quosure)
./units/NAMESPACE:S3method(mean,units)
./circular/NAMESPACE:S3method(mean, circular)

"the mean is not the mean" is not for us to decide. It's very heavy-handed of us to bypass method dispatch and say "your class wanted to do a mean this way, but trust us, the arithmetic mean is what you want". It's easy to imagine creating a class that would default to using geometric or harmonic means, for example, which are perfectly valid "mean"s, statistically speaking.

I think this is a bit unfair to write.

It was strongly worded & a bit flip, but the kernel holds true. Bypassing method dispatch for speed automatically is saying that accuracy is not paramount; speed is. We spent a lot of effort in #3056 trying to prevent users from potentially getting silently wrong results; does that diligence not apply here?

jangorecki · 2019-05-14T06:36:29Z

Agree with Michael that we should not automatically ignore method dispatch. It will be more safe and still user friendly if we would export gmean/fmean so user who want to ignore method dispatch can call those directly.

mattdowle · 2019-05-14T06:56:59Z

@jangorecki

It will be more safe and still user friendly if we would export gmean/fmean so user who want to ignore method dispatch can call those directly.

You realize this is big change in ethos and will require a lot of people to change. Do I understand your suggestion correctly? How is it still user friendly to require (a lot of) users to start to use new g functions explicitly otherwise there could be significant slow downs.

@MichaelChirico

Bypassing method dispatch for speed automatically is saying that accuracy is not paramount; speed is. We spent a lot of effort in #3056 trying to prevent users from potentially getting silently wrong results; does that diligence not apply here?

Because nobody has proposed a solution yet that doesn't have a greater cost. The title of this PR refers to "non-standard" classes, but really it shuts off gforce for any class, for lots of standard classes.

In general, data.table does break some eggs to make an omelette. For example, := is not perfect. By your argument, we should shut down := too.

fastmean was probably 10 years ago, and gforce maybe 5 years old now. Both have always done it this way and in all that time it's only now recently we get one report where it doesn't work: circular.

How about adding a blacklist of classes for which gforce should be turned off. circular would be the first entry. In the same way that users can add packages to the cedta whitelist, they could add to a new gforce.off list.

jangorecki · 2019-05-14T08:06:10Z

How is it still user friendly to require (a lot of) users to start to use new g functions explicitly otherwise there could be significant slow downs.

What I meant is that gmean would be required only when user want to skip S3 dispatch, normally mean will optimise to gmean unless there is a "non-standard" class having defined mean method.
Black/white listing is another option as it doesn't come out often.

MichaelChirico · 2019-05-14T08:45:02Z

lots of standard classes

I'm not seeing the cost that much. We also only have one test in the suite that relies on GForce applying to non-numeric class. So as much as we don't have ample evidence users are getting bit by current practice, we also don't have much evidence that people are relying on the current behavior. We could potentially run rev-dep on this branch? That would also give empirical weight to whether this change imposes "great cost".

Getting GForce to work with this PR is as simple as applying mean to the as.numeric version of your column, if artihmetic mean is what you intend. If your class is really just decorated-numeric (e.g. something that has nice print methods but storage mode is simple numeric and mean.default applies), as.numeric should be almost instant even on large data.

Another alternative is to use mean.default as an entry point for GForce, since it is definitely exported -- users could use that to signal it's safe for us to redirect to GForce regardless of class.

How about adding a blacklist of classes for which gforce should be turned off

The worry is still that unaware users can end up with silently wrong results. Whereas with cedta, the worst that can happen is some extra printing.

mattdowle · 2019-05-14T15:42:05Z

We also only have one test in the suite that relies on GForce applying to non-numeric class.

I'm not following what you mean by "non-numeric class". This issue and PR isn't about character data per se, for example. is.vector() is used in this PR for its feature of considering a vector with a class attribute not be a vector. Maybe you mean non-vector class rather than non-numeric class. The fact we're missing tests just shows that we thought one test was sufficient to make sure class attribute was retained by gforce functions. But over time the gforce abilities have been extended and one (gmedian) was missed. So we need more tests.

We could potentially run rev-dep on this branch? That would also give empirical weight to whether this change imposes "great cost".

revdep wouldn't show anything. revdep is for correctness (so much as revdep package tests cover). But this cost would be a slow-down.

Getting GForce to work with this PR is as simple as applying mean to the as.numeric version of your column, if artihmetic mean is what you intend.

I'm not following what you mean here. That sounds to me like they would have to know and then do something new manually. Currently gforce is an automatic optimization. It's not about the compute speed of having to do as.numeric, it's the inconvenience of users having to change their code to write as.numeric.

Another alternative is to use mean.default as an entry point for GForce, since it is definitely exported -- users could use that to signal it's safe for us to redirect to GForce regardless of class.

Again - I don't know what this means. What is an entry point for GForce ?

The worry is still that unaware users can end up with silently wrong results.

Not if circular is added to a list. The first known example in 10 years. If more are discovered, we can think again.

Whereas with cedta, the worst that can happen is some extra printing.

That's not exactly true. It's much better now that DT[,j] does do what DF[,j] does when j is a numeric or character vector, but it used to be that j was returned straight back and it could be a silent wrong-result. There are still some wrinkles related to that, but it's much better but not perfect.

MichaelChirico · 2019-05-15T06:45:45Z

revdep wouldn't show anything

we could edit this branch to error instead of setting GForce = FALSE and try then

they would have to know and then do something new manually

Yes. I'm fine with this personally (perhaps pending some numeric evidence via the revdep check)

What is an entry point for GForce

Another case of users having to know and then doing something manually. This time, instead of as.numeric (which could have a memory hit of storing two columns), users fine with using arithmetic mean on non-strictly-numeric columns do so by having mean.default(col) in j.

Michael Chirico added 4 commits May 6, 2019 13:18

Attempted fix for GForce & improper method dispatch

81ecfd0

Closes #3079, #1876, #3533 -- catch unsupported GForce/fastmean dispatch

8f48aac

try is.vector instead of inspecting class()

5b66b77

fixes to pass tests

d701d5a

MichaelChirico changed the title ~~[WIP] GForce shutoff for non-standard classes~~ GForce shutoff for non-standard classes May 6, 2019

MichaelChirico commented May 6, 2019

View reviewed changes

Michael Chirico and others added 2 commits May 13, 2019 10:27

Merge branch 'master' into gforce_dispatch

7923f44

Merge branch 'master' into gforce_dispatch

99b8dcb

mattdowle closed this May 14, 2019

d-sci mentioned this pull request May 14, 2019

Operations with "by" do not dispatch to S3 methods #3533

Open

This was referenced May 15, 2019

gmedian retain class #3564

Merged

fixed invalid trim #3567

Merged

MichaelChirico deleted the gforce_dispatch branch December 20, 2019 08:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GForce shutoff for non-standard classes #3546

GForce shutoff for non-standard classes #3546

MichaelChirico commented May 6, 2019 •

edited by mattdowle

MichaelChirico commented May 6, 2019

MichaelChirico May 6, 2019

codecov bot commented May 6, 2019 •

edited

MichaelChirico May 6, 2019

jangorecki May 8, 2019

mattdowle commented May 14, 2019 •

edited

MichaelChirico commented May 14, 2019 •

edited

mattdowle commented May 14, 2019 •

edited

MichaelChirico commented May 14, 2019 •

edited

jangorecki commented May 14, 2019

mattdowle commented May 14, 2019 •

edited

jangorecki commented May 14, 2019

MichaelChirico commented May 14, 2019

mattdowle commented May 14, 2019

MichaelChirico commented May 15, 2019

		@@ -7039,6 +7039,11 @@ test(1521, x[, b := 5], data.table(a=c(1,2), b=5))

		# Fix for #1160, fastmean retaining attributes
		x = data.table(a = c(2,2,1,1,2), b=setattr(1:5, 'class', c('bla', 'integer')))

GForce shutoff for non-standard classes #3546

GForce shutoff for non-standard classes #3546

Conversation

MichaelChirico commented May 6, 2019 • edited by mattdowle

MichaelChirico commented May 6, 2019

MichaelChirico May 6, 2019

Choose a reason for hiding this comment

codecov bot commented May 6, 2019 • edited

Codecov Report

MichaelChirico May 6, 2019

Choose a reason for hiding this comment

jangorecki May 8, 2019

Choose a reason for hiding this comment

mattdowle commented May 14, 2019 • edited

MichaelChirico commented May 14, 2019 • edited

mattdowle commented May 14, 2019 • edited

MichaelChirico commented May 14, 2019 • edited

jangorecki commented May 14, 2019

mattdowle commented May 14, 2019 • edited

jangorecki commented May 14, 2019

MichaelChirico commented May 14, 2019

mattdowle commented May 14, 2019

MichaelChirico commented May 15, 2019

MichaelChirico commented May 6, 2019 •

edited by mattdowle

codecov bot commented May 6, 2019 •

edited

mattdowle commented May 14, 2019 •

edited

MichaelChirico commented May 14, 2019 •

edited

mattdowle commented May 14, 2019 •

edited

MichaelChirico commented May 14, 2019 •

edited

mattdowle commented May 14, 2019 •

edited