Free speed gains?!? Too good to be true?!?! #21

aaronrudkin · 2017-10-26T21:53:16Z

I've been profiling the bootstrap/resampling functionality in fabricatr this week. I have a branch where I've implemented a few incremental speedups, but these are child's play.

One big speed boost we can get is replacing rbind with rbindlist (a function from the data.table package). In benchmarks, with a moderately large data, rbindlist runs about 9x faster than rbind, and the overall resample process runs about 2x faster using rbindlist than rbind. This is a pretty huge gain and I am very much in favour of it.

One issue is of course, the "Malawi problem", where we don't want to increase the size of numbers of dependencies for people who are extremely bandwidth constrained. But what if we could trade-off with both, allowing users who have data.table installed to make use of it, while allowing users who don't to be able to use our package without being told to install it.

Consider the following snippet:

        if(!requireNamespace("data.table")) {
            res = do.call(rbind, results_all)
            rownames(res) = NULL
        } else {
            # User has data.table, give them a speed benefit for it
            res = data.table::rbindlist(results_all)
            class(res) = "data.frame"
            attr(res, ".internal.selfref") = NULL
        }

requireNamespace will return false if data.table is not installed, so people without the package will get the do.call rbind version. I've benchmarked various ways of zapping the row names, don't worry about that.

If, on the other hand, the user DOES have data.table, then we can call the rbindlist function. I add the class and attr lines so that our function will return a data.frame -- in other words, the returned data will be exactly identical and pass an identical() call whether you have the data.table package or not.

In terms of how we signal this to users, we modify the docs/vignette. Neal assures me we can arbitrary key/value pairs to the DESCRIPTION file, so we could also add a key/value pair that has no ordinary meaning to let people know (i.e. FasterWith: data.table)

I'll post a full standalone profile script in Slack so you guys can play around with this

Summary:

Users with data.table get a speed boost (2X speed boost across full resample)
Users without data.table see no change
No additional dependencies
Output from the function will be exactly identical regardless of which version runs

@graemeblair Suggested I post an issue to make clear my intent here and see if anyone has a strong objection, but I really think this is a solution that's great!

The text was updated successfully, but these errors were encountered:

aaronrudkin · 2017-10-26T21:59:12Z

RE: You can add arbitrary key-value pairs to Description, here's Hadley:

You can also create your own fields to add additional metadata. The only restrictions are that you shouldn’t use existing names and that, if you plan to submit to CRAN, the names you use should be valid English words (so a spell-checking NOTE won’t be generated).

aaronrudkin · 2017-10-27T02:34:18Z

Doing the data.table::rbindlist call generates a warning in check if data.table is not added to Suggests. Hadley's book says that Suggests will not trigger an automatic download and recommends it for cases like this where you're doing a waterfall for a given function. I added it, but obviously we will ensure that a completely empty environment does not install it before we submit to CRAN.

graemeblair · 2017-10-28T17:29:49Z

Thanks Aaron. I'm in favor of this solution. Let us know what you find about whether suggests auto-installs.

nfultz · 2017-10-28T18:50:54Z

data.table should not get auto-installed when a user does `install.packages("DeclareDesign")`, but *would* get autoinstalled by CRAN / CI servers - . we might want an option for testing to turn off this optimization so that we can make sure that the non-optimized branch(es) doesn't break.

…

On Sat, Oct 28, 2017 at 10:29 AM, Graeme Blair ***@***.***> wrote: Thanks Aaron. I'm in favor of this solution. Let us know what you find about whether suggests auto-installs. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#21 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZjTjXeMPrLvCF1S0GuVOrit-lAf0X0ks5sw2SNgaJpZM4QIRTh> .

graemeblair · 2017-10-28T19:40:09Z

Ah, got it. Thanks, Neal. That seems like a good solution.

aaronrudkin · 2017-10-31T19:16:06Z

I'll be adding an option call (potentially private-only) for us to test the non-data.table code path and then merge the finished code.

aaronrudkin added a commit that referenced this issue Oct 27, 2017

Rewrite of resample_data for efficiency, including discussion in #21

f21ab04

aaronrudkin closed this as completed Oct 31, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Free speed gains?!? Too good to be true?!?! #21

Free speed gains?!? Too good to be true?!?! #21

aaronrudkin commented Oct 26, 2017

aaronrudkin commented Oct 26, 2017

aaronrudkin commented Oct 27, 2017

graemeblair commented Oct 28, 2017

nfultz commented Oct 28, 2017 via email

graemeblair commented Oct 28, 2017

aaronrudkin commented Oct 31, 2017

Free speed gains?!? Too good to be true?!?! #21

Free speed gains?!? Too good to be true?!?! #21

Comments

aaronrudkin commented Oct 26, 2017

aaronrudkin commented Oct 26, 2017

aaronrudkin commented Oct 27, 2017

graemeblair commented Oct 28, 2017

nfultz commented Oct 28, 2017 via email

graemeblair commented Oct 28, 2017

aaronrudkin commented Oct 31, 2017