Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Free speed gains?!? Too good to be true?!?! #21

Closed
aaronrudkin opened this issue Oct 26, 2017 · 6 comments
Closed

Free speed gains?!? Too good to be true?!?! #21

aaronrudkin opened this issue Oct 26, 2017 · 6 comments

Comments

@aaronrudkin
Copy link
Contributor

I've been profiling the bootstrap/resampling functionality in fabricatr this week. I have a branch where I've implemented a few incremental speedups, but these are child's play.

One big speed boost we can get is replacing rbind with rbindlist (a function from the data.table package). In benchmarks, with a moderately large data, rbindlist runs about 9x faster than rbind, and the overall resample process runs about 2x faster using rbindlist than rbind. This is a pretty huge gain and I am very much in favour of it.

One issue is of course, the "Malawi problem", where we don't want to increase the size of numbers of dependencies for people who are extremely bandwidth constrained. But what if we could trade-off with both, allowing users who have data.table installed to make use of it, while allowing users who don't to be able to use our package without being told to install it.

Consider the following snippet:

        if(!requireNamespace("data.table")) {
            res = do.call(rbind, results_all)
            rownames(res) = NULL
        } else {
            # User has data.table, give them a speed benefit for it
            res = data.table::rbindlist(results_all)
            class(res) = "data.frame"
            attr(res, ".internal.selfref") = NULL
        }

requireNamespace will return false if data.table is not installed, so people without the package will get the do.call rbind version. I've benchmarked various ways of zapping the row names, don't worry about that.

If, on the other hand, the user DOES have data.table, then we can call the rbindlist function. I add the class and attr lines so that our function will return a data.frame -- in other words, the returned data will be exactly identical and pass an identical() call whether you have the data.table package or not.

In terms of how we signal this to users, we modify the docs/vignette. Neal assures me we can arbitrary key/value pairs to the DESCRIPTION file, so we could also add a key/value pair that has no ordinary meaning to let people know (i.e. FasterWith: data.table)

I'll post a full standalone profile script in Slack so you guys can play around with this

Summary:

  • Users with data.table get a speed boost (2X speed boost across full resample)
  • Users without data.table see no change
  • No additional dependencies
  • Output from the function will be exactly identical regardless of which version runs

@graemeblair Suggested I post an issue to make clear my intent here and see if anyone has a strong objection, but I really think this is a solution that's great!

@aaronrudkin
Copy link
Contributor Author

RE: You can add arbitrary key-value pairs to Description, here's Hadley:

You can also create your own fields to add additional metadata. The only restrictions are that you shouldn’t use existing names and that, if you plan to submit to CRAN, the names you use should be valid English words (so a spell-checking NOTE won’t be generated).

@aaronrudkin
Copy link
Contributor Author

Doing the data.table::rbindlist call generates a warning in check if data.table is not added to Suggests. Hadley's book says that Suggests will not trigger an automatic download and recommends it for cases like this where you're doing a waterfall for a given function. I added it, but obviously we will ensure that a completely empty environment does not install it before we submit to CRAN.

@graemeblair
Copy link
Member

Thanks Aaron. I'm in favor of this solution. Let us know what you find about whether suggests auto-installs.

@nfultz
Copy link
Contributor

nfultz commented Oct 28, 2017 via email

@graemeblair
Copy link
Member

Ah, got it. Thanks, Neal. That seems like a good solution.

@aaronrudkin
Copy link
Contributor Author

I'll be adding an option call (potentially private-only) for us to test the non-data.table code path and then merge the finished code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants