Resampling speed improvements; continued coverage improvements #24

aaronrudkin · 2017-10-31T19:34:19Z

This PR encapsulates the last week or so of work, with an emphasis on the speed improvements for the multi-level bootstrap. I will leave this open for a brief code review -- if you want to tag suggestions or ideas below, I will merge the PR if everyone is happy and then add the suggestions as the first commits on the next feature branch I open.

…e columns being bootstrapped on did not start at 1.

…e internal arguments, added data.table to Suggests

…at both branches are explored.

nfultz

LGTM

nfultz · 2017-10-31T19:35:39Z

R/resample_data.R

@@ -35,12 +35,13 @@ resample_data = function(data, N, ID_labels=NULL) {
  .resample_data_internal(data, N, ID_labels)
 }

-.resample_data_internal = function(data, N, ID_labels=NULL, outer_level=1, use_dt = 0) {
+.resample_data_internal = function(data, N, ID_labels=NULL, outer_level=1, use_dt = NA) {


No need for the dot prefix, it isn't exported in NAMESPACE anyway.

nfultz · 2017-10-31T19:35:58Z

DESCRIPTION

-    rmarkdown
+    rmarkdown,
+    data.table
+FasterWith: data.table


We can drop the FasterWith

I recognize this one is not necessary; but I thought it would be handy to signal the nature of the suggestion (testthat, rmarkdown, etc. seem obviously about dev/building the package by definition) -- upon reading the information about DESCRIPTION files it doesn't sound like it's a problem to add arbitrary fields. What does everyone think about this?

There's probably a trade-off between "hey, the package devs are trying to tell me something" and "hey, the package devs have created a non-standard field that I need to think about"

nfultz · 2017-10-31T19:37:12Z

R/fabricate.R

+        "At the top level, ",
+        ifelse(!is.null(ID_label),
+               paste0(ID_label, ", "),
+               ""),


No need to check null in this case, eg

> foo <- NULL > stop("Foo is ", foo) Error: Foo is

This was mostly about making clean, legible text. As-is the error was "At the top level, , you must provide..." and I thought the , , section was ugly. By ifelsing here, we avoid the grammar problem.

nfultz · 2017-10-31T19:38:44Z

R/resample_data.R

-      "If you provide more than one ID_labels to resample data for multilevel data, please provide a vector for N of the same length representing the number to resample at each level."
-    )
-  }
+.resample_data_internal = function(data, N, ID_labels=NULL, outer_level=1, use_dt = NA) {


you can just use TRUE/FALSE for use_dt since it's an internal variable anyway.

nfultz · 2017-10-31T19:39:35Z

R/resample_data.R

+      use_dt = 1
+    } else {
+      use_dt = 0
+    }


use_dt <- use_dt || requireNamespace("data.table", quietly=T)

It wouldn't seem like this would work if we want a unit test to be able to explicitly override the value, no? If the unit test sets use_dt = FALSE in the argument, then the or operator is going to evaluate to FALSE || requireNamespace("data.table", quietly=T) which should evaluate to true.

Now, I could do this:
use_dt = ifelse(is.na(use_dt), requireNamespace("data.table", quietly=T), use_dt)

But it's not clear if speed or legibility benefit from this?

nfultz · 2017-10-31T19:40:48Z

R/resample_data.R

+    # of N units by row.
+    if (missing(N) & is.null(ID_labels)) {
+      return(bootstrap_single_level(data, dim(data)[1], ID_label=NULL))
+    }


You can move this to the outer function resample_data() above.

All the error checking can be on the outside function, and the inner function just does the real work.

Then you don't really need the outer argument here

nfultz · 2017-10-31T19:42:40Z

R/resample_data.R

+  # OK, if not, we need to recurse
+
+  # Split indices of data frame by the thing we're strapping on
+  split_data_on_boot_id = split(seq_len(dim(data)[1]), data[,ID_labels[1]])


data[[ID_labels[1]]]) will be slightly stricter

nfultz · 2017-10-31T20:06:24Z

Good catch, should be &&

…

On Tue, Oct 31, 2017, 1:04 PM Aaron Rudkin ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In R/resample_data.R <#24 (comment)> : > - - # Iterate over each thing chosen at the current level - results_all = lapply(sampled_boot_values, function(i) { - new_results = resample_data( - data[data[, ID_labels[1]] == i, ], - N=N[2:length(N)], - ID_labels=ID_labels[2:length(ID_labels)] - ) - }) - #res = rbindlist(results_all) + } + + # OK, if not, we need to recurse + + # Split indices of data frame by the thing we're strapping on + split_data_on_boot_id = split(seq_len(dim(data)[1]), data[,ID_labels[1]]) 👍 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#24 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAZjTrUoufCE-woXuMKNZFsRRl0DmWclks5sx31IgaJpZM4QNTkx> .

graemeblair · 2017-10-31T20:21:56Z

This looks great; merging now. Thanks!

coveralls · 2017-10-31T21:23:57Z

Coverage increased (+1.5%) to 95.778% when pulling 910d51b on profile_and_coverage into 38a2f02 on master.

coveralls · 2017-10-31T21:23:57Z

Coverage increased (+1.5%) to 95.778% when pulling 910d51b on profile_and_coverage into 38a2f02 on master.

aaronrudkin added 8 commits October 26, 2017 10:43

Made the code changes Neil suggested in code review for the previous PR.

2ac8aac

Fixed a serious bug which would break bootstrapping in cases where th…

9ed6814

…e columns being bootstrapped on did not start at 1.

Changed resampling to be more memory and speed efficient.

6863e51

Added a skipped test for fabricating and resampling extremely large data

86abcb5

Updated description file to indicate package is faster with data.table

a4c94da

Rewrite of resample_data for efficiency, including discussion in #21

f21ab04

Added test coverage for bootstrapping, made a wrapper function to hid…

da29cfb

…e internal arguments, added data.table to Suggests

Fixes to allow unit tests to override the data.table suggestion so th…

910d51b

…at both branches are explored.

nfultz approved these changes Oct 31, 2017

View reviewed changes

graemeblair merged commit ca85333 into master Oct 31, 2017

graemeblair deleted the profile_and_coverage branch October 31, 2017 20:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resampling speed improvements; continued coverage improvements #24

Resampling speed improvements; continued coverage improvements #24

aaronrudkin commented Oct 31, 2017

nfultz left a comment

nfultz Oct 31, 2017

nfultz Oct 31, 2017

aaronrudkin Oct 31, 2017

aaronrudkin Oct 31, 2017

nfultz Oct 31, 2017

aaronrudkin Oct 31, 2017

nfultz Oct 31, 2017

nfultz Oct 31, 2017

aaronrudkin Oct 31, 2017

nfultz Oct 31, 2017

nfultz Oct 31, 2017

nfultz Oct 31, 2017

nfultz Oct 31, 2017

aaronrudkin Oct 31, 2017

nfultz commented Oct 31, 2017 via email

graemeblair commented Oct 31, 2017

coveralls commented Oct 31, 2017

coveralls commented Oct 31, 2017

Resampling speed improvements; continued coverage improvements #24

Resampling speed improvements; continued coverage improvements #24

Conversation

aaronrudkin commented Oct 31, 2017

nfultz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nfultz commented Oct 31, 2017 via email

graemeblair commented Oct 31, 2017

coveralls commented Oct 31, 2017

coveralls commented Oct 31, 2017