-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resampling speed improvements; continued coverage improvements #24
Conversation
…e columns being bootstrapped on did not start at 1.
…e internal arguments, added data.table to Suggests
…at both branches are explored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
R/resample_data.R
Outdated
@@ -35,12 +35,13 @@ resample_data = function(data, N, ID_labels=NULL) { | |||
.resample_data_internal(data, N, ID_labels) | |||
} | |||
|
|||
.resample_data_internal = function(data, N, ID_labels=NULL, outer_level=1, use_dt = 0) { | |||
.resample_data_internal = function(data, N, ID_labels=NULL, outer_level=1, use_dt = NA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for the dot prefix, it isn't exported in NAMESPACE anyway.
rmarkdown | ||
rmarkdown, | ||
data.table | ||
FasterWith: data.table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can drop the FasterWith
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recognize this one is not necessary; but I thought it would be handy to signal the nature of the suggestion (testthat, rmarkdown, etc. seem obviously about dev/building the package by definition) -- upon reading the information about DESCRIPTION files it doesn't sound like it's a problem to add arbitrary fields. What does everyone think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's probably a trade-off between "hey, the package devs are trying to tell me something" and "hey, the package devs have created a non-standard field that I need to think about"
"At the top level, ", | ||
ifelse(!is.null(ID_label), | ||
paste0(ID_label, ", "), | ||
""), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to check null in this case, eg
> foo <- NULL
> stop("Foo is ", foo)
Error: Foo is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was mostly about making clean, legible text. As-is the error was "At the top level, , you must provide..." and I thought the , , section was ugly. By ifelsing here, we avoid the grammar problem.
"If you provide more than one ID_labels to resample data for multilevel data, please provide a vector for N of the same length representing the number to resample at each level." | ||
) | ||
} | ||
.resample_data_internal = function(data, N, ID_labels=NULL, outer_level=1, use_dt = NA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just use TRUE/FALSE for use_dt since it's an internal variable anyway.
use_dt = 1 | ||
} else { | ||
use_dt = 0 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use_dt <- use_dt || requireNamespace("data.table", quietly=T)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't seem like this would work if we want a unit test to be able to explicitly override the value, no? If the unit test sets use_dt = FALSE
in the argument, then the or operator is going to evaluate to FALSE || requireNamespace("data.table", quietly=T)
which should evaluate to true.
Now, I could do this:
use_dt = ifelse(is.na(use_dt), requireNamespace("data.table", quietly=T), use_dt)
But it's not clear if speed or legibility benefit from this?
# of N units by row. | ||
if (missing(N) & is.null(ID_labels)) { | ||
return(bootstrap_single_level(data, dim(data)[1], ID_label=NULL)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can move this to the outer function resample_data()
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the error checking can be on the outside function, and the inner function just does the real work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you don't really need the outer
argument here
# OK, if not, we need to recurse | ||
|
||
# Split indices of data frame by the thing we're strapping on | ||
split_data_on_boot_id = split(seq_len(dim(data)[1]), data[,ID_labels[1]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
data[[ID_labels[1]]])
will be slightly stricter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Good catch, should be &&
…On Tue, Oct 31, 2017, 1:04 PM Aaron Rudkin ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In R/resample_data.R
<#24 (comment)>
:
> -
- # Iterate over each thing chosen at the current level
- results_all = lapply(sampled_boot_values, function(i) {
- new_results = resample_data(
- data[data[, ID_labels[1]] == i, ],
- N=N[2:length(N)],
- ID_labels=ID_labels[2:length(ID_labels)]
- )
- })
- #res = rbindlist(results_all)
+ }
+
+ # OK, if not, we need to recurse
+
+ # Split indices of data frame by the thing we're strapping on
+ split_data_on_boot_id = split(seq_len(dim(data)[1]), data[,ID_labels[1]])
👍
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#24 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAZjTrUoufCE-woXuMKNZFsRRl0DmWclks5sx31IgaJpZM4QNTkx>
.
|
This looks great; merging now. Thanks! |
This PR encapsulates the last week or so of work, with an emphasis on the speed improvements for the multi-level bootstrap. I will leave this open for a brief code review -- if you want to tag suggestions or ideas below, I will merge the PR if everyone is happy and then add the suggestions as the first commits on the next feature branch I open.