Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time-series bootstrapping #45

Open
colintbowers opened this issue Feb 12, 2019 · 12 comments
Open

Time-series bootstrapping #45

colintbowers opened this issue Feb 12, 2019 · 12 comments

Comments

@colintbowers
Copy link
Contributor

Hi all,

I noticed this package listed in the recent announcement for StatsKit. Reading through the docs, it sounds like this package is likely to be the default Julia package for bootstrapping. As it happens, I've registered a fairly complete package for bootstrapping time-series (which is mainly block bootstrapping procedures and block length selection procedures). Link here

Are there any good reasons to try and merge some of the functionality from that package into this one? Alternatively, do any of you think it would useful for me to add an interface to the code in my package that matches the interface here, so frequent users of this package can quickly switch across to my package for any time-series needs without having to alter their code? Or should I just leave things as they are for now?

Any feedback is appreciated. I use Julia a lot, but haven't really been keeping my finger on the pulse of the package ecosystem.

Cheers,

Colin

@rofinn
Copy link
Contributor

rofinn commented Feb 14, 2019

FWIW, I'd be interested in seeing DependentBootstrap.jl merged with Bootstrap.jl. Bootstrap.jl does have an implementation of maximum entropy bootstrapping, but having other options would be nice.

@juliangehring
Copy link
Owner

I'm happy to explore together how we can bring the two packages closer together.
How well can the functionality of DependentBootstrap be mapped onto the Bootstrap API ( bootstrap, confint, BootstrapSampling)? Are there any parts that don't translate well and would need changes to the existing API?

@colintbowers
Copy link
Contributor Author

Thanks for the responses. Given you both think it would be useful to get some time-series functionality (beyond maximum entropy) into Bootstrap.jl, I'm happy to look a bit further into this.

Are there any parts that don't translate well and would need changes to the existing API?

Not sure, since I'm not familiar with the Bootstrap API. I think a sensible first step would be for me to familiarize myself with the Bootstrap API, and then report back here once I've worked out how hard (or easy) it will be. In the interests of managing expectations, I should also mention I'm juggling a couple of things at the moment, so the time-span for me doing this should probably be measured in weeks, not days.

Once I've done that, then we can explore the best way forward in this thread. Now that I know there is some interest, I'll start allocating some evenings to work on this. Cheers.

@colintbowers
Copy link
Contributor Author

Okay, this looks like it could be reasonably straightforward. I think I should be able to just add DependentBootstrap as a dependency, define a new abstract subtype to BootstrapSampling (say, DependentBootstrapSampling), and then then extend the bootstrap function to that new abstract subtype. Each of the various block bootstrap methods can be defined as a subtype of DependentBootstrapSampling, but I think I should only need the one method extension to bootstrap, since almost all the code can be common for all the different block bootstrap methods.

For each of the block bootstrap types I'll have an nrun field (as all the other ones do), and then I'll add an additional field blocklength::Float64. Users can specify a (strictly positive) block length, and if they don't, a default constructor can make the block length negative, which can be interpreted by the bootstrap function as asking it to estimate the block length using the appropriate block length selection procedure from DependentBootstrap.

At this point, I can only see two potential complications:

  1. DependentBootstrap currently depends on Requires. I use this in DependentBootstrap to lazy-load DataFrames and TimeSeries if the user has their data in the form of a DataFrame or TimeArray. Does anyone see any problems with Requires becoming an additional dependency?

  2. It's a little less clear how to elegantly allow bootstrapping residuals of models (such as ResidualSampling and WildSampling currently do) using one of the block bootstrap methods. I suggest leaving this for now. I can get the basic functionality discussed above up and running, and then we can think about this additional step.

If anyone has any thoughts/objections, please chime in, otherwise I'll create a development branch for myself and get to work. As I said, it doesn't look too difficult, but I'm juggling a few things at the moment, so I suspect it'll take me a few weeks to get it in a shape that I'm happy with.

@rofinn
Copy link
Contributor

rofinn commented Mar 15, 2019

Does anyone see any problems with Requires becoming an additional dependency?

I'd be very careful with using Revise.jl in packages. In my experience, it can make things significantly slower than just loading the packages. I have a feeling that's why Plots.jl takes so long to load. JuliaPackaging/Requires.jl#39

@colintbowers
Copy link
Contributor Author

colintbowers commented Mar 17, 2019

Interesting. After reading through that (and related posts) I will get rid of Requires from DependentBootstrap. Thanks for bringing that material to my attention!

This would mean that adding DependentBootstrap as a dependency to Bootstrap will only mean one additional dependency, namely TimeSeries. What are people's thoughts on this? I could comment out the TimeSeries dependency in DependentBootstrap for now if you would rather not have it as an implicit dependency on Bootstrap (in which case there will be no additional dependencies other than DependentBootstrap itself). I'm not convinced the eco-system has settled on TimeArrays as the de facto type for storing time-series data anyway...

@colintbowers
Copy link
Contributor Author

I've released a new tag for DependentBootstrap that removes the Requires dependency. I've left in TimeSeries since it is tiny and very fast to load, and upon reflection, will probably become the standard for time-series work.

I'm branching Bootstrap locally and starting work on wrapping DependentBootstrap. Will probably take a few weeks.

@juliangehring
Copy link
Owner

Sorry for the radio silence, I have been off the grid for a while.
Your ideas sound very good to me. I'm in favour of working around Requires if possible. Having TimeSeries as a dependency sounds perfectly reasonable, and I wouldn't worry about the additional dependency (DataFrames has been a dependency for a while and is a much bigger beast).
Let me know if you need any help or ideas. Thanks for the time and effort that you are putting into this!

@colintbowers
Copy link
Contributor Author

No problems!

I've actually already checked out a development branch and done most of the work (it was even easier than I was anticipating!).

I've ditched Requires from DependentBootstrap, but kept TimeSeries. I've done some rough timings on a using Bootstrap statement with and without DependentBootstrap as a dependency, and the difference is very small (roughly 3.7 seconds increasing to 3.85 seconds on my machine).

I'll finesse things over the next week or two, but should be able to submit a pull request to Bootstrap fairly soon after that with the changes. I'm putting all the new code in it's own file for now, which should make things easy to review.

@colintbowers
Copy link
Contributor Author

colintbowers commented Jul 28, 2019

Sorry for the rather long delay on this. I've just submitted PR #59 that adds the DependentBootstrap functionality. The new types in Bootstrap are StationarySampling, MovingBlockSampling, CircularBlockSampling, and NoOverlapSampling. Each of these types has two fields, the usual nrun field, and a blocklength field for indicating the block length to use. If you set the block length less than or equal to zero, then the DependentBootstrap package will attempt to estimate an optimal block length from the data.

Pretty much all the new code is in its own file, dependent_bootstrap.jl, so this should hopefully make it fairly easy to review. The main new method starts in line 70 of that file, and I've styled it fairly heavily based on the bootstrap methods in other files in Bootstrap, so hopefully it should be easy to follow.

Note, I also added a suite of tests (in the test subdirectory). The tests are designed to make sure that Bootstrap and DependentBootstrap provide the same output given the same input, and then I've also added an expected output from a call to bias in Bootstrap, just for extra safety. I've done this for two input data types, Vector{Float64} and DataFrame, with output types Float64 and Vector{Float64} respectively.

I also added a few lines to the README indicating the new functionality. Feel free to ask for clarification if anything is not clear.

Cheers

@azev77
Copy link

azev77 commented Jan 12, 2020

@juliangehring & @colintbowers I think it's great for the Julia ecosystem that you guys are working on combining your packages!
I'm slowly moving my work away from R/STATA into Julia.
One feature economists find very useful is the Wild Clustered Bootstrap, here is the STATA code.
Sandwich in R now also has Wild Cluster Bootstrap options, code here.

These features would basically bring bootstrapping in Julia up to speed w/ R & STATA.

@colintbowers
Copy link
Contributor Author

@azev77 I've not that familiar with the Wild Clustered Bootstrap so it would probably take me a little while to implement it. If I get some spare time I'll look into it.

Thanks for posting the code links. I'm not sure about the STATA code, but unfortunately we can't look at the R code when implementing a Julia version if we want to maintain the MIT license which the Julia Bootstrap and DependentBootstrap are currently under. (GPL is copy-left)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants