-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CV Splits - Is it Possible to Start With Most Recent Data? #41
Comments
This is what I've been exploring over at pytimetk: https://github.com/business-science/pytimetk/blob/master/src/pytimetk/crossvalidation/time_series_cv.py |
Hey Matt, first and foremost, thanks for your interest. Just to make sure I am getting the request correctly:
Reversing order should be fairly quick adjustment as iterator. While having an expanding backward option may be more challenging. For now the lazy way of doing it is to return the splits and reverse them manually. |
Yep I think you have it. The top need is the reversed order with the first split being the most recent. |
So which one of the two is the expected/desired behaviour for expanding window in "reverse" order? |
Rolling window with most recent first. |
Hey Matt, I am still not sure about what you are asking here. If the desirable is having the same splits just in different order (see figures below), then I don't see why your validation score would change.
I can agree with this, but more than having the splits in different order, you could give different importance to different folds to let the most recent ones have more weight in the final decision. The package gives the user enough flexibility to let this kind of decisions happen afterwards. On the other hand, if you want to have a fixed test set, and a moving training set, then we can have a (somewhat separate) discussion on that, and why I don't think it is a good idea to support it. FiguresCurrent behaviourReverse order |
Thanks for your message @FBruzzesi. This is what I'm planning to accomplish inside of In Creating the CV Sets:When the resampling is performed, the first set is always the most recent data. Here it's 24 months of data. But the user could have specified numerically 24 periods since it's a monthly frequency dataset. The initial is the window of training data. Skip is how many periods should be the gap. Your package essentially does the same thing but in reverse. That's consistent with how Rob Hyndman does it, but in my experience isn't the best way to do time series cross validation (again because newer information is typically more relevant, and what people do is they just do the top N resamples where N is 5 or so). So this way if they select resample_spec <- time_series_cv(data = m750,
initial = "6 years",
assess = "24 months",
skip = "24 months",
cumulative = FALSE,
slice_limit = 3)
#> Using date_var: date
resample_spec
#> # Time Series Cross Validation Plan
#> # A tibble: 3 × 2
#> splits id
#> <list> <chr>
#> 1 <split [72/24]> Slice1
#> 2 <split [72/24]> Slice2
#> 3 <split [72/24]> Slice3 When visualized the sets produced look like this: resample_spec %>%
plot_time_series_cv_plan(date, value, .interactive = FALSE) Like your package, it supports time series panels or groups so that all time series are split based on the sliding windows. walmart_tscv <- walmart_sales_weekly %>%
time_series_cv(
date_var = Date,
initial = "12 months",
assess = "3 months",
skip = "3 months",
slice_limit = 4
) The only thing that I also do is provide a "cumulative" argument that simply extends the data to the first timestamp in the data. # Cumulative TRUE
library(timetk)
library(tidyverse)
?time_series_cv
walmart_tscv <- walmart_sales_weekly %>%
time_series_cv(
date_var = Date,
initial = "12 months",
assess = "3 months",
skip = "3 months",
slice_limit = 4,
cumulative = TRUE
)
walmart_tscv %>%
plot_time_series_cv_plan(Date, Weekly_Sales, .interactive = FALSE) |
Now I see what you mean. I need to double check how easy it to flip the logic without the need to maintain two different algorithms. Will take a closer look during the weekend. Regarding taking the first N splits (in any direction), that's very easy and it could be enough to write a helper function that wraps the CV with |
Ok sounds good. Happy to help in any way I can. |
Hey @mdancho84 , I just released v0.2.0 with the new feature. |
Excellent. I'll check it out this weekend. Update: Docs look great. Will test out 0.2.0 for integration with pytimetk. |
First off, great package. I've been tinkering with CV for implementation in my
pytimetk
package, which aims to make it easier to do time series operations in python.I'm considering integrating this package. One issue is that my preference is to have cross validation start with the most recent data (meaning the first split should be the most recent data). My rationale is that the most recent time series data has the most information.
This may be preference, but it's a mistake to start with oldest data and think that the CV results will mean much.
So is it possible to have an argument to begin CV splits with the most recent data (basically inverting the default) for both sliding window and expanding (cumulative window)?
Thanks!
-Matt
The text was updated successfully, but these errors were encountered: