Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CV Splits - Is it Possible to Start With Most Recent Data? #41

Closed
mdancho84 opened this issue May 5, 2024 · 11 comments · Fixed by #44
Closed

CV Splits - Is it Possible to Start With Most Recent Data? #41

mdancho84 opened this issue May 5, 2024 · 11 comments · Fixed by #44
Labels
enhancement New feature or request

Comments

@mdancho84
Copy link

First off, great package. I've been tinkering with CV for implementation in my pytimetk package, which aims to make it easier to do time series operations in python.

I'm considering integrating this package. One issue is that my preference is to have cross validation start with the most recent data (meaning the first split should be the most recent data). My rationale is that the most recent time series data has the most information.

This may be preference, but it's a mistake to start with oldest data and think that the CV results will mean much.

So is it possible to have an argument to begin CV splits with the most recent data (basically inverting the default) for both sliding window and expanding (cumulative window)?

Thanks!
-Matt

@mdancho84
Copy link
Author

@FBruzzesi
Copy link
Owner

Hey Matt, first and foremost, thanks for your interest.

Just to make sure I am getting the request correctly:

  • for the sliding window case, the splits would be the same, returned in opposite order;
  • for the expanding window, I can think of two possibilities:
    • same as sliding window: same splits but returned in opposite order;
    • fixed test and expanding window backward i.e.
      |            ======= /// *** |
      |        =========== /// *** |
      |    =============== /// *** |
      | ================== /// *** |
      

Reversing order should be fairly quick adjustment as iterator. While having an expanding backward option may be more challenging.

For now the lazy way of doing it is to return the splits and reverse them manually.

@mdancho84
Copy link
Author

mdancho84 commented May 6, 2024

Yep I think you have it. The top need is the reversed order with the first split being the most recent.

@FBruzzesi
Copy link
Owner

So which one of the two is the expected/desired behaviour for expanding window in "reverse" order?

@mdancho84
Copy link
Author

Rolling window with most recent first.

@FBruzzesi
Copy link
Owner

FBruzzesi commented May 7, 2024

Hey Matt, I am still not sure about what you are asking here.

If the desirable is having the same splits just in different order (see figures below), then I don't see why your validation score would change.

My rationale is that the most recent time series data has the most information.

I can agree with this, but more than having the splits in different order, you could give different importance to different folds to let the most recent ones have more weight in the final decision.

The package gives the user enough flexibility to let this kind of decisions happen afterwards.

On the other hand, if you want to have a fixed test set, and a moving training set, then we can have a (somewhat separate) discussion on that, and why I don't think it is a good idea to support it.

Figures

Current behaviour

fig1-mini

Reverse order

fig2-mini

@mdancho84
Copy link
Author

Thanks for your message @FBruzzesi. This is what I'm planning to accomplish inside of pytimetk:

In timetk (comparable R package), I have a function called time_series_cv() and then some plotting utilities to help visualize the Time Series Cross Validation Sets. https://business-science.github.io/timetk/reference/time_series_cv.html

Creating the CV Sets:

When the resampling is performed, the first set is always the most recent data. Here it's 24 months of data. But the user could have specified numerically 24 periods since it's a monthly frequency dataset. The initial is the window of training data. Skip is how many periods should be the gap.

Your package essentially does the same thing but in reverse. That's consistent with how Rob Hyndman does it, but in my experience isn't the best way to do time series cross validation (again because newer information is typically more relevant, and what people do is they just do the top N resamples where N is 5 or so). So this way if they select slice_limit = 3 they will get the 3 most recent splits.

 resample_spec <- time_series_cv(data = m750,
                                initial     = "6 years",
                                assess      = "24 months",
                                skip        = "24 months",
                                cumulative  = FALSE,
                                slice_limit = 3)
#> Using date_var: date

resample_spec
#> # Time Series Cross Validation Plan 
#> # A tibble: 3 × 2
#>   splits          id    
#>   <list>          <chr> 
#> 1 <split [72/24]> Slice1
#> 2 <split [72/24]> Slice2
#> 3 <split [72/24]> Slice3

When visualized the sets produced look like this:

resample_spec %>%
    plot_time_series_cv_plan(date, value, .interactive = FALSE)

image

Like your package, it supports time series panels or groups so that all time series are split based on the sliding windows.

walmart_tscv <- walmart_sales_weekly %>%
    time_series_cv(
        date_var    = Date,
        initial     = "12 months",
        assess      = "3 months",
        skip        = "3 months",
        slice_limit = 4
    )

image

The only thing that I also do is provide a "cumulative" argument that simply extends the data to the first timestamp in the data.

# Cumulative TRUE
library(timetk)
library(tidyverse)

?time_series_cv

walmart_tscv <- walmart_sales_weekly %>%
    time_series_cv(
        date_var    = Date,
        initial     = "12 months",
        assess      = "3 months",
        skip        = "3 months",
        slice_limit = 4,
        cumulative  = TRUE
    )

walmart_tscv %>%
    plot_time_series_cv_plan(Date, Weekly_Sales, .interactive = FALSE)

walmart_tscv

@FBruzzesi
Copy link
Owner

Now I see what you mean. I need to double check how easy it to flip the logic without the need to maintain two different algorithms. Will take a closer look during the weekend.

Regarding taking the first N splits (in any direction), that's very easy and it could be enough to write a helper function that wraps the CV with itertools.islice. I will add it as an issue

@mdancho84
Copy link
Author

Ok sounds good. Happy to help in any way I can.

@FBruzzesi
Copy link
Owner

Hey @mdancho84 , I just released v0.2.0 with the new feature.
I added documentation regarding that in a dedicated paragraph. Let me know if you have any feedback on that

@mdancho84
Copy link
Author

mdancho84 commented May 17, 2024

Excellent. I'll check it out this weekend.

Update: Docs look great. Will test out 0.2.0 for integration with pytimetk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants