CV Splits - Is it Possible to Start With Most Recent Data? #41

mdancho84 · 2024-05-05T13:11:16Z

First off, great package. I've been tinkering with CV for implementation in my pytimetk package, which aims to make it easier to do time series operations in python.

I'm considering integrating this package. One issue is that my preference is to have cross validation start with the most recent data (meaning the first split should be the most recent data). My rationale is that the most recent time series data has the most information.

This may be preference, but it's a mistake to start with oldest data and think that the CV results will mean much.

So is it possible to have an argument to begin CV splits with the most recent data (basically inverting the default) for both sliding window and expanding (cumulative window)?

Thanks!
-Matt

The text was updated successfully, but these errors were encountered:

mdancho84 · 2024-05-05T13:12:37Z

This is what I've been exploring over at pytimetk: https://github.com/business-science/pytimetk/blob/master/src/pytimetk/crossvalidation/time_series_cv.py

FBruzzesi · 2024-05-06T08:09:23Z

Hey Matt, first and foremost, thanks for your interest.

Just to make sure I am getting the request correctly:

for the sliding window case, the splits would be the same, returned in opposite order;
for the expanding window, I can think of two possibilities:
- same as sliding window: same splits but returned in opposite order;
- fixed test and expanding window backward i.e.
```
|            ======= /// *** |
|        =========== /// *** |
|    =============== /// *** |
| ================== /// *** |
```

Reversing order should be fairly quick adjustment as iterator. While having an expanding backward option may be more challenging.

For now the lazy way of doing it is to return the splits and reverse them manually.

mdancho84 · 2024-05-06T10:35:08Z

Yep I think you have it. The top need is the reversed order with the first split being the most recent.

FBruzzesi · 2024-05-07T06:45:53Z

So which one of the two is the expected/desired behaviour for expanding window in "reverse" order?

mdancho84 · 2024-05-07T10:21:44Z

Rolling window with most recent first.

FBruzzesi · 2024-05-07T21:54:26Z

Hey Matt, I am still not sure about what you are asking here.

If the desirable is having the same splits just in different order (see figures below), then I don't see why your validation score would change.

My rationale is that the most recent time series data has the most information.

I can agree with this, but more than having the splits in different order, you could give different importance to different folds to let the most recent ones have more weight in the final decision.

The package gives the user enough flexibility to let this kind of decisions happen afterwards.

On the other hand, if you want to have a fixed test set, and a moving training set, then we can have a (somewhat separate) discussion on that, and why I don't think it is a good idea to support it.

Figures

Current behaviour

Reverse order

mdancho84 · 2024-05-09T15:01:29Z

Thanks for your message @FBruzzesi. This is what I'm planning to accomplish inside of pytimetk:

In timetk (comparable R package), I have a function called time_series_cv() and then some plotting utilities to help visualize the Time Series Cross Validation Sets. https://business-science.github.io/timetk/reference/time_series_cv.html

Creating the CV Sets:

When the resampling is performed, the first set is always the most recent data. Here it's 24 months of data. But the user could have specified numerically 24 periods since it's a monthly frequency dataset. The initial is the window of training data. Skip is how many periods should be the gap.

Your package essentially does the same thing but in reverse. That's consistent with how Rob Hyndman does it, but in my experience isn't the best way to do time series cross validation (again because newer information is typically more relevant, and what people do is they just do the top N resamples where N is 5 or so). So this way if they select slice_limit = 3 they will get the 3 most recent splits.

 resample_spec <- time_series_cv(data = m750,
                                initial     = "6 years",
                                assess      = "24 months",
                                skip        = "24 months",
                                cumulative  = FALSE,
                                slice_limit = 3)
#> Using date_var: date

resample_spec
#> # Time Series Cross Validation Plan 
#> # A tibble: 3 × 2
#>   splits          id    
#>   <list>          <chr> 
#> 1 <split [72/24]> Slice1
#> 2 <split [72/24]> Slice2
#> 3 <split [72/24]> Slice3

When visualized the sets produced look like this:

resample_spec %>%
    plot_time_series_cv_plan(date, value, .interactive = FALSE)

Like your package, it supports time series panels or groups so that all time series are split based on the sliding windows.

walmart_tscv <- walmart_sales_weekly %>%
    time_series_cv(
        date_var    = Date,
        initial     = "12 months",
        assess      = "3 months",
        skip        = "3 months",
        slice_limit = 4
    )

The only thing that I also do is provide a "cumulative" argument that simply extends the data to the first timestamp in the data.

# Cumulative TRUE
library(timetk)
library(tidyverse)

?time_series_cv

walmart_tscv <- walmart_sales_weekly %>%
    time_series_cv(
        date_var    = Date,
        initial     = "12 months",
        assess      = "3 months",
        skip        = "3 months",
        slice_limit = 4,
        cumulative  = TRUE
    )

walmart_tscv %>%
    plot_time_series_cv_plan(Date, Weekly_Sales, .interactive = FALSE)

FBruzzesi · 2024-05-10T18:35:17Z

Now I see what you mean. I need to double check how easy it to flip the logic without the need to maintain two different algorithms. Will take a closer look during the weekend.

Regarding taking the first N splits (in any direction), that's very easy and it could be enough to write a helper function that wraps the CV with itertools.islice. I will add it as an issue

mdancho84 · 2024-05-10T21:26:26Z

Ok sounds good. Happy to help in any way I can.

FBruzzesi · 2024-05-17T18:31:47Z

Hey @mdancho84 , I just released v0.2.0 with the new feature.
I added documentation regarding that in a dedicated paragraph. Let me know if you have any feedback on that

mdancho84 · 2024-05-17T19:15:06Z

Excellent. I'll check it out this weekend.

Update: Docs look great. Will test out 0.2.0 for integration with pytimetk.

mdancho84 mentioned this issue May 5, 2024

Time Series Cross Validation business-science/pytimetk#291

Open

FBruzzesi added the enhancement New feature or request label May 6, 2024

FBruzzesi mentioned this issue May 10, 2024

Take top N splits #42

Open

FBruzzesi mentioned this issue May 12, 2024

feat: allow for mode="backward" #44

Merged

FBruzzesi closed this as completed in #44 May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CV Splits - Is it Possible to Start With Most Recent Data? #41

CV Splits - Is it Possible to Start With Most Recent Data? #41

mdancho84 commented May 5, 2024

mdancho84 commented May 5, 2024

FBruzzesi commented May 6, 2024

mdancho84 commented May 6, 2024 •

edited

Loading

FBruzzesi commented May 7, 2024

mdancho84 commented May 7, 2024

FBruzzesi commented May 7, 2024 •

edited

Loading

mdancho84 commented May 9, 2024

FBruzzesi commented May 10, 2024

mdancho84 commented May 10, 2024

FBruzzesi commented May 17, 2024

mdancho84 commented May 17, 2024 •

edited

Loading

CV Splits - Is it Possible to Start With Most Recent Data? #41

CV Splits - Is it Possible to Start With Most Recent Data? #41

Comments

mdancho84 commented May 5, 2024

mdancho84 commented May 5, 2024

FBruzzesi commented May 6, 2024

mdancho84 commented May 6, 2024 • edited Loading

FBruzzesi commented May 7, 2024

mdancho84 commented May 7, 2024

FBruzzesi commented May 7, 2024 • edited Loading

Figures

Current behaviour

Reverse order

mdancho84 commented May 9, 2024

Creating the CV Sets:

FBruzzesi commented May 10, 2024

mdancho84 commented May 10, 2024

FBruzzesi commented May 17, 2024

mdancho84 commented May 17, 2024 • edited Loading

mdancho84 commented May 6, 2024 •

edited

Loading

FBruzzesi commented May 7, 2024 •

edited

Loading

mdancho84 commented May 17, 2024 •

edited

Loading