### Tutorial on how to use the train/test method listed in the s2s `traintest` module

For cross-validation, we split data resampled in the s2s `time` module into groups.

We start by importing the required libraries and generating an example `AdventCalendar` along with example data.

In [1]:
import s2spy.time
import s2spy.traintest
import pandas as pd
import numpy as np

In [2]:
calendar = s2spy.time.AdventCalendar(anchor_date=(10, 15), freq="180d")

time_index = pd.date_range("20151020", "20211001", freq="60d")
test_data = np.random.random(len(time_index))
df = pd.DataFrame(test_data, index=time_index, columns =["data1"])
ds = df.to_xarray().rename({"index": "time"})

We first need to resample the data using the calendar:

In [3]:
df = calendar.resample(df)
df.keys()

Index(['anchor_year', 'i_interval', 'interval', 'data1', 'target'], dtype='object')

#### Example of the `KFold` method.

All splitter classes from sklearn are supported, a list is available here:

https://scikit-learn.org/stable/modules/classes.html#splitter-classes

In [4]:
from sklearn.model_selection import KFold
splitter = KFold(n_splits=3)
df = s2spy.traintest.split_groups(splitter, df)

In [5]:
df

Unnamed: 0,anchor_year,i_interval,interval,data1,target,split_0,split_1,split_2
0,2016,0,"(2016-04-18, 2016-10-15]",0.397957,True,test,train,train
1,2016,1,"(2015-10-21, 2016-04-18]",0.21897,False,test,train,train
2,2017,0,"(2017-04-18, 2017-10-15]",0.45587,True,test,train,train
3,2017,1,"(2016-10-20, 2017-04-18]",0.636322,False,test,train,train
4,2018,0,"(2018-04-18, 2018-10-15]",0.326481,True,train,test,train
5,2018,1,"(2017-10-20, 2018-04-18]",0.569288,False,train,test,train
6,2019,0,"(2019-04-18, 2019-10-15]",0.33346,True,train,test,train
7,2019,1,"(2018-10-20, 2019-04-18]",0.423282,False,train,test,train
8,2020,0,"(2020-04-18, 2020-10-15]",0.512527,True,train,train,test
9,2020,1,"(2019-10-21, 2020-04-18]",0.439752,False,train,train,test


Get data from all training groups of fold 0:

In [6]:
training_data_split_0 = df.loc[df.split_0 == "train"]
training_data_split_0.dropna()

Unnamed: 0,anchor_year,i_interval,interval,data1,target,split_0,split_1,split_2
4,2018,0,"(2018-04-18, 2018-10-15]",0.326481,True,train,test,train
5,2018,1,"(2017-10-20, 2018-04-18]",0.569288,False,train,test,train
6,2019,0,"(2019-04-18, 2019-10-15]",0.33346,True,train,test,train
7,2019,1,"(2018-10-20, 2019-04-18]",0.423282,False,train,test,train
8,2020,0,"(2020-04-18, 2020-10-15]",0.512527,True,train,train,test
9,2020,1,"(2019-10-21, 2020-04-18]",0.439752,False,train,train,test


Loop through all train/test splits using split iterator `split_iterate`.

In [7]:
i = 1
for train_data, test_data in s2spy.traintest.split_iterate(df):
    print(f"Split group {i}")
    print("Anchor years in training data", set(train_data['anchor_year']))
    print("Anchor years in testing data", set(test_data['anchor_year']))
    i += 1

Split group 1
Anchor years in training data {'2020', '2019', '2018'}
Anchor years in testing data {'2016', '2017'}
Split group 2
Anchor years in training data {'2016', '2017', '2020'}
Anchor years in testing data {'2019', '2018'}
Split group 3
Anchor years in training data {'2016', '2019', '2017', '2018'}
Anchor years in testing data {'2020'}


### `xarray` example

In [8]:
ds = calendar.resample(ds)
ds

Here we choose the `ShuffleSplit` method:

In [9]:
from sklearn.model_selection import ShuffleSplit

splitter = ShuffleSplit(n_splits=3)
ds_traintest = s2spy.traintest.split_groups(splitter, ds)

In [10]:
ds_traintest

Loop through all train/test splits using split iterator `split_iterate`.

In [11]:
i = 1
for train_data, test_data in s2spy.traintest.split_iterate(ds_traintest):
    print(f"Split group {i}")
    print("Anchor years in training data", train_data.anchor_year.values)
    print("Anchor years in testing data", test_data.anchor_year.values)
    i += 1

Split group 1
Anchor years in training data [2016 2017 2019 2020]
Anchor years in testing data [2018]
Split group 2
Anchor years in training data [2016 2018 2019 2020]
Anchor years in testing data [2017]
Split group 3
Anchor years in training data [2016 2017 2018 2019]
Anchor years in testing data [2020]
