### Tutorial on how to use the train/test method listed in the s2s `traintest` module

For cross-validation, we split data resampled in the s2s `time` module into groups.

We start by importing the required libraries and generating an example `AdventCalendar` along with example data.

In [11]:
import s2spy.time
import s2spy.traintest
import pandas as pd
import numpy as np

ModuleNotFoundError: No module named 's2spy.time'

In [9]:
os.getcwd()

'/home/svijverber/Lorentz_s2spy_workshop/s2spy'

In [4]:
calendar = s2spy.time.AdventCalendar(anchor=(10, 15), freq="180d")

time_index = pd.date_range("20151020", "20211001", freq="60d")
test_data = np.random.random(len(time_index))
df = pd.DataFrame(test_data, index=time_index, columns =["data1"])
ds = df.to_xarray().rename({"index": "time"})

We first need to resample the data using the calendar:

In [7]:
calendar.map_to_data(df)
df = s2spy.time.resample(calendar, df)
df

Unnamed: 0,anchor_year,i_interval,interval,data1,target
0,2016,0,"(2016-04-18, 2016-10-15]",0.368305,True
1,2016,1,"(2015-10-21, 2016-04-18]",0.358154,False
2,2017,0,"(2017-04-18, 2017-10-15]",0.475839,True
3,2017,1,"(2016-10-20, 2017-04-18]",0.608055,False
4,2018,0,"(2018-04-18, 2018-10-15]",0.390253,True
5,2018,1,"(2017-10-20, 2018-04-18]",0.3747,False
6,2019,0,"(2019-04-18, 2019-10-15]",0.241466,True
7,2019,1,"(2018-10-20, 2019-04-18]",0.834023,False
8,2020,0,"(2020-04-18, 2020-10-15]",0.424158,True
9,2020,1,"(2019-10-21, 2020-04-18]",0.580456,False


#### Example of the `KFold` method.

All splitter classes from sklearn are supported, a list is available here:

https://scikit-learn.org/stable/modules/classes.html#splitter-classes

In [8]:
from sklearn.model_selection import KFold
splitter = KFold(n_splits=3)
traintest_splitter = s2spy.traintest.split_groups(splitter, calendar)

In [9]:
traintest_splitter

Unnamed: 0,anchor_year,i_interval,interval,data1,target,split_0,split_1,split_2
0,2016,0,"(2016-04-18, 2016-10-15]",0.368305,True,test,train,train
1,2016,1,"(2015-10-21, 2016-04-18]",0.358154,False,test,train,train
2,2017,0,"(2017-04-18, 2017-10-15]",0.475839,True,test,train,train
3,2017,1,"(2016-10-20, 2017-04-18]",0.608055,False,test,train,train
4,2018,0,"(2018-04-18, 2018-10-15]",0.390253,True,train,test,train
5,2018,1,"(2017-10-20, 2018-04-18]",0.3747,False,train,test,train
6,2019,0,"(2019-04-18, 2019-10-15]",0.241466,True,train,test,train
7,2019,1,"(2018-10-20, 2019-04-18]",0.834023,False,train,test,train
8,2020,0,"(2020-04-18, 2020-10-15]",0.424158,True,train,train,test
9,2020,1,"(2019-10-21, 2020-04-18]",0.580456,False,train,train,test


Get data from all training groups of fold 0:

Loop through all train/test splits using split iterator `split_iterate`.

In [11]:
i = 1
for X_train, X_test in traintest_spitter.split_iterate(*[df]):
    print(y_train)
    print(X_test)
    print(f"Split group {i}")
    print("Anchor years in training data", set(y_train['anchor_year']))
    print("Anchor years in testing data", set(X_test['anchor_year']))
    i += 1

Split group 1
Anchor years in training data {'2019', '2018', '2020'}
Anchor years in testing data {'2016', '2017'}
Split group 2
Anchor years in training data {'2020', '2016', '2017'}
Anchor years in testing data {'2019', '2018'}
Split group 3
Anchor years in training data {'2019', '2018', '2016', '2017'}
Anchor years in testing data {'2020'}


### `xarray` example

In [13]:
calendar.map_to_data(ds)
ds = s2spy.time.resample(calendar, ds)
ds

Here we choose the `ShuffleSplit` method:

Loop through all train/test splits using split iterator `split_iterate`.

In [16]:
i = 1
for X_train, y_train, X_test, y_test in traintest_spitter.split_iterate(*[ds], y=df):
    print(X_train)
    print(y_train)
    print(f"Split group {i}")
    print("Anchor years in training data", set(X_train['anchor_year']))
    print("Anchor years in testing data", set(X_test['anchor_year']))
    i += 1

Split group 1
Anchor years in training data [2016 2017 2018 2019]
Anchor years in testing data [2020]
Split group 2
Anchor years in training data [2016 2017 2018 2020]
Anchor years in testing data [2019]
Split group 3
Anchor years in training data [2017 2018 2019 2020]
Anchor years in testing data [2016]
