# New data splitters

Randomly splitting the dataset into training/validation/testing sets means that these subsets are from the same distribution, which can be hard to meet in the real world. As discussed by Kadambi Achuta et al. (Nature Machine Intelligence (2023): 1-9), "Although theoretical machine learning research aims to guarantee neural network performance by bounding error (referred to as generalization bounds), such bounds are only valid under assumptions that cannot be validated in reality, for instance that **the finite training data and yet-unseen test data be drawn from the same unknown distribution**."

Under some circumstances, we want to evaluate the generalization ability of models and take generalization as the criterion of model selection. This requires the functionality that makes the three subsets different. If we assume that we can not acquire samples from the real scenario and learn from them, the validation set and the training set are from the same distribution, but the testing set is from a different and more realistic distribution. If we instead assume that a small dataset can be acquired from the real scenario, the validation and training sets can be more similar to the testing set.

In this tutorial, we will show how to split the dataset into subsets with different distributions.

In [1]:
from tabensemb.data.datasplitter import AbstractSplitter
from tabensemb.utils import PickleAbleGenerator
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold

The `_split` method is about to be implemented. Specifically, the training and validation set will have lower target values than the testing set as a showcase of implementing new data splitters. The training and validation sets are randomly split. The ratio of training/validation/testing sets is according to `self.train_val_test` which is actually the `split_ratio` set in the configuration. Finally, it is better to shuffle the indices. The returned values should be 1d `np.ndarray`s.

```python
class TargetSplitter(AbstractSplitter):
    def _split(self, df, cont_feature_names, cat_feature_names, label_name):
        target = df[label_name[0]].values.flatten()
        test_indices = np.where(
            target >= np.percentile(target, np.sum(self.train_val_test[0:2]) * 100)
        )[0]
        train_val_indices = np.setdiff1d(df.index, test_indices)
        train_indices, val_indices = train_test_split(train_val_indices, test_size=self.train_val_test[1] / np.sum(self.train_val_test[0:2]), shuffle=True)

        np.random.shuffle(train_indices)
        np.random.shuffle(val_indices)
        np.random.shuffle(test_indices)
        return np.array(train_indices), np.array(val_indices), np.array(test_indices)
```

Implementing k-fold splitting is optional. Here, only the training and validation sets are k-folded while the testing set is always the same. Before implementing k-fold, the `support_cv` property should be set to `True`.

```python
    @property
    def support_cv(self):
        return True
```

`_next_cv` should be implemented for k-fold splitting. When it is called for the first time, the testing set `self.test_indices` and the combination of the training and validation sets `self.train_val_indices` are determined. A generator of k-fold splitting (`sklearn.model_selection.KFold`) is initialized. Because a generator can not be pickled, it is first transformed into a pickle-able `PickleAbleGenerator` instance. `KFold().split(self.train_val_indices)` yields two arrays representing the indices of `self.train_val_indices` for training and validation sets respectively. In the tutorial, for simplification, we are not getting this method prepared for unexpected cases. See the source code of `tabensemb.data.RandomSplitter._next_cv` and `tabensemb.data.AbstractSplitter._sklearn_k_fold` for a better implementation.

In [2]:
class TargetSplitter(AbstractSplitter):
    def _split(self, df, cont_feature_names, cat_feature_names, label_name):
        target = df[label_name[0]].values.flatten()
        test_indices = np.where(
            target >= np.percentile(target, np.sum(self.train_val_test[0:2]) * 100)
        )[0]
        train_val_indices = np.setdiff1d(df.index, test_indices)
        train_indices, val_indices = train_test_split(train_val_indices, test_size=self.train_val_test[1] / np.sum(self.train_val_test[0:2]), shuffle=True)

        np.random.shuffle(train_indices)
        np.random.shuffle(val_indices)
        np.random.shuffle(test_indices)
        return np.array(train_indices), np.array(val_indices), np.array(test_indices)

    @property
    def support_cv(self):
        return True

    def _next_cv(self, df, cont_feature_names, cat_feature_names, label_name, cv):
        if self.cv_generator is None:
            train_indices, val_indices, test_indices = self._split(df, cont_feature_names, cat_feature_names, label_name)
            self.test_indices = test_indices
            self.train_val_indices = np.append(train_indices, val_indices)
            self.cv_generator = PickleAbleGenerator(
                KFold(n_splits=cv, shuffle=True).split(self.train_val_indices)
            )
        train_indices_idx, val_indices_idx = self.cv_generator.__next__()
        train_indices, val_indices = self.train_val_indices[train_indices_idx], self.train_val_indices[val_indices_idx]
        return train_indices, val_indices, self.test_indices

The implemented splitter should be registered as follows to be recognized by `DataModule.set_data_splitter` automatically.

In [3]:
from tabensemb.data.datasplitter import splitter_mapping
splitter_mapping["TargetSplitter"] = TargetSplitter

In [4]:
from tabensemb.trainer import Trainer
import tabensemb

prefix = "../../../../"
tabensemb.setting["default_output_path"] = prefix + "output"
tabensemb.setting["default_config_path"] = prefix + "configs"
tabensemb.setting["default_data_path"] = prefix + "data"

trainer = Trainer(device="cpu")

trainer.load_config("sample")
trainer.datamodule.set_data_splitter("TargetSplitter")
trainer.load_data()

The project will be saved to ../../../../output/sample/2023-09-12-11-20-44-0_sample
Dataset size: 153 51 52
Data saved to ../../../../output/sample/2023-09-12-11-20-44-0_sample (data.csv and tabular_data.csv).


As expected, we can see the target values are much higher in the testing set than those in the training set or the validation set.

In [5]:
trainer.df.loc[trainer.train_indices, trainer.label_name[0]].mean(), trainer.df.loc[trainer.val_indices, trainer.label_name[0]].mean(), trainer.df.loc[trainer.test_indices, trainer.label_name[0]].mean()

(-71.48163032821536, -77.62461093871299, 236.44992911967717)

4-fold cross-validation is performed for the training and validation sets

In [6]:
first_fold_train, first_fold_val, first_fold_test = trainer.datamodule.datasplitter.split(trainer.df, trainer.cont_feature_names, trainer.cat_feature_names, trainer.label_name, cv=5)
second_fold_train, second_fold_val, second_fold_test = trainer.datamodule.datasplitter.split(trainer.df, trainer.cont_feature_names, trainer.cat_feature_names, trainer.label_name, cv=5)
len(first_fold_train), len(first_fold_val), len(first_fold_test), len(second_fold_train), len(second_fold_val), len(second_fold_test)

(163, 41, 52, 163, 41, 52)

The testing set stays unchanged across different folds.

In [7]:
all([x == y for x, y in zip(np.sort(first_fold_test), np.sort(second_fold_test))])

True

According to the definition of k-fold, an entirely different part of the samples is selected as the validation set in different folds.

In [8]:
all([x in first_fold_train for x in second_fold_val]), all([x in second_fold_train for x in first_fold_val])

(True, True)