# Using data functionalities

Running `Trainer.load_data` or `DataModule.load_data` will process the dataset in the following order:

1. Data splitting (training/validation/testing sets): See "Data splitters"
2. Data imputation: See "Data imputers"
3. Data augmentation (for features): See "Data derivers"
4. Data processing **(orderless except for data scaling)**: See "Data processors"
    * Data augmentation (for data points)
    * Data filtering
    * Feature selection
    * Categorical encoding
    * Data scaling
    * etc.
5. Data augmentation (for features, especially multi-modal features): See "Data derivers".

In this part, we will introduce the usage of "data splitters", "data imputers", "data processors", and "data derivers". Implementing new functionalities is left as a section in "Advanced Usage".

In [1]:
import torch
from tabensemb.trainer import Trainer
from tabensemb.model import *
import tabensemb
import os

prefix = "../../../../"
tabensemb.setting["default_output_path"] = prefix + "output"
tabensemb.setting["default_config_path"] = prefix + "configs"
tabensemb.setting["default_data_path"] = prefix + "data"

trainer = Trainer(device="cpu")

## Data splitters

Data splitters are used to split the whole dataset into training, validation, and testing sets. They inherit `tabensemb.data.AbstractSplitter`, and implement `_split` (the main method) and `_next_cv` (to generate the next fold for a k-fold CV process).

**Remark**: If `AbstractSplitter.support_cv=False`, the data splitter does not support k-fold CV.

There are several ways to specify the used data splitter. These ways can also be used to specify other configurations.


1. Modify the configuration file, `configs/sample.py` for example:

```python
cfg = {
    "data_splitter": "RandomSplitter",
    # Some other configurations...
}
```

2. Use the `manual_config` argument of `Trainer.load_config`.

```python
trainer.load_config("sample", manual_config={"data_splitter": "RandomSplitter"})
```

3. After `Trainer.load_config` is called and if one does not want to call it again, use `DataModule.set_data_splitter`.

In [2]:
trainer.load_config("sample")
trainer.datamodule.set_data_splitter("RandomSplitter", ratio=[7, 1.5, 1.5])
trainer.load_data()

Project will be saved to ../../../../output/sample/2023-08-27-14-13-06-0_sample
Dataset size: 178 39 39
Data saved to ../../../../output/sample/2023-08-27-14-13-06-0_sample (data.csv and tabular_data.csv).


The `ratio` argument can also be given in the configuration file, `manual_config`, or `set_data_splitter` as:

```python
cfg = {
    # This will overwrite the `split_ratio` configuration.
    "data_splitter": ["RandomSplitter", {"train_val_test": [7, 1.5, 1.5]}],
    # Some other configurations...
}
```

```python
trainer.load_config("sample", manual_config={"data_splitter": ["RandomSplitter", {"train_val_test": [7, 1.5, 1.5]}]})
```

```python
trainer.datamodule.set_data_splitter(["RandomSplitter", {"train_val_test": [7, 1.5, 1.5]}])
```

Available data splitters can be seen using:

In [3]:
from tabensemb.data.datasplitter import splitter_mapping
splitter_mapping

{'AbstractSplitter': tabensemb.data.base.AbstractSplitter,
 'RandomSplitter': tabensemb.data.datasplitter.RandomSplitter}

## Data imputers

Imputation is necessary when NaNs exist in the dataset. `tabensemb` provides several methods incorporating other packages like `miceforest` and `scikit-learn`. The configuration for an imputer contains two parts: the name of the imputer and its arguments. Data imputers can be set similarly to data splitters in the following ways:

1. Modify the configuration file, `configs/sample.py` for example:

```python
cfg = {
    "data_imputer": ["MiceImputer", {"max_iter": 10}],
    # "data_imputer": "MiceImputer", (If no kwargs is given)
    # Some other configurations...
}
```

2. Use the `manual_config` argument of `Trainer.load_config`.

```python
trainer.load_config("sample", manual_config={"data_imputer": ["MiceImputer", {"max_iter": 10}]})
trainer.load_config("sample", manual_config={"data_imputer": "MiceImputer"})
```

3. Use `DataModule.set_data_imputer`

In [4]:
trainer.load_config("sample")
trainer.datamodule.set_data_imputer(["MiceImputer", {"max_iter": 10}])
trainer.load_data()

Project will be saved to ../../../../output/sample/2023-08-27-14-13-06-0_sample-I1
Dataset size: 153 51 52
Data saved to ../../../../output/sample/2023-08-27-14-13-06-0_sample-I1 (data.csv and tabular_data.csv).


In [5]:
trainer.df.isna().any()

cont_0                False
cont_1                False
cont_2                False
cont_3                False
cont_4                False
cont_5                False
cont_6                False
cont_7                False
cont_8                False
cont_9                False
cat_0                 False
cat_1                 False
cat_2                 False
cat_3                 False
cat_4                 False
cat_5                 False
cat_6                 False
cat_7                 False
cat_8                 False
cat_9                 False
target                False
target_binary         False
target_multi_class    False
dtype: bool

Available data splitters can be seen using:

In [6]:
from tabensemb.data.dataimputer import imputer_mapping, get_data_imputer
imputer_mapping

{'AbstractImputer': tabensemb.data.base.AbstractImputer,
 'AbstractSklearnImputer': tabensemb.data.base.AbstractSklearnImputer,
 'GainImputer': tabensemb.data.dataimputer.GainImputer,
 'MeanImputer': tabensemb.data.dataimputer.MeanImputer,
 'MedianImputer': tabensemb.data.dataimputer.MedianImputer,
 'MiceImputer': tabensemb.data.dataimputer.MiceImputer,
 'MiceLightgbmImputer': tabensemb.data.dataimputer.MiceLightgbmImputer,
 'MissForestImputer': tabensemb.data.dataimputer.MissForestImputer,
 'ModeImputer': tabensemb.data.dataimputer.ModeImputer}

Arguments can be seen in API docs or in docstrings

In [7]:
print(get_data_imputer("MeanImputer").__doc__)


    Imputation with average values implemented using sklearn's SimpleImputer.

    Parameters
    ----------
    **kwargs
        Arguments for ``sklearn.impute.SimpleImputer`` (except for ``strategy``)
    


## Data processors

As listed in Step 4 above, data processing includes filtering, augmentation, feature selection, and much more. `tabensemb` provides a unified framework for implementing various data processing steps. The data imputation and processing procedure is quite similar to the `Pipeline` structure in `sklearn`, but is fully compatible with the other two modules introduced in this part and all four modules automatically do all preparations before training for the user.

Configuration for a processor also contains two parts: the name of the processor and arguments. Here we provide several examples:

* `CategoricalOrdinalEncoder`: same as the `OrdinalEncoder` from `sklearn`
* `NaNFeatureRemover`: remove features that are all NaNs
* `VarianceFeatureSelector`: same as the `VarianceThreshold` from `sklearn`
* `FeatureValueSelector`: select data points that have a certain value of a feature
* `CorrFeatureSelector`: remove highly correlated features
* `IQRRemover`: remove outliers found by the 1.5*IQR criteria
* `StdRemover`: remove outliers found by 3*std criteria
* `SampleDataAugmentor`: just an example to show the data augmentation capability (it copies the last two data points in the validation set)
* `StandardScaler`: same as the `StandardScaler` from `sklearn`

**Remark**: Data scalers like a `StandardScaler` must be the last data processor.

In [8]:
processor_configs = [
    ["CategoricalOrdinalEncoder", {}],
    ["NaNFeatureRemover", {}],
    ["VarianceFeatureSelector", {"thres": 0.1}],
    ["FeatureValueSelector", {"feature": "cat_1", "value": 0}],
    ["CorrFeatureSelector", {"thres": 0.1}],
    ["IQRRemover", {}],
    ["StdRemover", {}],
    ["SampleDataAugmentor", {}],
    ["StandardScaler", {}],
]


1. Modify the configuration file:

```python
cfg = {
    "data_processors": processor_configs,
    # Some other configurations...
}
```

2. Use the `manual_config` argument of `Trainer.load_config`.

```python
trainer.load_config("sample", manual_config={"data_processors": processor_configs})
```

3. Use `DataModule.set_data_processors`


In [9]:
import warnings
import numba
trainer.load_config("sample")
trainer.datamodule.set_data_processors(processor_configs)
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=numba.NumbaDeprecationWarning)
    trainer.load_data()

Project will be saved to ../../../../output/sample/2023-08-27-14-13-07-0_sample
Correlated features (Ranked by SHAP):
{
	'cont_2': 13.650428051938668,
	'cont_1': 8.98106859262871
}
1 features removed: ['cont_1']. 7 features retained: ['cont_0', 'cont_3', 'cont_4', 'cont_2', 'cat_0', 'cat_1', 'cat_2'].
Removing outliers by IQR. Original size: 36, Final size: 34.
Removing outliers by std. Original size: 34, Final size: 34.
Dataset size: 25 11 12
Data saved to ../../../../output/sample/2023-08-27-14-13-07-0_sample (data.csv and tabular_data.csv).


Let's check the effectiveness of these processors. Categorical features are encoded by `CategoricalOrdinalEncoder`:

In [10]:
trainer.datamodule.categorical_data.head()

Unnamed: 0,cat_0,cat_1,cat_2
0,3,0,2
1,3,0,1
2,3,0,4
3,0,0,0
4,4,0,2


The original categorical features can be accessed using

In [11]:
trainer.datamodule.categories_inverse_transform(trainer.datamodule.categorical_data).head()

Unnamed: 0,cat_0,cat_1,cat_2
0,category_3,0,2
1,category_3,0,1
2,category_3,0,4
3,category_0,0,0
4,category_4,0,2


One feature is removed by `CorrFeatureSelector`. It removes the feature with the lowest feature importance (ranked using `shap` in the example) in the correlation chain.

In [12]:
trainer.cont_feature_names

['cont_0', 'cont_2', 'cont_3', 'cont_4']

The specific `cat_1` feature value is selected by the `FeatureValueSelector`. Some outliers are removed by the `IQRRemover`. Original indices of the removed data points can be seen using

In [13]:
trainer.datamodule.dropped_indices

array([  0,   1,   2,   3,   4,   5,   8,   9,  10,  11,  12,  13,  14,
        15,  18,  19,  20,  22,  24,  25,  26,  27,  28,  29,  30,  31,
        32,  35,  37,  38,  39,  40,  41,  42,  43,  45,  46,  48,  49,
        50,  51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,
        63,  64,  65,  66,  67,  68,  69,  70,  72,  74,  75,  76,  77,
        80,  81,  82,  83,  85,  86,  89,  90,  93,  94,  96,  97,  98,
        99, 100, 102, 103, 104, 105, 106, 107, 109, 110, 111, 112, 113,
       114, 115, 116, 118, 119, 120, 121, 123, 124, 125, 127, 130, 131,
       132, 133, 135, 136, 137, 139, 140, 141, 142, 143, 144, 145, 146,
       147, 148, 150, 151, 152, 153, 154, 157, 158, 159, 161, 163, 165,
       166, 167, 168, 170, 172, 173, 175, 176, 177, 178, 179, 180, 182,
       183, 184, 185, 186, 187, 188, 189, 190, 192, 193, 194, 195, 196,
       197, 199, 200, 201, 202, 203, 207, 208, 209, 210, 211, 212, 213,
       214, 216, 217, 218, 219, 220, 221, 222, 223, 224, 225, 22

The `SampleDataAugmentor` copies the last two data points in the validation set as a showcase. `DataModule.augmented_indices` represents the indices of these data points before dropping `DataModule.dropped_indices`. We can see augmented data points using

In [14]:
trainer.df.loc[trainer.datamodule.augmented_indices-len(trainer.datamodule.dropped_indices), :]

Unnamed: 0,cont_0,cont_1,cont_2,cont_3,cont_4,cont_5,cont_6,cont_7,cont_8,cont_9,...,cat_3,cat_4,cat_5,cat_6,cat_7,cat_8,cat_9,target,target_binary,target_multi_class
46,-0.505358,-0.104343,-0.507518,-0.988002,-0.815792,-1.284552,-1.05188,0.564009,2.4972,-2.245322,...,4,4,category_4,3,1,1,2,-246.101543,1,3
47,-2.115056,0.138315,1.618054,0.541008,1.405365,-1.449118,-0.824409,-0.813794,0.42258,0.547481,...,0,0,category_2,4,3,3,1,-156.813059,0,3


In [15]:
trainer.df.loc[trainer.datamodule.val_indices[-2:], :]

Unnamed: 0,cont_0,cont_1,cont_2,cont_3,cont_4,cont_5,cont_6,cont_7,cont_8,cont_9,...,cat_3,cat_4,cat_5,cat_6,cat_7,cat_8,cat_9,target,target_binary,target_multi_class
16,-0.505358,-0.104343,-0.507518,-0.988002,-0.815792,-1.284552,-1.05188,0.564009,2.4972,-2.245322,...,4,4,category_4,3,1,1,2,-246.101543,1,3
0,-2.115056,0.138315,1.618054,0.541008,1.405365,-1.449118,-0.824409,-0.813794,0.42258,0.547481,...,0,0,category_2,4,3,3,1,-156.813059,0,3


Finally, `StandardScaler` scales the dataset. `DataModule.df` is the unscaled data frame, and `scaled_df` is the scaled one.

In [16]:
trainer.datamodule.df[trainer.cont_feature_names].describe()

Unnamed: 0,cont_0,cont_2,cont_3,cont_4
count,48.0,48.0,48.0,48.0
mean,-0.146982,-0.078769,0.114156,0.226035
std,0.973085,0.954908,0.733798,1.025565
min,-2.115056,-1.945703,-1.098768,-1.884586
25%,-0.653501,-0.802853,-0.41514,-0.615925
50%,-0.087749,0.082401,-0.059459,0.289189
75%,0.358191,0.77551,0.551649,1.187419
max,2.929096,1.618054,1.576299,2.285601


In [17]:
trainer.datamodule.scaled_df[trainer.cont_feature_names].describe()

Unnamed: 0,cont_0,cont_2,cont_3,cont_4
count,48.0,48.0,48.0,48.0
mean,0.070118,0.049228,-0.01036,0.026105
std,1.070437,0.972486,0.978905,1.00199
min,-2.09485,-1.852074,-1.628433,-2.035996
25%,-0.487075,-0.688186,-0.716455,-0.7965
50%,0.135278,0.213365,-0.241966,0.087807
75%,0.625831,0.919233,0.573267,0.965389
max,3.453941,1.777287,1.940177,2.038326


**Remark**: All modules are fitted on training and validation sets and transform the testing set.

In [18]:
import numpy as np
trainer.datamodule.scaled_df.loc[np.append(trainer.train_indices, trainer.val_indices), trainer.cont_feature_names].describe()

Unnamed: 0,cont_0,cont_2,cont_3,cont_4
count,36.0,36.0,36.0,36.0
mean,-6.208817e-10,-8.381903e-09,-9.520186e-09,4.139211e-10
std,1.014185,1.014185,1.014185,1.014185
min,-2.09485,-1.852074,-1.628433,-2.035996
25%,-0.455965,-0.6643361,-0.7956208,-0.8054215
50%,0.135278,0.02073498,-0.1113371,0.08780731
75%,0.5860012,0.9301667,0.573267,0.9653887
max,1.784157,1.777287,1.940177,1.965737


## Data derivers

Existing features in the dataset may not be sufficient to represent the inner relations between features and the target. Extending more features that can be strongly correlated with the target using existing ones can be helpful. Data derivers can be used to extend continuous features (stacked in the tabular dataset, Step 3 above) or **multi-modal** features (unstacked, Step 5 above).

Configurations are similar. Necessary and shared arguments are:

* `stacked`: Should the derived feature stack in the processed `DataFrame`?
* `intermediate`: Is the derived `stacked` feature excluded from continuous features?
* `derived_name`: What is the name of the feature?

Here we give three examples:

* `RelativeDeriver` calculates the result of dividing `absolute_col` by `relative2_col`;
* `SampleWeightDeriver` calculates the degree to which a data point is an outlier (it is just an example and there isn't detailed research on it);
* `UnscaledDataDeriver` records all continuous features before scaling (standard scaling by default).

In [19]:
deriver_configs = [
    ("RelativeDeriver", {
        "stacked": True,
        "absolute_col": "cont_0",
        "relative2_col": "cont_1",
        "intermediate": False,
        "derived_name": "derived_cont",
    }),
    ("SampleWeightDeriver", {
        "stacked": True,
        "intermediate": True,
        "derived_name": "sample_weight",
    }),
    ("UnscaledDataDeriver", {"derived_name": "unscaled", "stacked": False}),
]

1. Modify the configuration file:

```python
cfg = {
    "data_derivers": deriver_configs,
    # Some other configurations...
}
```

2. Use the `manual_config` argument of `Trainer.load_config`.

```python
trainer.load_config("sample", manual_config={"data_derivers": deriver_configs})
```

3. Use `DataModule.set_data_derivers`

In [20]:
trainer.load_config("sample")
trainer.datamodule.set_data_derivers(deriver_configs)
trainer.load_data()

Project will be saved to ../../../../output/sample/2023-08-27-14-13-07-0_sample-I1
Dataset size: 153 51 52
Data saved to ../../../../output/sample/2023-08-27-14-13-07-0_sample-I1 (data.csv and tabular_data.csv).


Two `stacked` features can be found in `Trainer.df` or `Trainer.datamodule.df`. `derived_cont` is a continuous feature because `intermediate=False`, but `sample_weight` is not.

In [21]:
trainer.df[["derived_cont", "sample_weight"]]

Unnamed: 0,derived_cont,sample_weight
0,5.884222,1.045746
1,-9.058123,1.063506
2,-3.650394,0.959582
3,1.678893,0.974096
4,-4.099185,1.000761
...,...,...
251,-1.355422,0.958380
252,1.088160,0.978138
253,0.374183,0.969419
254,1.199032,0.967882


In [22]:
"derived_cont" in trainer.cont_feature_names, "sample_weight" in trainer.cont_feature_names

(True, False)

The unstacked feature `unscaled` can be found in `Trainer.derived_data`

In [23]:
trainer.derived_data["unscaled"]

array([[-1.3065269 , -0.22203901, -0.11816405, -0.15957344,  1.65813065,
         5.88422203],
       [ 2.01125669, -0.22203901,  0.1950697 ,  0.52700418, -0.04459543,
        -9.05812263],
       [-1.21607661,  0.33313566, -0.74367219,  0.73018354,  0.14067191,
        -3.65039444],
       ...,
       [-0.06985649, -0.18669093, -1.02191329, -1.14364135,  0.2501139 ,
         0.37418258],
       [-1.03148246, -0.86026245, -0.06163805,  0.32830128, -1.42999125,
         1.19903231],
       [-1.46173275,  0.96069342,  0.36754489,  1.32906282, -0.68343979,
        -1.52153921]])

Available derivers can be seen by

In [24]:
from tabensemb.data.dataderiver import deriver_mapping, get_data_deriver
deriver_mapping

{'AbstractDeriver': tabensemb.data.base.AbstractDeriver,
 'RelativeDeriver': tabensemb.data.dataderiver.RelativeDeriver,
 'SampleWeightDeriver': tabensemb.data.dataderiver.SampleWeightDeriver,
 'UnscaledDataDeriver': tabensemb.data.dataderiver.UnscaledDataDeriver}

Arguments can be found in API docs or in the docstring.

In [25]:
print(get_data_deriver("RelativeDeriver").__doc__)


    Dividing a feature by another to derive a new feature. Required arguments are:

    absolute_col: str
        The feature that needs to be divided.
    relative2_col: str
        The feature that acts as the denominator.
    


## Access the processed dataset

All these data can be found in the `DataModule` instance in the trainer, along with many modified data structures for further usage:

* Continuous features
    * `DataModule.feature_data`: scaled
    * `DataModule.unscaled_feature_data`: not scaled
    * `DataModule.X_train/X_val/X_test[trainer.cont_feature_names]`: scaled and divided into three partitions
    * `DataModule.tensors[0]`: scaled and transformed into torch.Tensor.
* Categorical features
    * `DataModule.categorical_data`: ordinal-encoded
    * `DataModule.X_train/X_val/X_test[trainer.cat_feature_names]`: ordinal-encoded and divided into three partitions
    * `DataModule.derived_data["categorical"]`: ordinal-encoded
    * `trainer.datamodule.tensors[list(trainer.datamodule.derived_data.keys()).index("categorical")+1]`: ordinal-encoded and transformed into torch.Tensor.
* Derived unstacked features
    * `DataModule.derived_data`: include unstacked features, categorical features, and the signal for each data point representing whether it is an augmented one.
    * `DataModule.tensors[1:-1]`: same as `DataModule.derived_data`, but are `torch.Tensor`s.

**Remark**: Currently, derived unstacked features are not used in the supported external model bases. But it can be easily accessed using the above approaches, even easier for a customized `PyTorch`-based model base class `TorchModel`, which will be introduced in the "Advanced Usage" sections.

**Remark**: Stacked (continuous) derived features are derived after imputation but before data processing. These features will also be imputed. Unstacked derived features are derived after all other steps are finished.
