# New data derivers

In this package, a very limited number of derivers are currently provided. A deriver can be used to calculate new features (continuous or categorical) based on existing features, or load images, text, etc. as multimodal data. The source code of the integrated `tabensemb.data.dataderiver.RelativeDeriver` is extended here to demonstrate the implementation procedure.


In [1]:
from tabensemb.data.dataderiver import AbstractDeriver

Data derivers inherit `tabensemb.data.AbstractDervier` and four methods should be implemented:

* `_required_cols`: Arguments for columns that must exist in the tabular dataset. The following code means that the arguments `absolute_col` and `relative2_col` should be given in the configuration, such as `"data_derivers": [("MyRelativeDeriver", {"absolute_col": "cont_0", "relative2_col": "cont_1"})]`

```python
class MyRelativeDeriver(AbstractDeriver):
    def _required_cols(self):
        return ["absolute_col", "relative2_col"]
```

* `_required_kwargs`: Parameters that must be specified in the configuration. The following code means that the parameter `some_param` should be given in the configuration, such as `"data_derivers": [("MyRelativeDeriver", {"some_param": 1.5})]`

```python
    def _required_kwargs(self):
        return ["some_param"]
```

**Remark**: "stacked", "intermediate", "derived_name", and "is_continuous" are shared necessary kwargs and do not need to be added to `_required_kwargs`.

* `_defaults`: Default values of those in `_required_cols`, `_required_kwargs`, and `["stacked", "intermediate", "derived_name", "is_continuous"]`. If default values are given, no error will be raised if the argument is not set in the configuration.

```python
    def _defaults(self):
        return dict(stacked=True, intermediate=False, is_continuous=True)
```

* `_derive`: The main derivation step. It receives the tabular data (a `DataFrame`) and a `DataModule` and should return an `np.ndarray`. The returned array can not be 1d. Arguments are checked and recorded in `self.kwargs` when initializing.

```python
    def _derive(self, df, datamodule):
        absolute_col = self.kwargs["absolute_col"]
        relative2_col = self.kwargs["relative2_col"]
        some_param = self.kwargs["some_param"]
        stacked = self.kwargs["stacked"]

        relative = df[absolute_col] / df[relative2_col]
        relative = relative.values.reshape(-1, 1)
        return relative
```

In [2]:
class MyRelativeDeriver(AbstractDeriver):
    def _required_cols(self):
        return ["absolute_col", "relative2_col"]

    def _required_kwargs(self):
        return ["some_param"]

    def _defaults(self):
        return dict(stacked=True, intermediate=False, is_continuous=True)

    def _derive(self, df, datamodule):
        absolute_col = self.kwargs["absolute_col"]
        relative2_col = self.kwargs["relative2_col"]
        some_param = self.kwargs["some_param"]
        stacked = self.kwargs["stacked"]

        relative = df[absolute_col] / df[relative2_col]
        relative = relative.values.reshape(-1, 1)
        return relative

The implemented splitter should be registered as follows to be recognized by `DataModule.set_data_derivers` automatically.

In [3]:
from tabensemb.data.dataderiver import deriver_mapping
deriver_mapping["MyRelativeDeriver"] = MyRelativeDeriver

In [4]:
from tabensemb.trainer import Trainer
import tabensemb

prefix = "../../../../"
tabensemb.setting["default_output_path"] = prefix + "output"
tabensemb.setting["default_config_path"] = prefix + "configs"
tabensemb.setting["default_data_path"] = prefix + "data"

trainer = Trainer(device="cpu")

trainer.load_config("sample")

The project will be saved to ../../../../output/sample/2023-09-18-18-15-00-0_sample


If `stacked` is `True`:

In [5]:
trainer.datamodule.set_data_derivers([("MyRelativeDeriver", {"absolute_col": "cont_0", "relative2_col": "cont_1", "derived_name": "cont_0_relative2_cont_1", "some_param": 1.0, "stacked": True})])
trainer.load_data()
print(f"cont_0_relative2_cont_1 in continuous features?: {'cont_0_relative2_cont_1' in trainer.cont_feature_names}")
trainer.df

Dataset size: 153 51 52
Data saved to ../../../../output/sample/2023-09-18-18-15-00-0_sample (data.csv and tabular_data.csv).
cont_0_relative2_cont_1 in continuous features?: True


Unnamed: 0,cont_0,cont_1,cont_2,cont_3,cont_4,cont_5,cont_6,cont_7,cont_8,cont_9,...,cat_4,cat_5,cat_6,cat_7,cat_8,cat_9,target,target_binary,target_multi_class,cont_0_relative2_cont_1
0,-1.306527,0.065895,-0.118164,-0.159573,1.658131,-1.346718,-0.680178,-1.334258,0.666383,-0.460720,...,2,category_4,3,4,4,3,-71.084217,0,1,-19.827301
1,2.011257,0.117717,0.195070,0.527004,-0.044595,0.616887,-1.781563,0.354758,-0.729045,0.196557,...,3,category_3,3,1,3,2,13.415675,1,2,17.085552
2,-1.216077,0.065895,-0.743672,0.730184,0.140672,1.272954,-0.159012,-0.475175,0.240057,0.100159,...,4,category_3,4,1,0,2,-47.492280,0,2,-18.454666
3,0.559299,0.117717,-0.431096,-0.809627,-1.063696,-0.860153,0.572751,-0.467441,0.677557,1.307184,...,1,category_3,4,2,0,0,-94.482614,1,2,4.751225
4,0.910179,-0.213096,0.786328,-0.042257,0.317218,0.379152,-0.466419,-0.017020,-0.944446,-0.410050,...,0,category_2,0,2,3,0,195.819531,1,3,-4.271217
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,0.280442,-0.206904,0.841631,0.880179,-0.993124,-1.570623,-0.249459,0.643314,0.049495,0.493837,...,2,category_2,2,3,0,2,-171.249549,0,0,-1.355422
252,-1.165150,-1.070753,0.465662,1.054452,0.900826,-0.179925,-1.536244,1.178780,1.488252,1.895889,...,2,category_4,4,2,1,1,23.708442,0,2,1.088160
253,-0.069856,-0.186691,-1.021913,-1.143641,0.250114,1.040239,-1.150438,0.258798,-0.836111,0.642211,...,3,category_3,2,2,2,2,-33.414215,1,1,0.374183
254,-1.031482,-0.860262,-0.061638,0.328301,-1.429991,-1.048170,-1.432735,0.607112,0.087531,0.938747,...,0,category_3,4,1,4,4,-359.199191,0,4,1.199032


If `stacked` is `True` but `intermediate` is True:

In [6]:
trainer.datamodule.set_data_derivers([("MyRelativeDeriver", {"absolute_col": "cont_0", "relative2_col": "cont_1", "derived_name": "cont_0_relative2_cont_1", "some_param": 1.0, "stacked": True, "intermediate": True})])
trainer.load_data()
print(f"cont_0_relative2_cont_1 in continuous features?: {'cont_0_relative2_cont_1' in trainer.cont_feature_names}")
trainer.df

Using previously used data path ../../../../data/sample.csv
Dataset size: 153 51 52
Data saved to ../../../../output/sample/2023-09-18-18-15-00-0_sample (data.csv and tabular_data.csv).
cont_0_relative2_cont_1 in continuous features?: False


Unnamed: 0,cont_0,cont_1,cont_2,cont_3,cont_4,cont_5,cont_6,cont_7,cont_8,cont_9,...,cat_4,cat_5,cat_6,cat_7,cat_8,cat_9,target,target_binary,target_multi_class,cont_0_relative2_cont_1
0,-1.306527,-0.409756,-0.118164,-0.159573,1.658131,-1.346718,-0.680178,-1.334258,0.666383,-0.460720,...,2,category_4,3,4,4,3,-71.084217,0,1,3.188552
1,2.011257,-0.409756,0.195070,0.527004,-0.044595,0.616887,-1.781563,0.354758,-0.729045,0.196557,...,3,category_3,3,1,3,2,13.415675,1,2,-4.908431
2,-1.216077,0.104704,-0.743672,0.730184,0.140672,1.272954,-0.159012,-0.475175,0.240057,0.100159,...,4,category_3,4,1,0,2,-47.492280,0,2,-11.614467
3,0.559299,0.104704,-0.431096,-0.809627,-1.063696,-0.860153,0.572751,-0.467441,0.677557,1.307184,...,1,category_3,4,2,0,0,-94.482614,1,2,5.341736
4,0.910179,-0.409756,0.786328,-0.042257,0.317218,0.379152,-0.466419,-0.017020,-0.944446,-0.410050,...,0,category_2,0,2,3,0,195.819531,1,3,-2.221273
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,0.280442,-0.206904,0.841631,0.880179,-0.993124,-1.570623,-0.249459,0.643314,0.049495,0.493837,...,2,category_2,2,3,0,2,-171.249549,0,0,-1.355422
252,-1.165150,-1.070753,0.465662,1.054452,0.900826,-0.179925,-1.536244,1.178780,1.488252,1.895889,...,2,category_4,4,2,1,1,23.708442,0,2,1.088160
253,-0.069856,-0.186691,-1.021913,-1.143641,0.250114,1.040239,-1.150438,0.258798,-0.836111,0.642211,...,3,category_3,2,2,2,2,-33.414215,1,1,0.374183
254,-1.031482,-0.860262,-0.061638,0.328301,-1.429991,-1.048170,-1.432735,0.607112,0.087531,0.938747,...,0,category_3,4,1,4,4,-359.199191,0,4,1.199032


If `stacked` is `False`:

In [7]:
trainer.datamodule.set_data_derivers([("MyRelativeDeriver", {"absolute_col": "cont_0", "relative2_col": "cont_1", "derived_name": "cont_0_relative2_cont_1", "some_param": 1.0, "stacked": False})])
trainer.load_data()
print(f"cont_0_relative2_cont_1 in continuous features?: {'cont_0_relative2_cont_1' in trainer.cont_feature_names}")
trainer.df

Using previously used data path ../../../../data/sample.csv
Dataset size: 153 51 52
Data saved to ../../../../output/sample/2023-09-18-18-15-00-0_sample (data.csv and tabular_data.csv).
cont_0_relative2_cont_1 in continuous features?: False


Unnamed: 0,cont_0,cont_1,cont_2,cont_3,cont_4,cont_5,cont_6,cont_7,cont_8,cont_9,...,cat_3,cat_4,cat_5,cat_6,cat_7,cat_8,cat_9,target,target_binary,target_multi_class
0,-1.306527,0.138315,-0.118164,-0.159573,1.658131,-1.346718,-0.680178,-1.334258,0.666383,-0.460720,...,0,2,category_4,3,4,4,3,-71.084217,0,1
1,2.011257,-0.006111,0.195070,0.527004,-0.044595,0.616887,-1.781563,0.354758,-0.729045,0.196557,...,4,3,category_3,3,1,3,2,13.415675,1,2
2,-1.216077,0.138315,-0.743672,0.730184,0.140672,1.272954,-0.159012,-0.475175,0.240057,0.100159,...,0,4,category_3,4,1,0,2,-47.492280,0,2
3,0.559299,-0.006111,-0.431096,-0.809627,-1.063696,-0.860153,0.572751,-0.467441,0.677557,1.307184,...,4,1,category_3,4,2,0,0,-94.482614,1,2
4,0.910179,-0.006111,0.786328,-0.042257,0.317218,0.379152,-0.466419,-0.017020,-0.944446,-0.410050,...,1,0,category_2,0,2,3,0,195.819531,1,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,0.280442,-0.206904,0.841631,0.880179,-0.993124,-1.570623,-0.249459,0.643314,0.049495,0.493837,...,1,2,category_2,2,3,0,2,-171.249549,0,0
252,-1.165150,-1.070753,0.465662,1.054452,0.900826,-0.179925,-1.536244,1.178780,1.488252,1.895889,...,4,2,category_4,4,2,1,1,23.708442,0,2
253,-0.069856,-0.186691,-1.021913,-1.143641,0.250114,1.040239,-1.150438,0.258798,-0.836111,0.642211,...,0,3,category_3,2,2,2,2,-33.414215,1,1
254,-1.031482,-0.860262,-0.061638,0.328301,-1.429991,-1.048170,-1.432735,0.607112,0.087531,0.938747,...,0,0,category_3,4,1,4,4,-359.199191,0,4


In [8]:
trainer.derived_data.keys()

dict_keys(['cont_0_relative2_cont_1', 'categorical'])