# Dataset and configuration

In this part, we will introduce how to prepare a new dataset and its configuration file, and the basic usage of `UserConfig` and `DataModule`. You will be able to run benchmarks on your own dataset after reading this part.

## The dataset

We provide a randomly generated sample dataset (`data/sample.csv`) and its configuration file (`configs/sample.py`) in the repository. First, let's check the content of `sample.csv`. It contains 256 data points, 10 continuous features (namely `cont_0` to `cont_9`), 10 categorical features (namely `cat_0` to `cat_9`), and one target column `target`.

**Remark**: The dataset file should not contain an index column.

**Remark**: Both `.csv` and `.xlsx` are supported. We recommend `.csv` files for their efficiency.

In [1]:
import pandas as pd

prefix = "../../../../"
pd.read_csv(prefix + "data/sample.csv")

Unnamed: 0,cont_0,cont_1,cont_2,cont_3,cont_4,cont_5,cont_6,cont_7,cont_8,cont_9,...,cat_1,cat_2,cat_3,cat_4,cat_5,cat_6,cat_7,cat_8,cat_9,target
0,-1.306527,,-0.118164,-0.159573,1.658131,-1.346718,-0.680178,-1.334258,0.666383,-0.460720,...,4,2,0,2,category_4,3,4,4,3,-71.084217
1,2.011257,,0.195070,0.527004,-0.044595,0.616887,-1.781563,0.354758,-0.729045,0.196557,...,4,0,4,3,category_3,3,1,3,2,13.415675
2,-1.216077,,-0.743672,0.730184,0.140672,1.272954,-0.159012,-0.475175,0.240057,0.100159,...,3,2,0,4,category_3,4,1,0,2,-47.492280
3,0.559299,,-0.431096,-0.809627,-1.063696,-0.860153,0.572751,-0.467441,0.677557,1.307184,...,2,4,4,1,category_3,4,2,0,0,-94.482614
4,0.910179,,0.786328,-0.042257,0.317218,0.379152,-0.466419,-0.017020,-0.944446,-0.410050,...,1,3,1,0,category_2,0,2,3,0,195.819531
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,0.280442,-0.206904,0.841631,0.880179,-0.993124,-1.570623,-0.249459,0.643314,0.049495,0.493837,...,3,4,1,2,category_2,2,3,0,2,-171.249549
252,-1.165150,-1.070753,0.465662,1.054452,0.900826,-0.179925,-1.536244,1.178780,1.488252,1.895889,...,3,0,4,2,category_4,4,2,1,1,23.708442
253,-0.069856,-0.186691,-1.021913,-1.143641,0.250114,1.040239,-1.150438,0.258798,-0.836111,0.642211,...,3,3,0,3,category_3,2,2,2,2,-33.414215
254,-1.031482,-0.860262,-0.061638,0.328301,-1.429991,-1.048170,-1.432735,0.607112,0.087531,0.938747,...,4,2,0,0,category_3,4,1,4,4,-359.199191


## The configuration file

A configuration file contains a dictionary stating modified values compared to a given default configuration.

**Remark**: The dataset file can be a `.py` file containing a `dict` object named `cfg`, or a `.json` file.

### The default configuration

To see the default values, use `tabensemb.config.UserConfig`, which inherits `dict`.

In [2]:
from tabensemb.config import UserConfig
from tabensemb.utils import pretty
import tabensemb

tabensemb.setting["default_config_path"] = prefix + "configs"

cfg = UserConfig("sample")
print(pretty(cfg.defaults()))

{
	'database': 'sample',
	'bayes_opt': False,
	'bayes_calls': 50,
	'bayes_epoch': 30,
	'patience': 100,
	'epoch': 300,
	'lr': 0.001,
	'weight_decay': 1e-09,
	'batch_size': 1024,
	'layers': [
		64,
		128,
		256,
		128,
		64
	],
	'SPACEs': {
		'lr': {
			'type': 'Real',
			'low': 0.0001,
			'high': 0.05,
			'prior': 'log-uniform'
		},
		'weight_decay': {
			'type': 'Real',
			'low': 1e-09,
			'high': 0.05,
			'prior': 'log-uniform'
		},
		'batch_size': {
			'type': 'Categorical',
			'categories': [
				64,
				128,
				256,
				512,
				1024,
				2048
			]
		}
	},
	'data_splitter': 'RandomSplitter',
	'split_ratio': [
		0.6,
		0.2,
		0.2
	],
	'data_imputer': 'MissForestImputer',
	'data_processors': [
		(
			'CategoricalOrdinalEncoder',
			{
			}
		),
		(
			'NaNFeatureRemover',
			{
			}
		),
		(
			'VarianceFeatureSelector',
			{
				'thres': 1
			}
		),
		(
			'StandardScaler',
			{
			}
		)
	],
	'data_derivers': [
	],
	'feature_names_type': {
	},
	'categorical_feature_names': [
	],
	'f

### The configuration of the given sample dataset

`configs/sample.py` contains the following contents:
```python
cfg = {
    "database": "sample",
    "feature_types": ["Continuous", "Categorical", "Derived"],
    "feature_names_type": {
        "cont_0": 0,
        "cont_1": 0,
        "cont_2": 0,
        "cont_3": 0,
        "cont_4": 0,
        "cat_0": 1,
        "cat_1": 1,
        "cat_2": 1,
    },
    "categorical_feature_names": [
        "cat_0",
        "cat_1",
        "cat_2",
    ],
    "label_name": ["target"],
}
```
Load `configs/sample.py` and see the changes.

In [3]:
cfg = UserConfig("sample")
print(pretty(cfg))

{
	'database': 'sample',
	'bayes_opt': False,
	'bayes_calls': 50,
	'bayes_epoch': 30,
	'patience': 100,
	'epoch': 300,
	'lr': 0.001,
	'weight_decay': 1e-09,
	'batch_size': 1024,
	'layers': [
		64,
		128,
		256,
		128,
		64
	],
	'SPACEs': {
		'lr': {
			'type': 'Real',
			'low': 0.0001,
			'high': 0.05,
			'prior': 'log-uniform'
		},
		'weight_decay': {
			'type': 'Real',
			'low': 1e-09,
			'high': 0.05,
			'prior': 'log-uniform'
		},
		'batch_size': {
			'type': 'Categorical',
			'categories': [
				64,
				128,
				256,
				512,
				1024,
				2048
			]
		}
	},
	'data_splitter': 'RandomSplitter',
	'split_ratio': [
		0.6,
		0.2,
		0.2
	],
	'data_imputer': 'MissForestImputer',
	'data_processors': [
		(
			'CategoricalOrdinalEncoder',
			{
			}
		),
		(
			'NaNFeatureRemover',
			{
			}
		),
		(
			'VarianceFeatureSelector',
			{
				'thres': 1
			}
		),
		(
			'StandardScaler',
			{
			}
		)
	],
	'data_derivers': [
	],
	'feature_names_type': {
		'cont_0': 0,
		'cont_1': 0,
		'cont_2': 0

### Descriptions of keys in a configuration file

* `database`: The name of the database file. The file should be placed in the script directory or in `tabensemb.setting["default_data_path"]`. If no postfix (`.csv` or `.xlsx`) is provided, the program automatically searches for a matched postfix. If both `.csv` and `.xlsx` exist, an exception will be raised.
* `bayes_opt`: Perform gaussian-process-based bayesian hyperparameter optimization (HPO) using the `scikit-optimize` package when training each model.
* `bayes_calls`: The number of calls of the bayesian HPO. During each call, the model will be trained given a set of hyperparameters, and then the metric on the validation set will be returned to the bayesian HPO process.
* `bayes_epoch`: The number of epochs during each bayesian HPO call.
* `patience`: Early stopping patience. If the metric on the validation set does not improve after `patience` epochs, the training process terminates and the best model is loaded.
* `epoch`: Total epochs to train each model.
* `lr`: Initial learning rate.
* `weight_decay`: Initial weight_decay (for a `torch.optim.Adam` optimizer)
* `batch_size`: Initial batch_size.
* `layers`: Default hidden layers for some models.
* `SPACEs`: Default bayesian HPO spaces for `lr`, `weight_decay`, and `batch_size`. The key `type` determines the `skopt.space`, and the rest of keys determines its arguments.
* `data_splitter`: The dataset splitting method to split training/validation/testing sets. See `tabensemb.data.datasplitter.splitter_mapping` for available classes.
* `split_ratio`: The ratio of training/validation/testing sets.
* `data_imputer`: The imputation method for `NaN` values. See `tabensemb.data.dataimputer.imputer_mapping` for available classes.
* `data_processors`: A list of data processing steps and their corresponding arguments. See `tabensemb.data.dataprocessor.processor_mapping` for available classes. See API docs for definitions of arguments.
* `data_derivers`: A list of feature augmentation steps and their corresponding arguments. Some fix arguments are
    * `stacked`: `True` to append the derived feature to continuous features and the final `DataFrame` representing the processed dataset. `False` to leave it as an unstacked feature (mostly for multi-modal data)
    * `intermediate`: `True` to ignore the derived feature in continuous features even when `stacked=True`, but still append the feature to the `DataFrame`.

    See `tabensemb.data.dataderiver.deriver_mapping` for available classes. See API docs or `_required_cols` of each class for its additional arguments.
* `feature_types`: General types of features. `Categorical` and `Derived` are necessary for training and plotting.
* `feature_names_type`: A dict stating used features and their types. In this example, for continuous features the value is 0, which is the index of `Continuous` in `feature_types`, and for categorical features the value is 1 which is the index of `Categorical` in `feature_types`.
* `categorical_feature_names`: Just repeat features that are `Categorical` here.
* `label_name`: The predicted target.

## Use the configuration file to load the dataset

The `DataModule` requires a `UserConfig` to load the dataset, then initialize and run all data processing steps on the dataset or an upcoming new dataset. The following lines load the dataset, and present the loaded and processed `DataFrame` without imputation.

In [4]:
from tabensemb.data import DataModule
tabensemb.setting["default_data_path"] = prefix + "data"

datamodule = DataModule(cfg)
datamodule.load_data()
datamodule.get_not_imputed_df()

Dataset size: 153 51 52


Unnamed: 0,cont_0,cont_1,cont_2,cont_3,cont_4,cont_5,cont_6,cont_7,cont_8,cont_9,...,cat_1,cat_2,cat_3,cat_4,cat_5,cat_6,cat_7,cat_8,cat_9,target
0,-1.306527,,-0.118164,-0.159573,1.658131,-1.346718,-0.680178,-1.334258,0.666383,-0.460720,...,4,2,0,2,category_4,3,4,4,3,-71.084217
1,2.011257,,0.195070,0.527004,-0.044595,0.616887,-1.781563,0.354758,-0.729045,0.196557,...,4,0,4,3,category_3,3,1,3,2,13.415675
2,-1.216077,,-0.743672,0.730184,0.140672,1.272954,-0.159012,-0.475175,0.240057,0.100159,...,3,2,0,4,category_3,4,1,0,2,-47.492280
3,0.559299,,-0.431096,-0.809627,-1.063696,-0.860153,0.572751,-0.467441,0.677557,1.307184,...,2,4,4,1,category_3,4,2,0,0,-94.482614
4,0.910179,,0.786328,-0.042257,0.317218,0.379152,-0.466419,-0.017020,-0.944446,-0.410050,...,1,3,1,0,category_2,0,2,3,0,195.819531
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,0.280442,-0.206904,0.841631,0.880179,-0.993124,-1.570623,-0.249459,0.643314,0.049495,0.493837,...,3,4,1,2,category_2,2,3,0,2,-171.249549
252,-1.165150,-1.070753,0.465662,1.054452,0.900827,-0.179925,-1.536244,1.178780,1.488252,1.895889,...,3,0,4,2,category_4,4,2,1,1,23.708442
253,-0.069856,-0.186691,-1.021913,-1.143641,0.250114,1.040239,-1.150438,0.258798,-0.836111,0.642211,...,3,3,0,3,category_3,2,2,2,2,-33.414215
254,-1.031482,-0.860262,-0.061638,0.328301,-1.429991,-1.048170,-1.432735,0.607112,0.087531,0.938747,...,4,2,0,0,category_3,4,1,4,4,-359.199191


`DataModule.df` present the imputed `DataFrame`.

In [5]:
datamodule.df

Unnamed: 0,cont_0,cont_1,cont_2,cont_3,cont_4,cont_5,cont_6,cont_7,cont_8,cont_9,...,cat_1,cat_2,cat_3,cat_4,cat_5,cat_6,cat_7,cat_8,cat_9,target
0,-1.306527,0.058868,-0.118164,-0.159573,1.658131,-1.346718,-0.680178,-1.334258,0.666383,-0.460720,...,4,2,0,2,category_4,3,4,4,3,-71.084217
1,2.011257,0.058868,0.195070,0.527004,-0.044595,0.616887,-1.781563,0.354758,-0.729045,0.196557,...,4,0,4,3,category_3,3,1,3,2,13.415675
2,-1.216077,0.058868,-0.743672,0.730184,0.140672,1.272954,-0.159012,-0.475175,0.240057,0.100159,...,3,2,0,4,category_3,4,1,0,2,-47.492280
3,0.559299,0.337892,-0.431096,-0.809627,-1.063696,-0.860153,0.572751,-0.467441,0.677557,1.307184,...,2,4,4,1,category_3,4,2,0,0,-94.482614
4,0.910179,0.058868,0.786328,-0.042257,0.317218,0.379152,-0.466419,-0.017020,-0.944446,-0.410050,...,1,3,1,0,category_2,0,2,3,0,195.819531
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,0.280442,-0.206904,0.841631,0.880179,-0.993124,-1.570623,-0.249459,0.643314,0.049495,0.493837,...,3,4,1,2,category_2,2,3,0,2,-171.249549
252,-1.165150,-1.070753,0.465662,1.054452,0.900827,-0.179925,-1.536244,1.178780,1.488252,1.895889,...,3,0,4,2,category_4,4,2,1,1,23.708442
253,-0.069856,-0.186691,-1.021913,-1.143641,0.250114,1.040239,-1.150438,0.258798,-0.836111,0.642211,...,3,3,0,3,category_3,2,2,2,2,-33.414215
254,-1.031482,-0.860262,-0.061638,0.328301,-1.429991,-1.048170,-1.432735,0.607112,0.087531,0.938747,...,4,2,0,0,category_3,4,1,4,4,-359.199191


`DataModule.train_indices`, `DataModule.val_indices`, and `DataModule.test_indices` represent indices of training/validation/testing sets, respectively.

In [6]:
datamodule.train_indices, datamodule.val_indices, datamodule.test_indices

(array([162, 128, 241, 153, 150, 222,  87, 229, 184, 169, 172, 232,  31,
        151, 139, 247, 132,   3, 211, 105, 109, 198, 188,  73, 125, 196,
         76, 124, 183,  53,  88, 135,  81, 123, 237, 217, 179, 216, 197,
        160, 186,  94, 212, 193, 141,  89, 255, 177, 252, 140,  69, 171,
        148,  35,  67, 111,  65, 208, 136, 167, 161, 145, 251,  71, 240,
        102, 226,   7, 202,  58, 242, 103, 174, 121,  16, 199, 159,   9,
        233, 122, 182,  50, 248,   2, 106,  84, 220, 176, 205, 200,  12,
         18,  70, 245,  68,  80, 185,  49,  91,  51,  17, 127, 146, 181,
        249,  86,   6,   8,  96, 192,  99,  22,  19,  14, 227, 170,  97,
        164,  37, 231,  72, 108, 133, 213, 152, 138,   0, 130, 236, 155,
         90, 234, 187,  93,  74, 119, 215,  95,  29, 115,  23,   5,  85,
         57, 244,  64, 180,  36, 114, 156, 117,  98, 110]),
 array([228, 175,  44, 118, 147,  21, 137,  26, 221, 165, 144,  60,  66,
        173, 134,  10,   1, 101,  63,  13,  43,  20,  40,  92, 2

For detailed functionalities of `DataModule`, please check the API documentation.

## A `Trainer` does all things for you

Indeed, a user does not need to manually generate a `UserConfig` or a `DataModule` because `Trainer` does all above steps. After calling `Trainer.load_config` and `Trainer.load_data`, a `UserConfig` instance containing configurations and a `DataModule` instance containing processing steps and loaded data are generated and can be accessed by `Trainer.args` and `Trainer.datamodule`, respectively.


In [7]:
from tabensemb.trainer import Trainer

tabensemb.setting["default_output_path"] = prefix + "output"
trainer = Trainer(device="cpu")
trainer.load_config("sample")
trainer.load_data()
type(trainer.args), type(trainer.datamodule)

Project will be saved to ../../../../output/sample/2023-07-29-16-38-05-0_sample
Dataset size: 153 51 52
Data saved to ../../../../output/sample/2023-07-29-16-38-05-0_sample (data.csv and tabular_data.csv).


(tabensemb.config.user_config.UserConfig, tabensemb.data.datamodule.DataModule)