# Dataset and configuration

In this part, we will introduce how to prepare a new dataset and its configuration file, and the basic usage of `UserConfig` and `DataModule`. You will be able to run benchmarks on your own dataset after reading this part.

## The dataset

We provide a randomly generated sample dataset (`data/sample.csv`) and its configuration file (`configs/sample.py`) in the repository. First, let's check the content of `sample.csv`. It contains 256 data points, 10 continuous features (namely `cont_0` to `cont_9`), 10 categorical features (namely `cat_0` to `cat_9`), and one target column `target`.

**Remark**: The dataset file should not contain an index column.

**Remark**: Both `.csv` and `.xlsx` are supported. We recommend `.csv` files for their efficiency.

**Remark**: Values of categorical features that contain non-numerical values (bool, string, or mixed types) will be transformed into strings. So, for example, the number `3` and the string `"3"` of a categorical feature will be the same (are both interpreted as the string `"3"`).

In [1]:
import pandas as pd

prefix = "../../../../"
pd.read_csv(prefix + "data/sample.csv")

Unnamed: 0,cont_0,cont_1,cont_2,cont_3,cont_4,cont_5,cont_6,cont_7,cont_8,cont_9,...,cat_3,cat_4,cat_5,cat_6,cat_7,cat_8,cat_9,target,target_binary,target_multi_class
0,-1.306527,,-0.118164,-0.159573,1.658131,-1.346718,-0.680178,-1.334258,0.666383,-0.460720,...,0,2,category_4,3,4,4,3,-71.084217,0,1
1,2.011257,,0.195070,0.527004,-0.044595,0.616887,-1.781563,0.354758,-0.729045,0.196557,...,4,3,category_3,3,1,3,2,13.415675,1,2
2,-1.216077,,-0.743672,0.730184,0.140672,1.272954,-0.159012,-0.475175,0.240057,0.100159,...,0,4,category_3,4,1,0,2,-47.492280,0,2
3,0.559299,,-0.431096,-0.809627,-1.063696,-0.860153,0.572751,-0.467441,0.677557,1.307184,...,4,1,category_3,4,2,0,0,-94.482614,1,2
4,0.910179,,0.786328,-0.042257,0.317218,0.379152,-0.466419,-0.017020,-0.944446,-0.410050,...,1,0,category_2,0,2,3,0,195.819531,1,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,0.280442,-0.206904,0.841631,0.880179,-0.993124,-1.570623,-0.249459,0.643314,0.049495,0.493837,...,1,2,category_2,2,3,0,2,-171.249549,0,0
252,-1.165150,-1.070753,0.465662,1.054452,0.900826,-0.179925,-1.536244,1.178780,1.488252,1.895889,...,4,2,category_4,4,2,1,1,23.708442,0,2
253,-0.069856,-0.186691,-1.021913,-1.143641,0.250114,1.040239,-1.150438,0.258798,-0.836111,0.642211,...,0,3,category_3,2,2,2,2,-33.414215,1,1
254,-1.031482,-0.860262,-0.061638,0.328301,-1.429991,-1.048170,-1.432735,0.607112,0.087531,0.938747,...,0,0,category_3,4,1,4,4,-359.199191,0,4


## The configuration file

A configuration file contains a dictionary stating modified values compared to a given default configuration.

**Remark**: The dataset file can be a `.py` file containing a `dict` object named `cfg`, or a `.json` file.

### The default configuration

To see the default values, use `tabensemb.config.UserConfig`, which inherits `dict`.

In [2]:
from tabensemb.config import UserConfig
from tabensemb.utils import pretty
import tabensemb

tabensemb.setting["default_config_path"] = prefix + "configs"

cfg = UserConfig("sample")
print(pretty(cfg.defaults()))

{
	'database': 'sample',
	'task': None,
	'loss': None,
	'bayes_opt': False,
	'bayes_calls': 50,
	'bayes_epoch': 30,
	'patience': 100,
	'epoch': 300,
	'lr': 0.001,
	'weight_decay': 1e-09,
	'batch_size': 1024,
	'layers': [
		64,
		128,
		256,
		128,
		64
	],
	'SPACEs': {
		'lr': {
			'type': 'Real',
			'low': 0.0001,
			'high': 0.05,
			'prior': 'log-uniform'
		},
		'weight_decay': {
			'type': 'Real',
			'low': 1e-09,
			'high': 0.05,
			'prior': 'log-uniform'
		},
		'batch_size': {
			'type': 'Categorical',
			'categories': [
				64,
				128,
				256,
				512,
				1024,
				2048
			]
		}
	},
	'data_splitter': 'RandomSplitter',
	'split_ratio': [
		0.6,
		0.2,
		0.2
	],
	'data_imputer': 'MissForestImputer',
	'data_processors': [
		(
			'CategoricalOrdinalEncoder',
			{
			}
		),
		(
			'NaNFeatureRemover',
			{
			}
		),
		(
			'VarianceFeatureSelector',
			{
				'thres': 1
			}
		),
		(
			'StandardScaler',
			{
			}
		)
	],
	'data_derivers': [
	],
	'categorical_feature_names': [
	],
	'

### The configuration of the given sample dataset

`configs/sample.py` contains the following contents:
```python
cfg = {
    "database": "sample",
    "continuous_feature_names": ["cont_0", "cont_1", "cont_2", "cont_3", "cont_4"],
    "categorical_feature_names": ["cat_0", "cat_1", "cat_2"],
    "label_name": ["target"],
}
```
Load `configs/sample.py` and see the changes.

In [3]:
cfg = UserConfig("sample")
print(pretty(cfg))

{
	'database': 'sample',
	'task': None,
	'loss': None,
	'bayes_opt': False,
	'bayes_calls': 50,
	'bayes_epoch': 30,
	'patience': 100,
	'epoch': 300,
	'lr': 0.001,
	'weight_decay': 1e-09,
	'batch_size': 1024,
	'layers': [
		64,
		128,
		256,
		128,
		64
	],
	'SPACEs': {
		'lr': {
			'type': 'Real',
			'low': 0.0001,
			'high': 0.05,
			'prior': 'log-uniform'
		},
		'weight_decay': {
			'type': 'Real',
			'low': 1e-09,
			'high': 0.05,
			'prior': 'log-uniform'
		},
		'batch_size': {
			'type': 'Categorical',
			'categories': [
				64,
				128,
				256,
				512,
				1024,
				2048
			]
		}
	},
	'data_splitter': 'RandomSplitter',
	'split_ratio': [
		0.6,
		0.2,
		0.2
	],
	'data_imputer': 'MissForestImputer',
	'data_processors': [
		(
			'CategoricalOrdinalEncoder',
			{
			}
		),
		(
			'NaNFeatureRemover',
			{
			}
		),
		(
			'VarianceFeatureSelector',
			{
				'thres': 1
			}
		),
		(
			'StandardScaler',
			{
			}
		)
	],
	'data_derivers': [
	],
	'categorical_feature_names': [
		'cat

### Descriptions of keys in a configuration file

* `database`: The name of the database file. The file should be placed in the script directory or in `tabensemb.setting["default_data_path"]`. If no postfix (`.csv` or `.xlsx`) is provided, the program automatically searches for a matched postfix. If both `.csv` and `.xlsx` exist, an exception will be raised.
* `task`: "regression" for regression tasks, "binary" for binary classifications, and "multiclass" for multiclass classifications. If left None, the task will be guessed from the type of the target. If the target is of the type `object` or integers, "binary" or "multiclass" is guessed depending on the number of unique targets; otherwise, "regression" is guessed.
* `loss`: "mse" (default) or "mae" for regression tasks, and "cross_entropy" for classification tasks. This loss will be used across all model bases. If left None, "mse" or "cross_entropy" will be used.
* `bayes_opt`: Perform gaussian-process-based Bayesian hyperparameter optimization (HPO) using the `scikit-optimize` package when training each model.
* `bayes_calls`: The number of calls of the Bayesian HPO. During each call, the model will be trained given a set of hyperparameters, and then the metric on the validation set will be returned to the Bayesian HPO process.
* `bayes_epoch`: The number of epochs during each Bayesian HPO call.
* `patience`: Early stopping patience. If the metric on the validation set does not improve after `patience` epochs, the training process terminates and the best model is loaded.
* `epoch`: Total epochs to train each model.
* `lr`: Initial learning rate.
* `weight_decay`: Initial weight_decay (for a `torch.optim.Adam` optimizer)
* `batch_size`: Initial batch_size.
* `layers`: Default hidden layers for some models.
* `SPACEs`: Default bayesian HPO spaces for `lr`, `weight_decay`, and `batch_size`. The key `type` determines the `skopt.space`, and the rest of the keys determines its arguments.
* `data_splitter`: The dataset splitting method to split training/validation/testing sets. See `tabensemb.data.datasplitter.splitter_mapping` for available classes.
* `split_ratio`: The ratio of training/validation/testing sets.
* `data_imputer`: The imputation method for `NaN` values. See `tabensemb.data.dataimputer.imputer_mapping` for available classes.
* `data_processors`: A list of data processing steps and their corresponding arguments. See `tabensemb.data.dataprocessor.processor_mapping` for available classes. See API docs for definitions of arguments.
* `data_derivers`: A list of feature augmentation steps and their corresponding arguments. Some fix arguments are
    * `stacked`: `True` to append the derived feature to continuous features and the final `DataFrame` representing the processed dataset. `False` to leave it as an unstacked feature (mostly for multi-modal data)
    * `intermediate`: `True` to ignore the derived feature in continuous features even when `stacked=True`, but still append the feature to the `DataFrame`.

    See `tabensemb.data.dataderiver.deriver_mapping` for available classes. See API docs or `_required_cols` of each class for its additional arguments.
* `continuous_feature_names`: Continuous features. Each of them should be all floats or integers.
* `categorical_feature_names`: Categorical features. Each of them should be all integers or strings.
* `feature_types`: A dictionary stating categories of each feature defined in `continuous_feature_names` and `categorical_feature_names`. If it is not given in the configuration, "Continuous" and "Categorical" will be automatically used to assign the values of continuous and categorical features, respectively.
* `unique_feature_types`: Unique values in the dictionary `feature_types`.
* `label_name`: The predicted target.

## Use the configuration file to load the dataset

The `DataModule` requires a `UserConfig` to load the dataset, then initialize and run all data processing steps on the dataset or an upcoming new dataset. The following lines load the dataset and present the loaded and processed `DataFrame` without imputation.

In [4]:
from tabensemb.data import DataModule
tabensemb.setting["default_data_path"] = prefix + "data"

datamodule = DataModule(cfg)
datamodule.load_data()
datamodule.get_not_imputed_df()

Dataset size: 153 51 52


Unnamed: 0,cont_0,cont_1,cont_2,cont_3,cont_4,cont_5,cont_6,cont_7,cont_8,cont_9,...,cat_3,cat_4,cat_5,cat_6,cat_7,cat_8,cat_9,target,target_binary,target_multi_class
0,-1.306527,,-0.118164,-0.159573,1.658131,-1.346718,-0.680178,-1.334258,0.666383,-0.460720,...,0,2,category_4,3,4,4,3,-71.084217,0,1
1,2.011257,,0.195070,0.527004,-0.044595,0.616887,-1.781563,0.354758,-0.729045,0.196557,...,4,3,category_3,3,1,3,2,13.415675,1,2
2,-1.216077,,-0.743672,0.730184,0.140672,1.272954,-0.159012,-0.475175,0.240057,0.100159,...,0,4,category_3,4,1,0,2,-47.492280,0,2
3,0.559299,,-0.431096,-0.809627,-1.063696,-0.860153,0.572751,-0.467441,0.677557,1.307184,...,4,1,category_3,4,2,0,0,-94.482614,1,2
4,0.910179,,0.786328,-0.042257,0.317218,0.379152,-0.466419,-0.017020,-0.944446,-0.410050,...,1,0,category_2,0,2,3,0,195.819531,1,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,0.280442,-0.206904,0.841631,0.880179,-0.993124,-1.570623,-0.249459,0.643314,0.049495,0.493837,...,1,2,category_2,2,3,0,2,-171.249549,0,0
252,-1.165150,-1.070753,0.465662,1.054452,0.900826,-0.179925,-1.536244,1.178780,1.488252,1.895889,...,4,2,category_4,4,2,1,1,23.708442,0,2
253,-0.069856,-0.186691,-1.021913,-1.143641,0.250114,1.040239,-1.150438,0.258798,-0.836111,0.642211,...,0,3,category_3,2,2,2,2,-33.414215,1,1
254,-1.031482,-0.860262,-0.061638,0.328301,-1.429991,-1.048170,-1.432735,0.607112,0.087531,0.938747,...,0,0,category_3,4,1,4,4,-359.199191,0,4


`DataModule.df` present the imputed `DataFrame`.

In [5]:
datamodule.df

Unnamed: 0,cont_0,cont_1,cont_2,cont_3,cont_4,cont_5,cont_6,cont_7,cont_8,cont_9,...,cat_3,cat_4,cat_5,cat_6,cat_7,cat_8,cat_9,target,target_binary,target_multi_class
0,-1.306527,-0.568944,-0.118164,-0.159573,1.658131,-1.346718,-0.680178,-1.334258,0.666383,-0.460720,...,0,2,category_4,3,4,4,3,-71.084217,0,1
1,2.011257,-0.410219,0.195070,0.527004,-0.044595,0.616887,-1.781563,0.354758,-0.729045,0.196557,...,4,3,category_3,3,1,3,2,13.415675,1,2
2,-1.216077,-0.568944,-0.743672,0.730184,0.140672,1.272954,-0.159012,-0.475175,0.240057,0.100159,...,0,4,category_3,4,1,0,2,-47.492280,0,2
3,0.559299,-0.276046,-0.431096,-0.809627,-1.063696,-0.860153,0.572751,-0.467441,0.677557,1.307184,...,4,1,category_3,4,2,0,0,-94.482614,1,2
4,0.910179,0.202563,0.786328,-0.042257,0.317218,0.379152,-0.466419,-0.017020,-0.944446,-0.410050,...,1,0,category_2,0,2,3,0,195.819531,1,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
251,0.280442,-0.206904,0.841631,0.880179,-0.993124,-1.570623,-0.249459,0.643314,0.049495,0.493837,...,1,2,category_2,2,3,0,2,-171.249549,0,0
252,-1.165150,-1.070753,0.465662,1.054452,0.900826,-0.179925,-1.536244,1.178780,1.488252,1.895889,...,4,2,category_4,4,2,1,1,23.708442,0,2
253,-0.069856,-0.186691,-1.021913,-1.143641,0.250114,1.040239,-1.150438,0.258798,-0.836111,0.642211,...,0,3,category_3,2,2,2,2,-33.414215,1,1
254,-1.031482,-0.860262,-0.061638,0.328301,-1.429991,-1.048170,-1.432735,0.607112,0.087531,0.938747,...,0,0,category_3,4,1,4,4,-359.199191,0,4


`DataModule.train_indices`, `DataModule.val_indices`, and `DataModule.test_indices` represent indices of training/validation/testing sets, respectively.

In [6]:
datamodule.train_indices, datamodule.val_indices, datamodule.test_indices

(array([216,  72,  68,  62, 237, 116, 110, 236,  83,  66,   9, 219,  39,
         27, 176,  38, 211,   6, 114, 203,  92, 160, 238, 141, 163,  15,
         84,  13,  79, 198, 170, 197,  95, 107, 193, 135, 188, 137, 248,
        165, 112, 132, 194, 101,  48, 249, 186,  74, 202, 208, 235,   7,
        157,  30, 215, 243,  44, 242, 190,  65, 134, 187,  64,  47,  45,
        115, 109,  69,  86, 150,   4, 231,  63, 174, 106,  71, 204, 122,
        118, 205, 182, 126,  99,  40,  16, 217, 223,   0,  97,  96, 230,
         50, 206, 226, 147,  80, 221,  78, 179,  59, 154, 214,  26, 227,
         12, 245, 195,  46, 177,   2, 191, 119, 139, 181,  25,  52,  29,
         58, 128, 234, 167, 185, 196, 152,  57, 209,   1, 173,  23,  31,
         34, 250,  98,  82, 184,  28, 151, 192, 156, 255,  89,  76, 149,
        241,  24, 146, 252, 144, 175, 103, 143, 127,  56]),
 array([ 49, 200, 180, 121, 251,  67, 145, 212,  70,  90,  81, 253, 239,
         75, 246, 138, 228,  77, 130,  93,  32, 124, 142,  11, 2

For detailed functionalities of `DataModule`, please check the API documentation.

## A `Trainer` does all things for you

Indeed, a user does not need to manually generate a `UserConfig` or a `DataModule` because `Trainer` does all the above steps. After calling `Trainer.load_config` and `Trainer.load_data`, a `UserConfig` instance containing configurations, a `DataModule` instance containing processing steps, and loaded data are generated and can be accessed by `Trainer.args` and `Trainer.datamodule`, respectively.


In [7]:
from tabensemb.trainer import Trainer

tabensemb.setting["default_output_path"] = prefix + "output"
trainer = Trainer(device="cpu")
trainer.load_config("sample")
trainer.load_data()
type(trainer.args), type(trainer.datamodule)

The project will be saved to ../../../../output/sample/2023-09-18-17-22-27-0_sample
Dataset size: 153 51 52
Data saved to ../../../../output/sample/2023-09-18-17-22-27-0_sample (data.csv and tabular_data.csv).


(tabensemb.config.user_config.UserConfig, tabensemb.data.datamodule.DataModule)