# Basics of running benchmarks

Tabular Ensemble (`tabensemb`) is a benchmark platform for tabular prediction tasks. We support three well-established model bases as baselines:

* `autogluon`: [Link](https://github.com/autogluon/autogluon)

* `pytorch_widedeep`: [Link](https://github.com/jrzaurin/pytorch-widedeep)

* `pytorch_tabular`: [Link](https://github.com/manujosephv/pytorch_tabular)

Users can run benchmarks on customized datasets using customized preprocessing steps and implement customized models in the framework to run and compare their performance with baselines within a consistent procedure.

In this part, minimum examples of regression, binary classification, and multiclass classification are performed to show the basic functionality of the package.

## Regression

### Loading packages

First, import the necessary modules. Then check the validity of `CUDA` and determine the training device.

In [1]:
import torch
from tabensemb.trainer import Trainer
from tabensemb.model import *
from tabensemb.config import UserConfig
import tabensemb
import os

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))

Using cuda device


`tabensemb` uses paths relative to the current directory. For different IDEs (PyCharm, VSCode, etc.), the directory can be different. Set default paths to desired ones.

* `tabensemb.setting["default_output_path"]`: It will be used to save results. This path will be created if it does not exist.
* `tabensemb.setting["default_config_path"]`: It should be the path to configuration files (See "Using a configuration file" for its case).
* `tabensemb.setting["default_config_path"]`: It should be the path to data files. It will also be used to save downloaded datasets (See "Using a configuration file" for its case).

In this notebook, we use a temporary directory for cleanliness. Change `temp_path.name` to your own directory.

In [2]:
from tempfile import TemporaryDirectory

temp_path = TemporaryDirectory()
tabensemb.setting["default_output_path"] = os.path.join(temp_path.name, "output")
tabensemb.setting["default_config_path"] = os.path.join(temp_path.name, "configs")
tabensemb.setting["default_data_path"] = os.path.join(temp_path.name, "data")

### Configuring a `Trainer`

Create a `Trainer`, which acts as a bridge of data and models and provides some useful utilities.

In [3]:
trainer = Trainer(device=device)

As an example, we use the Auto MPG dataset from [UCI datasets](https://archive.ics.uci.edu/datasets). We can import UCI datasets through the `UserConfig` class.

In [4]:
mpg_columns = [
    "mpg",
    "cylinders",
    "displacement",
    "horsepower",
    "weight",
    "acceleration",
    "model_year",
    "origin",
    "car_name",
]
cfg = UserConfig.from_uci("Auto MPG", column_names=mpg_columns, sep=r"\s+")
trainer.load_config(cfg)

Downloading https://archive.ics.uci.edu/static/public/9/auto+mpg.zip to /tmp/tmpqcxgn2l1/data/Auto MPG.zip
cylinders is Integer and will be treated as a continuous feature.
model_year is Integer and will be treated as a continuous feature.
origin is Integer and will be treated as a continuous feature.
Unknown values are detected in ['horsepower']. They will be treated as np.nan.
The project will be saved to /tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig


*Optional*: We provide a useful `Logging` class to record all outputs to a file located in the above project root so that users can review the training process. This step is optional but we strongly recommend using it.

`Trainer.project_root` is the output directory of the `trainer`, and here we log all `stdout` and `stderr` to `log.txt` in this directory.

In [5]:
from tabensemb.utils import Logging
log = Logging()
log.enter(os.path.join(trainer.project_root, "log.txt"))

### Viewing configurations

We can view the summary of the current environment, including devices/Python version, the loaded configuration, and global settings of `tabensemb`.

In [6]:
trainer.summarize_setting()

Device:
{
	'System': 'Linux',
	'Node name': 'xlluo-WS',
	'System release': '5.15.6-custom',
	'System version': '#1 SMP Mon Dec 13 20:27:58 CST 2021',
	'Machine architecture': 'x86_64',
	'Processor architecture': 'x86_64',
	'Processor model': '11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz',
	'Physical cores': 8,
	'Total cores': 16,
	'Max core frequency': '5150.00Mhz',
	'Total memory': '31.20GB',
	'Python version': '3.10.12',
	'Python implementation': 'CPython',
	'Python compiler': 'GCC 11.2.0',
	'Cuda availability': True,
	'GPU devices': [
		'NVIDIA GeForce RTX 3090'
	]
}
Configurations:
{
	'database': 'auto-mpg',
	'task': 'regression',
	'loss': None,
	'bayes_opt': False,
	'bayes_calls': 50,
	'bayes_epoch': 30,
	'patience': 100,
	'epoch': 300,
	'lr': 0.001,
	'weight_decay': 1e-09,
	'batch_size': 1024,
	'layers': [
		64,
		128,
		256,
		128,
		64
	],
	'SPACEs': {
		'lr': {
			'type': 'Real',
			'low': 0.0001,
			'high': 0.05,
			'prior': 'log-uniform'
		},
		'weight_decay': {
			'type':

### Loading data

In the configuration summary above, the dataset file is defined by "database" under the `Configurations` category. `Trainer.load_data` automatically searches the file in the current directory and `tabensemb.setting["default_data_path"]`. Now, load the Auto MPG dataset into the `Trainer`. It will process the dataset and get ready for training models:

1. Data splitting (training/validation/testing sets)
2. Data imputation
3. Data augmentation (for features)
4. Data processing
    * Data augmentation (for data points)
    * Data filtering
    * Feature selection
    * Categorical encoding
    * Data scaling
    * etc.
5. Data augmentation (for features, especially multi-modal features)


In [7]:
trainer.load_data()

Dataset size: 238 80 80
Data saved to /tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig (data.csv and tabular_data.csv).


### Initializing model bases

Initialize model bases and add them to the `Trainer`. We only choose a subset of models in each model base for demonstration by passing the `model_subset` argument (without it, all available models will be trained).

In [8]:
models = [
    PytorchTabular(trainer, model_subset=["Category Embedding"]),
    WideDeep(trainer, model_subset=["TabMlp"]),
    AutoGluon(trainer, model_subset=["Random Forest"]),
]
trainer.add_modelbases(models)

### Start training

Now train the model bases. The argument `stderr_to_stdout` will redirect warnings and loggings to `stdout` and make records in the notebook clean.

*Optional*: Using the following line, we can run k-fold cross-validation to get the leaderboard, where k is `cross_validation`.

```python
trainer.get_leaderboard(cross_validation=10, split_type="cv", stderr_to_stdout=True)
```

**Remark**: `split_type` can be `random`, which means that the dataset is randomly split according to the given `split_ratio` in the configuration and different random seeds.

In [9]:
trainer.train(stderr_to_stdout=True)


-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-09-23 20:36:01,062 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-09-23 20:36:01,062 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-09-23 20:36:01,070 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-09-23 20:36:01,081 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-09-23 20:36:01,991 - {pytorch_tabular.tabular_model:582} - INFO - Training Started
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will tr

After training finishes, check the leaderboard to see their performance.

Metrics used in leaderboards can be found in `tabensemb.utils.utils.REGRESSION_METRICS/BINARY_METRICS/MULTICLASS_METRICS`. Most of the metrics are from `sklearn.metrics`.

In [10]:
trainer.get_leaderboard()

PytorchTabular metrics
Category Embedding 1/1
WideDeep metrics
TabMlp 1/1
AutoGluon metrics
Random Forest 1/1
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqcxgn2l1/output/auto-mpg/2023-09-23-20-36-00-0_UserInputConfig/trainer.pkl')


Unnamed: 0,Program,Model,Training RMSE,Training MSE,Training MAE,Training MAPE,Training R2,Training MEDIAN_ABSOLUTE_ERROR,Training EXPLAINED_VARIANCE_SCORE,Testing RMSE,...,Testing R2,Testing MEDIAN_ABSOLUTE_ERROR,Testing EXPLAINED_VARIANCE_SCORE,Validation RMSE,Validation MSE,Validation MAE,Validation MAPE,Validation R2,Validation MEDIAN_ABSOLUTE_ERROR,Validation EXPLAINED_VARIANCE_SCORE
0,AutoGluon,Random Forest,1.037981,1.077405,0.741566,0.031074,0.983285,0.5295,0.983293,2.047025,...,0.922065,1.156333,0.922591,3.378475,11.414091,2.269187,0.102995,0.796098,1.641334,0.796506
1,WideDeep,TabMlp,3.189102,10.170372,2.318564,0.096454,0.842218,1.669983,0.859805,2.537431,...,0.88025,1.767459,0.900587,3.415071,11.662707,2.539188,0.116035,0.791657,1.90416,0.806152
2,PytorchTabular,Category Embedding,3.354362,11.251746,2.445915,0.101659,0.825442,1.775388,0.854523,2.799644,...,0.854221,1.963455,0.888258,3.51671,12.36725,2.731159,0.125136,0.779071,2.375105,0.808039


## Binary classification

As a showcase for binary classification, we use the Adult dataset from UCI datasets. Note that the Adult dataset has an individual testing set, which will be discussed in the "Inference on an upcoming dataset" part.

In [11]:
trainer = Trainer(device=device)
adult_columns = [
    "age",
    "workclass",
    "fnlwgt",
    "education",
    "education-num",
    "marital-status",
    "occupation",
    "relationship",
    "race",
    "sex",
    "capital-gain",
    "capital-loss",
    "hours-per-week",
    "native-country",
    "income",
]
cfg = UserConfig.from_uci("Adult", column_names=adult_columns, sep=", ")
trainer.load_config(cfg)
trainer.load_data()
models = [
    PytorchTabular(trainer, model_subset=["Category Embedding"]),
    WideDeep(trainer, model_subset=["TabMlp"]),
    AutoGluon(trainer, model_subset=["Random Forest"]),
]
trainer.add_modelbases(models)
trainer.train(stderr_to_stdout=True)
trainer.get_leaderboard()

Downloading https://archive.ics.uci.edu/static/public/2/adult.zip to /tmp/tmpqcxgn2l1/data/Adult.zip


  df = pd.read_csv(StringIO(s), names=names, sep=sep)


age is Integer and will be treated as a continuous feature.
fnlwgt is Integer and will be treated as a continuous feature.
education-num is Integer and will be treated as a continuous feature.
capital-gain is Integer and will be treated as a continuous feature.
capital-loss is Integer and will be treated as a continuous feature.
hours-per-week is Integer and will be treated as a continuous feature.
The project will be saved to /tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig
Dataset size: 19536 6512 6513
Data saved to /tmp/tmpqcxgn2l1/output/adult/2023-09-23-20-36-15-0_UserInputConfig (data.csv and tabular_data.csv).

-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-09-23 20:36:17,315 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-09-23 20:36:17,317 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for classification task
2023-09-23 20:36:17,382 - {pytorch_t

Unnamed: 0,Program,Model,Training F1_SCORE,Training PRECISION_SCORE,Training RECALL_SCORE,Training JACCARD_SCORE,Training ACCURACY_SCORE,Training BALANCED_ACCURACY_SCORE,Training COHEN_KAPPA_SCORE,Training HAMMING_LOSS,...,Validation ACCURACY_SCORE,Validation BALANCED_ACCURACY_SCORE,Validation COHEN_KAPPA_SCORE,Validation HAMMING_LOSS,Validation MATTHEWS_CORRCOEF,Validation ZERO_ONE_LOSS,Validation ROC_AUC_SCORE,Validation LOG_LOSS,Validation BRIER_SCORE_LOSS,Validation AVERAGE_PRECISION_SCORE
0,WideDeep,TabMlp,0.6942,0.728505,0.662981,0.531628,0.859388,0.792321,0.603167,0.140612,...,0.852426,0.784474,0.584884,0.147574,0.585738,0.147574,0.908951,0.317288,0.101612,0.86842
1,AutoGluon,Random Forest,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,...,0.853808,0.776665,0.580404,0.146192,0.583003,0.146192,0.90701,0.318016,0.100486,0.875084
2,PytorchTabular,Category Embedding,0.709806,0.738341,0.683394,0.550154,0.865479,0.803303,0.622423,0.134521,...,0.85043,0.784467,0.581612,0.14957,0.58215,0.14957,0.909318,0.316194,0.101722,0.86841


## Multiclass classification

Iris is a famous multiclass classification task. It is also loaded from UCI datasets. We gave the argument `column_names` to `from_uci` in the above examples. If we do not know the column labels, column names from the UCI website are used (whose order might be wrong, such as those for the Auto MPG dataset) and the downloaded archive will not be removed after `from_uci`. There should be a file named `xxx.name` in the archive with column names in it.

In [12]:
trainer = Trainer(device=device)
cfg = UserConfig.from_uci("Iris", datafile_name="iris")
trainer.load_config(cfg)
trainer.load_data()
models = [
    PytorchTabular(trainer, model_subset=["Category Embedding"]),
    WideDeep(trainer, model_subset=["TabMlp"]),
    AutoGluon(trainer, model_subset=["Random Forest"]),
]
trainer.add_modelbases(models)
trainer.train(stderr_to_stdout=True)
trainer.get_leaderboard()

Downloading https://archive.ics.uci.edu/static/public/53/iris.zip to /tmp/tmpqcxgn2l1/data/Iris.zip




The project will be saved to /tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig
Dataset size: 90 30 30
Data saved to /tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-50-0_UserInputConfig (data.csv and tabular_data.csv).

-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-09-23 20:37:51,096 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-09-23 20:37:51,096 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for classification task
2023-09-23 20:37:51,106 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-09-23 20:37:51,121 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-09-23 20:37:51,137 - {pytorch_tabul

Unnamed: 0,Program,Model,Training ACCURACY_SCORE,Training BALANCED_ACCURACY_SCORE,Training COHEN_KAPPA_SCORE,Training HAMMING_LOSS,Training MATTHEWS_CORRCOEF,Training ZERO_ONE_LOSS,Training PRECISION_SCORE_MACRO,Training PRECISION_SCORE_MICRO,...,Validation F1_SCORE_MICRO,Validation F1_SCORE_WEIGHTED,Validation JACCARD_SCORE_MACRO,Validation JACCARD_SCORE_MICRO,Validation JACCARD_SCORE_WEIGHTED,Validation TOP_K_ACCURACY_SCORE,Validation LOG_LOSS,Validation ROC_AUC_SCORE_OVR_MACRO,Validation ROC_AUC_SCORE_OVR_WEIGHTED,Validation ROC_AUC_SCORE_OVO
0,PytorchTabular,Category Embedding,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,...,0.833333,0.837232,0.756944,0.714286,0.732639,1.0,0.366983,0.974891,0.971616,0.976042
1,WideDeep,TabMlp,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,...,0.833333,0.837232,0.756944,0.714286,0.732639,1.0,0.295129,0.979747,0.977576,0.980833
2,AutoGluon,Random Forest,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,...,0.8,0.804615,0.721154,0.666667,0.689423,1.0,0.781551,0.950812,0.941465,0.951042


## Using a configuration file

In the above introduction, we use UCI datasets whose configuration is automatically generated. The configuration can also be loaded from a local `.py` or `.json` file. To run a minimum example, we provide a randomly generated sample dataset (`data/sample.csv`) and its configuration file (`configs/sample.py`) in the repository. See "Dataset and configuration" for the detailed introduction of configuration files.

`tabensemb` uses paths relative to the current directory. For different IDEs (PyCharm, VSCode, etc.), the directory might be different. Set default paths to desired ones after checking the current working directory using magic commands in notebooks like `!pwd` or scripts like `import os; os.getcwd()`.

In [13]:
path = "../../../../"
tabensemb.setting["default_config_path"] = path + "configs"
tabensemb.setting["default_data_path"] = path + "data"

Load the configuration file `sample.py` using `Trainer.load_config`, which automatically searches the file in the current directory and `tabensemb.setting["default_config_path"]`.

In [14]:
trainer.load_config("sample")
trainer.load_data()

The project will be saved to /tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-58-0_sample
Dataset size: 153 51 52
Data saved to /tmp/tmpqcxgn2l1/output/iris/2023-09-23-20-37-58-0_sample (data.csv and tabular_data.csv).


Then initialize models:

In [15]:
trainer.clear_modelbase()
models = [
    PytorchTabular(trainer, model_subset=["Category Embedding"])
]
trainer.add_modelbases(models)

*Optional*: For a quick development test, changing the following global setting significantly reduces training time.

In [16]:
tabensemb.setting["debug_mode"] = True

In [17]:
trainer.train(stderr_to_stdout=True)
trainer.get_leaderboard()


-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-09-23 20:37:59,305 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-09-23 20:37:59,306 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-09-23 20:37:59,326 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-09-23 20:37:59,350 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-09-23 20:37:59,372 - {pytorch_tabular.tabular_model:582} - INFO - Training Started
You are using a CUDA device ('NVIDIA GeForce RTX 3090') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will tr

Unnamed: 0,Program,Model,Training RMSE,Training MSE,Training MAE,Training MAPE,Training R2,Training MEDIAN_ABSOLUTE_ERROR,Training EXPLAINED_VARIANCE_SCORE,Testing RMSE,...,Testing R2,Testing MEDIAN_ABSOLUTE_ERROR,Testing EXPLAINED_VARIANCE_SCORE,Validation RMSE,Validation MSE,Validation MAE,Validation MAPE,Validation R2,Validation MEDIAN_ABSOLUTE_ERROR,Validation EXPLAINED_VARIANCE_SCORE
0,PytorchTabular,Category Embedding,181.893055,33085.083331,145.409738,1.055588,-0.004359,121.139843,0.001236,172.657206,...,-0.005851,118.665751,-0.001657,148.93831,22182.620185,121.146176,1.00906,-0.001817,92.916794,0.001214


Clean the temporary directory of the notebook.

In [18]:
temp_path.cleanup()