# Basics of running benchmarks

Tabular Ensemble (`tabensemb`) is a benchmark platform for tabular prediction tasks. We support three well-established model bases as baselines:

* `autogluon`: [Link](https://github.com/autogluon/autogluon)

* `pytorch_widedeep`: [Link](https://github.com/jrzaurin/pytorch-widedeep)

* `pytorch_tabular`: [Link](https://github.com/manujosephv/pytorch_tabular)

Users are able to run benchmarks on customized datasets using customized preprocessing steps, and implement customized models in the framework to run and compare their performance with baselines within a consistent procedure.

In this part, minimum examples on regression, binary classification, and multiclass classification are performed to show the basic functionality of the package.

## Regression

### Loading packages

First, import necessary modules. Then check the validity of `CUDA` and determine the training device.

In [1]:
import torch
from tabensemb.trainer import Trainer
from tabensemb.model import *
from tabensemb.config import UserConfig
import tabensemb
import os

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))

Using cuda device


`tabensemb` uses paths relative to the current directory. For different IDEs (PyCharm, VSCode, etc.), the directory can be different. Set default paths to desired ones.

* tabensemb.setting["default_output_path"]: It will be used to save results. This path will be created if not exist.
* tabensemb.setting["default_config_path"]: It should be the path to configuration files (See "Using a configuration file" for its case).
* tabensemb.setting["default_config_path"]: It should be the path to data files. It will also be used to save downloaded datasets (See "Using a configuration file" for its case).

In this notebook, we use a temporary directory for cleanliness. Change `temp_path.name` to your own directory.

In [2]:
from tempfile import TemporaryDirectory

temp_path = TemporaryDirectory()
tabensemb.setting["default_output_path"] = os.path.join(temp_path.name, "output")
tabensemb.setting["default_config_path"] = os.path.join(temp_path.name, "configs")
tabensemb.setting["default_data_path"] = os.path.join(temp_path.name, "data")

### Configuring a `Trainer`

Create a `Trainer`, which acts as a bridge of data and models and provides some useful ultilities.

In [3]:
trainer = Trainer(device=device)

As an example, we use the Auto MPG dataset from [UCI datasets](https://archive.ics.uci.edu/datasets) . We can import UCI datasets through the `UserConfig` class.

In [4]:
cfg = UserConfig.from_uci("Auto MPG", sep="\s+")
trainer.load_config(cfg)

Downloading https://archive.ics.uci.edu/static/public/9/auto+mpg.zip to /tmp/tmpqij93vth/data/Auto MPG.zip
cylinders is Integer and will be treated as a continuous feature.
model_year is Integer and will be treated as a continuous feature.
origin is Integer and will be treated as a continuous feature.
Unknown values are detected in ['horsepower']. They will be treated as np.nan.
Project will be saved to /tmp/tmpqij93vth/output/auto-mpg/2023-08-03-20-50-47-0_UserInputConfig


*Optional*: We provide a useful `Logging` class to record all outputs to a file located in the above project root, so that users can review the training process. This step is optional but we strongly recommend using it.

`Trainer.project_root` is the output directory of the `trainer`, and here we log all `stdout` and `stderr` to `log.txt` in this directory.

In [5]:
from tabensemb.utils import Logging
log = Logging()
log.enter(os.path.join(trainer.project_root, "log.txt"))

### Viewing configurations

We can view the summary of the current environment, including devices/Python version, the loaded configuration, and global settings of `tabensemb`.

In [6]:
trainer.summarize_setting()

Device:
{
	'System': 'Linux',
	'Node name': 'xlluo-WS',
	'System release': '5.15.6-custom',
	'System version': '#1 SMP Mon Dec 13 20:27:58 CST 2021',
	'Machine architecture': 'x86_64',
	'Processor architecture': 'x86_64',
	'Processor model': '11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz',
	'Physical cores': 8,
	'Total cores': 16,
	'Max core frequency': '5150.00Mhz',
	'Total memory': '31.20GB',
	'Python version': '3.8.17',
	'Python implementation': 'CPython',
	'Python compiler': 'GCC 11.2.0',
	'Cuda availability': True,
	'GPU devices': [
		'NVIDIA GeForce RTX 3090'
	]
}
Configurations:
{
	'database': 'auto-mpg',
	'task': 'regression',
	'loss': None,
	'bayes_opt': False,
	'bayes_calls': 50,
	'bayes_epoch': 30,
	'patience': 100,
	'epoch': 300,
	'lr': 0.001,
	'weight_decay': 1e-09,
	'batch_size': 1024,
	'layers': [
		64,
		128,
		256,
		128,
		64
	],
	'SPACEs': {
		'lr': {
			'type': 'Real',
			'low': 0.0001,
			'high': 0.05,
			'prior': 'log-uniform'
		},
		'weight_decay': {
			'type': 

### Loading data

In the configuration summary above, the dataset file is defined by "database" under the `Configurations` category. `Trainer.load_data` automatically searches the file in the current directory and `tabensemb.setting["default_data_path"]`. Now, load the Auto MPG dataset into the `Trainer`. It will process the dataset and get ready for training models:

1. Data splitting (training/validation/testing sets)
2. Data imputation
3. Data augmentation (for features)
4. Data processing
    * Data augmentation (for data points)
    * Data filtering
    * Feature selection
    * Categorical encoding
    * Data scaling
    * etc.
5. Data augmentation (for features, especially multi-modal features)


In [7]:
trainer.load_data()

Dataset size: 238 80 80
Data saved to /tmp/tmpqij93vth/output/auto-mpg/2023-08-03-20-50-47-0_UserInputConfig (data.csv and tabular_data.csv).




### Initializing model bases

Initialize model bases and add them to the `Trainer`. We only choose a subset of models in each model base for demonstration by passing the `model_subset` argument (without it, all available models will be trained).

In [8]:
models = [
    PytorchTabular(trainer, model_subset=["Category Embedding"]),
    WideDeep(trainer, model_subset=["TabMlp"]),
    AutoGluon(trainer, model_subset=["Linear Regression"]),
]
trainer.add_modelbases(models)

### Start training

Now train the model bases. The argument `stderr_to_stdout` will redirect warnings and loggings to `stdout` and makes records in the notebook clean.

*Optional*: Use the following line, we can run k-fold cross-validation to get the leaderboard, where k is `cross_validation`.

```python
trainer.get_leaderboard(cross_validation=10, split_type="cv", stderr_to_stdout=True)
```

**Remark**: `split_type` can be `random`, which means that the dataset is randomly split according to the given `split_ratio` in the configuration and different random seeds.

In [9]:
trainer.train(stderr_to_stdout=True)


-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-08-03 20:50:48,586 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-08-03 20:50:48,586 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-08-03 20:50:48,595 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-08-03 20:50:48,606 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-08-03 20:50:49,530 - {pytorch_tabular.tabular_model:582} - INFO - Training Started
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name             | Type                      | Params
---------------------------------------------------------------
0 | _backbone        | CategoryEmbeddingBackbone 

After training finishes, check the leaderboard to see their performance.

Metrics used in leaderboards can be found in `tabensemb.utils.utils.REGRESSION_METRICS/BINARY_METRICS/MULTICLASS_METRICS`. Most of the metrics are from `sklearn.metrics`.

In [10]:
trainer.get_leaderboard()

PytorchTabular metrics
Category Embedding 1/1
WideDeep metrics
TabMlp 1/1
AutoGluon metrics
Linear Regression 1/1
Trainer saved. To load the trainer, run trainer = load_trainer(path='/tmp/tmpqij93vth/output/auto-mpg/2023-08-03-20-50-47-0_UserInputConfig/trainer.pkl')


Unnamed: 0,Program,Model,Training RMSE,Training MSE,Training MAE,Training MAPE,Training R2,Training MEDIAN_ABSOLUTE_ERROR,Training EXPLAINED_VARIANCE_SCORE,Testing RMSE,...,Testing R2,Testing MEDIAN_ABSOLUTE_ERROR,Testing EXPLAINED_VARIANCE_SCORE,Validation RMSE,Validation MSE,Validation MAE,Validation MAPE,Validation R2,Validation MEDIAN_ABSOLUTE_ERROR,Validation EXPLAINED_VARIANCE_SCORE
0,PytorchTabular,Category Embedding,0.357999,0.128164,0.269086,0.049639,0.954546,0.201576,0.961487,0.508347,...,0.916572,0.254563,0.930668,0.409699,0.167853,0.325146,0.062197,0.941091,0.24285,0.947143
1,WideDeep,TabMlp,0.507421,0.257476,0.370808,0.070433,0.908686,0.272293,0.915921,0.526612,...,0.91047,0.32388,0.921738,0.53413,0.285295,0.419972,0.081513,0.899874,0.35455,0.905366
2,AutoGluon,Linear Regression,0.492177,0.242239,0.360783,0.066718,0.91409,0.239442,0.91409,0.573398,...,0.893855,0.37383,0.894028,0.493568,0.243609,0.387519,0.073325,0.914504,0.327456,0.914646


## Binary classification

As a showcase for binary classification, we use the Adult dataset from UCI datasets. Note that the Adult dataset has a individual testing set, which will be discussed in the "Inference on an upcoming dataset" part.

In [11]:
trainer = Trainer(device=device)
cfg = UserConfig.from_uci("Adult", sep=", ")
trainer.load_config(cfg)
trainer.load_data()
models = [
    PytorchTabular(trainer, model_subset=["Category Embedding"]),
    WideDeep(trainer, model_subset=["TabMlp"]),
    AutoGluon(trainer, model_subset=["Linear Regression"]),
]
trainer.add_modelbases(models)
trainer.train(stderr_to_stdout=True)
trainer.get_leaderboard()

Downloading https://archive.ics.uci.edu/static/public/2/adult.zip to /tmp/tmpqij93vth/data/Adult.zip


  df = pd.read_csv(StringIO(s), names=names, sep=sep)


Project will be saved to /tmp/tmpqij93vth/output/adult/2023-08-03-20-51-08-0_UserInputConfig
Dataset size: 19536 6512 6513
Data saved to /tmp/tmpqij93vth/output/adult/2023-08-03-20-51-08-0_UserInputConfig (data.csv and tabular_data.csv).

-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-08-03 20:51:09,706 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-08-03 20:51:09,707 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for classification task
2023-08-03 20:51:09,765 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-08-03 20:51:09,801 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-08-03 20:51:09,825 - {pytorch_tabular.tabular_model:582

Unnamed: 0,Program,Model,Training F1_SCORE,Training PRECISION_SCORE,Training RECALL_SCORE,Training JACCARD_SCORE,Training ACCURACY_SCORE,Training BALANCED_ACCURACY_SCORE,Training COHEN_KAPPA_SCORE,Training HAMMING_LOSS,...,Validation ACCURACY_SCORE,Validation BALANCED_ACCURACY_SCORE,Validation COHEN_KAPPA_SCORE,Validation HAMMING_LOSS,Validation MATTHEWS_CORRCOEF,Validation ZERO_ONE_LOSS,Validation ROC_AUC_SCORE,Validation LOG_LOSS,Validation BRIER_SCORE_LOSS,Validation AVERAGE_PRECISION_SCORE
0,AutoGluon,Linear Regression,0.649959,0.719298,0.592813,0.481437,0.846284,0.759732,0.552647,0.153716,...,0.84137,0.755397,0.540607,0.15863,0.544061,0.15863,0.896542,0.335764,0.108401,0.848356
1,WideDeep,TabMlp,0.694051,0.730498,0.661067,0.531453,0.859695,0.79187,0.603321,0.140305,...,0.852426,0.78382,0.584325,0.147574,0.585265,0.147574,0.908941,0.317243,0.101578,0.868427
2,PytorchTabular,Category Embedding,0.708106,0.742191,0.677015,0.548115,0.865633,0.801226,0.621075,0.134367,...,0.851044,0.780948,0.579583,0.148956,0.580653,0.148956,0.909818,0.315222,0.101306,0.8693


## Multiclass classification

Iris is a famous multiclass classification task. It is also loaded from UCI datasets.

In [12]:
trainer = Trainer(device=device)
cfg = UserConfig.from_uci("Iris", datafile_name="iris")
trainer.load_config(cfg)
trainer.load_data()
models = [
    PytorchTabular(trainer, model_subset=["Category Embedding"]),
    WideDeep(trainer, model_subset=["TabMlp"]),
    AutoGluon(trainer, model_subset=["Linear Regression"]),
]
trainer.add_modelbases(models)
trainer.train(stderr_to_stdout=True)
trainer.get_leaderboard()

Downloading https://archive.ics.uci.edu/static/public/53/iris.zip to /tmp/tmpqij93vth/data/Iris.zip
Project will be saved to /tmp/tmpqij93vth/output/iris/2023-08-03-20-52-30-0_UserInputConfig
Dataset size: 90 30 30
Data saved to /tmp/tmpqij93vth/output/iris/2023-08-03-20-52-30-0_UserInputConfig (data.csv and tabular_data.csv).

-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-08-03 20:52:30,779 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-08-03 20:52:30,780 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for classification task
2023-08-03 20:52:30,790 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-08-03 20:52:30,801 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU a

Unnamed: 0,Program,Model,Training ACCURACY_SCORE,Training BALANCED_ACCURACY_SCORE,Training COHEN_KAPPA_SCORE,Training HAMMING_LOSS,Training MATTHEWS_CORRCOEF,Training ZERO_ONE_LOSS,Training PRECISION_SCORE_MACRO,Training PRECISION_SCORE_MICRO,...,Validation F1_SCORE_MICRO,Validation F1_SCORE_WEIGHTED,Validation JACCARD_SCORE_MACRO,Validation JACCARD_SCORE_MICRO,Validation JACCARD_SCORE_WEIGHTED,Validation TOP_K_ACCURACY_SCORE,Validation LOG_LOSS,Validation ROC_AUC_SCORE_OVR_MACRO,Validation ROC_AUC_SCORE_OVR_WEIGHTED,Validation ROC_AUC_SCORE_OVO
0,PytorchTabular,Category Embedding,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,...,0.833333,0.837232,0.756944,0.714286,0.732639,1.0,0.366983,0.974891,0.971616,0.976042
1,WideDeep,TabMlp,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,...,0.833333,0.837232,0.756944,0.714286,0.732639,1.0,0.295129,0.979747,0.977576,0.980833
2,AutoGluon,Linear Regression,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,...,0.833333,0.837232,0.756944,0.714286,0.732639,1.0,0.306514,0.986498,0.985051,0.987222


## Using a configuration file

In the above introduction, we use UCI datasets whose configuration is automatically generated. The configuration can also be loaded from a local `.py` or `.json` file. To run a minimum example, we provide a randomly generated sample dataset (`data/sample.csv`) and its configuration file (`configs/sample.py`) in the repository. See "Dataset and configuration" for detailed introduction of configuration files.

`tabensemb` uses paths relative to the current directory. For different IDEs (PyCharm, VSCode, etc.), the directory can be different. Set default paths to desired ones.

In [13]:
path = "../../../../"
tabensemb.setting["default_config_path"] = path + "configs"
tabensemb.setting["default_data_path"] = path + "data"

Load the configuration file `sample.py` using `Trainer.load_config`, which automatically searches the file in the current directory and `tabensemb.setting["default_config_path"]`.

In [14]:
trainer.load_config("sample")
trainer.load_data()

Project will be saved to /tmp/tmpqij93vth/output/iris/2023-08-03-20-52-36-0_sample
Dataset size: 153 51 52
Data saved to /tmp/tmpqij93vth/output/iris/2023-08-03-20-52-36-0_sample (data.csv and tabular_data.csv).


Then initialize models:

In [15]:
trainer.clear_modelbase()
models = [
    PytorchTabular(trainer, model_subset=["Category Embedding"])
]
trainer.add_modelbases(models)

*Optional*: For a quick development test, changing the following global setting significantly reduces training time.

In [16]:
tabensemb.setting["debug_mode"] = True

In [17]:
trainer.train(stderr_to_stdout=True)
trainer.get_leaderboard()


-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-08-03 20:52:37,112 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-08-03 20:52:37,112 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-08-03 20:52:37,126 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-08-03 20:52:37,142 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-08-03 20:52:37,155 - {pytorch_tabular.tabular_model:582} - INFO - Training Started
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name             | Type                      | Params
---------------------------------------------------------------
0 | _backbone        | CategoryEmbeddingBackbone 

Unnamed: 0,Program,Model,Training RMSE,Training MSE,Training MAE,Training MAPE,Training R2,Training MEDIAN_ABSOLUTE_ERROR,Training EXPLAINED_VARIANCE_SCORE,Testing RMSE,...,Testing R2,Testing MEDIAN_ABSOLUTE_ERROR,Testing EXPLAINED_VARIANCE_SCORE,Validation RMSE,Validation MSE,Validation MAE,Validation MAPE,Validation R2,Validation MEDIAN_ABSOLUTE_ERROR,Validation EXPLAINED_VARIANCE_SCORE
0,PytorchTabular,Category Embedding,181.893055,33085.083317,145.409737,1.055588,-0.004359,121.139843,0.001236,172.657206,...,-0.005851,118.665751,-0.001657,148.93831,22182.6202,121.146176,1.00906,-0.001817,92.916794,0.001214


Clean the temporary directory of the notebook.

In [18]:
temp_path.cleanup()