# Running model bases on a sample dataset

Tabular Ensemble (`tabensemb`) is a benchmark platform for tabular prediction tasks. We support three well-established model bases as baselines:

* `autogluon`: [Link](https://github.com/autogluon/autogluon)

* `pytorch_widedeep`: [Link](https://github.com/jrzaurin/pytorch-widedeep)

* `pytorch_tabular`: [Link](https://github.com/manujosephv/pytorch_tabular)

Users are able to run benchmarks on customized datasets using customized preprocessing steps, and implement customized models in the framework to run and compare their performance with baselines within a consistent procedure.

In this part, a minimum example is performed to show the basic functionality of the package.

## Loading packages

To run a minimum example, we provide a randomly generated sample dataset (`data/sample.csv`) and its configuration file (`configs/sample.py`) in the repository. First, import necessary modules.

`tabensemb` uses paths relative to the current directory. For different IDEs (PyCharm, VSCode, etc.), the directory can be different. Set default paths to desired ones.

Then check the validity of `CUDA` and determine the training device.

In [1]:
import torch
from tabensemb.trainer import Trainer
from tabensemb.model import *
import tabensemb
import os

prefix = "../../../../"
tabensemb.setting["default_output_path"] = prefix + "output"
tabensemb.setting["default_config_path"] = prefix + "configs"
tabensemb.setting["default_data_path"] = prefix + "data"

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))

Using cuda device


## Configuring a `Trainer`

Create a `Trainer`, which acts as a bridge of data and models and provides some useful ultilities.

Load the configuration file `sample.py` using `Trainer.load_config`, which automatically searches the file in the current directory and `tabensemb.setting["default_config_path"]`.

In [2]:
trainer = Trainer(device=device)
trainer.load_config("sample")

Project will be saved to ../../../../output/sample/2023-07-30-13-33-54-0_sample


*Optional*: We provide a useful `Logging` class to record all outputs to a file located in the above project root, so that users can review the training process. This step is optional but we strongly recommend using it.

`Trainer.project_root` is the output directory of the `trainer`, and here we log all `stdout` and `stderr` to `log.txt` in this directory.

In [3]:
from tabensemb.utils import Logging
log = Logging()
log.enter(os.path.join(trainer.project_root, "log.txt"))

## Viewing configurations

We can view the summary of the current environment, including devices/Python version, the loaded configuration file `configs/sample.py`, and global settings of `tabensemb`.

In [4]:
trainer.summarize_setting()

Device:
{
	'System': 'Linux',
	'Node name': 'xlluo-WS',
	'System release': '5.15.6-custom',
	'System version': '#1 SMP Mon Dec 13 20:27:58 CST 2021',
	'Machine architecture': 'x86_64',
	'Processor architecture': 'x86_64',
	'Processor model': '11th Gen Intel(R) Core(TM) i9-11900K @ 3.50GHz',
	'Physical cores': 8,
	'Total cores': 16,
	'Max core frequency': '5150.00Mhz',
	'Total memory': '31.20GB',
	'Python version': '3.8.17',
	'Python implementation': 'CPython',
	'Python compiler': 'GCC 11.2.0',
	'Cuda availability': True,
	'GPU devices': [
		'NVIDIA GeForce RTX 3090'
	]
}
Configurations:
{
	'database': 'sample',
	'bayes_opt': False,
	'bayes_calls': 50,
	'bayes_epoch': 30,
	'patience': 100,
	'epoch': 300,
	'lr': 0.001,
	'weight_decay': 1e-09,
	'batch_size': 1024,
	'layers': [
		64,
		128,
		256,
		128,
		64
	],
	'SPACEs': {
		'lr': {
			'type': 'Real',
			'low': 0.0001,
			'high': 0.05,
			'prior': 'log-uniform'
		},
		'weight_decay': {
			'type': 'Real',
			'low': 1e-09,
			'high': 0.05

## Loading data

In the configuration summary above, the dataset file is defined by "database" under the `Configurations` category. `Trainer.load_data` automatically searches the file in the current directory and `tabensemb.setting["default_data_path"]`. Now, load the dataset `data/sample.csv` into the `Trainer`. It will process the dataset and get ready for training models:

1. Data splitting (training/validation/testing sets)
2. Data imputation
3. Data augmentation (for features)
4. Data processing
    * Data augmentation (for data points)
    * Data filtering
    * Feature selection
    * Categorical encoding
    * Data scaling
    * etc.
5. Data augmentation (for features, especially multi-modal features)


In [5]:
trainer.load_data()

Dataset size: 153 51 52
Data saved to ../../../../output/sample/2023-07-30-13-33-54-0_sample (data.csv and tabular_data.csv).


## Initializing model bases

Initialize model bases and add them to the `Trainer`. We only choose a subset of models in each model base for demonstration by passing the `model_subset` argument (without it, all available models will be trained).

In [6]:
models = [
    PytorchTabular(trainer, model_subset=["Category Embedding"]),
    WideDeep(trainer, model_subset=["TabMlp"]),
    AutoGluon(trainer, model_subset=["Linear Regression"]),
]
trainer.add_modelbases(models)

*Optional*: For a quick development test, changing the following global setting significantly reduces training time.

In [7]:
tabensemb.setting["debug_mode"] = True

## Start training

Now train the model bases. The argument `stderr_to_stdout` will redirect warnings and loggings to `stdout` and makes records in the notebook clean. After training finishes, check the leaderboard to see their performance.

In [8]:
trainer.train(stderr_to_stdout=True)


-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-07-30 13:33:54,776 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-07-30 13:33:54,777 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-07-30 13:33:54,790 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-07-30 13:33:54,805 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023-07-30 13:33:55,698 - {pytorch_tabular.tabular_model:582} - INFO - Training Started
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name             | Type                      | Params
---------------------------------------------------------------
0 | custom_loss      | MSELoss                   

In [9]:
trainer.get_leaderboard()

PytorchTabular metrics
Category Embedding 1/1
WideDeep metrics
TabMlp 1/1
AutoGluon metrics
Linear Regression 1/1
Trainer saved. To load the trainer, run trainer = load_trainer(path='../../../../output/sample/2023-07-30-13-33-54-0_sample/trainer.pkl')


Unnamed: 0,Program,Model,Training RMSE,Training MSE,Training MAE,Training MAPE,Training R2,Training RMSE_CONSERV,Testing RMSE,Testing MSE,Testing MAE,Testing MAPE,Testing R2,Testing RMSE_CONSERV,Validation RMSE,Validation MSE,Validation MAE,Validation MAPE,Validation R2,Validation RMSE_CONSERV
0,AutoGluon,Linear Regression,114.065981,13011.048027,91.398513,2.686924,0.605025,12364.215662,139.269733,19396.058633,119.072766,4.078846,0.345548,11994.905098,110.253538,12155.842624,88.607594,1.54647,0.451015,12189.700803
1,WideDeep,TabMlp,182.17416,33187.424698,145.805452,0.996663,-0.007466,31909.355406,172.338087,29700.416239,132.76695,0.99415,-0.002136,23339.518145,149.173941,22252.864676,120.977058,1.000998,-0.004989,29799.600759
2,PytorchTabular,Category Embedding,181.893055,33085.083317,145.409737,1.055588,-0.004359,31932.919579,172.657206,29810.51081,132.876842,0.970391,-0.005851,23596.311604,148.93831,22182.6202,121.146176,1.00906,-0.001817,29044.712738
