# Cross-validation

Cross-validation is required to validate the generalization ability of models, avoid the effect of randomization, etc. Randomization may affect the dataset splitting, model initialization, forward propagation (especially convolution operations), and optimization.

In [1]:
import torch
from tabensemb.trainer import Trainer
from tabensemb.model import *
from tabensemb.config import UserConfig
import tabensemb
import os

device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))

from tempfile import TemporaryDirectory

temp_path = TemporaryDirectory()
tabensemb.setting["default_output_path"] = os.path.join(temp_path.name, "output")
tabensemb.setting["default_config_path"] = os.path.join(temp_path.name, "configs")
tabensemb.setting["default_data_path"] = os.path.join(temp_path.name, "data")

trainer = Trainer(device=device)
mpg_columns = [
    "mpg",
    "cylinders",
    "displacement",
    "horsepower",
    "weight",
    "acceleration",
    "model_year",
    "origin",
    "car_name",
]
cfg = UserConfig.from_uci("Auto MPG", column_names=mpg_columns, sep=r"\s+")
trainer.load_config(cfg)
trainer.load_data()
models = [
    PytorchTabular(trainer, model_subset=["Category Embedding"]),
]
trainer.add_modelbases(models)

Using cuda device
Downloading https://archive.ics.uci.edu/static/public/9/auto+mpg.zip to /tmp/tmpsmt6hpoy/data/Auto MPG.zip
cylinders is Integer and will be treated as a continuous feature.
model_year is Integer and will be treated as a continuous feature.
origin is Integer and will be treated as a continuous feature.
Unknown values are detected in ['horsepower']. They will be treated as np.nan.
The project will be saved to /tmp/tmpsmt6hpoy/output/auto-mpg/2023-09-18-19-13-05-0_UserInputConfig
Dataset size: 238 80 80
Data saved to /tmp/tmpsmt6hpoy/output/auto-mpg/2023-09-18-19-13-05-0_UserInputConfig (data.csv and tabular_data.csv).


## K-fold cross-validation

Some of the data splitters (See "Using data functionalities") in `tabensemb` support k-fold cross-validation. To activate k-fold CV, pass the argument `split_type="cv"` to `Trainer.get_leaderboard`. In this case, the ratio of training/validation/testing sets is (k-2):1:1.  Here we present an example of a 4-fold CV.

In [2]:
trainer.get_leaderboard(cross_validation=4, split_type="cv", stderr_to_stdout=True)

----------------------------1/4 cv----------------------------
Using previously used data path /tmp/tmpsmt6hpoy/data/auto-mpg.csv
Dataset size: 199 99 100
Data saved to /tmp/tmpsmt6hpoy/output/auto-mpg/2023-09-18-19-13-05-0_UserInputConfig (data.csv and tabular_data.csv).

-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-09-18 19:13:06,351 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-09-18 19:13:06,352 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-09-18 19:13:06,360 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-09-18 19:13:06,369 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2023

Unnamed: 0,Program,Model,Training RMSE,Training MSE,Training MAE,Training MAPE,Training R2,Training MEDIAN_ABSOLUTE_ERROR,Training EXPLAINED_VARIANCE_SCORE,Testing RMSE,...,Testing R2,Testing MEDIAN_ABSOLUTE_ERROR,Testing EXPLAINED_VARIANCE_SCORE,Validation RMSE,Validation MSE,Validation MAE,Validation MAPE,Validation R2,Validation MEDIAN_ABSOLUTE_ERROR,Validation EXPLAINED_VARIANCE_SCORE
0,PytorchTabular,Category Embedding,3.121511,9.743828,2.361333,0.100122,0.839443,1.824559,0.870805,3.610387,...,0.786089,2.024105,0.814192,3.82657,14.642641,2.699677,0.119046,0.761043,1.905207,0.785311


## Splitting the dataset randomly

We can simply split the dataset with different random seeds. This is achieved by passing the argument `split_type="random"`. In this case, the ratio of training/validation/testing sets is the one specified in the configuration (or 6:2:2 by default).

In [3]:
trainer.get_leaderboard(cross_validation=4, split_type="random", stderr_to_stdout=True)

----------------------------1/4 random----------------------------
Using previously used data path /tmp/tmpsmt6hpoy/data/auto-mpg.csv
Dataset size: 238 80 80
Data saved to /tmp/tmpsmt6hpoy/output/auto-mpg/2023-09-18-19-13-05-0_UserInputConfig (data.csv and tabular_data.csv).

-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-09-18 19:13:22,196 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-09-18 19:13:22,196 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-09-18 19:13:22,204 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-09-18 19:13:22,213 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2

Unnamed: 0,Program,Model,Training RMSE,Training MSE,Training MAE,Training MAPE,Training R2,Training MEDIAN_ABSOLUTE_ERROR,Training EXPLAINED_VARIANCE_SCORE,Testing RMSE,...,Testing R2,Testing MEDIAN_ABSOLUTE_ERROR,Testing EXPLAINED_VARIANCE_SCORE,Validation RMSE,Validation MSE,Validation MAE,Validation MAPE,Validation R2,Validation MEDIAN_ABSOLUTE_ERROR,Validation EXPLAINED_VARIANCE_SCORE
0,PytorchTabular,Category Embedding,3.195748,10.212808,2.387656,0.101595,0.838209,1.844076,0.865946,3.199201,...,0.822679,2.022343,0.856969,3.620434,13.107543,2.541972,0.11167,0.771539,1.739592,0.783368


## Unexpected termination

It may take quite a long time to cross-validate various models on the large dataset, especially with Bayesian hyperparameter optimization (Yes, Bayesian hyperparameter optimization and cross-validation can both be activated). If the script terminates unexpectedly, you can use a functionality that loads the stored cross-validation state to continue a previous execution.

First, we assume that the script terminates after the first run finishes.

In [4]:
_ = trainer.get_leaderboard(cross_validation=1, split_type="random", stderr_to_stdout=True)

----------------------------1/1 random----------------------------
Using previously used data path /tmp/tmpsmt6hpoy/data/auto-mpg.csv
Dataset size: 238 80 80
Data saved to /tmp/tmpsmt6hpoy/output/auto-mpg/2023-09-18-19-13-05-0_UserInputConfig (data.csv and tabular_data.csv).

-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-09-18 19:13:37,650 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-09-18 19:13:37,650 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-09-18 19:13:37,658 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-09-18 19:13:37,667 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2

To continue the cross-validation, set the argument `load_from_previous` to `True`

In [5]:
l1 = trainer.get_leaderboard(cross_validation=2, split_type="random", stderr_to_stdout=True, load_from_previous=True)
l1

Previous cross validation state is loaded.
----------------------------2/2 random----------------------------
Using previously used data path /tmp/tmpsmt6hpoy/data/auto-mpg.csv
Dataset size: 238 80 80
Data saved to /tmp/tmpsmt6hpoy/output/auto-mpg/2023-09-18-19-13-05-0_UserInputConfig (data.csv and tabular_data.csv).

-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-09-18 19:13:41,468 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-09-18 19:13:41,469 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-09-18 19:13:41,476 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-09-18 19:13:41,486 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0

Unnamed: 0,Program,Model,Training RMSE,Training MSE,Training MAE,Training MAPE,Training R2,Training MEDIAN_ABSOLUTE_ERROR,Training EXPLAINED_VARIANCE_SCORE,Testing RMSE,...,Testing R2,Testing MEDIAN_ABSOLUTE_ERROR,Testing EXPLAINED_VARIANCE_SCORE,Validation RMSE,Validation MSE,Validation MAE,Validation MAPE,Validation R2,Validation MEDIAN_ABSOLUTE_ERROR,Validation EXPLAINED_VARIANCE_SCORE
0,PytorchTabular,Category Embedding,3.273904,10.718448,2.439552,0.103957,0.827288,1.87891,0.857783,2.980939,...,0.844204,1.972121,0.890054,3.950363,15.605369,2.875634,0.123573,0.745909,2.082971,0.763982


Let's compare the result without termination.

In [6]:
l2 = trainer.get_leaderboard(cross_validation=2, split_type="random", stderr_to_stdout=True)

import numpy as np
cols = ["Training RMSE", "Testing RMSE", "Validation RMSE"]
assert np.allclose(l1[cols].values.astype(float), l2[cols].values.astype(float))
l2

----------------------------1/2 random----------------------------
Using previously used data path /tmp/tmpsmt6hpoy/data/auto-mpg.csv
Dataset size: 238 80 80
Data saved to /tmp/tmpsmt6hpoy/output/auto-mpg/2023-09-18-19-13-05-0_UserInputConfig (data.csv and tabular_data.csv).

-------------Run PytorchTabular-------------

Training Category Embedding
Global seed set to 42
2023-09-18 19:13:45,532 - {pytorch_tabular.tabular_model:473} - INFO - Preparing the DataLoaders
2023-09-18 19:13:45,533 - {pytorch_tabular.tabular_datamodule:290} - INFO - Setting up the datamodule for regression task
2023-09-18 19:13:45,540 - {pytorch_tabular.tabular_model:521} - INFO - Preparing the Model: CategoryEmbeddingModel
2023-09-18 19:13:45,549 - {pytorch_tabular.tabular_model:268} - INFO - Preparing the Trainer
  rank_zero_deprecation(
Auto select gpus: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
2

Unnamed: 0,Program,Model,Training RMSE,Training MSE,Training MAE,Training MAPE,Training R2,Training MEDIAN_ABSOLUTE_ERROR,Training EXPLAINED_VARIANCE_SCORE,Testing RMSE,...,Testing R2,Testing MEDIAN_ABSOLUTE_ERROR,Testing EXPLAINED_VARIANCE_SCORE,Validation RMSE,Validation MSE,Validation MAE,Validation MAPE,Validation R2,Validation MEDIAN_ABSOLUTE_ERROR,Validation EXPLAINED_VARIANCE_SCORE
0,PytorchTabular,Category Embedding,3.273904,10.718448,2.439552,0.103957,0.827288,1.87891,0.857783,2.980939,...,0.844204,1.972121,0.890054,3.950363,15.605369,2.875634,0.123573,0.745909,2.082971,0.763982
