**What is TabBench?**
*TabBench* is a benchmark suite for tabular data focused on real-world business use cases like product categorization, deduplication, and pricing. Unlike academic benchmarks, it evaluates models on industrial datasets from sectors such as retail, banking, and insurance. Built on top of [Neuralk Foundry-CE](https://github.com/Neuralk-AI/NeuralkFoundry-CE), TabBench structures each task as a modular workflow, making it easy to test and compare different approaches. It’s designed to help identify the best models for practical, industry-driven challenges.

# Adding a Dataset to the benchmark

Datasets are a core components of TabBench. Big registries exist online such as OpenML or UCI but are reserved to academic datasets. Industrial datasets usually comes with tighter use conditions and licenses that make them unfit for those platforms and require explicit consent of the user before downloading. Unfortunately this restricts the automation capabilities of our benchmark.

Neuralk Foundry supports three modes of dataset registration:

* **Local datasets**: Files already available on disk can be directly registered and used.
* **Remote datasets**: Files hosted online can be downloaded and cached locally. This is useful for sharing datasets across teams or environments.
* **OpenML datasets**: Public datasets can be fetched from OpenML, leveraging its large and well-maintained repository.

In this tutorial, we show how to add a dataset for each of those sources.

## Adding a Locally Generated Dataset

Foundry includes utilities to generate synthetic datasets for tasks such as deduplication.
In this example, we create a custom deduplication dataset where **70% of the records have at least one duplicate**, with an average of **4 duplicates per duplicated item**. Once the dataset is generated, it is saved locally. To make it available within TabBench, we define a corresponding `DataConfig`, which is automatically registered in the system.

For local datasets, the configuration must include the field `file_path`, which specifies the path to the dataset file on disk.


In [1]:
from neuralk_foundry_ce.utils.data import make_deduplication


df, target_col = make_deduplication(num_samples=300, embed_dim=16, dup_frac=0.7, avg_dups=4.0, decay=.4)
df.to_parquet('./my_dataset.parquet')

In [2]:
from neuralk_foundry_ce.datasets import get_data_config, LocalDataConfig
from dataclasses import dataclass


@dataclass
class DataConfig(LocalDataConfig):
    name: str='fake_deduplication'
    task: str = "linkage"
    target: str = target_col
    file_path: str = "./my_dataset.parquet"

# Check that the dataset is well imported
get_data_config('fake_deduplication').name


'fake_deduplication'

## Adding an OpenML Dataset

Registering a dataset from OpenML is the most straightforward approach.
To do so, simply create a configuration class and include the **`openml_id`** field. The dataset ID can be found on the corresponding dataset page at [openml.org](https://www.openml.org).

Once specified, the dataset will be automatically downloaded, cached locally, and made available within TabBench.


In [3]:
from neuralk_foundry_ce.datasets import OpenMLDataConfig


@dataclass
class OpenMLDataConfig(OpenMLDataConfig):
    name: str='credit-g'
    task: str = "classification"
    target: str = 'class'
    openml_id: int = 31

get_data_config('credit-g').name


'credit-g'

## Adding a Downloadable Dataset

To define a downloadable dataset, extend the `DataConfig` class with the following elements:

* A `filename` field specifying the name under which the dataset will be stored locally.
* A `download_data` method responsible for fetching the dataset and saving it to the designated location within the dataset cache managed by the package.

For example, the Best Buy product catalog can be registered this way. In this case, we download the data, retain only the relevant columns, and store the processed file for use in downstream tasks.


In [4]:
from dataclasses import dataclass
import pandas as pd

from neuralk_foundry_ce.datasets.base import DownloadDataConfig


@dataclass
class DataConfig(DownloadDataConfig):
    name: str  = "best_buy_simple_categ_again"
    task: str  = "classification"
    target: str = "type"
    file_name: str = 'data.parquet'

    def download_data(self, dataset_dir):
        ds_url = 'https://raw.githubusercontent.com/BestBuyAPIs/open-data-set/refs/heads/master/products.json'
        df = pd.read_json(ds_url)[['name', 'type', 'price', 'manufacturer']]
        df = df[df.type.isin(['HardGood', 'Game', 'Software'])]
        df = df.reset_index(drop=True)
        df.to_parquet(dataset_dir / self.file_name)

## Conclusion

Neuralk Foundry currently supports three types of dataset sources: **local files**, **downloadable resources**, and **OpenML datasets**. These options cover most common use cases in both research and industry.

If you wish to support an additional data source or contribute new tasks, contributions are welcome, feel free to open a pull request!

Check the other tutorials:

* [1 - Getting Started with TabBench.ipynb](./1%20-%20Getting%20Started%20with%20TabBench.ipynb)

* [3 - Use a custom model.ipynb](./3%20-%20Use%20a%20custom%20model.ipynb)

In [None]:
# Clean the parquet dataset
import os
file_path = './my_dataset.parquet'
if os.path.exists(file_path):
    os.remove("./my_dataset.parquet")