**What is TabBench?**
*TabBench* is a benchmark suite for tabular data focused on real-world business use cases like product categorization, deduplication, and pricing. Unlike academic benchmarks, it evaluates models on industrial datasets from sectors such as retail, banking, and insurance. Built on top of [Neuralk Foundry-CE](https://github.com/Neuralk-AI/NeuralkFoundry-CE), TabBench structures each task as a modular workflow, making it easy to test and compare different approaches. It’s designed to help identify the best models for practical, industry-driven challenges.

---

## Tackling the Categorization Challenge

One of the core tasks in TabBench is **product categorization**: a classification problem that arises frequently in e-commerce and digital inventory management.

Due to privacy restrictions, we are unable to share the 36 real-world product categorization datasets used internally. However, in this notebook, we provide a public dataset that mimics some of the challenges encountered in our industrial workloads. This allows you to experiment under realistic conditions and test the robustness of your models on comparable data.


## Understanding Product Categorization

At its core, product categorization is a **multi-class classification** problem. But unlike standard benchmarks, it comes with two key complications:

1. **Heterogeneous and partially structured inputs:**
   Product data is typically semi-structured. While all products may share fields like `title`, `description`, and `price`, other features — such as `battery power`, `screen size`, or `material` — may only apply to certain categories. For example, "voltage" is relevant for an electric drill but meaningless for a T-shirt. This inconsistency places an additional burden on preprocessing and feature engineering.

   In our internal workflows, this preprocessing is handled with care. However, for privacy reasons, the dataset we share here includes only basic fields (`title` and `description`), which we embed using a pre-trained language model followed by dimensionality reduction.

2. **Practical evaluation criteria:**
   Many academic benchmarks rely on ROC-AUC, which is threshold-independent and useful for model calibration analysis. However, in production systems, such as those used in retail, the priority is often on **accuracy**, **F1-score**, or **precision/recall**, depending on the downstream impact of misclassifications.

   Inspired by challenges like the [Rakuten Product Classification Challenge](https://challengedata.ens.fr/challenges/35), we use **F1-score** as the default metric in TabBench for this task. This choice better reflects industrial priorities, where both false positives and false negatives can be costly.

## Preprocessing Pipeline

To transform text into usable features, we use the [`TextEncoder`](https://skrub-data.org/stable/reference/generated/skrub.TextEncoder.html) class from skrub.

This encoder uses the [`intfloat/multilingual-e5-base`](https://huggingface.co/intfloat/multilingual-e5-base) model from Hugging Face to generate dense embeddings for the `title` and `description` fields. Each of these results in a 768-dimensional vector.

To reduce dimensionality and improve efficiency, we then apply **Principal Component Analysis (PCA)**, projecting each embedding down to 50 dimensions. The reduced embeddings for the title and description are concatenated into a single 100-dimensional feature vector per product.

This minimal preprocessing pipeline is designed to be easy to reproduce and compatible with a wide range of models.

## The Task at Hand

Since the real product datasets used in TabBench are confidential, we provide a substitute based on the public **BestBuy** dataset. This dataset, introduced in [Notebook 1](./1%20-%20Getting%20Started%20with%20TabBench.ipynb), contains:

* Product names (`title`)
* Descriptions
* A hierarchical category structure (level 1, 2, 3)

This structure allows us to simulate real-world categorization scenarios, with multiple levels of granularity and class imbalance.

Let’s begin by downloading the dataset.


In [1]:
import requests
import pandas as pd

# Load the JSON from URL
ds_url = 'https://raw.githubusercontent.com/BestBuyAPIs/open-data-set/refs/heads/master/products.json'
response = requests.get(ds_url)
data = response.json()  # this is a list of dicts

# Extract relevant fields
records = []
for item in data:
    record = {
        "id": item.get("sku"),
        "name": item.get("name"),
        "description": item.get("description"),
        'type': item.get("type"),
        'price': item.get("price"),
        'manufacturer': item.get("manufacturer"),
    }
    categories = item.get("category", [])
    for i in range(4):  # adjust to desired number of category levels
        record[f"category_{i+1}"] = categories[i]["name"] if i < len(categories) else None
    records.append(record)

# Create DataFrame
df = pd.DataFrame(records)

## Data Filtering

To simplify this example, we remove *rare classes*, that is, categories with very few samples. This serves two purposes:

1. **Avoiding class imbalance issues**:
   In classification tasks, classes with very few examples can introduce noise, make training unstable, and skew evaluation metrics. While handling imbalanced data is important in production settings, it is beyond the scope of this tutorial.

2. **Speeding up experimentation**:
   By filtering out underrepresented classes, we reduce the dataset size, which shortens both preprocessing and training time. This allows for quicker iteration during model development.

In practice, handling rare classes might involve techniques such as class reweighting, oversampling, or hierarchical classification — all of which can be explored once the baseline is established.

Let’s proceed with filtering out classes that appear too infrequently in the dataset.


In [2]:
# We only keep categories with more than 1000 products

def filter_by_group(df, group, count):
    counts = df.groupby(group)[group].transform('count')
    return df[counts >= count]

df = filter_by_group(df, 'category_3', 100)
df = filter_by_group(df, 'category_2', 500)
df = filter_by_group(df, 'category_1', 1000)

print('Top categories', df['category_1'].unique())

Top categories ['Connected Home & Housewares' 'Car Electronics & GPS'
 'Computers & Tablets' 'Appliances' 'Audio' 'Cameras & Camcorders'
 'Cell Phones']


## Preprocessing

Since we will be running multiple workflows on the same dataset, we apply the preprocessing step **once on the entire dataset**. This ensures consistency across all experiments and avoids redundant computation.

In this example, we focus on transforming the text-based fields (`title` and `description`), which are common in product catalogs. These fields are first encoded using the [`intfloat/multilingual-e5-base`](https://huggingface.co/intfloat/multilingual-e5-base) model, a pre-trained transformer from Hugging Face optimized for multilingual sentence embeddings.

Each text field is transformed into a dense vector of 768 floating-point values. To make these embeddings more manageable and reduce computation downstream, we apply **Principal Component Analysis (PCA)** and retain the top 50 components per field. The compressed vectors for `title` and `description` are then concatenated, yielding a final feature vector of 100 dimensions per product.


In [3]:
from skrub import TextEncoder

text_encoder = TextEncoder(model_name='intfloat/multilingual-e5-base', n_components=50)

name_encoded = text_encoder.fit_transform(df['name'])
desc_encoded = text_encoder.fit_transform(df['description'])

# Concatenate all
X = pd.concat([name_encoded, desc_encoded], axis=1)
X.reset_index(drop=True, inplace=True)

## Classification

Now that the data is preprocessed, we move on to the classification step.

The BestBuy dataset includes a **hierarchical category structure** with three levels: `level_1`, `level_2`, and `level_3`. To simplify the experimental setup and focus on meaningful comparisons, we define one classification task per level. For each level:

* We **select the subset of data corresponding to the most populated parent category** from the level above.
  This ensures that the classification problem remains well-posed and avoids severe class imbalance from sparsely populated branches in the hierarchy.

The resulting classification tasks are:

| Level   | Samples | Classes |
| ------- | ------- | ------- |
| Level 1 | 23,777  | 7       |
| Level 2 | 6,327   | 4       |
| Level 3 | 5,891   | 7       |

For each task, we train and evaluate a classifier using **5-fold cross-validation** and report the **average score** across folds. This approach gives a more reliable estimate of generalization performance than a single train-test split.

> ☕ **Heads up:** This step may take a few minutes depending on your hardware and the model used. It might be a good time to grab a coffee while the models train!

Once complete, we’ll visualize the results to compare performance across category levels.

In [4]:
from neuralk_foundry_ce.workflow.use_cases import Categorisation
import numpy as np
from IPython.display import display

import warnings
warnings.simplefilter("error", category=UserWarning) 

# Simulate 3 levels of problems

for level in [3, 2, 1]:
    print(f'Level {level}')
    print('-------')
    # Keep samples from the most populated category at level - 1
    X_level = X
    y_level = df[f'category_{level}']
    if level > 1:
        top_n_1_value = df[f'category_{level - 1}'].value_counts().idxmax()
        mask = (df[f'category_{level - 1}'] == top_n_1_value).values
        X_level = X_level[mask]
        y_level = y_level[mask]
    X_level = X_level.reset_index(drop=True)
    y_level = y_level.values

    labels, counts = np.unique(y_level, return_counts=True)
    n_samples, n_features = X_level.shape
    print(f"X: {n_samples} rows × {n_features} features")
    
    # Tabular display of class distribution
    df_counts = pd.DataFrame({"Label": labels, "Count": counts
    }).sort_values("Count", ascending=False).reset_index(drop=True)
    
    display(df_counts)
        
    workflow = Categorisation()
    f1_scores = []
    for fold_index in range(5):
        data, metrics = workflow.run(X_level, y_level, fold_index=fold_index)
        f1_scores.append(metrics['xgboost-classifier']['test_f1_score'])

    mean_f1 = np.mean(f1_scores)
    std_f1 = np.std(f1_scores)
    print(f"Level {level}: F1 = {mean_f1:.3f} ± {std_f1:.3f}")
    

Level 3
-------
X: 5891 rows × 100 features


Unnamed: 0,Label,Count
0,iPhone Accessories,2079
1,Cell Phone Cases & Clips,1847
2,Cell Phone Batteries & Power,801
3,"Adapters, Cables & Chargers",569
4,Smartwatches & Accessories,358
5,Screen Protectors,123
6,Photography Accessories,114


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Level 3: F1 = 0.944 ± 0.015
Level 2
-------
X: 6327 rows × 100 features


Unnamed: 0,Label,Count
0,Small Kitchen Appliances,3342
1,"Ranges, Cooktops & Ovens",1331
2,"Heating, Cooling & Air Quality",965
3,Refrigerators,689


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Level 2: F1 = 0.985 ± 0.003
Level 1
-------
X: 23777 rows × 100 features


Unnamed: 0,Label,Count
0,Appliances,6327
1,Cell Phones,5891
2,Computers & Tablets,3614
3,Audio,2134
4,Connected Home & Housewares,1987
5,Car Electronics & GPS,1948
6,Cameras & Camcorders,1876


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

Level 1: F1 = 0.973 ± 0.004
