TopBench

TOPBENCH: Benchmarking Predictive Reasoning over Tables

Overview | Quick Start | Inference | Evaluation | Citation

Official repository for TopBench, a benchmark for evaluating whether language models can understand a natural-language predictive intent and solve the corresponding tabular prediction task.

The full dataset is hosted on Hugging Face at LAMDA-Tabular/TopBench. This repository contains the code needed for inference, evaluation, sandboxed tool use, and the predict-only machine-learning baseline.

News

Initial Release: We release the TopBench codebase, legacy-compatible reproduction scripts, Docker sandbox setup, and predict-only ensemble baseline.
Dataset: The benchmark data is hosted on Hugging Face and should be placed under data/.

Overview

TopBench contains four task families:

Task	Description
`single_point_prediction`	Predict a missing value or class for one described case.
`decision_making`	Select the best option among candidate predictive scenarios.
`treatment_effect_analysis`	Estimate the effect or trend after an intervention.
`ranking_and_filtering`	Produce a structured CSV ranking or filtered result.

Legacy names are still supported by the compatibility scripts:

B1 -> single_point_prediction
B2 -> decision_making
B3 -> treatment_effect_analysis
B4 -> ranking_and_filtering

Quick Start

1. Install

conda create -n topbench python=3.10 -y
conda activate topbench
python -m pip install -U pip
python -m pip install -e .

For the full predict-only ensemble baseline:

python -m pip install -r requirements/full.txt

2. Prepare Data

Download the dataset from Hugging Face:

python scripts/download_dataset.py --local-dir data

Alternatively, place or symlink it as:

data/
  single_point_prediction/
  decision_making/
  treatment_effect_analysis/
  ranking_and_filtering/

Check the dataset layout:

python scripts/validate_dataset.py --data-root data

3. Configure API Keys

export DEEPSEEK_API_KEY=your_key_here

Other OpenAI-compatible providers can be configured through .env.example.

Inference

Run a small DeepSeek smoke test over all tasks and both modes:

python scripts/run_legacy_inference.py \
  --data-root data \
  --output-root outputs \
  --model deepseek \
  --tasks single_point_prediction decision_making treatment_effect_analysis ranking_and_filtering \
  --modes text_reasoning agentic_workflow \
  --max-files 1 \
  --workers 1

Outputs are written to:

outputs/<model>/<legacy_mode>/<legacy_task>/

where text_reasoning maps to no_tool and agentic_workflow maps to with_tool.

Sandbox

The agentic workflow executes model-generated Python code in Docker.

Build and check the sandbox:

docker build -f docker/Dockerfile.sandbox -t topbench-sandbox:latest .
python scripts/healthcheck_sandbox.py --image topbench-sandbox:latest

Use another image tag if needed:

export TOPBENCH_SANDBOX_IMAGE=topbench-sandbox:latest

Text-only inference and the predict-only baseline do not require Docker.

Evaluation

Evaluate frozen outputs with the bundled compatibility evaluators:

python scripts/reproduce_paper_scores.py \
  --data-root data \
  --inference-root outputs \
  --task decision_making \
  --model deepseek \
  --mode text_reasoning

For the structured ranking/filtering task:

python scripts/reproduce_paper_scores.py \
  --data-root data \
  --inference-root outputs \
  --task ranking_and_filtering \
  --model deepseek \
  --mode agentic_workflow

Summary files can be replayed without overwriting frozen outputs:

python scripts/reproduce_reasoning_summaries.py \
  --frozen-output-root outputs \
  --model deepseek \
  --mode text_reasoning \
  --compare

python scripts/reproduce_structured_summary.py \
  --data-root data \
  --frozen-output-root outputs \
  --model deepseek \
  --mode agentic_workflow \
  --compare

Predict-Only Baseline

The predict-only baseline uses gold structured data directly and does not use an LLM. It is an adaptive ensemble over strong tabular predictors such as HistGradientBoosting, ExtraTrees, XGBoost, LightGBM, CatBoost, and TabPFN when installed.

Run a smoke test:

python scripts/run_predict_only_baseline.py \
  --task single_point_prediction \
  --data-root data \
  --output-root outputs \
  --mode predict_only \
  --fast-smoke

Repository Layout

TopBench/
  data/                  # dataset placeholder
  docker/                # sandbox Dockerfile
  scripts/               # inference, evaluation, and baseline entry points
  src/topbench/          # package source
  requirements/          # dependency files

Citation

Citation metadata will be added after publication.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
data		data
docker		docker
docs		docs
requirements		requirements
scripts		scripts
src/topbench		src/topbench
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
RELEASE_AUDIT.md		RELEASE_AUDIT.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TopBench

News

Overview

Quick Start

1. Install

2. Prepare Data

3. Configure API Keys

Inference

Sandbox

Evaluation

Predict-Only Baseline

Repository Layout

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TopBench

News

Overview

Quick Start

1. Install

2. Prepare Data

3. Configure API Keys

Inference

Sandbox

Evaluation

Predict-Only Baseline

Repository Layout

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages