TOPBENCH: Benchmarking Predictive Reasoning over Tables
Overview | Quick Start | Inference | Evaluation | Citation
Official repository for TopBench, a benchmark for evaluating whether language models can understand a natural-language predictive intent and solve the corresponding tabular prediction task.
The full dataset is hosted on Hugging Face at LAMDA-Tabular/TopBench. This repository contains the code needed for inference, evaluation, sandboxed tool use, and the predict-only machine-learning baseline.
- Initial Release: We release the TopBench codebase, legacy-compatible reproduction scripts, Docker sandbox setup, and predict-only ensemble baseline.
- Dataset: The benchmark data is hosted on Hugging Face and should be placed under
data/.
TopBench contains four task families:
| Task | Description |
|---|---|
single_point_prediction |
Predict a missing value or class for one described case. |
decision_making |
Select the best option among candidate predictive scenarios. |
treatment_effect_analysis |
Estimate the effect or trend after an intervention. |
ranking_and_filtering |
Produce a structured CSV ranking or filtered result. |
Legacy names are still supported by the compatibility scripts:
B1 -> single_point_prediction
B2 -> decision_making
B3 -> treatment_effect_analysis
B4 -> ranking_and_filtering
conda create -n topbench python=3.10 -y
conda activate topbench
python -m pip install -U pip
python -m pip install -e .For the full predict-only ensemble baseline:
python -m pip install -r requirements/full.txtDownload the dataset from Hugging Face:
python scripts/download_dataset.py --local-dir dataAlternatively, place or symlink it as:
data/
single_point_prediction/
decision_making/
treatment_effect_analysis/
ranking_and_filtering/
Check the dataset layout:
python scripts/validate_dataset.py --data-root dataexport DEEPSEEK_API_KEY=your_key_hereOther OpenAI-compatible providers can be configured through .env.example.
Run a small DeepSeek smoke test over all tasks and both modes:
python scripts/run_legacy_inference.py \
--data-root data \
--output-root outputs \
--model deepseek \
--tasks single_point_prediction decision_making treatment_effect_analysis ranking_and_filtering \
--modes text_reasoning agentic_workflow \
--max-files 1 \
--workers 1Outputs are written to:
outputs/<model>/<legacy_mode>/<legacy_task>/
where text_reasoning maps to no_tool and agentic_workflow maps to with_tool.
The agentic workflow executes model-generated Python code in Docker.
Build and check the sandbox:
docker build -f docker/Dockerfile.sandbox -t topbench-sandbox:latest .
python scripts/healthcheck_sandbox.py --image topbench-sandbox:latestUse another image tag if needed:
export TOPBENCH_SANDBOX_IMAGE=topbench-sandbox:latestText-only inference and the predict-only baseline do not require Docker.
Evaluate frozen outputs with the bundled compatibility evaluators:
python scripts/reproduce_paper_scores.py \
--data-root data \
--inference-root outputs \
--task decision_making \
--model deepseek \
--mode text_reasoningFor the structured ranking/filtering task:
python scripts/reproduce_paper_scores.py \
--data-root data \
--inference-root outputs \
--task ranking_and_filtering \
--model deepseek \
--mode agentic_workflowSummary files can be replayed without overwriting frozen outputs:
python scripts/reproduce_reasoning_summaries.py \
--frozen-output-root outputs \
--model deepseek \
--mode text_reasoning \
--comparepython scripts/reproduce_structured_summary.py \
--data-root data \
--frozen-output-root outputs \
--model deepseek \
--mode agentic_workflow \
--compareThe predict-only baseline uses gold structured data directly and does not use an LLM. It is an adaptive ensemble over strong tabular predictors such as HistGradientBoosting, ExtraTrees, XGBoost, LightGBM, CatBoost, and TabPFN when installed.
Run a smoke test:
python scripts/run_predict_only_baseline.py \
--task single_point_prediction \
--data-root data \
--output-root outputs \
--mode predict_only \
--fast-smokeTopBench/
data/ # dataset placeholder
docker/ # sandbox Dockerfile
scripts/ # inference, evaluation, and baseline entry points
src/topbench/ # package source
requirements/ # dependency files
Citation metadata will be added after publication.