| 🌐 Project Page | https://livemathematicianbench.github.io/ |
| 📄 Paper | https://arxiv.org/abs/2604.01754 |
| 🗂️ Dataset | https://huggingface.co/datasets/LiveMathematicianBench/LiveMathematicianBench |
LiveMathematicianBench is a research benchmark and construction pipeline for live, theorem-grounded mathematical multiple-choice evaluation.
The benchmark is built from recent arXiv math papers. For each monthly slice, the pipeline:
- retrieves recent papers,
- downloads and normalizes LaTeX,
- extracts main theorems and proof sketches with a hybrid rule-first / agentic workflow,
- generates self-contained theorem-grounded MCQs,
- scores and filters them for quality,
- overgenerates harder variants and selects the hardest surviving candidates.
The goal is not just to collect math questions, but to maintain a continuously refreshable benchmark whose items stay aligned with current research mathematics and remain hard for frontier models.
Monthly benchmark releases are organized as data/YYYYMM/qa_YYYYMM_final.json, for example data/202511/qa_202511_final.json.
Each monthly file is a JSON array. Each item corresponds to one theorem-grounded multiple-choice question and includes:
no: item index within the monthly release,paper_link: source arXiv paper link,theorem: theorem statement used as the grounding target,sketch: proof sketch or proof idea used to support question construction,theorem_type: coarse theorem-category tags,mcq: the question, answer choices, correct answer, and construction metadata.
This repository is the reusable code release for the benchmark pipeline. It focuses on the benchmark-building path rather than large generated artifacts.
Included:
- reusable scripts for monthly retrieval, preprocessing, QA generation, and hard-set filtering,
- theorem-type-specific prompts,
- the hard benchmark-construction notes,
- a small release layer (
requirements.txt, quickstart, config template, demo skeleton).
Excluded:
- large monthly datasets,
- generated QA / hard-set artifacts,
- temporary run outputs,
- one-off experimental notebooks and local scratch files.
pipeline/scripts/- end-to-end scripts from monthly arXiv ingestion to final hard-set filtering
pipeline/docs/- benchmark construction notes and pipeline summaries
pipeline/QUICKSTART.md- minimal environment and execution guide
pipeline/config.example.sh- example runtime configuration for TRAPI / Azure-backed runs
pipeline/run_small_demo.sh- minimal demo skeleton for running the QA stage on a small prepared source jsonl
pipeline/scripts/arxiv_retriever.py
pipeline/scripts/extract_latex_text.pypipeline/scripts/build_month_filtered_latex.py
pipeline/scripts/preprocessing_run.pypipeline/scripts/extract_theorems_az.pypipeline/scripts/backfill_context_from_latex.py
This stage produces theorem/sketch/context sources used by QA generation.
pipeline/scripts/generate_qa.pypipeline/scripts/generate_qa_az.pypipeline/scripts/run_generate_qa_az_multi.pypipeline/scripts/prompts.pypipeline/scripts/prompts_qaGen.py
pipeline/scripts/polish_qa_artifacts.pypipeline/scripts/test_qa_accuracy.py
pipeline/scripts/judge_triviality_filter.pypipeline/scripts/bucket_stem_triviality.pypipeline/scripts/overgenerate_hard_pool.pypipeline/scripts/select_hard_candidates.py
The current release reflects the theorem-grounded hard-set construction line:
- question generation uses theorem-first drafting with context used only for minimal notation/setup repair,
- hard negatives are produced with proof-sketch-aware distractor generation,
- trivial stems are filtered before hard-pool overgeneration,
- final hard sets are selected by model-tested hardness rather than by quality score alone.
This keeps the benchmark closer to live mathematical reasoning rather than plain theorem restatement.
This is a research-code release, not a fully productized benchmark SDK. The code is organized for reproducibility and extension, but some scripts still assume a research workflow and local data layout.
Start with:
pipeline/QUICKSTART.mdpipeline/docs/hard_pipeline.md
If you use LiveMathematicianBench in your work, please cite:
@misc{he2026livemathematicianbench,
title={LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches},
author={Linyang He and Qiyao Yu and Hanze Dong and Baohao Liao and Xinxing Xu and Micah Goldblum and Jiang Bian and Nima Mesgarani},
year={2026},
eprint={2604.01754},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2604.01754},
}