LiveMathematicianBench


🌐 Project Page	https://livemathematicianbench.github.io/
📄 Paper	https://arxiv.org/abs/2604.01754
🗂️ Dataset	https://huggingface.co/datasets/LiveMathematicianBench/LiveMathematicianBench

LiveMathematicianBench is a research benchmark and construction pipeline for live, theorem-grounded mathematical multiple-choice evaluation.

The benchmark is built from recent arXiv math papers. For each monthly slice, the pipeline:

retrieves recent papers,
downloads and normalizes LaTeX,
extracts main theorems and proof sketches with a hybrid rule-first / agentic workflow,
generates self-contained theorem-grounded MCQs,
scores and filters them for quality,
overgenerates harder variants and selects the hardest surviving candidates.

The goal is not just to collect math questions, but to maintain a continuously refreshable benchmark whose items stay aligned with current research mathematics and remain hard for frontier models.

Data

Monthly benchmark releases are organized as data/YYYYMM/qa_YYYYMM_final.json, for example data/202511/qa_202511_final.json.

Each monthly file is a JSON array. Each item corresponds to one theorem-grounded multiple-choice question and includes:

no: item index within the monthly release,
paper_link: source arXiv paper link,
theorem: theorem statement used as the grounding target,
sketch: proof sketch or proof idea used to support question construction,
theorem_type: coarse theorem-category tags,
mcq: the question, answer choices, correct answer, and construction metadata.

What this repository contains

This repository is the reusable code release for the benchmark pipeline. It focuses on the benchmark-building path rather than large generated artifacts.

Included:

reusable scripts for monthly retrieval, preprocessing, QA generation, and hard-set filtering,
theorem-type-specific prompts,
the hard benchmark-construction notes,
a small release layer (requirements.txt, quickstart, config template, demo skeleton).

Excluded:

large monthly datasets,
generated QA / hard-set artifacts,
temporary run outputs,
one-off experimental notebooks and local scratch files.

Repository structure

pipeline/scripts/
- end-to-end scripts from monthly arXiv ingestion to final hard-set filtering
pipeline/docs/
- benchmark construction notes and pipeline summaries
pipeline/QUICKSTART.md
- minimal environment and execution guide
pipeline/config.example.sh
- example runtime configuration for TRAPI / Azure-backed runs
pipeline/run_small_demo.sh
- minimal demo skeleton for running the QA stage on a small prepared source jsonl

Main pipeline stages

1. Monthly retrieval

pipeline/scripts/arxiv_retriever.py

2. LaTeX acquisition and filtering

pipeline/scripts/extract_latex_text.py
pipeline/scripts/build_month_filtered_latex.py

3. Unified preprocessing

pipeline/scripts/preprocessing_run.py
pipeline/scripts/extract_theorems_az.py
pipeline/scripts/backfill_context_from_latex.py

This stage produces theorem/sketch/context sources used by QA generation.

4. QA generation

pipeline/scripts/generate_qa.py
pipeline/scripts/generate_qa_az.py
pipeline/scripts/run_generate_qa_az_multi.py
pipeline/scripts/prompts.py
pipeline/scripts/prompts_qaGen.py

5. Post-processing and evaluation

pipeline/scripts/polish_qa_artifacts.py
pipeline/scripts/test_qa_accuracy.py

6. Hard-set construction

pipeline/scripts/judge_triviality_filter.py
pipeline/scripts/bucket_stem_triviality.py
pipeline/scripts/overgenerate_hard_pool.py
pipeline/scripts/select_hard_candidates.py

Current benchmark design direction

The current release reflects the theorem-grounded hard-set construction line:

question generation uses theorem-first drafting with context used only for minimal notation/setup repair,
hard negatives are produced with proof-sketch-aware distractor generation,
trivial stems are filtered before hard-pool overgeneration,
final hard sets are selected by model-tested hardness rather than by quality score alone.

This keeps the benchmark closer to live mathematical reasoning rather than plain theorem restatement.

Status

This is a research-code release, not a fully productized benchmark SDK. The code is organized for reproducibility and extension, but some scripts still assume a research workflow and local data layout.

Start with:

pipeline/QUICKSTART.md
pipeline/docs/hard_pipeline.md

Citation

If you use LiveMathematicianBench in your work, please cite:

@misc{he2026livemathematicianbench,
      title={LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches},
      author={Linyang He and Qiyao Yu and Hanze Dong and Baohao Liao and Xinxing Xu and Micah Goldblum and Jiang Bian and Nima Mesgarani},
      year={2026},
      eprint={2604.01754},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.01754},
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
pipeline		pipeline
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LiveMathematicianBench

Data

What this repository contains

Repository structure

Main pipeline stages

1. Monthly retrieval

2. LaTeX acquisition and filtering

3. Unified preprocessing

4. QA generation

5. Post-processing and evaluation

6. Hard-set construction

Current benchmark design direction

Status

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LiveMathematicianBench

Data

What this repository contains

Repository structure

Main pipeline stages

1. Monthly retrieval

2. LaTeX acquisition and filtering

3. Unified preprocessing

4. QA generation

5. Post-processing and evaluation

6. Hard-set construction

Current benchmark design direction

Status

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages