Skip to content

LinyangHe/LiveMathematicianBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LiveMathematicianBench

🌐 Project Page https://livemathematicianbench.github.io/
📄 Paper https://arxiv.org/abs/2604.01754
🗂️ Dataset https://huggingface.co/datasets/LiveMathematicianBench/LiveMathematicianBench

LiveMathematicianBench is a research benchmark and construction pipeline for live, theorem-grounded mathematical multiple-choice evaluation.

The benchmark is built from recent arXiv math papers. For each monthly slice, the pipeline:

  1. retrieves recent papers,
  2. downloads and normalizes LaTeX,
  3. extracts main theorems and proof sketches with a hybrid rule-first / agentic workflow,
  4. generates self-contained theorem-grounded MCQs,
  5. scores and filters them for quality,
  6. overgenerates harder variants and selects the hardest surviving candidates.

The goal is not just to collect math questions, but to maintain a continuously refreshable benchmark whose items stay aligned with current research mathematics and remain hard for frontier models.

Data

Monthly benchmark releases are organized as data/YYYYMM/qa_YYYYMM_final.json, for example data/202511/qa_202511_final.json.

Each monthly file is a JSON array. Each item corresponds to one theorem-grounded multiple-choice question and includes:

  • no: item index within the monthly release,
  • paper_link: source arXiv paper link,
  • theorem: theorem statement used as the grounding target,
  • sketch: proof sketch or proof idea used to support question construction,
  • theorem_type: coarse theorem-category tags,
  • mcq: the question, answer choices, correct answer, and construction metadata.

What this repository contains

This repository is the reusable code release for the benchmark pipeline. It focuses on the benchmark-building path rather than large generated artifacts.

Included:

  • reusable scripts for monthly retrieval, preprocessing, QA generation, and hard-set filtering,
  • theorem-type-specific prompts,
  • the hard benchmark-construction notes,
  • a small release layer (requirements.txt, quickstart, config template, demo skeleton).

Excluded:

  • large monthly datasets,
  • generated QA / hard-set artifacts,
  • temporary run outputs,
  • one-off experimental notebooks and local scratch files.

Repository structure

  • pipeline/scripts/
    • end-to-end scripts from monthly arXiv ingestion to final hard-set filtering
  • pipeline/docs/
    • benchmark construction notes and pipeline summaries
  • pipeline/QUICKSTART.md
    • minimal environment and execution guide
  • pipeline/config.example.sh
    • example runtime configuration for TRAPI / Azure-backed runs
  • pipeline/run_small_demo.sh
    • minimal demo skeleton for running the QA stage on a small prepared source jsonl

Main pipeline stages

1. Monthly retrieval

  • pipeline/scripts/arxiv_retriever.py

2. LaTeX acquisition and filtering

  • pipeline/scripts/extract_latex_text.py
  • pipeline/scripts/build_month_filtered_latex.py

3. Unified preprocessing

  • pipeline/scripts/preprocessing_run.py
  • pipeline/scripts/extract_theorems_az.py
  • pipeline/scripts/backfill_context_from_latex.py

This stage produces theorem/sketch/context sources used by QA generation.

4. QA generation

  • pipeline/scripts/generate_qa.py
  • pipeline/scripts/generate_qa_az.py
  • pipeline/scripts/run_generate_qa_az_multi.py
  • pipeline/scripts/prompts.py
  • pipeline/scripts/prompts_qaGen.py

5. Post-processing and evaluation

  • pipeline/scripts/polish_qa_artifacts.py
  • pipeline/scripts/test_qa_accuracy.py

6. Hard-set construction

  • pipeline/scripts/judge_triviality_filter.py
  • pipeline/scripts/bucket_stem_triviality.py
  • pipeline/scripts/overgenerate_hard_pool.py
  • pipeline/scripts/select_hard_candidates.py

Current benchmark design direction

The current release reflects the theorem-grounded hard-set construction line:

  • question generation uses theorem-first drafting with context used only for minimal notation/setup repair,
  • hard negatives are produced with proof-sketch-aware distractor generation,
  • trivial stems are filtered before hard-pool overgeneration,
  • final hard sets are selected by model-tested hardness rather than by quality score alone.

This keeps the benchmark closer to live mathematical reasoning rather than plain theorem restatement.

Status

This is a research-code release, not a fully productized benchmark SDK. The code is organized for reproducibility and extension, but some scripts still assume a research workflow and local data layout.

Start with:

  • pipeline/QUICKSTART.md
  • pipeline/docs/hard_pipeline.md

Citation

If you use LiveMathematicianBench in your work, please cite:

@misc{he2026livemathematicianbench,
      title={LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches},
      author={Linyang He and Qiyao Yu and Hanze Dong and Baohao Liao and Xinxing Xu and Micah Goldblum and Jiang Bian and Nima Mesgarani},
      year={2026},
      eprint={2604.01754},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.01754},
}

About

LiveMathematicianBench: A Live Benchmark for Mathematician-Level Reasoning with Proof Sketches

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors