Have Large Multimodal Models Truly Conquered High School-level Examinations?
Official code release for the paper LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations? (HuggingFace dataset).
Large multimodal models (LMMs) routinely score near-perfect on benchmarks like MATH and AIME, yet they have never been put through a real student's day: a full exam paper, end-to-end, under timing and process-rigor constraints, with figures and questions laid out together on the page. LiveK12Bench asks whether they can.
The benchmark contains 2,000+ verified questions spanning Mathematics, Physics, Chemistry, and Biology, sourced from the latest (2026) authentic Chinese high-school exam papers, distributed in both Chinese and English, and designed to grow over time so it can resist data contamination. Its evaluation goes beyond final-answer accuracy and grades models the way a teacher grades a student:
- Outcome — final-answer accuracy (Pass@1).
- Reasoning process — three error categories (Condition Interpretation, Logical Assumption, Deductive Reasoning) penalised against the reference solution.
- Reasoning efficiency — accuracy reweighted by response length and accuracy under a hard token budget.
- Exam performance — a holistic, weighted Mock Exam score that mirrors human grading.
evaluate/— solver and grader for the benchmark. Runs overShawn-wxh/livek12bench(or any local JSON in the same schema), and routes every model call through LiteLLM so you can plug in any provider behind one configuration knob.analyze/— OCR-markdown → structured-JSON parsing pipeline. Use this if you want to add your own exam papers to the benchmark.data/— exam-paper inputs. The defaultpaper_dirisdata/chinese_k12/bio/. A two-question smoke-test fixture (data/chinese_k12/bio/smoke_test_paper.jsonanddata/ocr_input/smoke_test_paper.md) is committed so you can verify your environment without HuggingFace access; everything else is user-supplied and gitignored.predictions/— solver outputs. Each run lands inpredictions/<model>/<run-id>.json(gitignored).metrics/— aggregated Excel workbooks produced bymetric.py(gitignored).
bash setup.sh…or manually:
pip install -U -r requirements.txtPython 3.10+ recommended.
The benchmark lives on the HuggingFace Hub:
from datasets import load_dataset
ds = load_dataset("Shawn-wxh/livek12bench")
# DatasetDict with four splits:
# zh_2603, zh_2605 — original Chinese papers
# en_2603, en_2605 — English translations of the same papersEach row has the following schema:
| field | type | meaning |
|---|---|---|
id |
str |
stable id, e.g. "en_2603_math_0001" |
set |
str |
release set (e.g. "2603", "2605") |
subject |
str |
one of math, physics, chemistry, biology |
question_type |
str |
language-dependent: 选择题 / Multiple Choice, etc. |
point_value |
int |
exam-board point value |
question |
str |
question text with LaTeX |
answer |
list[str] |
reference answer(s); multi-select like ["ACD"] |
solution |
str |
full reference solution |
knowledge_points |
str |
knowledge points (semicolon-separated) |
images |
list[Image] |
per-question reference images (PIL objects via HuggingFace datasets) |
The split-naming convention is {lang}_{set} where lang ∈ {zh, en}.
The evaluation tooling normalises every record to a canonical English schema regardless of source. In particular:
question_typeis mapped to a canonical enum (multiple_choice,fill_in_blank,open_ended,proving,unknown) so graders never branch on the natural-language form.images(PIL) are cached to~/.cache/livek12bench/images/and the in-memory record exposes their on-disk paths.
LiveK12Bench routes every model call through
LiteLLM, so the same code can
drive OpenAI, Anthropic, Gemini, and self-hosted (vLLM / SGLang / TGI)
models behind a single function evaluate.util.llm.call_llm.
Set the API key for whichever provider you use; LiteLLM auto-detects the provider from the model name.
# OpenAI / Azure-style endpoints
export OPENAI_API_KEY=sk-...
# Optional: route OpenAI-compatible traffic to a proxy
# export OPENAI_BASE_URL=https://your-proxy.example.com/v1
# Anthropic
export ANTHROPIC_API_KEY=sk-ant-...
# Google Gemini
export GEMINI_API_KEY=...The model name you pass to call_llm(model="...") follows LiteLLM's
naming convention. The 12 models exercised in the paper map to:
| paper name | LiteLLM model id |
|---|---|
gpt-5 |
gpt-5 |
gpt-5-mini |
gpt-5-mini |
gpt-4o-2024-11-20 |
gpt-4o-2024-11-20 |
claude-opus-4-6 |
anthropic/claude-opus-4-6 |
claude-sonnet-4-6 |
anthropic/claude-sonnet-4-6 |
gemini-3-pro |
gemini/gemini-3-pro |
gemini-3-flash |
gemini/gemini-3-flash |
kimi-k2.5 |
moonshot/kimi-k2.5 |
glm-5 |
zhipuai/glm-5 |
qwen3-vl-235b-a22b-thinking |
self-hosted (see below) |
qwen3-vl-8b |
self-hosted (see below) |
qwen3-vl-32b |
self-hosted (see below) |
You can list any model you like — the LiteLLM Supported Models page is the canonical reference.
For models you serve yourself (vLLM, SGLang, TGI, llama.cpp, etc.),
add them to the vllm_models dict in evaluate/constants.py:
vllm_models = {
"qwen3-vl-8b": "http://your-vllm-host:8000/v1",
"qwen3-vl-32b": "http://your-vllm-host:8001/v1",
# ...
}call_llm looks up the model name in this dict; if it's there, the
endpoint is used and the call is forwarded as OpenAI-compatible.
Default prompts are in English (evaluate/prompts/en.py) and
explicitly handle Chinese-content exams (they instruct the grader to
work in the question's native language). A verbatim Chinese version of
the original prompts is archived in evaluate/prompts/zh.py. Switch
between them by editing PROMPT_LANG in evaluate/constants.py:
PROMPT_LANG = "en" # or "zh".
├── analyze/ Exam paper parsing pipeline (optional)
│ ├── analyze.py OCR markdown → structured per-question JSON
│ └── configs/
│ └── chinese_k12_exam.py Extraction schema (legacy Chinese keys)
│
├── evaluate/ Solver + grader framework
│ ├── constants.py ⚠️ Project-wide config (paths, model lists)
│ ├── solve.py Run a solver across one dataset slice
│ ├── evaluate.py Grade solutions (parallel verifier voting)
│ ├── evaluate_process.py Per-step process error classification
│ ├── metric.py Aggregate per-model scores into Excel
│ ├── prompts/
│ │ ├── __init__.py Lang selector (PROMPT_LANG)
│ │ ├── en.py Default English prompts
│ │ └── zh.py Archived Chinese prompts
│ └── util/
│ ├── llm.py LiteLLM-backed call_llm entry point
│ ├── dataset_loader.py Unified HF + local-JSON loader
│ ├── average_metrics.py Cross-paper averaging into summary sheet
│ └── ...
│
├── data/ Exam paper inputs
│ ├── chinese_k12/bio/ Default `paper_dir` (smoke fixture committed)
│ └── ocr_input/ Raw OCR markdown drop-zone (smoke fixture committed)
│
├── predictions/ Solver outputs land here (gitignored)
├── metrics/ Aggregated Excel reports (gitignored)
├── requirements.txt
└── setup.sh
All four entry points (solve.py, evaluate.py, evaluate_process.py,
metric.py) accept the same source-selection flags:
| flag | source |
|---|---|
--split |
a HuggingFace split (e.g. en_2603, zh_2605) |
--subject |
optional subject filter (math / physics / chemistry / biology) |
--json |
a local JSON file produced by analyze/analyze.py |
--limit |
take only the first N questions after filtering (handy for smoke tests) |
--ids |
comma-separated list of question ids to keep |
--run-id |
override the prediction filename stem (default: <split>__<subject>) |
metric.py additionally accepts a --paper shortcut (run-id stem under
predictions/) so you can aggregate results that were produced with a
custom --run-id.
Predictions land in predictions/<model>/<run-id>.json and the
aggregated metrics workbook lives at metrics/metrics.xlsx
(configurable via evaluate.constants.metrics_path).
cd evaluate
# Smoke-test: 10 math questions from the English split
python solve.py \
--split en_2603 --subject math --limit 10 \
--model gpt-5-mini
# Full sweep over one subject, parallel
python solve.py \
--split en_2603 --subject math \
--model gpt-5 \
--max-workers 8
# Run a model on the committed smoke-test fixture (no HuggingFace needed)
python solve.py \
--json ../data/chinese_k12/bio/smoke_test_paper.json \
--model gpt-5-miniAvailable solving modes (--mode):
| mode | input |
|---|---|
e2e |
question text + reference images go to the model (default) |
photo |
per-question screenshot only (legacy local-paper directory layout) |
exam |
full-paper screenshots + per-question instructions |
# Single model
python evaluate.py \
--split en_2603 --subject math --limit 10 \
--model gpt-5-mini
# All solvers configured in constants.solvers, in parallel
python evaluate.py --split en_2603 --subject mathVerifiers vote across multiple solver outputs; results are written back
into the prediction JSON under metrics.*.
python evaluate_process.py \
--split en_2603 --subject math \
--models gpt-5 claude-opus-4-6# Per (split, subject) sheet
python metric.py \
--split en_2603 --subject math \
--models gpt-5 claude-opus-4-6 gemini-3-pro
# Subset mode: aggregate metrics over questions belonging to one
# challenging subset. Each question carries a `subset` field of type
# list[str] (e.g. ["complex_layout", "long_reason"]); pass the subset
# name you want to slice on:
python metric.py --subset --subset-field complex_layout \
--papers paper_a paper_b paper_c
python metric.py --subset --subset-field rigorous_process --papers ...
python metric.py --subset --subset-field long_reason --papers ...A question with subset = ["complex_layout", "rigorous_process"] will
be counted in both the complex_layout and rigorous_process slices.
For backward compatibility, a top-level boolean field named after the
subset is also accepted.
To produce a cross-paper summary sheet on top of the workbook:
python util/average_metrics.py --xlsx ../metrics/your_run.xlsxAll metrics in the paper are computed by evaluate.py / evaluate_process.py
and aggregated into the workbook by metric.py.
| metric | meaning |
|---|---|
| ACC | Pass@1 final-answer accuracy. Proportion of questions whose extracted answer matches the ground truth. |
| ARL | Accuracy Reweighted by Length. Acc reweighted by a log-ratio of the average response length to the model's actual length — rewards concise correct solutions. |
| Acc≤r | Accuracy when the total generation budget (including thinking tokens) is hard-capped at ratio r of the context window — simulates a time/length constraint. The default knob in this repo is accuracy_within_16k_tokens. |
| OCS | Outcome Exam Score. Per-paper score derived purely from final-answer correctness, distributed across correctly answered (sub-)parts. |
| PES | Process Exam Score. Per-paper score that penalises three reasoning-process error types: Condition Interpretation Error (CIE), Logical Assumption Error (LAE), Deductive Reasoning Error (DRE). |
| OES | Overall Exam Score. Weighted combination of OCS and PES (weight w_p) normalised to a 100-point scale — the headline "Mock Exam" number. |
The per-question score ES decomposes into outcome (ES_O) and process
(ES_P) components combined by the same w_p. See the paper for the
full definitions.
If you have OCR output (from MinerU or similar) and want to parse new
papers into the same schema, use analyze/analyze.py:
python analyze/analyze.py \
--ocr-dir path/to/ocr_results \
--save-dir analyze/analyzed_json/my_papers \
--model gpt-5The script asks the LLM to extract per-question fields defined in
analyze/configs/chinese_k12_exam.py and writes one JSON per paper. The output
uses the legacy Chinese field names (题型, 分值, 题目, 答案,
解答, 图像); evaluate/util/dataset_loader.py accepts both that
schema and the new English schema transparently.
If LiveK12Bench is useful to you, please consider citing it:
@misc{livek12bench2026,
title = {LiveK12Bench},
author = {Wang, Xiaohan and Yin, Mingze and Zhao, Yilin and Sinbadliu and Li, Dian},
year = {2026},
url = {https://github.com/QQ-MM/LiveK12Bench}
}(GitHub's "Cite this repository" sidebar button is also available; it
reads from CITATION.cff.)
The code in this repository is released under the Apache License 2.0.
The accompanying dataset on HuggingFace
(Shawn-wxh/livek12bench)
is released under CC BY-NC 4.0 — see the dataset card for details.