This is the official repository for the paper MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring. Our work introduces MMTutorBench, a novel benchmark designed to evaluate the tutoring capabilities of multimodal large language models (MLLMs) in the context of mathematical problem-solving.
Unlike existing benchmarks that focus narrowly on handwritten expression recognition or final-answer problem solving, MMTutorBench targets the handwritten math problem solving process, requiring models to produce structured tutoring responses across three pedagogical dimensions: Insight, Formulation, and Execution.
To set up the environment, please follow the steps below:
git clone https://github.com/TangciuYueng/MMTutorBench.git
cd MMTutorBench
conda env create -f environment.yml -n mmtutorbench
conda activate mmtutorbenchThe MMTutorBench dataset consists of 770 carefully curated multimodal math tutoring problems with 1,414 images. Each problem includes a question, visual context, a detailed reference solution decomposed into Insight / Operation Formulation / Operation Execution, and a unique rubric for evaluation.
The benchmark is built through a multi-stage pipeline that combines real-world problem collection, three-axis tutoring task design, and rubric-based evaluation metric construction.
- Problem Collection: Video Selection → Key-step Identification → Context Reconstruction.
- Tutoring Task Design: three sub-tasks — Insight Discovery, Operation Formulation, and Operation Execution.
-
Reference Answer Curation & Rubric Construction: per-task reference answers (
$R_{insight}$ ,$R_{form}$ ,$R_{exec}$ ) paired with rubrics that drive the evaluation metric.
The pipeline has two stages: generate.py queries an OpenAI-compatible
endpoint to produce tutoring responses for each instance, and evaluate.py
scores those responses against the inlined rubric using an LLM-as-judge.
python scripts/validate_dataset.py # asserts 770 instances, all images presentBoth generate.py and evaluate.py talk to any OpenAI-compatible endpoint via
the openai SDK; point them with OPENAI_API_KEY (and optionally
OPENAI_BASE_URL):
# OpenAI
export OPENAI_API_KEY=sk-...
# Other OpenAI-compatible endpoints
export OPENAI_BASE_URL=http://ip:port/v1
export OPENAI_API_KEY=sk-...
# Local vLLM serve
export OPENAI_BASE_URL=http://localhost:port/v1
export OPENAI_API_KEY=anythingcd src
bash generate.sh # writes ../response/<model>/<uploader>/<task>.json
bash evaluate.sh # writes ../eval/<eval_model>/<gen_model>/<uploader>/<task>.jsonEdit the MODEL= line and the UPLOADERS=() array at the top of
generate.sh to control which model and which subset to run. Pass -u,
-g, -t to evaluate.sh to filter uploaders, generation models, or
tasks.
For thinking-style models that benefit from offline batching, pass
--use-batch to evaluate.py; this routes through VLLMBatchModel (Ray +
vLLM, no HTTP). Requires vllm and ray[data] installed.


