MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring

This is the official repository for the paper MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring. Our work introduces MMTutorBench, a novel benchmark designed to evaluate the tutoring capabilities of multimodal large language models (MLLMs) in the context of mathematical problem-solving.

Unlike existing benchmarks that focus narrowly on handwritten expression recognition or final-answer problem solving, MMTutorBench targets the handwritten math problem solving process, requiring models to produce structured tutoring responses across three pedagogical dimensions: Insight, Formulation, and Execution.

⚙️ Environment Setup

To set up the environment, please follow the steps below:

git clone https://github.com/TangciuYueng/MMTutorBench.git
cd MMTutorBench
conda env create -f environment.yml -n mmtutorbench
conda activate mmtutorbench

🗃️ Dataset

The MMTutorBench dataset consists of 770 carefully curated multimodal math tutoring problems with 1,414 images. Each problem includes a question, visual context, a detailed reference solution decomposed into Insight / Operation Formulation / Operation Execution, and a unique rubric for evaluation.

🛠️ Construction Pipeline

The benchmark is built through a multi-stage pipeline that combines real-world problem collection, three-axis tutoring task design, and rubric-based evaluation metric construction.

Problem Collection: Video Selection → Key-step Identification → Context Reconstruction.
Tutoring Task Design: three sub-tasks — Insight Discovery, Operation Formulation, and Operation Execution.
Reference Answer Curation & Rubric Construction: per-task reference answers ($R_{insight}$, $R_{form}$, $R_{exec}$) paired with rubrics that drive the evaluation metric.

🧪 Evaluation

The pipeline has two stages: generate.py queries an OpenAI-compatible endpoint to produce tutoring responses for each instance, and evaluate.py scores those responses against the inlined rubric using an LLM-as-judge.

1. Verify the dataset

python scripts/validate_dataset.py  # asserts 770 instances, all images present

2. Configure the endpoint

Both generate.py and evaluate.py talk to any OpenAI-compatible endpoint via the openai SDK; point them with OPENAI_API_KEY (and optionally OPENAI_BASE_URL):

# OpenAI
export OPENAI_API_KEY=sk-...
# Other OpenAI-compatible endpoints
export OPENAI_BASE_URL=http://ip:port/v1
export OPENAI_API_KEY=sk-...
# Local vLLM serve
export OPENAI_BASE_URL=http://localhost:port/v1
export OPENAI_API_KEY=anything

3. Run

cd src
bash generate.sh    # writes ../response/<model>/<uploader>/<task>.json
bash evaluate.sh    # writes ../eval/<eval_model>/<gen_model>/<uploader>/<task>.json

Edit the MODEL= line and the UPLOADERS=() array at the top of generate.sh to control which model and which subset to run. Pass -u, -g, -t to evaluate.sh to filter uploaders, generation models, or tasks.

Offline batch mode (vLLM + Ray)

For thinking-style models that benefit from offline batching, pass --use-batch to evaluate.py; this routes through VLLMBatchModel (Ray + vLLM, no HTTP). Requires vllm and ray[data] installed.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
asset		asset
mmtutorbench		mmtutorbench
src		src
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring

⚙️ Environment Setup

🗃️ Dataset

🛠️ Construction Pipeline

🧪 Evaluation

1. Verify the dataset

2. Configure the endpoint

3. Run

Offline batch mode (vLLM + Ray)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring

⚙️ Environment Setup

🗃️ Dataset

🛠️ Construction Pipeline

🧪 Evaluation

1. Verify the dataset

2. Configure the endpoint

3. Run

Offline batch mode (vLLM + Ray)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages