Skip to content

TangciuYueng/MMTutorBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring

arXiv HuggingFace Dataset

This is the official repository for the paper MMTutorBench: The First Multimodal Benchmark for AI Math Tutoring. Our work introduces MMTutorBench, a novel benchmark designed to evaluate the tutoring capabilities of multimodal large language models (MLLMs) in the context of mathematical problem-solving.

Overview

Unlike existing benchmarks that focus narrowly on handwritten expression recognition or final-answer problem solving, MMTutorBench targets the handwritten math problem solving process, requiring models to produce structured tutoring responses across three pedagogical dimensions: Insight, Formulation, and Execution.

⚙️ Environment Setup

To set up the environment, please follow the steps below:

git clone https://github.com/TangciuYueng/MMTutorBench.git
cd MMTutorBench
conda env create -f environment.yml -n mmtutorbench
conda activate mmtutorbench

🗃️ Dataset

The MMTutorBench dataset consists of 770 carefully curated multimodal math tutoring problems with 1,414 images. Each problem includes a question, visual context, a detailed reference solution decomposed into Insight / Operation Formulation / Operation Execution, and a unique rubric for evaluation.

stats

🛠️ Construction Pipeline

The benchmark is built through a multi-stage pipeline that combines real-world problem collection, three-axis tutoring task design, and rubric-based evaluation metric construction.

pipeline

  • Problem Collection: Video Selection → Key-step Identification → Context Reconstruction.
  • Tutoring Task Design: three sub-tasks — Insight Discovery, Operation Formulation, and Operation Execution.
  • Reference Answer Curation & Rubric Construction: per-task reference answers ($R_{insight}$, $R_{form}$, $R_{exec}$) paired with rubrics that drive the evaluation metric.

🧪 Evaluation

The pipeline has two stages: generate.py queries an OpenAI-compatible endpoint to produce tutoring responses for each instance, and evaluate.py scores those responses against the inlined rubric using an LLM-as-judge.

1. Verify the dataset

python scripts/validate_dataset.py  # asserts 770 instances, all images present

2. Configure the endpoint

Both generate.py and evaluate.py talk to any OpenAI-compatible endpoint via the openai SDK; point them with OPENAI_API_KEY (and optionally OPENAI_BASE_URL):

# OpenAI
export OPENAI_API_KEY=sk-...
# Other OpenAI-compatible endpoints
export OPENAI_BASE_URL=http://ip:port/v1
export OPENAI_API_KEY=sk-...
# Local vLLM serve
export OPENAI_BASE_URL=http://localhost:port/v1
export OPENAI_API_KEY=anything

3. Run

cd src
bash generate.sh    # writes ../response/<model>/<uploader>/<task>.json
bash evaluate.sh    # writes ../eval/<eval_model>/<gen_model>/<uploader>/<task>.json

Edit the MODEL= line and the UPLOADERS=() array at the top of generate.sh to control which model and which subset to run. Pass -u, -g, -t to evaluate.sh to filter uploaders, generation models, or tasks.

Offline batch mode (vLLM + Ray)

For thinking-style models that benefit from offline batching, pass --use-batch to evaluate.py; this routes through VLLMBatchModel (Ray + vLLM, no HTTP). Requires vllm and ray[data] installed.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors