Skip to content

modify mmmu#265

Merged
SJTUyh merged 2 commits into
AISBench:masterfrom
wenba0:mmmu_evalscope
May 11, 2026
Merged

modify mmmu#265
SJTUyh merged 2 commits into
AISBench:masterfrom
wenba0:mmmu_evalscope

Conversation

@wenba0

@wenba0 wenba0 commented Apr 27, 2026

Copy link
Copy Markdown
Collaborator

Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
感谢您的贡献,我们非常重视。以下说明将使您的拉取请求更健康,更易于获得反馈。如果您不理解某些项目,请不要担心,只需提交拉取请求并从维护人员那里寻求帮助即可。

PR Type / PR类型

  • Feature(功能新增)
  • Bugfix(Bug 修复)
  • Docs(文档更新)
  • CI/CD(持续集成/持续部署)
  • Refactor(代码重构)
  • Perf(性能优化)
  • Dependency(依赖项更新)
  • Test-Cases(测试用例更新)
  • Other(其他)

Related Issue | 关联 Issue
Fixes #224 #30

🔍 Motivation / 变更动机

mmmu数据集的测评之前是对齐vlmevalkit,多个issue发现该测评方法得分过低,修改对齐evalscope

📝 Modification / 修改内容

mmmu数据集修改对齐evalscope

📐 Associated Test Results / 关联测试结果

Please provide links to the related test results, such as CI pipelines, test reports, etc.
请提供相关测试结果的链接,例如 CI 管道、测试报告等。
Qwen2.5-VL-7B-Instuct测试结果如下,相同配置下对齐EvalScope,比官方结果低 可能是因为参数配置、vllm-ascend等原因
官方
image

AISBench
image
evalscope命令
evalscope eval --model qwen25vl --api-url http://localhost:8077/v1 --datasets mmmu --eval-batch-size 16 --dataset-args '{"mmmu":{"local_path":"/mnt/share/ljj/data/mmmu"}}' --generation-config '{"top_p":1, "top_k":-1, "temperature":0,"repetition_penalty":1.0,"max_tokens":5000}'
结果
image

⚠️ BC-breaking (Optional) / 向后不兼容变更(可选)

Does the modification introduce changes that break the backward compatibility of the downstream repositories? If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
是否引入了会破坏下游存储库向后兼容性的更改?如果是,请描述它如何破坏兼容性,以及下游项目应该如何修改其代码以保持与此 PR 的兼容性。

⚠️ Performance degradation (Optional) / 性能下降(可选)

If the modification introduces performance degradation, please describe the impact of the performance degradation and the expected performance improvement.
如果引入了性能下降,请描述性能下降的影响和预期的性能改进。

🌟 Use cases (Optional) / 使用案例(可选)

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.
如果此拉取请求引入了新功能,最好在此处列出一些用例并更新文档。

✅ Checklist / 检查列表

Before PR:

  • Pre-commit or other linting tools are used to fix the potential lint issues. / 使用预提交或其他 linting 工具来修复潜在的 lint 问题。
  • Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖,导致 Bug 的情况应在单元测试中添加。
  • The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是,请添加更多单元测试以确保正确性。
  • All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档(API 文档、文档字符串、示例教程)已更新以反映这些更改。

After PR:

  • If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects. / 如果此拉取请求对下游或其他相关项目有潜在影响,应在那些项目中测试此 PR。
  • CLA has been signed and all committers have signed the CLA in this PR. / CLA 已签署,且本 PR 中的所有提交者均已签署 CLA。

👥 Collaboration Info / 协作信息

  • Suggested Reviewers / 建议审核人: @xxx
  • Relevant Module Owners / 相关模块负责人: @xxx
  • Other Collaboration Notes / 其他协作说明:

🌟 Useful CI Command / 实用的CI命令

Command / 命令 Introduction / 介绍
/gemini review Performs a code review for the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 执行代码审核。
/gemini summary Provides a summary of the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 提供摘要。
/gemini help Displays a list of available commands of Gemini. / 显示 Gemini 可用命令的列表。
/readthedocs build Triggers a build of the documentation for the current pull request in its current state by Read the Docs. / 触发当前拉取请求在当前状态下由 Read the Docs 构建文档。

@wenba0 wenba0 had a problem deploying to smoke-test-approval April 27, 2026 09:22 — with GitHub Actions Error

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the MMMU dataset implementation to align with evalscope, transitioning from TSV files to local parquet files sourced from ModelScope. Key changes include updated documentation, consolidated task configurations, and enhanced logic for prompt templating, image processing, and evaluation of both multiple-choice and open-ended questions. Review feedback focuses on optimizing DataFrame processing for better performance, refining exception handling to be more specific, and fixing a validation bug in the multiple-choice prediction parser to ensure fallback characters are checked against allowed options.

parsed = ast.literal_eval(stripped)
if isinstance(parsed, (list, tuple)):
return list(parsed)
except Exception:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a broad Exception can mask unrelated errors and makes debugging harder. Since ast.literal_eval is the primary operation that might fail here, it is better to catch the specific exceptions it raises.

Suggested change
except Exception:
except (ValueError, SyntaxError):

subject = _infer_subject_from_parquet_path(parquet_file)
if subject and 'subject' not in data.columns:
data['subject'] = subject
records.extend(row.to_dict() for _, row in data.iterrows())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using iterrows() to convert a DataFrame to a list of dictionaries is inefficient as it creates a Series object for every row. pd.DataFrame.to_dict('records') is much faster and more idiomatic.

Suggested change
records.extend(row.to_dict() for _, row in data.iterrows())
records.extend(data.to_dict('records'))

Comment on lines +434 to +453
def _parse_mmmu_choice_prediction(prediction, num_choices):
match = re.search(
r'(?i)^ANSWER\s*:\s*([A-Za-z\d ,]+)\s*(?:$|\n|\.)',
prediction,
flags=re.MULTILINE,
)
if match is None:
match = re.search(
r'(?i)ANSWER\s*:\s*([A-Za-z\d ,]+)(?:[^\w]|\n|$|\.)',
prediction,
)
if match is None:
for letter in reversed(prediction):
if letter.isupper():
return letter
return ''

matched = match.group(1).strip().rstrip('.')
allowed_options = {_answer_character(index) for index in range(num_choices)}
return matched if matched in allowed_options else ''

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fallback heuristic in _parse_mmmu_choice_prediction (lines 446-448) returns the first uppercase letter found from the end of the string without verifying if it is a valid option. This can lead to incorrect parsing (e.g., returning 'I' from 'I don't know'). The logic should be updated to check against allowed_options.

Suggested change
def _parse_mmmu_choice_prediction(prediction, num_choices):
match = re.search(
r'(?i)^ANSWER\s*:\s*([A-Za-z\d ,]+)\s*(?:$|\n|\.)',
prediction,
flags=re.MULTILINE,
)
if match is None:
match = re.search(
r'(?i)ANSWER\s*:\s*([A-Za-z\d ,]+)(?:[^\w]|\n|$|\.)',
prediction,
)
if match is None:
for letter in reversed(prediction):
if letter.isupper():
return letter
return ''
matched = match.group(1).strip().rstrip('.')
allowed_options = {_answer_character(index) for index in range(num_choices)}
return matched if matched in allowed_options else ''
def _parse_mmmu_choice_prediction(prediction, num_choices):
allowed_options = {_answer_character(index) for index in range(num_choices)}
match = re.search(
r'(?i)^ANSWER\s*:\s*([A-Za-z\d ,]+)\s*(?:$|\n|\.)',
prediction,
flags=re.MULTILINE,
)
if match is None:
match = re.search(
r'(?i)ANSWER\s*:\s*([A-Za-z\d ,]+)(?:[^\w]|\n|$|\.)',
prediction,
)
if match is None:
for letter in reversed(prediction):
if letter.isupper() and letter in allowed_options:
return letter
return ''
matched = match.group(1).strip().rstrip('.')
return matched if matched in allowed_options else ''

@github-actions github-actions Bot added the bugfix label May 7, 2026
# Conflicts:
#	ais_bench/benchmark/configs/datasets/mmmu/mmmu_gen_cot.py
@wenba0 wenba0 temporarily deployed to smoke-test-approval May 11, 2026 02:37 — with GitHub Actions Inactive
MMMU_MULTI_CHOICE_TYPE = 'multiple-choice'
MMMU_OPEN_TYPE = 'open'

def dump_image(line, image_root_path):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] the dump_image function is unused?

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the MMMU dataset integration to align AISBench’s evaluation behavior with evalscope, addressing previously reported abnormally low MMMU scores. It rewrites MMMU data loading to use ModelScope’s parquet-based dataset layout and updates prompts/scoring to support both multiple-choice and open question types.

Changes:

  • Reworked MMMUDataset.load to discover/load local parquet shards, normalize options, extract/dump images, and build multimodal content prompts with <AIS_*> tags.
  • Updated MMMUEvaluator.score to score MCQ vs open questions differently and parse ANSWER:-style outputs.
  • Updated MMMU dataset docs/configs to use the ModelScope dataset and removed the obsolete mmmu_gen_cot config.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
ais_bench/benchmark/datasets/mmmu.py Parquet-based MMMU loading + new prompt construction and scoring logic for MCQ/open types.
ais_bench/benchmark/configs/datasets/mmmu/mmmu_gen.py Switch MMMU config to parquet root path + new prompt templates and reader columns.
ais_bench/benchmark/configs/datasets/mmmu/README.md Update Chinese MMMU deployment instructions to ModelScope parquet dataset.
ais_bench/benchmark/configs/datasets/mmmu/README_en.md Update English MMMU deployment instructions to ModelScope parquet dataset.
ais_bench/benchmark/configs/datasets/mmmu/mmmu_gen_cot.py Removed legacy CoT config (no longer documented/used).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +445 to +449
if match is None:
for letter in reversed(prediction):
if letter.isupper():
return letter
return ''
return letter
return ''

matched = match.group(1).strip().rstrip('.')
Comment on lines +457 to +459
match = re.search(r'ANSWER:\s*(.*)', prediction)
if match:
return match.group(1).strip()
subject = _infer_subject_from_parquet_path(parquet_file)
if subject and 'subject' not in data.columns:
data['subject'] = subject
records.extend(row.to_dict() for _, row in data.iterrows())
open_prompt=None,
start_text_prompt='',
end_text_prompt='',
option_prompt=''):
import string
import ast
from pathlib import Path
from os import environ
Comment on lines +679 to +686
if refer.get('type') == MMMU_MULTI_CHOICE_TYPE:
choices = json.loads(refer['choices']) if isinstance(refer.get('choices'), str) else refer.get('choices', {})
parsed_pred = _parse_mmmu_choice_prediction(pred, len(choices))
score = 1 if parsed_pred == str(refer.get('answer', '')).strip() else 0
else:
parsed_pred = _extract_mmmu_open_prediction(pred)
score = 1 if parsed_pred.strip().lower() == str(refer.get('answer', '')).strip().lower() else 0

Comment on lines +54 to 59
split='validation',
mult_choice_prompt=MULT_CHOICE_PROMPT,
open_prompt=OPEN_PROMPT,
reader_cfg=mmmu_reader_cfg,
infer_cfg=mmmu_infer_cfg,
eval_cfg=mmmu_eval_cfg
@SJTUyh SJTUyh merged commit c15f3f3 into AISBench:master May 11, 2026
17 checks passed
return records, resolved_path, True


def _resolve_mmmu_existing_image_path(image_path, data_root, image_root_path):

@Libotry Libotry May 28, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] issue: _resolve_mmmu_existing_image_path 会直接接受数据记录里给出的绝对路径,只要本机该路径存在就返回;如果数据源不完全可信,这会让数据集记录越过数据目录边界,引用任意本地文件并传给后续多模态推理流程,存在本地文件暴露风险。
suggestion: 对解析后的路径做规范化并限制在数据根目录或受控缓存目录下;对于绝对路径默认拒绝,除非显式开启白名单配置。

return output_path


def _build_mmmu_image_path(record, image_index, image_root_path, suffix='.png'):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review]issue: _build_mmmu_image_path 生成文件名时只使用 id/index 和 image_index,没有把 split、subject、源 parquet 文件名或内容哈希纳入键;不同子集或重复样本只要 id 相同,就会发生图片覆盖,导致后续样本读到错误图片。
suggestion: 把 split、subject 或 parquet stem 纳入文件名,或使用稳定哈希作为输出名,确保跨子集和多次加载时不会互相覆盖。

prediction,
)
if match is None:
for letter in reversed(prediction):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review]issue: _parse_mmmu_choice_prediction 在找不到 ANSWER 标记时,会退化为“返回预测文本里最后一个大写字母”;这个启发式过于宽松,像缩写、专有名词、单位符号里的大写字母都可能被误判成选项,造成偶发性误命中。
suggestion: 退化路径也应限制在合法选项集合内,并结合明确边界匹配;如果提取不出可靠答案,应返回空值而不是猜测最后一个大写字母。

base_dir = resolved_path if os.path.isdir(resolved_path) else os.path.dirname(resolved_path)
else:
base_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '../../..', 'datasets'))
image_root_path = os.path.join(base_dir, 'MMMU_images')

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review]issue: 数据加载过程中把解码后的图片直接写到数据目录下的 MMMU_images,这让一个“读数据”的动作带上了持久化副作用;在只读挂载、共享数据集目录或并发加载场景下,容易导致加载失败、目录污染和缓存冲突。
suggestion: 把图片缓存写到 work_dir 或专门的 cache_dir,并允许通过配置指定缓存路径;数据加载函数应尽量保持只读语义。

logger.warning(f'Skipping MMMU record without question at index {index}')
continue

question_type = record.get('question_type') or (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] issue: question_type的判定没有做规范化处理,后面 503 行又依赖与字符串multiple-choice的精确相等;如果上游数据写成Multiple-Choice、multiple_choice等变体,就会被误判到open分支,直接走错提示词和评测逻辑。
suggestion: 在分支判断前先统一做小写、去空格/下划线等标准化,再映射到内部枚举值。


if refer.get('type') == MMMU_MULTI_CHOICE_TYPE:
choices = json.loads(refer['choices']) if isinstance(refer.get('choices'), str) else refer.get('choices', {})
parsed_pred = _parse_mmmu_choice_prediction(pred, len(choices))

@Libotry Libotry May 28, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review] issue: MMMUEvaluator的多选题判分只调用_parse_mmmu_choice_prediction,没有复用本文件里已经提供的can_infer / can_infer_option / can_infer_text;这会让MMMU与mmmu_pro、mmstar、mathvision这几个复用can_infer的数据集在同类输出上的判分标准不一致,出现“别的集能识别,这里判错”的行为漂移。
suggestion: 把多选答案提取统一收敛到一个共享函数,确保MMMU系列数据集使用同一套解析和容错规则。

score = 1 if parsed_pred == str(refer.get('answer', '')).strip() else 0
else:
parsed_pred = _extract_mmmu_open_prediction(pred)
score = 1 if parsed_pred.strip().lower() == str(refer.get('answer', '')).strip().lower() else 0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[review]issue: 开放题判分采用parsed_pred.strip().lower()与标准答案做严格字符串相等,缺少数值归一化、标点归一化和常见同义格式兼容;对于开放式答案,这会带来较高的假阴性,特别是数字、单位、分数和简短短语的等价表达。
suggestion: 针对open类型增加规范化步骤,例如去除冗余标点、统一空白、处理数值等价格式;如果数据集官方允许,更应引入专门的答案归一化或judge逻辑。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] mmmu精度测评问题

5 participants