modify mmmu by wenba0 · Pull Request #265 · AISBench/benchmark

wenba0 · 2026-04-27T09:22:03Z

Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
感谢您的贡献，我们非常重视。以下说明将使您的拉取请求更健康，更易于获得反馈。如果您不理解某些项目，请不要担心，只需提交拉取请求并从维护人员那里寻求帮助即可。

PR Type / PR类型

Related Issue | 关联 Issue
Fixes #224 #30

🔍 Motivation / 变更动机

mmmu数据集的测评之前是对齐vlmevalkit，多个issue发现该测评方法得分过低，修改对齐evalscope

📝 Modification / 修改内容

mmmu数据集修改对齐evalscope

📐 Associated Test Results / 关联测试结果

Please provide links to the related test results, such as CI pipelines, test reports, etc.
请提供相关测试结果的链接，例如 CI 管道、测试报告等。
Qwen2.5-VL-7B-Instuct测试结果如下，相同配置下对齐EvalScope，比官方结果低可能是因为参数配置、vllm-ascend等原因
官方

AISBench

evalscope命令
evalscope eval --model qwen25vl --api-url http://localhost:8077/v1 --datasets mmmu --eval-batch-size 16 --dataset-args '{"mmmu":{"local_path":"/mnt/share/ljj/data/mmmu"}}' --generation-config '{"top_p":1, "top_k":-1, "temperature":0,"repetition_penalty":1.0,"max_tokens":5000}'
结果

⚠️ BC-breaking (Optional) / 向后不兼容变更（可选）

Does the modification introduce changes that break the backward compatibility of the downstream repositories? If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
是否引入了会破坏下游存储库向后兼容性的更改？如果是，请描述它如何破坏兼容性，以及下游项目应该如何修改其代码以保持与此 PR 的兼容性。

⚠️ Performance degradation (Optional) / 性能下降（可选）

If the modification introduces performance degradation, please describe the impact of the performance degradation and the expected performance improvement.
如果引入了性能下降，请描述性能下降的影响和预期的性能改进。

🌟 Use cases (Optional) / 使用案例（可选）

If this PR introduces a new feature, it is better to list some use cases here and update the documentation.
如果此拉取请求引入了新功能，最好在此处列出一些用例并更新文档。

✅ Checklist / 检查列表

Before PR:

Pre-commit or other linting tools are used to fix the potential lint issues. / 使用预提交或其他 linting 工具来修复潜在的 lint 问题。
Bug fixes are fully covered by unit tests, the case that causes the bug should be added in the unit tests. / 修复的 Bug 已完全由单元测试覆盖，导致 Bug 的情况应在单元测试中添加。
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness. / 此拉取请求中的修改已完全由单元测试覆盖。如果不是，请添加更多单元测试以确保正确性。
All relevant documentation (API docs, docstrings, example tutorials) has been updated to reflect these changes. / 所有相关文档（API 文档、文档字符串、示例教程）已更新以反映这些更改。

After PR:

If the modification has potential influence on downstream or other related projects, this PR should be tested with those projects. / 如果此拉取请求对下游或其他相关项目有潜在影响，应在那些项目中测试此 PR。
CLA has been signed and all committers have signed the CLA in this PR. / CLA 已签署，且本 PR 中的所有提交者均已签署 CLA。

👥 Collaboration Info / 协作信息

Suggested Reviewers / 建议审核人: @xxx
Relevant Module Owners / 相关模块负责人: @xxx
Other Collaboration Notes / 其他协作说明：

🌟 Useful CI Command / 实用的CI命令

Command / 命令	Introduction / 介绍
`/gemini review`	Performs a code review for the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 执行代码审核。
`/gemini summary`	Provides a summary of the current pull request in its current state by Gemini. / 对当前拉取请求在当前状态下由 Gemini 提供摘要。
`/gemini help`	Displays a list of available commands of Gemini. / 显示 Gemini 可用命令的列表。
`/readthedocs build`	Triggers a build of the documentation for the current pull request in its current state by Read the Docs. / 触发当前拉取请求在当前状态下由 Read the Docs 构建文档。

gemini-code-assist

Code Review

This pull request refactors the MMMU dataset implementation to align with evalscope, transitioning from TSV files to local parquet files sourced from ModelScope. Key changes include updated documentation, consolidated task configurations, and enhanced logic for prompt templating, image processing, and evaluation of both multiple-choice and open-ended questions. Review feedback focuses on optimizing DataFrame processing for better performance, refining exception handling to be more specific, and fixing a validation bug in the multiple-choice prediction parser to ensure fallback characters are checked against allowed options.

gemini-code-assist · 2026-04-27T09:23:44Z

+            parsed = ast.literal_eval(stripped)
+            if isinstance(parsed, (list, tuple)):
+                return list(parsed)
+        except Exception:


Catching a broad Exception can mask unrelated errors and makes debugging harder. Since ast.literal_eval is the primary operation that might fail here, it is better to catch the specific exceptions it raises.

Suggested change

except Exception:

except (ValueError, SyntaxError):

gemini-code-assist · 2026-04-27T09:23:45Z

+        subject = _infer_subject_from_parquet_path(parquet_file)
+        if subject and 'subject' not in data.columns:
+            data['subject'] = subject
+        records.extend(row.to_dict() for _, row in data.iterrows())


Using iterrows() to convert a DataFrame to a list of dictionaries is inefficient as it creates a Series object for every row. pd.DataFrame.to_dict('records') is much faster and more idiomatic.

Suggested change

records.extend(row.to_dict() for _, row in data.iterrows())

records.extend(data.to_dict('records'))

gemini-code-assist · 2026-04-27T09:23:45Z

+def _parse_mmmu_choice_prediction(prediction, num_choices):
+    match = re.search(
+        r'(?i)^ANSWER\s*:\s*([A-Za-z\d ,]+)\s*(?:$|\n|\.)',
+        prediction,
+        flags=re.MULTILINE,
+    )
+    if match is None:
+        match = re.search(
+            r'(?i)ANSWER\s*:\s*([A-Za-z\d ,]+)(?:[^\w]|\n|$|\.)',
+            prediction,
+        )
+    if match is None:
+        for letter in reversed(prediction):
+            if letter.isupper():
+                return letter
+        return ''
+
+    matched = match.group(1).strip().rstrip('.')
+    allowed_options = {_answer_character(index) for index in range(num_choices)}
+    return matched if matched in allowed_options else ''


The fallback heuristic in _parse_mmmu_choice_prediction (lines 446-448) returns the first uppercase letter found from the end of the string without verifying if it is a valid option. This can lead to incorrect parsing (e.g., returning 'I' from 'I don't know'). The logic should be updated to check against allowed_options.

Suggested change

def _parse_mmmu_choice_prediction(prediction, num_choices):

match = re.search(

r'(?i)^ANSWER\s*:\s*([A-Za-z\d ,]+)\s*(?:$|\n|\.)',

prediction,

flags=re.MULTILINE,

)

if match is None:

match = re.search(

r'(?i)ANSWER\s*:\s*([A-Za-z\d ,]+)(?:[^\w]|\n|$|\.)',

prediction,

)

if match is None:

for letter in reversed(prediction):

if letter.isupper():

return letter

return ''

matched = match.group(1).strip().rstrip('.')

allowed_options = {_answer_character(index) for index in range(num_choices)}

return matched if matched in allowed_options else ''

def _parse_mmmu_choice_prediction(prediction, num_choices):

allowed_options = {_answer_character(index) for index in range(num_choices)}

match = re.search(

r'(?i)^ANSWER\s*:\s*([A-Za-z\d ,]+)\s*(?:$|\n|\.)',

prediction,

flags=re.MULTILINE,

)

if match is None:

match = re.search(

r'(?i)ANSWER\s*:\s*([A-Za-z\d ,]+)(?:[^\w]|\n|$|\.)',

prediction,

)

if match is None:

for letter in reversed(prediction):

if letter.isupper() and letter in allowed_options:

return letter

return ''

matched = match.group(1).strip().rstrip('.')

return matched if matched in allowed_options else ''

# Conflicts: # ais_bench/benchmark/configs/datasets/mmmu/mmmu_gen_cot.py

zhongzhouTan-coder · 2026-05-11T02:58:35Z

+MMMU_MULTI_CHOICE_TYPE = 'multiple-choice'
+MMMU_OPEN_TYPE = 'open'
+
 def dump_image(line, image_root_path):


[review] the dump_image function is unused?

Copilot

Pull request overview

This PR updates the MMMU dataset integration to align AISBench’s evaluation behavior with evalscope, addressing previously reported abnormally low MMMU scores. It rewrites MMMU data loading to use ModelScope’s parquet-based dataset layout and updates prompts/scoring to support both multiple-choice and open question types.

Changes:

Reworked MMMUDataset.load to discover/load local parquet shards, normalize options, extract/dump images, and build multimodal content prompts with <AIS_*> tags.
Updated MMMUEvaluator.score to score MCQ vs open questions differently and parse ANSWER:-style outputs.
Updated MMMU dataset docs/configs to use the ModelScope dataset and removed the obsolete mmmu_gen_cot config.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`ais_bench/benchmark/datasets/mmmu.py`	Parquet-based MMMU loading + new prompt construction and scoring logic for MCQ/open types.
`ais_bench/benchmark/configs/datasets/mmmu/mmmu_gen.py`	Switch MMMU config to parquet root path + new prompt templates and reader columns.
`ais_bench/benchmark/configs/datasets/mmmu/README.md`	Update Chinese MMMU deployment instructions to ModelScope parquet dataset.
`ais_bench/benchmark/configs/datasets/mmmu/README_en.md`	Update English MMMU deployment instructions to ModelScope parquet dataset.
`ais_bench/benchmark/configs/datasets/mmmu/mmmu_gen_cot.py`	Removed legacy CoT config (no longer documented/used).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    if match is None:
+        for letter in reversed(prediction):
+            if letter.isupper():
+                return letter
+        return ''


+                return letter
+        return ''
+
+    matched = match.group(1).strip().rstrip('.')


+    match = re.search(r'ANSWER:\s*(.*)', prediction)
+    if match:
+        return match.group(1).strip()


+        subject = _infer_subject_from_parquet_path(parquet_file)
+        if subject and 'subject' not in data.columns:
+            data['subject'] = subject
+        records.extend(row.to_dict() for _, row in data.iterrows())


+             open_prompt=None,
+             start_text_prompt='',
+             end_text_prompt='',
+             option_prompt=''):


 import string
+import ast
+from pathlib import Path
 from os import environ


+            if refer.get('type') == MMMU_MULTI_CHOICE_TYPE:
+                choices = json.loads(refer['choices']) if isinstance(refer.get('choices'), str) else refer.get('choices', {})
+                parsed_pred = _parse_mmmu_choice_prediction(pred, len(choices))
+                score = 1 if parsed_pred == str(refer.get('answer', '')).strip() else 0
+            else:
+                parsed_pred = _extract_mmmu_open_prediction(pred)
+                score = 1 if parsed_pred.strip().lower() == str(refer.get('answer', '')).strip().lower() else 0
+


+        split='validation',
+        mult_choice_prompt=MULT_CHOICE_PROMPT,
+        open_prompt=OPEN_PROMPT,
        reader_cfg=mmmu_reader_cfg,
        infer_cfg=mmmu_infer_cfg,
        eval_cfg=mmmu_eval_cfg


Libotry · 2026-05-28T13:35:34Z

+    return records, resolved_path, True
+
+
+def _resolve_mmmu_existing_image_path(image_path, data_root, image_root_path):


[review] issue: _resolve_mmmu_existing_image_path 会直接接受数据记录里给出的绝对路径，只要本机该路径存在就返回；如果数据源不完全可信，这会让数据集记录越过数据目录边界，引用任意本地文件并传给后续多模态推理流程，存在本地文件暴露风险。
suggestion: 对解析后的路径做规范化并限制在数据根目录或受控缓存目录下；对于绝对路径默认拒绝，除非显式开启白名单配置。

Libotry · 2026-05-28T13:36:34Z

+    return output_path
+
+
+def _build_mmmu_image_path(record, image_index, image_root_path, suffix='.png'):


[review]issue: _build_mmmu_image_path 生成文件名时只使用 id/index 和 image_index，没有把 split、subject、源 parquet 文件名或内容哈希纳入键；不同子集或重复样本只要 id 相同，就会发生图片覆盖，导致后续样本读到错误图片。
suggestion: 把 split、subject 或 parquet stem 纳入文件名，或使用稳定哈希作为输出名，确保跨子集和多次加载时不会互相覆盖。

Libotry · 2026-05-28T13:38:35Z

+            prediction,
+        )
+    if match is None:
+        for letter in reversed(prediction):


[review]issue: _parse_mmmu_choice_prediction 在找不到 ANSWER 标记时，会退化为“返回预测文本里最后一个大写字母”；这个启发式过于宽松，像缩写、专有名词、单位符号里的大写字母都可能被误判成选项，造成偶发性误命中。
suggestion: 退化路径也应限制在合法选项集合内，并结合明确边界匹配；如果提取不出可靠答案，应返回空值而不是猜测最后一个大写字母。

Libotry · 2026-05-28T13:39:28Z

+            base_dir = resolved_path if os.path.isdir(resolved_path) else os.path.dirname(resolved_path)
+        else:
+            base_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '../../..', 'datasets'))
+        image_root_path = os.path.join(base_dir, 'MMMU_images')


[review]issue: 数据加载过程中把解码后的图片直接写到数据目录下的 MMMU_images，这让一个“读数据”的动作带上了持久化副作用；在只读挂载、共享数据集目录或并发加载场景下，容易导致加载失败、目录污染和缓存冲突。
suggestion: 把图片缓存写到 work_dir 或专门的 cache_dir，并允许通过配置指定缓存路径；数据加载函数应尽量保持只读语义。

Libotry · 2026-05-28T13:41:13Z

+                logger.warning(f'Skipping MMMU record without question at index {index}')
+                continue
+
+            question_type = record.get('question_type') or (


[review] issue: question_type的判定没有做规范化处理，后面 503 行又依赖与字符串multiple-choice的精确相等；如果上游数据写成Multiple-Choice、multiple_choice等变体，就会被误判到open分支，直接走错提示词和评测逻辑。
suggestion: 在分支判断前先统一做小写、去空格/下划线等标准化，再映射到内部枚举值。

Libotry · 2026-05-28T13:42:37Z

+
+            if refer.get('type') == MMMU_MULTI_CHOICE_TYPE:
+                choices = json.loads(refer['choices']) if isinstance(refer.get('choices'), str) else refer.get('choices', {})
+                parsed_pred = _parse_mmmu_choice_prediction(pred, len(choices))


[review] issue: MMMUEvaluator的多选题判分只调用_parse_mmmu_choice_prediction，没有复用本文件里已经提供的can_infer / can_infer_option / can_infer_text；这会让MMMU与mmmu_pro、mmstar、mathvision这几个复用can_infer的数据集在同类输出上的判分标准不一致，出现“别的集能识别，这里判错”的行为漂移。
suggestion: 把多选答案提取统一收敛到一个共享函数，确保MMMU系列数据集使用同一套解析和容错规则。

Libotry · 2026-05-28T13:44:09Z

+                score = 1 if parsed_pred == str(refer.get('answer', '')).strip() else 0
+            else:
+                parsed_pred = _extract_mmmu_open_prediction(pred)
+                score = 1 if parsed_pred.strip().lower() == str(refer.get('answer', '')).strip().lower() else 0


[review]issue: 开放题判分采用parsed_pred.strip().lower()与标准答案做严格字符串相等，缺少数值归一化、标点归一化和常见同义格式兼容；对于开放式答案，这会带来较高的假阴性，特别是数字、单位、分数和简短短语的等价表达。
suggestion: 针对open类型增加规范化步骤，例如去除冗余标点、统一空白、处理数值等价格式；如果数据集官方允许，更应引入专门的答案归一化或judge逻辑。

wenba0 had a problem deploying to smoke-test-approval April 27, 2026 09:22 — with GitHub Actions Error

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

github-actions Bot added the bugfix label May 7, 2026

modify mmmu

cad062a

wenba0 force-pushed the mmmu_evalscope branch from d66c6d3 to cad062a Compare May 11, 2026 02:29

Merge branch 'new1' into mmmu_evalscope

381a816

# Conflicts: # ais_bench/benchmark/configs/datasets/mmmu/mmmu_gen_cot.py

wenba0 temporarily deployed to smoke-test-approval May 11, 2026 02:37 — with GitHub Actions Inactive

zhongzhouTan-coder reviewed May 11, 2026

View reviewed changes

zhongzhouTan-coder requested a review from Copilot May 11, 2026 03:01

Copilot started reviewing on behalf of zhongzhouTan-coder May 11, 2026 03:02 View session

Copilot AI reviewed May 11, 2026

View reviewed changes

SJTUyh approved these changes May 11, 2026

View reviewed changes

SJTUyh merged commit c15f3f3 into AISBench:master May 11, 2026
17 checks passed

SJTUyh mentioned this pull request May 19, 2026

[Bug] 3.1.20260330 版本跑mmmu_gen 大面积题目结果判定出错 #300

Closed

4 tasks

Libotry reviewed May 28, 2026

View reviewed changes

	records.extend(row.to_dict() for _, row in data.iterrows())
	records.extend(data.to_dict('records'))

		return records, resolved_path, True


		def _resolve_mmmu_existing_image_path(image_path, data_root, image_root_path):

		return output_path


		def _build_mmmu_image_path(record, image_index, image_root_path, suffix='.png'):

Uh oh!

Conversation

wenba0 commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔍 Motivation / 变更动机

📝 Modification / 修改内容

📐 Associated Test Results / 关联测试结果

⚠️ BC-breaking (Optional) / 向后不兼容变更（可选）

⚠️ Performance degradation (Optional) / 性能下降（可选）

🌟 Use cases (Optional) / 使用案例（可选）

✅ Checklist / 检查列表

👥 Collaboration Info / 协作信息

🌟 Useful CI Command / 实用的CI命令

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

zhongzhouTan-coder May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Libotry May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Libotry May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Libotry May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Libotry May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Libotry May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Libotry May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Libotry May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wenba0 commented Apr 27, 2026 •

edited

Loading

Libotry May 28, 2026 •

edited

Loading

Libotry May 28, 2026 •

edited

Loading