modify mmmu#265
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the MMMU dataset implementation to align with evalscope, transitioning from TSV files to local parquet files sourced from ModelScope. Key changes include updated documentation, consolidated task configurations, and enhanced logic for prompt templating, image processing, and evaluation of both multiple-choice and open-ended questions. Review feedback focuses on optimizing DataFrame processing for better performance, refining exception handling to be more specific, and fixing a validation bug in the multiple-choice prediction parser to ensure fallback characters are checked against allowed options.
| parsed = ast.literal_eval(stripped) | ||
| if isinstance(parsed, (list, tuple)): | ||
| return list(parsed) | ||
| except Exception: |
There was a problem hiding this comment.
| subject = _infer_subject_from_parquet_path(parquet_file) | ||
| if subject and 'subject' not in data.columns: | ||
| data['subject'] = subject | ||
| records.extend(row.to_dict() for _, row in data.iterrows()) |
There was a problem hiding this comment.
Using iterrows() to convert a DataFrame to a list of dictionaries is inefficient as it creates a Series object for every row. pd.DataFrame.to_dict('records') is much faster and more idiomatic.
| records.extend(row.to_dict() for _, row in data.iterrows()) | |
| records.extend(data.to_dict('records')) |
| def _parse_mmmu_choice_prediction(prediction, num_choices): | ||
| match = re.search( | ||
| r'(?i)^ANSWER\s*:\s*([A-Za-z\d ,]+)\s*(?:$|\n|\.)', | ||
| prediction, | ||
| flags=re.MULTILINE, | ||
| ) | ||
| if match is None: | ||
| match = re.search( | ||
| r'(?i)ANSWER\s*:\s*([A-Za-z\d ,]+)(?:[^\w]|\n|$|\.)', | ||
| prediction, | ||
| ) | ||
| if match is None: | ||
| for letter in reversed(prediction): | ||
| if letter.isupper(): | ||
| return letter | ||
| return '' | ||
|
|
||
| matched = match.group(1).strip().rstrip('.') | ||
| allowed_options = {_answer_character(index) for index in range(num_choices)} | ||
| return matched if matched in allowed_options else '' |
There was a problem hiding this comment.
The fallback heuristic in _parse_mmmu_choice_prediction (lines 446-448) returns the first uppercase letter found from the end of the string without verifying if it is a valid option. This can lead to incorrect parsing (e.g., returning 'I' from 'I don't know'). The logic should be updated to check against allowed_options.
| def _parse_mmmu_choice_prediction(prediction, num_choices): | |
| match = re.search( | |
| r'(?i)^ANSWER\s*:\s*([A-Za-z\d ,]+)\s*(?:$|\n|\.)', | |
| prediction, | |
| flags=re.MULTILINE, | |
| ) | |
| if match is None: | |
| match = re.search( | |
| r'(?i)ANSWER\s*:\s*([A-Za-z\d ,]+)(?:[^\w]|\n|$|\.)', | |
| prediction, | |
| ) | |
| if match is None: | |
| for letter in reversed(prediction): | |
| if letter.isupper(): | |
| return letter | |
| return '' | |
| matched = match.group(1).strip().rstrip('.') | |
| allowed_options = {_answer_character(index) for index in range(num_choices)} | |
| return matched if matched in allowed_options else '' | |
| def _parse_mmmu_choice_prediction(prediction, num_choices): | |
| allowed_options = {_answer_character(index) for index in range(num_choices)} | |
| match = re.search( | |
| r'(?i)^ANSWER\s*:\s*([A-Za-z\d ,]+)\s*(?:$|\n|\.)', | |
| prediction, | |
| flags=re.MULTILINE, | |
| ) | |
| if match is None: | |
| match = re.search( | |
| r'(?i)ANSWER\s*:\s*([A-Za-z\d ,]+)(?:[^\w]|\n|$|\.)', | |
| prediction, | |
| ) | |
| if match is None: | |
| for letter in reversed(prediction): | |
| if letter.isupper() and letter in allowed_options: | |
| return letter | |
| return '' | |
| matched = match.group(1).strip().rstrip('.') | |
| return matched if matched in allowed_options else '' |
# Conflicts: # ais_bench/benchmark/configs/datasets/mmmu/mmmu_gen_cot.py
| MMMU_MULTI_CHOICE_TYPE = 'multiple-choice' | ||
| MMMU_OPEN_TYPE = 'open' | ||
|
|
||
| def dump_image(line, image_root_path): |
There was a problem hiding this comment.
[review] the dump_image function is unused?
There was a problem hiding this comment.
Pull request overview
This PR updates the MMMU dataset integration to align AISBench’s evaluation behavior with evalscope, addressing previously reported abnormally low MMMU scores. It rewrites MMMU data loading to use ModelScope’s parquet-based dataset layout and updates prompts/scoring to support both multiple-choice and open question types.
Changes:
- Reworked
MMMUDataset.loadto discover/load local parquet shards, normalize options, extract/dump images, and build multimodalcontentprompts with<AIS_*>tags. - Updated
MMMUEvaluator.scoreto score MCQ vs open questions differently and parseANSWER:-style outputs. - Updated MMMU dataset docs/configs to use the ModelScope dataset and removed the obsolete
mmmu_gen_cotconfig.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
ais_bench/benchmark/datasets/mmmu.py |
Parquet-based MMMU loading + new prompt construction and scoring logic for MCQ/open types. |
ais_bench/benchmark/configs/datasets/mmmu/mmmu_gen.py |
Switch MMMU config to parquet root path + new prompt templates and reader columns. |
ais_bench/benchmark/configs/datasets/mmmu/README.md |
Update Chinese MMMU deployment instructions to ModelScope parquet dataset. |
ais_bench/benchmark/configs/datasets/mmmu/README_en.md |
Update English MMMU deployment instructions to ModelScope parquet dataset. |
ais_bench/benchmark/configs/datasets/mmmu/mmmu_gen_cot.py |
Removed legacy CoT config (no longer documented/used). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if match is None: | ||
| for letter in reversed(prediction): | ||
| if letter.isupper(): | ||
| return letter | ||
| return '' |
| return letter | ||
| return '' | ||
|
|
||
| matched = match.group(1).strip().rstrip('.') |
| match = re.search(r'ANSWER:\s*(.*)', prediction) | ||
| if match: | ||
| return match.group(1).strip() |
| subject = _infer_subject_from_parquet_path(parquet_file) | ||
| if subject and 'subject' not in data.columns: | ||
| data['subject'] = subject | ||
| records.extend(row.to_dict() for _, row in data.iterrows()) |
| open_prompt=None, | ||
| start_text_prompt='', | ||
| end_text_prompt='', | ||
| option_prompt=''): |
| import string | ||
| import ast | ||
| from pathlib import Path | ||
| from os import environ |
| if refer.get('type') == MMMU_MULTI_CHOICE_TYPE: | ||
| choices = json.loads(refer['choices']) if isinstance(refer.get('choices'), str) else refer.get('choices', {}) | ||
| parsed_pred = _parse_mmmu_choice_prediction(pred, len(choices)) | ||
| score = 1 if parsed_pred == str(refer.get('answer', '')).strip() else 0 | ||
| else: | ||
| parsed_pred = _extract_mmmu_open_prediction(pred) | ||
| score = 1 if parsed_pred.strip().lower() == str(refer.get('answer', '')).strip().lower() else 0 | ||
|
|
| split='validation', | ||
| mult_choice_prompt=MULT_CHOICE_PROMPT, | ||
| open_prompt=OPEN_PROMPT, | ||
| reader_cfg=mmmu_reader_cfg, | ||
| infer_cfg=mmmu_infer_cfg, | ||
| eval_cfg=mmmu_eval_cfg |
| return records, resolved_path, True | ||
|
|
||
|
|
||
| def _resolve_mmmu_existing_image_path(image_path, data_root, image_root_path): |
There was a problem hiding this comment.
[review] issue: _resolve_mmmu_existing_image_path 会直接接受数据记录里给出的绝对路径,只要本机该路径存在就返回;如果数据源不完全可信,这会让数据集记录越过数据目录边界,引用任意本地文件并传给后续多模态推理流程,存在本地文件暴露风险。
suggestion: 对解析后的路径做规范化并限制在数据根目录或受控缓存目录下;对于绝对路径默认拒绝,除非显式开启白名单配置。
| return output_path | ||
|
|
||
|
|
||
| def _build_mmmu_image_path(record, image_index, image_root_path, suffix='.png'): |
There was a problem hiding this comment.
[review]issue: _build_mmmu_image_path 生成文件名时只使用 id/index 和 image_index,没有把 split、subject、源 parquet 文件名或内容哈希纳入键;不同子集或重复样本只要 id 相同,就会发生图片覆盖,导致后续样本读到错误图片。
suggestion: 把 split、subject 或 parquet stem 纳入文件名,或使用稳定哈希作为输出名,确保跨子集和多次加载时不会互相覆盖。
| prediction, | ||
| ) | ||
| if match is None: | ||
| for letter in reversed(prediction): |
There was a problem hiding this comment.
[review]issue: _parse_mmmu_choice_prediction 在找不到 ANSWER 标记时,会退化为“返回预测文本里最后一个大写字母”;这个启发式过于宽松,像缩写、专有名词、单位符号里的大写字母都可能被误判成选项,造成偶发性误命中。
suggestion: 退化路径也应限制在合法选项集合内,并结合明确边界匹配;如果提取不出可靠答案,应返回空值而不是猜测最后一个大写字母。
| base_dir = resolved_path if os.path.isdir(resolved_path) else os.path.dirname(resolved_path) | ||
| else: | ||
| base_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '../../..', 'datasets')) | ||
| image_root_path = os.path.join(base_dir, 'MMMU_images') |
There was a problem hiding this comment.
[review]issue: 数据加载过程中把解码后的图片直接写到数据目录下的 MMMU_images,这让一个“读数据”的动作带上了持久化副作用;在只读挂载、共享数据集目录或并发加载场景下,容易导致加载失败、目录污染和缓存冲突。
suggestion: 把图片缓存写到 work_dir 或专门的 cache_dir,并允许通过配置指定缓存路径;数据加载函数应尽量保持只读语义。
| logger.warning(f'Skipping MMMU record without question at index {index}') | ||
| continue | ||
|
|
||
| question_type = record.get('question_type') or ( |
There was a problem hiding this comment.
[review] issue: question_type的判定没有做规范化处理,后面 503 行又依赖与字符串multiple-choice的精确相等;如果上游数据写成Multiple-Choice、multiple_choice等变体,就会被误判到open分支,直接走错提示词和评测逻辑。
suggestion: 在分支判断前先统一做小写、去空格/下划线等标准化,再映射到内部枚举值。
|
|
||
| if refer.get('type') == MMMU_MULTI_CHOICE_TYPE: | ||
| choices = json.loads(refer['choices']) if isinstance(refer.get('choices'), str) else refer.get('choices', {}) | ||
| parsed_pred = _parse_mmmu_choice_prediction(pred, len(choices)) |
There was a problem hiding this comment.
[review] issue: MMMUEvaluator的多选题判分只调用_parse_mmmu_choice_prediction,没有复用本文件里已经提供的can_infer / can_infer_option / can_infer_text;这会让MMMU与mmmu_pro、mmstar、mathvision这几个复用can_infer的数据集在同类输出上的判分标准不一致,出现“别的集能识别,这里判错”的行为漂移。
suggestion: 把多选答案提取统一收敛到一个共享函数,确保MMMU系列数据集使用同一套解析和容错规则。
| score = 1 if parsed_pred == str(refer.get('answer', '')).strip() else 0 | ||
| else: | ||
| parsed_pred = _extract_mmmu_open_prediction(pred) | ||
| score = 1 if parsed_pred.strip().lower() == str(refer.get('answer', '')).strip().lower() else 0 |
There was a problem hiding this comment.
[review]issue: 开放题判分采用parsed_pred.strip().lower()与标准答案做严格字符串相等,缺少数值归一化、标点归一化和常见同义格式兼容;对于开放式答案,这会带来较高的假阴性,特别是数字、单位、分数和简短短语的等价表达。
suggestion: 针对open类型增加规范化步骤,例如去除冗余标点、统一空白、处理数值等价格式;如果数据集官方允许,更应引入专门的答案归一化或judge逻辑。
Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
感谢您的贡献,我们非常重视。以下说明将使您的拉取请求更健康,更易于获得反馈。如果您不理解某些项目,请不要担心,只需提交拉取请求并从维护人员那里寻求帮助即可。
PR Type / PR类型
Related Issue | 关联 Issue
Fixes #224 #30
🔍 Motivation / 变更动机
mmmu数据集的测评之前是对齐vlmevalkit,多个issue发现该测评方法得分过低,修改对齐evalscope
📝 Modification / 修改内容
mmmu数据集修改对齐evalscope
📐 Associated Test Results / 关联测试结果
Please provide links to the related test results, such as CI pipelines, test reports, etc.

请提供相关测试结果的链接,例如 CI 管道、测试报告等。
Qwen2.5-VL-7B-Instuct测试结果如下,相同配置下对齐EvalScope,比官方结果低 可能是因为参数配置、vllm-ascend等原因
官方
AISBench


evalscope命令
evalscope eval --model qwen25vl --api-url http://localhost:8077/v1 --datasets mmmu --eval-batch-size 16 --dataset-args '{"mmmu":{"local_path":"/mnt/share/ljj/data/mmmu"}}' --generation-config '{"top_p":1, "top_k":-1, "temperature":0,"repetition_penalty":1.0,"max_tokens":5000}'结果
Does the modification introduce changes that break the backward compatibility of the downstream repositories? If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
是否引入了会破坏下游存储库向后兼容性的更改?如果是,请描述它如何破坏兼容性,以及下游项目应该如何修改其代码以保持与此 PR 的兼容性。
If the modification introduces performance degradation, please describe the impact of the performance degradation and the expected performance improvement.
如果引入了性能下降,请描述性能下降的影响和预期的性能改进。
🌟 Use cases (Optional) / 使用案例(可选)
If this PR introduces a new feature, it is better to list some use cases here and update the documentation.
如果此拉取请求引入了新功能,最好在此处列出一些用例并更新文档。
✅ Checklist / 检查列表
Before PR:
After PR:
👥 Collaboration Info / 协作信息
🌟 Useful CI Command / 实用的CI命令
/gemini review/gemini summary/gemini help/readthedocs build