support mathvision#264
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the MathVision (MATH-V) dataset for multimodal mathematical reasoning, providing documentation, configuration files, and the core implementation for data loading and evaluation. The reviewer feedback suggests refining the multiple-choice answer normalization to prioritize 1-indexing, enhancing the robustness of image storage paths to support read-only environments, and standardizing the evaluation metric keys for consistency across the codebase.
| if upper_answer.isdigit(): | ||
| index = int(upper_answer) | ||
| if 0 <= index < len(options): | ||
| return allowed_letters[index] | ||
| if 1 <= index <= len(options): | ||
| return allowed_letters[index - 1] |
There was a problem hiding this comment.
The logic for converting numeric answers to letters is ambiguous. Currently, if a model outputs "1", it returns 'B' (index 1) because the 0-indexed check comes first. However, in multiple-choice questions, numeric answers are more commonly 1-indexed (i.e., "1" maps to 'A'). Prioritizing 1-indexing would align better with common model behavior for MCQ tasks.
| if upper_answer.isdigit(): | |
| index = int(upper_answer) | |
| if 0 <= index < len(options): | |
| return allowed_letters[index] | |
| if 1 <= index <= len(options): | |
| return allowed_letters[index - 1] | |
| if upper_answer.isdigit(): | |
| index = int(upper_answer) | |
| if 1 <= index <= len(options): | |
| return allowed_letters[index - 1] | |
| if 0 <= index < len(options): | |
| return allowed_letters[index] |
| if is_local: | ||
| base_dir = resolved_path if os.path.isdir(resolved_path) else os.path.dirname(resolved_path) | ||
| else: | ||
| base_dir = os.path.abspath(os.path.join(os.path.dirname(__file__), '../../..', 'datasets')) |
There was a problem hiding this comment.
The base_dir for storing images is calculated using a hardcoded relative path from the source file. This approach is fragile and will likely fail if the package is installed in a read-only location (e.g., as a system-wide site-package). It is recommended to use the get_cache_dir utility or respect the AIS_BENCH_DATASETS_CACHE environment variable to ensure a writable location is used for image extraction when the dataset is loaded from a remote source.
| if subject: | ||
| subject_scores.setdefault(f'subject: {subject}', []).append(score) | ||
|
|
||
| result = {'Accuracy': 100 * sum(scores) / len(scores) if scores else 0.0} |
There was a problem hiding this comment.
The evaluation result uses the key 'Accuracy' (capitalized), whereas other evaluators in this repository (e.g., MATHEvaluator in math.py) consistently use 'accuracy' (lowercase). Using the lowercase key ensures consistency across the benchmark suite and compatibility with reporting tools that expect the standard key name.
| result = {'Accuracy': 100 * sum(scores) / len(scores) if scores else 0.0} | |
| result = {'accuracy': 100 * sum(scores) / len(scores) if scores else 0.0} |
Thanks for your contribution; we appreciate it a lot. The following instructions will make your pull request healthier and help you get feedback more easily. If you do not understand some items, don't worry, just make the pull request and seek help from maintainers.
感谢您的贡献,我们非常重视。以下说明将使您的拉取请求更健康,更易于获得反馈。如果您不理解某些项目,请不要担心,只需提交拉取请求并从维护人员那里寻求帮助即可。
PR Type / PR类型
🔍 Motivation / 变更动机
新增mathvision数据集
📝 Modification / 修改内容
新增mathvision数据集
📐 Associated Test Results / 关联测试结果
增加冒烟用例accuracy_mathvision

Does the modification introduce changes that break the backward compatibility of the downstream repositories? If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
是否引入了会破坏下游存储库向后兼容性的更改?如果是,请描述它如何破坏兼容性,以及下游项目应该如何修改其代码以保持与此 PR 的兼容性。
If the modification introduces performance degradation, please describe the impact of the performance degradation and the expected performance improvement.
如果引入了性能下降,请描述性能下降的影响和预期的性能改进。
🌟 Use cases (Optional) / 使用案例(可选)
If this PR introduces a new feature, it is better to list some use cases here and update the documentation.
如果此拉取请求引入了新功能,最好在此处列出一些用例并更新文档。
✅ Checklist / 检查列表
Before PR:
After PR:
👥 Collaboration Info / 协作信息
🌟 Useful CI Command / 实用的CI命令
/gemini review/gemini summary/gemini help/readthedocs build