[feature] support aime26 dataset#274
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the AIME2026 dataset to the benchmark suite, providing documentation, configuration files for 0-shot chat and string prompts, and a new smoke test for accuracy verification. The review feedback suggests aligning dataset abbreviations in the configuration files with the test execution script, refactoring the dataset loading method to correctly override the base class while adding robustness for file encoding and empty lines, and using variables instead of hardcoded strings in the smoke test shell script.
| with read_base(): | ||
| from .aime2026_gen_0_shot_chat_prompt import aime2026_datasets | ||
|
|
||
| aime2026_datasets[0]['abbr'] = 'aime2026_0_shot_chat' |
There was a problem hiding this comment.
The abbr should match the dataset name passed to the ais_bench command in run.sh. Since run.sh uses ${CASE_NAME}_0_shot_chat (which evaluates to accuracy_aime2026_0_shot_chat), the abbreviation here should be updated to match. Otherwise, the benchmark might fail to find the dataset or produce output files with unexpected names.
| aime2026_datasets[0]['abbr'] = 'aime2026_0_shot_chat' | |
| aime2026_datasets[0]['abbr'] = 'accuracy_aime2026_0_shot_chat' |
| with read_base(): | ||
| from .aime2026_gen_0_shot_str import aime2026_datasets | ||
|
|
||
| aime2026_datasets[0]['abbr'] = 'aime2026_0_shot_str' |
There was a problem hiding this comment.
| def load(path, **kwargs): | ||
| path = get_data_path(path) | ||
| dataset = [] | ||
| with open(path, 'r') as f: | ||
| for line in f: | ||
| line = json.loads(line.strip()) | ||
| dataset.append(line) | ||
| return Dataset.from_list(dataset) No newline at end of file |
There was a problem hiding this comment.
The load method should be an instance method to correctly override the abstract method in BaseDataset. Additionally, it is recommended to specify encoding='utf-8' when opening the file and to skip empty lines in the JSONL file to ensure robustness against malformed data.
| def load(path, **kwargs): | |
| path = get_data_path(path) | |
| dataset = [] | |
| with open(path, 'r') as f: | |
| for line in f: | |
| line = json.loads(line.strip()) | |
| dataset.append(line) | |
| return Dataset.from_list(dataset) | |
| def load(self, path, **kwargs): | |
| path = get_data_path(path) | |
| dataset = [] | |
| with open(path, 'r', encoding='utf-8') as f: | |
| for line in f: | |
| line = line.strip() | |
| if line: | |
| dataset.append(json.loads(line)) | |
| return Dataset.from_list(dataset) |
| TIMESTAMP="${WORK_DIR_INFO##*/}" | ||
|
|
||
| # 数据集abbr列表,用于文件检查 | ||
| dataset_abbr_list=(aime2026_0_shot_chat aime2026_0_shot_str) |
There was a problem hiding this comment.
Instead of hardcoding the dataset abbreviations, it is better to use the ${CASE_NAME} variable to maintain consistency with the ais_bench command and the config overrides.
| dataset_abbr_list=(aime2026_0_shot_chat aime2026_0_shot_str) | |
| dataset_abbr_list=(${CASE_NAME}_0_shot_chat ${CASE_NAME}_0_shot_str) |
PR Type / PR类型
Related Issue | 关联 Issue
Fixes #(issue ID / issue 编号) / Relates to #(issue ID / issue 编号)
🔍 Motivation / 变更动机
支持aime26数据集。
📐 Associated Test Results / 关联测试结果
Please provide links to the related test results, such as CI pipelines, test reports, etc.
请提供相关测试结果的链接,例如 CI 管道、测试报告等。
✅ Checklist / 检查列表
Before PR:
🌟 Useful CI Command / 实用的CI命令
/gemini review/gemini summary/gemini help/readthedocs build