[UT]: add UT coverage for MMMU-Pro and MMStar datasets#326
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive unit test suites for the MMMU-Pro and MMStar dataset loaders and evaluators, ensuring robust coverage of data loading, CoT post-processing, and evaluation metrics. The reviewer feedback highlights several opportunities to strengthen these tests by adding assertions to verify mock calls (such as verifying that image paths are correctly processed and passed to dump_image or get_content_str). Additionally, the reviewer notes a limitation in cot_postproc where the 'A' in 'Answer:' is incorrectly matched when no other uppercase letters are present, suggesting documenting this behavior in the tests.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| def test_answer_with_no_uppercase(self): | ||
| response = "Some reasoning\nAnswer: 42" | ||
| result = cot_postproc(response) | ||
| self.assertEqual(result, "A") |
There was a problem hiding this comment.
The test test_answer_with_no_uppercase asserts that cot_postproc returns 'A' for the input 'Some reasoning\nAnswer: 42'. This is actually due to a bug in cot_postproc where it counts the 'A' in the prefix 'Answer:' as the predicted option because it is the only uppercase letter in the line. This means any answer line without other uppercase letters (e.g., 'Answer: 42' or 'Answer: b') will incorrectly return 'A'. Please add a comment to document this known limitation/bug so it can be fixed in the future without breaking the test suite unexpectedly.
| def test_answer_with_no_uppercase(self): | |
| response = "Some reasoning\nAnswer: 42" | |
| result = cot_postproc(response) | |
| self.assertEqual(result, "A") | |
| def test_answer_with_no_uppercase(self): | |
| response = "Some reasoning\nAnswer: 42" | |
| result = cot_postproc(response) | |
| # NOTE: This currently returns "A" due to a bug in cot_postproc where the "A" in "Answer:" is counted. | |
| # Once cot_postproc is fixed, this assertion should be updated to expect "42" or the raw response. | |
| self.assertEqual(result, "A") |
| def test_lowercase_not_detected(self): | ||
| response = "reasoning\nAnswer: a" | ||
| result = cot_postproc(response) | ||
| self.assertEqual(result, "A") |
There was a problem hiding this comment.
The test test_lowercase_not_detected asserts that cot_postproc returns 'A' for the input 'reasoning\nAnswer: a'. Similar to the bug in test_answer_with_no_uppercase, this is because the 'A' in 'Answer:' is the only uppercase letter counted. If the response was 'Answer: b', it would still incorrectly return 'A'. Please add a comment to document this behavior.
| def test_lowercase_not_detected(self): | |
| response = "reasoning\nAnswer: a" | |
| result = cot_postproc(response) | |
| self.assertEqual(result, "A") | |
| def test_lowercase_not_detected(self): | |
| response = "reasoning\nAnswer: a" | |
| result = cot_postproc(response) | |
| # NOTE: This currently returns "A" due to a bug in cot_postproc where the "A" in "Answer:" is counted. | |
| # If the response was "Answer: b", it would still incorrectly return "A". | |
| self.assertEqual(result, "A") |
| self.assertIsInstance(ds, Dataset) | ||
| self.assertEqual(len(ds), 2) |
There was a problem hiding this comment.
The test test_load_with_image_path_column only asserts the length of the dataset, but does not verify that the image_path column was actually processed and passed to dump_image. Adding assertions to verify the mock calls ensures the logic is robustly tested.
| self.assertIsInstance(ds, Dataset) | |
| self.assertEqual(len(ds), 2) | |
| self.assertIsInstance(ds, Dataset) | |
| self.assertEqual(len(ds), 2) | |
| self.assertEqual(mock_dump.call_args_list[0][0][0]['image_path'], "/path/a.png") | |
| self.assertEqual(mock_dump.call_args_list[1][0][0]['image_path'], "/path/b.png") |
| self.assertIsInstance(ds, Dataset) | ||
| self.assertEqual(len(ds), 2) |
There was a problem hiding this comment.
The test test_list_image_path only asserts the length of the dataset, but does not verify that both image paths returned by dump_image are actually included in the messages passed to get_content_str. Adding assertions to verify the mock calls ensures the list-handling logic is robustly tested.
| self.assertIsInstance(ds, Dataset) | |
| self.assertEqual(len(ds), 2) | |
| self.assertIsInstance(ds, Dataset) | |
| self.assertEqual(len(ds), 2) | |
| first_call_msgs = mock_content.call_args_list[0][0][0] | |
| self.assertEqual(first_call_msgs[0]['image_url'], "/fake/img1.png") | |
| self.assertEqual(first_call_msgs[1]['image_url'], "/fake/img2.png") |
| self.assertIsInstance(ds, Dataset) | ||
| self.assertEqual(len(ds), 2) |
There was a problem hiding this comment.
The test test_load_with_image_path_column only asserts the length of the dataset, but does not verify that the image_path column was actually processed and passed to dump_image. Adding assertions to verify the mock calls ensures the logic is robustly tested.
| self.assertIsInstance(ds, Dataset) | |
| self.assertEqual(len(ds), 2) | |
| self.assertIsInstance(ds, Dataset) | |
| self.assertEqual(len(ds), 2) | |
| self.assertEqual(mock_dump.call_args_list[0][0][0]['image_path'], "/path/a.png") | |
| self.assertEqual(mock_dump.call_args_list[1][0][0]['image_path'], "/path/b.png") |
| self.assertIsInstance(ds, Dataset) | ||
| self.assertEqual(len(ds), 2) |
There was a problem hiding this comment.
The test test_list_image_path only asserts the length of the dataset, but does not verify that only the first image path from the list is used for the vision dataset (as per the implementation of MMMUProVisionDataset.load). Adding assertions to verify the mock calls ensures the list-handling logic is robustly tested.
| self.assertIsInstance(ds, Dataset) | |
| self.assertEqual(len(ds), 2) | |
| self.assertIsInstance(ds, Dataset) | |
| self.assertEqual(len(ds), 2) | |
| first_call_msgs = mock_content.call_args_list[0][0][0] | |
| self.assertEqual(first_call_msgs[0]['image_url'], "/fake/img1.png") | |
| self.assertEqual(len([m for m in first_call_msgs if m.get('type') == 'image_url']), 1) |
| self.assertEqual(len(ds), 1) | ||
| answer = ds[0]["answer"] | ||
| self.assertEqual(answer["category"], "cat") |
There was a problem hiding this comment.
The test test_load_with_image_path only asserts the length of the dataset and the category, but does not verify that the image_path column was actually processed and passed to dump_image. Adding assertions to verify the mock calls ensures the logic is robustly tested.
| self.assertEqual(len(ds), 1) | |
| answer = ds[0]["answer"] | |
| self.assertEqual(answer["category"], "cat") | |
| self.assertEqual(len(ds), 1) | |
| answer = ds[0]["answer"] | |
| self.assertEqual(answer["category"], "cat") | |
| self.assertEqual(mock_dump_image.call_args_list[0][0][0]['image_path'], "/some/path.png") |
| }) | ||
| with patch("pandas.read_csv", return_value=df): | ||
| ds = MMStarDataset.load("/fake/data.tsv") | ||
| self.assertEqual(len(ds), 2) |
There was a problem hiding this comment.
The test test_load_image_map_short_entry_redirect only asserts the length of the dataset, but does not verify that the short image reference was actually redirected to the long image data. Adding assertions to verify the mock calls ensures the redirection logic is robustly tested.
| self.assertEqual(len(ds), 2) | |
| self.assertEqual(len(ds), 2) | |
| self.assertEqual(mock_dump_image.call_args_list[0][0][0]['image'], long_img) | |
| self.assertEqual(mock_dump_image.call_args_list[1][0][0]['image'], long_img) |
Summary
Add unit test coverage for two multi-modal benchmark datasets: MMMU-Pro and MMStar.
Changes
tests/UT/datasets/test_mmmu_pro.py(+411 lines)tests/UT/datasets/test_mmstar.py(+374 lines)MMMU-Pro (
test_mmmu_pro.py— 6 test classes)TestConstantsTestCotPostprocTestMMMUProEvaluatorTestMMMUProCotEvaluatorTestMMMUProOptions10DatasetLoadTestMMMUProVisionDatasetLoadMMStar (
test_mmstar.py— 2 test classes)TestMMStarEvaluatorScoreTestMMStarDatasetLoadVerification
全量UT测试结果
