[UT]: add UT coverage for MMMU-Pro and MMStar datasets by wanlongze · Pull Request #326 · AISBench/benchmark

wanlongze · 2026-06-04T03:43:26Z

Summary

Add unit test coverage for two multi-modal benchmark datasets: MMMU-Pro and MMStar.

Changes

Added tests/UT/datasets/test_mmmu_pro.py (+411 lines)
Added tests/UT/datasets/test_mmstar.py (+374 lines)

MMMU-Pro (`test_mmmu_pro.py` — 6 test classes)

Test Class	What's Covered
`TestConstants`	Module-level constants validation
`TestCotPostproc`	Chain-of-thought post-processing logic
`TestMMMUProEvaluator`	Standard evaluation scoring (choice/open types)
`TestMMMUProCotEvaluator`	CoT evaluation with reasoning extraction
`TestMMMUProOptions10DatasetLoad`	Dataset loading for options10 variant
`TestMMMUProVisionDatasetLoad`	Vision split data loading and format validation

MMStar (`test_mmstar.py` — 2 test classes)

Test Class	What's Covered
`TestMMStarEvaluatorScore`	Evaluation scoring pipeline (correct/incorrect parsing, edge cases including empty predictions, non-exact matches, JSON parsing, and incomplete answers)
`TestMMStarDatasetLoad`	Dataset loading, sample count validation, and data schema verification

Verification

全量UT测试结果

gemini-code-assist

Code Review

This pull request introduces comprehensive unit test suites for the MMMU-Pro and MMStar dataset loaders and evaluators, ensuring robust coverage of data loading, CoT post-processing, and evaluation metrics. The reviewer feedback highlights several opportunities to strengthen these tests by adding assertions to verify mock calls (such as verifying that image paths are correctly processed and passed to dump_image or get_content_str). Additionally, the reviewer notes a limitation in cot_postproc where the 'A' in 'Answer:' is incorrectly matched when no other uppercase letters are present, suggesting documenting this behavior in the tests.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-04T03:45:42Z

+    def test_answer_with_no_uppercase(self):
+        response = "Some reasoning\nAnswer: 42"
+        result = cot_postproc(response)
+        self.assertEqual(result, "A")


The test test_answer_with_no_uppercase asserts that cot_postproc returns 'A' for the input 'Some reasoning\nAnswer: 42'. This is actually due to a bug in cot_postproc where it counts the 'A' in the prefix 'Answer:' as the predicted option because it is the only uppercase letter in the line. This means any answer line without other uppercase letters (e.g., 'Answer: 42' or 'Answer: b') will incorrectly return 'A'. Please add a comment to document this known limitation/bug so it can be fixed in the future without breaking the test suite unexpectedly.

Suggested change

def test_answer_with_no_uppercase(self):

response = "Some reasoning\nAnswer: 42"

result = cot_postproc(response)

self.assertEqual(result, "A")

def test_answer_with_no_uppercase(self):

response = "Some reasoning\nAnswer: 42"

result = cot_postproc(response)

# NOTE: This currently returns "A" due to a bug in cot_postproc where the "A" in "Answer:" is counted.

# Once cot_postproc is fixed, this assertion should be updated to expect "42" or the raw response.

self.assertEqual(result, "A")

gemini-code-assist · 2026-06-04T03:45:42Z

+    def test_lowercase_not_detected(self):
+        response = "reasoning\nAnswer: a"
+        result = cot_postproc(response)
+        self.assertEqual(result, "A")


The test test_lowercase_not_detected asserts that cot_postproc returns 'A' for the input 'reasoning\nAnswer: a'. Similar to the bug in test_answer_with_no_uppercase, this is because the 'A' in 'Answer:' is the only uppercase letter counted. If the response was 'Answer: b', it would still incorrectly return 'A'. Please add a comment to document this behavior.

Suggested change

def test_lowercase_not_detected(self):

response = "reasoning\nAnswer: a"

result = cot_postproc(response)

self.assertEqual(result, "A")

def test_lowercase_not_detected(self):

response = "reasoning\nAnswer: a"

result = cot_postproc(response)

# NOTE: This currently returns "A" due to a bug in cot_postproc where the "A" in "Answer:" is counted.

# If the response was "Answer: b", it would still incorrectly return "A".

self.assertEqual(result, "A")

gemini-code-assist · 2026-06-04T03:45:42Z

+        self.assertIsInstance(ds, Dataset)
+        self.assertEqual(len(ds), 2)


The test test_load_with_image_path_column only asserts the length of the dataset, but does not verify that the image_path column was actually processed and passed to dump_image. Adding assertions to verify the mock calls ensures the logic is robustly tested.

Suggested change

self.assertIsInstance(ds, Dataset)

self.assertEqual(len(ds), 2)

self.assertIsInstance(ds, Dataset)

self.assertEqual(len(ds), 2)

self.assertEqual(mock_dump.call_args_list[0][0][0]['image_path'], "/path/a.png")

self.assertEqual(mock_dump.call_args_list[1][0][0]['image_path'], "/path/b.png")

gemini-code-assist · 2026-06-04T03:45:43Z

+        self.assertIsInstance(ds, Dataset)
+        self.assertEqual(len(ds), 2)


The test test_list_image_path only asserts the length of the dataset, but does not verify that both image paths returned by dump_image are actually included in the messages passed to get_content_str. Adding assertions to verify the mock calls ensures the list-handling logic is robustly tested.

Suggested change

self.assertIsInstance(ds, Dataset)

self.assertEqual(len(ds), 2)

self.assertIsInstance(ds, Dataset)

self.assertEqual(len(ds), 2)

first_call_msgs = mock_content.call_args_list[0][0][0]

self.assertEqual(first_call_msgs[0]['image_url'], "/fake/img1.png")

self.assertEqual(first_call_msgs[1]['image_url'], "/fake/img2.png")

gemini-code-assist · 2026-06-04T03:45:43Z

+        self.assertIsInstance(ds, Dataset)
+        self.assertEqual(len(ds), 2)


The test test_load_with_image_path_column only asserts the length of the dataset, but does not verify that the image_path column was actually processed and passed to dump_image. Adding assertions to verify the mock calls ensures the logic is robustly tested.

Suggested change

self.assertIsInstance(ds, Dataset)

self.assertEqual(len(ds), 2)

self.assertIsInstance(ds, Dataset)

self.assertEqual(len(ds), 2)

self.assertEqual(mock_dump.call_args_list[0][0][0]['image_path'], "/path/a.png")

self.assertEqual(mock_dump.call_args_list[1][0][0]['image_path'], "/path/b.png")

gemini-code-assist · 2026-06-04T03:45:43Z

+        self.assertIsInstance(ds, Dataset)
+        self.assertEqual(len(ds), 2)


The test test_list_image_path only asserts the length of the dataset, but does not verify that only the first image path from the list is used for the vision dataset (as per the implementation of MMMUProVisionDataset.load). Adding assertions to verify the mock calls ensures the list-handling logic is robustly tested.

Suggested change

self.assertIsInstance(ds, Dataset)

self.assertEqual(len(ds), 2)

self.assertIsInstance(ds, Dataset)

self.assertEqual(len(ds), 2)

first_call_msgs = mock_content.call_args_list[0][0][0]

self.assertEqual(first_call_msgs[0]['image_url'], "/fake/img1.png")

self.assertEqual(len([m for m in first_call_msgs if m.get('type') == 'image_url']), 1)

gemini-code-assist · 2026-06-04T03:45:43Z

+        self.assertEqual(len(ds), 1)
+        answer = ds[0]["answer"]
+        self.assertEqual(answer["category"], "cat")


The test test_load_with_image_path only asserts the length of the dataset and the category, but does not verify that the image_path column was actually processed and passed to dump_image. Adding assertions to verify the mock calls ensures the logic is robustly tested.

Suggested change

self.assertEqual(len(ds), 1)

answer = ds[0]["answer"]

self.assertEqual(answer["category"], "cat")

self.assertEqual(len(ds), 1)

answer = ds[0]["answer"]

self.assertEqual(answer["category"], "cat")

self.assertEqual(mock_dump_image.call_args_list[0][0][0]['image_path'], "/some/path.png")

gemini-code-assist · 2026-06-04T03:45:43Z

+        })
+        with patch("pandas.read_csv", return_value=df):
+            ds = MMStarDataset.load("/fake/data.tsv")
+        self.assertEqual(len(ds), 2)


The test test_load_image_map_short_entry_redirect only asserts the length of the dataset, but does not verify that the short image reference was actually redirected to the long image data. Adding assertions to verify the mock calls ensures the redirection logic is robustly tested.

Suggested change

self.assertEqual(len(ds), 2)

self.assertEqual(len(ds), 2)

self.assertEqual(mock_dump_image.call_args_list[0][0][0]['image'], long_img)

self.assertEqual(mock_dump_image.call_args_list[1][0][0]['image'], long_img)

test: add UT coverage for MMMU-Pro and MMStar datasets

7104252

gemini-code-assist Bot reviewed Jun 4, 2026

View reviewed changes

wanlongze changed the title ~~test: add UT coverage for MMMU-Pro and MMStar datasets~~ [UT]: add UT coverage for MMMU-Pro and MMStar datasets Jun 4, 2026

Keithwwa approved these changes Jun 4, 2026

View reviewed changes

Keithwwa merged commit f8c887b into AISBench:master Jun 4, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[UT]: add UT coverage for MMMU-Pro and MMStar datasets#326

[UT]: add UT coverage for MMMU-Pro and MMStar datasets#326
Keithwwa merged 1 commit into
AISBench:masterfrom
wanlongze:feat/add-ut-mmmu-pro-mmstar

wanlongze commented Jun 4, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		self.assertIsInstance(ds, Dataset)
		self.assertEqual(len(ds), 2)

Uh oh!

Conversation

wanlongze commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

MMMU-Pro (test_mmmu_pro.py — 6 test classes)

MMStar (test_mmstar.py — 2 test classes)

Verification

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wanlongze commented Jun 4, 2026 •

edited

Loading

MMMU-Pro (`test_mmmu_pro.py` — 6 test classes)

MMStar (`test_mmstar.py` — 2 test classes)