Skip to content

[UT]: add UT coverage for MMMU-Pro and MMStar datasets#326

Merged
Keithwwa merged 1 commit into
AISBench:masterfrom
wanlongze:feat/add-ut-mmmu-pro-mmstar
Jun 4, 2026
Merged

[UT]: add UT coverage for MMMU-Pro and MMStar datasets#326
Keithwwa merged 1 commit into
AISBench:masterfrom
wanlongze:feat/add-ut-mmmu-pro-mmstar

Conversation

@wanlongze

@wanlongze wanlongze commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Add unit test coverage for two multi-modal benchmark datasets: MMMU-Pro and MMStar.

Changes

  • Added tests/UT/datasets/test_mmmu_pro.py (+411 lines)
  • Added tests/UT/datasets/test_mmstar.py (+374 lines)

MMMU-Pro (test_mmmu_pro.py — 6 test classes)

Test Class What's Covered
TestConstants Module-level constants validation
TestCotPostproc Chain-of-thought post-processing logic
TestMMMUProEvaluator Standard evaluation scoring (choice/open types)
TestMMMUProCotEvaluator CoT evaluation with reasoning extraction
TestMMMUProOptions10DatasetLoad Dataset loading for options10 variant
TestMMMUProVisionDatasetLoad Vision split data loading and format validation

MMStar (test_mmstar.py — 2 test classes)

Test Class What's Covered
TestMMStarEvaluatorScore Evaluation scoring pipeline (correct/incorrect parsing, edge cases including empty predictions, non-exact matches, JSON parsing, and incomplete answers)
TestMMStarDatasetLoad Dataset loading, sample count validation, and data schema verification

Verification

全量UT测试结果
image

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive unit test suites for the MMMU-Pro and MMStar dataset loaders and evaluators, ensuring robust coverage of data loading, CoT post-processing, and evaluation metrics. The reviewer feedback highlights several opportunities to strengthen these tests by adding assertions to verify mock calls (such as verifying that image paths are correctly processed and passed to dump_image or get_content_str). Additionally, the reviewer notes a limitation in cot_postproc where the 'A' in 'Answer:' is incorrectly matched when no other uppercase letters are present, suggesting documenting this behavior in the tests.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +72 to +75
def test_answer_with_no_uppercase(self):
response = "Some reasoning\nAnswer: 42"
result = cot_postproc(response)
self.assertEqual(result, "A")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test test_answer_with_no_uppercase asserts that cot_postproc returns 'A' for the input 'Some reasoning\nAnswer: 42'. This is actually due to a bug in cot_postproc where it counts the 'A' in the prefix 'Answer:' as the predicted option because it is the only uppercase letter in the line. This means any answer line without other uppercase letters (e.g., 'Answer: 42' or 'Answer: b') will incorrectly return 'A'. Please add a comment to document this known limitation/bug so it can be fixed in the future without breaking the test suite unexpectedly.

Suggested change
def test_answer_with_no_uppercase(self):
response = "Some reasoning\nAnswer: 42"
result = cot_postproc(response)
self.assertEqual(result, "A")
def test_answer_with_no_uppercase(self):
response = "Some reasoning\nAnswer: 42"
result = cot_postproc(response)
# NOTE: This currently returns "A" due to a bug in cot_postproc where the "A" in "Answer:" is counted.
# Once cot_postproc is fixed, this assertion should be updated to expect "42" or the raw response.
self.assertEqual(result, "A")

Comment on lines +87 to +90
def test_lowercase_not_detected(self):
response = "reasoning\nAnswer: a"
result = cot_postproc(response)
self.assertEqual(result, "A")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test test_lowercase_not_detected asserts that cot_postproc returns 'A' for the input 'reasoning\nAnswer: a'. Similar to the bug in test_answer_with_no_uppercase, this is because the 'A' in 'Answer:' is the only uppercase letter counted. If the response was 'Answer: b', it would still incorrectly return 'A'. Please add a comment to document this behavior.

Suggested change
def test_lowercase_not_detected(self):
response = "reasoning\nAnswer: a"
result = cot_postproc(response)
self.assertEqual(result, "A")
def test_lowercase_not_detected(self):
response = "reasoning\nAnswer: a"
result = cot_postproc(response)
# NOTE: This currently returns "A" due to a bug in cot_postproc where the "A" in "Answer:" is counted.
# If the response was "Answer: b", it would still incorrectly return "A".
self.assertEqual(result, "A")

Comment on lines +309 to +310
self.assertIsInstance(ds, Dataset)
self.assertEqual(len(ds), 2)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test test_load_with_image_path_column only asserts the length of the dataset, but does not verify that the image_path column was actually processed and passed to dump_image. Adding assertions to verify the mock calls ensures the logic is robustly tested.

Suggested change
self.assertIsInstance(ds, Dataset)
self.assertEqual(len(ds), 2)
self.assertIsInstance(ds, Dataset)
self.assertEqual(len(ds), 2)
self.assertEqual(mock_dump.call_args_list[0][0][0]['image_path'], "/path/a.png")
self.assertEqual(mock_dump.call_args_list[1][0][0]['image_path'], "/path/b.png")

Comment on lines +332 to +333
self.assertIsInstance(ds, Dataset)
self.assertEqual(len(ds), 2)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test test_list_image_path only asserts the length of the dataset, but does not verify that both image paths returned by dump_image are actually included in the messages passed to get_content_str. Adding assertions to verify the mock calls ensures the list-handling logic is robustly tested.

Suggested change
self.assertIsInstance(ds, Dataset)
self.assertEqual(len(ds), 2)
self.assertIsInstance(ds, Dataset)
self.assertEqual(len(ds), 2)
first_call_msgs = mock_content.call_args_list[0][0][0]
self.assertEqual(first_call_msgs[0]['image_url'], "/fake/img1.png")
self.assertEqual(first_call_msgs[1]['image_url'], "/fake/img2.png")

Comment on lines +381 to +382
self.assertIsInstance(ds, Dataset)
self.assertEqual(len(ds), 2)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test test_load_with_image_path_column only asserts the length of the dataset, but does not verify that the image_path column was actually processed and passed to dump_image. Adding assertions to verify the mock calls ensures the logic is robustly tested.

Suggested change
self.assertIsInstance(ds, Dataset)
self.assertEqual(len(ds), 2)
self.assertIsInstance(ds, Dataset)
self.assertEqual(len(ds), 2)
self.assertEqual(mock_dump.call_args_list[0][0][0]['image_path'], "/path/a.png")
self.assertEqual(mock_dump.call_args_list[1][0][0]['image_path'], "/path/b.png")

Comment on lines +401 to +402
self.assertIsInstance(ds, Dataset)
self.assertEqual(len(ds), 2)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test test_list_image_path only asserts the length of the dataset, but does not verify that only the first image path from the list is used for the vision dataset (as per the implementation of MMMUProVisionDataset.load). Adding assertions to verify the mock calls ensures the list-handling logic is robustly tested.

Suggested change
self.assertIsInstance(ds, Dataset)
self.assertEqual(len(ds), 2)
self.assertIsInstance(ds, Dataset)
self.assertEqual(len(ds), 2)
first_call_msgs = mock_content.call_args_list[0][0][0]
self.assertEqual(first_call_msgs[0]['image_url'], "/fake/img1.png")
self.assertEqual(len([m for m in first_call_msgs if m.get('type') == 'image_url']), 1)

Comment on lines +225 to +227
self.assertEqual(len(ds), 1)
answer = ds[0]["answer"]
self.assertEqual(answer["category"], "cat")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test test_load_with_image_path only asserts the length of the dataset and the category, but does not verify that the image_path column was actually processed and passed to dump_image. Adding assertions to verify the mock calls ensures the logic is robustly tested.

Suggested change
self.assertEqual(len(ds), 1)
answer = ds[0]["answer"]
self.assertEqual(answer["category"], "cat")
self.assertEqual(len(ds), 1)
answer = ds[0]["answer"]
self.assertEqual(answer["category"], "cat")
self.assertEqual(mock_dump_image.call_args_list[0][0][0]['image_path'], "/some/path.png")

})
with patch("pandas.read_csv", return_value=df):
ds = MMStarDataset.load("/fake/data.tsv")
self.assertEqual(len(ds), 2)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test test_load_image_map_short_entry_redirect only asserts the length of the dataset, but does not verify that the short image reference was actually redirected to the long image data. Adding assertions to verify the mock calls ensures the redirection logic is robustly tested.

Suggested change
self.assertEqual(len(ds), 2)
self.assertEqual(len(ds), 2)
self.assertEqual(mock_dump_image.call_args_list[0][0][0]['image'], long_img)
self.assertEqual(mock_dump_image.call_args_list[1][0][0]['image'], long_img)

@wanlongze wanlongze changed the title test: add UT coverage for MMMU-Pro and MMStar datasets [UT]: add UT coverage for MMMU-Pro and MMStar datasets Jun 4, 2026
@Keithwwa Keithwwa merged commit f8c887b into AISBench:master Jun 4, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants