Skip to content

[UT]: add UT coverage for MMMU dataset#325

Merged
Keithwwa merged 1 commit into
AISBench:masterfrom
wanlongze:feat/add-ut-mmmu
Jun 4, 2026
Merged

[UT]: add UT coverage for MMMU dataset#325
Keithwwa merged 1 commit into
AISBench:masterfrom
wanlongze:feat/add-ut-mmmu

Conversation

@wanlongze

@wanlongze wanlongze commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Add comprehensive unit test coverage for the MMMU (Multi-Modal Understanding) dataset module.

Changes

  • Added tests/UT/datasets/test_mmmu.py (+983 lines)

Test Coverage (23 test classes)

Category Test Class What's Covered
Constants TestConstants IMAGE_MAP_LEN, MMMU_SUBSET_LIST, question types
Helpers TestSafeList Input normalization (None/NaN/list/string/JSON)
TestAnswerCharacter Index-to-letter mapping (A-Z and beyond)
TestBuildMmmuMcqPrompt MCQ prompt template rendering
TestBuildChoices Choice dict construction with NaN handling
Parquet TestParquetSortKey Deterministic file ordering by subject
TestInferSubjectFromParquetPath Subject extraction from path patterns
TestFindMmmuParquetFiles File discovery, filtering, dedup
Image I/O TestResolveMmmuExistingImagePath Absolute/relative path resolution
TestWriteMmmuImageBytes Binary write with parent dir creation
TestBuildMmmuImagePath Path construction with sanitization
TestDumpMmmuImage Multi-format image handling (bytes/dict/string/object)
TestCollectMmmuImages Image field aggregation from records
TestDumpImage Top-level image dump with path fallback
Multi-modal TestParseMmmuTextWithImages <image N> placeholder parsing
TestSplitMMMU Message splitting at image boundaries
Prediction TestParseMmmuChoicePrediction Choice extraction from model output
TestExtractMmmuOpenPrediction Open-ended answer extraction
Inference TestCanInferOption / TestCanInferText / TestCanInfer Answer inference logic
TestSortKey Result sorting key
Evaluator TestMMMUEvaluator Scoring (choice/open types, case-insensitive, edge cases)

Verification

全量UT测试效果 提升至 82.8
image

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive suite of unit tests for the MMMU dataset and evaluator components in tests/UT/datasets/test_mmmu.py. The review feedback highlights several opportunities to improve the robustness of these tests. Key recommendations include refining mock setups (such as for _dump_mmmu_image and os.path.exists) to prevent bypassing core logic, strengthening assertions to verify exact structures and values rather than just lengths or partial containment, fixing a potential runtime error in can_infer by passing a dictionary instead of a set, and renaming a test to accurately reflect that empty text parts are preserved rather than filtered.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +459 to +466
@patch('ais_bench.benchmark.datasets.mmmu._dump_mmmu_image')
def test_image_fields(self, mock_dump):
mock_dump.return_value = '/tmp/img.png'
record = {'image_1': 'data1', 'image_2': 'data2'}
result = _collect_mmmu_images(record, '/tmp', None)
self.assertIn(1, result)
self.assertIn(2, result)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The mock setup for _dump_mmmu_image is too broad. By setting mock_dump.return_value = '/tmp/img.png', it returns a path even when the candidate image is None (which happens for image_3 to image_7). This causes result to contain keys 1 to 7 instead of just 1 and 2. We should use side_effect to return a path only when the candidate is not None, and assert the exact dictionary structure.

Suggested change
@patch('ais_bench.benchmark.datasets.mmmu._dump_mmmu_image')
def test_image_fields(self, mock_dump):
mock_dump.return_value = '/tmp/img.png'
record = {'image_1': 'data1', 'image_2': 'data2'}
result = _collect_mmmu_images(record, '/tmp', None)
self.assertIn(1, result)
self.assertIn(2, result)
@patch('ais_bench.benchmark.datasets.mmmu._dump_mmmu_image')
def test_image_fields(self, mock_dump):
mock_dump.side_effect = lambda candidate, *args, **kwargs: '/tmp/img.png' if candidate is not None else None
record = {'image_1': 'data1', 'image_2': 'data2'}
result = _collect_mmmu_images(record, '/tmp', None)
self.assertEqual(result, {1: '/tmp/img.png', 2: '/tmp/img.png'})
References
  1. Ensure mock objects are configured precisely to return realistic values for all inputs, preventing false positives where tests pass despite incorrect behavior.

Comment on lines +761 to +766
@patch('ais_bench.benchmark.datasets.mmmu.toliststr', return_value=['rel.png'])
@patch('os.path.exists', return_value=True)
def test_no_image_field_relative_path(self, mock_exists, mock_toliststr):
line = {'image_path': 'rel.png'}
result = dump_image(line, '/root')
self.assertEqual(len(result), 1)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Mocking os.path.exists to always return True bypasses the relative path resolution logic under test. To properly test this fallback branch, os.path.exists should return False for the relative path and True for the resolved absolute path.

Suggested change
@patch('ais_bench.benchmark.datasets.mmmu.toliststr', return_value=['rel.png'])
@patch('os.path.exists', return_value=True)
def test_no_image_field_relative_path(self, mock_exists, mock_toliststr):
line = {'image_path': 'rel.png'}
result = dump_image(line, '/root')
self.assertEqual(len(result), 1)
@patch('ais_bench.benchmark.datasets.mmmu.toliststr', return_value=['rel.png'])
@patch('os.path.exists')
def test_no_image_field_relative_path(self, mock_exists, mock_toliststr):
mock_exists.side_effect = [False, True]
line = {'image_path': 'rel.png'}
result = dump_image(line, '/root')
self.assertEqual(result, ['/root/rel.png'])
References
  1. Avoid over-mocking that bypasses the core logic under test. Mock functions should return realistic sequences of values to exercise all branches of the function.

Comment on lines +778 to +783
@patch('ais_bench.benchmark.datasets.mmmu.toliststr', return_value=['a.png', 'b.png'])
@patch('os.path.exists', return_value=False)
def test_no_image_field_all_missing(self, mock_exists, mock_toliststr):
line = {'image_path': 'a.png,b.png'}
result = dump_image(line, '/root')
self.assertEqual(len(result), 2)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Asserting only the length of the result is weak because it doesn't verify that the paths were correctly resolved to their absolute forms. Asserting the exact list of resolved paths is much more robust.

Suggested change
@patch('ais_bench.benchmark.datasets.mmmu.toliststr', return_value=['a.png', 'b.png'])
@patch('os.path.exists', return_value=False)
def test_no_image_field_all_missing(self, mock_exists, mock_toliststr):
line = {'image_path': 'a.png,b.png'}
result = dump_image(line, '/root')
self.assertEqual(len(result), 2)
@patch('ais_bench.benchmark.datasets.mmmu.toliststr', return_value=['a.png', 'b.png'])
@patch('os.path.exists', return_value=False)
def test_no_image_field_all_missing(self, mock_exists, mock_toliststr):
line = {'image_path': 'a.png,b.png'}
result = dump_image(line, '/root')
self.assertEqual(result, ['/root/a.png', '/root/b.png'])
References
  1. Assert the exact expected output structure and values rather than just asserting the length of a collection, to ensure correctness and prevent silent regressions.

Comment on lines +868 to +869
def test_option_inferred(self):
self.assertEqual(can_infer('The answer is C', {'A', 'B', 'C'}), 'C')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Passing a set to can_infer is risky because can_infer's second argument choices should be a dictionary to be compatible with can_infer_text (which calls .values()). If the option inference fallback is triggered, it would raise an AttributeError. We should pass a dictionary to match the expected type.

Suggested change
def test_option_inferred(self):
self.assertEqual(can_infer('The answer is C', {'A', 'B', 'C'}), 'C')
def test_option_inferred(self):
self.assertEqual(can_infer('The answer is C', {'A': 'opt1', 'B': 'opt2', 'C': 'opt3'}), 'C')
References
  1. Pass arguments matching the expected types of the function under test to prevent unexpected runtime errors if internal fallback branches are executed.

Comment on lines +235 to +244
def test_subset_list_filter(self):
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
for subj in ['Math', 'Physics']:
d = root / subj
d.mkdir()
(d / 'validation-001.parquet').touch()
result = _find_mmmu_parquet_files(str(root), 'validation', subset_list=['Math'])
subjects_found = [_infer_subject_from_parquet_path(f) for f in result]
self.assertIn('Math', subjects_found)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Asserting self.assertIn('Math', subjects_found) is weak because it doesn't verify that other subjects (like Physics) are filtered out. We should assert the exact list of found subjects to ensure the filtering logic works correctly.

Suggested change
def test_subset_list_filter(self):
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
for subj in ['Math', 'Physics']:
d = root / subj
d.mkdir()
(d / 'validation-001.parquet').touch()
result = _find_mmmu_parquet_files(str(root), 'validation', subset_list=['Math'])
subjects_found = [_infer_subject_from_parquet_path(f) for f in result]
self.assertIn('Math', subjects_found)
def test_subset_list_filter(self):
with tempfile.TemporaryDirectory() as tmpdir:
root = Path(tmpdir)
for subj in ['Math', 'Physics']:
d = root / subj
d.mkdir()
(d / 'validation-001.parquet').touch()
result = _find_mmmu_parquet_files(str(root), 'validation', subset_list=['Math'])
subjects_found = [_infer_subject_from_parquet_path(f) for f in result]
self.assertEqual(subjects_found, ['Math'])
References
  1. Ensure filtering tests assert that unwanted elements are actually excluded, rather than just asserting that the wanted element is present.

Comment on lines +502 to +512
def test_text_with_placeholder(self):
image_map = {1: '/path/img1.png'}
text = 'Look at <image 1> carefully.'
result = _parse_mmmu_text_with_images(text, image_map)
self.assertEqual(len(result), 3)
self.assertEqual(result[0]['type'], 'text')
self.assertEqual(result[0]['text'], 'Look at ')
self.assertEqual(result[1]['type'], 'image_url')
self.assertEqual(result[1]['image_url'], '/path/img1.png')
self.assertEqual(result[2]['type'], 'text')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Asserting individual elements of the list is verbose and less robust than asserting the entire list structure. Asserting the entire list ensures that no extra elements are present and that the order is correct.

Suggested change
def test_text_with_placeholder(self):
image_map = {1: '/path/img1.png'}
text = 'Look at <image 1> carefully.'
result = _parse_mmmu_text_with_images(text, image_map)
self.assertEqual(len(result), 3)
self.assertEqual(result[0]['type'], 'text')
self.assertEqual(result[0]['text'], 'Look at ')
self.assertEqual(result[1]['type'], 'image_url')
self.assertEqual(result[1]['image_url'], '/path/img1.png')
self.assertEqual(result[2]['type'], 'text')
def test_text_with_placeholder(self):
image_map = {1: '/path/img1.png'}
text = 'Look at <image 1> carefully.'
result = _parse_mmmu_text_with_images(text, image_map)
self.assertEqual(result, [
{'type': 'text', 'text': 'Look at '},
{'type': 'image_url', 'image_url': '/path/img1.png'},
{'type': 'text', 'text': ' carefully.'}
])
References
  1. Assert the entire expected structure and values of a collection in a single assertion, which is more robust and readable than multiple individual element assertions.

Comment on lines +693 to +702
def test_empty_text_parts_filtered(self):
msgs = [
{'type': 'text', 'text': '<image 1>'},
{'type': 'image_url', 'image_url': 'url1'},
]
result = split_MMMU(msgs)
self.assertEqual(len(result), 3)
self.assertEqual(result[0], {'type': 'text', 'text': ''})
self.assertEqual(result[1], {'type': 'image_url', 'image_url': 'url1'})
self.assertEqual(result[2], {'type': 'text', 'text': ''})

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test name test_empty_text_parts_filtered is misleading because the assertions verify that empty text parts are actually preserved (not filtered). We should rename the test to accurately reflect its behavior.

Suggested change
def test_empty_text_parts_filtered(self):
msgs = [
{'type': 'text', 'text': '<image 1>'},
{'type': 'image_url', 'image_url': 'url1'},
]
result = split_MMMU(msgs)
self.assertEqual(len(result), 3)
self.assertEqual(result[0], {'type': 'text', 'text': ''})
self.assertEqual(result[1], {'type': 'image_url', 'image_url': 'url1'})
self.assertEqual(result[2], {'type': 'text', 'text': ''})
def test_empty_text_parts_preserved(self):
msgs = [
{'type': 'text', 'text': '<image 1>'},
{'type': 'image_url', 'image_url': 'url1'},
]
result = split_MMMU(msgs)
self.assertEqual(len(result), 3)
self.assertEqual(result[0], {'type': 'text', 'text': ''})
self.assertEqual(result[1], {'type': 'image_url', 'image_url': 'url1'})
self.assertEqual(result[2], {'type': 'text', 'text': ''})
References
  1. Test names should accurately reflect the behavior being asserted to prevent confusion and maintain readability.

@wanlongze wanlongze changed the title test: add UT coverage for MMMU dataset [UT]: add UT coverage for MMMU dataset Jun 4, 2026
@Keithwwa Keithwwa merged commit b5bc542 into AISBench:master Jun 4, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants