[UT]: add UT coverage for MMMU dataset by wanlongze · Pull Request #325 · AISBench/benchmark

wanlongze · 2026-06-04T03:41:40Z

Summary

Add comprehensive unit test coverage for the MMMU (Multi-Modal Understanding) dataset module.

Changes

Added tests/UT/datasets/test_mmmu.py (+983 lines)

Test Coverage (23 test classes)

Category	Test Class	What's Covered
Constants	`TestConstants`	`IMAGE_MAP_LEN`, `MMMU_SUBSET_LIST`, question types
Helpers	`TestSafeList`	Input normalization (None/NaN/list/string/JSON)
	`TestAnswerCharacter`	Index-to-letter mapping (A-Z and beyond)
	`TestBuildMmmuMcqPrompt`	MCQ prompt template rendering
	`TestBuildChoices`	Choice dict construction with NaN handling
Parquet	`TestParquetSortKey`	Deterministic file ordering by subject
	`TestInferSubjectFromParquetPath`	Subject extraction from path patterns
	`TestFindMmmuParquetFiles`	File discovery, filtering, dedup
Image I/O	`TestResolveMmmuExistingImagePath`	Absolute/relative path resolution
	`TestWriteMmmuImageBytes`	Binary write with parent dir creation
	`TestBuildMmmuImagePath`	Path construction with sanitization
	`TestDumpMmmuImage`	Multi-format image handling (bytes/dict/string/object)
	`TestCollectMmmuImages`	Image field aggregation from records
	`TestDumpImage`	Top-level image dump with path fallback
Multi-modal	`TestParseMmmuTextWithImages`	`<image N>` placeholder parsing
	`TestSplitMMMU`	Message splitting at image boundaries
Prediction	`TestParseMmmuChoicePrediction`	Choice extraction from model output
	`TestExtractMmmuOpenPrediction`	Open-ended answer extraction
Inference	`TestCanInferOption` / `TestCanInferText` / `TestCanInfer`	Answer inference logic
	`TestSortKey`	Result sorting key
Evaluator	`TestMMMUEvaluator`	Scoring (choice/open types, case-insensitive, edge cases)

Verification

全量UT测试效果提升至 82.8

gemini-code-assist

Code Review

This pull request introduces a comprehensive suite of unit tests for the MMMU dataset and evaluator components in tests/UT/datasets/test_mmmu.py. The review feedback highlights several opportunities to improve the robustness of these tests. Key recommendations include refining mock setups (such as for _dump_mmmu_image and os.path.exists) to prevent bypassing core logic, strengthening assertions to verify exact structures and values rather than just lengths or partial containment, fixing a potential runtime error in can_infer by passing a dictionary instead of a set, and renaming a test to accurately reflect that empty text parts are preserved rather than filtered.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-04T03:43:30Z

+    @patch('ais_bench.benchmark.datasets.mmmu._dump_mmmu_image')
+    def test_image_fields(self, mock_dump):
+        mock_dump.return_value = '/tmp/img.png'
+        record = {'image_1': 'data1', 'image_2': 'data2'}
+        result = _collect_mmmu_images(record, '/tmp', None)
+        self.assertIn(1, result)
+        self.assertIn(2, result)
+


The mock setup for _dump_mmmu_image is too broad. By setting mock_dump.return_value = '/tmp/img.png', it returns a path even when the candidate image is None (which happens for image_3 to image_7). This causes result to contain keys 1 to 7 instead of just 1 and 2. We should use side_effect to return a path only when the candidate is not None, and assert the exact dictionary structure.

Suggested change

@patch('ais_bench.benchmark.datasets.mmmu._dump_mmmu_image')

def test_image_fields(self, mock_dump):

mock_dump.return_value = '/tmp/img.png'

record = {'image_1': 'data1', 'image_2': 'data2'}

result = _collect_mmmu_images(record, '/tmp', None)

self.assertIn(1, result)

self.assertIn(2, result)

@patch('ais_bench.benchmark.datasets.mmmu._dump_mmmu_image')

def test_image_fields(self, mock_dump):

mock_dump.side_effect = lambda candidate, *args, **kwargs: '/tmp/img.png' if candidate is not None else None

record = {'image_1': 'data1', 'image_2': 'data2'}

result = _collect_mmmu_images(record, '/tmp', None)

self.assertEqual(result, {1: '/tmp/img.png', 2: '/tmp/img.png'})

References

Ensure mock objects are configured precisely to return realistic values for all inputs, preventing false positives where tests pass despite incorrect behavior.

gemini-code-assist · 2026-06-04T03:43:30Z

+    @patch('ais_bench.benchmark.datasets.mmmu.toliststr', return_value=['rel.png'])
+    @patch('os.path.exists', return_value=True)
+    def test_no_image_field_relative_path(self, mock_exists, mock_toliststr):
+        line = {'image_path': 'rel.png'}
+        result = dump_image(line, '/root')
+        self.assertEqual(len(result), 1)


Mocking os.path.exists to always return True bypasses the relative path resolution logic under test. To properly test this fallback branch, os.path.exists should return False for the relative path and True for the resolved absolute path.

Suggested change

@patch('ais_bench.benchmark.datasets.mmmu.toliststr', return_value=['rel.png'])

@patch('os.path.exists', return_value=True)

def test_no_image_field_relative_path(self, mock_exists, mock_toliststr):

line = {'image_path': 'rel.png'}

result = dump_image(line, '/root')

self.assertEqual(len(result), 1)

@patch('ais_bench.benchmark.datasets.mmmu.toliststr', return_value=['rel.png'])

@patch('os.path.exists')

def test_no_image_field_relative_path(self, mock_exists, mock_toliststr):

mock_exists.side_effect = [False, True]

line = {'image_path': 'rel.png'}

result = dump_image(line, '/root')

self.assertEqual(result, ['/root/rel.png'])

References

Avoid over-mocking that bypasses the core logic under test. Mock functions should return realistic sequences of values to exercise all branches of the function.

gemini-code-assist · 2026-06-04T03:43:30Z

+    @patch('ais_bench.benchmark.datasets.mmmu.toliststr', return_value=['a.png', 'b.png'])
+    @patch('os.path.exists', return_value=False)
+    def test_no_image_field_all_missing(self, mock_exists, mock_toliststr):
+        line = {'image_path': 'a.png,b.png'}
+        result = dump_image(line, '/root')
+        self.assertEqual(len(result), 2)


Asserting only the length of the result is weak because it doesn't verify that the paths were correctly resolved to their absolute forms. Asserting the exact list of resolved paths is much more robust.

Suggested change

@patch('ais_bench.benchmark.datasets.mmmu.toliststr', return_value=['a.png', 'b.png'])

@patch('os.path.exists', return_value=False)

def test_no_image_field_all_missing(self, mock_exists, mock_toliststr):

line = {'image_path': 'a.png,b.png'}

result = dump_image(line, '/root')

self.assertEqual(len(result), 2)

@patch('ais_bench.benchmark.datasets.mmmu.toliststr', return_value=['a.png', 'b.png'])

@patch('os.path.exists', return_value=False)

def test_no_image_field_all_missing(self, mock_exists, mock_toliststr):

line = {'image_path': 'a.png,b.png'}

result = dump_image(line, '/root')

self.assertEqual(result, ['/root/a.png', '/root/b.png'])

References

Assert the exact expected output structure and values rather than just asserting the length of a collection, to ensure correctness and prevent silent regressions.

gemini-code-assist · 2026-06-04T03:43:30Z

+    def test_option_inferred(self):
+        self.assertEqual(can_infer('The answer is C', {'A', 'B', 'C'}), 'C')


Passing a set to can_infer is risky because can_infer's second argument choices should be a dictionary to be compatible with can_infer_text (which calls .values()). If the option inference fallback is triggered, it would raise an AttributeError. We should pass a dictionary to match the expected type.

Suggested change

def test_option_inferred(self):

self.assertEqual(can_infer('The answer is C', {'A', 'B', 'C'}), 'C')

def test_option_inferred(self):

self.assertEqual(can_infer('The answer is C', {'A': 'opt1', 'B': 'opt2', 'C': 'opt3'}), 'C')

References

Pass arguments matching the expected types of the function under test to prevent unexpected runtime errors if internal fallback branches are executed.

gemini-code-assist · 2026-06-04T03:43:30Z

+    def test_subset_list_filter(self):
+        with tempfile.TemporaryDirectory() as tmpdir:
+            root = Path(tmpdir)
+            for subj in ['Math', 'Physics']:
+                d = root / subj
+                d.mkdir()
+                (d / 'validation-001.parquet').touch()
+            result = _find_mmmu_parquet_files(str(root), 'validation', subset_list=['Math'])
+            subjects_found = [_infer_subject_from_parquet_path(f) for f in result]
+            self.assertIn('Math', subjects_found)


Asserting self.assertIn('Math', subjects_found) is weak because it doesn't verify that other subjects (like Physics) are filtered out. We should assert the exact list of found subjects to ensure the filtering logic works correctly.

Suggested change

def test_subset_list_filter(self):

with tempfile.TemporaryDirectory() as tmpdir:

root = Path(tmpdir)

for subj in ['Math', 'Physics']:

d = root / subj

d.mkdir()

(d / 'validation-001.parquet').touch()

result = _find_mmmu_parquet_files(str(root), 'validation', subset_list=['Math'])

subjects_found = [_infer_subject_from_parquet_path(f) for f in result]

self.assertIn('Math', subjects_found)

def test_subset_list_filter(self):

with tempfile.TemporaryDirectory() as tmpdir:

root = Path(tmpdir)

for subj in ['Math', 'Physics']:

d = root / subj

d.mkdir()

(d / 'validation-001.parquet').touch()

result = _find_mmmu_parquet_files(str(root), 'validation', subset_list=['Math'])

subjects_found = [_infer_subject_from_parquet_path(f) for f in result]

self.assertEqual(subjects_found, ['Math'])

References

Ensure filtering tests assert that unwanted elements are actually excluded, rather than just asserting that the wanted element is present.

gemini-code-assist · 2026-06-04T03:43:30Z

+    def test_text_with_placeholder(self):
+        image_map = {1: '/path/img1.png'}
+        text = 'Look at <image 1> carefully.'
+        result = _parse_mmmu_text_with_images(text, image_map)
+        self.assertEqual(len(result), 3)
+        self.assertEqual(result[0]['type'], 'text')
+        self.assertEqual(result[0]['text'], 'Look at ')
+        self.assertEqual(result[1]['type'], 'image_url')
+        self.assertEqual(result[1]['image_url'], '/path/img1.png')
+        self.assertEqual(result[2]['type'], 'text')
+


Asserting individual elements of the list is verbose and less robust than asserting the entire list structure. Asserting the entire list ensures that no extra elements are present and that the order is correct.

Suggested change

def test_text_with_placeholder(self):

image_map = {1: '/path/img1.png'}

text = 'Look at <image 1> carefully.'

result = _parse_mmmu_text_with_images(text, image_map)

self.assertEqual(len(result), 3)

self.assertEqual(result[0]['type'], 'text')

self.assertEqual(result[0]['text'], 'Look at ')

self.assertEqual(result[1]['type'], 'image_url')

self.assertEqual(result[1]['image_url'], '/path/img1.png')

self.assertEqual(result[2]['type'], 'text')

def test_text_with_placeholder(self):

image_map = {1: '/path/img1.png'}

text = 'Look at <image 1> carefully.'

result = _parse_mmmu_text_with_images(text, image_map)

self.assertEqual(result, [

{'type': 'text', 'text': 'Look at '},

{'type': 'image_url', 'image_url': '/path/img1.png'},

{'type': 'text', 'text': ' carefully.'}

])

References

Assert the entire expected structure and values of a collection in a single assertion, which is more robust and readable than multiple individual element assertions.

gemini-code-assist · 2026-06-04T03:43:30Z

+    def test_empty_text_parts_filtered(self):
+        msgs = [
+            {'type': 'text', 'text': '<image 1>'},
+            {'type': 'image_url', 'image_url': 'url1'},
+        ]
+        result = split_MMMU(msgs)
+        self.assertEqual(len(result), 3)
+        self.assertEqual(result[0], {'type': 'text', 'text': ''})
+        self.assertEqual(result[1], {'type': 'image_url', 'image_url': 'url1'})
+        self.assertEqual(result[2], {'type': 'text', 'text': ''})


The test name test_empty_text_parts_filtered is misleading because the assertions verify that empty text parts are actually preserved (not filtered). We should rename the test to accurately reflect its behavior.

Suggested change

def test_empty_text_parts_filtered(self):

msgs = [

{'type': 'text', 'text': '<image 1>'},

{'type': 'image_url', 'image_url': 'url1'},

]

result = split_MMMU(msgs)

self.assertEqual(len(result), 3)

self.assertEqual(result[0], {'type': 'text', 'text': ''})

self.assertEqual(result[1], {'type': 'image_url', 'image_url': 'url1'})

self.assertEqual(result[2], {'type': 'text', 'text': ''})

def test_empty_text_parts_preserved(self):

msgs = [

{'type': 'text', 'text': '<image 1>'},

{'type': 'image_url', 'image_url': 'url1'},

]

result = split_MMMU(msgs)

self.assertEqual(len(result), 3)

self.assertEqual(result[0], {'type': 'text', 'text': ''})

self.assertEqual(result[1], {'type': 'image_url', 'image_url': 'url1'})

self.assertEqual(result[2], {'type': 'text', 'text': ''})

References

Test names should accurately reflect the behavior being asserted to prevent confusion and maintain readability.

test: add UT coverage for MMMU dataset

11134b6

gemini-code-assist Bot reviewed Jun 4, 2026

View reviewed changes

wanlongze changed the title ~~test: add UT coverage for MMMU dataset~~ [UT]: add UT coverage for MMMU dataset Jun 4, 2026

Keithwwa approved these changes Jun 4, 2026

View reviewed changes

Keithwwa merged commit b5bc542 into AISBench:master Jun 4, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[UT]: add UT coverage for MMMU dataset#325

[UT]: add UT coverage for MMMU dataset#325
Keithwwa merged 1 commit into
AISBench:masterfrom
wanlongze:feat/add-ut-mmmu

wanlongze commented Jun 4, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		def test_option_inferred(self):
		self.assertEqual(can_infer('The answer is C', {'A', 'B', 'C'}), 'C')

Uh oh!

Conversation

wanlongze commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Coverage (23 test classes)

Verification

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wanlongze commented Jun 4, 2026 •

edited

Loading