[UT]: add UT coverage for MMMU dataset#325
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive suite of unit tests for the MMMU dataset and evaluator components in tests/UT/datasets/test_mmmu.py. The review feedback highlights several opportunities to improve the robustness of these tests. Key recommendations include refining mock setups (such as for _dump_mmmu_image and os.path.exists) to prevent bypassing core logic, strengthening assertions to verify exact structures and values rather than just lengths or partial containment, fixing a potential runtime error in can_infer by passing a dictionary instead of a set, and renaming a test to accurately reflect that empty text parts are preserved rather than filtered.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| @patch('ais_bench.benchmark.datasets.mmmu._dump_mmmu_image') | ||
| def test_image_fields(self, mock_dump): | ||
| mock_dump.return_value = '/tmp/img.png' | ||
| record = {'image_1': 'data1', 'image_2': 'data2'} | ||
| result = _collect_mmmu_images(record, '/tmp', None) | ||
| self.assertIn(1, result) | ||
| self.assertIn(2, result) | ||
|
|
There was a problem hiding this comment.
The mock setup for _dump_mmmu_image is too broad. By setting mock_dump.return_value = '/tmp/img.png', it returns a path even when the candidate image is None (which happens for image_3 to image_7). This causes result to contain keys 1 to 7 instead of just 1 and 2. We should use side_effect to return a path only when the candidate is not None, and assert the exact dictionary structure.
| @patch('ais_bench.benchmark.datasets.mmmu._dump_mmmu_image') | |
| def test_image_fields(self, mock_dump): | |
| mock_dump.return_value = '/tmp/img.png' | |
| record = {'image_1': 'data1', 'image_2': 'data2'} | |
| result = _collect_mmmu_images(record, '/tmp', None) | |
| self.assertIn(1, result) | |
| self.assertIn(2, result) | |
| @patch('ais_bench.benchmark.datasets.mmmu._dump_mmmu_image') | |
| def test_image_fields(self, mock_dump): | |
| mock_dump.side_effect = lambda candidate, *args, **kwargs: '/tmp/img.png' if candidate is not None else None | |
| record = {'image_1': 'data1', 'image_2': 'data2'} | |
| result = _collect_mmmu_images(record, '/tmp', None) | |
| self.assertEqual(result, {1: '/tmp/img.png', 2: '/tmp/img.png'}) |
References
- Ensure mock objects are configured precisely to return realistic values for all inputs, preventing false positives where tests pass despite incorrect behavior.
| @patch('ais_bench.benchmark.datasets.mmmu.toliststr', return_value=['rel.png']) | ||
| @patch('os.path.exists', return_value=True) | ||
| def test_no_image_field_relative_path(self, mock_exists, mock_toliststr): | ||
| line = {'image_path': 'rel.png'} | ||
| result = dump_image(line, '/root') | ||
| self.assertEqual(len(result), 1) |
There was a problem hiding this comment.
Mocking os.path.exists to always return True bypasses the relative path resolution logic under test. To properly test this fallback branch, os.path.exists should return False for the relative path and True for the resolved absolute path.
| @patch('ais_bench.benchmark.datasets.mmmu.toliststr', return_value=['rel.png']) | |
| @patch('os.path.exists', return_value=True) | |
| def test_no_image_field_relative_path(self, mock_exists, mock_toliststr): | |
| line = {'image_path': 'rel.png'} | |
| result = dump_image(line, '/root') | |
| self.assertEqual(len(result), 1) | |
| @patch('ais_bench.benchmark.datasets.mmmu.toliststr', return_value=['rel.png']) | |
| @patch('os.path.exists') | |
| def test_no_image_field_relative_path(self, mock_exists, mock_toliststr): | |
| mock_exists.side_effect = [False, True] | |
| line = {'image_path': 'rel.png'} | |
| result = dump_image(line, '/root') | |
| self.assertEqual(result, ['/root/rel.png']) |
References
- Avoid over-mocking that bypasses the core logic under test. Mock functions should return realistic sequences of values to exercise all branches of the function.
| @patch('ais_bench.benchmark.datasets.mmmu.toliststr', return_value=['a.png', 'b.png']) | ||
| @patch('os.path.exists', return_value=False) | ||
| def test_no_image_field_all_missing(self, mock_exists, mock_toliststr): | ||
| line = {'image_path': 'a.png,b.png'} | ||
| result = dump_image(line, '/root') | ||
| self.assertEqual(len(result), 2) |
There was a problem hiding this comment.
Asserting only the length of the result is weak because it doesn't verify that the paths were correctly resolved to their absolute forms. Asserting the exact list of resolved paths is much more robust.
| @patch('ais_bench.benchmark.datasets.mmmu.toliststr', return_value=['a.png', 'b.png']) | |
| @patch('os.path.exists', return_value=False) | |
| def test_no_image_field_all_missing(self, mock_exists, mock_toliststr): | |
| line = {'image_path': 'a.png,b.png'} | |
| result = dump_image(line, '/root') | |
| self.assertEqual(len(result), 2) | |
| @patch('ais_bench.benchmark.datasets.mmmu.toliststr', return_value=['a.png', 'b.png']) | |
| @patch('os.path.exists', return_value=False) | |
| def test_no_image_field_all_missing(self, mock_exists, mock_toliststr): | |
| line = {'image_path': 'a.png,b.png'} | |
| result = dump_image(line, '/root') | |
| self.assertEqual(result, ['/root/a.png', '/root/b.png']) |
References
- Assert the exact expected output structure and values rather than just asserting the length of a collection, to ensure correctness and prevent silent regressions.
| def test_option_inferred(self): | ||
| self.assertEqual(can_infer('The answer is C', {'A', 'B', 'C'}), 'C') |
There was a problem hiding this comment.
Passing a set to can_infer is risky because can_infer's second argument choices should be a dictionary to be compatible with can_infer_text (which calls .values()). If the option inference fallback is triggered, it would raise an AttributeError. We should pass a dictionary to match the expected type.
| def test_option_inferred(self): | |
| self.assertEqual(can_infer('The answer is C', {'A', 'B', 'C'}), 'C') | |
| def test_option_inferred(self): | |
| self.assertEqual(can_infer('The answer is C', {'A': 'opt1', 'B': 'opt2', 'C': 'opt3'}), 'C') |
References
- Pass arguments matching the expected types of the function under test to prevent unexpected runtime errors if internal fallback branches are executed.
| def test_subset_list_filter(self): | ||
| with tempfile.TemporaryDirectory() as tmpdir: | ||
| root = Path(tmpdir) | ||
| for subj in ['Math', 'Physics']: | ||
| d = root / subj | ||
| d.mkdir() | ||
| (d / 'validation-001.parquet').touch() | ||
| result = _find_mmmu_parquet_files(str(root), 'validation', subset_list=['Math']) | ||
| subjects_found = [_infer_subject_from_parquet_path(f) for f in result] | ||
| self.assertIn('Math', subjects_found) |
There was a problem hiding this comment.
Asserting self.assertIn('Math', subjects_found) is weak because it doesn't verify that other subjects (like Physics) are filtered out. We should assert the exact list of found subjects to ensure the filtering logic works correctly.
| def test_subset_list_filter(self): | |
| with tempfile.TemporaryDirectory() as tmpdir: | |
| root = Path(tmpdir) | |
| for subj in ['Math', 'Physics']: | |
| d = root / subj | |
| d.mkdir() | |
| (d / 'validation-001.parquet').touch() | |
| result = _find_mmmu_parquet_files(str(root), 'validation', subset_list=['Math']) | |
| subjects_found = [_infer_subject_from_parquet_path(f) for f in result] | |
| self.assertIn('Math', subjects_found) | |
| def test_subset_list_filter(self): | |
| with tempfile.TemporaryDirectory() as tmpdir: | |
| root = Path(tmpdir) | |
| for subj in ['Math', 'Physics']: | |
| d = root / subj | |
| d.mkdir() | |
| (d / 'validation-001.parquet').touch() | |
| result = _find_mmmu_parquet_files(str(root), 'validation', subset_list=['Math']) | |
| subjects_found = [_infer_subject_from_parquet_path(f) for f in result] | |
| self.assertEqual(subjects_found, ['Math']) |
References
- Ensure filtering tests assert that unwanted elements are actually excluded, rather than just asserting that the wanted element is present.
| def test_text_with_placeholder(self): | ||
| image_map = {1: '/path/img1.png'} | ||
| text = 'Look at <image 1> carefully.' | ||
| result = _parse_mmmu_text_with_images(text, image_map) | ||
| self.assertEqual(len(result), 3) | ||
| self.assertEqual(result[0]['type'], 'text') | ||
| self.assertEqual(result[0]['text'], 'Look at ') | ||
| self.assertEqual(result[1]['type'], 'image_url') | ||
| self.assertEqual(result[1]['image_url'], '/path/img1.png') | ||
| self.assertEqual(result[2]['type'], 'text') | ||
|
|
There was a problem hiding this comment.
Asserting individual elements of the list is verbose and less robust than asserting the entire list structure. Asserting the entire list ensures that no extra elements are present and that the order is correct.
| def test_text_with_placeholder(self): | |
| image_map = {1: '/path/img1.png'} | |
| text = 'Look at <image 1> carefully.' | |
| result = _parse_mmmu_text_with_images(text, image_map) | |
| self.assertEqual(len(result), 3) | |
| self.assertEqual(result[0]['type'], 'text') | |
| self.assertEqual(result[0]['text'], 'Look at ') | |
| self.assertEqual(result[1]['type'], 'image_url') | |
| self.assertEqual(result[1]['image_url'], '/path/img1.png') | |
| self.assertEqual(result[2]['type'], 'text') | |
| def test_text_with_placeholder(self): | |
| image_map = {1: '/path/img1.png'} | |
| text = 'Look at <image 1> carefully.' | |
| result = _parse_mmmu_text_with_images(text, image_map) | |
| self.assertEqual(result, [ | |
| {'type': 'text', 'text': 'Look at '}, | |
| {'type': 'image_url', 'image_url': '/path/img1.png'}, | |
| {'type': 'text', 'text': ' carefully.'} | |
| ]) |
References
- Assert the entire expected structure and values of a collection in a single assertion, which is more robust and readable than multiple individual element assertions.
| def test_empty_text_parts_filtered(self): | ||
| msgs = [ | ||
| {'type': 'text', 'text': '<image 1>'}, | ||
| {'type': 'image_url', 'image_url': 'url1'}, | ||
| ] | ||
| result = split_MMMU(msgs) | ||
| self.assertEqual(len(result), 3) | ||
| self.assertEqual(result[0], {'type': 'text', 'text': ''}) | ||
| self.assertEqual(result[1], {'type': 'image_url', 'image_url': 'url1'}) | ||
| self.assertEqual(result[2], {'type': 'text', 'text': ''}) |
There was a problem hiding this comment.
The test name test_empty_text_parts_filtered is misleading because the assertions verify that empty text parts are actually preserved (not filtered). We should rename the test to accurately reflect its behavior.
| def test_empty_text_parts_filtered(self): | |
| msgs = [ | |
| {'type': 'text', 'text': '<image 1>'}, | |
| {'type': 'image_url', 'image_url': 'url1'}, | |
| ] | |
| result = split_MMMU(msgs) | |
| self.assertEqual(len(result), 3) | |
| self.assertEqual(result[0], {'type': 'text', 'text': ''}) | |
| self.assertEqual(result[1], {'type': 'image_url', 'image_url': 'url1'}) | |
| self.assertEqual(result[2], {'type': 'text', 'text': ''}) | |
| def test_empty_text_parts_preserved(self): | |
| msgs = [ | |
| {'type': 'text', 'text': '<image 1>'}, | |
| {'type': 'image_url', 'image_url': 'url1'}, | |
| ] | |
| result = split_MMMU(msgs) | |
| self.assertEqual(len(result), 3) | |
| self.assertEqual(result[0], {'type': 'text', 'text': ''}) | |
| self.assertEqual(result[1], {'type': 'image_url', 'image_url': 'url1'}) | |
| self.assertEqual(result[2], {'type': 'text', 'text': ''}) |
References
- Test names should accurately reflect the behavior being asserted to prevent confusion and maintain readability.
Summary
Add comprehensive unit test coverage for the MMMU (Multi-Modal Understanding) dataset module.
Changes
tests/UT/datasets/test_mmmu.py(+983 lines)Test Coverage (23 test classes)
TestConstantsIMAGE_MAP_LEN,MMMU_SUBSET_LIST, question typesTestSafeListTestAnswerCharacterTestBuildMmmuMcqPromptTestBuildChoicesTestParquetSortKeyTestInferSubjectFromParquetPathTestFindMmmuParquetFilesTestResolveMmmuExistingImagePathTestWriteMmmuImageBytesTestBuildMmmuImagePathTestDumpMmmuImageTestCollectMmmuImagesTestDumpImageTestParseMmmuTextWithImages<image N>placeholder parsingTestSplitMMMUTestParseMmmuChoicePredictionTestExtractMmmuOpenPredictionTestCanInferOption/TestCanInferText/TestCanInferTestSortKeyTestMMMUEvaluatorVerification
全量UT测试效果 提升至 82.8
