-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Fix #8599: Add track_meta and weights_only arguments to PersistentDataset for MetaTensor support. #8628
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
Signed-off-by: Mason Cleveland <mccleve@umich.edu>
Signed-off-by: Mason Cleveland <mccleve@umich.edu>
Signed-off-by: Mason Cleveland <mccleve@umich.edu>
Signed-off-by: Mason Cleveland <mccleve@umich.edu>
WalkthroughAdded two public constructor options to PersistentDataset: Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
Pre-merge checks and finishing touches✅ Passed checks (5 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
tests/data/test_persistentdataset.py (1)
179-211: Consider validating metadata preservation.The test correctly validates the type of the returned object, but doesn't verify that metadata is actually preserved when
track_meta=True. Consider adding an assertion to check that the MetaTensor contains expected metadata (e.g., affine, filename).Example enhancement:
im = test_dataset[0]["image"] self.assertIsInstance(im, expected_type) + if track_meta and isinstance(im, MetaTensor): + self.assertIsNotNone(im.meta.get("filename_or_obj"))monai/data/dataset.py (1)
446-503: Consider adding support fortrack_metaandweights_onlyinCacheNTransDataset.
CacheNTransDatasetinherits_cachecheckfromPersistentDataset, which usestorch.save/torch.load. Users may want to cache MetaTensors with this dataset type as well.Add the parameters to the constructor:
def __init__( self, data: Sequence, transform: Sequence[Callable] | Callable, cache_n_trans: int, cache_dir: Path | str | None, hash_func: Callable[..., bytes] = pickle_hashing, pickle_module: str = "pickle", pickle_protocol: int = DEFAULT_PROTOCOL, hash_transform: Callable[..., bytes] | None = None, reset_ops_id: bool = True, + track_meta: bool = False, + weights_only: bool = True, ) -> None:Then pass them to super:
super().__init__( data=data, transform=transform, cache_dir=cache_dir, hash_func=hash_func, pickle_module=pickle_module, pickle_protocol=pickle_protocol, hash_transform=hash_transform, reset_ops_id=reset_ops_id, + track_meta=track_meta, + weights_only=weights_only, )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (2)
monai/data/dataset.py(5 hunks)tests/data/test_persistentdataset.py(3 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
⚙️ CodeRabbit configuration file
Review the Python code for quality and correctness. Ensure variable names adhere to PEP8 style guides, are sensible and informative in regards to their function, though permitting simple names for loop and comprehension variables. Ensure routine names are meaningful in regards to their function and use verbs, adjectives, and nouns in a semantically appropriate way. Docstrings should be present for all definition which describe each variable, return value, and raised exception in the appropriate section of the Google-style of docstrings. Examine code for logical error or inconsistencies, and suggest what may be changed to addressed these. Suggest any enhancements for code improving efficiency, maintainability, comprehensibility, and correctness. Ensure new or modified definitions will be covered by existing or new unit tests.
Files:
tests/data/test_persistentdataset.pymonai/data/dataset.py
🪛 Ruff (0.14.4)
monai/data/dataset.py
295-298: Avoid specifying long messages outside the exception class
(TRY003)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: packaging
- GitHub Check: quick-py3 (macOS-latest)
- GitHub Check: build-docs
- GitHub Check: flake8-py3 (codeformat)
- GitHub Check: quick-py3 (ubuntu-latest)
- GitHub Check: quick-py3 (windows-latest)
- GitHub Check: flake8-py3 (mypy)
- GitHub Check: flake8-py3 (pytype)
- GitHub Check: min-dep-os (windows-latest)
- GitHub Check: min-dep-os (macOS-latest)
- GitHub Check: min-dep-py3 (3.10)
- GitHub Check: min-dep-os (ubuntu-latest)
- GitHub Check: min-dep-py3 (3.9)
- GitHub Check: min-dep-pytorch (2.8.0)
- GitHub Check: min-dep-py3 (3.12)
- GitHub Check: min-dep-py3 (3.11)
- GitHub Check: min-dep-pytorch (2.5.1)
- GitHub Check: min-dep-pytorch (2.6.0)
- GitHub Check: min-dep-pytorch (2.7.1)
🔇 Additional comments (7)
tests/data/test_persistentdataset.py (2)
23-23: LGTM!MetaTensor import is necessary for type assertions in the new test cases.
46-52: LGTM!Test cases comprehensively cover all combinations of
track_metaandweights_onlyflags, including the invalid combination that should raiseValueError.monai/data/dataset.py (5)
233-234: LGTM!New parameters have appropriate defaults that preserve backward compatibility.
269-278: LGTM!Documentation clearly explains the new parameters and their interaction.
294-300: Validation logic is correct.The check prevents the invalid combination that would cause cache thrashing. Error message is clear.
Minor: Static analysis suggests defining exception messages as constants or within exception classes, but this is a style preference and can be deferred.
398-398: LGTM!Correctly propagates
weights_onlytotorch.load.
419-419: LGTM!Correctly propagates
track_metatoconvert_to_tensorwhen writing cache.
ericspod
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @mccle thanks for this fix. We had changed things in the past in response to a security concern so it's important to discuss the implication of loading more than just weights. Since torch.load still retains this functionality it should be enough to do as they do and make weights_only True by default as you have. We are investigating alternative storage representations that would be safe as well.
Co-authored-by: Eric Kerfoot <17726042+ericspod@users.noreply.github.com> Signed-off-by: Mason C. Cleveland <104479423+mccle@users.noreply.github.com>
Co-authored-by: Eric Kerfoot <17726042+ericspod@users.noreply.github.com> Signed-off-by: Mason C. Cleveland <104479423+mccle@users.noreply.github.com>
Signed-off-by: mccle <mccleve@umich.edu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
tests/data/test_persistentdataset.py (1)
178-201: Test logic is correct; consider verifying cache reload.The test correctly implements the suggested contextlib pattern and validates parameter combinations. Consider adding a second PersistentDataset instantiation (after line 198) with the same cache_dir to verify that cached items reload correctly with the same settings, especially for track_meta=True.
Example enhancement:
# Verify cache reload works with same settings test_dataset_reload = PersistentDataset( data=test_data, transform=transform, cache_dir=cache_dir, track_meta=track_meta, weights_only=weights_only, ) im_reload = test_dataset_reload[0]["image"] self.assertIsInstance(im_reload, expected_type)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (1)
tests/data/test_persistentdataset.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
⚙️ CodeRabbit configuration file
Review the Python code for quality and correctness. Ensure variable names adhere to PEP8 style guides, are sensible and informative in regards to their function, though permitting simple names for loop and comprehension variables. Ensure routine names are meaningful in regards to their function and use verbs, adjectives, and nouns in a semantically appropriate way. Docstrings should be present for all definition which describe each variable, return value, and raised exception in the appropriate section of the Google-style of docstrings. Examine code for logical error or inconsistencies, and suggest what may be changed to addressed these. Suggest any enhancements for code improving efficiency, maintainability, comprehensibility, and correctness. Ensure new or modified definitions will be covered by existing or new unit tests.
Files:
tests/data/test_persistentdataset.py
🔇 Additional comments (2)
tests/data/test_persistentdataset.py (2)
14-14: LGTM on imports.Both additions are necessary:
contextlibfor the nullcontext pattern andMetaTensorfor type assertions in the new test.Also applies to: 24-24
47-53: Test cases cover all critical combinations.The four test cases correctly validate: MetaTensor with track_meta=True, ValueError when both flags are True, and torch.Tensor for default and non-tracking modes.
|
Hello @ericspod, thank you for your quick response! I completely understand your concern about the potential security implications of using |
Fixes #8599.
Description
PersistentDatasetcurrently casts allMetaTensorobjects totorch.Tensorobjects and forces the use oftorch.loadwithweights_only=True. This makes it impossible to save or load metadata to cached files, which may be necessary for accurate post-transform operations.To address this, this PR introduces the
track_metaandweights_onlyarguments directly toPersistentDataset. They are internally passed toconvert_to_tensorandtorch.load, respectively. AValueErroris raised whentrack_meta=Trueandweights_only=True, sinceMetaTensorobjects cannot be loaded withweights_only=Trueand the cached files would be continually deleted and rewritten.These changes restore the ability to cache
MetaTensorobjects by allowing explicit control over data casting andtorch.loadbehavior. The default values oftrack_meta=Falseandweights_only=Truewill preserve the current behavior ofPersistentDataset.Types of changes
./runtests.sh -f -u --net --coverage../runtests.sh --quick --unittests --disttests.make htmlcommand in thedocs/folder.