Skip to content

Commit e5e58c8

Browse files
authored
Cleanup the pickle dump/load logic in webdatamodule (#1430)
### Description Address a potential security issue around path resolution in webdatamodule. #### Usage Not changed. ### Type of changes <!-- Mark the relevant option with an [x] --> - [x] Bug fix (non-breaking change which fixes an issue) - [ ] New feature (non-breaking change which adds functionality) - [ ] Refactor - [ ] Documentation update - [ ] Other (please describe): ### CI Pipeline Configuration Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run. - [ciflow:skip](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:skip) - Skip all CI tests for this PR - [ciflow:notebooks](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:notebooks) - Run Jupyter notebooks execution tests for bionemo2 - [ciflow:slow](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:slow) - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2 - [ciflow:all](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all) - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2. - [ciflow:all-recipes](https://github.com/NVIDIA/bionemo-framework/blob/main/docs/docs/main/contributing/contributing.md#ciflow:all-recipes) - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes. Unit tests marked as `@pytest.mark.multi_gpu` or `@pytest.mark.distributed` are not run in the PR pipeline. For more details, see [CONTRIBUTING](CONTRIBUTING.md) > [!NOTE] > By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage. #### Authorizing CI Runs We use [copy-pr-bot](https://docs.gha-runners.nvidia.com/apps/copy-pr-bot/#automation) to manage authorization of CI runs on NVIDIA's compute resources. - If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123) - If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an `/ok to test` comment on the pull request to trigger CI. This will need to be done for each new commit. #### Triggering Code Rabbit AI Review To trigger a code review from code rabbit, comment on a pull request with one of these commands: - @coderabbitai review - Triggers a standard review - @coderabbitai full review - Triggers a comprehensive review See https://docs.coderabbit.ai/reference/review-commands for a full list of commands. ### Pre-submit Checklist <!--- Ensure all items are completed before submitting --> - [ ] I have tested these changes locally - [ ] I have updated the documentation accordingly - [ ] I have added/updated tests as needed - [ ] All existing tests pass successfully Signed-off-by: John St. John <jstjohn@nvidia.com>
1 parent 6704a71 commit e5e58c8

File tree

1 file changed

+2
-8
lines changed
  • sub-packages/bionemo-webdatamodule/src/bionemo/webdatamodule

1 file changed

+2
-8
lines changed

sub-packages/bionemo-webdatamodule/src/bionemo/webdatamodule/utils.py

Lines changed: 2 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,6 @@
1515

1616

1717
import os
18-
import pickle
1918
from pathlib import Path
2019
from typing import Any, Callable, Dict, Iterable, List, Optional, Union, get_args
2120

@@ -96,15 +95,10 @@ def pickles_to_tars(
9695
for name in input_prefix_subset:
9796
try:
9897
if isinstance(input_suffix, str):
99-
suffix_to_data = {
100-
input_suffix: pickle.dumps(
101-
pickle.loads((Path(dir_input) / f"{name}.{input_suffix}").read_bytes())
102-
)
103-
}
98+
suffix_to_data = {input_suffix: (Path(dir_input) / f"{name}.{input_suffix}").read_bytes()}
10499
else:
105100
suffix_to_data = {
106-
suffix: pickle.dumps(pickle.loads((Path(dir_input) / f"{name}.{suffix}").read_bytes()))
107-
for suffix in input_suffix
101+
suffix: (Path(dir_input) / f"{name}.{suffix}").read_bytes() for suffix in input_suffix
108102
}
109103
# the prefix name shouldn't contain any "." per webdataset's
110104
# specification

0 commit comments

Comments
 (0)