[feat] Support user-defined data parser for SimpleStorage backend#82
Conversation
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
CLA Signature Pass0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
There was a problem hiding this comment.
Pull request overview
Adds an optional, user-provided data_parser hook to the KV put path so callers can store lightweight references and have them parsed/realized inside SimpleStorageUnit during put, reducing client-side decode/load work.
Changes:
- Extend public KV APIs (
kv_put,kv_batch_put, and async variants) to accept an optionaldata_parser. - Plumb
data_parserthrough client → storage manager → ZMQ message →SimpleStorageUnit._handle_put, executing it before persisting. - Add a unit test for
SimpleStorageUnitparsing behavior and update the tutorial notebook with adata_parserdemo.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tutorial/basic.ipynb | Documents and demonstrates data_parser usage in the basic tutorial. |
| transfer_queue/storage/simple_backend.py | Executes data_parser inside SimpleStorageUnit before storing field data. |
| transfer_queue/storage/managers/simple_backend_manager.py | Adds data_parser plumbing into PUT ZMQ requests to storage units. |
| transfer_queue/storage/managers/base.py | Extends storage manager interface; explicitly rejects data_parser for KV-based backends. |
| transfer_queue/interface.py | Exposes data_parser on high-level KV functions (sync + async). |
| transfer_queue/client.py | Propagates data_parser through put/async_put to the storage manager. |
| tests/test_simple_storage_unit.py | Adds a storage-unit-level test covering data_parser. |
| tests/test_client.py | Updates mocked put_data signatures to accept the new parameter. |
Comments suppressed due to low confidence (1)
transfer_queue/interface.py:470
data_parseris accepted even whenfieldsis None, but in that branch the call is treated as a tag-only update anddata_parseris silently ignored. This is likely a user error; consider raisingValueErrorwhendata_parseris provided withoutfields(same for the async variants) to avoid surprising no-ops.
if fields is None and tag is None:
raise ValueError("Please provide at least one parameter of `fields` or `tag`.")
tq_client = _maybe_create_transferqueue_client()
# 1. translate user-specified key to BatchMeta
batch_meta = tq_client.kv_retrieve_meta(keys=[key], partition_id=partition_id, create=True)
if batch_meta.size != 1:
raise RuntimeError(f"Retrieved BatchMeta size {batch_meta.size} does not match with input `key` size of 1!")
# 2. register the user-specified tag to BatchMeta
if tag is not None:
batch_meta.update_custom_meta([tag])
# 3. put data
if fields is not None:
if isinstance(fields, dict):
# TODO: consider whether to support this...
batch = {}
for field_name, value in fields.items():
if isinstance(value, torch.Tensor):
if value.is_nested:
raise ValueError("Please use (async)kv_batch_put for batch operation")
batch[field_name] = value.unsqueeze(0)
else:
batch[field_name] = NonTensorStack(value)
fields = TensorDict(batch, batch_size=[1])
elif not isinstance(fields, TensorDict):
raise ValueError("`fields` can only be dict or TensorDict")
# After put, batch_meta.field_names will include the new fields written by user
batch_meta = tq_client.put(fields, batch_meta, data_parser=data_parser)
else:
# Directly update custom_meta (tag) to controller
tq_client.set_custom_meta(batch_meta)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "parser_fields = TensorDict(\n", | ||
| " {\n", | ||
| " # Normal data column that will not be modified by data parser\n", | ||
| " \"normal_data\": torch.tensor([[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]]),\n", | ||
| " # Each sample carries a list of ints describing the desired tensor shape\n", | ||
| " \"data_to_be_parsed\": NonTensorStack([2, 3], [1, 4], [3, 2]),\n", | ||
| " },\n", | ||
| " batch_size=3,\n", | ||
| ")\n", | ||
| "\n", | ||
| "tq.kv_batch_put(\n", | ||
| " keys=parser_keys,\n", | ||
| " partition_id=\"train\",\n", | ||
| " fields=parser_fields,\n", | ||
| " data_parser=create_data_by_shape_parser,\n", | ||
| ")\n", | ||
| "print(\"Stored 3 samples with data parser\")" | ||
| ] |
There was a problem hiding this comment.
The data_parser demo uses shape descriptors [[2, 3], [1, 4], [3, 2]], which produces tensors varying in more than one dimension and triggers the warning about jagged nested tensors only supporting a single ragged dimension. To keep the tutorial output clean and demonstrate the intended path, consider using shapes that only vary along one ragged dimension (e.g., keep the trailing dims constant) or adjust the narrative to explain the warning/fallback.
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
CLA Signature Pass0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Args: | ||
| data: TensorDict containing the data to store. | ||
| metadata: BatchMeta containing storage location information. | ||
| data_parser: Optional callable to parse reference data (e.g., URLs) into real | ||
| content. Executed distributedly on each SimpleStorageUnit. | ||
| """ |
There was a problem hiding this comment.
Since data_parser runs inside the storage unit, it can change value types/shapes compared to the input TensorDict. However, SimpleStorageManager.put_data still notifies the controller using a schema derived from the original data (pre-parse), so BatchMeta.field_schema in the controller/clients can become inconsistent with what is actually stored. Consider constraining data_parser to preserve schema, or extending the PUT response so storage units can return the post-parse schema for accurate controller updates.
| data_parser: Optional callable to parse reference data (e.g., URLs) into real | ||
| content. Receives a dict of field_name -> batched values and should | ||
| return a dict with the same structure. Only supported by SimpleStorage. | ||
|
|
There was a problem hiding this comment.
Allowing a user-provided data_parser to be serialized and executed inside storage units effectively enables arbitrary code execution in the storage process. If TransferQueue can be used in multi-tenant or untrusted-client scenarios, this is a significant security boundary change. Consider documenting this explicitly (trusted clients only) and/or gating it behind an opt-in config flag for SimpleStorage to avoid accidental exposure.
| _, put_get_address = storage_setup | ||
| client = MockStorageClient(put_get_address) | ||
|
|
||
| def create_data_by_shape_parser(field_data): |
There was a problem hiding this comment.
We will need a clear input/output type definition of this function
| return shapes, dtypes, custom_backend_meta_list | ||
|
|
||
| async def put_data(self, data: TensorDict, metadata: BatchMeta) -> None: | ||
| async def put_data( |
There was a problem hiding this comment.
How to handle cases where the data_parser of each data item in the Tensordict is different?
There was a problem hiding this comment.
- When the data type requiring specific processing can be detected from the pre-parsed data (e.g., via the URL, file extension, or data type metadata), we can implement conditional processing for the same field:
if "multi_modal_data" in field_data:
parsed_url_data = []
for pre_parser_data in field_data["multi_modal_data"].values():
if isinstance(pre_parser_data, str):
if pre_parser_data.endswith(".mp4"):
parsed_url_data.append(process_mp4_url(pre_parser_data))
elif pre_parser_data.endswith((".jpg", ".png")):
parsed_url_data.append(process_image_url(pre_parser_data))
else:
parsed_url_data.append(pre_parser_data)- If the data type is difficult to determine dynamically, we can simply separate the data that requires different parser functions into distinct fields.
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
CLA Signature Pass0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| a list. It must return a dict of the same format with the exact | ||
| same keys and the same number of elements per column; do not | ||
| change the inner order of values within each column. Only | ||
| supported by SimpleStorage. |
There was a problem hiding this comment.
Since data_parser is serialized with cloudpickle and executed inside SimpleStorageUnit, it effectively allows arbitrary code execution on storage processes. The docstring should explicitly warn that this is only safe in trusted environments / when clients are authorized to run code on the backend.
| supported by SimpleStorage. | |
| supported by SimpleStorage. Security warning: this callable is | |
| serialized and executed on storage/backend processes, so it | |
| effectively permits backend code execution. Only use this in | |
| trusted environments and only allow authorized clients to supply | |
| `data_parser`. |
CLA Signature Pass0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
CLA Signature Pass0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
SimpleStorage backendSimpleStorage backend
|
Why this is a breaking change? |
SimpleStorage backendSimpleStorage backend
Background
Users often need to store lightweight references (e.g., URLs, file paths, etc.) rather than the full data into TransferQueue to avoid the expensive loading and decoding processes happening within user code, which reduces a data copy.
Solution
This PR introduces support for a user-defined data parser in the
SimpleStoragebackend.The
kv_putandkv_batch_putmethods now accept an optionaldata_parsercallable. This parser is executed inside eachSimpleStorageUnitat put time. It receives the rawfield_datadictionary during the put request and should return a dictionary with the same structure, replacing reference values with the actual parsed data.Limitations & Future Work
putoperation. This means the put request is only completed when the data parser finishes execution.data_parseris currently only supported by the SimpleStorage backend.SimpleStorageUnitmay lead to incorrectshape&dtypemetadata, which is captured when the data is still inTransferQueueClient. This can lead to problems for RDMA transport, which leverages these metadata collected by TQ to restore tensor duringget.Demo Script