Support storage unit in TransferQueue by FightingZhen · Pull Request #1 · TransferQueue/verl

FightingZhen · 2025-09-23T11:59:40Z

What does this PR do?

Support storage unit in TransferQueue

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Not related.

API and Usage Example

Not related.

Design & Code Changes

Not related.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Read the Contribute Guide.
Apply pre-commit checks: pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: ...
Once your PR is ready for CI, send a message in the ci-request channel in the verl Slack workspace. (If not accessible, please try the Feishu group (飞书群).)

Copilot

Pull Request Overview

This PR adds a storage unit component to the TransferQueue experimental feature, implementing a distributed storage system with ZMQ-based communication between storage units and controllers.

Key Changes

Implements StorageUnitData class for managing field-based data storage with validation
Adds TransferQueueStorageSimpleUnit as a Ray remote actor for distributed storage operations
Establishes ZMQ-based communication protocol for PUT/GET/CLEAR operations with controllers

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
verl/experimental/transfer_queue/storage.py	New storage implementation with data management and ZMQ communication
verl/experimental/transfer_queue/init.py	Package initialization file with copyright header

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-23T12:06:59Z

+from transfer_queue.utils.utils import TransferQueueRole
+from transfer_queue.utils.zmq_utils import (
+    ZMQMessage,
+    ZMQRequestType,
+    ZMQServerInfo,
+    create_zmq_socket,
+    get_free_port,
+)


The import statements use relative paths starting with 'transfer_queue' but should use absolute imports from the package. These should be 'verl.experimental.transfer_queue.utils.utils' and 'verl.experimental.transfer_queue.utils.zmq_utils'.

Copilot · 2025-09-23T12:07:00Z

+                else:
+                    result[field] = gathered_item.unsqueeze(0)
+            else:
+                gathered_items = list(itemgetter(*local_indexes)(self.field_data[field]))


Using itemgetter with unpacked local_indexes will fail when local_indexes has only one element, as itemgetter(single_value) returns the item directly, not a tuple. This inconsistency with the multi-item case could cause issues.

Suggested change

gathered_items = list(itemgetter(*local_indexes)(self.field_data[field]))

# Ensure gathered_items is always a list, even if local_indexes has one element

if len(local_indexes) == 1:

gathered_items = [self.field_data[field][local_indexes[0]]]

else:

gathered_items = list(itemgetter(*local_indexes)(self.field_data[field]))

Copilot · 2025-09-23T12:07:00Z

+                if gathered_items:
+                    all_tensors = all(isinstance(x, torch.Tensor) for x in gathered_items)
+                    if all_tensors:
+                        result[field] = torch.nested.as_nested_tensor(gathered_items)


Creating nested tensors can be expensive and may not be the most efficient representation for storage operations. Consider using torch.stack() when tensors have compatible shapes, falling back to nested tensors only when necessary.

Suggested change

result[field] = torch.nested.as_nested_tensor(gathered_items)

try:

result[field] = torch.stack(gathered_items)

except RuntimeError:

result[field] = torch.nested.as_nested_tensor(gathered_items)

Copilot · 2025-09-23T12:07:01Z

+            per_tensor_dtypes: dict[int, torch.dtype] = {}
+            per_tensor_shapes: dict[int, torch.Size] = {}


The type annotations are incorrect. Based on the code below, these dictionaries should be dict[int, dict[str, torch.dtype]] and dict[int, dict[str, torch.Size]] respectively, as they store nested dictionaries mapping field names to dtypes/shapes.

Suggested change

per_tensor_dtypes: dict[int, torch.dtype] = {}

per_tensor_shapes: dict[int, torch.Size] = {}

per_tensor_dtypes: dict[int, dict[str, torch.dtype]] = {}

per_tensor_shapes: dict[int, dict[str, torch.Size]] = {}

Copilot · 2025-09-23T12:07:01Z

+                    per_tensor_dtypes[global_idx][field] = data_item.dtype if hasattr(data_item, "dtype") else None
+                    per_tensor_shapes[global_idx][field] = data_item.shape if hasattr(data_item, "shape") else None


Using hasattr() for dtype and shape checks is fragile. Consider using isinstance() checks for torch.Tensor or other expected types to make the code more explicit and maintainable.

Suggested change

per_tensor_dtypes[global_idx][field] = data_item.dtype if hasattr(data_item, "dtype") else None

per_tensor_shapes[global_idx][field] = data_item.shape if hasattr(data_item, "shape") else None

per_tensor_dtypes[global_idx][field] = data_item.dtype if isinstance(data_item, torch.Tensor) else None

per_tensor_shapes[global_idx][field] = data_item.shape if isinstance(data_item, torch.Tensor) else None

Support storage unit in TransferQueue

a92a942

0oshowero0 requested a review from Copilot September 23, 2025 12:05

Copilot AI reviewed Sep 23, 2025

View reviewed changes

Fix importance error

bae27bb

0oshowero0 merged commit 8006fa2 into TransferQueue:main Sep 23, 2025

FightingZhen deleted the transferqueue_verl branch November 13, 2025 15:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support storage unit in TransferQueue#1

Support storage unit in TransferQueue#1
0oshowero0 merged 2 commits into
TransferQueue:mainfrom
FightingZhen:transferqueue_verl

FightingZhen commented Sep 23, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

Copilot AI Sep 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-                gathered_items = list(itemgetter(*local_indexes)(self.field_data[field]))
+                # Ensure gathered_items is always a list, even if local_indexes has one element
+                if len(local_indexes) == 1:
+                    gathered_items = [self.field_data[field][local_indexes[0]]]
+                else:
+                    gathered_items = list(itemgetter(*local_indexes)(self.field_data[field]))

		per_tensor_dtypes: dict[int, torch.dtype] = {}
		per_tensor_shapes: dict[int, torch.Size] = {}

		per_tensor_dtypes[global_idx][field] = data_item.dtype if hasattr(data_item, "dtype") else None
		per_tensor_shapes[global_idx][field] = data_item.shape if hasattr(data_item, "shape") else None

Conversation

FightingZhen commented Sep 23, 2025

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants