Skip to content

Support storage unit in TransferQueue#1

Merged
0oshowero0 merged 2 commits into
TransferQueue:mainfrom
FightingZhen:transferqueue_verl
Sep 23, 2025
Merged

Support storage unit in TransferQueue#1
0oshowero0 merged 2 commits into
TransferQueue:mainfrom
FightingZhen:transferqueue_verl

Conversation

@FightingZhen
Copy link
Copy Markdown

What does this PR do?

Support storage unit in TransferQueue

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

Not related.

API and Usage Example

Not related.

Design & Code Changes

Not related.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@0oshowero0 0oshowero0 requested a review from Copilot September 23, 2025 12:05
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a storage unit component to the TransferQueue experimental feature, implementing a distributed storage system with ZMQ-based communication between storage units and controllers.

Key Changes

  • Implements StorageUnitData class for managing field-based data storage with validation
  • Adds TransferQueueStorageSimpleUnit as a Ray remote actor for distributed storage operations
  • Establishes ZMQ-based communication protocol for PUT/GET/CLEAR operations with controllers

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
verl/experimental/transfer_queue/storage.py New storage implementation with data management and ZMQ communication
verl/experimental/transfer_queue/init.py Package initialization file with copyright header

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines +28 to +35
from transfer_queue.utils.utils import TransferQueueRole
from transfer_queue.utils.zmq_utils import (
ZMQMessage,
ZMQRequestType,
ZMQServerInfo,
create_zmq_socket,
get_free_port,
)
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import statements use relative paths starting with 'transfer_queue' but should use absolute imports from the package. These should be 'verl.experimental.transfer_queue.utils.utils' and 'verl.experimental.transfer_queue.utils.zmq_utils'.

Copilot uses AI. Check for mistakes.
else:
result[field] = gathered_item.unsqueeze(0)
else:
gathered_items = list(itemgetter(*local_indexes)(self.field_data[field]))
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using itemgetter with unpacked local_indexes will fail when local_indexes has only one element, as itemgetter(single_value) returns the item directly, not a tuple. This inconsistency with the multi-item case could cause issues.

Suggested change
gathered_items = list(itemgetter(*local_indexes)(self.field_data[field]))
# Ensure gathered_items is always a list, even if local_indexes has one element
if len(local_indexes) == 1:
gathered_items = [self.field_data[field][local_indexes[0]]]
else:
gathered_items = list(itemgetter(*local_indexes)(self.field_data[field]))

Copilot uses AI. Check for mistakes.
if gathered_items:
all_tensors = all(isinstance(x, torch.Tensor) for x in gathered_items)
if all_tensors:
result[field] = torch.nested.as_nested_tensor(gathered_items)
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating nested tensors can be expensive and may not be the most efficient representation for storage operations. Consider using torch.stack() when tensors have compatible shapes, falling back to nested tensors only when necessary.

Suggested change
result[field] = torch.nested.as_nested_tensor(gathered_items)
try:
result[field] = torch.stack(gathered_items)
except RuntimeError:
result[field] = torch.nested.as_nested_tensor(gathered_items)

Copilot uses AI. Check for mistakes.
Comment on lines +338 to +339
per_tensor_dtypes: dict[int, torch.dtype] = {}
per_tensor_shapes: dict[int, torch.Size] = {}
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type annotations are incorrect. Based on the code below, these dictionaries should be dict[int, dict[str, torch.dtype]] and dict[int, dict[str, torch.Size]] respectively, as they store nested dictionaries mapping field names to dtypes/shapes.

Suggested change
per_tensor_dtypes: dict[int, torch.dtype] = {}
per_tensor_shapes: dict[int, torch.Size] = {}
per_tensor_dtypes: dict[int, dict[str, torch.dtype]] = {}
per_tensor_shapes: dict[int, dict[str, torch.Size]] = {}

Copilot uses AI. Check for mistakes.
Comment on lines +350 to +351
per_tensor_dtypes[global_idx][field] = data_item.dtype if hasattr(data_item, "dtype") else None
per_tensor_shapes[global_idx][field] = data_item.shape if hasattr(data_item, "shape") else None
Copy link

Copilot AI Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using hasattr() for dtype and shape checks is fragile. Consider using isinstance() checks for torch.Tensor or other expected types to make the code more explicit and maintainable.

Suggested change
per_tensor_dtypes[global_idx][field] = data_item.dtype if hasattr(data_item, "dtype") else None
per_tensor_shapes[global_idx][field] = data_item.shape if hasattr(data_item, "shape") else None
per_tensor_dtypes[global_idx][field] = data_item.dtype if isinstance(data_item, torch.Tensor) else None
per_tensor_shapes[global_idx][field] = data_item.shape if isinstance(data_item, torch.Tensor) else None

Copilot uses AI. Check for mistakes.
@0oshowero0 0oshowero0 merged commit 8006fa2 into TransferQueue:main Sep 23, 2025
@FightingZhen FightingZhen deleted the transferqueue_verl branch November 13, 2025 15:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants