Skip to content

[data] feat: TransferQueue - Support managing multiple data partitions for Train/Val/Test in controller#43

Merged
0oshowero0 merged 4 commits intoTransferQueue:main_tq_submodulefrom
LLLLxmmm:data_partitions
Nov 7, 2025
Merged

[data] feat: TransferQueue - Support managing multiple data partitions for Train/Val/Test in controller#43
0oshowero0 merged 4 commits intoTransferQueue:main_tq_submodulefrom
LLLLxmmm:data_partitions

Conversation

@LLLLxmmm
Copy link

@LLLLxmmm LLLLxmmm commented Nov 7, 2025

…s for Train/Val/Test in controller

What does this PR do?

Add concise overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc.

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Add code snippet or script demonstrating how to use this

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

Summary by CodeRabbit

  • Refactor

    • Consolidated data system handling and simplified transfer queue client initialization architecture.
    • Unified storage and controller management into a single data system flow.
    • Transitioned from global_step-based to partition_id-based data organization across training and validation workflows.
  • Chores

    • Updated TransferQueue dependency to latest version.

@LLLLxmmm LLLLxmmm changed the title [data] feat: TransferQueue - Support managing multiple data partition… [data] feat: TransferQueue - Support managing multiple data partitions for Train/Val/Test in controller Nov 7, 2025
@LLLLxmmm LLLLxmmm requested review from 0oshowero0, baymax591 and ji-huazhong and removed request for baymax591 and ji-huazhong November 7, 2025 07:47
self.config = OmegaConf.merge(tq_config, self.config)

# 4. create client
# each client should be allocated to exactly one controller
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# each client should be allocated to exactly one controller

@0oshowero0 0oshowero0 requested a review from Copilot November 7, 2025 08:15
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the TransferQueue data system to use a unified controller/client architecture instead of separate train and validation systems. The key changes include:

  • Consolidates dual data systems (train and validation) into a single unified TransferQueue system
  • Replaces global_step parameter with partition_id for better data partitioning
  • Updates TransferQueue dependency to a newer commit version
  • Removes the concept of separate validation clients

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
verl/utils/transferqueue_utils.py Removed separate validation client; simplified client creation to use single controller with storage manager initialization
verl/single_controller/base/worker.py Updated client creation signature to pass unified controller info and config instead of separate train/val parameters
requirements_transferqueue.txt Updated TransferQueue dependency to newer commit hash (862b74a)
recipe/transfer_queue/ray_trainer.py Consolidated train/val data systems into unified system; replaced global_step with partition_id prefixed by train/val context
recipe/transfer_queue/agent_loop.py Updated to use unified controller info and config parameters
Comments suppressed due to low confidence (1)

recipe/transfer_queue/ray_trainer.py:708

  • The 'validate' flag is still being set in extra_info (lines 708, 738) but the TransferQueue system no longer uses this flag to route data to separate validation clients after the refactor. This flag appears to only be used by rollout workers to determine sampling behavior. Consider documenting this usage or removing it if it's truly unused after the consolidation of train/val data systems.
                "validate": True,

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@0oshowero0 0oshowero0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@coderabbitai
Copy link

coderabbitai bot commented Nov 7, 2025

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Unified transfer queue data system architecture by consolidating per-role (train/val) initialization into a single controller-config-based approach. Refactored method signatures across agent loop, worker, and utility modules to use unified controller_info and config parameters. Updated RPC patterns from global_step-based to partition_id-based namespacing. Updated dependency commit.

Changes

Cohort / File(s) Change Summary
Transfer Queue Client API Consolidation
recipe/transfer_queue/agent_loop.py, verl/single_controller/base/worker.py, verl/utils/transferqueue_utils.py
Updated create_transferqueue_client() method signatures across all three files from (controller_infos, storage_infos, role) to (controller_info, config). Simplified client initialization to use single controller info and config payload. In transferqueue_utils.py, removed _VAL_TRANSFER_QUEUE_CLIENT global variable and eliminated get_val_transferqueue_client() public method, consolidating to single _TRANSFER_QUEUE_CLIENT for both train/val data flows.
Ray Trainer Data System Unification
recipe/transfer_queue/ray_trainer.py
Introduced unified _initialize_data_system() method replacing separate _initialize_train_data_system() and _initialize_val_data_system() flows. Changed storage creation from TransferQueueStorageSimpleUnit to SimpleStorageUnit. Updated data access patterns from train/val-specific clients to single self.data_system_client. Refactored RPC calls from global_step-based to partition_id-based namespacing (e.g., train_{step}, val_{step}). Updated worker initialization to use new (controller_info, config) signature. Restructured config propagation via OmegaConf.merge() with allow_objects=True for ZMQServerInfo preservation.
Dependency Update
requirements_transferqueue.txt
Updated TransferQueue git commit pinning from 68c04e7 to 862b74a.

Sequence Diagram(s)

sequenceDiagram
    participant RayTrainer as RayPPOTrainer
    participant DataSys as _initialize_data_system()
    participant Controller as TransferQueueController
    participant Storage as SimpleStorageUnit
    participant Client as TransferQueueClient

    Note over RayTrainer,Client: New Unified Data System Flow

    RayTrainer->>DataSys: Initialize single data system
    DataSys->>DataSys: Compute train/val data sizes
    DataSys->>Storage: Create SimpleStorageUnit
    DataSys->>Controller: Create TransferQueueController
    DataSys->>Controller: Register storage_unit_infos
    DataSys->>RayTrainer: Return controller_info, storage_infos

    rect rgb(220, 240, 220)
    Note over RayTrainer,Client: Worker Setup (Old: multiple clients per role)
    RayTrainer->>Client: create_transferqueue_client(controller_info, config)
    Client->>Client: initialize_storage_manager(AsyncSimpleStorageManager)
    end

    rect rgb(240, 220, 220)
    Note over RayTrainer,Client: RPC Pattern Change (Old: global_step-based)
    RayTrainer->>Storage: async_put(key="train_{step}", data)
    RayTrainer->>Storage: async_put(key="val_{step}", data)
    Storage-->>RayTrainer: Partition_id-based namespacing
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25–40 minutes

  • recipe/transfer_queue/ray_trainer.py: Requires careful attention to the unified data system initialization logic, RPC pattern changes from global_step to partition_id-based keys, and OmegaConf config merging strategy.
  • verl/utils/transferqueue_utils.py: Review consolidation of train/val client branching logic into single client flow; verify initialize_storage_manager() call replaces previous multi-storage setup correctly.
  • Cross-file signature consistency: Verify all callers of create_transferqueue_client() across agent_loop.py, ray_trainer.py, and worker.py correctly pass the new (controller_info, config) parameters.
  • Dependency bump: Confirm TransferQueue commit 862b74a is compatible with API changes and no breaking changes are introduced.

Poem

🐰 One queue, unified flow,
Train and val now dance together,
No more roles split in two,
Partition keys namespace the way,
Simpler paths make carrots grow! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title check ⚠️ Warning The PR title mentions managing multiple data partitions for Train/Val/Test, but the actual changes consolidate separate train/val data systems into a unified single data system with partition-based namespacing, which is contrary to the title's implication of multiple independent partitions. Consider revising the title to reflect the actual refactoring, such as 'Refactor TransferQueue to unify train/val data systems with partition-based routing' or similar, which better captures the consolidation of multiple data systems into a single controller-based system.
Docstring Coverage ⚠️ Warning Docstring coverage is 29.41% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

Comment @coderabbitai help to get the list of available commands and usage tips.

@0oshowero0
Copy link
Member

@coderabbitai review

@coderabbitai
Copy link

coderabbitai bot commented Nov 7, 2025

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
recipe/transfer_queue/ray_trainer.py (1)

447-451: Document why allow_objects=True is necessary for config merge.

The code creates a temporary config with allow_objects=True to preserve ZMQServerInfo instances during the merge. While this is correct, consider adding a comment explaining why this flag is required, as it's not immediately obvious and could be accidentally removed during refactoring.

Apply this diff to add clarifying documentation:

         tq_config = OmegaConf.create({}, flags={"allow_objects": True})  # Note: Need to generate a new DictConfig
-        # with allow_objects=True to maintain ZMQServerInfo instance. Otherwise it will be flattened to dict
+        # with allow_objects=True to preserve ZMQServerInfo instances (which contain socket connection details).
+        # Without this flag, OmegaConf would flatten these objects to dicts, breaking the transfer queue client initialization.
         tq_config.controller_info = self.data_system_controller_info
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 20d0f98 and 0dc3ddd.

📒 Files selected for processing (5)
  • recipe/transfer_queue/agent_loop.py (1 hunks)
  • recipe/transfer_queue/ray_trainer.py (15 hunks)
  • requirements_transferqueue.txt (1 hunks)
  • verl/single_controller/base/worker.py (1 hunks)
  • verl/utils/transferqueue_utils.py (3 hunks)
🔇 Additional comments (8)
recipe/transfer_queue/agent_loop.py (1)

70-72: LGTM! Signature simplified to unified controller/config pattern.

The method signature has been updated to use a single controller_info and config instead of separate controller_infos, storage_infos, and role parameters. This aligns with the unified data system architecture described in the PR.

verl/single_controller/base/worker.py (1)

134-141: LGTM! Worker-side signature updated to match unified API.

The signature change removes the role parameter and consolidates controller_infos/storage_infos into controller_info/config. The client ID now uses a simpler worker_{rank} format instead of incorporating the role.

verl/utils/transferqueue_utils.py (2)

45-52: LGTM! Unified transfer queue client initialization.

The function now creates a single AsyncTransferQueueClient and initializes its storage manager in one place, removing the complexity of managing separate train/val clients. This aligns with the unified data system architecture.


104-104: All visible callsites have been successfully updated to use the unified client.

The verification found:

  • No references to old get_val_transferqueue_client or get_train_transferqueue_client functions
  • Only _TRANSFER_QUEUE_CLIENT used in transferqueue_utils.py (line 104)
  • Partition-based routing is explicit in ray_trainer.py via partition_id parameters to async_get_meta(), async_put(), and async_clear()
  • All calls in the decorator pattern (lines 172, 174) correctly convert BatchMeta to DataProto through the unified client

However, the partition routing mechanism at line 104 relies on AsyncTransferQueueClient (imported from external transfer_queue module), which cannot be inspected in this codebase. The routing assumes the client maintains partition context from async_get_meta() through to async_get_data(), but this contract cannot be verified without access to the external dependency.

recipe/transfer_queue/ray_trainer.py (3)

414-460: LGTM! Unified data system initialization consolidates train/val partitions.

The new _initialize_data_system method creates a single TransferQueueController and shared storage units sized for combined train+val data. The use of OmegaConf.create with allow_objects=True correctly preserves ZMQServerInfo instances when merging into config.


671-678: Partition_id naming is consistent across all RPC calls.

Verification confirms all async_put, async_get_meta, async_get_data, and async_clear operations for validation data consistently use f"val_{self.global_steps - 1}", and training data consistently use f"train_{self.global_steps - 1}". The self.global_steps - 1 offset is correctly applied uniformly throughout.


1268-1270: Concern verified and resolved.

The code correctly calls async_put at line 1269 before any subsequent async_get_meta or async_get_data operations on the same partition. Specifically:

  • Line 1269: async_put writes data to partition_id=f"train_{self.global_steps - 1}"
  • Lines 1273+: async_get_meta and async_get_data read from the same partition_id

All read operations occur after the write, preventing consumer failures.

requirements_transferqueue.txt (1)

2-2: Avoid using a same-day commit as a dependency; consider waiting for a stable release.

The commit 862b74a exists in the upstream repository, but it was committed today (2025-11-07 03:01 UTC). This is far too recent to be considered stable. The commit is a bug fix ("Close zmq socket for storage manager"), but lacks the testing time required for production use. Wait for an official release or a commit with a proven track record.

@0oshowero0 0oshowero0 merged commit 9f00d21 into TransferQueue:main_tq_submodule Nov 7, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants