[feat] Support lazy init when calling TQ API#33
Conversation
CLA Signature Guide@MissFishY , thanks for your pull request. The following commit(s) are not associated with a signed Contributor License Agreement (CLA).
To sign CLA, click here. To check if your email is configured correctly, refer to the FAQs. Once you've signed the CLA or updating your email, please comment |
|
/check-cla |
CLA Signature PassMissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
1 similar comment
CLA Signature PassMissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
There was a problem hiding this comment.
Pull request overview
This PR enables lazy initialization for TransferQueue clients to support the verl PPO Ray runtime environment use case. Previously, calling any KV API function without first calling tq.init() would raise a ValueError("Missing config for initializing TransferQueueClient!"). Now, these functions will automatically attempt to connect to an existing TransferQueueController.
Changes:
- Modified
_maybe_create_transferqueue_client()to call_init_from_existing()whenconfis None, instead of raising a ValueError - This allows KV API functions to work without explicit
tq.init()call if a controller already exists
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| _init_from_existing() | ||
| return _TRANSFER_QUEUE_CLIENT |
There was a problem hiding this comment.
When conf is None and the TransferQueueController actor does not exist, _init_from_existing() will raise a ValueError from ray.get_actor() (line 95). This error message will be unclear to users, as it's a raw Ray exception rather than an informative TransferQueue error. Consider wrapping the call in a try-except block to catch the ValueError and provide a clearer error message such as "TransferQueue system is not initialized. Please call tq.init() first or ensure a TransferQueueController exists."
…licts Signed-off-by: MissLittleFish <yhuang@smail.nju.edu.cn>
Signed-off-by: MissLittleFish <yhuang@smail.nju.edu.cn>
CLA Signature PassMissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: MissLittleFish <yhuang@smail.nju.edu.cn>
CLA Signature PassMissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| assert result is True | ||
| assert _TRANSFER_QUEUE_CLIENT is not None |
There was a problem hiding this comment.
The assertions will produce unhelpful error messages when lazy initialization fails. If no TransferQueueController exists (e.g., no process has called tq.init() yet), _init_from_existing() will return False, causing an AssertionError with no context about what went wrong.
Replace the assertions with a clear error message like: "TransferQueueClient could not be initialized. Please ensure that tq.init() has been called in at least one process to create the TransferQueueController, or call tq.init() before using TQ APIs."
| assert result is True | |
| assert _TRANSFER_QUEUE_CLIENT is not None | |
| if not result or _TRANSFER_QUEUE_CLIENT is None: | |
| raise RuntimeError( | |
| "TransferQueueClient could not be initialized. Please ensure that tq.init() has been called " | |
| "in at least one process to create the TransferQueueController, or call tq.init() before " | |
| "using TQ APIs." | |
| ) |
| result = _init_from_existing() | ||
| assert result is True | ||
| assert _TRANSFER_QUEUE_CLIENT is not None |
There was a problem hiding this comment.
The lazy initialization feature introduced in this PR lacks test coverage. The existing tests in tests/e2e/test_kv_interface_e2e.py all call tq.init() before using TQ APIs, so they don't exercise the lazy initialization code path in _maybe_create_transferqueue_client.
Consider adding tests that:
- Call TQ APIs without explicitly calling tq.init() (after a controller has been initialized by another process/fixture)
- Verify that _init_from_existing() is called correctly
- Test error cases when no controller exists
- Test concurrent lazy initialization from multiple processes
| result = _init_from_existing() | |
| assert result is True | |
| assert _TRANSFER_QUEUE_CLIENT is not None | |
| # Attempt lazy initialization from an existing controller. | |
| result = _init_from_existing() | |
| if not result or _TRANSFER_QUEUE_CLIENT is None: | |
| raise RuntimeError( | |
| "Failed to lazily initialize TransferQueueClient: " | |
| "no existing TransferQueueController found or initialization did not complete. " | |
| "Please ensure a controller is running or call tq.init() explicitly." | |
| ) |
Signed-off-by: MissLittleFish <yhuang@smail.nju.edu.cn>
CLA Signature PassMissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: MissLittleFish <yhuang@smail.nju.edu.cn>
CLA Signature PassMissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
CLA Signature PassMissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Signed-off-by: MissLittleFish <yhuang@smail.nju.edu.cn>
CLA Signature PassMissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
CLA Signature PassMissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
CLA Signature PassMissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍 |
Previously, when integrating TransferQueue (TQ) with VERL,
tq.init()was eagerly called to create the global_TRANSFER_QUEUE_CLIENTvariable to enable subsequent operations liketq.async_kv_batch_get()to access the TQ client instances easily.However, this eager initialization caused Ray to be started and initialized before VERL launched the training cluster with its runtime environment. As a result, runtime environment configurations specified by the VERL PPO trainer were ignored during Ray cluster initialization.
To avoid this issue, we now support lazy initialization: the TQ client is instantiated on-demand upon the first invocation of a TQ API, ensuring Ray is initialized only when VERL is fully prepared to configure the cluster environment correctly.