Skip to content

[feat] Support lazy init when calling TQ API#33

Merged
0oshowero0 merged 8 commits intoAscend:mainfrom
MissFishY:main
Feb 13, 2026
Merged

[feat] Support lazy init when calling TQ API#33
0oshowero0 merged 8 commits intoAscend:mainfrom
MissFishY:main

Conversation

@MissFishY
Copy link
Copy Markdown
Contributor

@MissFishY MissFishY commented Feb 12, 2026

Previously, when integrating TransferQueue (TQ) with VERL, tq.init() was eagerly called to create the global _TRANSFER_QUEUE_CLIENT variable to enable subsequent operations like tq.async_kv_batch_get() to access the TQ client instances easily.

However, this eager initialization caused Ray to be started and initialized before VERL launched the training cluster with its runtime environment. As a result, runtime environment configurations specified by the VERL PPO trainer were ignored during Ray cluster initialization.

To avoid this issue, we now support lazy initialization: the TQ client is instantiated on-demand upon the first invocation of a TQ API, ensuring Ray is initialized only when VERL is fully prepared to configure the cluster environment correctly.

@ascend-robot
Copy link
Copy Markdown

CLA Signature Guide

@MissFishY , thanks for your pull request.

The following commit(s) are not associated with a signed Contributor License Agreement (CLA).

Commit Reason
8028bd4e update _maybe_create_transferque... the email used in the commit is not linked to a signed CLA!
please verify that it matches the email you used when signing the CLA.

To sign CLA, click here.

To check if your email is configured correctly, refer to the FAQs.

Once you've signed the CLA or updating your email, please comment /check-cla to revalidate CLA status.

@MissFishY MissFishY changed the title update _maybe_create_transferqueue_client to avoid potential ray conflicts [fix] update _maybe_create_transferqueue_client to avoid potential ray conflicts Feb 12, 2026
@MissFishY
Copy link
Copy Markdown
Contributor Author

/check-cla

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

MissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍

1 similar comment
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

MissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@0oshowero0 0oshowero0 changed the title [fix] update _maybe_create_transferqueue_client to avoid potential ray conflicts [feat] Support lazy init when calling tq API Feb 13, 2026
@0oshowero0 0oshowero0 changed the title [feat] Support lazy init when calling tq API [feat] Support lazy init when calling TQ API Feb 13, 2026
@0oshowero0 0oshowero0 requested a review from Copilot February 13, 2026 03:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enables lazy initialization for TransferQueue clients to support the verl PPO Ray runtime environment use case. Previously, calling any KV API function without first calling tq.init() would raise a ValueError("Missing config for initializing TransferQueueClient!"). Now, these functions will automatically attempt to connect to an existing TransferQueueController.

Changes:

  • Modified _maybe_create_transferqueue_client() to call _init_from_existing() when conf is None, instead of raising a ValueError
  • This allows KV API functions to work without explicit tq.init() call if a controller already exists

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +50 to +51
_init_from_existing()
return _TRANSFER_QUEUE_CLIENT
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When conf is None and the TransferQueueController actor does not exist, _init_from_existing() will raise a ValueError from ray.get_actor() (line 95). This error message will be unclear to users, as it's a raw Ray exception rather than an informative TransferQueue error. Consider wrapping the call in a try-except block to catch the ValueError and provide a clearer error message such as "TransferQueue system is not initialized. Please call tq.init() first or ensure a TransferQueueController exists."

Copilot uses AI. Check for mistakes.
…licts

Signed-off-by: MissLittleFish <yhuang@smail.nju.edu.cn>
Signed-off-by: MissLittleFish <yhuang@smail.nju.edu.cn>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

MissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: MissLittleFish <yhuang@smail.nju.edu.cn>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

MissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread transfer_queue/interface.py Outdated
Comment on lines +51 to +52
assert result is True
assert _TRANSFER_QUEUE_CLIENT is not None
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertions will produce unhelpful error messages when lazy initialization fails. If no TransferQueueController exists (e.g., no process has called tq.init() yet), _init_from_existing() will return False, causing an AssertionError with no context about what went wrong.

Replace the assertions with a clear error message like: "TransferQueueClient could not be initialized. Please ensure that tq.init() has been called in at least one process to create the TransferQueueController, or call tq.init() before using TQ APIs."

Suggested change
assert result is True
assert _TRANSFER_QUEUE_CLIENT is not None
if not result or _TRANSFER_QUEUE_CLIENT is None:
raise RuntimeError(
"TransferQueueClient could not be initialized. Please ensure that tq.init() has been called "
"in at least one process to create the TransferQueueController, or call tq.init() before "
"using TQ APIs."
)

Copilot uses AI. Check for mistakes.
Comment thread transfer_queue/interface.py
Comment thread transfer_queue/interface.py Outdated
Comment thread transfer_queue/interface.py Outdated
Comment on lines +50 to +52
result = _init_from_existing()
assert result is True
assert _TRANSFER_QUEUE_CLIENT is not None
Copy link

Copilot AI Feb 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lazy initialization feature introduced in this PR lacks test coverage. The existing tests in tests/e2e/test_kv_interface_e2e.py all call tq.init() before using TQ APIs, so they don't exercise the lazy initialization code path in _maybe_create_transferqueue_client.

Consider adding tests that:

  1. Call TQ APIs without explicitly calling tq.init() (after a controller has been initialized by another process/fixture)
  2. Verify that _init_from_existing() is called correctly
  3. Test error cases when no controller exists
  4. Test concurrent lazy initialization from multiple processes
Suggested change
result = _init_from_existing()
assert result is True
assert _TRANSFER_QUEUE_CLIENT is not None
# Attempt lazy initialization from an existing controller.
result = _init_from_existing()
if not result or _TRANSFER_QUEUE_CLIENT is None:
raise RuntimeError(
"Failed to lazily initialize TransferQueueClient: "
"no existing TransferQueueController found or initialization did not complete. "
"Please ensure a controller is running or call tq.init() explicitly."
)

Copilot uses AI. Check for mistakes.
Comment thread transfer_queue/interface.py
Signed-off-by: MissLittleFish <yhuang@smail.nju.edu.cn>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

MissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: MissLittleFish <yhuang@smail.nju.edu.cn>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

MissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

MissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: MissLittleFish <yhuang@smail.nju.edu.cn>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

MissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: MissLittleFish <yhuang@smail.nju.edu.cn>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

MissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: MissLittleFish <yhuang@smail.nju.edu.cn>
@ascend-robot
Copy link
Copy Markdown

CLA Signature Pass

MissFishY, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@0oshowero0 0oshowero0 merged commit a98786e into Ascend:main Feb 13, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants