Skip to content

[fix,refactor] Isolate notify_data_update ZMQ I/O into a dedicated background asyncio loop#117

Merged
0oshowero0 merged 9 commits into
Ascend:mainfrom
dodatboii:notify_data_status_loop
Jun 8, 2026
Merged

[fix,refactor] Isolate notify_data_update ZMQ I/O into a dedicated background asyncio loop#117
0oshowero0 merged 9 commits into
Ascend:mainfrom
dodatboii:notify_data_status_loop

Conversation

@dodatboii

@dodatboii dodatboii commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

Root cause: notify_data_update ran on the caller's asyncio loop. Under heavy compute workloads (PyTorch, Ray), the loop would stall, causing asyncio.wait_for timers to fire before the controller ACK was ever read.

Changes:

  • Spin up a dedicated daemon thread (_notify_thread) running an isolated _notify_loop per StorageManager instance, keeping notify I/O immune to caller-side event loop starvation
  • Bridge calls via run_coroutine_threadsafe + wrap_future so the caller-side await remains non-blocking
  • Add _notify_lock to serialize concurrent notify_data_update calls on the dynamic socket, preventing send/recv interleaving across coroutines
  • Guard close() with hasattr(_notify_loop) to avoid AttributeError in __del__ when __init__ fails before the loop is created

…background asyncio loop

Root cause: `notify_data_update` ran on the caller's asyncio loop. Under heavy compute workloads (PyTorch, Ray), the loop would stall, causing `asyncio.wait_for` timers to fire before the controller ACK was ever read.

 Changes:
 - Spin up a dedicated daemon thread (`_notify_thread`) running an isolated `_notify_loop` per StorageManager instance, keeping notify I/O immune to caller-side event loop starvation
 - Pre-connect a single reusable async DEALER socket (`_notify_sock`) in `_init_notify_zmq`; bridge calls via `run_coroutine_threadsafe` + `wrap_future` so the caller-side `await` remains non-blocking
 - Add `_notify_lock` to serialize concurrent `notify_data_update` calls on the shared socket, preventing send/recv interleaving across coroutines
 - Guard `close()` with `hasattr(_notify_loop)` to avoid `AttributeError` in `__del__` when `__init__` fails before the loop is created

Signed-off-by: yxstev <zhangyixiang9@huawei.com>
@ascend-robot

Copy link
Copy Markdown

CLA Signature Pass

dodatboii, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: yxstev <zhangyixiang9@huawei.com>
@ascend-robot

Copy link
Copy Markdown

CLA Signature Pass

dodatboii, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors StorageManager.notify_data_update() to move ZMQ notify send/recv off the caller’s asyncio event loop by introducing a dedicated background thread with its own asyncio loop and a pre-connected async DEALER socket, aiming to avoid event-loop starvation under heavy compute workloads.

Changes:

  • Added a per-StorageManager daemon thread running a dedicated asyncio loop for notify I/O.
  • Pre-initialized and reused a single async DEALER socket for notify traffic, bridging calls via run_coroutine_threadsafe + wrap_future.
  • Added locking around notify send/recv to prevent concurrent interleaving on the shared socket.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread transfer_queue/storage/managers/base.py Outdated
Comment on lines +86 to +90
self._notify_zmq_ctx: zmq.asyncio.Context | None = None
self._notify_sock: zmq.asyncio.Socket | None = None
self._notify_lock = asyncio.Lock()
notify_sock_ready = threading.Event()
asyncio.run_coroutine_threadsafe(self._init_notify_zmq(notify_sock_ready), self._notify_loop)
Comment thread transfer_queue/storage/managers/base.py Outdated
Comment on lines +89 to +91
notify_sock_ready = threading.Event()
asyncio.run_coroutine_threadsafe(self._init_notify_zmq(notify_sock_ready), self._notify_loop)
notify_sock_ready.wait()
Comment thread transfer_queue/storage/managers/base.py Outdated
Comment on lines +279 to +284
async def _notify_and_wait(self, request_msg: list) -> None:
"""Send a data status notification to the controller and block until ACK is received."""
assert self._notify_sock is not None

async with self._notify_lock:
await self._notify_sock.send_multipart(request_msg)
Comment thread transfer_queue/storage/managers/base.py
Comment thread transfer_queue/storage/managers/base.py
Comment thread transfer_queue/storage/managers/base.py Outdated
Comment on lines +86 to +90
self._notify_zmq_ctx: zmq.asyncio.Context | None = None
self._notify_sock: zmq.asyncio.Socket | None = None
self._notify_lock = asyncio.Lock()
notify_sock_ready = threading.Event()
asyncio.run_coroutine_threadsafe(self._init_notify_zmq(notify_sock_ready), self._notify_loop)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zmq.Context() is thread safe. We don't need to introduce a _init_notify_zmq function here.

Comment thread transfer_queue/storage/managers/base.py Outdated
try:
self._notify_zmq_ctx = zmq.asyncio.Context()
identity = f"{self.storage_manager_id}-notify-{uuid4().hex[:8]}".encode()
self._notify_sock = create_zmq_socket(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can preserve the previous dynamic socket design during the run

Comment thread transfer_queue/storage/managers/base.py Outdated
"""Send a data status notification to the controller and block until ACK is received."""
assert self._notify_sock is not None

async with self._notify_lock:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the dynamic socket we can delete this lock

Signed-off-by: yxstev <zhangyixiang9@huawei.com>
@ascend-robot

Copy link
Copy Markdown

CLA Signature Pass

dodatboii, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: yxstev <zhangyixiang9@huawei.com>
@ascend-robot

Copy link
Copy Markdown

CLA Signature Pass

dodatboii, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Comment thread transfer_queue/storage/managers/base.py Outdated

async def _notify_and_wait(self, request_msg: list) -> None:
"""Send a data status notification to the controller and block until ACK is received."""
zmq_ctx = zmq.asyncio.Context()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this line during init and revert the _connect_to_controller

Comment thread transfer_queue/storage/managers/base.py
Comment thread transfer_queue/storage/managers/base.py Outdated
except Exception:
pass
sock.close()
zmq_ctx.term()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not term this during each run

Comment thread transfer_queue/storage/managers/base.py Outdated
await sock.send_multipart(error_msg)
except Exception:
pass
raise TimeoutError(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we still use logger.error rather than raise error

Signed-off-by: yxstev <zhangyixiang9@huawei.com>
@ascend-robot

Copy link
Copy Markdown

CLA Signature Pass

dodatboii, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
@ascend-robot

Copy link
Copy Markdown

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
@ascend-robot

Copy link
Copy Markdown

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
@0oshowero0 0oshowero0 requested a review from Copilot June 8, 2026 01:46
@ascend-robot

Copy link
Copy Markdown

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.

Comment thread transfer_queue/storage/managers/base.py
Comment thread transfer_queue/storage/managers/base.py Outdated
Comment on lines +251 to +255
thread_future = asyncio.run_coroutine_threadsafe(
self._notify_and_wait(request_msg),
self._notify_loop,
)
await asyncio.wrap_future(thread_future)
Signed-off-by: 0oshowero0 <o0shower0o@outlook.com>
@ascend-robot

Copy link
Copy Markdown

CLA Signature Pass

0oshowero0, thanks for your pull request. All authors of the commits have signed the CLA. 👍

@0oshowero0 0oshowero0 merged commit 46ee27b into Ascend:main Jun 8, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants