Skip to content

fix: eliminate ZMQ subscription-gap hang at "waiting for readout"#439

Merged
astronomerdave merged 1 commit into
mainfrom
hotfix/zmq
May 20, 2026
Merged

fix: eliminate ZMQ subscription-gap hang at "waiting for readout"#439
astronomerdave merged 1 commit into
mainfrom
hotfix/zmq

Conversation

@astronomerdave
Copy link
Copy Markdown
Contributor

If sequencerd briefly disconnects from the ZMQ broker during a TCP reconnect, the XSUB socket drops its "camerad" subscription reference count to zero. The single-fire can_expose=true message published by camerad at the end of readout was silently discarded during that window, leaving the indefinite camerad_cv.wait() in sequence_start with nothing to wake it.

  • Set HWM=0 and LINGER=0 on all PubSub sockets and on the broker's XSUB/XPUB sockets, preventing silent drops under backpressure and blocking-on-close hangs.
  • Persist the zmqpp::poller as a class member rather than reconstructing it on every has_message() call, eliminating up to 100ms stall between burst messages.
  • Add a burst-drain inner loop in the subscriber thread so all queued messages are consumed before blocking on the next poll.
  • Add a 100ms settle delay after connect_to_broker() to let subscription propagation reach the broker before the first publish.
  • After can_expose.store(true) in dothread_monitor_exposure_pending, spawn a detached thread that republishes the ready state every 2 s for up to 10 s, stopping as soon as a new exposure starts. Covers any remaining reconnect window without structural changes to the receive path.
  • Replace both camerad_cv.wait() calls in sequence.cpp with wait_for(15s) and wait_for(30s) loops that call request_snapshot() on timeout, so sequencerd actively solicits a republish rather than waiting indefinitely if the initial publish and all periodic republishes are somehow missed.

If sequencerd briefly disconnects from the ZMQ broker during
a TCP reconnect, the XSUB socket drops its "camerad" subscription
reference count to zero. The single-fire can_expose=true message
published by camerad at the end of readout was silently discarded
during that window, leaving the indefinite camerad_cv.wait() in
sequence_start with nothing to wake it.

- Set HWM=0 and LINGER=0 on all PubSub sockets and on the broker's
  XSUB/XPUB sockets, preventing silent drops under backpressure and
  blocking-on-close hangs.
- Persist the zmqpp::poller as a class member rather than reconstructing
  it on every has_message() call, eliminating up to 100ms stall between
  burst messages.
- Add a burst-drain inner loop in the subscriber thread so all queued
  messages are consumed before blocking on the next poll.
- Add a 100ms settle delay after connect_to_broker() to let subscription
  propagation reach the broker before the first publish.
- After can_expose.store(true) in dothread_monitor_exposure_pending, spawn
  a detached thread that republishes the ready state every 2 s for up to
  10 s, stopping as soon as a new exposure starts. Covers any remaining
  reconnect window without structural changes to the receive path.
- Replace both camerad_cv.wait() calls in sequence.cpp with wait_for(15s)
  and wait_for(30s) loops that call request_snapshot() on timeout, so
  sequencerd actively solicits a republish rather than waiting indefinitely
  if the initial publish and all periodic republishes are somehow missed.
@astronomerdave astronomerdave merged commit d2e87cc into main May 20, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants