Skip to content

Fix message-send races on reconnect and handoff#60

Merged
kahrendt merged 5 commits into
mainfrom
fix-async-send-connection-reuse
May 22, 2026
Merged

Fix message-send races on reconnect and handoff#60
kahrendt merged 5 commits into
mainfrom
fix-async-send-connection-reuse

Conversation

@kahrendt
Copy link
Copy Markdown
Contributor

@kahrendt kahrendt commented May 22, 2026

This is a follow-up to #57. That PR stopped the use-after-free crash by making the httpd session the owner of the SendspinConnection, but a few related races remain in how queued messages get sent when a connection is reused or replaced. This fixes three of them.

Changes

Bind ESP async sends to the connection that queued them. The httpd async send workers identified their target by (server, sockfd) and re-resolved the connection when the worker ran. If the connection closed before the worker ran and the socket fd was recycled by a new connection, the worker could send a frame to the wrong (or torn-down) peer. The work items now hold a weak_ptr<SendspinServerConnection> and lock() it in the worker — it resolves to the exact connection that queued the work, or null if it's gone, in which case the worker no-ops cleanly. The arg structs hold non-trivial members now, so they're placement-new constructed and explicitly destroyed before platform_free.

Gate sends on the client/hello. Adds an allow_before_hello parameter to send_text_message. Async transports drop any frame queued before client_hello_sent_ is set, preserving the "hello is always first" invariant even if a stale frame gets queued during teardown. Exactly two messages opt out: the hello itself (otherwise it would gate its own send and deadlock the handshake) and goodbye (a rejected connection is told to leave before it ever sends a hello). Synchronous host/IXWebSocket transports ignore the flag.

Track hello retries per connection. The retry timer was a single slot, so when a handoff candidate arrived mid-handshake, arming its retry overwrote the current connection's pending hello and the first connection never got retried. It's now one entry per managed connection.

Also updates docs/internals.md — the "Server connection ownership" and "Graceful Disconnect" sections still described the old (server, sockfd) worker lookup — and tightens a few comments.

Thread-safety fixes from review

  • client_hello_sent_ / server_hello_received_ are written on network threads and read from the main loop via is_handshake_complete(), so they're now std::atomic<bool> (matching the existing message_dispatch_enabled_). This was a pre-existing race, but the new pre-hello gate reads the same flag, so it's worth resolving here.
  • The destructor cleared hello_retries_ outside conn_ptr_mutex_; moved it inside, matching every other access.

Testing

Unit tests aren't applicable because this touches the ESP code.

kahrendt added 3 commits May 21, 2026 20:26
The async httpd send workers looked up their target connection by
(server, sockfd). A worker queued for a connection that closed before it
ran could fire against a different connection that had since reused the
same socket fd, sending a frame to the wrong peer (or a torn-down one).

Capture a weak_ptr to the originating connection in the work item and
lock() it inside the worker instead; if the connection is gone, drop the
send. Also gate any frame queued before the client/hello on
client_hello_sent_ so stale pre-handshake traffic can't leak out;
goodbye and the hello itself opt in via allow_before_hello, since both
legitimately precede the handshake.
A single hello_retry_ slot meant that when a second connection arrived
mid-handshake (e.g. a handoff candidate), arming its retry overwrote the
first connection's pending hello, so the first never got one.

Keep one HelloRetryState per managed connection in a vector. loop()
walks it, drops entries whose connection is no longer current/pending,
and retries the rest independently; on_connection_lost removes an
entry by connection pointer.
- internals.md: the "Server connection ownership" and "Graceful Disconnect"
  sections still described the old (server, sockfd) worker lookup. The send
  workers now capture a weak_ptr; rewrite those sections and document the
  pre-hello send gate. Update the async_send_time_text header docstring too.
- complete_handoff: remove the displaced connection's hello-retry entry when
  it leaves the managed set, instead of leaving it for loop()'s lazy prune.
- Correct the async_send_text comment and the send_text_message callback
  doc: the completion callback is skipped when the conn is gone, so
  allow_before_hello bypasses the gate but not the conn-alive requirement.
- Clarify the initiate_hello dedup and remove_hello_retry no-op comments.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Follow-up to #57 to further harden reconnect/handoff behavior by eliminating remaining message-send races in the ESP async send path and improving handshake sequencing.

Changes:

  • Bind ESP async send work items to the originating connection via weak_ptr instead of (server, sockfd) re-lookup, with proper placement-new construction/destruction for non-trivial arg structs.
  • Add a pre-hello send gate (allow_before_hello) to preserve the “client/hello is always first” invariant (hello + goodbye bypass the gate).
  • Track hello retries per managed connection (vector of retry entries) so mid-handshake handoffs don’t clobber each other.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
src/connection.h Extends send_text_message API with allow_before_hello and clarifies callback semantics.
src/connection.cpp Sends goodbye with allow_before_hello=true to avoid gating control messages.
src/connection_manager.h Replaces single hello retry slot with per-connection retry vector; adds helper to remove retry entries.
src/connection_manager.cpp Implements per-connection hello retries and clears retry state on connection loss/handoff.
src/esp/server_connection.h Updates send signature and documents weak_ptr-based async worker args.
src/esp/server_connection.cpp Switches async workers to weak_ptr identity, adds pre-hello gate enforcement, and uses placement-new/destruct for worker args.
src/esp/client_connection.h Updates send signature to include allow_before_hello.
src/esp/client_connection.cpp Updates send signature (flag ignored for synchronous transport).
src/host/server_connection.h Updates send signature to include allow_before_hello.
src/host/server_connection.cpp Updates send signature (flag ignored for synchronous transport).
src/host/client_connection.h Updates send signature to include allow_before_hello.
src/host/client_connection.cpp Updates send signature (flag ignored for synchronous transport).
docs/internals.md Updates internals documentation to reflect weak_ptr worker binding and per-connection hello retries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/connection_manager.cpp
Comment thread src/esp/server_connection.cpp
kahrendt added 2 commits May 21, 2026 20:59
Every other access to hello_retries_ holds conn_ptr_mutex_; the destructor
cleared it outside the lock. Move the clear into the same locked block as the
connection-pointer resets so the locking discipline is uniform.
client_hello_sent_ and server_hello_received_ are written on network threads
(the send-completion callback and the disconnect handlers) and read from the
main loop via is_handshake_complete(), so as plain bools the cross-thread
access was a data race. Make them std::atomic<bool>, matching the existing
treatment of message_dispatch_enabled_. No call sites change: atomic loads
and stores are implicit in the existing reads and assignments.
@kahrendt kahrendt merged commit ef1157f into main May 22, 2026
5 checks passed
@kahrendt kahrendt deleted the fix-async-send-connection-reuse branch May 22, 2026 01:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants