Fix message-send races on reconnect and handoff#60
Merged
Conversation
The async httpd send workers looked up their target connection by (server, sockfd). A worker queued for a connection that closed before it ran could fire against a different connection that had since reused the same socket fd, sending a frame to the wrong peer (or a torn-down one). Capture a weak_ptr to the originating connection in the work item and lock() it inside the worker instead; if the connection is gone, drop the send. Also gate any frame queued before the client/hello on client_hello_sent_ so stale pre-handshake traffic can't leak out; goodbye and the hello itself opt in via allow_before_hello, since both legitimately precede the handshake.
A single hello_retry_ slot meant that when a second connection arrived mid-handshake (e.g. a handoff candidate), arming its retry overwrote the first connection's pending hello, so the first never got one. Keep one HelloRetryState per managed connection in a vector. loop() walks it, drops entries whose connection is no longer current/pending, and retries the rest independently; on_connection_lost removes an entry by connection pointer.
- internals.md: the "Server connection ownership" and "Graceful Disconnect" sections still described the old (server, sockfd) worker lookup. The send workers now capture a weak_ptr; rewrite those sections and document the pre-hello send gate. Update the async_send_time_text header docstring too. - complete_handoff: remove the displaced connection's hello-retry entry when it leaves the managed set, instead of leaving it for loop()'s lazy prune. - Correct the async_send_text comment and the send_text_message callback doc: the completion callback is skipped when the conn is gone, so allow_before_hello bypasses the gate but not the conn-alive requirement. - Clarify the initiate_hello dedup and remove_hello_retry no-op comments.
There was a problem hiding this comment.
Pull request overview
Follow-up to #57 to further harden reconnect/handoff behavior by eliminating remaining message-send races in the ESP async send path and improving handshake sequencing.
Changes:
- Bind ESP async send work items to the originating connection via
weak_ptrinstead of(server, sockfd)re-lookup, with proper placement-new construction/destruction for non-trivial arg structs. - Add a pre-hello send gate (
allow_before_hello) to preserve the “client/hello is always first” invariant (hello + goodbye bypass the gate). - Track hello retries per managed connection (vector of retry entries) so mid-handshake handoffs don’t clobber each other.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/connection.h | Extends send_text_message API with allow_before_hello and clarifies callback semantics. |
| src/connection.cpp | Sends goodbye with allow_before_hello=true to avoid gating control messages. |
| src/connection_manager.h | Replaces single hello retry slot with per-connection retry vector; adds helper to remove retry entries. |
| src/connection_manager.cpp | Implements per-connection hello retries and clears retry state on connection loss/handoff. |
| src/esp/server_connection.h | Updates send signature and documents weak_ptr-based async worker args. |
| src/esp/server_connection.cpp | Switches async workers to weak_ptr identity, adds pre-hello gate enforcement, and uses placement-new/destruct for worker args. |
| src/esp/client_connection.h | Updates send signature to include allow_before_hello. |
| src/esp/client_connection.cpp | Updates send signature (flag ignored for synchronous transport). |
| src/host/server_connection.h | Updates send signature to include allow_before_hello. |
| src/host/server_connection.cpp | Updates send signature (flag ignored for synchronous transport). |
| src/host/client_connection.h | Updates send signature to include allow_before_hello. |
| src/host/client_connection.cpp | Updates send signature (flag ignored for synchronous transport). |
| docs/internals.md | Updates internals documentation to reflect weak_ptr worker binding and per-connection hello retries. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Every other access to hello_retries_ holds conn_ptr_mutex_; the destructor cleared it outside the lock. Move the clear into the same locked block as the connection-pointer resets so the locking discipline is uniform.
client_hello_sent_ and server_hello_received_ are written on network threads (the send-completion callback and the disconnect handlers) and read from the main loop via is_handshake_complete(), so as plain bools the cross-thread access was a data race. Make them std::atomic<bool>, matching the existing treatment of message_dispatch_enabled_. No call sites change: atomic loads and stores are implicit in the existing reads and assignments.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a follow-up to #57. That PR stopped the use-after-free crash by making the httpd session the owner of the
SendspinConnection, but a few related races remain in how queued messages get sent when a connection is reused or replaced. This fixes three of them.Changes
Bind ESP async sends to the connection that queued them. The httpd async send workers identified their target by
(server, sockfd)and re-resolved the connection when the worker ran. If the connection closed before the worker ran and the socket fd was recycled by a new connection, the worker could send a frame to the wrong (or torn-down) peer. The work items now hold aweak_ptr<SendspinServerConnection>andlock()it in the worker — it resolves to the exact connection that queued the work, or null if it's gone, in which case the worker no-ops cleanly. The arg structs hold non-trivial members now, so they're placement-new constructed and explicitly destroyed beforeplatform_free.Gate sends on the client/hello. Adds an
allow_before_helloparameter tosend_text_message. Async transports drop any frame queued beforeclient_hello_sent_is set, preserving the "hello is always first" invariant even if a stale frame gets queued during teardown. Exactly two messages opt out: the hello itself (otherwise it would gate its own send and deadlock the handshake) and goodbye (a rejected connection is told to leave before it ever sends a hello). Synchronous host/IXWebSocket transports ignore the flag.Track hello retries per connection. The retry timer was a single slot, so when a handoff candidate arrived mid-handshake, arming its retry overwrote the current connection's pending hello and the first connection never got retried. It's now one entry per managed connection.
Also updates
docs/internals.md— the "Server connection ownership" and "Graceful Disconnect" sections still described the old(server, sockfd)worker lookup — and tightens a few comments.Thread-safety fixes from review
client_hello_sent_/server_hello_received_are written on network threads and read from the main loop viais_handshake_complete(), so they're nowstd::atomic<bool>(matching the existingmessage_dispatch_enabled_). This was a pre-existing race, but the new pre-hello gate reads the same flag, so it's worth resolving here.hello_retries_outsideconn_ptr_mutex_; moved it inside, matching every other access.Testing
Unit tests aren't applicable because this touches the ESP code.