Fix message-send races on reconnect and handoff by kahrendt · Pull Request #60 · Sendspin/sendspin-cpp

kahrendt · 2026-05-22T00:49:08Z

This is a follow-up to #57. That PR stopped the use-after-free crash by making the httpd session the owner of the SendspinConnection, but a few related races remain in how queued messages get sent when a connection is reused or replaced. This fixes three of them.

Changes

Bind ESP async sends to the connection that queued them. The httpd async send workers identified their target by (server, sockfd) and re-resolved the connection when the worker ran. If the connection closed before the worker ran and the socket fd was recycled by a new connection, the worker could send a frame to the wrong (or torn-down) peer. The work items now hold a weak_ptr<SendspinServerConnection> and lock() it in the worker — it resolves to the exact connection that queued the work, or null if it's gone, in which case the worker no-ops cleanly. The arg structs hold non-trivial members now, so they're placement-new constructed and explicitly destroyed before platform_free.

Gate sends on the client/hello. Adds an allow_before_hello parameter to send_text_message. Async transports drop any frame queued before client_hello_sent_ is set, preserving the "hello is always first" invariant even if a stale frame gets queued during teardown. Exactly two messages opt out: the hello itself (otherwise it would gate its own send and deadlock the handshake) and goodbye (a rejected connection is told to leave before it ever sends a hello). Synchronous host/IXWebSocket transports ignore the flag.

Track hello retries per connection. The retry timer was a single slot, so when a handoff candidate arrived mid-handshake, arming its retry overwrote the current connection's pending hello and the first connection never got retried. It's now one entry per managed connection.

Also updates docs/internals.md — the "Server connection ownership" and "Graceful Disconnect" sections still described the old (server, sockfd) worker lookup — and tightens a few comments.

Thread-safety fixes from review

client_hello_sent_ / server_hello_received_ are written on network threads and read from the main loop via is_handshake_complete(), so they're now std::atomic<bool> (matching the existing message_dispatch_enabled_). This was a pre-existing race, but the new pre-hello gate reads the same flag, so it's worth resolving here.
The destructor cleared hello_retries_ outside conn_ptr_mutex_; moved it inside, matching every other access.

Testing

Unit tests aren't applicable because this touches the ESP code.

The async httpd send workers looked up their target connection by (server, sockfd). A worker queued for a connection that closed before it ran could fire against a different connection that had since reused the same socket fd, sending a frame to the wrong peer (or a torn-down one). Capture a weak_ptr to the originating connection in the work item and lock() it inside the worker instead; if the connection is gone, drop the send. Also gate any frame queued before the client/hello on client_hello_sent_ so stale pre-handshake traffic can't leak out; goodbye and the hello itself opt in via allow_before_hello, since both legitimately precede the handshake.

A single hello_retry_ slot meant that when a second connection arrived mid-handshake (e.g. a handoff candidate), arming its retry overwrote the first connection's pending hello, so the first never got one. Keep one HelloRetryState per managed connection in a vector. loop() walks it, drops entries whose connection is no longer current/pending, and retries the rest independently; on_connection_lost removes an entry by connection pointer.

- internals.md: the "Server connection ownership" and "Graceful Disconnect" sections still described the old (server, sockfd) worker lookup. The send workers now capture a weak_ptr; rewrite those sections and document the pre-hello send gate. Update the async_send_time_text header docstring too. - complete_handoff: remove the displaced connection's hello-retry entry when it leaves the managed set, instead of leaving it for loop()'s lazy prune. - Correct the async_send_text comment and the send_text_message callback doc: the completion callback is skipped when the conn is gone, so allow_before_hello bypasses the gate but not the conn-alive requirement. - Clarify the initiate_hello dedup and remove_hello_retry no-op comments.

Copilot

Pull request overview

Follow-up to #57 to further harden reconnect/handoff behavior by eliminating remaining message-send races in the ESP async send path and improving handshake sequencing.

Changes:

Bind ESP async send work items to the originating connection via weak_ptr instead of (server, sockfd) re-lookup, with proper placement-new construction/destruction for non-trivial arg structs.
Add a pre-hello send gate (allow_before_hello) to preserve the “client/hello is always first” invariant (hello + goodbye bypass the gate).
Track hello retries per managed connection (vector of retry entries) so mid-handshake handoffs don’t clobber each other.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
src/connection.h	Extends `send_text_message` API with `allow_before_hello` and clarifies callback semantics.
src/connection.cpp	Sends goodbye with `allow_before_hello=true` to avoid gating control messages.
src/connection_manager.h	Replaces single hello retry slot with per-connection retry vector; adds helper to remove retry entries.
src/connection_manager.cpp	Implements per-connection hello retries and clears retry state on connection loss/handoff.
src/esp/server_connection.h	Updates send signature and documents `weak_ptr`-based async worker args.
src/esp/server_connection.cpp	Switches async workers to `weak_ptr` identity, adds pre-hello gate enforcement, and uses placement-new/destruct for worker args.
src/esp/client_connection.h	Updates send signature to include `allow_before_hello`.
src/esp/client_connection.cpp	Updates send signature (flag ignored for synchronous transport).
src/host/server_connection.h	Updates send signature to include `allow_before_hello`.
src/host/server_connection.cpp	Updates send signature (flag ignored for synchronous transport).
src/host/client_connection.h	Updates send signature to include `allow_before_hello`.
src/host/client_connection.cpp	Updates send signature (flag ignored for synchronous transport).
docs/internals.md	Updates internals documentation to reflect `weak_ptr` worker binding and per-connection hello retries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Every other access to hello_retries_ holds conn_ptr_mutex_; the destructor cleared it outside the lock. Move the clear into the same locked block as the connection-pointer resets so the locking discipline is uniform.

client_hello_sent_ and server_hello_received_ are written on network threads (the send-completion callback and the disconnect handlers) and read from the main loop via is_handshake_complete(), so as plain bools the cross-thread access was a data race. Make them std::atomic<bool>, matching the existing treatment of message_dispatch_enabled_. No call sites change: atomic loads and stores are implicit in the existing reads and assignments.

kahrendt added 3 commits May 21, 2026 20:26

kahrendt added the bugfix label May 22, 2026

kahrendt requested a review from Copilot May 22, 2026 00:50

Copilot started reviewing on behalf of kahrendt May 22, 2026 00:51 View session

Copilot AI reviewed May 22, 2026

View reviewed changes

Comment thread src/connection_manager.cpp

Comment thread src/esp/server_connection.cpp

kahrendt added 2 commits May 21, 2026 20:59

Clear hello_retries_ under conn_ptr_mutex_ in destructor

91b19a4

Every other access to hello_retries_ holds conn_ptr_mutex_; the destructor cleared it outside the lock. Move the clear into the same locked block as the connection-pointer resets so the locking discipline is uniform.

kahrendt merged commit ef1157f into main May 22, 2026
5 checks passed

kahrendt deleted the fix-async-send-connection-reuse branch May 22, 2026 01:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix message-send races on reconnect and handoff#60

Fix message-send races on reconnect and handoff#60
kahrendt merged 5 commits into
mainfrom
fix-async-send-connection-reuse

kahrendt commented May 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kahrendt commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Thread-safety fixes from review

Testing

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kahrendt commented May 22, 2026 •

edited

Loading