Skip to content

Proposal to change how the WebSocket clients are cleaned up to remove a race condition#424

Open
MitchBradley wants to merge 4 commits intoESP32Async:mainfrom
MitchBradley:fix/posix-websocket-broadcast-race
Open

Proposal to change how the WebSocket clients are cleaned up to remove a race condition#424
MitchBradley wants to merge 4 commits intoESP32Async:mainfrom
MitchBradley:fix/posix-websocket-broadcast-race

Conversation

@MitchBradley
Copy link
Copy Markdown

This fixes several lifetime and synchronization hazards in AsyncWebSocket that can surface when a client disconnects while the server is broadcasting to connected websocket clients.

Summary:

  • defer client removal from _handleDisconnect() to cleanupClients() so broadcast iteration is not invalidated by disconnect callbacks
  • protect AsyncWebSocketClient::_client reads in shouldBeDeleted(), _onPoll(), _queueControl(), _queueMessage(), _onTimeout(), _onDisconnect(), remoteIP(), and remotePort()
  • resolve per-client close(), ping(), text(), and binary() operations under the websocket server lock instead of fetching a raw client pointer and using it later outside the lock
  • make cleanupClients() count connected clients directly and erase all deletable clients safely in one pass

Why:
A disconnect callback can race with websocket send/broadcast paths. In particular, erasing a client immediately during _handleDisconnect() can invalidate active iteration during textAll()/binaryAll(), and unlocked _client access can race with teardown.

This change keeps the fix narrowly scoped to AsyncWebSocket lifetime handling. It does not include unrelated local debugging or response-state-machine changes from downstream investigation.

Copilot AI review requested due to automatic review settings April 9, 2026 03:49
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tightens AsyncWebSocket client lifetime handling to avoid iterator invalidation and unsafe client pointer usage when disconnects race with broadcast sends.

Changes:

  • Defers client removal from disconnect handling to cleanupClients() to avoid invalidating _clients iteration during broadcasts.
  • Adds additional locking around AsyncWebSocketClient::_client reads and moves per-client operations (close/ping/text/binary) to execute under the websocket server lock.
  • Updates cleanupClients() to count connected clients directly and erase all deletable clients in a single pass.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/AsyncWebSocket.h Adds client lock usage in shouldBeDeleted() to make deletion checks thread-safe.
src/AsyncWebSocket.cpp Introduces a locked client lookup helper, adds additional locking in client callbacks/queueing, defers disconnect removal, and revises cleanup + per-client operations to run under the server lock.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@MitchBradley
Copy link
Copy Markdown
Author

While stress-testing with FluidNC running on MacOS, I ran into some problems that showed up with websocket textAll which tended to deadlock. GPT-5.4 did a deep analysis and came up with these fixes, which solved the problem.

@MitchBradley
Copy link
Copy Markdown
Author

status() is now synchronized under the per-client mutex, so the connected-client checks in find_connected_client_locked() and cleanupClients() no longer race with _statusupdates. I also changed cleanupClients() to splice deletable clients into a temporary list and destroy them after releasing AsyncWebSocket::_lock, so disconnect callbacks no longer run under the server lock.

@mathieucarbou
Copy link
Copy Markdown
Member

mathieucarbou commented Apr 9, 2026

@MitchBradley : I have cherry picked your commits into #427 and removed in an added commit the changes regarding how the clients are cleaned up in order to split the PR in 2 parts:

For the story, we know the issue and @willmmiles also suggested we do some changes in this area once, and we got a user ran into such issue like you did.

Now, the proposed changes are quite important because initially, cleanupClients() was more used to close the rogue clients or a surplus of clients (when max client is reached).

With the middleware stack that was added, now the websocket client control can be done with a middleware that is controlling directly the upgrade requests to pro-actively close the WS upgrade request before it is transformed into a websocket client. So there is less a need now to call cleanupClients() than before, even if still advised.

So moving the complete client cleanup into this badly named cleanupClients() function would then require everyone to call it from the loop, which defies a little the purpose of an asynchronous framework, and woudl also change the meaning of this badly named cleanupClients() functions, which then becomes a client count control function but also a client cleanup function.

So let's merge #427 first with all the good fixes and we can rebase this PR to focus the changes on the behavior change, that we will need to discuss within the team with @me-no-dev and @willmmiles .

Maybe there is another better way also to handle that.

We are also advising users to run with -D CONFIG_ASYNC_TCP_RUNNING_CORE=1 (https://esp32async.github.io/ESPAsyncWebServer/configuration/#important-recommendations-for-build-options). Running the async_tcp task under the same core as the user code (loop) also helps reducing some concurrency issues.

@mathieucarbou mathieucarbou marked this pull request as draft April 9, 2026 13:18
@mathieucarbou mathieucarbou changed the title Fix websocket client lifetime races during broadcast Proposal to change how the WebSocket clients are cleaned up to remove a race condition Apr 9, 2026
return _clientId;
}
AwsClientStatus status() const {
asyncsrv::lock_guard_type lock(_lock);
Copy link
Copy Markdown
Member

@mathieucarbou mathieucarbou Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I understand the use case of locking a whole function, although I do not to read a register value ? status could also be updated from many other places.

Let's say client A disconnects and has locked and is about to set its status to DISCONNECTED.
At the same time, the app calls status() and reads CONNECTED for client A
Then what ?
Even if the app would trigger any call to send some data to this client A, the app will encounter a lock in one of the send method.

I do not think there is a race also through the cleanupClient function ? Because a double-call on close() should be supported ? And also it would be ok if we miss a client that is currently disconnecting, it will be cleaned next time.

Not sure we need a lock here ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must confess that I am out of my depth here. Concurrency has always made my brain hurt. Here is what GPT-5.4 has to say:

There are two different questions:

  1. Is a stale read semantically acceptable?
  2. Is the read synchronized according to the C++ memory model?

For status(), the second question is the important one. If _status is an ordinary field, then a read from status() without taking the client lock is concurrent with writes that happen under the client lock in AsyncWebSocket.cpp. That is a real data race in C++, even if the only practical consequence most of the time would be “you briefly saw CONNECTED before DISCONNECTED”. The language does not treat that as a harmless stale read; it treats it as unsynchronized access to shared state, which is undefined behavior.

The later lock in a send path does not repair that earlier unsynchronized read. Once status() has read _status without synchronization, the damage is already done from the memory-model perspective. The same applies to cleanupClients(): whether a missed disconnect is tolerable is a behavioral question, but it does not make an unlocked read of _status safe.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If async_tcp task is running on the same core as the loop or user code, there is no need to add mutex but if the async_tcp it ran from core 0 and user code from another core then accessing the value could potentielle return a local cachée value while the real value in memory was updated from another core. The question is : is this temporary situation acceptable considering that the local caches value won’t stay long in the cache : it would take some user code silly looping and calling status() to force many reads. so in that silly case marking the field volatile could be enough in order to force a read from memory each time.

But like I said, I am not sure I ever want to go there because if you look at the parsing logic there are many fields (most of them) set from the async-tcp task that could be on core 0 and accessed from core 1. And that’s ok. Status is no exception.

and I think it is ok so admit that no one should ever call status() in a loop at very high frequency such that the returned value could be a cached one.

Usually people falling status() are doing that because they want to execute something else based on the returned value.

so I would revert this one.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will defer to your judgment on this - but is the mutex so expensive that removing it is worth the risk? Apparently, according to C++, if there is any unguarded access of such a variable, all bets are off with respect to memory barrier guarantees.

Copy link
Copy Markdown
Member

@mathieucarbou mathieucarbou Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MitchBradley : I agree with you regarding the memory model of c++ not being respected in a sense that this status flag can be r/w from many places, so it could be considered as a shared state.

My concern is how much it can be considered as a shared state... For me there is no difference with the _acked and _ack fields for example that can also be read and accessed from many tasks (core).

That's why I want this discussion and concern completely out of scope of this PR.

I don't disagree about the fact that some code review / update is required in this area.

But I disagree on putting it here in this PR.

This PR is to propose a change regarding the location when the cleanup happens, and the more important part is that this is a design / breaking change.

A new PR should be opened with the title being "Review locking and unlocking fo shared state to adhere to c++ memory model". And in this PR, yes the work can be done and discussed with the whole team and also be scoped to more than just the status flag. The status flag is only 1 variable like that but there are many more that are sharing the same concerns are that are accessed more frequently than the status flag.

Please make sure your PR are as isolated as possible and only fix 1 concern at a time. That's fine if you open 2 or even 3 PRs and one has to be merged (and is based) on the other one. What is important is to keep PRs focused in 1 issue at a time.

This is particularly hard otherwise to review or read back the history of 1 PR contains a big bag of several fixes that are not focused on solving the same problem.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, our current locks are made in a way to guard specific variables.

If you look at #429 I have renamed them for clarity.

We have 2 locks:

  • one that is guarding the queues
  • one that is guarding the ws clients list

That is all.
And we cannot use the same locks to lock everything...

For example, your changes tend to reuse the existing locks to sometimes guard the _client pointer, but not at every places, and also guard the status flag, also not at every places.

So that's not consistent and can cause some unexpected slowdowns and un-necessary locking because could prevent access to the status flag while someone is sending a message.

That's why this review work has better be done in a more global PR and be discussed because maybe the solution will be completely different than what you proposed.

C++ has several mechanism also like atomic fields, weak pointers, etc
Locks are not the only solution.

@mathieucarbou mathieucarbou marked this pull request as ready for review April 9, 2026 15:19
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@MitchBradley MitchBradley force-pushed the fix/posix-websocket-broadcast-race branch from fcff5d0 to 8bf0526 Compare April 9, 2026 21:25

void AsyncWebSocketClient::_onPoll() {
asyncsrv::unique_lock_type lock(_lock);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line removal change can be removed


bool AsyncWebSocketClient::_queueControl(uint8_t opcode, const uint8_t *data, size_t len, bool mask) {
asyncsrv::lock_guard_type lock(_lock);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line removal change can be removed

_messageQueue.clear();
_controlQueue.clear();
}
_server->_handleEvent(this, WS_EVT_DISCONNECT, NULL, NULL, 0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you correctly test that moving this to _handleDisconnect works in all situations ?

  1. javascript client (or websocat) closes WS connection
  2. server-side close
  3. cleanupClient call
  4. network connection breaks (i.e. wifi disconnects)

The call here makes sure that the user event handler is called whatever the use case and can cleanup resources.

The move to _handleDisconnect will only be called if _onDisconnect is called, which is subject to how the AsyncTCP / ESPAsyncTCP / RPI / etc implementation was done, which we have no control over.


bool AsyncWebSocketClient::_queueMessage(AsyncWebSocketSharedBuffer buffer, uint8_t opcode, bool mask) {
asyncsrv::unique_lock_type lock(_lock);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line removal change can be removed

@@ -502,14 +498,16 @@ bool AsyncWebSocketClient::_queueMessage(AsyncWebSocketSharedBuffer buffer, uint
}

void AsyncWebSocketClient::close(uint16_t code, const char *message) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the changes here are relevant to the goal of this PR (which is to propose a new way to cleanup clients) and I do not think it is needed at all to guard the status flag

_server->_handleEvent(this, WS_EVT_ERROR, (void *)&reasonCode, (uint8_t *)reasonString, strlen(reasonString));
}
}
asyncsrv::unique_lock_type lock(_lock);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the changes here are relevant to the goal of this PR (which is to propose a new way to cleanup clients) and I do not think it is needed at all to guard the status flag

if (_client) {
_client->ackLater();
{
asyncsrv::lock_guard_type lock(_lock);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to protect the client ptr: it canot become null between the if and the ackLater call.

_client ptr is only set to null from _client.close() and this is only called from the async_tcp task, same as this method.

Also, see: #429

this _lock is aimed at guarding the queue, not the client ptr


IPAddress AsyncWebSocketClient::remoteIP() const {
asyncsrv::lock_guard_type lock(_lock);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line removal change can be removed


uint16_t AsyncWebSocketClient::remotePort() const {
asyncsrv::lock_guard_type lock(_lock);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line removal change can be removed

// active iterator in the caller. However, emit the disconnect event now so
// applications observe the disconnect at the time it happens even though the
// client object remains in _clients until cleanup.
_handleEvent(client, WS_EVT_DISCONNECT, NULL, NULL, 0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

src/AsyncWebSocket.h:267

  • AsyncWebSocketClient::client() / client() const return the raw _client pointer without taking _lock, but _client is mutated under _lock (e.g. set to nullptr in _onDisconnect()). With status() now locking and other _client accessors being locked, these unlocked accessors become a remaining data-race/UB entry point. Consider guarding these accessors with _lock (or returning a snapshot via a locked getter) and/or clearly documenting that callers must externally synchronize before calling client().
  AwsClientStatus status() const {
    asyncsrv::lock_guard_type lock(_lock);
    return _status;
  }
  AsyncClient *client() {
    return _client;
  }
  const AsyncClient *client() const {
    return _client;
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// iterating _clients for broadcast sends, and erasing here invalidates the
// active iterator in the caller. However, emit the disconnect event now so
// applications observe the disconnect at the time it happens even though the
// client object remains in _clients until cleanup.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MitchBradley : indeed, the lock added in onDisconnect is wrong and should be removed for this PR. See my other comments. First this is not the goal of this PR and also the lock that is used is the one to lock guard the queues.

@mathieucarbou
Copy link
Copy Markdown
Member

@MitchBradley : I have merged #429.

Could you please run your tests again ?

  1. without calling cleanupClients()
  2. with calling cleanupClients()

I think just the fixes done in #429 and yesterday should be enough. If not, this should be covered by a new PR, but not in this one.

Also, as I mentioned, you are not supposed to call cleanupClients(): the list cleans up by itself when a client disconnects. This is an asynchronous server so no loop action should be required from user.

The cleanupClients() API is an old API that was put in place to limit the number of WS clients.
But in order to do that, the correct way now to setup a middleware like it is done in the WebSocket.ino example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants