Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
81 changes: 48 additions & 33 deletions src/AsyncWebSocket.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -301,7 +301,6 @@ AsyncWebSocketClient::~AsyncWebSocketClient() {
_messageQueue.clear();
_controlQueue.clear();
}
_server->_handleEvent(this, WS_EVT_DISCONNECT, NULL, NULL, 0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you correctly test that moving this to _handleDisconnect works in all situations ?

  1. javascript client (or websocat) closes WS connection
  2. server-side close
  3. cleanupClient call
  4. network connection breaks (i.e. wifi disconnects)

The call here makes sure that the user event handler is called whatever the use case and can cleanup resources.

The move to _handleDisconnect will only be called if _onDisconnect is called, which is subject to how the AsyncTCP / ESPAsyncTCP / RPI / etc implementation was done, which we have no control over.

}

void AsyncWebSocketClient::_clearQueue() {
Expand Down Expand Up @@ -358,7 +357,6 @@ void AsyncWebSocketClient::_onAck(size_t len, uint32_t time) {

void AsyncWebSocketClient::_onPoll() {
asyncsrv::unique_lock_type lock(_lock);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line removal change can be removed

if (!_client) {
return;
}
Expand Down Expand Up @@ -446,7 +444,6 @@ bool AsyncWebSocketClient::canSend() const {

bool AsyncWebSocketClient::_queueControl(uint8_t opcode, const uint8_t *data, size_t len, bool mask) {
asyncsrv::lock_guard_type lock(_lock);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line removal change can be removed

if (!_client) {
return false;
}
Expand All @@ -463,7 +460,6 @@ bool AsyncWebSocketClient::_queueControl(uint8_t opcode, const uint8_t *data, si

bool AsyncWebSocketClient::_queueMessage(AsyncWebSocketSharedBuffer buffer, uint8_t opcode, bool mask) {
asyncsrv::unique_lock_type lock(_lock);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line removal change can be removed

if (!_client || !buffer || buffer->empty() || _status != WS_CONNECTED) {
return false;
}
Expand Down Expand Up @@ -502,14 +498,16 @@ bool AsyncWebSocketClient::_queueMessage(AsyncWebSocketSharedBuffer buffer, uint
}

void AsyncWebSocketClient::close(uint16_t code, const char *message) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the changes here are relevant to the goal of this PR (which is to propose a new way to cleanup clients) and I do not think it is needed at all to guard the status flag

if (_status != WS_CONNECTED) {
return;
{
asyncsrv::lock_guard_type lock(_lock);
if (_status != WS_CONNECTED) {
return;
}
_status = WS_DISCONNECTING;
}

async_ws_log_w("[%s][%" PRIu32 "] CLOSE", _server->url(), _clientId);

_status = WS_DISCONNECTING;

if (code) {
uint8_t packetLen = 2;
if (message != NULL) {
Expand Down Expand Up @@ -538,6 +536,7 @@ void AsyncWebSocketClient::close(uint16_t code, const char *message) {
}

bool AsyncWebSocketClient::ping(const uint8_t *data, size_t len) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the changes here are relevant to the goal of this PR (which is to propose a new way to cleanup clients) and I do not think it is needed at all to guard the status flag

asyncsrv::lock_guard_type lock(_lock);
return _status == WS_CONNECTED && _queueControl(WS_PING, data, len);
}

Expand Down Expand Up @@ -567,9 +566,10 @@ void AsyncWebSocketClient::_onData(void *pbuf, size_t plen) {
uint8_t *data = (uint8_t *)pbuf;

while (plen > 0) {
const AwsClientStatus client_status = status();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need: these are only log lines and they get removed depending on compilation flags

async_ws_log_v(
"[%s][%" PRIu32 "] DATA plen: %" PRIu32 ", _pstate: %" PRIu8 ", _status: %" PRIu8, _server->url(), _clientId, static_cast<uint32_t>(plen), _pstate,
static_cast<uint8_t>(_status)
static_cast<uint8_t>(client_status)
);

if (_pstate == STATE_FRAME_START) {
Expand Down Expand Up @@ -688,10 +688,13 @@ void AsyncWebSocketClient::_onData(void *pbuf, size_t plen) {
_server->_handleEvent(this, WS_EVT_ERROR, (void *)&reasonCode, (uint8_t *)reasonString, strlen(reasonString));
}
}
asyncsrv::unique_lock_type lock(_lock);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the changes here are relevant to the goal of this PR (which is to propose a new way to cleanup clients) and I do not think it is needed at all to guard the status flag

if (_status == WS_DISCONNECTING) {
_status = WS_DISCONNECTED;
if (_client) {
_client->close();
auto *client = _client;
lock.unlock();
client->close();
}
} else {
_status = WS_DISCONNECTING;
Expand Down Expand Up @@ -735,9 +738,12 @@ void AsyncWebSocketClient::_onData(void *pbuf, size_t plen) {
"[%s][%" PRIu32 "] DATA frame error: len: %u, index: %" PRIu64 ", total: %" PRIu64 "\n", _server->url(), _clientId, datalen, _pinfo.index, _pinfo.len
);

_status = WS_DISCONNECTING;
if (_client) {
_client->ackLater();
{
asyncsrv::lock_guard_type lock(_lock);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to protect the client ptr: it canot become null between the if and the ackLater call.

_client ptr is only set to null from _client.close() and this is only called from the async_tcp task, same as this method.

Also, see: #429

this _lock is aimed at guarding the queue, not the client ptr

_status = WS_DISCONNECTING;
if (_client) {
_client->ackLater();
}
}
_queueControl(WS_DISCONNECT, data, datalen);
break;
Expand Down Expand Up @@ -952,7 +958,6 @@ bool AsyncWebSocketClient::binary(const __FlashStringHelper *data, size_t len) {

IPAddress AsyncWebSocketClient::remoteIP() const {
asyncsrv::lock_guard_type lock(_lock);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line removal change can be removed

if (!_client) {
return IPAddress((uint32_t)0U);
}
Expand All @@ -962,7 +967,6 @@ IPAddress AsyncWebSocketClient::remoteIP() const {

uint16_t AsyncWebSocketClient::remotePort() const {
asyncsrv::lock_guard_type lock(_lock);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line removal change can be removed

if (!_client) {
return 0;
}
Expand Down Expand Up @@ -991,14 +995,12 @@ AsyncWebSocketClient *AsyncWebSocket::_newClient(AsyncWebServerRequest *request)
}

void AsyncWebSocket::_handleDisconnect(AsyncWebSocketClient *client) {
asyncsrv::lock_guard_type lock(_lock);
const auto client_id = client->id();
const auto iter = std::find_if(std::begin(_clients), std::end(_clients), [client_id](const AsyncWebSocketClient &c) {
return c.id() == client_id;
});
if (iter != std::end(_clients)) {
_clients.erase(iter);
}
// Defer removal to cleanupClients(). Disconnect callbacks can fire while
// iterating _clients for broadcast sends, and erasing here invalidates the
// active iterator in the caller. However, emit the disconnect event now so
// applications observe the disconnect at the time it happens even though the
// client object remains in _clients until cleanup.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MitchBradley : indeed, the lock added in onDisconnect is wrong and should be removed for this PR. See my other comments. First this is not the goal of this PR and also the lock that is used is the one to lock guard the queues.

_handleEvent(client, WS_EVT_DISCONNECT, NULL, NULL, 0);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

bool AsyncWebSocket::availableForWriteAll() {
Expand Down Expand Up @@ -1055,17 +1057,30 @@ void AsyncWebSocket::closeAll(uint16_t code, const char *message) {
}

void AsyncWebSocket::cleanupClients(uint16_t maxClients) {
asyncsrv::lock_guard_type lock(_lock);
const size_t c = count();
if (c > maxClients) {
async_ws_log_v("[%s] CLEANUP %" PRIu32 " (%u/%" PRIu16 ")", _url.c_str(), _clients.front().id(), c, maxClients);
_clients.front().close();
}
std::list<AsyncWebSocketClient> removed_clients;
{
asyncsrv::lock_guard_type lock(_lock);
const size_t connected = std::count_if(std::begin(_clients), std::end(_clients), [](const AsyncWebSocketClient &c) {
return c.status() == WS_CONNECTED;
});

if (connected > maxClients) {
const auto connected_iter = std::find_if(std::begin(_clients), std::end(_clients), [](const AsyncWebSocketClient &c) {
return c.status() == WS_CONNECTED;
});
if (connected_iter != std::end(_clients)) {
async_ws_log_v("[%s] CLEANUP %" PRIu32 " (%u/%" PRIu16 ")", _url.c_str(), connected_iter->id(), connected, maxClients);
connected_iter->close();
}
}

for (auto i = _clients.begin(); i != _clients.end(); ++i) {
if (i->shouldBeDeleted()) {
_clients.erase(i);
break;
for (auto iter = _clients.begin(); iter != _clients.end();) {
if (iter->shouldBeDeleted()) {
auto current = iter++;
removed_clients.splice(removed_clients.end(), _clients, current);
} else {
++iter;
}
}
}
}
Expand Down
1 change: 1 addition & 0 deletions src/AsyncWebSocket.h
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,7 @@ class AsyncWebSocketClient {
return _clientId;
}
AwsClientStatus status() const {
asyncsrv::lock_guard_type lock(_lock);
Copy link
Copy Markdown
Member

@mathieucarbou mathieucarbou Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although I understand the use case of locking a whole function, although I do not to read a register value ? status could also be updated from many other places.

Let's say client A disconnects and has locked and is about to set its status to DISCONNECTED.
At the same time, the app calls status() and reads CONNECTED for client A
Then what ?
Even if the app would trigger any call to send some data to this client A, the app will encounter a lock in one of the send method.

I do not think there is a race also through the cleanupClient function ? Because a double-call on close() should be supported ? And also it would be ok if we miss a client that is currently disconnecting, it will be cleaned next time.

Not sure we need a lock here ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must confess that I am out of my depth here. Concurrency has always made my brain hurt. Here is what GPT-5.4 has to say:

There are two different questions:

  1. Is a stale read semantically acceptable?
  2. Is the read synchronized according to the C++ memory model?

For status(), the second question is the important one. If _status is an ordinary field, then a read from status() without taking the client lock is concurrent with writes that happen under the client lock in AsyncWebSocket.cpp. That is a real data race in C++, even if the only practical consequence most of the time would be “you briefly saw CONNECTED before DISCONNECTED”. The language does not treat that as a harmless stale read; it treats it as unsynchronized access to shared state, which is undefined behavior.

The later lock in a send path does not repair that earlier unsynchronized read. Once status() has read _status without synchronization, the damage is already done from the memory-model perspective. The same applies to cleanupClients(): whether a missed disconnect is tolerable is a behavioral question, but it does not make an unlocked read of _status safe.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If async_tcp task is running on the same core as the loop or user code, there is no need to add mutex but if the async_tcp it ran from core 0 and user code from another core then accessing the value could potentielle return a local cachée value while the real value in memory was updated from another core. The question is : is this temporary situation acceptable considering that the local caches value won’t stay long in the cache : it would take some user code silly looping and calling status() to force many reads. so in that silly case marking the field volatile could be enough in order to force a read from memory each time.

But like I said, I am not sure I ever want to go there because if you look at the parsing logic there are many fields (most of them) set from the async-tcp task that could be on core 0 and accessed from core 1. And that’s ok. Status is no exception.

and I think it is ok so admit that no one should ever call status() in a loop at very high frequency such that the returned value could be a cached one.

Usually people falling status() are doing that because they want to execute something else based on the returned value.

so I would revert this one.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will defer to your judgment on this - but is the mutex so expensive that removing it is worth the risk? Apparently, according to C++, if there is any unguarded access of such a variable, all bets are off with respect to memory barrier guarantees.

Copy link
Copy Markdown
Member

@mathieucarbou mathieucarbou Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MitchBradley : I agree with you regarding the memory model of c++ not being respected in a sense that this status flag can be r/w from many places, so it could be considered as a shared state.

My concern is how much it can be considered as a shared state... For me there is no difference with the _acked and _ack fields for example that can also be read and accessed from many tasks (core).

That's why I want this discussion and concern completely out of scope of this PR.

I don't disagree about the fact that some code review / update is required in this area.

But I disagree on putting it here in this PR.

This PR is to propose a change regarding the location when the cleanup happens, and the more important part is that this is a design / breaking change.

A new PR should be opened with the title being "Review locking and unlocking fo shared state to adhere to c++ memory model". And in this PR, yes the work can be done and discussed with the whole team and also be scoped to more than just the status flag. The status flag is only 1 variable like that but there are many more that are sharing the same concerns are that are accessed more frequently than the status flag.

Please make sure your PR are as isolated as possible and only fix 1 concern at a time. That's fine if you open 2 or even 3 PRs and one has to be merged (and is based) on the other one. What is important is to keep PRs focused in 1 issue at a time.

This is particularly hard otherwise to review or read back the history of 1 PR contains a big bag of several fixes that are not focused on solving the same problem.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, our current locks are made in a way to guard specific variables.

If you look at #429 I have renamed them for clarity.

We have 2 locks:

  • one that is guarding the queues
  • one that is guarding the ws clients list

That is all.
And we cannot use the same locks to lock everything...

For example, your changes tend to reuse the existing locks to sometimes guard the _client pointer, but not at every places, and also guard the status flag, also not at every places.

So that's not consistent and can cause some unexpected slowdowns and un-necessary locking because could prevent access to the status flag while someone is sending a message.

That's why this review work has better be done in a more global PR and be discussed because maybe the solution will be completely different than what you proposed.

C++ has several mechanism also like atomic fields, weak pointers, etc
Locks are not the only solution.

return _status;
}
AsyncClient *client() {
Expand Down