-
Notifications
You must be signed in to change notification settings - Fork 97
Proposal to change how the WebSocket clients are cleaned up to remove a race condition #424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
7f96a06
a5d0afd
50f1e28
8bf0526
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -301,7 +301,6 @@ AsyncWebSocketClient::~AsyncWebSocketClient() { | |
| _messageQueue.clear(); | ||
| _controlQueue.clear(); | ||
| } | ||
| _server->_handleEvent(this, WS_EVT_DISCONNECT, NULL, NULL, 0); | ||
| } | ||
|
|
||
| void AsyncWebSocketClient::_clearQueue() { | ||
|
|
@@ -358,7 +357,6 @@ void AsyncWebSocketClient::_onAck(size_t len, uint32_t time) { | |
|
|
||
| void AsyncWebSocketClient::_onPoll() { | ||
| asyncsrv::unique_lock_type lock(_lock); | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this line removal change can be removed |
||
| if (!_client) { | ||
| return; | ||
| } | ||
|
|
@@ -446,7 +444,6 @@ bool AsyncWebSocketClient::canSend() const { | |
|
|
||
| bool AsyncWebSocketClient::_queueControl(uint8_t opcode, const uint8_t *data, size_t len, bool mask) { | ||
| asyncsrv::lock_guard_type lock(_lock); | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this line removal change can be removed |
||
| if (!_client) { | ||
| return false; | ||
| } | ||
|
|
@@ -463,7 +460,6 @@ bool AsyncWebSocketClient::_queueControl(uint8_t opcode, const uint8_t *data, si | |
|
|
||
| bool AsyncWebSocketClient::_queueMessage(AsyncWebSocketSharedBuffer buffer, uint8_t opcode, bool mask) { | ||
| asyncsrv::unique_lock_type lock(_lock); | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this line removal change can be removed |
||
| if (!_client || !buffer || buffer->empty() || _status != WS_CONNECTED) { | ||
| return false; | ||
| } | ||
|
|
@@ -502,14 +498,16 @@ bool AsyncWebSocketClient::_queueMessage(AsyncWebSocketSharedBuffer buffer, uint | |
| } | ||
|
|
||
| void AsyncWebSocketClient::close(uint16_t code, const char *message) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think the changes here are relevant to the goal of this PR (which is to propose a new way to cleanup clients) and I do not think it is needed at all to guard the status flag |
||
| if (_status != WS_CONNECTED) { | ||
| return; | ||
| { | ||
| asyncsrv::lock_guard_type lock(_lock); | ||
| if (_status != WS_CONNECTED) { | ||
| return; | ||
| } | ||
| _status = WS_DISCONNECTING; | ||
| } | ||
mathieucarbou marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| async_ws_log_w("[%s][%" PRIu32 "] CLOSE", _server->url(), _clientId); | ||
|
|
||
| _status = WS_DISCONNECTING; | ||
|
|
||
| if (code) { | ||
| uint8_t packetLen = 2; | ||
| if (message != NULL) { | ||
|
|
@@ -538,6 +536,7 @@ void AsyncWebSocketClient::close(uint16_t code, const char *message) { | |
| } | ||
|
|
||
| bool AsyncWebSocketClient::ping(const uint8_t *data, size_t len) { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think the changes here are relevant to the goal of this PR (which is to propose a new way to cleanup clients) and I do not think it is needed at all to guard the status flag |
||
| asyncsrv::lock_guard_type lock(_lock); | ||
| return _status == WS_CONNECTED && _queueControl(WS_PING, data, len); | ||
| } | ||
|
|
||
|
|
@@ -567,9 +566,10 @@ void AsyncWebSocketClient::_onData(void *pbuf, size_t plen) { | |
| uint8_t *data = (uint8_t *)pbuf; | ||
|
|
||
| while (plen > 0) { | ||
| const AwsClientStatus client_status = status(); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. no need: these are only log lines and they get removed depending on compilation flags |
||
| async_ws_log_v( | ||
| "[%s][%" PRIu32 "] DATA plen: %" PRIu32 ", _pstate: %" PRIu8 ", _status: %" PRIu8, _server->url(), _clientId, static_cast<uint32_t>(plen), _pstate, | ||
| static_cast<uint8_t>(_status) | ||
| static_cast<uint8_t>(client_status) | ||
| ); | ||
|
|
||
| if (_pstate == STATE_FRAME_START) { | ||
|
|
@@ -688,10 +688,13 @@ void AsyncWebSocketClient::_onData(void *pbuf, size_t plen) { | |
| _server->_handleEvent(this, WS_EVT_ERROR, (void *)&reasonCode, (uint8_t *)reasonString, strlen(reasonString)); | ||
| } | ||
| } | ||
| asyncsrv::unique_lock_type lock(_lock); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think the changes here are relevant to the goal of this PR (which is to propose a new way to cleanup clients) and I do not think it is needed at all to guard the status flag |
||
| if (_status == WS_DISCONNECTING) { | ||
| _status = WS_DISCONNECTED; | ||
| if (_client) { | ||
| _client->close(); | ||
| auto *client = _client; | ||
| lock.unlock(); | ||
| client->close(); | ||
| } | ||
| } else { | ||
| _status = WS_DISCONNECTING; | ||
|
|
@@ -735,9 +738,12 @@ void AsyncWebSocketClient::_onData(void *pbuf, size_t plen) { | |
| "[%s][%" PRIu32 "] DATA frame error: len: %u, index: %" PRIu64 ", total: %" PRIu64 "\n", _server->url(), _clientId, datalen, _pinfo.index, _pinfo.len | ||
| ); | ||
|
|
||
| _status = WS_DISCONNECTING; | ||
| if (_client) { | ||
| _client->ackLater(); | ||
| { | ||
| asyncsrv::lock_guard_type lock(_lock); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no need to protect the client ptr: it canot become null between the if and the ackLater call. _client ptr is only set to null from _client.close() and this is only called from the async_tcp task, same as this method. Also, see: #429 this _lock is aimed at guarding the queue, not the client ptr |
||
| _status = WS_DISCONNECTING; | ||
| if (_client) { | ||
| _client->ackLater(); | ||
| } | ||
| } | ||
| _queueControl(WS_DISCONNECT, data, datalen); | ||
| break; | ||
|
|
@@ -952,7 +958,6 @@ bool AsyncWebSocketClient::binary(const __FlashStringHelper *data, size_t len) { | |
|
|
||
| IPAddress AsyncWebSocketClient::remoteIP() const { | ||
| asyncsrv::lock_guard_type lock(_lock); | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this line removal change can be removed |
||
| if (!_client) { | ||
| return IPAddress((uint32_t)0U); | ||
| } | ||
|
|
@@ -962,7 +967,6 @@ IPAddress AsyncWebSocketClient::remoteIP() const { | |
|
|
||
| uint16_t AsyncWebSocketClient::remotePort() const { | ||
| asyncsrv::lock_guard_type lock(_lock); | ||
|
|
||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this line removal change can be removed |
||
| if (!_client) { | ||
| return 0; | ||
| } | ||
|
|
@@ -991,14 +995,12 @@ AsyncWebSocketClient *AsyncWebSocket::_newClient(AsyncWebServerRequest *request) | |
| } | ||
|
|
||
| void AsyncWebSocket::_handleDisconnect(AsyncWebSocketClient *client) { | ||
| asyncsrv::lock_guard_type lock(_lock); | ||
| const auto client_id = client->id(); | ||
| const auto iter = std::find_if(std::begin(_clients), std::end(_clients), [client_id](const AsyncWebSocketClient &c) { | ||
| return c.id() == client_id; | ||
| }); | ||
| if (iter != std::end(_clients)) { | ||
| _clients.erase(iter); | ||
| } | ||
| // Defer removal to cleanupClients(). Disconnect callbacks can fire while | ||
| // iterating _clients for broadcast sends, and erasing here invalidates the | ||
| // active iterator in the caller. However, emit the disconnect event now so | ||
| // applications observe the disconnect at the time it happens even though the | ||
| // client object remains in _clients until cleanup. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @MitchBradley : indeed, the lock added in onDisconnect is wrong and should be removed for this PR. See my other comments. First this is not the goal of this PR and also the lock that is used is the one to lock guard the queues. |
||
| _handleEvent(client, WS_EVT_DISCONNECT, NULL, NULL, 0); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. |
||
| } | ||
|
|
||
| bool AsyncWebSocket::availableForWriteAll() { | ||
|
|
@@ -1055,17 +1057,30 @@ void AsyncWebSocket::closeAll(uint16_t code, const char *message) { | |
| } | ||
|
|
||
| void AsyncWebSocket::cleanupClients(uint16_t maxClients) { | ||
| asyncsrv::lock_guard_type lock(_lock); | ||
| const size_t c = count(); | ||
| if (c > maxClients) { | ||
| async_ws_log_v("[%s] CLEANUP %" PRIu32 " (%u/%" PRIu16 ")", _url.c_str(), _clients.front().id(), c, maxClients); | ||
| _clients.front().close(); | ||
| } | ||
| std::list<AsyncWebSocketClient> removed_clients; | ||
| { | ||
| asyncsrv::lock_guard_type lock(_lock); | ||
| const size_t connected = std::count_if(std::begin(_clients), std::end(_clients), [](const AsyncWebSocketClient &c) { | ||
| return c.status() == WS_CONNECTED; | ||
| }); | ||
|
|
||
| if (connected > maxClients) { | ||
| const auto connected_iter = std::find_if(std::begin(_clients), std::end(_clients), [](const AsyncWebSocketClient &c) { | ||
| return c.status() == WS_CONNECTED; | ||
| }); | ||
| if (connected_iter != std::end(_clients)) { | ||
| async_ws_log_v("[%s] CLEANUP %" PRIu32 " (%u/%" PRIu16 ")", _url.c_str(), connected_iter->id(), connected, maxClients); | ||
MitchBradley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| connected_iter->close(); | ||
| } | ||
| } | ||
|
|
||
| for (auto i = _clients.begin(); i != _clients.end(); ++i) { | ||
| if (i->shouldBeDeleted()) { | ||
| _clients.erase(i); | ||
| break; | ||
| for (auto iter = _clients.begin(); iter != _clients.end();) { | ||
| if (iter->shouldBeDeleted()) { | ||
| auto current = iter++; | ||
| removed_clients.splice(removed_clients.end(), _clients, current); | ||
| } else { | ||
| ++iter; | ||
| } | ||
| } | ||
| } | ||
| } | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -256,6 +256,7 @@ class AsyncWebSocketClient { | |
| return _clientId; | ||
| } | ||
| AwsClientStatus status() const { | ||
| asyncsrv::lock_guard_type lock(_lock); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Although I understand the use case of locking a whole function, although I do not to read a register value ? status could also be updated from many other places. Let's say client A disconnects and has locked and is about to set its status to DISCONNECTED. I do not think there is a race also through the cleanupClient function ? Because a double-call on close() should be supported ? And also it would be ok if we miss a client that is currently disconnecting, it will be cleaned next time. Not sure we need a lock here ?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I must confess that I am out of my depth here. Concurrency has always made my brain hurt. Here is what GPT-5.4 has to say: There are two different questions:
For status(), the second question is the important one. If _status is an ordinary field, then a read from status() without taking the client lock is concurrent with writes that happen under the client lock in AsyncWebSocket.cpp. That is a real data race in C++, even if the only practical consequence most of the time would be “you briefly saw CONNECTED before DISCONNECTED”. The language does not treat that as a harmless stale read; it treats it as unsynchronized access to shared state, which is undefined behavior. The later lock in a send path does not repair that earlier unsynchronized read. Once status() has read _status without synchronization, the damage is already done from the memory-model perspective. The same applies to cleanupClients(): whether a missed disconnect is tolerable is a behavioral question, but it does not make an unlocked read of _status safe.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If async_tcp task is running on the same core as the loop or user code, there is no need to add mutex but if the async_tcp it ran from core 0 and user code from another core then accessing the value could potentielle return a local cachée value while the real value in memory was updated from another core. The question is : is this temporary situation acceptable considering that the local caches value won’t stay long in the cache : it would take some user code silly looping and calling status() to force many reads. so in that silly case marking the field volatile could be enough in order to force a read from memory each time. But like I said, I am not sure I ever want to go there because if you look at the parsing logic there are many fields (most of them) set from the async-tcp task that could be on core 0 and accessed from core 1. And that’s ok. Status is no exception. and I think it is ok so admit that no one should ever call status() in a loop at very high frequency such that the returned value could be a cached one. Usually people falling status() are doing that because they want to execute something else based on the returned value. so I would revert this one.
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I will defer to your judgment on this - but is the mutex so expensive that removing it is worth the risk? Apparently, according to C++, if there is any unguarded access of such a variable, all bets are off with respect to memory barrier guarantees.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @MitchBradley : I agree with you regarding the memory model of c++ not being respected in a sense that this status flag can be r/w from many places, so it could be considered as a shared state. My concern is how much it can be considered as a shared state... For me there is no difference with the _acked and _ack fields for example that can also be read and accessed from many tasks (core). That's why I want this discussion and concern completely out of scope of this PR. I don't disagree about the fact that some code review / update is required in this area. But I disagree on putting it here in this PR. This PR is to propose a change regarding the location when the cleanup happens, and the more important part is that this is a design / breaking change. A new PR should be opened with the title being "Review locking and unlocking fo shared state to adhere to c++ memory model". And in this PR, yes the work can be done and discussed with the whole team and also be scoped to more than just the status flag. The status flag is only 1 variable like that but there are many more that are sharing the same concerns are that are accessed more frequently than the status flag. Please make sure your PR are as isolated as possible and only fix 1 concern at a time. That's fine if you open 2 or even 3 PRs and one has to be merged (and is based) on the other one. What is important is to keep PRs focused in 1 issue at a time. This is particularly hard otherwise to review or read back the history of 1 PR contains a big bag of several fixes that are not focused on solving the same problem.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, our current locks are made in a way to guard specific variables. If you look at #429 I have renamed them for clarity. We have 2 locks:
That is all. For example, your changes tend to reuse the existing locks to sometimes guard the _client pointer, but not at every places, and also guard the status flag, also not at every places. So that's not consistent and can cause some unexpected slowdowns and un-necessary locking because could prevent access to the status flag while someone is sending a message. That's why this review work has better be done in a more global PR and be discussed because maybe the solution will be completely different than what you proposed. C++ has several mechanism also like atomic fields, weak pointers, etc |
||
| return _status; | ||
| } | ||
MitchBradley marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| AsyncClient *client() { | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you correctly test that moving this to
_handleDisconnectworks in all situations ?The call here makes sure that the user event handler is called whatever the use case and can cleanup resources.
The move to
_handleDisconnectwill only be called if _onDisconnect is called, which is subject to how the AsyncTCP / ESPAsyncTCP / RPI / etc implementation was done, which we have no control over.