New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tinydtls: sock_dtls: fix deadlock in sock_dtls_recv() #12959
Conversation
General comment for now. Handling of DTLS message in Another thing to take note of is that currently new incoming handshake is handled in |
thanks @pokgak for the feedback. Some remarks and questions:
I see. So, the initial intention of the while loop was to capture all messages on a possible handshake before returning? Is there a way to find out if a handshake is in progress and loop only then? |
In the current implementation of `sock_dtls_recv()` a timeout message event is armed to fire after `timeout` us. If `sock_udp_recv()` takes a long time to return with a valid message (res > 0), then the timeout message might fire **before** the timer is removed with `xtimer_remove()`. This leads to a message in the `mbox` that is never removed. After a few iterations, this results in a full `mbox` and hence a deadlock of `sock_dtls_recv()` (particularly: in the `_read()` event of tinydtls).
ec5ad60
to
d216628
Compare
[edit: use of gcoap is based on #12104] I encountered a simple test case for this issue with confirmable message handling in the gcoap example. It tinydtls in PSK mode and the libcoap Setup coap-server to drop message 7. The first six messages sent by libcoap are for DTLS session setup. The 7th is the response to the gcoap request below.
Startup the gcoap example, and request the time resource.
In Wireshark I see the request from gcoap followed a few seconds later by the retry and libcoap's response. However I never see the response in the gcoap CLI. I would expect to see something like below. I do see this response with the commit in this PR.
|
@kb2ma that indeed is a nice and minimal test case, thanks! Another problem I encountered with the gcoap dtls integration is a deadlock when request retransmissions trigger a dtls handshake (because the application closed a dtls session in the response handler of a previous request). Not sure if this test case is quite artificial or valid (any peer might close the session, though). The actual problem is that request retransmissions happen in the gcoap thread, so does the handling of dtls handshakes (_listen()). A thread cannot send and receive at the same time .. without having async socks / select / etc. |
i encountered a similar problem but was able to resolve it. I have an automated test for the gcoap example to send requests to a server S while at the same time the gcoap example acts as a server to handle requests from a client C. In the test, S is the libcoap coap-server, and C is an aiocoap client. See In the case you describe, it sounds like there is a single peer though. At the same time, it seems like this scenario could work since dtls_sock is managing the messaging and should be able to intercept session setup messages. Is there a simple way to trigger the problem? |
@kb2ma From this comment it seems this fixes your issue? @pokgak is the fix OK now that the while loop is not removed? |
Yes, confirmable retries (con_retry_test) work with this PR. |
@kb2ma is there an easy way to redo your test, but close the dtls session before a coap request retry? I suspect that this yields another deadlock (or locking for a long period of time, probably |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Simple send/receive test with dtls-sock example application still works as expected.
@cgundogan In our discussion, you also mentioned moving the new timeout calculation (code below) to after dtls_handle_message()
which would give more accurate remaining timeout e.g. when the dtls_handle_message()
takes a long time to return. Are you intentionally leaving that out?
RIOT/pkg/tinydtls/contrib/sock_dtls.c
Lines 407 to 410 in d216628
if ((timeout != SOCK_NO_TIMEOUT) && (timeout != 0)) { | |
uint32_t time_passed = (xtimer_now_usec() - start_recv); | |
timeout = (time_passed > timeout) ? 0: timeout - time_passed; | |
} |
@@ -427,7 +418,6 @@ ssize_t sock_dtls_recv(sock_dtls_t *sock, sock_dtls_session_t *remote, | |||
if (mbox_try_get(&sock->mbox, &msg)) { | |||
switch(msg.type) { | |||
case DTLS_EVENT_READ: | |||
xtimer_remove(&timeout_timer); | |||
return msg.content.value; | |||
case DTLS_EVENT_TIMEOUT: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the timer removed, we don't need to handle DTLS_EVENT_TIMEOUT
anymore. Maybe also lose the switch case.
Unfortunately, no. There is no mechanism in gcoap to close a session. I suppose this situation is manageable if gcoap only talks to at most DTLS_PEER_MAX (tinydtls) unique peers. On the other hand, there are scenarios that would like to return a 5.03 Service Unavailable if unable to open a new session. |
How much work would be needed for this? |
With #13495, no messages are put into mbox when a timeout occurs. This should fix the deadlock problem where the mbox fills up with timeout messages after some time. |
@cgundogan can you confirm if this one is obsolete now? |
yep |
Contribution description
In the current implementation of
sock_dtls_recv()
a timeout messageevent is armed to fire after
timeout
us. Ifsock_udp_recv()
takesa long time to return with a valid message (res > 0), then the timeout
message might fire before the timer is removed with
xtimer_remove()
. This leads to a message in thembox
that is neverremoved. After a few iterations, this results in a full
mbox
andhence a deadlock of
sock_dtls_recv()
(particularly: in the_read()
event of tinydtls).
This patch refactors the
sock_dtls_recv()
function to remove the timeout event (there is a timeout insock_udp_recv()
anyways) and also removes the while loop around thesock_udp_recv()
call (this should be done by the application).Testing procedure
I encountered this issue in a larger scale setup. I will try to find a minimal setup to reproduce this easily.
Issues/PRs references
none