Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tinydtls: sock_dtls: fix deadlock in sock_dtls_recv() #12959

Closed
wants to merge 1 commit into from

Conversation

cgundogan
Copy link
Member

Contribution description

In the current implementation of sock_dtls_recv() a timeout message
event is armed to fire after timeout us. If sock_udp_recv() takes
a long time to return with a valid message (res > 0), then the timeout
message might fire before the timer is removed with
xtimer_remove(). This leads to a message in the mbox that is never
removed. After a few iterations, this results in a full mbox and
hence a deadlock of sock_dtls_recv() (particularly: in the _read()
event of tinydtls).

This patch refactors the sock_dtls_recv() function to remove the timeout event (there is a timeout in sock_udp_recv() anyways) and also removes the while loop around the sock_udp_recv() call (this should be done by the application).

Testing procedure

I encountered this issue in a larger scale setup. I will try to find a minimal setup to reproduce this easily.

Issues/PRs references

none

@cgundogan cgundogan added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Area: network Area: Networking labels Dec 16, 2019
@leandrolanzieri leandrolanzieri added the Area: security Area: Security-related libraries and subsystems label Dec 16, 2019
@pokgak
Copy link
Contributor

pokgak commented Dec 16, 2019

General comment for now.

Handling of DTLS message in dtls_handle_message() may also costs significant time (e.g. when using ECC credentials). Removing the timer in sock_dtls_recv() completely and only relying on the timer in sock_udp_recv() means we now don't have a way to make sure that the whole sock_dtls_recv() operation does not takes more than timeout.

Another thing to take note of is that currently new incoming handshake is handled in sock_dtls_recv() by blocking until all handshake messages are processed. Removing the while loop here will change how the API behaves (as you said this should be done by the application) and the usage doc will also need to be updated.

@cgundogan
Copy link
Member Author

thanks @pokgak for the feedback. Some remarks and questions:

Handling of DTLS message in dtls_handle_message() may also costs significant time (e.g. when using ECC credentials). Removing the timer in sock_dtls_recv() completely and only relying on the timer in sock_udp_recv() means we now don't have a way to make sure that the whole sock_dtls_recv() operation does not takes more than timeout.

dtls_handle_message() is not interruptible if I am not mistaken. So even if it takes a long time for certain operations, there is no way to return this function on timeout time. The timeout message is checked after dtls_handle_message() returns (possibly at a time much much greater than timeout), so the actual timeout event is rather ineffective?

Another thing to take note of is that currently new incoming handshake is handled in sock_dtls_recv() by blocking until all handshake messages are processed. Removing the while loop here will change how the API behaves (as you said this should be done by the application) and the usage doc will also need to be updated.

I see. So, the initial intention of the while loop was to capture all messages on a possible handshake before returning? Is there a way to find out if a handshake is in progress and loop only then?

In the current implementation of `sock_dtls_recv()` a timeout message
event is armed to fire after `timeout` us. If `sock_udp_recv()` takes
a long time to return with a valid message (res > 0), then the timeout
message might fire **before** the timer is removed with
`xtimer_remove()`. This leads to a message in the `mbox` that is never
removed. After a few iterations, this results in a full `mbox` and
hence a deadlock of `sock_dtls_recv()` (particularly: in the `_read()`
event of tinydtls).
@kb2ma
Copy link
Member

kb2ma commented Dec 21, 2019

[edit: use of gcoap is based on #12104]

I encountered a simple test case for this issue with confirmable message handling in the gcoap example. It tinydtls in PSK mode and the libcoap coap-server example option to drop an outgoing message.

Setup coap-server to drop message 7. The first six messages sent by libcoap are for DTLS session setup. The 7th is the response to the gcoap request below.

   $ coap-server -k secretPSK -l 7-7

Startup the gcoap example, and request the time resource.

> coap get -c fd00:bbbb::1 5684 /time
coap get -c fd00:bbbb::1 5684 /time
gcoap_cli: sending msg ID 29844, 11 bytes

In Wireshark I see the request from gcoap followed a few seconds later by the retry and libcoap's response. However I never see the response in the gcoap CLI. I would expect to see something like below. I do see this response with the commit in this PR.

> gcoap: response Success, code 2.05, 15 bytes
Dec 21 10:36:35

@cgundogan
Copy link
Member Author

@kb2ma that indeed is a nice and minimal test case, thanks!

Another problem I encountered with the gcoap dtls integration is a deadlock when request retransmissions trigger a dtls handshake (because the application closed a dtls session in the response handler of a previous request). Not sure if this test case is quite artificial or valid (any peer might close the session, though). The actual problem is that request retransmissions happen in the gcoap thread, so does the handling of dtls handshakes (_listen()). A thread cannot send and receive at the same time .. without having async socks / select / etc.

@kb2ma
Copy link
Member

kb2ma commented Dec 21, 2019

Another problem I encountered with the gcoap dtls integration is a deadlock when request retransmissions trigger a dtls handshake

i encountered a similar problem but was able to resolve it. I have an automated test for the gcoap example to send requests to a server S while at the same time the gcoap example acts as a server to handle requests from a client C. In the test, S is the libcoap coap-server, and C is an aiocoap client. See test_client_server(). Both peers use the same PSK id/key due to limitations of credman as discussed in #12104. The solution was to increase the number of tinydtls peers by setting DTLS_PEER_MAX to 2.

In the case you describe, it sounds like there is a single peer though. At the same time, it seems like this scenario could work since dtls_sock is managing the messaging and should be able to intercept session setup messages. Is there a simple way to trigger the problem?

@fjmolinas
Copy link
Contributor

I have finished a first pass at functional testing and updating the riot-coap-pytest automated tests. You can see what works in the Setup section of the test README.md.

There are 3 issues at present:

  1. The deadlock addressed in tinydtls: sock_dtls: fix deadlock in sock_dtls_recv() #12959, which causes confirmable retries (con_retry_test) to fail
  2. credman limitation of a single credential per sock, described above
  3. cord_ep and cord_epsim examples use a synchronous messaging model which fails for DTLS handshaking.

On (3) I was able to manually register with a DTLS based RD server using the gcoap CLI example in PSK mode, but the cord_ep example froze early in handshaking.

Originally posted by @kb2ma in #12104 (comment)

@kb2ma From this comment it seems this fixes your issue?

@pokgak is the fix OK now that the while loop is not removed?

@kb2ma
Copy link
Member

kb2ma commented Jan 3, 2020

Yes, confirmable retries (con_retry_test) work with this PR.

@cgundogan
Copy link
Member Author

Yes, confirmable retries (con_retry_test) work with this PR.

@kb2ma is there an easy way to redo your test, but close the dtls session before a coap request retry? I suspect that this yields another deadlock (or locking for a long period of time, probably DTLS_HANDSHAKE_TIMEOUT). The gcoap thread is then trying to create a new session before sending the request retransmission. Another issue with the current integration is that all other packets (even from successful dtls peers) are dropped when a handshake is in process.

Copy link
Contributor

@pokgak pokgak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simple send/receive test with dtls-sock example application still works as expected.

@cgundogan In our discussion, you also mentioned moving the new timeout calculation (code below) to after dtls_handle_message() which would give more accurate remaining timeout e.g. when the dtls_handle_message() takes a long time to return. Are you intentionally leaving that out?

if ((timeout != SOCK_NO_TIMEOUT) && (timeout != 0)) {
uint32_t time_passed = (xtimer_now_usec() - start_recv);
timeout = (time_passed > timeout) ? 0: timeout - time_passed;
}

@@ -427,7 +418,6 @@ ssize_t sock_dtls_recv(sock_dtls_t *sock, sock_dtls_session_t *remote,
if (mbox_try_get(&sock->mbox, &msg)) {
switch(msg.type) {
case DTLS_EVENT_READ:
xtimer_remove(&timeout_timer);
return msg.content.value;
case DTLS_EVENT_TIMEOUT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the timer removed, we don't need to handle DTLS_EVENT_TIMEOUT anymore. Maybe also lose the switch case.

@kb2ma
Copy link
Member

kb2ma commented Jan 7, 2020

is there an easy way to redo your test, but close the dtls session before a coap request retry?

Unfortunately, no. There is no mechanism in gcoap to close a session. I suppose this situation is manageable if gcoap only talks to at most DTLS_PEER_MAX (tinydtls) unique peers. On the other hand, there are scenarios that would like to return a 5.03 Service Unavailable if unable to open a new session.

@benpicco
Copy link
Contributor

benpicco commented Feb 6, 2020

There is no mechanism in gcoap to close a session. I suppose this situation is manageable if gcoap only talks to at most DTLS_PEER_MAX (tinydtls) unique peers.

How much work would be needed for this?
I see this is a problem for a node that talks to many peers - don't want to have to reboot it if DTLS_PEER_MAX is exhausted 😉

@pokgak
Copy link
Contributor

pokgak commented Mar 9, 2020

With #13495, no messages are put into mbox when a timeout occurs. This should fix the deadlock problem where the mbox fills up with timeout messages after some time.

@fjmolinas
Copy link
Contributor

#13495 (comment)_

@cgundogan can you confirm if this one is obsolete now?

@cgundogan
Copy link
Member Author

@cgundogan can you confirm if this one is obsolete now?

yep

@cgundogan cgundogan closed this Apr 14, 2020
@cgundogan cgundogan deleted the pr/dtls_sock_deadlock branch April 14, 2020 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: network Area: Networking Area: security Area: Security-related libraries and subsystems Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants