tinydtls: sock_dtls: fix deadlock in sock_dtls_recv() #12959

cgundogan · 2019-12-16T10:27:15Z

Contribution description

In the current implementation of sock_dtls_recv() a timeout message
event is armed to fire after timeout us. If sock_udp_recv() takes
a long time to return with a valid message (res > 0), then the timeout
message might fire before the timer is removed with
xtimer_remove(). This leads to a message in the mbox that is never
removed. After a few iterations, this results in a full mbox and
hence a deadlock of sock_dtls_recv() (particularly: in the _read()
event of tinydtls).

This patch refactors the sock_dtls_recv() function to remove the timeout event (there is a timeout in sock_udp_recv() anyways) and also removes the while loop around the sock_udp_recv() call (this should be done by the application).

Testing procedure

I encountered this issue in a larger scale setup. I will try to find a minimal setup to reproduce this easily.

Issues/PRs references

none

pokgak · 2019-12-16T16:31:02Z

General comment for now.

Handling of DTLS message in dtls_handle_message() may also costs significant time (e.g. when using ECC credentials). Removing the timer in sock_dtls_recv() completely and only relying on the timer in sock_udp_recv() means we now don't have a way to make sure that the whole sock_dtls_recv() operation does not takes more than timeout.

Another thing to take note of is that currently new incoming handshake is handled in sock_dtls_recv() by blocking until all handshake messages are processed. Removing the while loop here will change how the API behaves (as you said this should be done by the application) and the usage doc will also need to be updated.

cgundogan · 2019-12-16T18:16:01Z

thanks @pokgak for the feedback. Some remarks and questions:

Handling of DTLS message in dtls_handle_message() may also costs significant time (e.g. when using ECC credentials). Removing the timer in sock_dtls_recv() completely and only relying on the timer in sock_udp_recv() means we now don't have a way to make sure that the whole sock_dtls_recv() operation does not takes more than timeout.

dtls_handle_message() is not interruptible if I am not mistaken. So even if it takes a long time for certain operations, there is no way to return this function on timeout time. The timeout message is checked after dtls_handle_message() returns (possibly at a time much much greater than timeout), so the actual timeout event is rather ineffective?

Another thing to take note of is that currently new incoming handshake is handled in sock_dtls_recv() by blocking until all handshake messages are processed. Removing the while loop here will change how the API behaves (as you said this should be done by the application) and the usage doc will also need to be updated.

I see. So, the initial intention of the while loop was to capture all messages on a possible handshake before returning? Is there a way to find out if a handshake is in progress and loop only then?

In the current implementation of `sock_dtls_recv()` a timeout message event is armed to fire after `timeout` us. If `sock_udp_recv()` takes a long time to return with a valid message (res > 0), then the timeout message might fire **before** the timer is removed with `xtimer_remove()`. This leads to a message in the `mbox` that is never removed. After a few iterations, this results in a full `mbox` and hence a deadlock of `sock_dtls_recv()` (particularly: in the `_read()` event of tinydtls).

kb2ma · 2019-12-21T10:53:20Z

[edit: use of gcoap is based on #12104]

I encountered a simple test case for this issue with confirmable message handling in the gcoap example. It tinydtls in PSK mode and the libcoap coap-server example option to drop an outgoing message.

Setup coap-server to drop message 7. The first six messages sent by libcoap are for DTLS session setup. The 7th is the response to the gcoap request below.

   $ coap-server -k secretPSK -l 7-7

Startup the gcoap example, and request the time resource.

> coap get -c fd00:bbbb::1 5684 /time
coap get -c fd00:bbbb::1 5684 /time
gcoap_cli: sending msg ID 29844, 11 bytes

In Wireshark I see the request from gcoap followed a few seconds later by the retry and libcoap's response. However I never see the response in the gcoap CLI. I would expect to see something like below. I do see this response with the commit in this PR.

> gcoap: response Success, code 2.05, 15 bytes
Dec 21 10:36:35

cgundogan · 2019-12-21T11:08:03Z

@kb2ma that indeed is a nice and minimal test case, thanks!

Another problem I encountered with the gcoap dtls integration is a deadlock when request retransmissions trigger a dtls handshake (because the application closed a dtls session in the response handler of a previous request). Not sure if this test case is quite artificial or valid (any peer might close the session, though). The actual problem is that request retransmissions happen in the gcoap thread, so does the handling of dtls handshakes (_listen()). A thread cannot send and receive at the same time .. without having async socks / select / etc.

kb2ma · 2019-12-21T11:53:03Z

Another problem I encountered with the gcoap dtls integration is a deadlock when request retransmissions trigger a dtls handshake

i encountered a similar problem but was able to resolve it. I have an automated test for the gcoap example to send requests to a server S while at the same time the gcoap example acts as a server to handle requests from a client C. In the test, S is the libcoap coap-server, and C is an aiocoap client. See test_client_server(). Both peers use the same PSK id/key due to limitations of credman as discussed in #12104. The solution was to increase the number of tinydtls peers by setting DTLS_PEER_MAX to 2.

In the case you describe, it sounds like there is a single peer though. At the same time, it seems like this scenario could work since dtls_sock is managing the messaging and should be able to intercept session setup messages. Is there a simple way to trigger the problem?

fjmolinas · 2020-01-03T09:38:37Z

I have finished a first pass at functional testing and updating the riot-coap-pytest automated tests. You can see what works in the Setup section of the test README.md.

There are 3 issues at present:

The deadlock addressed in tinydtls: sock_dtls: fix deadlock in sock_dtls_recv() #12959, which causes confirmable retries (con_retry_test) to fail

credman limitation of a single credential per sock, described above

cord_ep and cord_epsim examples use a synchronous messaging model which fails for DTLS handshaking.

On (3) I was able to manually register with a DTLS based RD server using the gcoap CLI example in PSK mode, but the cord_ep example froze early in handshaking.

Originally posted by @kb2ma in #12104 (comment)

@kb2ma From this comment it seems this fixes your issue?

@pokgak is the fix OK now that the while loop is not removed?

kb2ma · 2020-01-03T10:47:58Z

Yes, confirmable retries (con_retry_test) work with this PR.

cgundogan · 2020-01-03T16:02:31Z

Yes, confirmable retries (con_retry_test) work with this PR.

@kb2ma is there an easy way to redo your test, but close the dtls session before a coap request retry? I suspect that this yields another deadlock (or locking for a long period of time, probably DTLS_HANDSHAKE_TIMEOUT). The gcoap thread is then trying to create a new session before sending the request retransmission. Another issue with the current integration is that all other packets (even from successful dtls peers) are dropped when a handshake is in process.

pokgak

Simple send/receive test with dtls-sock example application still works as expected.

@cgundogan In our discussion, you also mentioned moving the new timeout calculation (code below) to after dtls_handle_message() which would give more accurate remaining timeout e.g. when the dtls_handle_message() takes a long time to return. Are you intentionally leaving that out?

RIOT/pkg/tinydtls/contrib/sock_dtls.c

Lines 407 to 410 in d216628

    
           if ((timeout != SOCK_NO_TIMEOUT) && (timeout != 0)) { 
        
               uint32_t time_passed = (xtimer_now_usec() - start_recv); 
        
               timeout = (time_passed > timeout) ? 0: timeout - time_passed; 
        
           }

pokgak · 2020-01-03T17:48:55Z

pkg/tinydtls/contrib/sock_dtls.c

@@ -427,7 +418,6 @@ ssize_t sock_dtls_recv(sock_dtls_t *sock, sock_dtls_session_t *remote,
        if (mbox_try_get(&sock->mbox, &msg)) {
            switch(msg.type) {
                case DTLS_EVENT_READ:
-                    xtimer_remove(&timeout_timer);
                    return msg.content.value;
                case DTLS_EVENT_TIMEOUT:


With the timer removed, we don't need to handle DTLS_EVENT_TIMEOUT anymore. Maybe also lose the switch case.

kb2ma · 2020-01-07T13:17:35Z

is there an easy way to redo your test, but close the dtls session before a coap request retry?

Unfortunately, no. There is no mechanism in gcoap to close a session. I suppose this situation is manageable if gcoap only talks to at most DTLS_PEER_MAX (tinydtls) unique peers. On the other hand, there are scenarios that would like to return a 5.03 Service Unavailable if unable to open a new session.

benpicco · 2020-02-06T11:09:16Z

There is no mechanism in gcoap to close a session. I suppose this situation is manageable if gcoap only talks to at most DTLS_PEER_MAX (tinydtls) unique peers.

How much work would be needed for this?
I see this is a problem for a node that talks to many peers - don't want to have to reboot it if DTLS_PEER_MAX is exhausted 😉

pokgak · 2020-03-09T16:12:45Z

With #13495, no messages are put into mbox when a timeout occurs. This should fix the deadlock problem where the mbox fills up with timeout messages after some time.

fjmolinas · 2020-04-14T07:14:55Z

#13495 (comment)_

@cgundogan can you confirm if this one is obsolete now?

cgundogan · 2020-04-14T08:30:09Z

@cgundogan can you confirm if this one is obsolete now?

yep

cgundogan added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Area: network Area: Networking labels Dec 16, 2019

cgundogan requested review from leandrolanzieri and pokgak December 16, 2019 10:27

leandrolanzieri added the Area: security Area: Security-related libraries and subsystems label Dec 16, 2019

cgundogan force-pushed the pr/dtls_sock_deadlock branch from ec5ad60 to d216628 Compare December 20, 2019 09:53

kb2ma mentioned this pull request Dec 22, 2019

gcoap: add DTLS integration #12104

Closed

pokgak suggested changes Jan 3, 2020

View reviewed changes

cgundogan mentioned this pull request Feb 28, 2020

examples/dtls-sock: fix timeout msg stays in mbox #13495

Merged

cgundogan closed this Apr 14, 2020

cgundogan deleted the pr/dtls_sock_deadlock branch April 14, 2020 08:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tinydtls: sock_dtls: fix deadlock in sock_dtls_recv() #12959

tinydtls: sock_dtls: fix deadlock in sock_dtls_recv() #12959

cgundogan commented Dec 16, 2019

pokgak commented Dec 16, 2019

cgundogan commented Dec 16, 2019

kb2ma commented Dec 21, 2019 •

edited

cgundogan commented Dec 21, 2019

kb2ma commented Dec 21, 2019

fjmolinas commented Jan 3, 2020

kb2ma commented Jan 3, 2020

cgundogan commented Jan 3, 2020

pokgak left a comment

pokgak Jan 3, 2020

kb2ma commented Jan 7, 2020

benpicco commented Feb 6, 2020

pokgak commented Mar 9, 2020

fjmolinas commented Apr 14, 2020

cgundogan commented Apr 14, 2020

	if ((timeout != SOCK_NO_TIMEOUT) && (timeout != 0)) {
	uint32_t time_passed = (xtimer_now_usec() - start_recv);
	timeout = (time_passed > timeout) ? 0: timeout - time_passed;
	}

tinydtls: sock_dtls: fix deadlock in sock_dtls_recv() #12959

tinydtls: sock_dtls: fix deadlock in sock_dtls_recv() #12959

Conversation

cgundogan commented Dec 16, 2019

Contribution description

Testing procedure

Issues/PRs references

pokgak commented Dec 16, 2019

cgundogan commented Dec 16, 2019

kb2ma commented Dec 21, 2019 • edited

cgundogan commented Dec 21, 2019

kb2ma commented Dec 21, 2019

fjmolinas commented Jan 3, 2020

kb2ma commented Jan 3, 2020

cgundogan commented Jan 3, 2020

pokgak left a comment

Choose a reason for hiding this comment

pokgak Jan 3, 2020

Choose a reason for hiding this comment

kb2ma commented Jan 7, 2020

benpicco commented Feb 6, 2020

pokgak commented Mar 9, 2020

fjmolinas commented Apr 14, 2020

cgundogan commented Apr 14, 2020

kb2ma commented Dec 21, 2019 •

edited