lwIP: hangs due to sema change after a while #7035

miri64 · 2017-05-10T18:42:39Z

Try tests/lwip_sock_ip tests with both LWIP_IPV4=1 set as environment variable. The test will predictably hang at test_sock_ip_send4__socketed_no_local_no_netif(). I can't really say why this is, but some bisecting revealed that this is due to the changes recently done to sema (which is used by lwIP): before #6155 was merged the tests ran through, after it an assertion wasn't met. This however was fixed in #6696 but afterwards the tests hangs.

The text was updated successfully, but these errors were encountered:

lebrush · 2017-05-10T19:48:56Z

Strange... we have it running stably since a while. The only difference is that we use #6441 which has not been merged in yet. Can you apply the patch and confirm if it is still happening?

miri64 · 2017-05-11T06:24:07Z

Yapp, still hanging :-/. Might also be, that due to my hacky handling in those tests I leave the stack's mutexes in an unclean state, but since I really don't touch them and just manipulate the neighbor cache, I doubt that.

miri64 · 2017-05-11T06:25:07Z

(also curious that only lwip_sock_ip is affected, but not lwip_sock_udp or lwip_sock_tcp)

lebrush · 2017-05-11T11:48:25Z

Thanks for trying it. It's actually a bit strange... from the debug traces I couldn't see anything. However, running it [slowly] with the debugger just works fine. I'm looking into it, but don't understand why is not happening with IPv6 or the other ones

lebrush · 2017-05-12T10:56:52Z

Analysis of the issue. There are 4 threads running:

PID	Name
1	idle
2	main
3	lwip_netdev_mux
4	tcpip_thread

The deadlock happens because we want to send "main" (pid 2) from "tcpip_thread" (pid 4) a message that we are done (stack.c method _netdev_send). In the case of IPv6 there's no message in the queue of "main" (capacity 1) so we just put it there and continue. For IPv4 there's a message in the queue and the sending thread blocks indefinitely as STATUS_SEND_BLOCKED and there's were it breaks.

I guess that it used to work becase semaphores were using msg (and dropping messages). So I don't think it's a problem with the sempahore itself but on how the information flow is used in the test.
I'm not sure when mbox was introduced, or if it's using msg tcb variables somehow, maybe there's an interference there? (just an idea).

Btw: I tried to run as well the problematic tests case without the previous test cases and IPv6 also fails of them suceeds. So there is a dependency with the previous cases, is that expected?

Logs for IPv4 (full log):

tcpip_thread: ip4_output_if: call netif->output()
tcpip_thread: etharp_send_ip: sending packet 0x80760d4
tcpip_thread: _netdev_send to 2 (status: 10) # Here should be no context switch
sched_set_status: removing thread 4 (9->4) from runqueue 1.
sched_run: active thread: 4, next thread: 2
sched_run: done, changed sched_active_thread.
main:  sys_arch_sem_wait (0) # Message is never read so "tcpip_thread" is never "unlocked".
sched_set_status: removing thread 2 (9->2) from runqueue 7.
sched_run: active thread: 2, next thread: 1
sched_run: done, changed sched_active_thread.

Logs for IPv6 (full log):

tcpip_thread: netif->output_ip6()
tcpip_thread: ethip6_send: sending packet 0x807912c
tcpip_thread: _netdev_send to 2 (status: 10) # Here there's no context switch since the message is queued
tcpip_thread:  sys_sem_signal 
tcpip_thread:  sys_mbox_fetch 1
sched_set_status: removing thread 4 (9->8) from runqueue 1.
sched_run: active thread: 4, next thread: 2
sched_run: done, changed sched_active_thread.
main:  sys_arch_sem_wait (0)
sched_set_status: removing thread 2 (9->2) from runqueue 7.
sched_run: active thread: 2, next thread: 1
sched_run: done, changed sched_active_thread.
idle:  _mbox_timeout

I don't fully understand the purpose and interaction of all tests and won't have time to look at it deeply within the comming days... maybe you see something that I didn't consider so far.

lebrush · 2017-08-31T09:21:50Z

It might be relevant for this: we ran into priority inversion issues with lwIP, this is something that will have to be tackled at some point in RIOT :-(

miri64 · 2017-09-04T11:57:27Z

Like #7461 e.g.?

miri64 · 2018-08-02T11:45:12Z

Seems like it was an issue in lwIP after all, since I can't reproduce it in master anymore and 41eb295 is the first one where it worked.

miri64 added the Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) label May 10, 2017

miri64 assigned lebrush May 10, 2017

miri64 mentioned this issue May 10, 2017

lwip: port to v2.0.2 #7036

Merged

kYc0o added the Area: network Area: Networking label Oct 31, 2017

miri64 mentioned this issue Jul 17, 2018

lwip: fix lwip thread_create() wrapper #9588

Merged

miri64 closed this as completed Aug 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lwIP: hangs due to sema change after a while #7035

lwIP: hangs due to sema change after a while #7035

miri64 commented May 10, 2017

lebrush commented May 10, 2017

miri64 commented May 11, 2017

miri64 commented May 11, 2017

lebrush commented May 11, 2017

lebrush commented May 12, 2017

lebrush commented Aug 31, 2017

miri64 commented Sep 4, 2017

miri64 commented Aug 2, 2018

lwIP: hangs due to sema change after a while #7035

lwIP: hangs due to sema change after a while #7035

Comments

miri64 commented May 10, 2017

lebrush commented May 10, 2017

miri64 commented May 11, 2017

miri64 commented May 11, 2017

lebrush commented May 11, 2017

lebrush commented May 12, 2017

lebrush commented Aug 31, 2017

miri64 commented Sep 4, 2017

miri64 commented Aug 2, 2018