Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lwIP: hangs due to sema change after a while #7035

Closed
miri64 opened this issue May 10, 2017 · 8 comments
Closed

lwIP: hangs due to sema change after a while #7035

miri64 opened this issue May 10, 2017 · 8 comments
Assignees
Labels
Area: network Area: Networking Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Comments

@miri64
Copy link
Member

miri64 commented May 10, 2017

Try tests/lwip_sock_ip tests with both LWIP_IPV4=1 set as environment variable. The test will predictably hang at test_sock_ip_send4__socketed_no_local_no_netif(). I can't really say why this is, but some bisecting revealed that this is due to the changes recently done to sema (which is used by lwIP): before #6155 was merged the tests ran through, after it an assertion wasn't met. This however was fixed in #6696 but afterwards the tests hangs.

@miri64 miri64 added the Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) label May 10, 2017
@lebrush
Copy link
Member

lebrush commented May 10, 2017

Strange... we have it running stably since a while. The only difference is that we use #6441 which has not been merged in yet. Can you apply the patch and confirm if it is still happening?

@miri64
Copy link
Member Author

miri64 commented May 11, 2017

Yapp, still hanging :-/. Might also be, that due to my hacky handling in those tests I leave the stack's mutexes in an unclean state, but since I really don't touch them and just manipulate the neighbor cache, I doubt that.

@miri64
Copy link
Member Author

miri64 commented May 11, 2017

(also curious that only lwip_sock_ip is affected, but not lwip_sock_udp or lwip_sock_tcp)

@lebrush
Copy link
Member

lebrush commented May 11, 2017

Thanks for trying it. It's actually a bit strange... from the debug traces I couldn't see anything. However, running it [slowly] with the debugger just works fine. I'm looking into it, but don't understand why is not happening with IPv6 or the other ones

@lebrush
Copy link
Member

lebrush commented May 12, 2017

Analysis of the issue. There are 4 threads running:

PID Name
1 idle
2 main
3 lwip_netdev_mux
4 tcpip_thread

The deadlock happens because we want to send "main" (pid 2) from "tcpip_thread" (pid 4) a message that we are done (stack.c method _netdev_send). In the case of IPv6 there's no message in the queue of "main" (capacity 1) so we just put it there and continue. For IPv4 there's a message in the queue and the sending thread blocks indefinitely as STATUS_SEND_BLOCKED and there's were it breaks.

I guess that it used to work becase semaphores were using msg (and dropping messages). So I don't think it's a problem with the sempahore itself but on how the information flow is used in the test.
I'm not sure when mbox was introduced, or if it's using msg tcb variables somehow, maybe there's an interference there? (just an idea).

Btw: I tried to run as well the problematic tests case without the previous test cases and IPv6 also fails of them suceeds. So there is a dependency with the previous cases, is that expected?

Logs for IPv4 (full log):

tcpip_thread: ip4_output_if: call netif->output()
tcpip_thread: etharp_send_ip: sending packet 0x80760d4
tcpip_thread: _netdev_send to 2 (status: 10) # Here should be no context switch
sched_set_status: removing thread 4 (9->4) from runqueue 1.
sched_run: active thread: 4, next thread: 2
sched_run: done, changed sched_active_thread.
main:  sys_arch_sem_wait (0) # Message is never read so "tcpip_thread" is never "unlocked".
sched_set_status: removing thread 2 (9->2) from runqueue 7.
sched_run: active thread: 2, next thread: 1
sched_run: done, changed sched_active_thread.

Logs for IPv6 (full log):

tcpip_thread: netif->output_ip6()
tcpip_thread: ethip6_send: sending packet 0x807912c
tcpip_thread: _netdev_send to 2 (status: 10) # Here there's no context switch since the message is queued
tcpip_thread:  sys_sem_signal 
tcpip_thread:  sys_mbox_fetch 1
sched_set_status: removing thread 4 (9->8) from runqueue 1.
sched_run: active thread: 4, next thread: 2
sched_run: done, changed sched_active_thread.
main:  sys_arch_sem_wait (0)
sched_set_status: removing thread 2 (9->2) from runqueue 7.
sched_run: active thread: 2, next thread: 1
sched_run: done, changed sched_active_thread.
idle:  _mbox_timeout

I don't fully understand the purpose and interaction of all tests and won't have time to look at it deeply within the comming days... maybe you see something that I didn't consider so far.

@lebrush
Copy link
Member

lebrush commented Aug 31, 2017

It might be relevant for this: we ran into priority inversion issues with lwIP, this is something that will have to be tackled at some point in RIOT :-(

@miri64
Copy link
Member Author

miri64 commented Sep 4, 2017

Like #7461 e.g.?

@miri64
Copy link
Member Author

miri64 commented Aug 2, 2018

Seems like it was an issue in lwIP after all, since I can't reproduce it in master anymore and 41eb295 is the first one where it worked.

@miri64 miri64 closed this as completed Aug 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: network Area: Networking Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

No branches or pull requests

3 participants