native stops receiving packets after a while #298

benpicco · 2013-11-04T03:06:33Z

When I run several native processes in desvirt, they all will stop receiving any UDP packets after a while.

I've tried the suggested ping -I tap0 -f -i 200 10.0.0.1 but still most processes will cease receiving anything after a minute or so, unless they #288 before. (Actually they won't be affected by #288 if they stop receiving packets)

The typical output of ps in such situation looks like this:

        pid | name                 | state    Q | pri | stack ( used) location   | runtime | switches 
          0 | idle                 | pending  Q |  31 |  8192 ( 1164) 0x8081420 |   -nan% |       -1
          1 | main                 | running  Q |  15 | 16384 ( 3396) 0x807d420 |   -nan% |       -1
          2 | uart0                | bl rx    _ |  14 |  8448 (  976) 0x8083560 |   -nan% |       -1
          3 | udp_packet_handler   | bl reply _ |  15 |  8192 ( 1008) 0x80b1fa0 |   -nan% |       -1
          4 | tcp_packet_handler   | bl rx    _ |  15 |  8192 (  960) 0x80b3fc0 |   -nan% |       -1
          5 | tcp_general_timer    | sleeping _ |  16 |  8192 ( 1252) 0x80affa0 |   -nan% |       -1
          6 | radio                | bl rx    _ |  13 |  8448 ( 1104) 0x80a9000 |   -nan% |       -1
          7 | Transceiver          | bl rx    _ |  12 | 16384 ( 2828) 0x80b7600 |   -nan% |       -1
          8 | ip_process           | bl reply _ |  14 | 49152 ( 1104) 0x8098b40 |   -nan% |       -1
          9 | lowpan_context_rem   | sleeping _ |  16 |  8192 ( 1056) 0x80967a0 |   -nan% |       -1
         10 | lowpan_transfer      | bl reply _ |  14 |  8192 (  976) 0x80a4b80 |   -nan% |       -1
         11 | olsr_rec             | bl rx    _ |  14 | 16384 ( 3860) 0x8086040 |   -nan% |       -1
         12 | olsr_snd             | bl rx    _ |  14 | 16384 ( 3268) 0x808a040 |   -nan% |       -1
         13 | pong                 | bl rx    _ |  14 |  8192 ( 1040) 0x808e120 |   -nan% |       -1
            | SUM                  |            |     | 188928

On a 'healthy' node, udp_packet_handler, ip_process and lowpan_transfer will be bl rx instead of bl reply.

The text was updated successfully, but these errors were encountered:

LudwigKnuepfer · 2013-11-05T08:36:53Z

Could you try this in one of the branches (native_syscalls / issue_161) again?

LudwigKnuepfer · 2013-11-05T19:32:45Z

@benpicco bump

benpicco · 2013-11-06T01:01:41Z

Thank you, with your latest branch this doesn't happen anymore, response times are also much better.
#288 and #272 are still present unfortunately and make my test topology disintegrate after a short while. (both seem to originate from makecontext?)

LudwigKnuepfer · 2013-11-06T10:02:11Z

Thanks a lot!
It appears signal-driven IO is effectively edge triggered..

And yes, the segfaults are a different issue and I'm trying to fix them as well. It's a little difficult to pinpoint though as makecontext probably is not the cause but the victim of the underlying problem.

BTW:
If all you need is a running system (until this is fixed) you could try slowing things down a bit. You could do that either in-RIOT by sending less often (maybe this is in tune with your goals to save energy consumption anyways?), or externally with tc. I could provide you with a tc wrapper script to build upon if you want to try that way. Or, if you are using desvirt, you could add a latency parameter to add_link in desvirt/lossnet.py and change the tc invocation accordingly.

benpicco · 2013-11-06T16:57:26Z

Well, I'm sending HELLO messages every 2s (actually 1s + random(0…1000) ms so they aren't all sending at the same time) and Topology Control messages every 5s (again 4s + 1s jitter). Only TC messages are forwarded instantly by each node that hasn't already received it.

I've tried

self.tc('qdisc add dev %s parent 1:%d netem loss %d%% delay %dms' % (to_tap, mark+10, packet_loss, 100))

but it's still crashing.

LudwigKnuepfer · 2013-11-06T17:05:12Z

100ms delay should be enough... did you verify the rule actually worked? (For example set the delay to to something noticeable like 2 seconds and watch the tap interfaces with tcpdump...)

benpicco · 2013-11-06T17:20:16Z

Yes, ping between two neighbors goes from ~388µs to 200388µs.
But a delay won't affect the amount or possibility of concurrent packets send, it just postpones them.
A random delay might lower the possibility, is that possible to set up?

LudwigKnuepfer · 2013-11-06T17:30:20Z

Ah right, my fault .. maybe drastically reducing the bandwidth ... ;-)
I guess probabilistic delay is possible, don't know tc too well ...

LudwigKnuepfer · 2013-11-06T17:38:10Z

http://www.linuxfoundation.org/collaborate/workgroups/networking/netem#Emulating_wide_area_network_delays

benpicco · 2013-11-06T18:17:56Z

Ah thank you.
But unfortunately it doesn't really have an impact on the likelihood of the segfaults to appear - assuming it's somewhat related to concurrent packet reception or transmitting while receiving, then for every random collision this probabilistic delay resolved, it would equally randomly introduce another one.

LudwigKnuepfer · 2013-11-06T18:33:19Z

I'm not sure I follow, but spreading events (random delay) over a bigger time frame (reduced bandwidth) should reduce the probability of simultaneous events, right? (Unless the medium is saturated.)

benpicco · 2013-11-06T18:54:48Z

True, but the events are already randomized over a 1s interval, which makes me wonder if the cause of this is really a collision (should be much more unlikely) or something else.

LudwigKnuepfer · 2013-11-06T19:05:58Z

You should be able to see it with tcpdump..
Anyways, don't bother wasting too much time on this hack. And yeah, it could be caused by anything ;-)

ghost assigned LudwigKnuepfer Nov 4, 2013

LudwigKnuepfer mentioned this issue Nov 13, 2013

Native fixes #323

Merged

LudwigKnuepfer closed this as completed Nov 13, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

native stops receiving packets after a while #298

native stops receiving packets after a while #298

benpicco commented Nov 4, 2013

LudwigKnuepfer commented Nov 5, 2013

LudwigKnuepfer commented Nov 5, 2013

benpicco commented Nov 6, 2013

LudwigKnuepfer commented Nov 6, 2013

benpicco commented Nov 6, 2013

LudwigKnuepfer commented Nov 6, 2013

benpicco commented Nov 6, 2013

LudwigKnuepfer commented Nov 6, 2013

LudwigKnuepfer commented Nov 6, 2013

benpicco commented Nov 6, 2013

LudwigKnuepfer commented Nov 6, 2013

benpicco commented Nov 6, 2013

LudwigKnuepfer commented Nov 6, 2013

native stops receiving packets after a while #298

native stops receiving packets after a while #298

Comments

benpicco commented Nov 4, 2013

LudwigKnuepfer commented Nov 5, 2013

LudwigKnuepfer commented Nov 5, 2013

benpicco commented Nov 6, 2013

LudwigKnuepfer commented Nov 6, 2013

benpicco commented Nov 6, 2013

LudwigKnuepfer commented Nov 6, 2013

benpicco commented Nov 6, 2013

LudwigKnuepfer commented Nov 6, 2013

LudwigKnuepfer commented Nov 6, 2013

benpicco commented Nov 6, 2013

LudwigKnuepfer commented Nov 6, 2013

benpicco commented Nov 6, 2013

LudwigKnuepfer commented Nov 6, 2013