Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

native stops receiving packets after a while #298

Closed
benpicco opened this issue Nov 4, 2013 · 13 comments
Closed

native stops receiving packets after a while #298

benpicco opened this issue Nov 4, 2013 · 13 comments
Assignees
Labels
Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)

Comments

@benpicco
Copy link
Contributor

benpicco commented Nov 4, 2013

When I run several native processes in desvirt, they all will stop receiving any UDP packets after a while.

I've tried the suggested ping -I tap0 -f -i 200 10.0.0.1 but still most processes will cease receiving anything after a minute or so, unless they #288 before. (Actually they won't be affected by #288 if they stop receiving packets)

The typical output of ps in such situation looks like this:

        pid | name                 | state    Q | pri | stack ( used) location   | runtime | switches 
          0 | idle                 | pending  Q |  31 |  8192 ( 1164) 0x8081420 |   -nan% |       -1
          1 | main                 | running  Q |  15 | 16384 ( 3396) 0x807d420 |   -nan% |       -1
          2 | uart0                | bl rx    _ |  14 |  8448 (  976) 0x8083560 |   -nan% |       -1
          3 | udp_packet_handler   | bl reply _ |  15 |  8192 ( 1008) 0x80b1fa0 |   -nan% |       -1
          4 | tcp_packet_handler   | bl rx    _ |  15 |  8192 (  960) 0x80b3fc0 |   -nan% |       -1
          5 | tcp_general_timer    | sleeping _ |  16 |  8192 ( 1252) 0x80affa0 |   -nan% |       -1
          6 | radio                | bl rx    _ |  13 |  8448 ( 1104) 0x80a9000 |   -nan% |       -1
          7 | Transceiver          | bl rx    _ |  12 | 16384 ( 2828) 0x80b7600 |   -nan% |       -1
          8 | ip_process           | bl reply _ |  14 | 49152 ( 1104) 0x8098b40 |   -nan% |       -1
          9 | lowpan_context_rem   | sleeping _ |  16 |  8192 ( 1056) 0x80967a0 |   -nan% |       -1
         10 | lowpan_transfer      | bl reply _ |  14 |  8192 (  976) 0x80a4b80 |   -nan% |       -1
         11 | olsr_rec             | bl rx    _ |  14 | 16384 ( 3860) 0x8086040 |   -nan% |       -1
         12 | olsr_snd             | bl rx    _ |  14 | 16384 ( 3268) 0x808a040 |   -nan% |       -1
         13 | pong                 | bl rx    _ |  14 |  8192 ( 1040) 0x808e120 |   -nan% |       -1
            | SUM                  |            |     | 188928

On a 'healthy' node, udp_packet_handler, ip_process and lowpan_transfer will be bl rx instead of bl reply.

@ghost ghost assigned LudwigKnuepfer Nov 4, 2013
@LudwigKnuepfer
Copy link
Member

Could you try this in one of the branches (native_syscalls / issue_161) again?

@LudwigKnuepfer
Copy link
Member

@benpicco bump

@benpicco
Copy link
Contributor Author

benpicco commented Nov 6, 2013

Thank you, with your latest branch this doesn't happen anymore, response times are also much better.
#288 and #272 are still present unfortunately and make my test topology disintegrate after a short while. (both seem to originate from makecontext?)

@LudwigKnuepfer
Copy link
Member

Thanks a lot!
It appears signal-driven IO is effectively edge triggered..

And yes, the segfaults are a different issue and I'm trying to fix them as well. It's a little difficult to pinpoint though as makecontext probably is not the cause but the victim of the underlying problem.

BTW:
If all you need is a running system (until this is fixed) you could try slowing things down a bit. You could do that either in-RIOT by sending less often (maybe this is in tune with your goals to save energy consumption anyways?), or externally with tc. I could provide you with a tc wrapper script to build upon if you want to try that way. Or, if you are using desvirt, you could add a latency parameter to add_link in desvirt/lossnet.py and change the tc invocation accordingly.

@benpicco
Copy link
Contributor Author

benpicco commented Nov 6, 2013

Well, I'm sending HELLO messages every 2s (actually 1s + random(0…1000) ms so they aren't all sending at the same time) and Topology Control messages every 5s (again 4s + 1s jitter). Only TC messages are forwarded instantly by each node that hasn't already received it.

I've tried

self.tc('qdisc add dev %s parent 1:%d netem loss %d%% delay %dms' % (to_tap, mark+10, packet_loss, 100))

but it's still crashing.

@LudwigKnuepfer
Copy link
Member

100ms delay should be enough... did you verify the rule actually worked? (For example set the delay to to something noticeable like 2 seconds and watch the tap interfaces with tcpdump...)

@benpicco
Copy link
Contributor Author

benpicco commented Nov 6, 2013

Yes, ping between two neighbors goes from ~388µs to 200388µs.
But a delay won't affect the amount or possibility of concurrent packets send, it just postpones them.
A random delay might lower the possibility, is that possible to set up?

@LudwigKnuepfer
Copy link
Member

Ah right, my fault .. maybe drastically reducing the bandwidth ... ;-)
I guess probabilistic delay is possible, don't know tc too well ...

@benpicco
Copy link
Contributor Author

benpicco commented Nov 6, 2013

Ah thank you.
But unfortunately it doesn't really have an impact on the likelihood of the segfaults to appear - assuming it's somewhat related to concurrent packet reception or transmitting while receiving, then for every random collision this probabilistic delay resolved, it would equally randomly introduce another one.

@LudwigKnuepfer
Copy link
Member

I'm not sure I follow, but spreading events (random delay) over a bigger time frame (reduced bandwidth) should reduce the probability of simultaneous events, right? (Unless the medium is saturated.)

@benpicco
Copy link
Contributor Author

benpicco commented Nov 6, 2013

True, but the events are already randomized over a 1s interval, which makes me wonder if the cause of this is really a collision (should be much more unlikely) or something else.

@LudwigKnuepfer
Copy link
Member

You should be able to see it with tcpdump..
Anyways, don't bother wasting too much time on this hack. And yeah, it could be caused by anything ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

No branches or pull requests

2 participants