-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zperf ping timeout on beagleconnect_freedom over subg #68674
Comments
Hi @Ayush1325! We appreciate you submitting your first issue for our open-source project. 🌟 Even though I'm a bot, I can assure you that the whole community is genuinely grateful for your time and effort. 🤖💙 |
There are network echo-server/client samples that too could have been used for testing. Those samples support both TCP and UDP, and either one can be disabled if not needed. The configs look mostly ok, just wondering why you have The error message tells that you run out of buffers. So if you are certain that buffer count is enough, then I suspect there is a memory leak (not likely), or you are receiving too much data and the device is not able to process it. You can use network shell to monitor the buffer count usage. The command is |
Well, I used separate applications since I wanted to test across zephyr forks. But yes, maybe it would be better to use samples.
While I cannot comment on memory leak, I can confidently say that the device is capable of processing this much data. The Zephyr fork that BeagleBoard has is able to run this example without any problem. In fact I can run this example with rx and tx packet and buffer count set to 16. Also, after some testing, it seems that I get the
Here is the output from uart:~$ net allocs
Network memory allocations
memory Status Pool Function alloc -> freed
0x20009020 free RX drv_rx_done():356 -> reass_timeout():290
0x20009064 free RX drv_rx_done():356 -> reass_timeout():290
0x200090a8 free RX drv_rx_done():356 -> reass_timeout():290
0x20009284 free RX drv_rx_done():356 -> pkt_alloc_with_buffer():1487
0x20008e44 free TX ieee802154_acknowledge():83 -> ieee802154_acknowledge():94
0x200091b8 free RX drv_rx_done():356 -> net_icmpv6_input():371
0x20009174 free RX drv_rx_done():356 -> net_icmpv6_input():371
0x20009130 free RX drv_rx_done():356 -> net_icmpv6_input():371
0x200090ec free RX drv_rx_done():356 -> reass_timeout():290
0x200090a8 free RX drv_rx_done():356 -> fragment_add_to_cache():562
0x20009064 free RX drv_rx_done():356 -> fragment_add_to_cache():562
0x20009020 free RX drv_rx_done():356 -> fragment_add_to_cache():562
uart:~$ net mem
Fragment length 128 bytes
Network buffer pools:
Address Total Avail Name
0x20001278 16 16 RX
0x20001294 16 16 TX
0x2000143c 16 16 RX DATA (rx_bufs)
0x20001474 16 16 TX DATA (tx_bufs)
No external memory pools found. |
Hi @Ayush1325 - Would it be possible to add a log message that prints both the time and failure count, and then to create a graph of the failures over time? (this would require incrementing a counter when the error is encountered). Additionally, it would be best to either test upstream <-> upstream, or fork <-> fork, but mixing versions introduces many more moving parts so a straight comparison becomes more difficult. Perhaps that's the goal here - to find the difference between upstream and the fork that causes this issue. Definitely important to note, is that upstream has had a number of IEEE 802.15.4 and cc13xx subg radio improvements over time, and they might not have been captured downstream. If this issue was encountered in the process of performing an OTA update, it will likely not be fixable, because the physical layer (subg) itself is not constant between versions. If that is the case, there may need to be an alternative means of upgrading devices. |
@cfriedt Sorry for the late reply, was busy with college exams. Since the packet drops have been somewhat fixed by #69098 , I have updated udp-server and udp-client to use TCP sockets (since that is what I need to use in my actual firmware). Here are the results for upstream <-> upstream test:
And here are the results for fork <-> fork test:
The time per request here is the time it roughly takes to send and recv an int over tcp. I am using 50 reqs to calcualte the average. As you can see, it is pretty evident that there is a large performance gap here. Will inspect the diffs in more detail to see if I can find the cause. There is no concern for OTA or anything like that since the firmware is completely experimental and new, so no devices in the wild with it. |
Hi @Ayush1325 - I hope you did well on your exams 👍 Total time / 50 is probably not the best way to measure radio driver performance because it's effectively measuring the entire chain of software calls. Unless you're trying to measure the performance of the 15.4 subsystem, but that would likely be independent of the driver in use. Aside from that, I have other questions:
If you're thinking about performance indicators for the radio driver, it would be best to look at statistics such as
|
The fork is a version of Zephyr 3.4 + a lot of additional patches. The fork has deviated too much from upstream and is in no state for rebase (all the previous rebase attempts have rendered the fork in non-functional state). I have tried to cherry pick and upstream all the patches that should be required, but seems like something is missing.
Round trip time of TCP packet. In my firmware, I want to send some packets which will then cause the node to respond.
I don't think so. I am using the exact same udp-server and udp-client application.
Yes, I do not get any buffer space warnings after setting them to 64. Kind of overkill considering I am only sending 50 messages.
Thanks will try the above things. Are there any helpers or shell stuff that can help profile the different net layers? |
So I switched to using zperf sample instead of udp-server and udp-client apps. Testing UDP:
Testing TCP:
Anything higher than 2 sec in TCP triggers |
@Ayush1325 - a bisect will allow you to find the specific commit that breaks your performance criteria. I would suggest finding an upstream commit where your performance works as expected, and then find a recent upstream commit where your performance does not work as expected, and then proceed with a git bisect. If your bisection is successful (i.e. you are able to narrow down the bug to a specific commit), it would probably help to diagnose your issue. |
Well, I don't think that's going to be possible. I added the subg support to upstream (by cherry picking commits from fork) a few months ago and performance was pretty much the same as right now. So upstream does not have any commit that works better. I can only say that the hardware is capable of much better performance. Eg: Running zperf in fork <-> fork setup gives rate of 67 Kbps. So around a difference of 100% in performance. |
After series of git bisects, it seems the bad commit in question is: a94877b . Causes drop in udp performance from 67 kbps to 38 kbps |
Hi @Ayush1325 ! Sorry that I chime in only now, I was ill and couldn't read my messages before. The commit you point to fixes a bug in the CC13/26xx SubG driver which announced a "hard MAC" CSMA/CA capability which it didn't implement. The change enables the L2 stack's "soft MAC" CSMA/CA implementation instead. What you're observing might just be standard-conforming random CSMA/CA backoffs that intentionally reduce throughput to enhance co-existence. This can be "fixed" by switching CSMA/CA off. The fact that some packages are delivered without additional latency may just be due to the fact that CSMA/CA chooses not to backoff at random intervals. I did extensive regression testing on that change, see #58439 and above all #58439 (comment). There you'll also find an in-depth explanation and demo of the increased latencies due to CSMA/CA as well as sample configurations with CSMA/CA turned off that prove no regression in throughput. Not sure why you see package drops, though? Some timeout maybe that does not take potential max backoffs into account? Please let me know if I'm misinterpreting something here. I didn't have time to look deeply into all prior comments. Hope this helps? |
@fgrandel Thanks for the explanation. Switching to ALOHA seems get me around 56 kbps in the zperf test. Not completely sure how that will play with greybus though so let's see.
It has been fixed by #69098 |
While the previous bisect, I also found another problem in zperf. It seems one of the commits in #62942 breaks the ping that zperf upload test does. So I get
|
Hi @Ayush1325 , thank you so much for tracking down troubles with BeagleConnect Freedom. I personally know several people who'll be very grateful to you! It would be very helpful if you could identify the specific commit that breaks zperf. @jukkar and myself will probably be able to better support you if you hone in a bit more on the offending commit. I reviewed that PR and couldn't see any obvious problem there. The change looks very reasonable to me. |
This sounds reasonable if you have ACK switched on. If you switch ACK off you should see ~120 Kbps throughput if you optimize package fragmentation, again see my test results. Faster than that is physically almost impossible due to the SubG SUN PHY's low bit rates. The rest is due to MAC/PHY header overhead and 802.15.4 inter-frame space (IFS) requirements.
I know little about greybus protocol but it probably depends on the throughput and package size required by your remote driver. In any case 802.15.4 over SubG SUN PHY would not usually be your ideal protocol stack as it is not designed for high throughput and reliability. The greybus demos I've seen usually had a fast and reliable ethernet connection. Maybe @cfriedt or @vaishnavachath know more about this? In any case upper layers will have to guarantee delivery or tolerate packet loss as 802.15.4 (like any other radio protocol) will never give you 100% reliability no matter how much you tune your PHY/MAC protocol settings. |
Exactly - TCP or a similar protocol with retries and receipt acknowledgement needs to be used. TCP is the simplest, but if using UDP, then Thread or Matter might also work. |
So I did do a git bisect, but the thing is that a lot of commits there seem to prevent zperf from building. So I am left with this:
19d1dd7 Also, the commits surrounding this one seem to have other problems due to which I am getting 1-2kbs speed, but I was only looking for the commit that breaks the ping timeout for now. |
@jukkar Could you help here? @Ayush1325 tracked down zperf troubles to the ping API refactoring, it seems. Just by inspecting your code I couldn't see anything wrong with it, but maybe you have an idea? |
- Ping before upload was broken - Fixes zephyrproject-rtos#68674 Signed-off-by: Ayush Singh <ayushdevel1325@gmail.com>
- Ping before upload was broken - Fixes zephyrproject-rtos#68674 Signed-off-by: Ayush Singh <ayushdevel1325@gmail.com>
Fixes remote address for ping before upload. This caused the ping to timeout as shown in the following output: ``` uart:~$ zperf udp upload 2001:db8::2 5001 10 50 1M Remote port is 5001 Connecting to 2001:db8::2 Duration: 10.00 s Packet size: 50 bytes Rate: 1000 kbps Starting... ping 2001:db8::2 timeout Rate: 1.00 Mbps Packet duration 390 us ``` Fixes: zephyrproject-rtos#68674 Signed-off-by: Ayush Singh <ayushdevel1325@gmail.com>
Fixes remote address for ping before upload. This caused the ping in zperf upload to timeout as shown in the following output: ``` uart:~$ zperf udp upload 2001:db8::2 5001 10 50 1M Remote port is 5001 Connecting to 2001:db8::2 Duration: 10.00 s Packet size: 50 bytes Rate: 1000 kbps Starting... ping 2001:db8::2 timeout Rate: 1.00 Mbps Packet duration 390 us ``` Fixes: zephyrproject-rtos#68674 Signed-off-by: Ayush Singh <ayushdevel1325@gmail.com>
Fixes remote address for ping before upload. This caused the ping in zperf upload to timeout as shown in the following output: ``` uart:~$ zperf udp upload 2001:db8::2 5001 10 50 1M Remote port is 5001 Connecting to 2001:db8::2 Duration: 10.00 s Packet size: 50 bytes Rate: 1000 kbps Starting... ping 2001:db8::2 timeout Rate: 1.00 Mbps Packet duration 390 us ``` Fixes: #68674 Signed-off-by: Ayush Singh <ayushdevel1325@gmail.com>
Fixes remote address for ping before upload. This caused the ping in zperf upload to timeout as shown in the following output: ``` uart:~$ zperf udp upload 2001:db8::2 5001 10 50 1M Remote port is 5001 Connecting to 2001:db8::2 Packet size: 50 bytes Starting... ping 2001:db8::2 timeout Packet duration 390 us ``` (cherry picked from commit 56882e2) Original-Duration: 10.00 s Original-Rate: 1000 kbps Original-Rate: 1.00 Mbps Original-Fixes: zephyrproject-rtos/zephyr#68674 Original-Signed-off-by: Ayush Singh <ayushdevel1325@gmail.com> GitOrigin-RevId: 56882e2 Change-Id: I2065ca2abc94ae95f01f3ee148522002baf6f987 Reviewed-on: https://chromium-review.googlesource.com/c/chromiumos/third_party/zephyr/+/5544060 Tested-by: ChromeOS Prod (Robot) <chromeos-ci-prod@chromeos-bot.iam.gserviceaccount.com> Commit-Queue: Yuval Peress <peress@google.com> Tested-by: Yuval Peress <peress@google.com> Reviewed-by: Yuval Peress <peress@google.com>
Describe the bug
Subg performance for Beagle-connect Freedom is extremely unreliable and slow. Initially, I encountered this while implementing support for DNS SD PTR query. None of the messages were recieved if multiple nodes responded simultaneously. However, it worked fine if one of the nodes was delayed.
I created a simple server client to test the reliability and performance, but well, I have no idea what to make of the results.
Most of the times, I get the following:
As might be visible in server config, I have set the buffer counts to 64 which I think should be more than sufficient since packets are being sent with 50 ms intervals.
Once in a blue moon (1 out of 30 runs), I actually receive all (100) the packets, which seems to suggest that programs can work.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
All packets to be received by the server.
Environment (please complete the following information):
Additional context
I will look if I missed something when I added network support in #65048 .
The text was updated successfully, but these errors were encountered: