Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpu/esp: esp_now_netdev fixes and optimizations #10700

Merged
merged 12 commits into from Jan 11, 2019

Conversation

gschorcht
Copy link
Contributor

Contribution description

This PR provides a fix of problem 2 in issue #10682.

As described in #10682, a single ping without data was not possible anymore after heavy load with ping timeouts. The reason was that ISR events got lost and messed up the ringbuffer used in esp_now_netdev.

This PR is rather an evolution than only a fix. All different steps are realized as separate commits which contain the following changes:

  1. _recv function of esp_now_netdev simplified according to the possible parameter values.
  2. _recv function of esp_now_netdev updated to drop the frame as documented in new API if the parameter len is smaller than the received frame.
  3. Lost of NETDEV_EVENT_ISR events in esp_now_netdev fixed.

In further extensive stress tests the following could be figured out:

  • Function esp_now_recv_cb is executed in the context of the wifi thread (this was already known).
  • The ISRs which handle the hardware interrupts from the WiFi interface only pass events to a message queue of the wifi thread.
  • The wifi thread then processes the events sequentially and executes registered callback functions like esp_now_recv_cb asynchronously.
  • This corresponds to the NETDEV_EVENT_ISR approach in GNRC.

This results in the following facts:

  • Function esp_now_recv_cb will not be reentered.
  • Execution of function esp_now_recv_cb is executed in the context of the highest priority thread and is therefore not interrupted or superseded by other threads.

These facts allow the following optimizations which are provided step by step as separate commits.

  1. NETDEV_EVENT_ISR events are not needed. The execution of netif::_recv and thus esp_now_netdev::_recv can be triggered directly with NETDEV_EVENT_RX_COMPLETE on the receiption of a frame in esp_now_recv_cb.
  2. Mutual exclusion is not required anymore since esp_now_recv_cb and esp_now_netdev::_recv are executed sequentially in same thread context.
  3. Handling of critical section by disabling/enabling interrupts is not needed.
  4. If esp_now_netdev::_recv is executed immediately in esp_now_recv_cb and in same thread context, there is no need for a ring buffer anymore.
  5. There is even no buffer required. Received data can be copied directly from the buffer of the wifi thread.

Although function esp_now_recv_cb will not be reentered, it is secured by an additional boolean variable to avoid potential data inconsistencies. If the NDEBUG macro is undefined, an assertion is used instead for debugging purposes.

Testing procedure

Use at least two or more ESP32 nodes and flash them with examples/gnrc_networking, e.g.:

USEMODULE=esp_now make -C examples/gnrc_networking BOARD=esp32-wroom-32 flash

Use the ping command to ping from each node another node to produce heavy load, either with two nodes

node1> ping6 10000 <node2_addr> 1232 0
node2> ping6 10000 <node1_addr> 1232 0

or with three nodes

node1> ping6 10000 <node2_addr> 1232 0
node2> ping6 10000 <node3_addr> 1232 0
node3> ping6 10000 <node1_addr> 1232 0

A further test is bombarding one node from two other nodes:

node1> ping6 10000 <node3_addr> 1232 0
node2> ping6 10000 <node3_addr> 1232 0
node3> 

Issues/PRs references

This PR fixes #10682 problem 2, the stucked gnrc buffer problem remains.
This PR is a follow-on PR of #10581, #10679, #10680.

Simplifies the _recv receive function structure of esp_now_netdev according to the possible parameter values.
Newest version of API requires to drop the frame when if parameter len is smaller than the received packet.
Although it should not be possible to reenter the function esp_now_recv_cb, this is avoided by an additional boolean variable. It can not be realized by mutex because it would reenter from same thread context.  If macro NDEBUG is not defined, an assertion is used.
Since the esp_now_recv_cb function is not called directly as an ISR through interrupts, it is not executed in the interrupt context. _recv can therefore be called directly. The treatment of the recv_event is no longer necessary.
Since it is not possible to reenter the function `esp_now_recv_cb`, and the functions netif::_ recv and esp_now_netdev::_ recv are called directly in the same thread context, neither a mutual exclusion has to be realized nor have the interrupts to be deactivated.
Since it is not possible to reenter the function `esp_now_recv_cb`, and the functions netif::_ recv and esp_now_netdev::_ recv are called directly in the same thread context we only need a buffer which contains one frame and its source mac address.
Since it is not possible to reenter the function `esp_now_recv_cb`, and the functions netif::_ recv and esp_now_netdev::_ recv are called directly in the same thread context we can read directly from the `buf` and don't need a receive buffer anymmore.
@gschorcht gschorcht requested a review from miri64 January 3, 2019 17:51
@gschorcht gschorcht added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Area: network Area: Networking Type: cleanup The issue proposes a clean-up / The PR cleans-up parts of the codebase / documentation Platform: ESP Platform: This PR/issue effects ESP-based platforms labels Jan 3, 2019
@miri64
Copy link
Member

miri64 commented Jan 3, 2019

I don't have the hardware to test this :-/

@gschorcht
Copy link
Contributor Author

@TimoRoth Sorry, I used the wrong issue by mistake when I was asking you to take a look to the changes for esp_now.

@gschorcht
Copy link
Contributor Author

gschorcht commented Jan 3, 2019

@miri64

I don't have the hardware to test this :-/

I know, but it would be really great if you could take a look to the changes in esp_now_netdev::esp_now_recv_cn and esp_now_netdev::_recv whether the approach without using NETDEV_EVENT_ISR events and buffers is ok from your point of view. I have tried to document carefully by comments why I think that is possible to it in that way.

@TimoRoth will test it.

Timer restart was moved from esp_now_scan_peers_done to esp_now_scan_peers_start to avoid that the scan for peers isn't triggered anymore if the esp_now_scan_peers_done callback isn't called because of errors.
@gschorcht
Copy link
Contributor Author

@TimoRoth Could you test the changes? I would like to make some progress with these changes since they increase the stability a lot.

@MrKevinWeiss

This comment has been minimized.

@MrKevinWeiss
Copy link
Contributor

With master I get something a but different...

node1 seems to transmit very fast.

NODE0

2019-01-08 11:20:30,426 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=876 hop limit=64 time = 43.896 ms
2019-01-08 11:20:30,469 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=877 hop limit=64 time = 41.664 ms
2019-01-08 11:20:30,519 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=878 hop limit=64 time = 49.910 ms
2019-01-08 11:20:30,566 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=879 hop limit=64 time = 46.098 ms
2019-01-08 11:20:30,614 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=880 hop limit=64 time = 46.396 ms
2019-01-08 11:20:30,665 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=881 hop limit=64 time = 50.211 ms
2019-01-08 11:20:30,716 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=882 hop limit=64 time = 50.726 ms
2019-01-08 11:20:30,762 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=883 hop limit=64 time = 44.962 ms
2019-01-08 11:20:30,807 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=884 hop limit=64 time = 45.691 ms
2019-01-08 11:20:30,864 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=885 hop limit=64 time = 55.993 ms
2019-01-08 11:20:30,913 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=886 hop limit=64 time = 46.950 ms
2019-01-08 11:20:30,964 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=887 hop limit=64 time = 51.093 ms
2019-01-08 11:20:31,011 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=888 hop limit=64 time = 44.513 ms
2019-01-08 11:20:31,058 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=889 hop limit=64 time = 47.317 ms
2019-01-08 11:20:32,075 - INFO # ping timeout
2019-01-08 11:20:32,126 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=890 hop limit=64 time = 40.475 ms
2019-01-08 11:20:33,165 - INFO # ping timeout
2019-01-08 11:20:33,225 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=891 hop limit=64 time = 49.040 ms
2019-01-08 11:20:33,263 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=892 hop limit=64 time = 38.490 ms
2019-01-08 11:20:33,306 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=893 hop limit=64 time = 41.638 ms
2019-01-08 11:20:33,354 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=894 hop limit=64 time = 47.216 ms
2019-01-08 11:20:33,401 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=895 hop limit=64 time = 46.423 ms
2019-01-08 11:20:33,446 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=896 hop limit=64 time = 45.140 ms
2019-01-08 11:20:33,491 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=897 hop limit=64 time = 44.716 ms
2019-01-08 11:20:33,533 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=898 hop limit=64 time = 38.847 ms
2019-01-08 11:20:33,573 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=899 hop limit=64 time = 39.839 ms
2019-01-08 11:20:33,615 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=900 hop limit=64 time = 41.459 ms
2019-01-08 11:20:33,656 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=901 hop limit=64 time = 39.716 ms

NODE1

2019-01-08 11:20:36,434 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4872 hop limit=64 time = 0.769 ms
2019-01-08 11:20:36,437 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4873 hop limit=64 time = 0.708 ms
2019-01-08 11:20:36,449 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4874 hop limit=64 time = 0.773 ms
2019-01-08 11:20:36,451 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4875 hop limit=64 time = 0.712 ms
2019-01-08 11:20:36,465 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4876 hop limit=64 time = 1.228 ms
2019-01-08 11:20:36,468 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4877 hop limit=64 time = 0.774 ms
2019-01-08 11:20:36,481 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4878 hop limit=64 time = 0.713 ms
2019-01-08 11:20:36,483 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4879 hop limit=64 time = 0.716 ms
2019-01-08 11:20:36,497 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4880 hop limit=64 time = 0.708 ms
2019-01-08 11:20:36,500 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4881 hop limit=64 time = 0.708 ms
2019-01-08 11:20:36,513 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4882 hop limit=64 time = 0.707 ms
2019-01-08 11:20:36,516 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4883 hop limit=64 time = 0.708 ms
2019-01-08 11:20:36,519 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4884 hop limit=64 time = 0.708 ms
2019-01-08 11:20:36,531 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4885 hop limit=64 time = 0.708 ms
2019-01-08 11:20:36,533 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4886 hop limit=64 time = 0.708 ms
2019-01-08 11:20:36,547 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4887 hop limit=64 time = 0.707 ms
2019-01-08 11:20:36,551 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4888 hop limit=64 time = 0.707 ms
2019-01-08 11:20:36,563 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4889 hop limit=64 time = 0.708 ms
2019-01-08 11:20:36,566 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4890 hop limit=64 time = 0.804 ms
2019-01-08 11:20:36,578 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4891 hop limit=64 time = 0.712 ms
2019-01-08 11:20:36,582 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4892 hop limit=64 time = 0.704 ms
2019-01-08 11:20:36,594 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4893 hop limit=64 time = 0.703 ms
2019-01-08 11:20:36,597 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4894 hop limit=64 time = 0.703 ms
2019-01-08 11:20:36,610 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4895 hop limit=64 time = 0.703 ms
2019-01-08 11:20:36,613 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4896 hop limit=64 time = 0.704 ms
2019-01-08 11:20:36,626 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4897 hop limit=64 time = 0.704 ms
2019-01-08 11:20:36,629 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4898 hop limit=64 time = 0.703 ms
2019-01-08 11:20:36,641 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4899 hop limit=64 time = 0.966 ms
2019-01-08 11:20:36,644 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4900 hop limit=64 time = 0.717 ms
2019-01-08 11:20:36,657 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4901 hop limit=64 time = 0.704 ms
2019-01-08 11:20:36,660 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4902 hop limit=64 time = 0.704 ms
2019-01-08 11:20:36,673 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4903 hop limit=64 time = 0.770 ms
2019-01-08 11:20:36,676 - INFO # 1240 bytes from fe80::860d:8eff:fe18:7e31: id=83 seq=4904 hop limit=64 time = 1.575 ms
2019-01-08 11:20:37,665 - INFO # ping timeout
2019-01-08 11:20:38,673 - INFO # ping timeout
2019-01-08 11:20:39,666 - INFO # ping timeout
2019-01-08 11:20:39,668 - INFO # error: packet buffer full
2019-01-08 11:20:39,669 - INFO # error: packet buffer full
2019-01-08 11:20:39,681 - INFO # error: packet buffer full
2019-01-08 11:20:39,683 - INFO # error: packet buffer full

@gschorcht
Copy link
Contributor Author

@MrKevinWeiss Strange, I have stress tested over hours, but I have never seen that the UART interface is affected. UART is realized in Hardware with a buffer oft 128 characters.
Is it reproducable? Could you try it with another terminal program?

@MrKevinWeiss
Copy link
Contributor

I am not too worried about it, I am running things through a usb hub. Keep in mind it is a different board (node1=esp32-wrover-32). I will investigate!

@MrKevinWeiss
Copy link
Contributor

MrKevinWeiss commented Jan 8, 2019

Yup, just my usb ports, switched to the ones directly connected to the cpu and have 0% packet loss with 1000 pings. The testing shows a noticeable improvement! 👍

@MrKevinWeiss MrKevinWeiss added the Reviewed: 3-testing The PR was tested according to the maintainer guidelines label Jan 8, 2019
@gschorcht
Copy link
Contributor Author

@MrKevinWeiss Thanks for testing 😄

@gschorcht
Copy link
Contributor Author

How do we continue with this PR? PR #9917 depends these changes.

* If the NDEBUG macro is undefined, an assertion is used instead for
* debugging purposes.
*/
assert(!_in_recv_cb);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems redundant in light of the following if, or not?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was added intentionally rather to see when this happens in stress tests during the development process than to suppress it silently.
Should I remove it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mhm, I'see ... the assert will result in a crash, you could a DEBUG message inside the if below if you want to avoid the crash an just see when (and how often) this happens.

TL;DR: depends on you intention and use case: if you want to the app to crash first time this happens use assert (so leave as is) otherwise better use DEBUG.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with a DEBUG message is that you would miss it in the set of other debug messages, especially during stress tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smlng I decided to remove it.

/*
* Since we are not in the interrupt context, we do not have to pass
* `NETDEV_EVENT_ISR` first. We can call the receive function directly.
* But we have to unlock the mutex and enable interrupts before.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which mutex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments is from the time when a mutex was used. I forgot to change it it when I removed 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove then, afterwards this seems good to go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. Thank you for the review.

@smlng smlng added Reviewed: 1-fundamentals The fundamentals of the PR were reviewed according to the maintainer guidelines Reviewed: 2-code-design The code design of the PR was reviewed according to the maintainer guidelines Reviewed: 4-code-style The adherence to coding conventions by the PR were reviewed according to the maintainer guidelines Reviewed: 5-documentation The documentation details of the PR were reviewed according to the maintainer guidelines labels Jan 10, 2019
@MrKevinWeiss MrKevinWeiss added the CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR label Jan 10, 2019
@TimoRoth
Copy link
Contributor

I did a really quick test of the changes, and they indeed do increase the stability to the point where I was unable to make anything bad or unexpected happen.
The code changes all look sensible to me as well, though I did not do an in depth review due to lack of time.

Looking forward to this and in turn the esp8266 PR being merged.

assert was added intentionally for debugging purposes. For released version it is enough that function returns.
@gschorcht
Copy link
Contributor Author

@TimoRoth Thanks for testing again 😄

@smlng
Copy link
Member

smlng commented Jan 11, 2019

@TimoRoth thx for reporting, with the tests by @MrKevinWeiss I'd say this is good to go then!

@MrKevinWeiss MrKevinWeiss merged commit 9e1ab1e into RIOT-OS:master Jan 11, 2019
@gschorcht
Copy link
Contributor Author

Thank you all for your support.

@gschorcht gschorcht deleted the esp32_esp_now_fix_opt branch January 11, 2019 12:33
@aabadie aabadie added this to the Release 2019.01 milestone Jan 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: network Area: Networking CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Platform: ESP Platform: This PR/issue effects ESP-based platforms Reviewed: 1-fundamentals The fundamentals of the PR were reviewed according to the maintainer guidelines Reviewed: 2-code-design The code design of the PR was reviewed according to the maintainer guidelines Reviewed: 3-testing The PR was tested according to the maintainer guidelines Reviewed: 4-code-style The adherence to coding conventions by the PR were reviewed according to the maintainer guidelines Reviewed: 5-documentation The documentation details of the PR were reviewed according to the maintainer guidelines Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Type: cleanup The issue proposes a clean-up / The PR cleans-up parts of the codebase / documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cpu/esp_common: esp_now netdev problems
6 participants