Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

modMessageService triggers watchdog on ESP32 #1257

Closed
beckerzito opened this issue Nov 21, 2023 · 14 comments
Closed

modMessageService triggers watchdog on ESP32 #1257

beckerzito opened this issue Nov 21, 2023 · 14 comments

Comments

@beckerzito
Copy link

beckerzito commented Nov 21, 2023

Build environment: macOS, Windows, or Linux
Moddable SDK version: 4.1
Target device: NodeMCU ESP32

Description
This commit has changed the execution loop of the modMessageService. Before the change, the loop used to iterate only through the number of messages present on the queue at the beginning of the execution, retrieved by: unsigned portBASE_TYPE count = uxQueueMessagesWaiting(the->msgQueue);.

After the change, the loop happens based on the while (xQueueReceive(the->msgQueue, &msg, maxDelayMS)) condition. Having said that, even though the maxDelayMS is set to 0, if the message callback execution is slow (performing flash operation for example), and more messages are added to the queue (by a socket connection for example), the loop will never stop the execution and never yield to the main loop giving the opportunity to the timers to execute and the watchdog to be reset.

Steps to Reproduce

  1. Create an app to download data from a web page and write to the flash, for example, an OTA operation.

Expected behavior
The operation shall end with success in writing all the downloaded data to the flash. But instead, the application will reset due watchdog reset.

@phoddie
Copy link
Collaborator

phoddie commented Nov 21, 2023

Steps to Reproduce

Create an app to download data from a web page and write to the flash, for example, an OTA operation.

We have products that perform OTA on ESP32 without triggering the watchdog. Therefore, this one step appears to be insufficiently precise to allow your scenario to be reproduced. Would you please provide additional details? Even better, a working example that shows the problem would be most helpful. Thank you.

FWIW – the way maxDelayMS is used is imperfect, in that the loop can run longer than that. This behavior, however, does match what the ESP-IDF does with the same parameter in many situations. Perhaps this case could be changed to better respect the input value. But that exploration is best done with a testbed app that reproduces the behavior you see.

@phoddie
Copy link
Collaborator

phoddie commented Nov 24, 2023

A little more on this. The callbacks should take priority over messages. Callbacks have a deadline, and should fire as close to that time as possible. Messages, by contrast, are analogous to the JavaScript job queue used for promises and so they run when nothing else is active.

The behavior you describe is the priorities being inverted -- messages are starving callbacks. That should be fixed and, I think, respecting maxDelayMS will help, it may not be enough by itself. Correcting that means that callbacks have some potential to starve messages, of course, since they take priority.

To avoid stressing the overall system, many modules in the Moddable SDK minimize messages sent. I'm not certain what's going on in your situation but it sounds like there may be a lot of messages in flight. If that can be reduced somehow, that's probably for the best since the ESP32 is fast, but not infinitely so.

Before the change, the loop used to iterate only through the number of messages present on the queue at the beginning of the execution

Right. This was a throttle. It was imperfect, because it could still allow messages to run longer than maxDelayMS which could still trigger the watchdog. But if there is a continuous flood of messages, the approach does ensure that the callbacks will still get a chance to run during the flood.

@beckerzito
Copy link
Author

Hi @phoddie thanks for the additional comments.

I wasn't able to stop and implement something reproducible yet. In our context, there's a worker living for a long period exchanging data with the application machine. That makes a lot of messages to be exchanged from both machines....

What happens is that when the worker is taking too long to run a message, for example, that triggers flash operations, we end up in this condition.

One scenario that I think could reproduce is:

Application machine:
Timer post message to worker each 500 ms

Worker machine:
Creates a timer to print a message "Timer callback" every 500ms;
Delay 1000ms on its "onMessage" method.

Probably we are never going to see the "Timer callback" being printed, once there will be always a message to be executed when the one being executed ends.

My point is that these recent changes modified the software dynamics and personally, even though we were able to overflow the maxDelayMS I see that the same can continue to happen.

You are never able to guarantee that the machine execution will take less than the remaining time to reset by the watchdog, that seems to me a responsibility of the developer to ensure his JS code doesn't take too long. But at least in the old implementation in the case of many messages, we used to have the possibility to reset the watchdog in between.

@phoddie
Copy link
Collaborator

phoddie commented Dec 18, 2023

@beckerzito - Thanks for the scenario. For that case, I see why the previous behavior would be preferred. Still, once the message queue is full, the worker machine is going to be blocked for a significant part of its execution time.

I'll try to get back to a behavior where the main machine can't spin forever in a scenario like this. Still, the system is unbalanced so all the SDK can do is make it less bad. Restoring balance is up to the developer (you).

@phoddie
Copy link
Collaborator

phoddie commented Dec 19, 2023

I modified the base/worker example to implement the scenario you describe above. That spins forever handling messages, as expected. I had previously started changes to address this. Finishing that up gives a proof-of-concept that eliminates spinning forever. The implementation only processes the messages pending on entry and also handles the maximum delay correctly.

I need to construct some more test cases to make sure the behavior is reasonable in all scenarios, not just this one. The code is different for instrumented and debug builds (to allow debug builds to timeout, as previously discussed) and a dependency on the watchdog timer -- so a few variations to test.

@phoddie
Copy link
Collaborator

phoddie commented Dec 20, 2023

Some more progress here.

  • Confirmed that the behavior is also correct when the roles are reserved (swapped code in worker and main)
  • Works for both debug & non-debug (instrumented, release) builds
  • Behaves correctly when posting a message times out (debug builds only, by default)
  • Debugger interactions continue to work correctly
  • Tested with much shorter values timing and random timing in main & worker - looking to stress the system
  • Refined handling of max delay to better respect value passed in. (A callback can still overshoot this; if one does, timers are gives a chance to run before resuming message servicing)
  • Moved the reset of the watchdog to immediately before invoking the callback to give callback a consistent amount of time for execution before triggering watchdog. This isn't strictly necessary but seems reasonable.

Also, reconfirmed that the timer code only services those timers eligible when the function is entered. That prevents the timer service from spinning forever. That's the same idea as limiting the messages serviced to the pending message count on entry. So, neither should ever completely starve the other.

The code is probably about ready to commit.

@phoddie
Copy link
Collaborator

phoddie commented Dec 23, 2023

Code committed. Please give it a try.

@beckerzito
Copy link
Author

Sorry for the long delay @phoddie I'm gonna test that and will let you know!
Thanks!

@phoddie
Copy link
Collaborator

phoddie commented Jan 15, 2024

Great, thank you.

@tve
Copy link
Contributor

tve commented Jan 16, 2024

FYI, I believe I'm running into this same issue. What I see is that the machine deadlocks on mxMessagePostToMachine, coming from fxPromiseThen > fxQueueJob > fxQueuePromiseJobs. Are those the symptoms?

@phoddie
Copy link
Collaborator

phoddie commented Jan 16, 2024

@tve - not at all clear they are related. Both do involve messages, but the original report is about cross-worker messages and your example is about a message from/to the same worker. Probably best to open a separate issue with steps to reproduce.

@phoddie
Copy link
Collaborator

phoddie commented Feb 7, 2024

I'm gonna test that and will let you know!

@beckerzito – any update?

@beckerzito
Copy link
Author

Hi @phoddie sorry for the long delay. Thanks for the support, it fixed the issue!

@phoddie
Copy link
Collaborator

phoddie commented Mar 28, 2024

That's great to hear. Thanks for confirming.

@phoddie phoddie closed this as completed Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants