modMessageService triggers watchdog on ESP32 #1257

beckerzito · 2023-11-21T22:02:25Z

Build environment: macOS, Windows, or Linux
Moddable SDK version: 4.1
Target device: NodeMCU ESP32

Description
This commit has changed the execution loop of the modMessageService. Before the change, the loop used to iterate only through the number of messages present on the queue at the beginning of the execution, retrieved by: unsigned portBASE_TYPE count = uxQueueMessagesWaiting(the->msgQueue);.

After the change, the loop happens based on the while (xQueueReceive(the->msgQueue, &msg, maxDelayMS)) condition. Having said that, even though the maxDelayMS is set to 0, if the message callback execution is slow (performing flash operation for example), and more messages are added to the queue (by a socket connection for example), the loop will never stop the execution and never yield to the main loop giving the opportunity to the timers to execute and the watchdog to be reset.

Steps to Reproduce

Create an app to download data from a web page and write to the flash, for example, an OTA operation.

Expected behavior
The operation shall end with success in writing all the downloaded data to the flash. But instead, the application will reset due watchdog reset.

The text was updated successfully, but these errors were encountered:

phoddie · 2023-11-21T22:32:30Z

Steps to Reproduce

Create an app to download data from a web page and write to the flash, for example, an OTA operation.

We have products that perform OTA on ESP32 without triggering the watchdog. Therefore, this one step appears to be insufficiently precise to allow your scenario to be reproduced. Would you please provide additional details? Even better, a working example that shows the problem would be most helpful. Thank you.

FWIW – the way maxDelayMS is used is imperfect, in that the loop can run longer than that. This behavior, however, does match what the ESP-IDF does with the same parameter in many situations. Perhaps this case could be changed to better respect the input value. But that exploration is best done with a testbed app that reproduces the behavior you see.

phoddie · 2023-11-24T23:52:34Z

A little more on this. The callbacks should take priority over messages. Callbacks have a deadline, and should fire as close to that time as possible. Messages, by contrast, are analogous to the JavaScript job queue used for promises and so they run when nothing else is active.

The behavior you describe is the priorities being inverted -- messages are starving callbacks. That should be fixed and, I think, respecting maxDelayMS will help, it may not be enough by itself. Correcting that means that callbacks have some potential to starve messages, of course, since they take priority.

To avoid stressing the overall system, many modules in the Moddable SDK minimize messages sent. I'm not certain what's going on in your situation but it sounds like there may be a lot of messages in flight. If that can be reduced somehow, that's probably for the best since the ESP32 is fast, but not infinitely so.

Before the change, the loop used to iterate only through the number of messages present on the queue at the beginning of the execution

Right. This was a throttle. It was imperfect, because it could still allow messages to run longer than maxDelayMS which could still trigger the watchdog. But if there is a continuous flood of messages, the approach does ensure that the callbacks will still get a chance to run during the flood.

beckerzito · 2023-11-27T19:59:31Z

Hi @phoddie thanks for the additional comments.

I wasn't able to stop and implement something reproducible yet. In our context, there's a worker living for a long period exchanging data with the application machine. That makes a lot of messages to be exchanged from both machines....

What happens is that when the worker is taking too long to run a message, for example, that triggers flash operations, we end up in this condition.

One scenario that I think could reproduce is:

Application machine:
Timer post message to worker each 500 ms

Worker machine:
Creates a timer to print a message "Timer callback" every 500ms;
Delay 1000ms on its "onMessage" method.

Probably we are never going to see the "Timer callback" being printed, once there will be always a message to be executed when the one being executed ends.

My point is that these recent changes modified the software dynamics and personally, even though we were able to overflow the maxDelayMS I see that the same can continue to happen.

You are never able to guarantee that the machine execution will take less than the remaining time to reset by the watchdog, that seems to me a responsibility of the developer to ensure his JS code doesn't take too long. But at least in the old implementation in the case of many messages, we used to have the possibility to reset the watchdog in between.

phoddie · 2023-12-18T23:19:20Z

@beckerzito - Thanks for the scenario. For that case, I see why the previous behavior would be preferred. Still, once the message queue is full, the worker machine is going to be blocked for a significant part of its execution time.

I'll try to get back to a behavior where the main machine can't spin forever in a scenario like this. Still, the system is unbalanced so all the SDK can do is make it less bad. Restoring balance is up to the developer (you).

phoddie · 2023-12-19T00:17:57Z

I modified the base/worker example to implement the scenario you describe above. That spins forever handling messages, as expected. I had previously started changes to address this. Finishing that up gives a proof-of-concept that eliminates spinning forever. The implementation only processes the messages pending on entry and also handles the maximum delay correctly.

I need to construct some more test cases to make sure the behavior is reasonable in all scenarios, not just this one. The code is different for instrumented and debug builds (to allow debug builds to timeout, as previously discussed) and a dependency on the watchdog timer -- so a few variations to test.

phoddie · 2023-12-20T20:47:13Z

Some more progress here.

Confirmed that the behavior is also correct when the roles are reserved (swapped code in worker and main)
Works for both debug & non-debug (instrumented, release) builds
Behaves correctly when posting a message times out (debug builds only, by default)
Debugger interactions continue to work correctly
Tested with much shorter values timing and random timing in main & worker - looking to stress the system
Refined handling of max delay to better respect value passed in. (A callback can still overshoot this; if one does, timers are gives a chance to run before resuming message servicing)
Moved the reset of the watchdog to immediately before invoking the callback to give callback a consistent amount of time for execution before triggering watchdog. This isn't strictly necessary but seems reasonable.

Also, reconfirmed that the timer code only services those timers eligible when the function is entered. That prevents the timer service from spinning forever. That's the same idea as limiting the messages serviced to the pending message count on entry. So, neither should ever completely starve the other.

The code is probably about ready to commit.

…g, etc. #1257

phoddie · 2023-12-23T01:38:35Z

Code committed. Please give it a try.

beckerzito · 2024-01-15T11:12:04Z

Sorry for the long delay @phoddie I'm gonna test that and will let you know!
Thanks!

phoddie · 2024-01-15T19:22:44Z

Great, thank you.

tve · 2024-01-16T16:44:45Z

FYI, I believe I'm running into this same issue. What I see is that the machine deadlocks on mxMessagePostToMachine, coming from fxPromiseThen > fxQueueJob > fxQueuePromiseJobs. Are those the symptoms?

phoddie · 2024-01-16T18:51:16Z

@tve - not at all clear they are related. Both do involve messages, but the original report is about cross-worker messages and your example is about a message from/to the same worker. Probably best to open a separate issue with steps to reproduce.

phoddie · 2024-02-07T20:00:14Z

I'm gonna test that and will let you know!

@beckerzito – any update?

beckerzito · 2024-03-28T11:24:48Z

Hi @phoddie sorry for the long delay. Thanks for the support, it fixed the issue!

phoddie · 2024-03-28T16:32:17Z

That's great to hear. Thanks for confirming.

mkellner pushed a commit that referenced this issue Dec 21, 2023

update modMessageService to avoid starving timers, triggering watchdo…

0bd542e

…g, etc. #1257

phoddie closed this as completed Mar 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

modMessageService triggers watchdog on ESP32 #1257

modMessageService triggers watchdog on ESP32 #1257

beckerzito commented Nov 21, 2023 •

edited by phoddie

phoddie commented Nov 21, 2023

phoddie commented Nov 24, 2023 •

edited

beckerzito commented Nov 27, 2023

phoddie commented Dec 18, 2023

phoddie commented Dec 19, 2023 •

edited

phoddie commented Dec 20, 2023 •

edited

phoddie commented Dec 23, 2023

beckerzito commented Jan 15, 2024

phoddie commented Jan 15, 2024

tve commented Jan 16, 2024

phoddie commented Jan 16, 2024

phoddie commented Feb 7, 2024

beckerzito commented Mar 28, 2024

phoddie commented Mar 28, 2024

modMessageService triggers watchdog on ESP32 #1257

modMessageService triggers watchdog on ESP32 #1257

Comments

beckerzito commented Nov 21, 2023 • edited by phoddie

phoddie commented Nov 21, 2023

phoddie commented Nov 24, 2023 • edited

beckerzito commented Nov 27, 2023

phoddie commented Dec 18, 2023

phoddie commented Dec 19, 2023 • edited

phoddie commented Dec 20, 2023 • edited

phoddie commented Dec 23, 2023

beckerzito commented Jan 15, 2024

phoddie commented Jan 15, 2024

tve commented Jan 16, 2024

phoddie commented Jan 16, 2024

phoddie commented Feb 7, 2024

beckerzito commented Mar 28, 2024

phoddie commented Mar 28, 2024

beckerzito commented Nov 21, 2023 •

edited by phoddie

phoddie commented Nov 24, 2023 •

edited

phoddie commented Dec 19, 2023 •

edited

phoddie commented Dec 20, 2023 •

edited