New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
modMessageService triggers watchdog on ESP32 #1257
Comments
We have products that perform OTA on ESP32 without triggering the watchdog. Therefore, this one step appears to be insufficiently precise to allow your scenario to be reproduced. Would you please provide additional details? Even better, a working example that shows the problem would be most helpful. Thank you. FWIW – the way |
A little more on this. The callbacks should take priority over messages. Callbacks have a deadline, and should fire as close to that time as possible. Messages, by contrast, are analogous to the JavaScript job queue used for promises and so they run when nothing else is active. The behavior you describe is the priorities being inverted -- messages are starving callbacks. That should be fixed and, I think, respecting To avoid stressing the overall system, many modules in the Moddable SDK minimize messages sent. I'm not certain what's going on in your situation but it sounds like there may be a lot of messages in flight. If that can be reduced somehow, that's probably for the best since the ESP32 is fast, but not infinitely so.
Right. This was a throttle. It was imperfect, because it could still allow messages to run longer than |
Hi @phoddie thanks for the additional comments. I wasn't able to stop and implement something reproducible yet. In our context, there's a worker living for a long period exchanging data with the application machine. That makes a lot of messages to be exchanged from both machines.... What happens is that when the worker is taking too long to run a message, for example, that triggers flash operations, we end up in this condition. One scenario that I think could reproduce is: Application machine: Worker machine: Probably we are never going to see the "Timer callback" being printed, once there will be always a message to be executed when the one being executed ends. My point is that these recent changes modified the software dynamics and personally, even though we were able to overflow the maxDelayMS I see that the same can continue to happen. You are never able to guarantee that the machine execution will take less than the remaining time to reset by the watchdog, that seems to me a responsibility of the developer to ensure his JS code doesn't take too long. But at least in the old implementation in the case of many messages, we used to have the possibility to reset the watchdog in between. |
@beckerzito - Thanks for the scenario. For that case, I see why the previous behavior would be preferred. Still, once the message queue is full, the worker machine is going to be blocked for a significant part of its execution time. I'll try to get back to a behavior where the main machine can't spin forever in a scenario like this. Still, the system is unbalanced so all the SDK can do is make it less bad. Restoring balance is up to the developer (you). |
I modified the base/worker example to implement the scenario you describe above. That spins forever handling messages, as expected. I had previously started changes to address this. Finishing that up gives a proof-of-concept that eliminates spinning forever. The implementation only processes the messages pending on entry and also handles the maximum delay correctly. I need to construct some more test cases to make sure the behavior is reasonable in all scenarios, not just this one. The code is different for instrumented and debug builds (to allow debug builds to timeout, as previously discussed) and a dependency on the watchdog timer -- so a few variations to test. |
Some more progress here.
Also, reconfirmed that the timer code only services those timers eligible when the function is entered. That prevents the timer service from spinning forever. That's the same idea as limiting the messages serviced to the pending message count on entry. So, neither should ever completely starve the other. The code is probably about ready to commit. |
Code committed. Please give it a try. |
Sorry for the long delay @phoddie I'm gonna test that and will let you know! |
Great, thank you. |
FYI, I believe I'm running into this same issue. What I see is that the machine deadlocks on mxMessagePostToMachine, coming from fxPromiseThen > fxQueueJob > fxQueuePromiseJobs. Are those the symptoms? |
@tve - not at all clear they are related. Both do involve messages, but the original report is about cross-worker messages and your example is about a message from/to the same worker. Probably best to open a separate issue with steps to reproduce. |
@beckerzito – any update? |
Hi @phoddie sorry for the long delay. Thanks for the support, it fixed the issue! |
That's great to hear. Thanks for confirming. |
Build environment: macOS, Windows, or Linux
Moddable SDK version: 4.1
Target device: NodeMCU ESP32
Description
This commit has changed the execution loop of the modMessageService. Before the change, the loop used to iterate only through the number of messages present on the queue at the beginning of the execution, retrieved by:
unsigned portBASE_TYPE count = uxQueueMessagesWaiting(the->msgQueue);
.After the change, the loop happens based on the
while (xQueueReceive(the->msgQueue, &msg, maxDelayMS))
condition. Having said that, even though themaxDelayMS
is set to 0, if the message callback execution is slow (performing flash operation for example), and more messages are added to the queue (by a socket connection for example), the loop will never stop the execution and never yield to the main loop giving the opportunity to the timers to execute and the watchdog to be reset.Steps to Reproduce
Expected behavior
The operation shall end with success in writing all the downloaded data to the flash. But instead, the application will reset due watchdog reset.
The text was updated successfully, but these errors were encountered: