-
Notifications
You must be signed in to change notification settings - Fork 457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Module reconnecting pattern #245
Comments
@ddobric - Thanks for all the details. Here are my answers to them - Issue 1 - Modules receiving messages do not reconnect if EdgeHub restarts. Issue 2 - Modules run into timeout exceptions Issue 3 - EdgeHub being restarted randomly - we have not seen this issue before. Can you provide the entire edgeHub and edgeAgent logs during this period? By any chance, was there a deployment on the Edge device during this period, that would have triggered an update of the EdgeHub? Hope I answered all your questions. If I missed something, or if you have any other questions/concerns, please let me know. |
First of all, thank you for your answer. Issue 1 - Modules receiving messages do not reconnect if EdgeHub restarts.I was not aware of this issue. How this was fixed? Some kind of retry inside of AMQP client? Issue 2 - Modules run into timeout exceptionsTotally agree on this. However timeout exception (in this case) was thrown inside of C# SDK while receiving events. I guess Issue1 targets exactly this problem? If not, we have no chance catch such exception without providing us some error callback in registration of input handler. Issue 3 - EdgeHub being restarted randomlyI attached the log, which contains two random restarts. The only error in the log is Logs: Hope this helps. RecapLogs show instability of device, which is running one sensor (sinusgen) module and one receiving filter-module (fncrecognizer). Filter module stops receiving of events without any exception and continue running (doing nothing). Both modules are implemented in C#, .NET Core. |
Anything new? |
We have been running a long haul test with the fix to verify that it resolves the issue. All signs are positive but we would like it to run for a few more days to gain more confidence. You can track the PR here: Azure/azure-iot-sdk-csharp#611 |
Thanks :) |
The connectivity issues have been resolved in version 1.0.7. Closing this issue, but feel free to reopen if you are still facing issues. |
Have an iotedge device with few modules. All modules are typically (in C#) initialised with following code:
With this initialisation all works fine until edgeHub crashes. Once that happen, edgeHub will be started again, but all modules will continue running. However they are not able to consume events. Modules, which produce events, continue running and sending of events, but they will never arrive destination.
In the log of edgeHub can be found following:
This obviously indicates, that hub knows about disconnected module. "Hub knows that module has a problem and module knows that hub has a problem". This opens up many questions.
1
Modules do not know that dependent container has crashed. All modules will, I guess always have dependency to edgeHub, if they use IotEdge runtime features.
In my opinion, there should be some event-mechanism, which notifies module to reconnect or even to restart.
2
If the edgeHub knows that module is not connected, wouldn't be possible/reasonable to request restart of the module by iotedge service?
3
Even if everything works well, after a while some connectivity issue will happen between module and edgeHub. Also in this case, we need some reconnect strategy.
I think, I already had this issue in last months. For example this one is thrown by module(s):
When this happen, mostly all modules throw this exception. edgeHub and interestingly continue running as it all would be fine. edgeAgent log notice this errors and schedule restart of modules.
Because modules didn't cause this issue (edgeHub or docker did it) they will enter initialisation and run in timeout over and over again.
I would expect here auto-restart of edgeHub, but this does not happen, because there is no error in hub. But it must be, because hub is not reachable or error might be caused by docker.
When you (or service) then restart edgeHub you will get
To workaround this behaviour all modules have to be restarted manually after restart of edgeHub.
4
Sometimes is hub restarted without any error in any module. In this case also all modules might lose connection to edgeHub. Take a look on this log.
Hub is restarted for whatever reason. edgeHub logs shows following:
And you guess, module is running in this case without any error and it does not receive any event.
Recap
It can be that you expect some different approach to solve listed issues. In any case, It would be great help to share your thoughts about this. In my opinion, this can be a serious issue.
Sorry because of long description, but I do not see better way to describe it.
Thanks in advance
The text was updated successfully, but these errors were encountered: