Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module reconnecting pattern #245

Closed
ddobric opened this issue Sep 4, 2018 · 6 comments
Closed

Module reconnecting pattern #245

ddobric opened this issue Sep 4, 2018 · 6 comments

Comments

@ddobric
Copy link

ddobric commented Sep 4, 2018

Have an iotedge device with few modules. All modules are typically (in C#) initialised with following code:

static async Task Init()
{
            AmqpTransportSettings amqpSetting = new 
            AmqpTransportSettings(TransportType.Amqp_Tcp_Only);
            ITransportSettings[] settings = { amqpSetting };

            ioTHubModuleClient = await ModuleClient.CreateFromEnvironmentAsync(settings);
            await ioTHubModuleClient.OpenAsync();

            await ioTHubModuleClient.SetInputMessageHandlerAsync("input1", PipeMessage, 
            ioTHubModuleClient);
}

With this initialisation all works fine until edgeHub crashes. Once that happen, edgeHub will be started again, but all modules will continue running. However they are not able to consume events. Modules, which produce events, continue running and sending of events, but they will never arrive destination.
In the log of edgeHub can be found following:

[ERR] Module deviceXY/modyleXY is not connected

This obviously indicates, that hub knows about disconnected module. "Hub knows that module has a problem and module knows that hub has a problem". This opens up many questions.

1

Modules do not know that dependent container has crashed. All modules will, I guess always have dependency to edgeHub, if they use IotEdge runtime features.
In my opinion, there should be some event-mechanism, which notifies module to reconnect or even to restart.

2

If the edgeHub knows that module is not connected, wouldn't be possible/reasonable to request restart of the module by iotedge service?

3

Even if everything works well, after a while some connectivity issue will happen between module and edgeHub. Also in this case, we need some reconnect strategy.
I think, I already had this issue in last months. For example this one is thrown by module(s):

Unhandled Exception: System.AggregateException: One or more errors occurred. (Operation timeout expired.) ---> System.TimeoutException: Operation timeout expired.
   at Microsoft.Azure.Devices.Client.InternalClient.<>c.<ApplyTimeout>b__62_2(Task t)
   at System.Threading.Tasks.ContinuationTaskFromTask.InnerInvoke()
   at System.Threading.Tasks.Task.<>c.<.cctor>b__278_1(Object obj)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)

When this happen, mostly all modules throw this exception. edgeHub and interestingly continue running as it all would be fine. edgeAgent log notice this errors and schedule restart of modules.
Because modules didn't cause this issue (edgeHub or docker did it) they will enter initialisation and run in timeout over and over again.
I would expect here auto-restart of edgeHub, but this does not happen, because there is no error in hub. But it must be, because hub is not reachable or error might be caused by docker.

When you (or service) then restart edgeHub you will get

[ERR] Module deviceXY/modyleXY is not connected.

To workaround this behaviour all modules have to be restarted manually after restart of edgeHub.

4

Sometimes is hub restarted without any error in any module. In this case also all modules might lose connection to edgeHub. Take a look on this log.

2018-09-04 20:45:08.328 +00:00 [INF] - Module 'edgeHub' scheduled to restart after 10s (08s left).
2018-09-04 20:45:08.510 +00:00 [INF] - Updated reported properties
2018-09-04 20:45:13.589 +00:00 [INF] - Module 'edgeHub' scheduled to restart after 10s (03s left).
2018-09-04 20:45:18.676 +00:00 [INF] - Plan execution started for deployment 36
2018-09-04 20:45:18.677 +00:00 [INF] - Executing command: "Command Group: (
[Stop module edgeHub]
[Start module edgeHub]
[Saving edgeHub to store]
)"
2018-09-04 20:45:18.677 +00:00 [INF] - Executing command: "Stop module edgeHub"
2018-09-04 20:45:18.691 +00:00 [INF] - Executing command: "Start module edgeHub"
2018-09-04 20:45:20.721 +00:00 [INF] - Executing command: "Saving edgeHub to store"
2018-09-04 20:45:20.722 +00:00 [INF] - Plan execution ended for deployment 36
2018-09-04 20:45:26.125 +00:00 [INF] - Updated reported properties
2018-09-04 20:55:11.229 +00:00 [INF] - HealthRestartPlanner is clearing restart stats for module 'edgeHub' as it has been running healthy for 00:10:00.
2018-09-04 20:55:11.229 +00:00 [INF] - Plan execution started for deployment 36
2018-09-04 20:55:11.230 +00:00 [INF] - Executing command: "Saving edgeHub to store"
2018-09-04 20:55:11.230 +00:00 [INF] - Plan execution ended for deployment 36
2018-09-04 20:55:16.537 +00:00 [INF] - Updated reported properties

Hub is restarted for whatever reason. edgeHub logs shows following:

[ERR] Module deviceXY/modyleXY is not connected

And you guess, module is running in this case without any error and it does not receive any event.

Recap

It can be that you expect some different approach to solve listed issues. In any case, It would be great help to share your thoughts about this. In my opinion, this can be a serious issue.

  • All modules run into timeout exception
  • Because of error, modules are restarted
  • After restart modules always run into same error, because hub is not reachable.
  • Hub does not contain any error in log and it is not restarted.
  • Manual restart of hub solves hub's issue, but modules need to be restarted manually or somehow reconnected again. Under 4) auto restart of hub caused same behaviour.
  • Modules do not known about need to reconnect and restart.
  • Modules are not automatically restarted (or I didn't wait long enough . >5 min)
  • Modules have to be manually restarted, which is in production not possible.

Sorry because of long description, but I do not see better way to describe it.
Thanks in advance

@varunpuranik
Copy link
Contributor

@ddobric -

Thanks for all the details. Here are my answers to them -

Issue 1 - Modules receiving messages do not reconnect if EdgeHub restarts.
This is a known issue with the C# SDK. Here is the issue Azure/azure-iot-sdk-csharp#571. I have submitted a Fix for this in the SDK - Azure/azure-iot-sdk-csharp#611. This should get merged soon, and be available in the next release of the C# SDK.
Meanwhile, a workaround would be to move to using MQTT transport (instead of AMQP)

Issue 2 - Modules run into timeout exceptions
Modules are expected to have retry logic around any communication with the EdgeHub, just like they would have while talking to any server. That way they will not crash. Restarting the server (EdgeHub) because the client got a timeout exception is not a pattern, and the behavior is by design. This should be handled on the client side (i.e in the module). If it is not, the client will be restarted by the edgeAgent.
In any case, I would like to understand the scenario in which the modules are running into this exception. What is the frequency of sending messages/size of messages/how many modules sending, etc.

Issue 3 - EdgeHub being restarted randomly - we have not seen this issue before. Can you provide the entire edgeHub and edgeAgent logs during this period? By any chance, was there a deployment on the Edge device during this period, that would have triggered an update of the EdgeHub?

Hope I answered all your questions. If I missed something, or if you have any other questions/concerns, please let me know.

@ddobric
Copy link
Author

ddobric commented Sep 7, 2018

First of all, thank you for your answer.

Issue 1 - Modules receiving messages do not reconnect if EdgeHub restarts.

I was not aware of this issue. How this was fixed? Some kind of retry inside of AMQP client?

Issue 2 - Modules run into timeout exceptions

Totally agree on this. However timeout exception (in this case) was thrown inside of C# SDK while receiving events. I guess Issue1 targets exactly this problem? If not, we have no chance catch such exception without providing us some error callback in registration of input handler.

Issue 3 - EdgeHub being restarted randomly

I attached the log, which contains two random restarts. The only error in the log is
Module ai-02-arm32v7/fncrecognizer is not connected
The module fncrecognizer is still running for at least 15 hours without any error. However module has stop receiving of messages (probably Issue1 above).
Log entries of edgeHub at point of time stopping of receiving of messages of module are overwritten. If there is some way to grab them from somewhere pls. let me know. Last log entry inside of still running module fncrecognizer was at 09/06/2018 15:36:48 at previous day.
I do not have that log-entry, but in last few days, I think I saw in edgeHub log "out of memory" error approx. at the moment when fncrecognizer module has stopped receiving of events.

Logs:
edgeHub.txt

edgeAgent.txt

service.txt

Hope this helps.

Recap

Logs show instability of device, which is running one sensor (sinusgen) module and one receiving filter-module (fncrecognizer). Filter module stops receiving of events without any exception and continue running (doing nothing).
EdgeHub is running and restarts approx. every 30 min.

Both modules are implemented in C#, .NET Core.

@ddobric
Copy link
Author

ddobric commented Sep 11, 2018

Anything new?

@myagley
Copy link
Contributor

myagley commented Sep 11, 2018

We have been running a long haul test with the fix to verify that it resolves the issue. All signs are positive but we would like it to run for a few more days to gain more confidence. You can track the PR here: Azure/azure-iot-sdk-csharp#611

@ddobric
Copy link
Author

ddobric commented Sep 13, 2018

Thanks :)

@myagley
Copy link
Contributor

myagley commented May 29, 2019

The connectivity issues have been resolved in version 1.0.7. Closing this issue, but feel free to reopen if you are still facing issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants