Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Module fails to restart because of transient network error #1982

Closed
fgheysels opened this issue Nov 12, 2019 · 6 comments
Closed

Module fails to restart because of transient network error #1982

fgheysels opened this issue Nov 12, 2019 · 6 comments

Comments

@fgheysels
Copy link

fgheysels commented Nov 12, 2019

Expected Behavior

When an IoT Edge module disconnects it should be able to reconnect.

Current Behavior

I have an IoT Edge solution that runs on a Raspberry Pi and consists of 4 IoT Edge modules.
Upon deployment of the solution to the Edge Device (Pi), everything was working fine. After a couple of hours, one of the modules was being restarted by the IoT Edge runtime,
and failed to start again. All the other modules just keep on working.

When I look in the logs of the failing module, I can see that an exception is being thrown with the following stacktrace:

Unhandled Exception: Microsoft.Azure.Devices.Client.Exceptions.IotHubCommunicationException: Transient network error occurred, please retry. ---> System.Net.Internals.SocketExceptionFactory+ExtendedSocketException: No such device or address
   at System.Net.Dns.InternalGetHostByName(String hostName)
   at System.Net.Dns.ResolveCallback(Object context)
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw(Exception source)
   at System.Net.Dns.HostResolutionEndHelper(IAsyncResult asyncResult)
   at System.Net.Dns.EndGetHostAddresses(IAsyncResult asyncResult)
   at System.Net.Dns.<>c.<GetHostAddressesAsync>b__25_1(IAsyncResult asyncResult)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncCoreLogic(IAsyncResult iar, Func`2 endFunction, Action`1 endAction, Task`1 promise, Boolean requiresSynchronization)
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.Azure.Devices.Client.Transport.Mqtt.MqttTransportHandler.OpenAsyncInternal(CancellationToken cancellationToken)
   at Microsoft.Azure.Devices.Client.Transport.Mqtt.MqttTransportHandler.OpenAsync(CancellationToken cancellationToken)
   at Microsoft.Azure.Devices.Client.Transport.ProtocolRoutingDelegatingHandler.OpenAsync(CancellationToken cancellationToken)
   at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.<>c__DisplayClass22_0.<<ExecuteWithErrorHandlingAsync>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.ExecuteWithErrorHandlingAsync[T](Func`1 asyncOperation)
   --- End of inner exception stack trace ---
   at Microsoft.Azure.Devices.Client.Transport.ErrorDelegatingHandler.ExecuteWithErrorHandlingAsync[T](Func`1 asyncOperation)
   at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.<>c__DisplayClass32_0.<<OpenAsyncInternal>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.Azure.Devices.Client.Transport.RetryDelegatingHandler.EnsureOpenedAsync(CancellationToken cancellationToken)
   at Microsoft.Azure.Devices.Client.InternalClient.OpenAsync()
   at DiscoveryModule.Program.Init() in /app/Program.cs:line 210
   at DiscoveryModule.Program.Main(String[] args) in /app/Program.cs:line 25
   at DiscoveryModule.Program.<Main>(String[] args)

This exception occurs when the OpenAsync method on the ModuleClient is being called to open the connection to IoT Hub.

The module is using MQTT as a communication protocol; this is how the connection is being established:

var mqttSetting = new MqttTransportSettings(TransportType.Mqtt_Tcp_Only);
ITransportSettings[] settings = { mqttSetting };

// Open a connection to the Edge runtime
ModuleClient ioTHubModuleClient = await ModuleClient.CreateFromEnvironmentAsync(settings);
await ioTHubModuleClient.OpenAsync();

var currentTwin = await ioTHubModuleClient.GetTwinAsync();

My other modules that are running on the same device (and that keep working), are also using MQTT.

I'm using Microsoft.Azure.Devices.Client 1.20.3 in all my Edge Modules (both on the one that's failing and in the other ones that are still working).

I also tried to restart the module manually via iotedge restart <modulename>, but this didn't help. I assume that the problem will be solved (temporarely?) when I reboot the device, but this is not something that can be done when the device is 'in the field'.

The module that is failing has the following 'createOptions':

"createOptions": {
                "HostConfig": {
                  "NetworkMode": "host"
                },
                "NetworkingConfig": {
                  "EndpointsConfig": {
                    "host": {}
                  }
                }
              }

(However, this module is also running on another device and this issue doesn't exists on the other device).

Context (Environment)

Output of iotedge check

Click here
Configuration checks
--------------------
√ config.yaml is well-formed - OK
√ config.yaml has well-formed connection string - OK
√ container engine is installed and functional - OK
√ config.yaml has correct hostname - OK
√ config.yaml has correct URIs for daemon mgmt endpoint - OK
√ latest security daemon - OK
√ host time is close to real time - OK
√ container time is close to host time - OK
‼ DNS server - Warning
    Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub.
    Please see https://aka.ms/iotedge-prod-checklist-dns for best practices.
    You can ignore this warning if you are setting DNS server per module in the Edge deployment.
‼ production readiness: certificates - Warning
    Device is using self-signed, automatically generated certs.
    Please see https://aka.ms/iotedge-prod-checklist-certs for best practices.
√ production readiness: certificates expiry - OK
√ production readiness: container engine - OK
‼ production readiness: logs policy - Warning
    Container engine is not configured to rotate module logs which may cause it run out of disk space.
    Please see https://aka.ms/iotedge-prod-checklist-logs for best practices.
    You can ignore this warning if you are setting log policy per module in the Edge deployment.

Connectivity checks
-------------------
√ host can connect to and perform TLS handshake with IoT Hub AMQP port - OK
√ host can connect to and perform TLS handshake with IoT Hub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with IoT Hub MQTT port - OK
√ container on the default network can connect to IoT Hub AMQP port - OK
√ container on the default network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the default network can connect to IoT Hub MQTT port - OK
√ container on the IoT Edge module network can connect to IoT Hub AMQP port - OK
√ container on the IoT Edge module network can connect to IoT Hub HTTPS / WebSockets port - OK
√ container on the IoT Edge module network can connect to IoT Hub MQTT port - OK
√ Edge Hub can bind to ports on host - OK

20 check(s) succeeded.
3 check(s) raised warnings. Re-run with --verbose for more details.

Device Information

  • Host OS : Raspbian gnu/linux 9 (stretch)
  • Architecture: ARM32
  • Container OS: Linux containers

Runtime Versions

  • iotedged [run iotedge version]: iotedge 1.0.8 (208b220)
  • Edge Agent [image tag (e.g. 1.0.0)]: 1.0.8.3
  • Edge Hub [image tag (e.g. 1.0.0)]: 1.0.8.3
  • Docker/Moby [run docker version]: 3.0.7

Logs

edge-agent logs
<6> 2019-11-12 16:12:42.456 +00:00 [INF] - Executing command: "Command Group: (\n  [Stop module DiscoveryModule]\n  [Start module DiscoveryModule]\n  [Saving DiscoveryModule to store]\n)"
<6> 2019-11-12 16:12:42.456 +00:00 [INF] - Executing command: "Stop module DiscoveryModule"
<6> 2019-11-12 16:12:42.466 +00:00 [INF] - Executing command: "Start module DiscoveryModule"
<6> 2019-11-12 16:12:43.026 +00:00 [INF] - Executing command: "Saving DiscoveryModule to store"
<6> 2019-11-12 16:12:43.027 +00:00 [INF] - Plan execution ended for deployment 9
<6> 2019-11-12 16:12:48.412 +00:00 [INF] - Updated reported properties
<6> 2019-11-12 16:16:58.822 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:55s left).
<6> 2019-11-12 16:16:58.995 +00:00 [INF] - Updated reported properties
<6> 2019-11-12 16:17:04.077 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:49s left).
<6> 2019-11-12 16:17:09.186 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:44s left).
<6> 2019-11-12 16:17:14.296 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:39s left).
<6> 2019-11-12 16:17:19.408 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:34s left).
<6> 2019-11-12 16:17:24.515 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:29s left).
<6> 2019-11-12 16:17:29.633 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:24s left).
<6> 2019-11-12 16:17:34.741 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:19s left).
<6> 2019-11-12 16:17:39.848 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:13s left).
<6> 2019-11-12 16:17:44.959 +00:00 [INF] - Module 'DiscoveryModule' scheduled to restart after 05m:00s (04m:08s left).
edge-hub logs
<6> 2019-11-12 15:23:53.408 +00:00 [INF] - Credentials for client DG-0000000-191025-001/ActionExecutor are not valid.
<4> 2019-11-12 15:23:53.409 +00:00 [WRN] - Reauthenticating client DG-0000000-191025-001/ActionExecutor failed, removing client connection
<6> 2019-11-12 15:23:53.409 +00:00 [INF] - Unable to re-authenticate DG-0000000-191025-001/ActionExecutor, dropping client connection.
<4> 2019-11-12 15:23:53.409 +00:00 [WRN] - Closing connection for device: DG-0000000-191025-001/ActionExecutor, Microsoft.Azure.Devices.Edge.Hub.Core.EdgeHubConnectionException: Connection closed for device DG-0000000-191025-001/ActionExecutor.,
<6> 2019-11-12 15:23:53.409 +00:00 [INF] - Disposing MessagingServiceClient for device Id DG-0000000-191025-001/ActionExecutor because of exception - Microsoft.Azure.Devices.Edge.Hub.Core.EdgeHubConnectionException: Connection closed for device DG-0000000-191025-001/ActionExecutor.
<6> 2019-11-12 15:23:53.410 +00:00 [INF] - Closing device proxy for device Id DG-0000000-191025-001/ActionExecutor
<6> 2019-11-12 15:23:53.410 +00:00 [INF] - Device connection removed for device DG-0000000-191025-001/ActionExecutor
<6> 2019-11-12 15:23:53.559 +00:00 [INF] - Client DG-0000000-191025-001/ActionExecutor in device scope authenticated locally.
<6> 2019-11-12 15:23:53.560 +00:00 [INF] - Successfully generated identity for clientId DG-0000000-191025-001/ActionExecutor and username dg-0000000-191025-001/DG-0000000-191025-001/ActionExecutor/?api-version=2018-06-30&DeviceClientType=.NET%2F1.20.3%20%28.NET%20Core%204.6.28008.02%3B%20Linux%204.19.66-v7%2B%20%231253%20SMP%20Thu%20Aug%2015%2011%3A49%3A46%20BST%202019%3B%20Arm%29
<6> 2019-11-12 15:23:53.560 +00:00 [INF] - ClientAuthenticated, DG-0000000-191025-001/ActionExecutor, 3bee4180
<6> 2019-11-12 15:23:53.563 +00:00 [INF] - New device connection for device DG-0000000-191025-001/ActionExecutor
<6> 2019-11-12 15:23:53.568 +00:00 [INF] - Bind device proxy for device DG-0000000-191025-001/ActionExecutor
<6> 2019-11-12 15:23:53.568 +00:00 [INF] - Binding message channel for device Id DG-0000000-191025-001/ActionExecutor
<6> 2019-11-12 15:23:53.569 +00:00 [INF] - Processing subscriptions TwinResponse, ModuleMessages, Methods for client DG-0000000-191025-001/ActionExecutor.
<6> 2019-11-12 15:23:53.586 +00:00 [INF] - Set subscriptions from session state for DG-0000000-191025-001/ActionExecutor
<6> 2019-11-12 15:23:53.586 +00:00 [INF] - Updated reported properties for DG-0000000-191025-001/$edgeHub
<6> 2019-11-12 15:23:53.589 +00:00 [INF] - Processing subscriptions TwinResponse, ModuleMessages, Methods for client DG-0000000-191025-001/ActionExecutor.
<6> 2019-11-12 15:23:53.591 +00:00 [INF] - Set subscriptions from session state for DG-0000000-191025-001/ActionExecutor
<6> 2019-11-12 15:23:53.593 +00:00 [INF] - Processing subscriptions TwinResponse, ModuleMessages, Methods for client DG-0000000-191025-001/ActionExecutor.
<6> 2019-11-12 15:23:53.596 +00:00 [INF] - Set subscriptions from session state for DG-0000000-191025-001/ActionExecutor
<6> 2019-11-12 15:23:58.725 +00:00 [INF] - Updated reported properties for DG-0000000-191025-001/$edgeHub
<6> 2019-11-12 15:28:53.416 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 15:28:53.422 +00:00 [INF] - Error authenticating token for DG-0000000-191025-001/TelemetryRetriever because the token is expired or could not be parsed
<6> 2019-11-12 15:28:53.423 +00:00 [INF] - Credentials for client DG-0000000-191025-001/TelemetryRetriever are not valid.
<4> 2019-11-12 15:28:53.423 +00:00 [WRN] - Reauthenticating client DG-0000000-191025-001/TelemetryRetriever failed, removing client connection
<6> 2019-11-12 15:28:53.424 +00:00 [INF] - Unable to re-authenticate DG-0000000-191025-001/TelemetryRetriever, dropping client connection.
<4> 2019-11-12 15:28:53.424 +00:00 [WRN] - Closing connection for device: DG-0000000-191025-001/TelemetryRetriever, Microsoft.Azure.Devices.Edge.Hub.Core.EdgeHubConnectionException: Connection closed for device DG-0000000-191025-001/TelemetryRetriever.,
<6> 2019-11-12 15:28:53.424 +00:00 [INF] - Disposing MessagingServiceClient for device Id DG-0000000-191025-001/TelemetryRetriever because of exception - Microsoft.Azure.Devices.Edge.Hub.Core.EdgeHubConnectionException: Connection closed for device DG-0000000-191025-001/TelemetryRetriever.
<6> 2019-11-12 15:28:53.431 +00:00 [INF] - Closing device proxy for device Id DG-0000000-191025-001/TelemetryRetriever
<6> 2019-11-12 15:28:53.432 +00:00 [INF] - Device connection removed for device DG-0000000-191025-001/TelemetryRetriever
<6> 2019-11-12 15:28:53.587 +00:00 [INF] - Client DG-0000000-191025-001/TelemetryRetriever in device scope authenticated locally.
<6> 2019-11-12 15:28:53.588 +00:00 [INF] - Successfully generated identity for clientId DG-0000000-191025-001/TelemetryRetriever and username dg-0000000-191025-001/DG-0000000-191025-001/TelemetryRetriever/?api-version=2018-06-30&DeviceClientType=.NET%2F1.20.3%20%28.NET%20Core%204.6.28008.02%3B%20Linux%204.19.66-v7%2B%20%231253%20SMP%20Thu%20Aug%2015%2011%3A49%3A46%20BST%202019%3B%20Arm%29
<6> 2019-11-12 15:28:53.588 +00:00 [INF] - ClientAuthenticated, DG-0000000-191025-001/TelemetryRetriever, 190cc900
<6> 2019-11-12 15:28:53.591 +00:00 [INF] - New device connection for device DG-0000000-191025-001/TelemetryRetriever
<6> 2019-11-12 15:28:53.595 +00:00 [INF] - Bind device proxy for device DG-0000000-191025-001/TelemetryRetriever
<6> 2019-11-12 15:28:53.595 +00:00 [INF] - Binding message channel for device Id DG-0000000-191025-001/TelemetryRetriever
<6> 2019-11-12 15:28:53.597 +00:00 [INF] - Processing subscriptions TwinResponse, ModuleMessages, DesiredPropertyUpdates for client DG-0000000-191025-001/TelemetryRetriever.
<6> 2019-11-12 15:28:53.599 +00:00 [INF] - Set subscriptions from session state for DG-0000000-191025-001/TelemetryRetriever
<6> 2019-11-12 15:28:53.602 +00:00 [INF] - Processing subscriptions TwinResponse, ModuleMessages, DesiredPropertyUpdates for client DG-0000000-191025-001/TelemetryRetriever.
<6> 2019-11-12 15:28:53.604 +00:00 [INF] - Set subscriptions from session state for DG-0000000-191025-001/TelemetryRetriever
<6> 2019-11-12 15:28:53.606 +00:00 [INF] - Updated reported properties for DG-0000000-191025-001/$edgeHub
<6> 2019-11-12 15:28:53.606 +00:00 [INF] - Processing subscriptions TwinResponse, ModuleMessages, DesiredPropertyUpdates for client DG-0000000-191025-001/TelemetryRetriever.
<6> 2019-11-12 15:28:53.608 +00:00 [INF] - Set subscriptions from session state for DG-0000000-191025-001/TelemetryRetriever
<6> 2019-11-12 15:28:58.808 +00:00 [INF] - Updated reported properties for DG-0000000-191025-001/$edgeHub
<6> 2019-11-12 15:33:53.410 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 15:38:53.407 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 15:43:53.408 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 15:44:11.827 +00:00 [INF] - Starting periodic operation Get EdgeHub config...
<6> 2019-11-12 15:44:11.894 +00:00 [INF] - Obtained edge hub config from module twin
<6> 2019-11-12 15:44:11.895 +00:00 [INF] - Successfully completed periodic operation Get EdgeHub config
<6> 2019-11-12 15:44:12.038 +00:00 [INF] - Updated reported properties for DG-0000000-191025-001/$edgeHub
<6> 2019-11-12 15:48:53.407 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 15:49:54.537 +00:00 [INF] - Started task to cleanup processed and stale messages for endpoint iothub
<6> 2019-11-12 15:49:54.539 +00:00 [INF] - Cleaned up 0 messages from queue for endpoint iothub and 0 messages from message store.
<6> 2019-11-12 15:50:24.540 +00:00 [INF] - Started task to cleanup processed and stale messages for endpoint DG-0000000-191025-001/ActionExecutor/input1
<6> 2019-11-12 15:50:24.542 +00:00 [INF] - Cleaned up 0 messages from queue for endpoint DG-0000000-191025-001/ActionExecutor/input1 and 0 messages from message store.
<6> 2019-11-12 15:50:54.539 +00:00 [INF] - Started task to cleanup processed and stale messages for endpoint DG-0000000-191025-001/TelemetryRetriever/input1
<6> 2019-11-12 15:50:54.541 +00:00 [INF] - Cleaned up 0 messages from queue for endpoint DG-0000000-191025-001/TelemetryRetriever/input1 and 0 messages from message store.
<6> 2019-11-12 15:52:43.928 +00:00 [INF] - Starting refresh of device scope identities cache
<6> 2019-11-12 15:53:53.407 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 15:58:53.406 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 16:03:53.407 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 16:08:53.406 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 16:13:53.407 +00:00 [INF] - Reauthenticating connected clients
<6> 2019-11-12 16:18:53.407 +00:00 [INF] - Reauthenticating connected clients

@philipktlin
Copy link
Contributor

Did you check DNS works at the moment that your module failed to connect? did you test DNS inside failed module? does other modules use the same way to instantiate module client? Would you share the full log of iotedged, edgeAgent and edgeHub. @varunpuranik do you have any suggestion?

@fgheysels
Copy link
Author

I've been able to ssh into the Raspberry Pi that acts as an edge device. Since iotedge check reported that there was no DNS configured, I've done that by specifying this entry in daemon.json:

{
   "dns": ["1.1.1.1"]
}

I've also noticed that the /etc/hosts file contained an error: the IP address 127.0.1.1 didn't refer to the hostname of the device. I've changed that as well and rebooted the device

For now, the problem hasn't reoccurred yet, but now I also do not know which change might have fixed the problem.

@philipktlin
Copy link
Contributor

Definitely 127.0.0.1 mapping should always map to your device name. For using public DNS 1.1.1.1, I think it is required as well since iotedge check result of your device tells that no DNS is configured.

Please monitor and provide logs of IoT edge daemon, edgeAgent, edgeHub and your custom module , and iotedge check result when problem happens again.

This issue will go to stale if it is inactive for 30 days. We will close it when it goes to stale. thanks.

@philipktlin philipktlin self-assigned this Nov 18, 2019
@philipktlin philipktlin added area:connectivity Anything related to connectivity, networking, and related. area:host_configuration and removed area:connectivity Anything related to connectivity, networking, and related. labels Nov 18, 2019
@fgheysels
Copy link
Author

fgheysels commented Nov 19, 2019

Should 127.0.0.1 map to the device-name or 127.0.1.1 ? (127.0.0.1 refers to localhost; 127.0.1.1 refers to the hostname).

The logs for edgeAgent, edgeHub should be in the issue, but seems like there's something wrong with the formatting. I'll fix it (done).

I can't find anymore logs from my custom module since it seems that it is now running for a week and the issue has not reoccured yet, so it seems to be fixed for now ...
Might be an additional feature for iotedge check to verify if /etc/hosts is aligned with /etc/hostname ?

How can I get the logs for the edge daemon ?

@ancaantochi ancaantochi self-assigned this Nov 21, 2019
@ancaantochi
Copy link
Contributor

I think the change to the /etc/hosts file fixed your issue because the connection failed from the module to edgeHub. The change to daemon.json to add the DNS is to fix issues from edgeHub to the cloud. Because your module is running on the host network it needs to be able to resolve the GatewayHostname, you can check which value it has in the Env for IOTEDGE_GATEWAYHOSTNAME using:

docker inspect <moduleName>

To get daemon logs:

sudo journalctl -u iotedge -f

We will evaluate if iotedge check should validate /etc/hosts and /etc/hostname

@ancaantochi
Copy link
Contributor

I will close this issue, please reopen if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants