Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After System.TimeoutException from DeviceClient.SendEventAsync no instance of DeviceClient can connect to Iot Hub, even when the cause for the OperationTimeout is not longer present. #613

Closed
dleinius opened this issue Sep 5, 2018 · 10 comments
Labels
investigation-required Requires further investigation to root cause this. IoTSDK Tracks all IoT SDK issues across the board

Comments

@dleinius
Copy link

dleinius commented Sep 5, 2018

  • OS, version, SKU and CPU architecture used: Microsoft Windows [Version 10.0.17134.137]
  • Application's .NET Target Framework : net452
  • Device: Laptop
  • SDK version used: Microsoft.Azure.Devices.Client 1.18.0,
    Microsoft.Azure.Devices.Shared 1.15.1,
    Microsoft.Azure.Amqp 2.3.3

Description of the issue:

After System.TimeoutException ("Operation timeout expired.") from DeviceClient.SendEventAsync no instance of DeviceClient within the AppDomain can ever connect or SendEvents to Iot Hub, even when the cause for the OperationTimeout is not longer present. Reset of AppDomain (restart of program) solves issue.

We are building a Protocol Gateway for native devices. The native devices are using not supported protocols and the Protocol Gateway builds "the bridge" towards the Iot Hub. The Protocol Gateway initiates an instance of DeviceClient foreach connected native device:

Native Devices -> (own implemented) Protocol Gateway -> DeviceClient -> IotHub

At some point (after minutes, hours or days) we experiences a TimeoutException ("Operation timeout expired.") when sending a message via DeviceClient.SendEventAsync. The TimeoutException itself is fine for us, because in this cloud scenario you have to expectect that at some time the network might be down for seconds or whatever. Our Issue is that after this point, even when the cause for the TimeoutException no longer exists, the DeviceClient will NEVER EVER sucessfully connects or sends messages to the IotHub. An reset of the AppDomain (restart of program) immediately solves the issue.

The following code sample demonstrates the issue in a console app:

The program will simulate 2 devices, one device (1234567900000) sending messages to IotHub every 5 seconds, the other device (1234567900001) every 45 seconds. To simulate a TimeoutException use a tool like Clumsy to drop AMQP messages on Port (5671) - or if you are sitting in front of your machine simply disconnect the network.

Repro Steps:

  1. Run program
  2. Wait for transmission of the messages
  3. After Device 1234567900001 has send successfully one message, use clumsy (or disconnect network) to drop 1(!) message of the other device (1234567900000)
  4. Device 1234567900000 will experience a TimeoutException after 5 seconds (OperationTimeoutInMilliseconds is only 5 seconds)
  5. After TimeoutException, disable clumys (or enabled network) to allow DeviceClient to send the next messages

Received Result:
All further messages of both(!!!) devices will received a TimeoutException. - forever

Expected Result:
After disabling of clumsy (and therefore getting a working network) all "new" messages should be sucessfully send to IotHub

Therefore, why are all instances of DeviceClient are getting a TimeoutException - even when the cause of the Timeout no longer exists? Is there a way to reset the InternalClient of DeviceClient?

We tested it with Amqp and Mqtt protocol, each withTcp_Only and Amqp_WebSocket_Only.

Please help, this behaviour seems to be a show stopper for us. :-(

Code sample exhibiting the issue:

Timeout.zip

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using Microsoft.Azure.Devices.Client;

namespace ConsoleApp4
{
class Program
{
static void Main(string[] args)
{
RunnerTask().Wait();
}

    static async Task RunnerTask()
    {
        //create two devices
        var devices = new Dictionary<string, DeviceClient>();
        devices.Add("1234567900000", await CreateDeviceClient("1234567900000"));
        devices.Add("1234567900001", await CreateDeviceClient("1234567900001"));

        var tasks = new List<Task>();
        foreach (var device in devices)
        {
            tasks.Add(
                Task.Run(async () =>
                {
                    //run forever
                    while (1 == 1)
                    {
                        try
                        {
                            if (device.Key == "1234567900000")
                                //one devices sends every 5 seconds
                                await Task.Delay(5 * 1000);
                            else
                                //the othe sends every 45 seconds
                                await Task.Delay(45 * 1000);
                            var time = DateTime.Now.ToString();
                            var message = new Message(Encoding.ASCII.GetBytes(time));

                            //send message
                            Console.WriteLine($"Trying for {device.Key}");
                            await device.Value.SendEventAsync(message);
                            Console.WriteLine($"Send {time}");

                        }
                        catch (TimeoutException te)
                        {
                            Console.WriteLine($"Timeout {device.Key}: {te.Message}");
                        }
                        catch (Exception e)
                        {
                            Console.WriteLine($"Exception {device.Key}: {e.Message}");
                        }
                    }
                })
            );
        }

        await Task.WhenAny(tasks);
    }

    public static async Task<DeviceClient> CreateDeviceClient(string deviceId)
    {
        var connectionPoolSize = 10;
        var tcpSettings = new AmqpTransportSettings(TransportType.Amqp_Tcp_Only);
        var amqpConnectionPoolSettings = new AmqpConnectionPoolSettings
        {
            MaxPoolSize = unchecked((uint)connectionPoolSize),
            Pooling = connectionPoolSize > 0
        };
        tcpSettings.AmqpConnectionPoolSettings = amqpConnectionPoolSettings;


        var deviceClient = DeviceClient.CreateFromConnectionString(YOURCONNECTION,  "I" + deviceId,
            new ITransportSettings[]
            {
                tcpSettings
            });
        deviceClient.OperationTimeoutInMilliseconds = 1 * 5 * 1000;
        await deviceClient.OpenAsync();

        return deviceClient;
    }
}

}

Console log of the issue:

Trying for 1234567900000
Send 05.09.2018 16:22:25 for 1234567900000
Trying for 1234567900000
Send 05.09.2018 16:22:30 for 1234567900000
Trying for 1234567900001
Send 05.09.2018 16:22:35 for 1234567900001
Trying for 1234567900000
Send 05.09.2018 16:22:35 for 1234567900000
[+++ PLING NETWORK OFF +++]
Trying for 1234567900000
Timeout 1234567900000: Operation timeout expired.
[+++ PLING NETWORK ON +++]
Trying for 1234567900000
Timeout 1234567900000: Operation timeout expired.
Trying for 1234567900000
Timeout 1234567900000: Operation timeout expired.
Trying for 1234567900000
Trying for 1234567900001
Timeout 1234567900001: Operation timeout expired.
Timeout 1234567900000: Operation timeout expired.
Trying for 1234567900000
Timeout 1234567900000: Operation timeout expired.
Trying for 1234567900000
Timeout 1234567900000: Operation timeout expired.

@dleinius
Copy link
Author

dleinius commented Sep 7, 2018

We did further testing, including going back to Microsoft.Azure.Devices.Client 1.7.2 (from June 2018). With this version (1.7.2) the expected behaviour occurs.

Does anybody else have connectivity issues with the current version (1.18.0)?

@CIPop
Copy link
Member

CIPop commented Sep 8, 2018

@dleinius thanks for reporting. Could you please collect and share logs: https://github.com/Azure/azure-iot-sdk-csharp/tree/master/tools/CaptureLogs ?
If the logs contain private information, please create a support request to upload them.

@CIPop CIPop added the investigation-required Requires further investigation to root cause this. label Sep 8, 2018
@CIPop
Copy link
Member

CIPop commented Sep 8, 2018

Does anybody else have connectivity issues with the current version (1.18.0)?

We are aware of certain Receive message recovery issues: #571. Part of the investigation, we've tried to inject faults to DeviceClient AMQP while sending and were not able to repro this issue.

@dleinius
Copy link
Author

dleinius commented Sep 8, 2018

Could you please collect and share logs: https://github.com/Azure/azure-iot-sdk-csharp/tree/master/tools/CaptureLogs ?
If the logs contain private information, please create a support request to upload them.

I captured the logs, but the ETL file seems to be "empty" (8KB). Opening it with the event log viewer of windows shows no entries.

Nevertheless, where can I upload the log? Via a support request within the azure portal? How can I make sure, that the logs are delivered to you?

And @CIPop thanks for your help!

@anhashia
Copy link
Contributor

Able to locally reproduce the issue on 1.18.0. Same symptoms as #620

1.) Issue specific to AMQP. Not able to reproduce using MQTT

2.) Use IoT Hub connection string for local reproduction
DeviceConnectionString = "HostName=myHub.azure-devices.net;SharedAccessKeyName=device;SharedAccessKey=*************";

DeviceClient deviceClient = DeviceClient.CreateFromConnectionString(DeviceConnectionString, "MyDevice", TransportType.Amqp_Tcp_Only)

3.) After timeout no attempt being made by SDK to try on a brand new connection. Hence all subsequent attemps fail event though underlying network connectivity is restored.

4.) No reproduction using older version 1.6.0

@abhipsaMisra
Copy link
Member

The current LTS version supports the behavior of creating device client and sending telemetry using the IoTHub connection string; so it can be used to unblock this issue. That being said, we highly encourage you to use the device connection string for creating the device client.

@dleinius
Copy link
Author

@abhipsaMisra

Thanks for your help. Do I get you right, that the issue only exists when the DeviceClient is created by the IotHub Connectionstring and it will not come up with DeviceConnectionString?

@abhipsaMisra
Copy link
Member

abhipsaMisra commented Sep 21, 2018

@dleinius Yes, on connecting with DeviceConnectionString, the client will successfully recover the connection and will be able to continue sending messages. I tried the steps using DeviceConnectionString for creating the Device Client, and was not able to reproduce the issue.
Please let us know if this help! Thanks!

@dleinius
Copy link
Author

@abhipsaMisra
We had the same issues with the DeviceConnectionString :-( The connection did not sucessfully recovered with 1.18.0. We are no back on 1.6.0. Is there a planning for a 1.19.0? I was not able to build the LTS do some trouble with the build on my machine :-(

@sharmasejal sharmasejal added the IoTSDK Tracks all IoT SDK issues across the board label Mar 19, 2020
@dleinius dleinius closed this as completed May 7, 2020
@az-iot-builder-01
Copy link
Contributor

@dleinius, @CIPop, @abhipsaMisra, thank you for your contribution to our open-sourced project! Please help us improve by filling out this 2-minute customer satisfaction survey

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigation-required Requires further investigation to root cause this. IoTSDK Tracks all IoT SDK issues across the board
Projects
None yet
Development

No branches or pull requests

8 participants