Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModuleClient with MQTT TCP does not re-connect if the connection is dropped by the server #558

Closed
varunpuranik opened this issue Jul 20, 2018 · 23 comments
Assignees
Labels
area-edge Issues related to IoT Edge. bug Something isn't working. investigation-required Requires further investigation to root cause this. IoTSDK Tracks all IoT SDK issues across the board

Comments

@varunpuranik
Copy link
Contributor

OS: Ubuntu 16.04, x64
.Net Core 2.1
Device: Azure VM
SDK Version: 1.17.0

Description of the issue:

When a module uses the ModuleClient with MQTT Tcp connection to connect to the EdgeHub, the server (EdgeHub) drops the connection when it needs a new token. The expectation is that the ModuleClient will re-open this connection to the server and continue the operation.

However, every so often, the ModuleClient will never re-open the connection to the EdgeHub, which leaves a stranded module which doesn't do anything. This happens more frequently when the Module is only Receiving messages.
The ModuleClient does not even throw any exceptions, it simply loses the connection to the EdgeHub and stands still, not doing anything.

Repro steps -

  • Create a module that uses the ModuleClient to connect to EdgeHub over MQTT and set the callback to receive messages.
  • Deploy the module to an Edge device, such that it receives messages from some other module/device.
  • For some time the module receives successfully. But after a few token refresh cycles (2-3 hours or so), it stops receiving messages.
@CIPop
Copy link
Member

CIPop commented Jul 30, 2018

Thanks @varunpuranik for reporting this!
We'll need to set up a long running test/repro for properly testing token refresh over amqp and mqtt.

@pavele
Copy link

pavele commented Aug 3, 2018

This should be considered as a blocker since this is already in GA. Not having long running tests for iot "solution" is nothing but ridiculous.

@CIPop
Copy link
Member

CIPop commented Aug 3, 2018

I was able to repro this using the MQTT fault injection tests which appear to have been disabled for a while (meaning that this is probably not working for DeviceClient to IoT Hub either).
@varunpuranik Would restarting the ModuleClient from time to time be a viable workaround for MQTT until we fix the bug?

@CIPop CIPop self-assigned this Aug 3, 2018
@varunpuranik
Copy link
Contributor Author

@CIPop If the module died and was restarted automatically by EdgeAgent, this might have been acceptable. Unfortunately, this is not the case. The Module simply hangs after disconnecting, doing nothing. A user needs to notice that this has happened, and then restart the module. I doubt if that will be acceptable to many customers.

@jason-e-gross
Copy link

jason-e-gross commented Aug 5, 2018 via email

@CIPop
Copy link
Member

CIPop commented Aug 6, 2018

I've started investigating and found that most of the tests for MQTT and AMQP recovery have been disabled a long time ago. After re-enabling them I do see they are failing to reconnect after various types of faults.
The good part is that we have a consistent repro, without Edge being involved. The bad part is that we don't know when the behavior got broken (or if it ever worked well). I'll update this thread and the AMQP equivalent once I have some idea why this is failing.

@pavele
Copy link

pavele commented Aug 7, 2018

We've already escalated a ticket because of that problem since this is business critical and is blocking our release.
The situation that you describe sounds like somebody is developing IoT without understanding that lack of connectivity is the very first problem to solve.
On top of that we as a team were using this functionality since it was in preview and we were trying to discuss exactly such issues with somebody that has an idea of Azure IoT for months. Without luck. Always the same excuse - very busy product team. So now we're facing amazing product with crappy communication with the cloud, buggy communication between the modules and support team which only resets the counter of the tickets. This combined with problems Azure Message Bus. Pure awesomeness.

@myagley
Copy link
Contributor

myagley commented Aug 10, 2018

We believe this issue has been fixed in DotNetty here: Azure/DotNetty#413

We need to cut a new release of DotNetty and update the SDK.

@myagley
Copy link
Contributor

myagley commented Aug 18, 2018

Version 1.18.0 of the SDK has been released to nuget.org. Please give it a try.

@varunpuranik
Copy link
Contributor Author

This works now with SDK version 1.18.0. Closing this issue.

@az-iot-builder-01
Copy link
Contributor

@pavele, @jason-e-gross, @myagley, thank you for your contribution to our open-sourced project! Please help us improve by filling out this 2-minute customer satisfaction survey

@CIPop
Copy link
Member

CIPop commented Aug 23, 2018

Reopening as we still have TODO items in our E2E tests that need to be investigated.

@CIPop CIPop reopened this Aug 23, 2018
@CIPop
Copy link
Member

CIPop commented Aug 27, 2018

So far, the tests that were disabled were failing because of issues on the service-side fault injection support.
The tests try to send the fault injection request message but never receive a PUBACK (the delayInSec parameter is ignored and the connection is reset immediately). This caused the tests to enter an infinite loop (probably the reason they were disabled in the first place).

I'm working on fixing all test flavors but so far all connections seem to recover without any SDK changes.

@CIPop
Copy link
Member

CIPop commented Oct 15, 2018

I was able to re-enable all Telemetry send/receive fault recovery tests in our CI system but not the Command and Twin.
Since I'm seeing some random failures when these are enabled, I expect that to be the case for Edge. We are still investigating these with highest priority and we've also decided to block further PRs until we get all our E2E fault injection tests enabled to ensure no further regressions.

@sebader
Copy link
Member

sebader commented Dec 5, 2018

is there any update on this? I have a customer stating they are running into this - and it sounds here like it is still expected to happen on Edge

@jesbrd
Copy link

jesbrd commented Dec 6, 2018

My team is also being blocked by this issue. Updates are appreciated.

varunpuranik pushed a commit to varunpuranik/azure-iot-sdk-csharp that referenced this issue Dec 13, 2018
…ure#558)

Add functionality along with tests in the Edge Hub to obtain the trust bundle either from the iotedged or an input file to facilitate development. The trust bundle is not really used here rather this is being staged for additional upcoming features.
@sebader
Copy link
Member

sebader commented Jan 10, 2019

@CIPop should this issue now be fully fixed with the referenced PR - and Microsoft.Azure.Devices.Client 1.19.0?

@prmathur-microsoft
Copy link
Member

Can you give it a try and let us know if you still see issues.

@sebader
Copy link
Member

sebader commented Jan 10, 2019

I've already updated and am testing this. I just wanted to check if this is actually something that should have been fully solved - since the PR mentions that more fixes/changes are to come.

@prmathur-microsoft
Copy link
Member

Our current release addresses a huge set of fixes. But yes, we still have few more coming soon on the AMQP. We are working on redesigning AMQP to address lot of issues on that front as well.

In the meantime, if you can let us know if our 1.19.0 release resolved your issues or if you still see problem ?

@sebader
Copy link
Member

sebader commented Jan 31, 2019

I finally had time to test this. And it does not look to me as this has been fixed for MQTT.

C# SDK: 1.19.0
EdgeHub: 1.0.6-rc1

My setup:

  • Two instances of the same custom module. One connected over AMQP, the other over MQTT to EdgeHub
  • The module has a ConnectionStatusHandler method that logs when the connection has been changed (lost, reconnected, retry expired etc)

When all is running, I manually restart the edgeHub container (iotedge restart edgeHub). Both modules detect the connection change.
After the edgeHub is up again, the AMQP module quickly reconnects.

2019-01-31 12:07:37.011 +00:00 [INF] - Edge Hub module client initialized using Amqp_Tcp_Only
2019-01-31 12:09:02.185 +00:00 [INF] - Module connection changed. New status=Disconnected_Retrying Reason=Communication_Error
2019-01-31 12:09:47.930 +00:00 [INF] - Module connection changed. New status=Connected Reason=Connection_Ok

The MQTT module, however, does not reconnect. It does run into the Retry-Expired timeout:

2019-01-31 12:07:32.679 +00:00 [INF] - Edge Hub module client initialized using Mqtt_Tcp_Only
2019-01-31 12:09:12.015 +00:00 [INF] - Module connection changed. New status=Disconnected_Retrying Reason=Communication_Error
2019-01-31 12:13:12.032 +00:00 [INF] - Module connection changed. New status=Disconnected Reason=Retry_Expired
2019-01-31 12:13:12.032 +00:00 [ERR] - Connection can not be re-established. Exiting module

To work around this, I am exiting the module when the retry_expired happens. Then the edgeAgent restarts the module and it connects fine again.
So: Does not look resolved to me.

@prmathur-microsoft prmathur-microsoft assigned ewertons and unassigned CIPop Jul 10, 2019
@sharmasejal sharmasejal added the IoTSDK Tracks all IoT SDK issues across the board label Mar 19, 2020
@sharmasejal sharmasejal added IoTSDK Tracks all IoT SDK issues across the board and removed IoTSDK Tracks all IoT SDK issues across the board labels Jun 3, 2020
@abhipsaMisra
Copy link
Member

@sebader Would you be able to share the SDK logs for the scenario where reconnection does not succeed over Mqtt.

@vinagesh vinagesh added the investigation-required Requires further investigation to root cause this. label Sep 4, 2020
@vinagesh vinagesh self-assigned this Dec 3, 2020
@vinagesh
Copy link
Member

vinagesh commented Dec 7, 2020

I tried this and cannot repro it anymore. Closing due to inactivity. Please reopen if you see any issues.

@vinagesh vinagesh closed this as completed Dec 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-edge Issues related to IoT Edge. bug Something isn't working. investigation-required Requires further investigation to root cause this. IoTSDK Tracks all IoT SDK issues across the board
Projects
None yet
Development

No branches or pull requests