-
Notifications
You must be signed in to change notification settings - Fork 496
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug Report] Windows ModuleClient: Cannot connect to EdgeHub #2223
Comments
I thought we could mitigate this SDK issue in our tests, but it seems worse than previously thought: Azure/azure-iot-sdk-csharp#2223 We thought ModuleClient would recover on subsequent reconnects, but it never recovers. I have tried fiddling with the delays, removing the explicit `OpenAsync`, allowing more reconnect attempts, etc. Nothing seems to stabilize this flakiness. I didn't resort to changing the protocol as I believe our tests should use the default (and I believe most stable) Amqp_Tcp_Only. I feel like the loose thread here for further investigation is the fact that our Longhaul Tests don't suffer from this issue. So it seems like something specific to the TempSensor module might be at play. For this PR, I have disabled the TempSensor test on windows with a comment to re-enable when the SDK issue is fixed. For the windows 10 minimal suite, I switched to using EdgeAgent ping test. Waiting on this test run before merge: https://dev.azure.com/msazure/One/_build/results?buildId=48491714&view=results 5 E2E runs completed successfully ## Azure IoT Edge PR checklist: This checklist is used to make sure that common guidelines for a pull request are followed. ### General Guidelines and Best Practices - [x] I have read the [contribution guidelines](https://github.com/azure/iotedge#contributing). - [x] Title of the pull request is clear and informative. - [x] Description of the pull request includes a concise summary of the enhancement or bug fix. ### Testing Guidelines - [x] Pull request includes test coverage for the included changes. - Description of the pull request includes - [x] concise summary of tests added/modified - [x] local testing done. ### Draft PRs - Open the PR in `Draft` mode if it is: - Work in progress or not intended to be merged. - Encountering multiple pipeline failures and working on fixes. _Note: We use the kodiakhq bot to merge PRs once the necessary checks and approvals are in place. When it merges a PR, kodiakhq converts the PR title to the commit title, PR description to the commit description, and squashes all the commits in the PR to a single commit. The net effect is that entire PR becomes a single commit. Please follow the best practices mentioned [here](https://chris.beams.io/posts/git-commit/#:~:text=The%20seven%20rules%20of%20a%20great%20Git%20commit,what%20and%20why%20vs.%20how%20For%20example%3A%20) for the PR title and description_
It appears you are using an older LTS version of the device client. We have since published another LTS version you may be interested to use. Having said that, it is a bit difficult to investigate this issue with no SDK logs. It would be great if you could share the SDK logs when the connection issue happens. |
Based on the etl logs you provided, it looks like the connection doesn't open because the remote server is actively refusing it. We can sync offline to reach out to the service team and try to figure out the issue from their side. |
@and-rewsmith |
Context
<PackageReference Include="Microsoft.Azure.Devices.Client" Version="1.28.2" /
Description of the issue
Please be as detailed as possible: which feature has a problem, how often does it fail, etc.
I am a member of the IoT Edge team. On windows only, we have observed behavior where the devices sdk tries to connect to EdgeHub. The ModuleClient will sometimes timeout connecting to EdgeHub. In many cases the ModuleClient will continue timing out and never successfully connect.
Code sample exhibiting the issue
Here is the module we are using to connect to EdgeHub. It is meant to be the canonical, basic example.
https://github.com/Azure/iotedge/tree/master/edge-modules/SimulatedTemperatureSensor
Here is one of our E2E tests where we see this issue. Notice how the module times out after EdgeHub is up and running. Also subsequent attempts to connect fail too. Timeline:
https://github.com/Azure/iotedge/blob/27bdf7d9afedf1f997e2fbc262b52efbbbbcefc0/edge-modules/SimulatedTemperatureSensor/src/Program.cs#L295
Test run exhibiting this behavior with logs:
https://dev.azure.com/msazure/One/_build/results?buildId=48134614&view=ms.vss-test-web.build-test-results-tab&runId=940061006&resultId=100000&paneView=attachments
~This reproduces
1/7 times iirc.Our theory is that if the ModuleClient tries to connect before EdgeHub has opened its ports then this will happen. When we added a wait to our module that waited enough time for EdgeHub to be guaranteed up and running at the time ofModuleClient.OpenAsync()
, these problems disappeared.The above strikethrough actually isn't the case. Even if EdgeHub is up and running, the
ModuleClient
will sometimes fail repeatedly to connect. And it happens ~40% of the time. This test run shows this happening:https://msazure.visualstudio.com/One/_build/results?buildId=48370188&view=ms.vss-test-web.build-test-results-tab&runId=945793648&resultId=100007&paneView=debug
This also seems to reproduce with all protocols, not just Amqp_Tcp_Only.
I am available for assistance if you want to chat about this. My microsoft alias is andsmi.
Console log of the issue
Follow the instructions here to capture SDK logs.
Don't forget to remove any connection string information!
Click to expand code sample
The text was updated successfully, but these errors were encountered: