Close AMQP connection explicitly when no more links #4914

vipeller · 2021-04-28T20:50:21Z

This fixes the following problem:

In certain situations EdgeHub can decide to close the connection of a device (or module). Such situations are when a SAS token expires or the device gets disabled on the azure portal, for example.
The current solution to close the device when it uses Amqp is to close the links in the Amqp session. The problem is what the underlying transport channel remains open in this case and the device is able to reopen the channel.
When EdgeHub closes a client that way it also sets its representing device objects (the related instance of classes like DeviceProxy) as closed. After that moment for certain operations, like sending M2M messages, EdgeHub checks if a module has an active DeviceProxy, and if does not, it does not send out the message. Interestingly, other operations (like transmitting twin results) does not do this check and the operation goes through, as the underlying link can be reopened.
Also, when the module (or device) does an operation (e.g. sending a telemetry message) that can go through, even if the device proxy is closed.

Because of this bug, the following can happen:

The module works properly for a while
The token is near to expire, so the module sends a new one. Let's say that the processing of the token has problems (in real life scenario, talking to edged failed for some reason), so EdgeHub chooses to close the connection.
EdgeHub closes all the Amqp links, and sets the device proxy as closed/
The module wants to send a telemetry message, it still have the underlying transport, so it opens a new link for telemetry, and sends the message.
EdgeHub processes the message and everything seems working
An incoming message comes from somewhere which is routed to the module we are talking about.
EdgeHub finds the module and sees that the device proxy is closed, so it does not send the message.

The message does not get lost, but will not be sent out until the module does not get restarted. Also if there is a time-to-leave value (or some other limits), the message will be dropped after a while.

This fix changes the logic from closing the links to close the entire connection. As a result the client (device or module) cannot reopen links but needs to create a new connection, which results creating a new DeviceProxy with an appropriate state (=opened), so M2M messages (and other device bound operations checking the state) can go through.

…ept tcp level connection)

vadim-kovalyov · 2021-04-28T22:54:56Z

edge-hub/core/src/Microsoft.Azure.Devices.Edge.Hub.Amqp/ClientConnectionHandler.cs

@@ -21,18 +22,21 @@ namespace Microsoft.Azure.Devices.Edge.Hub.Amqp
    /// </summary>
    class ClientConnectionHandler : IConnectionHandler
    {
+        readonly TimeSpan closeTimeout = TimeSpan.FromSeconds(10);


I know still draft, sorry for jumping the gun... but how this number was picked? maybe consider using "CloudOperationTimeout" setting here,

I've checked the option to inject the timeout as a parameter, but it would go through several objects from the point of dependency registration and most of the time I had the feeling as "it does not belong there". Also, CloudOperationTimeout regards to cloud operations while this one is about talking to devices, so then the best would be to create its own (or a more general) timeout value.
Instead what I did, I checked in the amqp source code what it does and I saw that in the newer version there is a CloseAsync() with no timeout, which uses a hardcoded 60 sec timeout value. So I brought that value here and using that. When the time passes, the code throws a timeout exception

…r amqp api for CloseAsync() without timeout

vadim-kovalyov

Can you please add PR description?

…ept tcp level connection) (#4984) cherry pick of #4914

…ept tcp level connection) (Azure#4984) cherry pick of Azure#4914

This fixes the following problem: In certain situations EdgeHub can decide to close the connection of a device (or module). Such situations are when a SAS token expires or the device gets disabled on the azure portal, for example. The current solution to close the device when it uses Amqp is to close the links in the Amqp session. The problem is what the underlying transport channel remains open in this case and the device is able to reopen the channel. When EdgeHub closes a client that way it also sets its representing device objects (the related instance of classes like DeviceProxy) as closed. After that moment for certain operations, like sending M2M messages, EdgeHub checks if a module has an active DeviceProxy, and if does not, it does not send out the message. Interestingly, other operations (like transmitting twin results) does not do this check and the operation goes through, as the underlying link can be reopened. Also, when the module (or device) does an operation (e.g. sending a telemetry message) that can go through, even if the device proxy is closed. Because of this bug, the following can happen: 1) The module works properly for a while 2) The token is near to expire, so the module sends a new one. Let's say that the processing of the token has problems (in real life scenario, talking to edged failed for some reason), so EdgeHub chooses to close the connection. 3) EdgeHub closes all the Amqp links, and sets the device proxy as closed/ 4) The module wants to send a telemetry message, it still have the underlying transport, so it opens a new link for telemetry, and sends the message. 5) EdgeHub processes the message and everything seems working 6) An incoming message comes from somewhere which is routed to the module we are talking about. 7) EdgeHub finds the module and sees that the device proxy is closed, so it does not send the message. The message does not get lost, but will not be sent out until the module does not get restarted. Also if there is a time-to-leave value (or some other limits), the message will be dropped after a while. This fix changes the logic from closing the links to close the entire connection. As a result the client (device or module) cannot reopen links but needs to create a new connection, which results creating a new DeviceProxy with an appropriate state (=opened), so M2M messages (and other device bound operations checking the state) can go through.

vipeller added 2 commits April 28, 2021 13:48

Close AMQP connection explicitly when no more links (removing links k…

83d36bc

…ept tcp level connection)

ca rule fixes

0e533fb

vadim-kovalyov reviewed Apr 28, 2021

View reviewed changes

vipeller added 2 commits May 7, 2021 16:11

updated close timeout to 60sec as that is the default timeout in newe…

79131ad

…r amqp api for CloseAsync() without timeout

Merge branch 'master' into vipeller/amqp_close

728b538

vipeller marked this pull request as ready for review May 7, 2021 23:16

vipeller requested review from varunpuranik and vadim-kovalyov May 7, 2021 23:16

vadim-kovalyov reviewed May 8, 2021

View reviewed changes

vipeller requested a review from vadim-kovalyov May 10, 2021 16:21

vadim-kovalyov approved these changes May 10, 2021

View reviewed changes

vipeller added the ready-to-merge label May 10, 2021

Merge branch 'master' into vipeller/amqp_close

37ac2ab

kodiakhq bot merged commit 6c8134e into Azure:master May 10, 2021

vipeller added a commit to vipeller/iotedge that referenced this pull request May 14, 2021

cherry pick of Azure#4914

9752964

vipeller mentioned this pull request May 14, 2021

Close AMQP connection explicitly when no more links (removing links kept tcp level connection) #4984

Merged

kodiakhq bot pushed a commit that referenced this pull request May 14, 2021

Close AMQP connection explicitly when no more links (removing links k…

5be30fb

…ept tcp level connection) (#4984) cherry pick of #4914

darobs pushed a commit to darobs/iotedge that referenced this pull request May 26, 2021

Close AMQP connection explicitly when no more links (removing links k…

a9ffac0

…ept tcp level connection) (Azure#4984) cherry pick of Azure#4914

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Close AMQP connection explicitly when no more links #4914

Close AMQP connection explicitly when no more links #4914

vipeller commented Apr 28, 2021 •

edited

vadim-kovalyov Apr 28, 2021

vipeller May 7, 2021

vadim-kovalyov left a comment

Close AMQP connection explicitly when no more links #4914

Close AMQP connection explicitly when no more links #4914

Conversation

vipeller commented Apr 28, 2021 • edited

vadim-kovalyov Apr 28, 2021

Choose a reason for hiding this comment

vipeller May 7, 2021

Choose a reason for hiding this comment

vadim-kovalyov left a comment

Choose a reason for hiding this comment

vipeller commented Apr 28, 2021 •

edited