Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close AMQP connection explicitly when no more links #4914

Merged
merged 5 commits into from
May 10, 2021

Conversation

vipeller
Copy link
Contributor

@vipeller vipeller commented Apr 28, 2021

This fixes the following problem:

In certain situations EdgeHub can decide to close the connection of a device (or module). Such situations are when a SAS token expires or the device gets disabled on the azure portal, for example.
The current solution to close the device when it uses Amqp is to close the links in the Amqp session. The problem is what the underlying transport channel remains open in this case and the device is able to reopen the channel.
When EdgeHub closes a client that way it also sets its representing device objects (the related instance of classes like DeviceProxy) as closed. After that moment for certain operations, like sending M2M messages, EdgeHub checks if a module has an active DeviceProxy, and if does not, it does not send out the message. Interestingly, other operations (like transmitting twin results) does not do this check and the operation goes through, as the underlying link can be reopened.
Also, when the module (or device) does an operation (e.g. sending a telemetry message) that can go through, even if the device proxy is closed.

Because of this bug, the following can happen:

  1. The module works properly for a while
  2. The token is near to expire, so the module sends a new one. Let's say that the processing of the token has problems (in real life scenario, talking to edged failed for some reason), so EdgeHub chooses to close the connection.
  3. EdgeHub closes all the Amqp links, and sets the device proxy as closed/
  4. The module wants to send a telemetry message, it still have the underlying transport, so it opens a new link for telemetry, and sends the message.
  5. EdgeHub processes the message and everything seems working
  6. An incoming message comes from somewhere which is routed to the module we are talking about.
  7. EdgeHub finds the module and sees that the device proxy is closed, so it does not send the message.

The message does not get lost, but will not be sent out until the module does not get restarted. Also if there is a time-to-leave value (or some other limits), the message will be dropped after a while.

This fix changes the logic from closing the links to close the entire connection. As a result the client (device or module) cannot reopen links but needs to create a new connection, which results creating a new DeviceProxy with an appropriate state (=opened), so M2M messages (and other device bound operations checking the state) can go through.

@@ -21,18 +22,21 @@ namespace Microsoft.Azure.Devices.Edge.Hub.Amqp
/// </summary>
class ClientConnectionHandler : IConnectionHandler
{
readonly TimeSpan closeTimeout = TimeSpan.FromSeconds(10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know still draft, sorry for jumping the gun... but how this number was picked? maybe consider using "CloudOperationTimeout" setting here,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked the option to inject the timeout as a parameter, but it would go through several objects from the point of dependency registration and most of the time I had the feeling as "it does not belong there". Also, CloudOperationTimeout regards to cloud operations while this one is about talking to devices, so then the best would be to create its own (or a more general) timeout value.
Instead what I did, I checked in the amqp source code what it does and I saw that in the newer version there is a CloseAsync() with no timeout, which uses a hardcoded 60 sec timeout value. So I brought that value here and using that. When the time passes, the code throws a timeout exception

@vipeller vipeller marked this pull request as ready for review May 7, 2021 23:16
Copy link
Contributor

@vadim-kovalyov vadim-kovalyov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add PR description?

@kodiakhq kodiakhq bot merged commit 6c8134e into Azure:master May 10, 2021
vipeller added a commit to vipeller/iotedge that referenced this pull request May 14, 2021
kodiakhq bot pushed a commit that referenced this pull request May 14, 2021
darobs pushed a commit to darobs/iotedge that referenced this pull request May 26, 2021
ggjjj pushed a commit to ggjjj/iotedge that referenced this pull request Jul 22, 2021
This fixes the following problem:

In certain situations EdgeHub can decide to close the connection of a device (or module). Such situations are when a SAS token expires or the device gets disabled on the azure portal, for example.
The current solution to close the device when it uses Amqp is to close the links in the Amqp session. The problem is what the underlying transport channel remains open in this case and the device is able to reopen the channel.
When EdgeHub closes a client that way it also sets its representing device objects (the related instance of classes like DeviceProxy) as closed. After that moment for certain operations, like sending M2M messages, EdgeHub checks if a module has an active DeviceProxy, and if does not, it does not send out the message. Interestingly, other operations (like transmitting twin results) does not do this check and the operation goes through, as the underlying link can be reopened.
Also, when the module (or device) does an operation (e.g. sending a telemetry message) that can go through, even if the device proxy is closed.

Because of this bug, the following can happen:
1) The module works properly for a while
2) The token is near to expire, so the module sends a new one. Let's say that the processing of the token has problems (in real life scenario, talking to edged failed for some reason), so EdgeHub chooses to close the connection.
3) EdgeHub closes all the Amqp links, and sets the device proxy as closed/
4) The module wants to send a telemetry message, it still have the underlying transport, so it opens a new link for telemetry, and sends the message.
5) EdgeHub processes the message and everything seems working
6) An incoming message comes from somewhere which is routed to the module we are talking about. 
7) EdgeHub finds the module and sees that the device proxy is closed, so it does not send the message.

The message does not get lost, but will not be sent out until the module does not get restarted. Also if there is a time-to-leave value (or some other limits), the message will be dropped after a while.

This fix changes the logic from closing the links to close the entire connection. As a result the client (device or module) cannot reopen links but needs to create a new connection, which results creating a new DeviceProxy with an appropriate state (=opened), so M2M messages (and other device bound operations checking the state) can go through.
damonbarry pushed a commit to damonbarry/iotedge that referenced this pull request Apr 15, 2022
This fixes the following problem:

In certain situations EdgeHub can decide to close the connection of a device (or module). Such situations are when a SAS token expires or the device gets disabled on the azure portal, for example.
The current solution to close the device when it uses Amqp is to close the links in the Amqp session. The problem is what the underlying transport channel remains open in this case and the device is able to reopen the channel.
When EdgeHub closes a client that way it also sets its representing device objects (the related instance of classes like DeviceProxy) as closed. After that moment for certain operations, like sending M2M messages, EdgeHub checks if a module has an active DeviceProxy, and if does not, it does not send out the message. Interestingly, other operations (like transmitting twin results) does not do this check and the operation goes through, as the underlying link can be reopened.
Also, when the module (or device) does an operation (e.g. sending a telemetry message) that can go through, even if the device proxy is closed.

Because of this bug, the following can happen:
1) The module works properly for a while
2) The token is near to expire, so the module sends a new one. Let's say that the processing of the token has problems (in real life scenario, talking to edged failed for some reason), so EdgeHub chooses to close the connection.
3) EdgeHub closes all the Amqp links, and sets the device proxy as closed/
4) The module wants to send a telemetry message, it still have the underlying transport, so it opens a new link for telemetry, and sends the message.
5) EdgeHub processes the message and everything seems working
6) An incoming message comes from somewhere which is routed to the module we are talking about. 
7) EdgeHub finds the module and sees that the device proxy is closed, so it does not send the message.

The message does not get lost, but will not be sent out until the module does not get restarted. Also if there is a time-to-leave value (or some other limits), the message will be dropped after a while.

This fix changes the logic from closing the links to close the entire connection. As a result the client (device or module) cannot reopen links but needs to create a new connection, which results creating a new DeviceProxy with an appropriate state (=opened), so M2M messages (and other device bound operations checking the state) can go through.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants