Health check endpoints continues to report Healthy for Azure Service Bus when network connection drops #2804

mihailmacarie · 2021-08-27T05:50:56Z

Is this a bug report?

Yes

Can you also reproduce the problem with the latest version?

Yes (7.2.2)

Environment

1.Net Core 3.1
2.Windows 10
3.Visual Studio

Steps to Reproduce

(Write your steps here:)

Configure MassTransit with Azure Service Bus (AutoStart=true and receiving endpoints)
Configure health check endpoint
Start you app and check that the health check endpoint returns status Healthy and description "ready" on bus endpoints
Stop network connection (for example if on a WiFi connection turn on Airplane mode)
Check that the health check endpoint returns status Unhealthy for the bus endpoints with exception detail and for the bus itself.

Expected Behavior

At step 4 the health check endpoint returns status Unhealthy for the bus endpoints with exception detail and for the bus itself.
Exception details are included in the response if available.

Actual Behavior

At step 4 the health check endpoint result is unchanged though it is obvious that it can't connect to the bus anymore (no network connection).
The health check endpoint continues to return status Healthy for the bus endpoints and for the bus itself.

Reproducible Demo

I'll try to provide if needed.

phatboyg · 2021-10-06T17:26:08Z

This might be related to the way ASB reports transient exceptions, in that it doesn't actually fault the transport. I'm not entirely sure though, some details debug logs would be helpful (despite most transient errors being suppressed to avoid noisy logs).

mihailmacarie · 2021-10-11T05:02:55Z

Only the following type of errors are logged with Debug level by MassTransit after the network connection drops:

Exception on Receiver sb://{omitted} during "Receive" 
ActiveDispatchCount(0) 
ErrorRequiresRecycle(False)

Microsoft.Azure.ServiceBus.ServiceBusCommunicationException: No such host is known. 
ErrorCode: HostNotFound   
---> 
System.Net.Sockets.SocketException (11001): No such host is known.     
at Microsoft.Azure.ServiceBus.ServiceBusConnection.CreateConnectionAsync(TimeSpan timeout)   
at Microsoft.Azure.Amqp.FaultTolerantAmqpObject`1.OnCreateAsync(TimeSpan timeout)     
at Microsoft.Azure.Amqp.Singleton`1.GetOrCreateAsync(TimeSpan timeout)   
at Microsoft.Azure.Amqp.Singleton`1.GetOrCreateAsync(TimeSpan timeout)  
at Microsoft.Azure.ServiceBus.Amqp.AmqpLinkCreator.CreateAndOpenAmqpLinkAsync()  
at Microsoft.Azure.ServiceBus.Core.MessageReceiver.CreateLinkAsync(TimeSpan timeout)  
at Microsoft.Azure.Amqp.FaultTolerantAmqpObject`1.OnCreateAsync(TimeSpan timeout)   
at Microsoft.Azure.Amqp.Singleton`1.GetOrCreateAsync(TimeSpan timeout)    
at Microsoft.Azure.Amqp.Singleton`1.GetOrCreateAsync(TimeSpan timeout)  
at Microsoft.Azure.ServiceBus.Core.MessageReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan serverWaitTime)     
--- End of inner exception stack trace ---
at Microsoft.Azure.ServiceBus.Core.MessageReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan serverWaitTime)
at Microsoft.Azure.ServiceBus.Core.MessageReceiver.<>c__DisplayClass65_0.<<ReceiveAsync>b__0>d.MoveNext()  
--- End of stack trace from previous location where exception was thrown ---     
at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout)     
at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout)     
at Microsoft.Azure.ServiceBus.Core.MessageReceiver.ReceiveAsync(Int32 maxMessageCount, TimeSpan operationTimeout)     
at Microsoft.Azure.ServiceBus.Core.MessageReceiver.ReceiveAsync(TimeSpan operationTimeout)     
at Microsoft.Azure.ServiceBus.MessageReceivePump.<MessagePumpTaskAsync>b__12_0()

The ready health endpoint continues to return Healthy as below for both the bus and endpoint:

"masstransit-bus": {
	"data": {
		"Endpoints": {
			"sb://{omitted}": {
				"status": "Healthy",
				"description": "ready"
			}
		}
	},
	"description": "Ready",
	"duration": "00:00:00.0017363",
	"status": "Healthy",
	"tags": [
		"ready",
		"masstransit"
	]
}

phatboyg · 2021-10-11T16:21:05Z

@cmeeren Do you have any thoughts on this one? By no longer recycling the receiver when a "transient" error occurs, the health checks are misleading.

cmeeren · 2021-10-11T18:07:58Z

I don't use health checks, so I have no opinions on that particular aspect. Of course, I'd prefer any rectification to not introduce regressions for fixed issues I have previously raised in this repo. "Recycling the receiver" etc. seems to me like an implementation detail. I don't know anything about that, just that I don't want to be spammed with warnings if MT correctly and successfully handles reconnection etc.

phatboyg · 2021-10-11T18:12:36Z

Fair enough, I think in the scenario that a communication exception occurs, I'll recycle but not log it as an error (debug only).

…sport, isTransient isn't really "correct" in this case.

phatboyg · 2021-10-11T18:43:06Z

Let me know if the new develop package with this fix resolves the issue.

phatboyg · 2021-10-14T14:05:29Z

I'm going this, feel free to comment if you find the problem still exists.

phatboyg · 2021-11-02T14:00:24Z

Apparently this is still an issue, and with the SDK change for v7.3.0, should try to reproduce and fix if possible.

phatboyg · 2021-11-04T12:53:50Z

I put one more check into the receiver to try and recycle on failure, to detect the unhealthiness, surely the log should show something now. I'm going to need to check this eventually using some approach.

oguzhankahyaoglu · 2022-01-24T13:20:13Z

Same issue is valid for me (as I've reported before too)
There are messages in the queue, not being processed, no valid logs from MT (reduced log level to debug) and health checks are reporting Healthly which is misleading.
when we restart the application, everything is working fine.
Is there a way to check them via a custom health check mechanism? Probably will remove MT from all our systems and revert back to azure service bus native libraries in near future.

phatboyg · 2022-01-24T13:23:44Z

@oguzhankahyaoglu are you running 7.3.1 and seeing the same result? The Azure SDK has a lot of transient error handling features, and it doesn't immediately surface the unavailability to applications right away. That said, there may be some way to pick up the loss of connection and report the unhealthy status, but I haven't had time to look at it yet.

oguzhankahyaoglu · 2022-01-24T14:13:08Z

@oguzhankahyaoglu are you running 7.3.1 and seeing the same result? The Azure SDK has a lot of transient error handling features, and it doesn't immediately surface the unavailability to applications right away. That said, there may be some way to pick up the loss of connection and report the unhealthy status, but I haven't had time to look at it yet.

We were at 7.3.0 for the last 10 days approx and upgraded to 7.3.1 today, will definitely watch whether everything is working fine on production.
Is there a way to health-check whether all queue consumers are alive and healthy? we've previously tried this extra health check mechanism and its still not working expected: (it just tries to ensure a healthy connection can be established using GetSendEndpoint method, AFAIK)

 public async Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext context,
            CancellationToken cancellationToken = default)
        {
            var busHealth = _busControl.CheckHealth();
            Log.Debug("Bus health:" + busHealth.Status);
            foreach (var queueName in QueueNames)
            {
                Log.Debug($"Health checking {queueName}");
                var endpoint = await _busControl.GetSendEndpoint(new Uri("queue:" + queueName));
                Log.Debug($"Health check ok");
            }
            foreach (var topicName in TopicNames)
            {
                Log.Debug($"Health checking {topicName}");
                var endpoint = await _busControl.GetSendEndpoint(new Uri("topic:" + topicName));
                Log.Debug($"Health check ok");
            }
            switch (busHealth.Status)
            {
                case BusHealthStatus.Degraded:
                case BusHealthStatus.Unhealthy:
                    return HealthCheckResult.Unhealthy();
            }

            return HealthCheckResult.Healthy();
        }

phatboyg · 2022-01-24T14:27:29Z

Well, that's what the receive transport does - if the consumer is disconnected from Azure, it should go into an unhealthy state. But in the event of transient issues, it doesn't immediately detect it. Not sure why, and like I stated above, I haven't been able to setup a scenario where I disconnect from the network to see it break.

phatboyg · 2022-02-06T15:09:07Z

Research this, and the Azure SDK just never seems to report a connection failure as it retries under the hood regardless of the retry policy applied. So, at this point, I can't think of anything to do and will close this issue.

phatboyg added a commit that referenced this issue Oct 11, 2021

Issue #2804 - communication exceptions not recycling the receive tran…

b13ab48

…sport, isTransient isn't really "correct" in this case.

phatboyg closed this as completed Oct 14, 2021

phatboyg reopened this Nov 2, 2021

phatboyg closed this as completed Feb 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health check endpoints continues to report Healthy for Azure Service Bus when network connection drops #2804

Health check endpoints continues to report Healthy for Azure Service Bus when network connection drops #2804

mihailmacarie commented Aug 27, 2021 •

edited

phatboyg commented Oct 6, 2021

mihailmacarie commented Oct 11, 2021 •

edited

phatboyg commented Oct 11, 2021

cmeeren commented Oct 11, 2021

phatboyg commented Oct 11, 2021

phatboyg commented Oct 11, 2021

phatboyg commented Oct 14, 2021

phatboyg commented Nov 2, 2021

phatboyg commented Nov 4, 2021

oguzhankahyaoglu commented Jan 24, 2022

phatboyg commented Jan 24, 2022

oguzhankahyaoglu commented Jan 24, 2022

phatboyg commented Jan 24, 2022

phatboyg commented Feb 6, 2022

Health check endpoints continues to report Healthy for Azure Service Bus when network connection drops #2804

Health check endpoints continues to report Healthy for Azure Service Bus when network connection drops #2804

Comments

mihailmacarie commented Aug 27, 2021 • edited

Is this a bug report?

Can you also reproduce the problem with the latest version?

Environment

Steps to Reproduce

Expected Behavior

Actual Behavior

Reproducible Demo

phatboyg commented Oct 6, 2021

mihailmacarie commented Oct 11, 2021 • edited

phatboyg commented Oct 11, 2021

cmeeren commented Oct 11, 2021

phatboyg commented Oct 11, 2021

phatboyg commented Oct 11, 2021

phatboyg commented Oct 14, 2021

phatboyg commented Nov 2, 2021

phatboyg commented Nov 4, 2021

oguzhankahyaoglu commented Jan 24, 2022

phatboyg commented Jan 24, 2022

oguzhankahyaoglu commented Jan 24, 2022

phatboyg commented Jan 24, 2022

phatboyg commented Feb 6, 2022

mihailmacarie commented Aug 27, 2021 •

edited

mihailmacarie commented Oct 11, 2021 •

edited