Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health check endpoints continues to report Healthy for Azure Service Bus when network connection drops #2804

Closed
mihailmacarie opened this issue Aug 27, 2021 · 14 comments

Comments

@mihailmacarie
Copy link

mihailmacarie commented Aug 27, 2021

Is this a bug report?

Yes

Can you also reproduce the problem with the latest version?

Yes (7.2.2)

Environment

1.Net Core 3.1
2.Windows 10
3.Visual Studio

Steps to Reproduce

(Write your steps here:)

  1. Configure MassTransit with Azure Service Bus (AutoStart=true and receiving endpoints)
  2. Configure health check endpoint
  3. Start you app and check that the health check endpoint returns status Healthy and description "ready" on bus endpoints
  4. Stop network connection (for example if on a WiFi connection turn on Airplane mode)
  5. Check that the health check endpoint returns status Unhealthy for the bus endpoints with exception detail and for the bus itself.

Expected Behavior

At step 4 the health check endpoint returns status Unhealthy for the bus endpoints with exception detail and for the bus itself.
Exception details are included in the response if available.

Actual Behavior

At step 4 the health check endpoint result is unchanged though it is obvious that it can't connect to the bus anymore (no network connection).
The health check endpoint continues to return status Healthy for the bus endpoints and for the bus itself.

Reproducible Demo

I'll try to provide if needed.

@phatboyg
Copy link
Member

phatboyg commented Oct 6, 2021

This might be related to the way ASB reports transient exceptions, in that it doesn't actually fault the transport. I'm not entirely sure though, some details debug logs would be helpful (despite most transient errors being suppressed to avoid noisy logs).

@mihailmacarie
Copy link
Author

mihailmacarie commented Oct 11, 2021

Only the following type of errors are logged with Debug level by MassTransit after the network connection drops:

Exception on Receiver sb://{omitted} during "Receive" 
ActiveDispatchCount(0) 
ErrorRequiresRecycle(False)

Microsoft.Azure.ServiceBus.ServiceBusCommunicationException: No such host is known. 
ErrorCode: HostNotFound   
---> 
System.Net.Sockets.SocketException (11001): No such host is known.     
at Microsoft.Azure.ServiceBus.ServiceBusConnection.CreateConnectionAsync(TimeSpan timeout)   
at Microsoft.Azure.Amqp.FaultTolerantAmqpObject`1.OnCreateAsync(TimeSpan timeout)     
at Microsoft.Azure.Amqp.Singleton`1.GetOrCreateAsync(TimeSpan timeout)   
at Microsoft.Azure.Amqp.Singleton`1.GetOrCreateAsync(TimeSpan timeout)  
at Microsoft.Azure.ServiceBus.Amqp.AmqpLinkCreator.CreateAndOpenAmqpLinkAsync()  
at Microsoft.Azure.ServiceBus.Core.MessageReceiver.CreateLinkAsync(TimeSpan timeout)  
at Microsoft.Azure.Amqp.FaultTolerantAmqpObject`1.OnCreateAsync(TimeSpan timeout)   
at Microsoft.Azure.Amqp.Singleton`1.GetOrCreateAsync(TimeSpan timeout)    
at Microsoft.Azure.Amqp.Singleton`1.GetOrCreateAsync(TimeSpan timeout)  
at Microsoft.Azure.ServiceBus.Core.MessageReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan serverWaitTime)     
--- End of inner exception stack trace ---
at Microsoft.Azure.ServiceBus.Core.MessageReceiver.OnReceiveAsync(Int32 maxMessageCount, TimeSpan serverWaitTime)
at Microsoft.Azure.ServiceBus.Core.MessageReceiver.<>c__DisplayClass65_0.<<ReceiveAsync>b__0>d.MoveNext()  
--- End of stack trace from previous location where exception was thrown ---     
at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout)     
at Microsoft.Azure.ServiceBus.RetryPolicy.RunOperation(Func`1 operation, TimeSpan operationTimeout)     
at Microsoft.Azure.ServiceBus.Core.MessageReceiver.ReceiveAsync(Int32 maxMessageCount, TimeSpan operationTimeout)     
at Microsoft.Azure.ServiceBus.Core.MessageReceiver.ReceiveAsync(TimeSpan operationTimeout)     
at Microsoft.Azure.ServiceBus.MessageReceivePump.<MessagePumpTaskAsync>b__12_0()

The ready health endpoint continues to return Healthy as below for both the bus and endpoint:

"masstransit-bus": {
	"data": {
		"Endpoints": {
			"sb://{omitted}": {
				"status": "Healthy",
				"description": "ready"
			}
		}
	},
	"description": "Ready",
	"duration": "00:00:00.0017363",
	"status": "Healthy",
	"tags": [
		"ready",
		"masstransit"
	]
}

@phatboyg
Copy link
Member

@cmeeren Do you have any thoughts on this one? By no longer recycling the receiver when a "transient" error occurs, the health checks are misleading.

@cmeeren
Copy link
Contributor

cmeeren commented Oct 11, 2021

I don't use health checks, so I have no opinions on that particular aspect. Of course, I'd prefer any rectification to not introduce regressions for fixed issues I have previously raised in this repo. "Recycling the receiver" etc. seems to me like an implementation detail. I don't know anything about that, just that I don't want to be spammed with warnings if MT correctly and successfully handles reconnection etc.

@phatboyg
Copy link
Member

Fair enough, I think in the scenario that a communication exception occurs, I'll recycle but not log it as an error (debug only).

phatboyg added a commit that referenced this issue Oct 11, 2021
…sport, isTransient isn't really "correct" in this case.
@phatboyg
Copy link
Member

Let me know if the new develop package with this fix resolves the issue.

@phatboyg
Copy link
Member

I'm going this, feel free to comment if you find the problem still exists.

@phatboyg phatboyg reopened this Nov 2, 2021
@phatboyg
Copy link
Member

phatboyg commented Nov 2, 2021

Apparently this is still an issue, and with the SDK change for v7.3.0, should try to reproduce and fix if possible.

@phatboyg
Copy link
Member

phatboyg commented Nov 4, 2021

I put one more check into the receiver to try and recycle on failure, to detect the unhealthiness, surely the log should show something now. I'm going to need to check this eventually using some approach.

@oguzhankahyaoglu
Copy link

Same issue is valid for me (as I've reported before too)
There are messages in the queue, not being processed, no valid logs from MT (reduced log level to debug) and health checks are reporting Healthly which is misleading.
when we restart the application, everything is working fine.
Is there a way to check them via a custom health check mechanism? Probably will remove MT from all our systems and revert back to azure service bus native libraries in near future.

@phatboyg
Copy link
Member

@oguzhankahyaoglu are you running 7.3.1 and seeing the same result? The Azure SDK has a lot of transient error handling features, and it doesn't immediately surface the unavailability to applications right away. That said, there may be some way to pick up the loss of connection and report the unhealthy status, but I haven't had time to look at it yet.

@oguzhankahyaoglu
Copy link

@oguzhankahyaoglu are you running 7.3.1 and seeing the same result? The Azure SDK has a lot of transient error handling features, and it doesn't immediately surface the unavailability to applications right away. That said, there may be some way to pick up the loss of connection and report the unhealthy status, but I haven't had time to look at it yet.

We were at 7.3.0 for the last 10 days approx and upgraded to 7.3.1 today, will definitely watch whether everything is working fine on production.
Is there a way to health-check whether all queue consumers are alive and healthy? we've previously tried this extra health check mechanism and its still not working expected: (it just tries to ensure a healthy connection can be established using GetSendEndpoint method, AFAIK)

 public async Task<HealthCheckResult> CheckHealthAsync(HealthCheckContext context,
            CancellationToken cancellationToken = default)
        {
            var busHealth = _busControl.CheckHealth();
            Log.Debug("Bus health:" + busHealth.Status);
            foreach (var queueName in QueueNames)
            {
                Log.Debug($"Health checking {queueName}");
                var endpoint = await _busControl.GetSendEndpoint(new Uri("queue:" + queueName));
                Log.Debug($"Health check ok");
            }
            foreach (var topicName in TopicNames)
            {
                Log.Debug($"Health checking {topicName}");
                var endpoint = await _busControl.GetSendEndpoint(new Uri("topic:" + topicName));
                Log.Debug($"Health check ok");
            }
            switch (busHealth.Status)
            {
                case BusHealthStatus.Degraded:
                case BusHealthStatus.Unhealthy:
                    return HealthCheckResult.Unhealthy();
            }

            return HealthCheckResult.Healthy();
        }

@phatboyg
Copy link
Member

Well, that's what the receive transport does - if the consumer is disconnected from Azure, it should go into an unhealthy state. But in the event of transient issues, it doesn't immediately detect it. Not sure why, and like I stated above, I haven't been able to setup a scenario where I disconnect from the network to see it break.

@phatboyg
Copy link
Member

phatboyg commented Feb 6, 2022

Research this, and the Azure SDK just never seems to report a connection failure as it retries under the hood regardless of the retry policy applied. So, at this point, I can't think of anything to do and will close this issue.

@phatboyg phatboyg closed this as completed Feb 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants