Skip to content

Latest commit

 

History

History
60 lines (45 loc) · 6.81 KB

troubleshoot-rhc.md

File metadata and controls

60 lines (45 loc) · 6.81 KB
title description services documentationcenter author ms.service ms.devlang ms.topic ms.tgt_pltfrm ms.workload ms.date ms.author
Troubleshoot Azure Load Balancer resource health, frontend, and backend availability issues
Use the available metrics to diagnose your degraded or unavailable Azure Standard Load Balancer.
load-balancer
na
erichrt
load-balancer
na
article
na
infrastructure-services
08/14/2020
errobin

Troubleshoot resource health, frontend, and backend availability issues

This article is a guide to investigate issues impacting the availability of your load balancer frontend IP and backend resources.

About the metrics we'll use

The two metrics to be used are Data path availability and Health probe status and it is important to understand their meaning to derive correct insights.

Data path availability

The data path availability metric is generated by a TCP ping every 25 seconds on all frontend ports that have load-balancing and inbound NAT rules configured. This TCP ping will then be routed to any of the healthy (probed up) backend instances. If the service receives a response to the ping, it is considered a success and the sum of the metric will be iterated once, if it fails it won't. The count of this metric is 1/100 of the total TCP pings per sample period. Thus, we want to consider the average, which will show the average of sum/count for the time period. The data path availability metric aggregated by average thus gives us a percentage success rate for TCP pings on your frontend IP:port for each of your load-balancing and inbound NAT rules.

Health probe status

The health probe status metric is generated by a ping of the protocol defined in the health probe. This ping is sent to each instance in the backend pool and on the port defined in the health probe. For HTTP and HTTPS probes, a successful ping requires an HTTP 200 OK response whereas with TCP probes any response is considered successful. The consecutive successes or failures of each probe then determines whether a backend instance is healthy and able to receive traffic for the load-balancing rules to which the backend pool is assigned. Similar to data path availability we use the average aggregation, which tells us the average successful/total pings during the sampling interval. This health probe status value indicates the backend health in isolation from your load balancer by probing your backend instances without sending traffic through the frontend.

Important

Health probe status is sampled on a one minute basis. This can lead to minor fluctuations in an otherwise steady value. For example, if there are two backend instances, one probed up and one probed down, the health probe service may capture 7 samples for the healthy instance and 6 for the unhealthy instance. This will lead to a previously steady value of 50 showing as 46.15 for a one minute interval.

Diagnose degraded and unavailable load balancers

As outlined in the resource health article, a degraded load balancer is one that shows between 25% and 90% data path availability, and an unavailable load balancer is one with less than 25% data path availability, over a two-minute period. These same steps can be taken to investigate the failure you see in any health probe status or data path availability alerts you've configured. We'll explore the case where we've checked our resource health and found our load balancer to be unavailable with a data path availability of 0% - our service is down.

First, we go to the detailed metrics view of our load balancer insights blade. You can do this via your load balancer resource blade or the link in your resource health message. Next we navigate to the Frontend and Backend availability tab and review a thirty-minute window of the time period when the degraded or unavailable state occurred. If we see our data path availability has been 0%, we know there's an issue preventing traffic for all of our load-balancing and inbound NAT rules and can see how long this impact has lasted.

The next place we need to look is our health probe status metric to determine whether our data path is unavailable is because we have no healthy backend instances to serve traffic. If we have at least one healthy backend instance for all of our load-balancing and inbound rules, we know it is not our configuration causing our data paths to be unavailable. This scenario indicates an Azure platform issue, while rare, do not fret if you find these as an automated alert is sent to our team to rapidly resolve all platform issues.

Diagnose health probe failures

Let's say we check our health probe status and find out that all instances are showing as unhealthy. This finding explains why our data path is unavailable as traffic has nowhere to go. We should then go through the following checklist to rule out common configuration errors:

  • Check the CPU utilization for your resources to check if your instances are in fact healthy
    • You can check this
  • If using an HTTP or HTTPS probe check if the application is healthy and responsive
    • Validate this is functional by directly accessing the applications through the private IP address or instance-level public IP address associated with your backend instance
  • Review the Network Security Groups applied to our backend resources. Ensure that there are no rules of a higher priority than AllowAzureLoadBalancerInBound that will block the health probe
    • You can do this by visiting the Networking blade of your backend VMs or Virtual Machine Scale Sets
    • If you find this NSG issue is the case, move the existing Allow rule or create a new high priority rule to allow AzureLoadBalancer traffic
  • Check your OS. Ensure your VMs are listening on the probe port and review their OS firewall rules to ensure they are not blocking the probe traffic originating from IP address 168.63.129.16
    • You can check listening ports by running netstat -a the Windows command prompt or netstat -l in a Linux terminal
  • Don't place a firewall NVA VM in the backend pool of the load balancer, use user-defined routes to route traffic to backend instances through the firewall
  • Ensure you're using the right protocol, if using HTTP to probe a port listening for a non-HTTP application the probe will fail

If you've gone through this checklist and are still finding health probe failures, there may be rare platform issues impacting the probe service for your instances. In this case, Azure has your back and an automated alert is sent to our team to rapidly resolve all platform issues.

Next steps