Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
91 changes: 91 additions & 0 deletions scenarios/AKSDNSLookupFailError/aksdns-lookup-fail-error.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
title: Troubleshoot the K8SAPIServerDNSLookupFailVMExtensionError error code (52)
description: Learn how to troubleshoot the K8SAPIServerDNSLookupFailVMExtensionError error (52) when you try to start or create and deploy an Azure Kubernetes Service (AKS) cluster.
ms.topic: article
ms.date: 06/14/2024
author: MicrosoftDocsExec
ms.author: MicrosoftDocsExec
ms.custom: sap:Create, Upgrade, Scale and Delete operations (cluster or nodepool), innovation-engine
---

# Troubleshoot the K8SAPIServerDNSLookupFailVMExtensionError error code (52)

This article discusses how to identify and resolve the `K8SAPIServerDNSLookupFailVMExtensionError` error (also known as error code ERR_K8S_API_SERVER_DNS_LOOKUP_FAIL, error number 52) that occurs when you try to start or create and deploy a Microsoft Azure Kubernetes Service (AKS) cluster.

## Prerequisites

- The [nslookup](/windows-server/administration/windows-commands/nslookup) DNS lookup tool for Windows nodes or the [dig](https://linuxize.com/post/how-to-use-dig-command-to-query-dns-in-linux/) tool for Linux nodes.

- [Azure CLI](/cli/azure/install-azure-cli), version 2.0.59 or a later version. If Azure CLI is already installed, you can find the version number by running `az --version`.

## Symptoms

When you try to start or create an AKS cluster, you receive the following error message:

> Agents are unable to resolve Kubernetes API server name. It's likely custom DNS server is not correctly configured, please see <https://aka.ms/aks/private-cluster#hub-and-spoke-with-custom-dns> for more information.
>
> Details: Code="VMExtensionProvisioningError"
>
> Message="VM has reported a failure when processing extension 'vmssCSE'.
>
> Error message: "**Enable failed: failed to execute command: command terminated with exit status=52**\n[stdout]\n{
>
> "ExitCode": "52",
>
> "Output": "Fri Oct 15 10:06:00 UTC 2021,aks- nodepool1-36696444-vmss000000\\nConnection to mcr.microsoft.com 443 port [tcp/https]

## Cause

The cluster nodes can't resolve the cluster's fully qualified domain name (FQDN) in Azure DNS. Run the following DNS lookup command on the failed cluster node to find DNS resolutions that are valid.

| Node OS | Command |
| ------- | ------------------------- |
| Linux | `dig <cluster-fqdn>` |
| Windows | `nslookup <cluster-fqdn>` |

## Solution

On your DNS servers and firewall, make sure that nothing blocks the resolution to your cluster's FQDN. Your custom DNS server might be incorrectly configured if something is blocking even after you run the `nslookup` or `dig` command and apply any necessary fixes. For help to configure your custom DNS server, review the following articles:

- [Create a private AKS cluster](/azure/aks/private-clusters)
- [Private Azure Kubernetes service with custom DNS server](https://github.com/Azure/terraform/tree/00d15e09c54f25fb6387330c36aa4366122c5aaa/quickstart/301-aks-private-cluster)
- [What is IP address 168.63.129.16?](/azure/virtual-network/what-is-ip-address-168-63-129-16)

When you use a private cluster that has a custom DNS, a DNS zone is created. The DNS zone must be linked to the virtual network. This occurs after the cluster is created. Creating a private cluster that has a custom DNS fails during creation. However, you can restore the creation process to a "success" state by reconciling the cluster. To do this, run the [az resource update](/cli/azure/resource#az-resource-update) command in Azure CLI, as follows:

Below, set your AKS cluster and resource group names, then run the update command to reconcile the cluster. The environment variables will make your resource names unique and are declared just before use.

```azurecli-interactive
az resource update --resource-group $RESOURCE_GROUP_NAME \
--name $CLUSTER_NAME \
--namespace Microsoft.ContainerService \
--resource-type ManagedClusters
```

Results:

<!-- expected_similarity=0.3 -->

```output
{
"id": "/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/myResourceGroupxxx/providers/Microsoft.ContainerService/ManagedClusters/myAksClusterxxx",
"location": "eastus",
"name": "myAksClusterxxx",
"properties": {
// ...other properties...
},
"resourceGroup": "myResourceGroupxxx",
"type": "Microsoft.ContainerService/ManagedClusters"
}
```

Also verify that your DNS server is configured correctly for your private cluster, as described earlier.

> [!NOTE]
> Conditional Forwarding doesn't support subdomains.

## More information

- [General troubleshooting of AKS cluster creation issues](troubleshoot-aks-cluster-creation-issues.md)

[!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)]
Original file line number Diff line number Diff line change
@@ -0,0 +1,244 @@
---
title: Troubleshoot the health probe mode for AKS cluster service load balancer
description: Diagnoses and fixes common issues with the health probe mode feature.
ms.date: 06/03/2024
ms.reviewer: niqi, cssakscic, v-weizhu
ms.service: azure-kubernetes-service
ms.custom: sap:Node/node pool availability and performance, devx-track-azurecli, innovation-engine
---

# Troubleshoot issues when enabling the AKS cluster service health probe mode

The health probe mode feature allows you to configure how Azure Load Balancer probes the health of the nodes in your Azure Kubernetes Service (AKS) cluster. You can choose between two modes: Shared and ServiceNodePort. The Shared mode uses a single health probe for all external traffic policy cluster services that use the same load balancer. In contrast, the ServiceNodePort mode uses a separate health probe for each service. The Shared mode can reduce the number of health probes and improve the performance of the load balancer, but it requires some additional components to work properly. To enable this feature, see [How to enable the health probe mode feature using the Azure CLI](#how-to-enable-the-health-probe-mode-feature-using-the-azure-cli).

This article describes some common issues about using the health probe mode feature in an AKS cluster and helps you troubleshoot and resolve these issues.

## Symptoms

When creating or updating an AKS cluster by using the Azure CLI, if you enable the health probe mode feature using the `--cluster-service-load-balancer-health-probe-mode Shared` flag, the following issues occur:

- The load balancer doesn't distribute traffic to the nodes as expected.

- The load balancer reports unhealthy nodes even if they're healthy.

- The health-probe-proxy sidecar container crashes or doesn't start.

- The cloud-node-manager pod crashes or doesn't start.

The following operations also happen:

1. RP frontend checks if the request is valid and updates the corresponding property in the LoadBalancerProfile.

2. RP async calls the cloud provider config secret reconciler to update the cloud provider config secret based on the LoadBalancerProfile.

3. Overlaymgr reconciles the cloud-node-manager chart to enable the health-probe-proxy sidecar.

## Initial troubleshooting

To troubleshoot these issues, follow these steps:

0. First, connect to your AKS cluster using the Azure CLI:

```azurecli
export RESOURCE_GROUP="aks-rg"
export AKS_CLUSTER_NAME="aks-cluster"
az aks get-credentials --resource-group $RESOURCE_GROUP --name $AKS_CLUSTER_NAME --overwrite-existing
```

1. Next, check the RP frontend log to see if the health probe mode in the LoadBalancerProfile is properly configured. You can use the `az aks show` command to view the LoadBalancerProfile property of your cluster.

```azurecli
export RESOURCE_GROUP="aks-rg"
export AKS_CLUSTER_NAME="aks-cluster"
az aks show --resource-group $RESOURCE_GROUP --name $AKS_CLUSTER_NAME --query "networkProfile.loadBalancerProfile"
```
Results:

<!-- expected_similarity=0.3 -->

```output
{
"clusterServiceLoadBalancerHealthProbeMode": "Shared",
"managedOutboundIPs": null,
"outboundIPs": null,
"outboundIPPrefixes": null,
"allocatedOutboundPorts": null,
"effectiveOutboundIPs": [
{
"id": "/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/resourceGroups/MC_aks-rg_aks-cluster_eastus2/providers/Microsoft.Network/publicIPAddresses/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
}
],
"idleTimeoutInMinutes": 30,
"loadBalancerSku": "standard",
"managedOutboundIPv6": null
}
```

2. Check the cloud provider configuration. In modern AKS clusters, the cloud provider configuration is managed internally and the `ccp` namespace doesn't exist. Instead, check for cloud provider related resources and verify the cloud-node-manager pods are running properly:


```bash
# Check for cloud provider related ConfigMaps in kube-system
kubectl get configmap -n kube-system | grep -i azure

# Check if cloud-node-manager pods are running (indicates cloud provider integration is working)
kubectl get pods -n kube-system | grep cloud-node-manager

# Check the azure-ip-masq-agent-config if it exists
kubectl get configmap azure-ip-masq-agent-config-reconciled -n kube-system -o yaml 2>/dev/null || echo "ConfigMap not found"
```
Results:

<!-- expected_similarity=0.3 -->

```output
configmap/azure-ip-masq-agent-config-reconciled 1 11h

cloud-node-manager-rfb2w 2/2 Running 0 16m
```

3. Check the chart or overlay daemonset cloud-node-manager to see if the health-probe-proxy sidecar container is enabled. You can use the `kubectl get ds` command to view the daemonset.

```shell
kubectl get ds -n kube-system cloud-node-manager -o yaml
```
Results:

<!-- expected_similarity=0.3 -->

```output
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: cloud-node-manager
namespace: kube-system
...
spec:
template:
spec:
containers:
- name: cloud-node-manager
image: mcr.microsoft.com/oss/kubernetes/azure-cloud-node-manager:xxxxxxxx
- name: health-probe-proxy
image: mcr.microsoft.com/oss/kubernetes/azure-health-probe-proxy:xxxxxxxx
...
```

## Cause 1: The health probe mode isn't Shared or ServiceNodePort

The health probe mode feature only works with these two modes. If you use any other mode, the feature won't work.

### Solution 1: Use the correct health probe mode

Make sure you use the Shared or ServiceNodePort mode when creating or updating your cluster. You can use the `--cluster-service-load-balancer-health-probe-mode` flag to specify the mode.

## Cause 2: The toggle for the health probe mode feature is off

The health probe mode feature is controlled by a toggle that can be enabled or disabled by the AKS team. If the toggle is off, the feature won't work.

### Solution 2: Turn on the toggle

Contact the AKS team to check if the toggle for the health probe mode feature is on or off. If it's off, ask them to turn it on for your subscription.

## Cause 3: The load balancer SKU is Basic

The health probe mode feature only works with the Standard Load Balancer SKU. If you use the Basic Load Balancer SKU, the feature won't work.

### Solution 3: Use the Standard Load Balancer SKU

Make sure you use the Standard Load Balancer SKU when creating or updating your cluster. You can use the `--load-balancer-sku` flag to specify the SKU.

## Cause 4: The feature isn't registered

The health probe mode feature requires you to register the feature on your subscription. If the feature isn't registered, it won't work.

### Solution 4: Register the feature

Make sure you register the feature for your subscription before creating or updating your cluster. You can use the `az feature register` command to register the feature.

```azurecli
export FEATURE_NAME="EnableSLBSharedHealthProbePreview"
export PROVIDER_NAMESPACE="Microsoft.ContainerService"
az feature register --name $FEATURE_NAME --namespace $PROVIDER_NAMESPACE
```
Results:

<!-- expected_similarity=0.3 -->

```output
{
"id": "/subscriptions/xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx/providers/Microsoft.Features/providers/Microsoft.ContainerService/features/EnableAKSClusterServiceLoadBalancerHealthProbeMode",
"name": "Microsoft.ContainerService/EnableAKSClusterServiceLoadBalancerHealthProbeMode",
"properties": {
"state": "Registering"
},
"type": "Microsoft.Features/providers/features"
}
```

## Cause 5: The Kubernetes version is earlier than v1.28.0

The health probe mode feature requires a minimum Kubernetes version of v1.28.0. If you use an older version, the feature won't work.

### Solution 5: Upgrade the Kubernetes version

Make sure you use Kubernetes v1.28.0 or a later version when creating or updating your cluster. You can use the `--kubernetes-version` flag to specify the version.

## Known issues

For Windows, the kube-proxy component doesn't start until you create the first non-HPC pod in a node. This issue affects the health probe mode feature and causes the load balancer to report unhealthy nodes. It will be fixed in a future update.

## How to enable the health probe mode feature using the Azure CLI

To enable the health probe mode feature, run one of the following commands:

Enable `ServiceNodePort` health probe mode (default) for a cluster:

```shell
export RESOURCE_GROUP="aks-rg"
export AKS_CLUSTER_NAME="aks-cluster"
az aks update --resource-group $RESOURCE_GROUP --name $AKS_CLUSTER_NAME --cluster-service-load-balancer-health-probe-mode ServiceNodePort
```
Results:

```output
{
"name": "aks-cluster",
"location": "eastus2",
"resourceGroup": "aks-rg",
"kubernetesVersion": "1.28.x",
"provisioningState": "Succeeded",
"loadBalancerProfile": {
"clusterServiceLoadBalancerHealthProbeMode": "ServiceNodePort",
...
},
...
}
```

Enable `Shared` health probe mode for a cluster:

```shell
export RESOURCE_GROUP="MyAksResourceGroup"
export AKS_CLUSTER_NAME="MyAksCluster"
az aks update --resource-group $RESOURCE_GROUP --name $AKS_CLUSTER_NAME --cluster-service-load-balancer-health-probe-mode Shared
```

Results:

```output
{
"name": "MyAksCluster",
"location": "eastus2",
"resourceGroup": "MyAksResourceGroup",
"kubernetesVersion": "1.28.x",
"provisioningState": "Succeeded",
"loadBalancerProfile": {
"clusterServiceLoadBalancerHealthProbeMode": "Shared",
...
},
...
}
```

[!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)]
Loading