AKS-Hybrid/known-issues-networking.yml

### YamlMime:FAQ
metadata:
  title: Troubleshoot networking issues and errors in AKS enabled by Arc
  description: Get help to troubleshoot networking issues and errors in AKS Arc.
  author: sethmanheim
  ms.author: sethm
  ms.service: azure-stack
  ms.subservice: aks-hci
  ms.topic: faq
  ms.date: 07/11/2024

title: Fix known issues and errors when configuring a network in AKS Arc
summary: |
  **Applies to: AKS on Azure Stack HCI, AKS on Windows Server**
  Use this topic to help you troubleshoot and resolve networking-related issues with AKS Arc.
  
sections:
  - name: Single section - ignored
    questions:
      - question: | 
          Error: 'Failed to start the cloud agent generic cluster service in failover cluster. The cluster resource group is in the 'failed' state.'

        answer: | 
          Cloud agent may fail to successfully start when using path names with spaces in them.
          
          When using [Set-AksHciConfig](./reference/ps/set-akshciconfig.md) to specify `-imageDir`, `-workingDir`, `-cloudConfigLocation`, or `-nodeConfigLocation` parameters with a path name that contains a space character, such as `D:\Cloud Share\AKS HCI`, the cloud agent cluster service will fail to start with the following (or similar) error message:
          
          ```
          Failed to start the cloud agent generic cluster service in failover cluster. The cluster resource group os in the 'failed' state. Resources in 'failed' or 'pending' states: 'MOC Cloud Agent Service'
          ```
          
          To work around this issue, use a path that does not include spaces, for example, `C:\CloudShare\AKS-HCI`.
          
      - question: | 
          Arc-connected clusters have empty JSON "distribution" property

        answer: | 
          Azure Arc for Kubernetes-connected clusters can have the value for the JSON property `distribution` set to an empty string. For AKS Arc-connected clusters, this value is set during initial installation and is never altered for the lifetime of the deployment.
          
          To reproduce the issue, open an Azure PowerShell window and run the following commands:
          
          ```powershell
          az login
          az account set --subscription <sub_id>
          az connectedk8s show -g <rg_name> -n
          ```
          
          The following is example output from this command:
          
          ```shell
          {
          "agentPublicKeyCertificate": <value>
          "agentVersion": "1.8.14",
          "azureHybridBenefit": "NotApplicable",
          "connectivityStatus": "Connected",
          "distribution": "",
          "distributionVersion": null,
          "id": "/subscriptions/<subscription>/resourceGroups/<resource group>/providers/Microsoft.Kubernetes/connectedClusters/<cluster name>",
          "identity": {
            "principalId": "<principal id>",
            "tenantId": "<tenant id>",
            "type": "SystemAssigned"
          },
          "infrastructure": "azure_stack_hci",
          "kubernetesVersion": "1.23.12",
          "lastConnectivityTime": "2022-11-04T14:59:59.050000+00:00",
          "location": "eastus",
          "managedIdentityCertificateExpirationTime": "2023-02-02T14:24:00+00:00",
          "miscellaneousProperties": null,
          "name": "mgmt-cluster",
          "offering": "AzureStackHCI_AKS_Management",
          "privateLinkScopeResourceId": null,
          "privateLinkState": "Disabled",
          "provisioningState": "Succeeded",
          "resourceGroup": "<resource group>",
          "systemData": {
            "createdAt": "2022-11-04T14:29:17.680060+00:00",
            "createdBy": "<>",
            "createdByType": "Application",
            "lastModifiedAt": "2022-11-04T15:00:01.830067+00:00",
            "lastModifiedBy": "64b12d6e-6549-484c-8cc6-6281839ba394",
            "lastModifiedByType": "Application"
          },
          "tags": {},
          "totalCoreCount": 4,
          "totalNodeCount": 1,
          "type": "microsoft.kubernetes/connectedclusters"
          }
          ```  
          
          ### To resolve the issue
          
          If the output for the JSON `distribution` property returns an empty string, follow these instructions to patch your cluster:
          
          1. Copy the following configuration into a file called **patchBody.json**:
          
             ```json
             {
                "properties": {
                  "distribution": "aks_management"
                }
             }
             ```
          
             > [!IMPORTANT]
             > If your cluster is an AKS management cluster, the value should be set to `aks_management`. If it's a workload cluster, it should be set to `aks_workload`.
          
          2. From the JSON output you obtained above, copy your connected cluster ID. It should appear as a long string in the following format:
          
              `"/subscriptions/<subscription >/resourceGroups/<resource group>/providers/Microsoft.Kubernetes/connectedClusters/<cluster name>",`
          
          3. Run the following command to patch your cluster. The `<cc_arm_id>` value should be the ID obtained in step 2:
          
             ```powershell
             az rest -m patch -u "<cc_arm_id>?api-version=2022-10-01-preview" -b "@patchBody.json"
             ```
          
          4. Check that your cluster has been successfully patched by running `az connectedk8s show -g <rg_name> -n <cc_name>` to make sure the JSON property `distribution` has been correctly set.
          
      - question: | 
          The WSSDAgent service is stuck while starting and fails to connect to the cloud agent

        answer: | 
          Symptoms:
          - Proxy enabled in AKS Arc. The WSSDAgent service stuck in the `starting` state.
          Shows up as the following:
          - `Test-NetConnection -ComputerName <computer IP/Name> -Port <port>` from the node where the node agent is failing towards the cloud agent is working properly on the system (even when the wssdagent fails to start)
          - Curl.exe from the node on which the agent is failing towards the cloud agent reproduces the problem and is getting stuck:
          `curl.exe https://<computerIP>:6500`
          - To fix the problem, pass the `--noproxy` flag to curl.exe. Curl returns an error from from wssdcloudagent. This is expected since curl is not a GRPC client. Curl doesn't get stuck waiting when you send the `--noproxy` flag. So returning an error is considered a success here):
          
          ```bash
          curl.exe --noproxy '*' https://<computerIP>:65000
          ```
          
          It is likely the proxy settings were changed to a faulty proxy on the host. The proxy settings for AKS Arc are environment variables that are inherited from the parent process on the host. These settings only get propagated when a new service starts or an old one updates or reboots. It is possible that faulty proxy settings were set on the host, and they were propagated to the WSSDAgent after an update or reboot, which has caused the WSSDAgent to fail.
          
          You will need to fix the proxy settings by changing the environmental variables on the machine.
          On the machine, change the variables with the following commands:
          
          ```bash
            [Environment]::SetEnvironmentVariable("https_proxy", <Value>, "Machine")
            [Environment]::SetEnvironmentVariable("http_proxy", <Value>, "Machine")
            [Environment]::SetEnvironmentVariable("no_proxy", <Value>, "Machine")
          ````
          
          Reboot the machine so that the service manager, and the WSSDAgent, pick up the fixed proxy.
          
      - question: | 
          CAPH pod fails to renew certificate

        answer: | 
          This error occurs because every time the CAPH pod is started, a login to cloudagent is attempted and the certificate is stored in the temporary storage volume, which will clean out on pod restarts. Therefore, every time a pod is restarted, the certificate is destroyed and a new login attempt is made. 
          
          A login attempt starts a renewal routine, which renews the certificate when it nears expiry. The CAPH pod decides if a login is needed if the certificate is available or not. If the certificate is available, the login is not attempted, assuming the renewal routine is already there.
          
          However, on a container restart, the temporary directory is not cleaned, so the file is still persisted and the login attempt is not made, causing the renewal routine to not start. This leads to certificate expiration.
          
          To mitigate this issue, restart the CAPH pod using the following command:
          
          ```bash
          kubectl delete pod pod-name
          ````
      - question: | 
          Set-AksHciConfig fails with WinRM errors but shows WinRM is configured correctly

        answer: | 
          When running [Set-AksHciConfig](./reference/ps/./set-akshciconfig.md), you might encounter the following error:
          
          ```powershell
          WinRM service is already running on this machine.
          WinRM is already set up for remote management on this computer.
          Powershell remoting to TK5-3WP08R0733 was not successful.
          At C:\Program Files\WindowsPowerShell\Modules\Moc\0.2.23\Moc.psm1:2957 char:17
          + ...             throw "Powershell remoting to "+$env:computername+" was n ...
          +                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
              + CategoryInfo          : OperationStopped: (Powershell remo...not successful.:String) [], RuntimeException
              + FullyQualifiedErrorId : Powershell remoting to TK5-3WP08R0733 was not successful.
          ```
          
          Most of the time, this error occurs as a result of a change in the user's security token (due to a change in group membership), a password change, or an expired password. In most cases, the issue can be remediated by logging off from the computer and logging back in. If the error still occurs, you can create a support ticket through the Azure portal.
          
      - question: | 
          AKS Arc cluster stuck in "ScalingControlPlane" provisioning state

        answer: | 
          This issue causes an AKS Arc cluster to remain in the **ScalingControlPlane** state for an extended period of time.
          
          ### To reproduce
          
          ```powershell
          Get-AksHciCluster -Name <cluster name> | select *
          ```
          
          ```output
          Status                : {ProvisioningState, Details}
          ProvisioningState     : ScalingControlPlane
          KubernetesVersion     : v1.22.6
          PackageVersion        : v1.22.6-kvapkg.0
          NodePools             : {nodepoolA, nodepoolB}
          WindowsNodeCount      : 0
          LinuxNodeCount        : 1
          ControlPlaneNodeCount : 1
          ControlPlaneVmSize    : Standard_D4s_v3
          AutoScalerEnabled     : False
          AutoScalerProfile     :
          Name                  : <cluster name>
          ```
          
          This issue has been fixed in recent AKS Arc versions. We recommend updating your AKS Arc clusters to the latest release.
          
          To mitigate this issue, contact support to manually patch the status of the control plane nodes to remove the condition **MachineOwnerRemediatedCondition** from the machine status via kubectl.
      - question: | 
          The workload cluster is not found 

        answer: | 
          The workload cluster may not be found if the IP address pools of two AKS on Azure Stack HCI deployments are the same or overlap. If you deploy two AKS hosts and use the same `AksHciNetworkSetting` configuration for both, PowerShell and Windows Admin Center will potentially fail to find the workload cluster because the API server will be assigned the same IP address in both clusters, resulting in a conflict.
          
          The error message you receive will look similar to the example shown below.
          
          ```powershell
          A workload cluster with the name 'clustergroup-management' was not found.
          At C:\Program Files\WindowsPowerShell\Modules\Kva\0.2.23\Common.psm1:3083 char:9
          +         throw $("A workload cluster with the name '$Name' was not fou ...
          +         ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
              + CategoryInfo          : OperationStopped: (A workload clus... was not found.:String) [], RuntimeException
              + FullyQualifiedErrorId : A workload cluster with the name 'clustergroup-management' was not found.
          ```
          
          > [!NOTE]
          > Your cluster name will be different.
          
          To resolve the issue, delete one of the clusters and create a new AKS cluster network setting that has a non-overlapping network with the other clusters.
          
      - question: | 
          Get-AksHCIClusterNetwork does not show the current allocation of IP addresses

        answer: | 
          Running the [Get-AksHciClusterNetwork](./reference/ps/get-akshciclusternetwork.md) command provides a list of all virtual network configurations. However, the command does not show the current allocation of the IP addresses. 
          
          To find out what IP addresses are currently in use in a virtual network, use the steps below:
          
          1. To get the group, run the following command:
          
            ```powershell
            Get-MocGroup -location MocLocation
            ```
          2. To get the list of IP addresses that are currently in use, and the list of available or used virtual IP addresses, run the following command:
          
            ```powershell
            Get-MocNetworkInterface -Group <groupName> | ConvertTo-Json -depth 10
            ```
          
          3. Use the following command to view the list of virtual IP addresses that are currently in use: 
          
            ```powershell
            Get-MocLoadBalancer -Group <groupName> | ConvertTo-Json -depth 10
            ```
      - question: | 
          "Cluster IP Address x.x.x.x" fails

        answer: | 
          A cluster IP address shows as "Failed" during precheck. This precheck tests that all IPv4 addresses and DNS network names are online and reachable. For example, a failed IPv4 or network name can cause this test to fail.
          
          To resolve this, perform the following steps:
          
          1. Run the following command:
          
             ```powershell
             Get-ClusterGroup -name "Cluster Group" | Get-ClusterResource
             ```
          
          1. Ensure that all network names and IP addresses are in an online state.
          1. Run the following two commands:
          
             ```powershell
             Stop-ClusterResource -name "Cluster IP Address x.x.x.x"
             ```
          
             and then
             
             ```powershell
             Start-ClusterResource -name "Cluster IP Address x.x.x.x"
             ```
          
      - question: | 
          When you deploy AKS on Azure Stack HCI with a misconfigured network, deployment timed out at various points

        answer: | 
          When you deploy AKS on Azure Stack HCI, the deployment may time out at different points of the process depending on where the misconfiguration occurred. You should review the error message to determine the cause and where it occurred.
          
          For example, in the following error, the point at which the misconfiguration occurred is in `Get-DownloadSdkRelease -Name "mocstack-stable"`: 
          
          ```shell
          $vnet = New-AksHciNetworkSettingSet-AksHciConfig -vnet $vnetInstall-AksHciVERBOSE: 
          Initializing environmentVERBOSE: [AksHci] Importing ConfigurationVERBOSE: 
          [AksHci] Importing Configuration Completedpowershell : 
          GetRelease - error returned by API call: 
          Post "https://msk8s.api.cdp.microsoft.com/api/v1.1/contents/default/namespaces/default/names/mocstack-stable/versions/0.9.7.0/files?action=generateDownloadInfo&ForegroundPriority=True": 
          dial tcp 52.184.220.11:443: connectex: 
          A connection attempt failed because the connected party did not properly
          respond after a period of time, or established connection failed because
          connected host has failed to respond.At line:1 char:1+ powershell -command
          { Get-DownloadSdkRelease -Name "mocstack-stable"}
          ```
          
          This indicates that the physical Azure Stack HCI node can resolve the name of the download URL, `msk8s.api.cdp.microsoft.com`, but the node can't connect to the target server.
          
          To resolve this issue, you need to determine where the breakdown occurred in the connection flow. Here are some steps to try to resolve the issue from the physical cluster node:
          
          1. Ping the destination DNS name: ping `msk8s.api.cdp.microsoft.com`. 
          2. If you get a response back and no time-out, then the basic network path is working. 
          3. If the connection times out, then there could be a break in the data path. For more information, see [check proxy settings](./set-proxy-settings.md). Or, there could be a break in the return path, so you should check the firewall rules. 
          
      - question: | 
          Applications that are NTP time-dependent trigger hundreds of false alerts

        answer: | 
          Applications that are NTP time-dependent can trigger hundreds of false alerts when they are out of time sync. If the cluster is running in a proxy environment, your nodepools may not be able to communicate with the default NTP server, **time.windows.com**, through your proxy or firewall.
          
          ### Workaround
          
          To work around this issue, update the NTP server config on each nodepool node to synchronize the clocks. Doing so will set the clocks in each of your application pods as well.
          
          #### Prerequisites
          
          - An address of an NTP server that is accessible in each nodepool node.
          - Access to the workload cluster **kubeconfig** file.
          - Access to the management cluster **kubeconfig**.
          
          #### Steps
          
          1. Run the following `kubectl` command using the workload cluster **kubeconfig** to get a list of nodes in it:
          
             ```powershell
             kubectl --kubeconfig <PATH TO KUBECONFIG> get nodes -owide
             ```
          
          1. Run the following `kubectl` command to correlate the node names from the previous command to the workload cluster's nodepool nodes. Take note of the relevant IP addresses from the previous command's output.
          
             ```powershell
             kubectl --kubeconfig (Get-KvaConfig).kubeconfig get machine -l cluster.x-k8s.io/cluster-name=<WORKLOAD_CLUSTER_NAME>
             ```
          
          1. Using the output from the previous steps, run the following steps for each nodepool node that needs its NTP config updated:
             1. SSH into a nodepool node VM:
                
                ```powershell
                ssh -o StrictHostKeyChecking=no -i (Get-MocConfig).sshPrivateKey clouduser@<NODEPOOL_IP>
                ```
          
             1. Verify that the configured NTP server is unreachable:
          
                ```shell
                sudo timedatectl timesync-status
                ```
          
                If the output looks like this, then the NTP server is unreachable and it needs to be changed:
          
                ```shell
                Server: n/a (time.windows.com)
                Poll interval: 0 (min: 32s; max 34min 8s)
                Packet count: 0
                ```
          
             1. To update the NTP server, run the following commands to set the NTP server in the timesync config file to one that is accessible from the nodepool VM:
          
                ```shell
                # make a backup of the config file
                sudo cp /etc/systemd/timesyncd.conf ~/timesyncd.conf.bak
                
                # update the ntp server
                NTPSERVER="NEW_NTP_SERVER_ADDRESS"
                sudo sed -i -E "s|^(NTP=)(.*)$|\1$NTPSERVER|g" /etc/systemd/timesyncd.conf
                
                # verify that the NTP server is updated correctly - check the value for the field that starts with "NTP="
                sudo cat /etc/systemd/timesyncd.conf 
                ```
          
             1. Restart the timesync service:
          
                ```shell
                sudo systemctl restart systemd-timesyncd.service
                ```
          
             1. Verify that the NTP server is accessible:
          
                ```shell
                sudo timedatectl timesync-status
                ```
          
             1. Verify that the clock shows the correct time:
          
                ```shell
                date
                ```
          
          1. Make sure that each nodepool node shows the same time by running the command from step 3.f.
          1. If your application does not update its time automatically, restart its pods.
  

additionalContent: |
  ## Next steps
  - [Troubleshooting overview](troubleshoot-overview.md)
  - [Installation issues and errors](/azure-stack/aks-hci/known-issues-installation)
  - [Storage known issues](/azure-stack/aks-hci/known-issues-storage)