-
Notifications
You must be signed in to change notification settings - Fork 388
/
known-issues-networking.yml
407 lines (303 loc) · 21.3 KB
/
known-issues-networking.yml
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
### YamlMime:FAQ
metadata:
title: Troubleshoot networking issues and errors in AKS enabled by Arc
description: Get help to troubleshoot networking issues and errors in AKS Arc.
author: sethmanheim
ms.author: sethm
ms.service: azure-stack
ms.subservice: aks-hci
ms.topic: faq
ms.date: 07/11/2024
title: Fix known issues and errors when configuring a network in AKS Arc
summary: |
**Applies to: AKS on Azure Stack HCI, AKS on Windows Server**
Use this topic to help you troubleshoot and resolve networking-related issues with AKS Arc.
sections:
- name: Single section - ignored
questions:
- question: |
Error: 'Failed to start the cloud agent generic cluster service in failover cluster. The cluster resource group is in the 'failed' state.'
answer: |
Cloud agent may fail to successfully start when using path names with spaces in them.
When using [Set-AksHciConfig](./reference/ps/set-akshciconfig.md) to specify `-imageDir`, `-workingDir`, `-cloudConfigLocation`, or `-nodeConfigLocation` parameters with a path name that contains a space character, such as `D:\Cloud Share\AKS HCI`, the cloud agent cluster service will fail to start with the following (or similar) error message:
```
Failed to start the cloud agent generic cluster service in failover cluster. The cluster resource group os in the 'failed' state. Resources in 'failed' or 'pending' states: 'MOC Cloud Agent Service'
```
To work around this issue, use a path that does not include spaces, for example, `C:\CloudShare\AKS-HCI`.
- question: |
Arc-connected clusters have empty JSON "distribution" property
answer: |
Azure Arc for Kubernetes-connected clusters can have the value for the JSON property `distribution` set to an empty string. For AKS Arc-connected clusters, this value is set during initial installation and is never altered for the lifetime of the deployment.
To reproduce the issue, open an Azure PowerShell window and run the following commands:
```powershell
az login
az account set --subscription <sub_id>
az connectedk8s show -g <rg_name> -n
```
The following is example output from this command:
```shell
{
"agentPublicKeyCertificate": <value>
"agentVersion": "1.8.14",
"azureHybridBenefit": "NotApplicable",
"connectivityStatus": "Connected",
"distribution": "",
"distributionVersion": null,
"id": "/subscriptions/<subscription>/resourceGroups/<resource group>/providers/Microsoft.Kubernetes/connectedClusters/<cluster name>",
"identity": {
"principalId": "<principal id>",
"tenantId": "<tenant id>",
"type": "SystemAssigned"
},
"infrastructure": "azure_stack_hci",
"kubernetesVersion": "1.23.12",
"lastConnectivityTime": "2022-11-04T14:59:59.050000+00:00",
"location": "eastus",
"managedIdentityCertificateExpirationTime": "2023-02-02T14:24:00+00:00",
"miscellaneousProperties": null,
"name": "mgmt-cluster",
"offering": "AzureStackHCI_AKS_Management",
"privateLinkScopeResourceId": null,
"privateLinkState": "Disabled",
"provisioningState": "Succeeded",
"resourceGroup": "<resource group>",
"systemData": {
"createdAt": "2022-11-04T14:29:17.680060+00:00",
"createdBy": "<>",
"createdByType": "Application",
"lastModifiedAt": "2022-11-04T15:00:01.830067+00:00",
"lastModifiedBy": "64b12d6e-6549-484c-8cc6-6281839ba394",
"lastModifiedByType": "Application"
},
"tags": {},
"totalCoreCount": 4,
"totalNodeCount": 1,
"type": "microsoft.kubernetes/connectedclusters"
}
```
### To resolve the issue
If the output for the JSON `distribution` property returns an empty string, follow these instructions to patch your cluster:
1. Copy the following configuration into a file called **patchBody.json**:
```json
{
"properties": {
"distribution": "aks_management"
}
}
```
> [!IMPORTANT]
> If your cluster is an AKS management cluster, the value should be set to `aks_management`. If it's a workload cluster, it should be set to `aks_workload`.
2. From the JSON output you obtained above, copy your connected cluster ID. It should appear as a long string in the following format:
`"/subscriptions/<subscription >/resourceGroups/<resource group>/providers/Microsoft.Kubernetes/connectedClusters/<cluster name>",`
3. Run the following command to patch your cluster. The `<cc_arm_id>` value should be the ID obtained in step 2:
```powershell
az rest -m patch -u "<cc_arm_id>?api-version=2022-10-01-preview" -b "@patchBody.json"
```
4. Check that your cluster has been successfully patched by running `az connectedk8s show -g <rg_name> -n <cc_name>` to make sure the JSON property `distribution` has been correctly set.
- question: |
The WSSDAgent service is stuck while starting and fails to connect to the cloud agent
answer: |
Symptoms:
- Proxy enabled in AKS Arc. The WSSDAgent service stuck in the `starting` state.
Shows up as the following:
- `Test-NetConnection -ComputerName <computer IP/Name> -Port <port>` from the node where the node agent is failing towards the cloud agent is working properly on the system (even when the wssdagent fails to start)
- Curl.exe from the node on which the agent is failing towards the cloud agent reproduces the problem and is getting stuck:
`curl.exe https://<computerIP>:6500`
- To fix the problem, pass the `--noproxy` flag to curl.exe. Curl returns an error from from wssdcloudagent. This is expected since curl is not a GRPC client. Curl doesn't get stuck waiting when you send the `--noproxy` flag. So returning an error is considered a success here):
```bash
curl.exe --noproxy '*' https://<computerIP>:65000
```
It is likely the proxy settings were changed to a faulty proxy on the host. The proxy settings for AKS Arc are environment variables that are inherited from the parent process on the host. These settings only get propagated when a new service starts or an old one updates or reboots. It is possible that faulty proxy settings were set on the host, and they were propagated to the WSSDAgent after an update or reboot, which has caused the WSSDAgent to fail.
You will need to fix the proxy settings by changing the environmental variables on the machine.
On the machine, change the variables with the following commands:
```bash
[Environment]::SetEnvironmentVariable("https_proxy", <Value>, "Machine")
[Environment]::SetEnvironmentVariable("http_proxy", <Value>, "Machine")
[Environment]::SetEnvironmentVariable("no_proxy", <Value>, "Machine")
````
Reboot the machine so that the service manager, and the WSSDAgent, pick up the fixed proxy.
- question: |
CAPH pod fails to renew certificate
answer: |
This error occurs because every time the CAPH pod is started, a login to cloudagent is attempted and the certificate is stored in the temporary storage volume, which will clean out on pod restarts. Therefore, every time a pod is restarted, the certificate is destroyed and a new login attempt is made.
A login attempt starts a renewal routine, which renews the certificate when it nears expiry. The CAPH pod decides if a login is needed if the certificate is available or not. If the certificate is available, the login is not attempted, assuming the renewal routine is already there.
However, on a container restart, the temporary directory is not cleaned, so the file is still persisted and the login attempt is not made, causing the renewal routine to not start. This leads to certificate expiration.
To mitigate this issue, restart the CAPH pod using the following command:
```bash
kubectl delete pod pod-name
````
- question: |
Set-AksHciConfig fails with WinRM errors but shows WinRM is configured correctly
answer: |
When running [Set-AksHciConfig](./reference/ps/./set-akshciconfig.md), you might encounter the following error:
```powershell
WinRM service is already running on this machine.
WinRM is already set up for remote management on this computer.
Powershell remoting to TK5-3WP08R0733 was not successful.
At C:\Program Files\WindowsPowerShell\Modules\Moc\0.2.23\Moc.psm1:2957 char:17
+ ... throw "Powershell remoting to "+$env:computername+" was n ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : OperationStopped: (Powershell remo...not successful.:String) [], RuntimeException
+ FullyQualifiedErrorId : Powershell remoting to TK5-3WP08R0733 was not successful.
```
Most of the time, this error occurs as a result of a change in the user's security token (due to a change in group membership), a password change, or an expired password. In most cases, the issue can be remediated by logging off from the computer and logging back in. If the error still occurs, you can create a support ticket through the Azure portal.
- question: |
AKS Arc cluster stuck in "ScalingControlPlane" provisioning state
answer: |
This issue causes an AKS Arc cluster to remain in the **ScalingControlPlane** state for an extended period of time.
### To reproduce
```powershell
Get-AksHciCluster -Name <cluster name> | select *
```
```output
Status : {ProvisioningState, Details}
ProvisioningState : ScalingControlPlane
KubernetesVersion : v1.22.6
PackageVersion : v1.22.6-kvapkg.0
NodePools : {nodepoolA, nodepoolB}
WindowsNodeCount : 0
LinuxNodeCount : 1
ControlPlaneNodeCount : 1
ControlPlaneVmSize : Standard_D4s_v3
AutoScalerEnabled : False
AutoScalerProfile :
Name : <cluster name>
```
This issue has been fixed in recent AKS Arc versions. We recommend updating your AKS Arc clusters to the latest release.
To mitigate this issue, contact support to manually patch the status of the control plane nodes to remove the condition **MachineOwnerRemediatedCondition** from the machine status via kubectl.
- question: |
The workload cluster is not found
answer: |
The workload cluster may not be found if the IP address pools of two AKS on Azure Stack HCI deployments are the same or overlap. If you deploy two AKS hosts and use the same `AksHciNetworkSetting` configuration for both, PowerShell and Windows Admin Center will potentially fail to find the workload cluster because the API server will be assigned the same IP address in both clusters, resulting in a conflict.
The error message you receive will look similar to the example shown below.
```powershell
A workload cluster with the name 'clustergroup-management' was not found.
At C:\Program Files\WindowsPowerShell\Modules\Kva\0.2.23\Common.psm1:3083 char:9
+ throw $("A workload cluster with the name '$Name' was not fou ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : OperationStopped: (A workload clus... was not found.:String) [], RuntimeException
+ FullyQualifiedErrorId : A workload cluster with the name 'clustergroup-management' was not found.
```
> [!NOTE]
> Your cluster name will be different.
To resolve the issue, delete one of the clusters and create a new AKS cluster network setting that has a non-overlapping network with the other clusters.
- question: |
Get-AksHCIClusterNetwork does not show the current allocation of IP addresses
answer: |
Running the [Get-AksHciClusterNetwork](./reference/ps/get-akshciclusternetwork.md) command provides a list of all virtual network configurations. However, the command does not show the current allocation of the IP addresses.
To find out what IP addresses are currently in use in a virtual network, use the steps below:
1. To get the group, run the following command:
```powershell
Get-MocGroup -location MocLocation
```
2. To get the list of IP addresses that are currently in use, and the list of available or used virtual IP addresses, run the following command:
```powershell
Get-MocNetworkInterface -Group <groupName> | ConvertTo-Json -depth 10
```
3. Use the following command to view the list of virtual IP addresses that are currently in use:
```powershell
Get-MocLoadBalancer -Group <groupName> | ConvertTo-Json -depth 10
```
- question: |
"Cluster IP Address x.x.x.x" fails
answer: |
A cluster IP address shows as "Failed" during precheck. This precheck tests that all IPv4 addresses and DNS network names are online and reachable. For example, a failed IPv4 or network name can cause this test to fail.
To resolve this, perform the following steps:
1. Run the following command:
```powershell
Get-ClusterGroup -name "Cluster Group" | Get-ClusterResource
```
1. Ensure that all network names and IP addresses are in an online state.
1. Run the following two commands:
```powershell
Stop-ClusterResource -name "Cluster IP Address x.x.x.x"
```
and then
```powershell
Start-ClusterResource -name "Cluster IP Address x.x.x.x"
```
- question: |
When you deploy AKS on Azure Stack HCI with a misconfigured network, deployment timed out at various points
answer: |
When you deploy AKS on Azure Stack HCI, the deployment may time out at different points of the process depending on where the misconfiguration occurred. You should review the error message to determine the cause and where it occurred.
For example, in the following error, the point at which the misconfiguration occurred is in `Get-DownloadSdkRelease -Name "mocstack-stable"`:
```shell
$vnet = New-AksHciNetworkSettingSet-AksHciConfig -vnet $vnetInstall-AksHciVERBOSE:
Initializing environmentVERBOSE: [AksHci] Importing ConfigurationVERBOSE:
[AksHci] Importing Configuration Completedpowershell :
GetRelease - error returned by API call:
Post "https://msk8s.api.cdp.microsoft.com/api/v1.1/contents/default/namespaces/default/names/mocstack-stable/versions/0.9.7.0/files?action=generateDownloadInfo&ForegroundPriority=True":
dial tcp 52.184.220.11:443: connectex:
A connection attempt failed because the connected party did not properly
respond after a period of time, or established connection failed because
connected host has failed to respond.At line:1 char:1+ powershell -command
{ Get-DownloadSdkRelease -Name "mocstack-stable"}
```
This indicates that the physical Azure Stack HCI node can resolve the name of the download URL, `msk8s.api.cdp.microsoft.com`, but the node can't connect to the target server.
To resolve this issue, you need to determine where the breakdown occurred in the connection flow. Here are some steps to try to resolve the issue from the physical cluster node:
1. Ping the destination DNS name: ping `msk8s.api.cdp.microsoft.com`.
2. If you get a response back and no time-out, then the basic network path is working.
3. If the connection times out, then there could be a break in the data path. For more information, see [check proxy settings](./set-proxy-settings.md). Or, there could be a break in the return path, so you should check the firewall rules.
- question: |
Applications that are NTP time-dependent trigger hundreds of false alerts
answer: |
Applications that are NTP time-dependent can trigger hundreds of false alerts when they are out of time sync. If the cluster is running in a proxy environment, your nodepools may not be able to communicate with the default NTP server, **time.windows.com**, through your proxy or firewall.
### Workaround
To work around this issue, update the NTP server config on each nodepool node to synchronize the clocks. Doing so will set the clocks in each of your application pods as well.
#### Prerequisites
- An address of an NTP server that is accessible in each nodepool node.
- Access to the workload cluster **kubeconfig** file.
- Access to the management cluster **kubeconfig**.
#### Steps
1. Run the following `kubectl` command using the workload cluster **kubeconfig** to get a list of nodes in it:
```powershell
kubectl --kubeconfig <PATH TO KUBECONFIG> get nodes -owide
```
1. Run the following `kubectl` command to correlate the node names from the previous command to the workload cluster's nodepool nodes. Take note of the relevant IP addresses from the previous command's output.
```powershell
kubectl --kubeconfig (Get-KvaConfig).kubeconfig get machine -l cluster.x-k8s.io/cluster-name=<WORKLOAD_CLUSTER_NAME>
```
1. Using the output from the previous steps, run the following steps for each nodepool node that needs its NTP config updated:
1. SSH into a nodepool node VM:
```powershell
ssh -o StrictHostKeyChecking=no -i (Get-MocConfig).sshPrivateKey clouduser@<NODEPOOL_IP>
```
1. Verify that the configured NTP server is unreachable:
```shell
sudo timedatectl timesync-status
```
If the output looks like this, then the NTP server is unreachable and it needs to be changed:
```shell
Server: n/a (time.windows.com)
Poll interval: 0 (min: 32s; max 34min 8s)
Packet count: 0
```
1. To update the NTP server, run the following commands to set the NTP server in the timesync config file to one that is accessible from the nodepool VM:
```shell
# make a backup of the config file
sudo cp /etc/systemd/timesyncd.conf ~/timesyncd.conf.bak
# update the ntp server
NTPSERVER="NEW_NTP_SERVER_ADDRESS"
sudo sed -i -E "s|^(NTP=)(.*)$|\1$NTPSERVER|g" /etc/systemd/timesyncd.conf
# verify that the NTP server is updated correctly - check the value for the field that starts with "NTP="
sudo cat /etc/systemd/timesyncd.conf
```
1. Restart the timesync service:
```shell
sudo systemctl restart systemd-timesyncd.service
```
1. Verify that the NTP server is accessible:
```shell
sudo timedatectl timesync-status
```
1. Verify that the clock shows the correct time:
```shell
date
```
1. Make sure that each nodepool node shows the same time by running the command from step 3.f.
1. If your application does not update its time automatically, restart its pods.
additionalContent: |
## Next steps
- [Troubleshooting overview](troubleshoot-overview.md)
- [Installation issues and errors](/azure-stack/aks-hci/known-issues-installation)
- [Storage known issues](/azure-stack/aks-hci/known-issues-storage)