Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Nodes become unreachable when using Cilium #3531

Closed
maybedino opened this issue Mar 13, 2023 · 16 comments
Closed

[BUG] Nodes become unreachable when using Cilium #3531

maybedino opened this issue Mar 13, 2023 · 16 comments

Comments

@maybedino
Copy link

Describe the bug
I set up an AKS cluster with Azure CNI + Cilium, and 1 Standard_D4plds_v5 (ARM) node. After a few hours or up to a few days, the node becomes unreachable. It shows "not ready" in the Portal, and can't be reached over network. I can't get any logs out of it either.

To Reproduce
Create the cluster:

az group create --name test-cilium-debug --location westeurope

az aks create -n aks-ciliumdebug-westeu -g test-cilium-debug -l westeurope \
  --network-plugin azure \
  --network-plugin-mode overlay \
  --pod-cidr 192.168.0.0/16 \
  --enable-cilium-dataplane \
  --node-vm-size Standard_D4plds_v5 \
  --node-count 1

Wait at least a few minutes, up to a few days. Most of the time it takes a few hours but sometimes more or less.

Check the node status, it shows "not ready" and it's no longer possible to connect to the node (pings, ssh, etc.).

So far this has been 100% reproducible for me, after trying ~5 times to to rule out other factors. Sometimes it just takes longer but the node will always fail eventually.

Expected behavior
The node should work normally.

Environment (please complete the following information):

  • CLI Version - 2.45.0
  • Kubernetes version - Tested both 1.24.x (default) and 1.25.x
  • CLI Extension version - AKS preview 0.5.129
@maybedino maybedino added the bug label Mar 13, 2023
@rafaribe
Copy link

rafaribe commented Mar 13, 2023

Same happened to me in multiple configurations, either with Preview Cilium Dataplane and with BYOCNI.
Logs and kubectl describe point only to Kubelet problems, nothing with Cilium, one of the triggers to make this happen is to create a cluster on 1.25 and then upgrade to 1.26.0 (preview).

@sabbour
Copy link
Contributor

sabbour commented Mar 14, 2023

@phealy @chasewilson can you please take a look?

@mark-angelo
Copy link

mark-angelo commented Mar 14, 2023

This issue is not present in AKS 1.23, which is being deprecated at the end of this month. We are therefore unable to upgrade to the new AKS version(s).

If this issue is not resolved promptly, we kindly ask that the deprecation timeline for AKS 1.23 be extended. Thank you!

@twendt
Copy link

twendt commented Mar 14, 2023

The issue is caused by the latest systemd update. The latest node image has an older version installed. After the node comes up, it will be patched by the unattended updates that is enabled in Ubuntu. During the installation of the update, the node goes into NotReady state.
Unfortunately the PR for the new node image also includes the 249.11-0ubuntu3.6 version of systemd.

It would be great if MS could provide a new node image that already includes the latest systemd package with the 249.11-0ubuntu3.7 version.

What helps is to restart, not re-image, the node. Of course all nodes that will be added by the cluster autoscaler will also run into this issue.

The solution would be to use the new OS upgrade feature, but that is still in Public Preview.

@Davee02
Copy link

Davee02 commented Mar 15, 2023

We're experiencing the exact same problem, AKS 1.25.4 with node pool image AKSUbuntu-2204gen2containerd-2023.02.15 and Cilium 1.12.3

@atykhyy
Copy link

atykhyy commented Mar 15, 2023

One can reproduce the issue deterministically, without waiting for an indefinite period. Create a one-node AKS cluster with a recent version (I used 1.25.4) and Cilium dataplane, SSH into the node, and run DEBIAN_FRONTEND=noninteractive sudo apt upgrade systemd -y. In my case this upgrades systemd from 249.11-0ubuntu3.6 to 249.11-0ubuntu3.7. At some point in the upgrade process the SSH connection freezes and the node transitions to Not Ready state. Manually rebooting a node instance stuck in Not Ready state cures it, so the problem appears to be with the upgrade process rather than with the new systemd itself.

A simple temporary fix for the issue is to run sudo apt-mark hold systemd on every fresh node. Another simple workaround is to create a daemonset which will pin the systemd package on every node.

PS: the DNS resolution issue (tracking ID 2TWN-VT0) which affected a large number of Azure VMs and AKS clusters using Ubuntu last August was also caused by a faulty systemd upgrade.

@mark-angelo
Copy link

@phealy @chasewilson we would be grateful if you could share an update as this is impacting our ability to upgrade from 1.23. Thank you!

@phealy
Copy link
Contributor

phealy commented Mar 16, 2023

Tagging @wedaly on this and raising it internally - we'll look at this right away.

@ghost
Copy link

ghost commented Mar 16, 2023

@aanandr, @phealy would you be able to assist?

Issue Details

Describe the bug
I set up an AKS cluster with Azure CNI + Cilium, and 1 Standard_D4plds_v5 (ARM) node. After a few hours or up to a few days, the node becomes unreachable. It shows "not ready" in the Portal, and can't be reached over network. I can't get any logs out of it either.

To Reproduce
Create the cluster:

az group create --name test-cilium-debug --location westeurope

az aks create -n aks-ciliumdebug-westeu -g test-cilium-debug -l westeurope \
  --network-plugin azure \
  --network-plugin-mode overlay \
  --pod-cidr 192.168.0.0/16 \
  --enable-cilium-dataplane \
  --node-vm-size Standard_D4plds_v5 \
  --node-count 1

Wait at least a few minutes, up to a few days. Most of the time it takes a few hours but sometimes more or less.

Check the node status, it shows "not ready" and it's no longer possible to connect to the node (pings, ssh, etc.).

So far this has been 100% reproducible for me, after trying ~5 times to to rule out other factors. Sometimes it just takes longer but the node will always fail eventually.

Expected behavior
The node should work normally.

Environment (please complete the following information):

  • CLI Version - 2.45.0
  • Kubernetes version - Tested both 1.24.x (default) and 1.25.x
  • CLI Extension version - AKS preview 0.5.129
Author: maybedino
Assignees: -
Labels:

bug, networking/azcni

Milestone: -

@phealy
Copy link
Contributor

phealy commented Mar 16, 2023

OK, this is coming from a known issue with Cilium and Systemd - Cilium adds routes with proto static, and systemd 249 (in Ubuntu 22.04) has a setting that's on by default - networkd thinks it owns all routing on the system, and will remove any routes that aren't placed by it on a package restart.

The easiest temporary fix would be to use the NodeOSUpgrade preview feature to disable unattended-upgrade by setting the nodes to "none", "securitypatch", or "nodeimage" (really, anything other than "unmanaged"). This will prevent systemd-networkd restarting and removing the routes when unattended-upgrade runs.

@atykhyy
Copy link

atykhyy commented Mar 28, 2023

Thank you @phealy! We were able to use that PR to solve our specific problem with Cilium on AKS. This issue, too, should go away once --enable-cilium-dataplane switches to the upcoming 1.13.2 planned to include that PR.

May 13: this is the correct PR to use for a private build of Cilium: cilium/cilium#25350

@ghost ghost added the action-required label Apr 22, 2023
@ghost
Copy link

ghost commented Apr 27, 2023

Action required from @Azure/aks-pm

@ghost ghost added the Needs Attention 👋 Issues needs attention/assignee/owner label Apr 27, 2023
@ghost
Copy link

ghost commented May 12, 2023

Issue needing attention of @Azure/aks-leads

1 similar comment
@ghost
Copy link

ghost commented May 28, 2023

Issue needing attention of @Azure/aks-leads

@atykhyy
Copy link

atykhyy commented May 31, 2023

Cilium changes which fix this issue have been merged into their main branch and will almost certainly be included in their soon-to-come 1.14.0 release. Reading the discussion in the Cilium PR thread linked above, the backport of these changes into 1.13.x is likely to take a while longer because upgrade scenarios require additional testing and possibly coding, and 1.12.x is unlikely to be attempted unless perhaps as a community PR. --enable-cilium-dataplane currently uses 1.12.8 (I've just checked this).

@wedaly
Copy link

wedaly commented May 31, 2023

1.12.x is unlikely to be attempted unless perhaps as a community PR. --enable-cilium-dataplane currently uses 1.12.8 (I've just checked this).

For Azure CNI powered by Cilium we added an init container to the Cilium daemonset to configure systemd-networkd with the mitigation suggested in cilium/cilium#18706 (comment)

[Network]
ManageForeignRoutes=no
ManageForeignRoutingPolicyRules=no

This restores the behavior of systemd-networkd before the unattended update that caused this issue.

@ghost ghost removed action-required Needs Attention 👋 Issues needs attention/assignee/owner labels May 31, 2023
@wedaly wedaly closed this as completed May 31, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jun 30, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

9 participants