Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FYI] Cilium not working on bottlerocket-v1.20 #32610

Open
2 of 3 tasks
obirhuppertz opened this issue May 17, 2024 · 6 comments
Open
2 of 3 tasks

[FYI] Cilium not working on bottlerocket-v1.20 #32610

obirhuppertz opened this issue May 17, 2024 · 6 comments
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/agent Cilium agent related.

Comments

@obirhuppertz
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

What happened?

Just letting you know so you may join the discussion over at bottlerocket-os/bottlerocket#3968

Something in bottlerocket-v1.20 seems to have changed preventing using cilium for now. Worked in v1.19.5. Seems to be a netfilter/module load issue resulting in DNS resolving issues on node where coredns is running.

Cilium Version

v1.14.9

Kernel Version

bottlerocket-v1.20

Kubernetes Version

v1.27.11-eks

Regression

No response

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Cilium Users Document

  • Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • I agree to follow this project's Code of Conduct
@obirhuppertz obirhuppertz added kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. labels May 17, 2024
@vigh-m
Copy link

vigh-m commented May 17, 2024

To add more detail on this issue, Bottlerocket moved from xz compressed kernel modules to gz compression. From what I can tell, Cilium brings its own modprobe which is essentially Ubuntu 22.04's kmod which does not have gzip support

# cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy

# apt-cache madison kmod
      kmod | 29-1ubuntu1 | http://archive.ubuntu.com/ubuntu jammy/main amd64 Packages

# ldd /usr/sbin/modprobe
	linux-vdso.so.1 (0x00007ffcc07f7000)
	libzstd.so.1 => /lib/x86_64-linux-gnu/libzstd.so.1 (0x00007fb9dde40000)
	liblzma.so.5 => /lib/x86_64-linux-gnu/liblzma.so.5 (0x00007fb9dde15000)
	libcrypto.so.3 => /lib/x86_64-linux-gnu/libcrypto.so.3 (0x00007fb9dd9d1000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb9dd7a8000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fb9ddf3e000)

This breaks module loading for Cilium between Bottlerocket versions.

Is it possible for Cilium to use the host OS’s modprobe when it exists vs always using their own?

@arthur1542
Copy link

We have the same issue with Cilium 1.15.3

@maodahua
Copy link

Same issue like #32616
I spend so many time to invegist, found the envoy node dns have issue. Then I see this issue, I think I found the root cause.

@obirhuppertz
Copy link
Author

@vigh-m just posted a temporary fix for the latest bottlerocket release v1.20.1 at bottlerocket-os/bottlerocket#3968 (comment) . Can confirm that adding the following lines to values.yaml fixes the module loading issue for now.

extraVolumeMounts:
  - name: kmod-static
    mountPath: /usr/local/sbin/modprobe
    readOnly: true
  - name: kernel-modules
    mountPath: /lib/modules/
    readOnly: true

extraVolumes:
  - name: kmod-static
    hostPath:
      path: /usr/bin/kmod
    type: File
  - name: kernel-modules
    hostPath:
      path: /lib/modules/

Maybe someone from cilium may track the issue over at bottlerocket? Its currently only a temp workaround and its said there will be a more permanent solution coming soon.

@project-administrator
Copy link

project-administrator commented Jun 7, 2024

@obirhuppertz We applied the fix, thanks for the hint!
We had to roll all of our EKS k8s nodes to apply the change..
Unfortunately, if we do curl in a loop we can see it's still failing about 5% of the requests with the following error:

GET /details/1 HTTP/1.1
Host: k8s-gwapites-ciliumga-xxxxxxxxx-yyyyyyyyyyyyy.elb.eu-west-2.amazonaws.com
User-Agent: curl/8.7.1
Accept: application/json, */*

* Request completely sent off
HTTP/1.1 503 Service Unavailable
content-length: 91
content-type: text/plain
date: Fri, 07 Jun 2024 11:34:37 GMT
server: envoy

curl: (22) The requested URL returned error: 503
upstream connect error or disconnect/reset before headers. reset reason: connection timeout

Further investigation revealed that two nodes of type t3.medium fail to load the required modules during the startup:

$ kubectl logs -n cilium cilium-sf6j6 | ag module
Defaulted container "cilium-agent" out of: cilium-agent, config (init), mount-cgroup (init), apply-sysctl-overwrites (init), mount-bpf-fs (init), clean-cilium-state (init), install-cni-binaries (init)
time="2024-06-07T11:24:29Z" level=warning msg="iptables modules could not be initialized. It probably means that iptables is not available on this system" error="could not load module iptable_raw: exit status 1" subsys=iptables

After checking the node it turns out that the iptable_raw module is loaded, but still, the node has much less loaded module count than the other nodes. E.g. these modules are not loaded on the node: ip6table_raw, raw_diag.
The initial guess is that the cilium-agent fails to load some modules on a smaller/slower node for some reason. Therefore, the proposed fix does not work for all node types.
We'll try to load all required modules with bottlerocket userdata settings in an attempt to get a more reliable way of loading the modules.

@ese
Copy link
Contributor

ese commented Jun 7, 2024

I applied @obirhuppertz fix in cilium helm and all my tests for our cilium ingresses worked fine using bottlerocket 1.20.1 and cilium 1.14.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug This is a bug in the Cilium logic. kind/community-report This was reported by a user in the Cilium community, eg via Slack. needs/triage This issue requires triaging to establish severity and next steps. sig/agent Cilium agent related.
Projects
None yet
Development

No branches or pull requests

7 participants