Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes to kernel module compression can break certain workflows #3968

Open
vigh-m opened this issue May 16, 2024 · 11 comments · Fixed by #4007
Open

Changes to kernel module compression can break certain workflows #3968

vigh-m opened this issue May 16, 2024 · 11 comments · Fixed by #4007
Labels
area/core Issues core to the OS (variant independent) status/needs-proposal Needs a more detailed proposal for next steps type/bug Something isn't working

Comments

@vigh-m
Copy link
Contributor

vigh-m commented May 16, 2024

Image I'm using:
All Bottlerocket OS 1.20.0 variants

What I expected to happen:
Updating to Bottlerocket v1.20.0 should not have broken any existing container workflows.

What actually happened:
Launching Bottlerocket v1.20.0 can result in unexpected errors like:

level=warning msg="iptables modules could not be initialized. It probably means that iptables is not available on this system" error="could not load module iptable_raw: exit status

or

Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process:

Root Cause:
Following changes in Bottlerocket v1.20.0 we dropped support for xz compressed modules in favour of gz compression. We're seeing failures to launch certain containers which require kernel modules and have modprobe built without libz (gzip) support.

Known Affected systems:

  • Cilium

Workaround:
A workaround would be to use user data on new launches or apiclient in running hosts to setup kernel.modules.<name>.autoload to load the required modules as described here.

In the case of Cilium the commands would like the following when using the apiclient:

apiclient set \
  settings.kernel.modules.ip_tables.allowed=true \
  settings.kernel.modules.ip_tables.autoload=true \
  settings.kernel.modules.iptable_nat.allowed=true \
  settings.kernel.modules.iptable_nat.autoload=true \
  settings.kernel.modules.iptable_mangle.allowed=true \
  settings.kernel.modules.iptable_mangle.autoload=true \
  settings.kernel.modules.iptable_raw.allowed=true \
  settings.kernel.modules.iptable_raw.autoload=true \
  settings.kernel.modules.iptable_filter.allowed=true \
  settings.kernel.modules.iptable_filter.autoload=true \
  settings.kernel.modules.ip6table_mangle.allowed=true \
  settings.kernel.modules.ip6table_mangle.autoload=true \
  settings.kernel.modules.ip6table_raw.allowed=true \
  settings.kernel.modules.ip6table_raw.autoload=true \
  settings.kernel.modules.ip6table_filter.allowed=true \
  settings.kernel.modules.ip6table_filter.autoload=true

OR via user-data.toml:

[settings.kernel.modules.ip_tables]
allowed = true
autoload = true

[settings.kernel.modules.iptable_nat]
allowed = true
autoload = true

[settings.kernel.modules.iptable_mangle]
allowed = true
autoload = true

[settings.kernel.modules.iptable_raw]
allowed = true
autoload = true

[settings.kernel.modules.iptable_filter]
allowed = true
autoload = true

[settings.kernel.modules.ip6table_mangle]
allowed = true
autoload = true

[settings.kernel.modules.ip6table_raw]
allowed = true
autoload = true

[settings.kernel.modules.ip6table_filter]
allowed = true
autoload = true
@vigh-m vigh-m added type/bug Something isn't working status/needs-proposal Needs a more detailed proposal for next steps area/core Issues core to the OS (variant independent) labels May 16, 2024
@vigh-m vigh-m pinned this issue May 16, 2024
@obirhuppertz
Copy link

(Im adding to this issue to have all informations at one place.)

We have a similar issue right now since one of our cluster updated to Bottlerocket v1.20.

Can confirm that adding the missing module - in our case it was iptable_raw module - fixes cilium complaining about missing modules but it does not fix DNS issues when a pod is on the same node as coredns.

Deploying a debug pod and resolving any DNS on a node where a coredns pod lives on results in "i/o timeouts" that can be seen in cilium-agent logs IF you have a clusterwidenetworkpolicy in your cluster. After reading this issue i debugged this a little bit more.

Turns out that an important factor is to have a clusterwidenetworkpolicy. It does not have to deny anything its just has to be there in order to trigger the "i/o timeout". If that policy is deleted you can do internal DNS resolution to services but external DNS would not work.

apiVersion: v1
items:
- apiVersion: cilium.io/v2
  kind: CiliumClusterwideNetworkPolicy
  metadata:
    name: default
  spec:
    egress:
    - toEntities:
      - cluster
    - toEndpoints:
      - matchLabels:
          io.kubernetes.pod.namespace: kube-system
          k8s-app: kube-dns
      toPorts:
      - ports:
        - port: "53"
          protocol: UDP
        rules:
          dns:
          - matchPattern: '*'
    - toCIDRSet:
      - cidr: 0.0.0.0/0

Can anyone confirm having DNS issues as well with Bottlerocket v1.20 and cilium? (pods on same host as coredns cant resolve DNS).

@vigh-m
Copy link
Contributor Author

vigh-m commented May 17, 2024

Thanks for opening the issue with Cilium. I've added more details there as well.

Was this configuration working for you in Bottlerocket v 1.19.5? There was a change made some time ago which caused DNS issues on binding to 0.0.0.0:53 (documented here) and I am thinking that they could be related.

Also, Cilium requires the modules listed here to be loaded. Can you validate that they are all loaded? I have updated the issue description with commands to load all the required kernel modules.

@obirhuppertz
Copy link

obirhuppertz commented May 21, 2024

@vigh-m our configuration works with BR 1.19.5.

Only the AMI is different. Just tested to deploy with your changed user_data. Coredns wont start with new configuration complaining about i/o timeouts when trying to read via udp from cluster ip to host ip. Will try with newer cilium version.

The required modules where loaded in BR 1.19.5 in 1.20 only iptable_raw where missing. added it via autoload -> Node started. Tried today with your full module load user_data with reported results.

Next Up:
Will rollback to BR 1.19.5 (no user_data modification) -> Update cilium -> Update to BR 1.20. And report back.

(saw over at cilium issue tracker someone mentioned they have the same issue with 1.15.3 but will try anyway as our configuration might be different)

@obirhuppertz
Copy link

@vigh-m today i investigated a bit further and followed your path first. I think you are on the right track here regarding missing modules or modules that aren't loaded. Since i had no luck using your settings.kernel.modules i spawned a bottlerocket v1.19.5 and 1.20.0 instance comparing both /proc/config.gz against each other checking https://docs.cilium.io/en/stable/operations/system_requirements/ loaded modules.

The relevant requirements for cilium are as followed:

###
# checking base_requirements ...
###
WARN: CONFIG_NET_CLS_BPF != y. Found m
WARN: CONFIG_NET_SCH_INGRESS != y. Found m
WARN: CONFIG_CRYPTO_USER_API_HASH != y. Found m
###
# checking itpables-based masq requirements ...
###
>>> passed all checks
###
# checking L7 and FQDN Policies requirements ...
###
>>> passed all checks
###
# checking ipsec requirements = y ...
###
>>> passed all checks
###
# checking ipsec requirements = m ...
###
WARN: CONFIG_INET_XFRM_MODE_TUNNEL != m. Found undefined
WARN: CONFIG_CRYPTO_AEAD != m. Found y
WARN: CONFIG_CRYPTO_AEAD2 != m. Found y
WARN: CONFIG_CRYPTO_SEQIV != m. Found y
WARN: CONFIG_CRYPTO_HMAC != m. Found y
WARN: CONFIG_CRYPTO_SHA256 != m. Found y
WARN: CONFIG_CRYPTO_AES != m. Found y
###
# checking bandwith_manager requirements ...
###
>>> passed all checks

This result is identical for v1.19.5 and v1.20.0 so i would rule out missing CONFIG settings here. The lsmod between both instances gave more promissing results.

To reproduce:

  1. lsmod via kubectl exec or other methods and get an output for bottlerocket v1.19.5
  2. repeat for 1.20.0
  3. store each file as lsmod-<version>.txt
  4. cat lsmod-1.19.txt | awk '{ printf("%20s %s %s\n", $1, $3, $4) }' | sed 1,1d > lsmod-1.19_diff.txt
  5. cat lsmod-1.20.txt | awk '{ printf("%20s %s %s\n", $1, $3, $4) }' | sed 1,1d > lsmod-1.20_diff.txt
  6. diff -u lsmod-1.19_diff.txt lsmod-1.20_diff.txt > lsmod_br-diff.txt
--- lsmod-1.19_diff.txt	2024-05-21 16:04:32.422565616 +0200
+++ lsmod-1.20_diff.txt	2024-05-21 16:04:39.150647799 +0200
@@ -1,22 +1,3 @@
-     rpcsec_gss_krb5 0 
-         auth_rpcgss 1 rpcsec_gss_krb5
-               nfsv4 0 
-        dns_resolver 1 nfsv4
-                 nfs 1 nfsv4
-               lockd 1 nfs
-               grace 1 lockd
-              sunrpc 5 rpcsec_gss_krb5,auth_rpcgss,nfsv4,nfs,lockd
-             fscache 1 nfs
-nf_conntrack_netlink 0 
-       nft_chain_nat 0 
-         nft_counter 0 
-         xt_addrtype 0 
-          nft_compat 0 
-        br_netfilter 0 
-              bridge 1 br_netfilter
-                 stp 1 bridge
-                 llc 2 bridge,stp
-                 tls 0 
            xt_TPROXY 2 
       nf_tproxy_ipv6 1 xt_TPROXY
       nf_tproxy_ipv4 1 xt_TPROXY
@@ -24,29 +5,25 @@
               xt_nat 2 
        xt_MASQUERADE 1 
                xt_CT 5 
-             xt_mark 19 
-             cls_bpf 17 
-         sch_ingress 9 
+             xt_mark 17 
+         iptable_raw 1 
+             cls_bpf 19 
+         sch_ingress 10 
                vxlan 0 
       ip6_udp_tunnel 1 vxlan
           udp_tunnel 1 vxlan
            xfrm_user 1 
            xfrm_algo 1 xfrm_user
                 veth 0 
-           xt_socket 1 
-      nf_socket_ipv4 1 xt_socket
-      nf_socket_ipv6 1 xt_socket
-        ip6table_raw 0 
-         iptable_raw 1 
-           nf_tables 3 nft_chain_nat,nft_counter,nft_compat
-           nfnetlink 5 nf_conntrack_netlink,nft_compat,ip_set,nf_tables
+           nf_tables 0 
+           nfnetlink 3 ip_set,nf_tables
      ip6table_filter 1 
         ip6table_nat 1 
          iptable_nat 1 
-              nf_nat 5 nft_chain_nat,xt_nat,xt_MASQUERADE,ip6table_nat,iptable_nat
+              nf_nat 4 xt_nat,xt_MASQUERADE,ip6table_nat,iptable_nat
      ip6table_mangle 1 
         xt_conntrack 1 
-          xt_comment 33 
+          xt_comment 32 
       iptable_mangle 1 
       iptable_filter 1 
                  drm 0 
@@ -55,21 +32,21 @@
             squashfs 2 
       lz4_decompress 1 squashfs
                 loop 4 
-             overlay 34 
+             overlay 29 
         crc32_pclmul 0 
-        crc32c_intel 5 
+        crc32c_intel 4 
  ghash_clmulni_intel 0 
          aesni_intel 0 
-                 ena 0 
          crypto_simd 1 aesni_intel
+                 ena 0 
               cryptd 2 ghash_clmulni_intel,crypto_simd
                  ptp 1 ena
             pps_core 1 ptp
               button 0 
-        sch_fq_codel 5 
-        nf_conntrack 6 nf_conntrack_netlink,xt_nat,xt_MASQUERADE,xt_CT,nf_nat,xt_conntrack
-      nf_defrag_ipv6 3 xt_TPROXY,xt_socket,nf_conntrack
-      nf_defrag_ipv4 3 xt_TPROXY,xt_socket,nf_conntrack
+        sch_fq_codel 3 
+        nf_conntrack 5 xt_nat,xt_MASQUERADE,xt_CT,nf_nat,xt_conntrack
+      nf_defrag_ipv6 2 xt_TPROXY,nf_conntrack
+      nf_defrag_ipv4 2 xt_TPROXY,nf_conntrack
                 fuse 1 
             configfs 1 
            dmi_sysfs 0 

My guess would be that the following modules have to be loaded as well or network policies will not work. They did get loaded in bottlerocket v1.19.5.

  • nf_conntrack_netlink
  • br_netfilter
  • bridge
  • xt_addrtype
  • xt_socket
  • nf_socket_ipv4
  • nf_socket_ipv6

nf_tables loaded nft_chain_nat,nft_counter,nft_compat but is not showing up in bottlerocket v1.20.0

As mentioned here https://docs.cilium.io/en/stable/operations/system_requirements/#requirements-for-l7-and-fqdn-policies xt_socket is a hard requirement for our setup.

@vigh-m though we are seeing different modules not loaded i would concur with your initial finding regarding cilium not beeing able to load its modules.

i believe i do not need to update to a later version of cilium right now to furhter strengthen the findings of yours. Seems pretty clear to me that the underlying issue is in fact modules not beeing loaded thus breaking essential features of cilium.

Regards,

@vigh-m
Copy link
Contributor Author

vigh-m commented May 22, 2024

Hi @obirhuppertz,

Thank you so much for your deep dive on this. It seems pretty clear that the kernel modules not being loaded are the root cause of the issue. I don't think different versions of cilium will behave differently here.

The list of modules in the initial issue comment is non-exhaustive and is causing the other failures you are seeing. As we discover more such missing modules, that list is bound to get longer.

We're investigating options to build a version of modprobe that can be mounted by customer containers allowing for another workaround to this issue than passing in the right user-data. Hope to have more details on that soon

@scbunn-avalara
Copy link

Upgrading to 1.20 has broken Antrea with

[2024-05-30T18:50:59Z INFO install_cni_chaining]: updating CNI conf file 10-aws.conflist -> 05-antrea.conflist
[2024-05-30T18:50:59Z INFO install_cni_chaining]: CNI conf file is already up-to-date
uid=0(root) gid=0(root) groups=0(root)
modprobe: ERROR: could not insert 'openvswitch': Operation not permitted
Failed to load the OVS kernel module from the container, try running 'modprobe openvswitch' on your Nodes

Will investigate adding this module to user-data

@bcressey
Copy link
Contributor

bcressey commented Jun 4, 2024

Reopening since most of these changes have been reverted because of an issue found during testing with Alpine Linux, where the injected mount overwrote the busybox binary and rendered the container useless.

@bcressey bcressey reopened this Jun 4, 2024
@bcressey
Copy link
Contributor

bcressey commented Jun 4, 2024

Based on a survey of container images, it seems like /usr/local/sbin/modprobe might be a safer path.

@vigh-m
Copy link
Contributor Author

vigh-m commented Jun 6, 2024

Bottlerocket 1.20.1 release contains a new version of kmod built statically to allow customer containers to mount it. This will ensure that the OS provided kernel modules can be loaded by the container without compatibility issues.
Here is a sample kubernetes spec which enables this:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: <my-image>
    securityContext:
      capabilities:
        add: ["SYS_MODULE"]
    volumeMounts:
    - name: kmod-static
      mountPath: /usr/local/sbin/modprobe
      readOnly: true
   - name: kernel-modules
      mountPath: /lib/modules/
      readOnly: true
  volumes:
    - name: kmod-static
      hostPath:
      path: /usr/bin/kmod
      type: File
    - name: kernel-modules
      hostPath:
      path: /lib/modules/

This should allow containers to be launched with the correct packages and modules to work with Cilium and any other such workloads.

This is a temporary workaround to enable existing workloads. A more permanent fix is in the works.

@obirhuppertz
Copy link

@vigh-m Thanks for the Update! Can confirm that adding the following lines to the values.yaml results in a working cilium again meaning the Workaround fixes the cilium issue for us with bottlerocket-v1.20.1.

extraVolumeMounts:
  - name: kmod-static
    mountPath: /usr/local/sbin/modprobe
    readOnly: true
  - name: kernel-modules
    mountPath: /lib/modules/
    readOnly: true

extraVolumes:
  - name: kmod-static
    hostPath:
      path: /usr/bin/kmod
    type: File
  - name: kernel-modules
    hostPath:
      path: /lib/modules/

@project-administrator
Copy link

Thank you @vigh-m. This fix resolves the issue only partially for us because we're still getting 5% of the requests fail rate with the same timeout error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/core Issues core to the OS (variant independent) status/needs-proposal Needs a more detailed proposal for next steps type/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants