Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calico-node pod refuses to start on google coral dev board without the nf_conntrack_netlink kernel module #8726

Open
JOUNAIDSoufiane opened this issue Apr 17, 2024 · 7 comments

Comments

@JOUNAIDSoufiane
Copy link

JOUNAIDSoufiane commented Apr 17, 2024

Let me preface this by saying that this is an unusual setup scenario and that I am not running Calico in its ideal environment. If you do not care about the context as to why we try to start Calico without nf_conntrack_netlink, please skip over to the Expected and Current behavior headings

Context Environment

We are working on enrolling the google coral dev board onto our existing balena-fleet that runs a collection of raspberry pis and nvidia Jetson nanos in the following configuration:

  • flashed with the corresponding balena OS kernels
  • Runs our code that installs k3s and for the Jetson nano and Xavier builds and loads the ipip and wireguard modules out-of-tree in this script

After the above steps, the devices are able to start k3s agent in a container and wireguard in another and join our k3s cluster that is running Calico as its CNI.

Our process for enrolling the Google Coral Dev Board

  • We first flash the google coral with the recommended Balena OS version 2.108.26 kernel.
$ uname -r
4.14.98-imx

Here are the logs for what happens when starting the k3s agent with ALL the kernel modules loaded

Dmesg logs on the host kernel

[   40.838652] random: crng init done
[   40.851547] EXT4-fs (mmcblk0p2): re-mounted. Opts: (null)
[   40.874565] EXT4-fs (mmcblk0p6): mounted filesystem with ordered data mode. Opts: (null)
[   41.028904] systemd[1]: System time before build time, advancing clock.
[   41.097585] systemd[1]: File /lib/systemd/system/systemd-journald.service:12 configures an IP firewall (IPAddressDeny=any), but the local system does not support BPF/cgroup based firewalling.
[   41.097598] systemd[1]: Proceeding WITHOUT firewalling in effect! (This warning is only shown for the first loaded unit using IP firewalling.)
[   41.209113] systemd[1]: /lib/systemd/system/chronyd.service:25: Unknown key name 'ProcSubset' in section 'Service', ignoring.
[   41.209146] systemd[1]: /lib/systemd/system/chronyd.service:28: Unknown key name 'ProtectHostname' in section 'Service', ignoring.
[   41.209163] systemd[1]: /lib/systemd/system/chronyd.service:29: Unknown key name 'ProtectKernelLogs' in section 'Service', ignoring.
[   41.209194] systemd[1]: /lib/systemd/system/chronyd.service:32: Unknown key name 'ProtectProc' in section 'Service', ignoring.
[   42.247023] imx-sdma 30bd0000.sdma: no iram assigned, using external mem
[   42.256266] imx-sdma 30bd0000.sdma: loaded firmware 4.2
[   42.259899] imx-sdma 302c0000.sdma: no iram assigned, using external mem
[   42.268096] imx-sdma 302c0000.sdma: loaded firmware 4.2
[   42.348589] ina2xx 1-0040: error configuring the device: -6
[   42.361241] ina2xx 1-0041: error configuring the device: -6
[   42.750411] zram: Can't change algorithm for initialized device
[   43.627353] Adding 503584k swap on /dev/zram0.  Priority:-2 extents:1 across:503584k SS
[   43.910583] wlan: loading out-of-tree module taints kernel.
[   43.975040] wlan: loading driver v4.5.23.1
[   43.975387] hif_pci_probe:, con_mode= 0x0
[   43.975397] PCI device id is 003e :003e
[   43.975417] hif_pci 0000:01:00.0: BAR 0: assigned [mem 0x18000000-0x181fffff 64bit]
[   43.975548] hif_pci 0000:01:00.0: enabling device (0000 -> 0002)
[   43.976718]
                hif_pci_configure : num_desired MSI set to 1
[   44.054114] hif_pci_probe: ramdump base 0xffff800024e00000 size 2095136
[   44.126366] NUM_DEV=1 FWMODE=0x2 FWSUBMODE=0x0 FWBR_BUF 0
[   44.779370] +HWT
[   44.796852] -HWT
[   44.820250] HTT: full reorder offload enabled
[   44.860930] Pkt log is disabled
[   44.865835] Host SW:4.5.23.1, FW:2.0.1.1048, HW:QCA6174_REV3_2
[   44.866430] ol_pktlog_init: pktlogmod_init successfull
[   44.866722] wlan: driver loaded in 892000
[   44.870061] target uses HTT version 3.50; host uses 3.28
[   47.488191] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   47.495084] Generic PHY 30be0000.ethernet-1:00: attached PHY driver [Generic PHY] (mii_bus:phy_addr=30be0000.ethernet-1:00, irq=POLL)
[   47.495751] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   47.534226] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[   47.534572] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[   47.668851] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[   51.593483] fec 30be0000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[   51.593510] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   65.517525] Bridge firewalling registered
[   65.630300] Initializing XFRM netlink socket
[   65.641279] Netfilter messages via NETLINK v0.30.
[   65.900887] IPv6: ADDRCONF(NETDEV_UP): supervisor0: link is not ready
[   65.995140] IPv6: ADDRCONF(NETDEV_UP): balena0: link is not ready
[   68.796835] ipip: IPv4 and MPLS over IPv4 tunneling driver
[   73.869504] ip6_tables: (C) 2000-2006 Netfilter Core Team
[   79.811546] EXT4-fs (mmcblk0p3): mounted filesystem with ordered data mode. Opts: (null)
[  235.903492] ctnetlink v0.93: registering with nfnetlink.
[  255.787535] ip_set: protocol 6
[  256.065155] IPVS: [rr] scheduler registered.
[  256.688073] Unable to handle kernel NULL pointer dereference at virtual address 00000040
[  256.704251] Mem abort info:
[  256.709945]   Exception class = DABT (current EL), IL = 32 bits
[  256.721880]   SET = 0, FnV = 0
[  256.728142]   EA = 0, S1PTW = 0
[  256.734520] Data abort info:
[  256.740377]   ISV = 0, ISS = 0x00000006
[  256.748146]   CM = 0, WnR = 0
[  256.754179] user pgtable: 4k pages, 48-bit VAs, pgd = ffff80001f9f3000
[  256.767329] [0000000000000040] *pgd=000000005f9f8003, *pud=000000005fa76003, *pmd=0000000000000000
[  256.785345] Internal error: Oops: 96000006 [#1] PREEMPT SMP

K3S agent logs (1.23.17 but also crashes on the latest stable)

INFO[0001] Starting k3s agent v1.23.17+k3s1 (abb8d7d4)
INFO[0001] Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [129.114.34.140:6443 dev.edge.chameleoncloud.org:6443]
WARN[0001] Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation.
INFO[0003] Module overlay was already loaded
INFO[0003] Module nf_conntrack was already loaded
INFO[0003] Module br_netfilter was already loaded
INFO[0003] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600
INFO[0003] Set sysctl 'net/netfilter/nf_conntrack_max' to 131072
INFO[0003] Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400
INFO[0003] Logging containerd to /var/lib/rancher/k3s/agent/containerd/containerd.log
INFO[0003] Running containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/k3s/agent/containerd
INFO[0004] Containerd is now running
INFO[0004] Getting list of apiserver endpoints from server
INFO[0005] Tunnel authorizer set Kubelet Port 10250
INFO[0005] Updating load balancer k3s-agent-load-balancer default server address -> 129.114.34.140:6443
INFO[0005] Connecting to proxy                           url="wss://129.114.34.140:6443/v1-k3s/connect"
WARN[0005] Disabling CPU quotas due to missing cpu controller or cpu.cfs_period_us
INFO[0005] Running kubelet --address=0.0.0.0 --allowed-unsafe-sysctls=net.ipv4.ip_forward,net.ipv6.conf.all.forwarding --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=cgroupfs --client-ca-file=/var/lib/rancher/k3s/agent/client-ca.crt --cloud-provider=external --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --container-runtime=remote --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --containerd=/run/k3s/containerd/containerd.sock --cpu-cfs-quota=false --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --healthz-bind-address=127.0.0.1 --hostname-override=8fb60a5 --kubeconfig=/var/lib/rancher/k3s/agent/kubelet.kubeconfig --kubelet-cgroups=/k3s --node-labels= --pod-manifest-path=/var/lib/rancher/k3s/agent/pod-manifests --read-only-port=0 --resolv-conf=/etc/resolv.conf --serialize-image-pulls=false --tls-cert-file=/var/lib/rancher/k3s/agent/serving-kubelet.crt --tls-private-key-file=/var/lib/rancher/k3s/agent/serving-kubelet.key --volume-plugin-dir=/opt/libexec/kubernetes/kubelet-plugins/volume/exec
Flag --cloud-provider has been deprecated, will be removed in 1.24 or later, in favor of removing cloud provider code from Kubelet.
Flag --containerd has been deprecated, This is a cadvisor flag that was mistakenly registered with the Kubelet. Due to legacy concerns, it will follow the standard CLI deprecation timeline before being removed.
I0416 18:57:06.107611    1338 server.go:442] "Kubelet version" kubeletVersion="v1.23.17+k3s1"
I0416 18:57:06.111875    1338 dynamic_cafile_content.go:156] "Starting controller" name="client-ca-bundle::/var/lib/rancher/k3s/agent/client-ca.crt"
INFO[0005] Annotations and labels have already set on node: 8fb60a5
INFO[0006] Running kube-proxy --cluster-cidr=192.168.64.0/18 --conntrack-max-per-core=0 --conntrack-tcp-timeout-close-wait=0s --conntrack-tcp-timeout-established=0s --healthz-bind-address=127.0.0.1 --hostname-override=8fb60a5 --kubeconfig=/var/lib/rancher/k3s/agent/kubeproxy.kubeconfig --proxy-mode=iptables
I0416 18:57:06.604015    1338 server.go:224] "Warning, all flags other than --config, --write-config-to, and --cleanup are deprecated, please begin using a config file ASAP"
INFO[0006] Starting the netpol controller version v1.5.2-0.20221026101626-e01045262706, built on 2023-03-10T21:33:49Z, go1.19.6
I0416 18:57:06.623003    1338 network_policy_controller.go:163] Starting network policy controller
I0416 18:57:06.626245    1338 proxier.go:652] "Failed to load kernel module with modprobe, you can ignore this message when kube-proxy is running inside container without mounting /lib/modules" moduleName="ip_vs_wrr"
I0416 18:57:06.631820    1338 proxier.go:652] "Failed to load kernel module with modprobe, you can ignore this message when kube-proxy is running inside container without mounting /lib/modules" moduleName="ip_vs_sh"
I0416 18:57:06.679697    1338 network_policy_controller.go:175] Starting network policy controller full sync goroutine
I0416 18:57:06.811294    1338 node.go:163] Successfully retrieved node IP: 192.168.1.201
I0416 18:57:06.811465    1338 server_others.go:138] "Detected node IP" address="192.168.1.201"
I0416 18:57:06.895718    1338 server_others.go:206] "Using iptables Proxier"
I0416 18:57:06.896101    1338 server_others.go:213] "kube-proxy running in dual-stack mode" ipFamily=IPv4
I0416 18:57:06.896288    1338 server_others.go:214] "Creating dualStackProxier for iptables"
I0416 18:57:06.896494    1338 server_others.go:502] "Detect-local-mode set to ClusterCIDR, but no IPv6 cluster CIDR defined, , defaulting to no-op detect-local for IPv6"
I0416 18:57:06.898929    1338 server.go:656] "Version info" version="v1.23.17+k3s1"
I0416 18:57:06.911637    1338 config.go:444] "Starting node config controller"
I0416 18:57:06.912495    1338 shared_informer.go:240] Waiting for caches to sync for node config
I0416 18:57:06.911638    1338 config.go:226] "Starting endpoint slice config controller"
I0416 18:57:06.912773    1338 shared_informer.go:240] Waiting for caches to sync for endpoint slice config
I0416 18:57:06.911692    1338 config.go:317] "Starting service config controller"
I0416 18:57:06.912992    1338 shared_informer.go:240] Waiting for caches to sync for service config
I0416 18:57:07.013755    1338 shared_informer.go:247] Caches are synced for node config
I0416 18:57:07.113248    1338 shared_informer.go:247] Caches are synced for endpoint slice config
I0416 18:57:07.113317    1338 shared_informer.go:247] Caches are synced for service config

Debugging the crash

After manually loading the kernel modules one by one, We managed to identify the kernel module that causes the crash: nf_conntrack_netlink. The K3S agent starts fine with all the other kernel modules loaded but crashes the kernel as soon as it is started with the offending kmod loaded. This is of course not an issue with Calico, though I would highly appreciate some help with figuring out how the crash could

Expected Behavior

the calico-node pod should start and possibly throw other errors related to the missing kernel module.

Current Behavior

After I start k3s agent without nf_conntrack_netlink, it managed to join the cluster. However, as expected, Calico refuses to start but I am unsure of the reasons why, here is a bullet summary of what I managed to gather: the calico-node pod fails to start its first init-container flexvolidriver. While K8s fails to gather the logs from containerd, we observe a cryptic Destination directory /host/driver not present!? when starting the container using the k3s ctr utility to directly access containerd. This is the roadblock in calico's setup on the agent.

Kubectl describe calico-node output

Name:                 calico-node-k2gpr
Namespace:            calico-system
Priority:             2000001000
Priority Class Name:  system-node-critical
Node:                 8fb60a5/192.168.1.201
Start Time:           Wed, 17 Apr 2024 14:36:31 -0500
Labels:               app.kubernetes.io/name=calico-node
                      controller-revision-hash=7f676f8bcd
                      k8s-app=calico-node
                      pod-template-generation=1
Annotations:          hash.operator.tigera.io/cni-config: d9bacb836c6a353943c41e4ffe971f36fbc78402
                      hash.operator.tigera.io/tigera-ca-private: 26433242d479d9686ab279672b73104ca069001c
Status:               Pending
IP:                   192.168.1.201
IPs:
  IP:           192.168.1.201
  IP:           2600:4041:5be7:6400:adbb:943f:c903:492c
Controlled By:  DaemonSet/calico-node
Init Containers:
  flexvol-driver:
    Container ID:   
    Image:          docker.io/calico/pod2daemon-flexvol:v3.24.1
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       StartError
      Message:      sandbox container "7efd1a9004167f93b06178010d01d3094544cdeb2f9e5495a804c1786563d82c" is not running
      Exit Code:    128
      Started:      Wed, 31 Dec 1969 18:00:00 -0600
      Finished:     Wed, 17 Apr 2024 16:47:32 -0500
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /host/driver from flexvol-driver-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-79ghs (ro)
  install-cni:
    Container ID:  
    Image:         docker.io/calico/cni:v3.24.1
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      /opt/cni/bin/install
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Environment:
      CNI_CONF_NAME:            10-calico.conflist
      SLEEP:                    false
      CNI_NET_DIR:              /etc/cni/net.d
      CNI_NETWORK_CONFIG:       <set to the key 'config' of config map 'cni-config'>  Optional: false
      KUBERNETES_SERVICE_HOST:  10.43.0.1
      KUBERNETES_SERVICE_PORT:  443
    Mounts:
      /host/etc/cni/net.d from cni-net-dir (rw)
      /host/opt/cni/bin from cni-bin-dir (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-79ghs (ro)
Containers:
  calico-node:
    Container ID:   
    Image:          docker.io/calico/node:v3.24.1
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       PodInitializing
    Ready:          False
    Restart Count:  0
    Liveness:       http-get http://localhost:9099/liveness delay=0s timeout=10s period=10s #success=1 #failure=3
    Readiness:      exec [/bin/calico-node -bird-ready -felix-ready] delay=0s timeout=5s period=10s #success=1 #failure=3
    Environment:
      DATASTORE_TYPE:                      kubernetes
      WAIT_FOR_DATASTORE:                  true
      CLUSTER_TYPE:                        k8s,operator,bgp
      CALICO_DISABLE_FILE_LOGGING:         false
      FELIX_DEFAULTENDPOINTTOHOSTACTION:   ACCEPT
      FELIX_HEALTHENABLED:                 true
      FELIX_HEALTHPORT:                    9099
      NODENAME:                             (v1:spec.nodeName)
      NAMESPACE:                           calico-system (v1:metadata.namespace)
      FELIX_TYPHAK8SNAMESPACE:             calico-system
      FELIX_TYPHAK8SSERVICENAME:           calico-typha
      FELIX_TYPHACAFILE:                   /etc/pki/tls/certs/tigera-ca-bundle.crt
      FELIX_TYPHACERTFILE:                 /node-certs/tls.crt
      FELIX_TYPHAKEYFILE:                  /node-certs/tls.key
      FIPS_MODE_ENABLED:                   false
      FELIX_TYPHACN:                       typha-server
      CALICO_MANAGE_CNI:                   true
      CALICO_IPV4POOL_CIDR:                192.168.64.0/18
      CALICO_IPV4POOL_IPIP:                Always
      CALICO_IPV4POOL_BLOCK_SIZE:          28
      CALICO_IPV4POOL_NODE_SELECTOR:       all()
      CALICO_IPV4POOL_DISABLE_BGP_EXPORT:  false
      FELIX_VXLANMTU:                      1400
      FELIX_WIREGUARDMTU:                  1400
      CALICO_NETWORKING_BACKEND:           bird
      FELIX_IPINIPMTU:                     1400
      IP:                                  autodetect
      IP_AUTODETECTION_METHOD:             interface=wg-.*
      IP6:                                 none
      FELIX_IPV6SUPPORT:                   false
      KUBERNETES_SERVICE_HOST:             10.43.0.1
      KUBERNETES_SERVICE_PORT:             443
    Mounts:
      /etc/pki/tls/certs/ from tigera-ca-bundle (ro)
      /host/etc/cni/net.d from cni-net-dir (rw)
      /lib/modules from lib-modules (ro)
      /node-certs from node-certs (ro)
      /run/xtables.lock from xtables-lock (rw)
      /var/lib/calico from var-lib-calico (rw)
      /var/log/calico/cni from cni-log-dir (rw)
      /var/run/calico from var-run-calico (rw)
      /var/run/nodeagent from policysync (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-79ghs (ro)
Conditions:
  Type              Status
  Initialized       False 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  lib-modules:
    Type:          HostPath (bare host directory volume)
    Path:          /lib/modules
    HostPathType:  
  xtables-lock:
    Type:          HostPath (bare host directory volume)
    Path:          /run/xtables.lock
    HostPathType:  FileOrCreate
  policysync:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/nodeagent
    HostPathType:  DirectoryOrCreate
  tigera-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tigera-ca-bundle
    Optional:  false
  node-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  node-certs
    Optional:    false
  var-run-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/calico
    HostPathType:  
  var-lib-calico:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/calico
    HostPathType:  
  cni-bin-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /opt/cni/bin
    HostPathType:  
  cni-net-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/cni/net.d
    HostPathType:  
  cni-log-dir:
    Type:          HostPath (bare host directory volume)
    Path:          /var/log/calico/cni
    HostPathType:  
  flexvol-driver-host:
    Type:          HostPath (bare host directory volume)
    Path:          /usr/libexec/kubernetes/kubelet-plugins/volume/exec/nodeagent~uds
    HostPathType:  DirectoryOrCreate
  kube-api-access-79ghs:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              kubernetes.io/os=linux
Tolerations:                 :NoSchedule op=Exists
                             :NoExecute op=Exists
                             CriticalAddonsOnly op=Exists
                             node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:
  Type     Reason          Age                     From     Message
  ----     ------          ----                    ----     -------
  Normal   Created         41m (x2576 over 131m)   kubelet  Created container flexvol-driver
  Normal   Pulled          11m (x3434 over 131m)   kubelet  Container image "docker.io/calico/pod2daemon-flexvol:v3.24.1" already present on machine
  Warning  BackOff         6m2s (x3564 over 131m)  kubelet  Back-off restarting failed container
  Normal   SandboxChanged  62s (x7426 over 131m)   kubelet  Pod sandbox changed, it will be killed and re-created.

output of starting the flexvol container with containerd on the google coral dev board

$ k3s ctr --debug run docker.io/calico/pod2daemon-flexvol:v3.24.1 test-calico
Destination directory /host/driver not present!?
@JOUNAIDSoufiane JOUNAIDSoufiane changed the title Calico-node pod refuses to start on google coral dev board missing the nf_conntrack_netlink kernel module Calico-node pod refuses to start on google coral dev board without the nf_conntrack_netlink kernel module Apr 17, 2024
@tomastigera
Copy link
Contributor

@JOUNAIDSoufiane do you have a kernel stack trace? Does it faul because of the missing module?

@JOUNAIDSoufiane
Copy link
Author

JOUNAIDSoufiane commented Apr 18, 2024

I am currently trying my best to get a call trace out of this. I'll post one as soon as I have it. Is there, in the meanwhile, way to start calico sucessfully without the nf_conntrack_netlink module? that specific module (which is cited as required for calico) happens to cause the crash when I start k3s agent. When I start it without that module, k3s agent runs and joins the cluster but calico does not initialize, I have included logs above of how the calico-node init containers refuse to start in this case.

@tomastigera
Copy link
Contributor

tomastigera commented Apr 18, 2024

Destanation directory /host/driver not present

Comes from here https://github.com/projectcalico/calico/blob/master/pod2daemon/flexvol/docker-image/flexvol.sh#L55

I think that is something created by k8s and it is just missing if you run it in the simplistic way as you do with k3s crt

@tomastigera
Copy link
Contributor

the calico-node pod should start and possibly throw other errors related to the missing kernel module.

calicoctl checksystem check is you have all the prerequisities. Here is a list of the modules that calicoctl checks for currently.

@JOUNAIDSoufiane
Copy link
Author

JOUNAIDSoufiane commented Apr 24, 2024

Comes from here https://github.com/projectcalico/calico/blob/master/pod2daemon/flexvol/docker-image/flexvol.sh#L55
I think that is something created by k8s and it is just missing if you run it in the simplistic way as you do with k3s crt

I see, thank you for that clarification, here is my output of calicoctl check system

Checking kernel version...
		4.14.98-imx         					OK
Checking kernel modules...
		ip_tables           					OK
WARNING: Unable to detect the ipt_ipvs module as Loaded/Builtin module or lsmod
		ipt_ipvs            					FAIL
		xt_bpf              					OK
		ipt_rpfilter        					OK
WARNING: Unable to detect the ipt_set module as Loaded/Builtin module or lsmod
		ipt_set             					FAIL
		xt_set              					OK
		xt_u32              					OK
		ip6_tables          					OK
WARNING: Unable to detect the xt_rpfilter module as Loaded/Builtin module or lsmod
		xt_rpfilter         					FAIL
WARNING: Unable to detect the nf_conntrack_netlink module as Loaded/Builtin module or lsmod
		nf_conntrack_netlink					FAIL
		xt_icmp             					OK
		xt_multiport        					OK
WARNING: Unable to detect the vfio-pci module as Loaded/Builtin module or lsmod
		vfio-pci            					FAIL
		xt_addrtype         					OK
		xt_conntrack        					OK
		xt_mark             					OK
		ipt_REJECT          					OK
		xt_icmp6            					OK
		ip_set              					OK

I purposely unloaded nf_conntrack_netlink as it causes a crash when starting k3s agent with calico; as for the other missing modules, this GitHub issue suggests that the command itself is outdated.

Furthermore, in relation to why calico-node is not starting. I doubt the issue is related to missing modules since the flexvol init-container in itself refuses to even start, at which point, calico itself has not really started on the node yet to be able to complain?

This is all I could gather from k8s, I tried to lookup the message but hardly any concrete luck as to why this is not starting

Init Containers:
  flexvol-driver:
    Container ID:   
    Image:          docker.io/calico/pod2daemon-flexvol:v3.24.1
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       StartError
      Message:      sandbox container "7efd1a9004167f93b06178010d01d3094544cdeb2f9e5495a804c1786563d82c" is not running
      Exit Code:    128
      Started:      Wed, 31 Dec 1969 18:00:00 -0600
      Finished:     Wed, 17 Apr 2024 16:47:32 -0500
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /host/driver from flexvol-driver-host (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-79ghs (ro)

@tomastigera
Copy link
Contributor

purposely unloaded nf_conntrack_netlink as it causes a crash when starting k3s agent with calico

Sure, but what is the cause? Buggy old kernel it seems. If you managed to start calico and k3s without conntrack, would you be ale to use policies meaningfully? I don't think so 🤷

Any chance you can install a newer fixed kernel?

@JOUNAIDSoufiane
Copy link
Author

Right, it does seem like a buggy old kernel. I'm using Balena OS, I've put in a request for them to update the kernel version!

In the meanwhile I'll try outside of balena OS with a newer kernel provided by Google and let you know how that fares.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants