New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Calico-node pod refuses to start on google coral dev board without the nf_conntrack_netlink kernel module #8726
Comments
@JOUNAIDSoufiane do you have a kernel stack trace? Does it faul because of the missing module? |
I am currently trying my best to get a call trace out of this. I'll post one as soon as I have it. Is there, in the meanwhile, way to start calico sucessfully without the nf_conntrack_netlink module? that specific module (which is cited as required for calico) happens to cause the crash when I start k3s agent. When I start it without that module, k3s agent runs and joins the cluster but calico does not initialize, I have included logs above of how the calico-node init containers refuse to start in this case. |
Comes from here https://github.com/projectcalico/calico/blob/master/pod2daemon/flexvol/docker-image/flexvol.sh#L55 I think that is something created by k8s and it is just missing if you run it in the simplistic way as you do with k3s crt |
|
I see, thank you for that clarification, here is my output of
I purposely unloaded nf_conntrack_netlink as it causes a crash when starting k3s agent with calico; as for the other missing modules, this GitHub issue suggests that the command itself is outdated. Furthermore, in relation to why calico-node is not starting. I doubt the issue is related to missing modules since the flexvol init-container in itself refuses to even start, at which point, calico itself has not really started on the node yet to be able to complain? This is all I could gather from k8s, I tried to lookup the message but hardly any concrete luck as to why this is not starting
|
Sure, but what is the cause? Buggy old kernel it seems. If you managed to start calico and k3s without conntrack, would you be ale to use policies meaningfully? I don't think so 🤷 Any chance you can install a newer fixed kernel? |
Right, it does seem like a buggy old kernel. I'm using Balena OS, I've put in a request for them to update the kernel version! In the meanwhile I'll try outside of balena OS with a newer kernel provided by Google and let you know how that fares. |
Let me preface this by saying that this is an unusual setup scenario and that I am not running Calico in its ideal environment. If you do not care about the context as to why we try to start Calico without
nf_conntrack_netlink
, please skip over to the Expected and Current behavior headingsContext Environment
We are working on enrolling the google coral dev board onto our existing balena-fleet that runs a collection of raspberry pis and nvidia Jetson nanos in the following configuration:
After the above steps, the devices are able to start k3s agent in a container and wireguard in another and join our k3s cluster that is running Calico as its CNI.
Our process for enrolling the Google Coral Dev Board
Here are the logs for what happens when starting the k3s agent with ALL the kernel modules loaded
Dmesg logs on the host kernel
K3S agent logs (1.23.17 but also crashes on the latest stable)
Debugging the crash
After manually loading the kernel modules one by one, We managed to identify the kernel module that causes the crash:
nf_conntrack_netlink
. The K3S agent starts fine with all the other kernel modules loaded but crashes the kernel as soon as it is started with the offending kmod loaded. This is of course not an issue with Calico, though I would highly appreciate some help with figuring out how the crash couldExpected Behavior
the calico-node pod should start and possibly throw other errors related to the missing kernel module.
Current Behavior
After I start k3s agent without
nf_conntrack_netlink
, it managed to join the cluster. However, as expected, Calico refuses to start but I am unsure of the reasons why, here is a bullet summary of what I managed to gather: the calico-node pod fails to start its first init-containerflexvolidriver
. While K8s fails to gather the logs from containerd, we observe a crypticDestination directory /host/driver not present!?
when starting the container using thek3s ctr
utility to directly access containerd. This is the roadblock in calico's setup on the agent.Kubectl describe calico-node output
output of starting the flexvol container with containerd on the google coral dev board
The text was updated successfully, but these errors were encountered: