New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods unable to talk via Calico - made worse with higher pod load #8676
Comments
What calico version do you use? Do you have enough IPs in ippools? |
@fasaxc looks like a usecase for a high number of pods per node. |
My apologies, I was sure I had replied to this. Currently we are using the Yes, we have plenty of IPs left in IPPools. lwr says: > looks like a usecase for a high number of pods per node. |
We saw huge improvement after following the suggestion from someone on Calico Slack to switch this so that Calico was primary and the Bridge network was secondary. Unfortunately it meant a fair amount of shuffling of our testing and a few hacks to make it work (within our pods, not Kubernetes wise) but it helped. We're still seeing this issue but less severely. |
Calico only supports being the primary CNI with multus. We use the podIP in the Pod resource to dictate the IP that Calico expects the pod to have (so we're not a "pure" enough CNI implementation to act as secondary). Using Calico as secondary, I'd expect it to not work at all but there could be an interaction with the bridge CNI that gives something partially working. My guess would be that the traffic all comes out of the "eth0" of the pod, which will be the bridge and it ends up partially working by being bridged to the host where it gets routed on. Note that Calico won't be able to secure any traffic that goes through the bridged interface. We won't even know about it (and even if the above partially works, we won't secure traffic that arrives over the bridge). @Zorlin now you've made that switch, are you seeing any issues at all? We don't put a hard limit on number of pods but 3.27 includes fixes that make it work pretty well up to about 2k pods. Going beyond that, I think you'd need to disable (or increase |
Oh, there's another limit that may affect you: there's an IPAM block limit of 20 blocks per node. Each block has 64 IPs by default so that's enough for 64*20 pods/node. You probably want to use an IPPool with larger blocks for best efficiency. you could assign /23s, for example. Note that the block size must not be changed on an active IP pool(!) or the IPAM database will get corrupted. You need to add a new non-overlapping pool and then set the old one's node selector to |
Fascinating! Yeah, we see slowdowns at exactly around 2000 pods, up until that point Calico is lightning fast. This was the case before we were using Multus as well; so good to see that it's consistent and a known "issue" (well, not an issue really :) My use case is unusual!) I'll tweak that setting - maybe make it 10x less frequent. Thank you! |
In our case our software is too heavy to do more than a couple of thousand pods per physical node - this would help if we were running directly on bare metal, but each physical node is split into 8 logical VMs anyways; so this limit won't be being hit at this stage :) |
It's an unusual uses case but not that unusual. Would be good to know where things break at 2k+ interfaces. Maybe look at I think that's as simple as sending felix a SIGUSR2:
Then wait 10s and there'll be a pprof file in /tmp inside the container. |
I'll have a look at that soon @fasaxc - thanks for the detailed instructions! Meanwhile, across a much larger deployment (6 physical nodes split into 8, 4 of which are dual 64 core EPYC Milan or Milan-X) I'm getting really close to the goals I'm trying to hit... with more than 8000 pods spawned in and working. This is the main issue we're now facing that slows down further progress -
|
Failed to look up reserved IPs means that the CNI plugin's GET to the API server is timing out; suggests node is overloaded or the API server is overloaded. (Or any CPU reservations on your hosts are starving out the CNI plugin/API server.) Check the load average on your nodes, if that's much higher than the number of cores, things are getting starved. Check the CPU usage of your API server, it is running hot? If not, you may need to tune the max requests in flight settings etc to allow more connections. |
@Zorlin any update? |
Hi, will be looking into this again soon, sorry. Had some other issues to
deal with, lots of new hardware and a network topology change (we're now on
redundant networking via MLAG 💪)
Thanks
…On Tue, 7 May 2024, 17:47 Tomas Hruby, ***@***.***> wrote:
@Zorlin <https://github.com/Zorlin> any update?
—
Reply to this email directly, view it on GitHub
<#8676 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKONLCITE73JCFGRFGQ52TZBEASZAVCNFSM6AAAAABFQRTD2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJYHA4DQMBRG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi, so I'm still gathering evidence and data, but for some reason we're seeing this "instability" (Calico networking becoming unavailable intermittently) even as low as 3000 pods now. It shows up like this (in k9s): Pods will just semi-randomly appear orange when they fail to contact their upstream service, and the common thread is that they're all things that are in Calico. Authentik, Grafana, Harbor, Rancher and Prometheus all "turn orange" at various points. Then most of the time they recover on their own. During this time, services such as Grafana get harder and harder to use until they become unusable. Under extreme load (5k to 10k pods), spawning pods gets "harder and harder" (happens less and less frequently and with less success rate over time) Here's Calico CPU usage during that time |
Speaking of which though, with having to reinstall Calico I think I accidentally reverted these settings so will fix them again. |
Hi there!
I've got a Calico cluster that we're using for testing large scale deployments of Kubernetes pods - specifically containers running a piece of software called Waku. We are using Multus to provide multiple networks, specifically using just Calico for all normal pods, then Calico+Bridge CNI for the Waku pods with the Bridge network set as the primary network for the Waku pods and Calico used as a secondary.
What we're noticing is that when we spawn in our deployment, which goes like this:
A large percentage of the time (between 0.2 and 15% or so depending on load), a pod deploying in the midstrap section gets "stuck". The symptoms of this are fairly odd - the pod gets a valid Calico IP address, but cannot communicate at all with anything through Calico. Shelling into the pod and pinging 10.2.0.1 (the upstream router for the Bridge CNI network we have attached as the primary network) works, indicating that Multus and the Bridge CNI have connectivity. Pinging any Kubernetes 10.42.x.x/16 or 10.43.x.x/16 address does not work (meanwhile the other 85% to 99.8% of the nodes have fully working connectivity to both).
The cause has been unclear, and I haven't been able to figure anything much out yet as to why it just sometimes happens. It's usually one or two nodes that get "stuck" and start failing to spawn pods. Which would make me think it's CPU load related - but again, these programs are lightweight and there is a random chance it happens even when the CPU is almost idle.
Repeatedly killing the broken pods and letting them respawn helps up to a point, up until it stops helping and eventually a huge chunk of the remaining nodes collapse back to a restarting state. Then the cycle kind of repeats.
This issue is affecting our productivity at the moment :( Any help is appreciated. Thanks for making Calico - other than this issue it's been working great for us.
Expected Behavior
Pods spawn with full networking capabilities
Current Behavior
Some % of pods spawn without Calico networking, despite having the adapter visible, having a valid IP address and being able to communicate on other CNIs
Possible Solution
None known yet.
Steps to Reproduce (for bugs)
Run ./deployment.sh from here against a THROWAWAY, TESTING Kubernetes cluster running Calico and Harbor (or just Calico if you replace any instances of "https://harbor2.riff.cc/localmirror/" with the Docker Hub URL)
https://github.com/vacp2p/10ksim/tree/zorlin/midstrap/accelerated_deployment
Observe the results
Here is a video of an already hosed cluster where nothing will spawn any more, demonstrating entering the shell of a seemingly working pod but seeing that Calico networking has broken down while Bridge is still working
https://asciinema.org/a/3FbPab7GbsiilJjfI2QniPHby
Here is a video of the same project spawning in 143 nodes then scaling to 500-1000 and having just a few fail, then the cluster having the same connectivity issues, with again Bridge still working
https://asciinema.org/a/65fDn7G7LGrRcXhxFpjKg7MeO
Context
We are trying to spawn 10,000 containers of an open source networking/messaging tool, across a cluster of 7x 64-core 512GiB of RAM physical machines, with extremely fast networking between them. We're currently testing, at the smaller scale, a more efficient network setup with offloading that we hope will allow us to scale to approximately 2000 nodes per physical machine (each of which is split into 8 logical VMs).
This issue prevents us from being able to run as many nodes as the machines are capable of supporting.
Your Environment
Calico version
Orchestrator version (e.g. kubernetes, mesos, rkt):
Kubernetes (RKE2 v1.27.12+rke2r1)
Multus CNI
Calico installed normally
Multus installed via a small chart change that allows for it to run on NoSchedule tainted nodes
Operating System and version:
Hosts: Debian 12, amd64, all latest updates, running Proxmox kernel and Mellanox OFED drivers
Guests: Debian 12, amd64, all latest updates, running Debian kernel and Mellanox OFED drivers
Link to your project (optional):
10ksim
The text was updated successfully, but these errors were encountered: