Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods unable to talk via Calico - made worse with higher pod load #8676

Open
Zorlin opened this issue Mar 31, 2024 · 15 comments
Open

Pods unable to talk via Calico - made worse with higher pod load #8676

Zorlin opened this issue Mar 31, 2024 · 15 comments

Comments

@Zorlin
Copy link

Zorlin commented Mar 31, 2024

Hi there!

I've got a Calico cluster that we're using for testing large scale deployments of Kubernetes pods - specifically containers running a piece of software called Waku. We are using Multus to provide multiple networks, specifically using just Calico for all normal pods, then Calico+Bridge CNI for the Waku pods with the Bridge network set as the primary network for the Waku pods and Calico used as a secondary.

What we're noticing is that when we spawn in our deployment, which goes like this:

  • 3 bootstrap pods
  • 40 midstrap pods
  • N node pods - have tried 250, 300, 400, 500, 1000, 2000 - both at a time and gently scaling between

A large percentage of the time (between 0.2 and 15% or so depending on load), a pod deploying in the midstrap section gets "stuck". The symptoms of this are fairly odd - the pod gets a valid Calico IP address, but cannot communicate at all with anything through Calico. Shelling into the pod and pinging 10.2.0.1 (the upstream router for the Bridge CNI network we have attached as the primary network) works, indicating that Multus and the Bridge CNI have connectivity. Pinging any Kubernetes 10.42.x.x/16 or 10.43.x.x/16 address does not work (meanwhile the other 85% to 99.8% of the nodes have fully working connectivity to both).

The cause has been unclear, and I haven't been able to figure anything much out yet as to why it just sometimes happens. It's usually one or two nodes that get "stuck" and start failing to spawn pods. Which would make me think it's CPU load related - but again, these programs are lightweight and there is a random chance it happens even when the CPU is almost idle.

Repeatedly killing the broken pods and letting them respawn helps up to a point, up until it stops helping and eventually a huge chunk of the remaining nodes collapse back to a restarting state. Then the cycle kind of repeats.

This issue is affecting our productivity at the moment :( Any help is appreciated. Thanks for making Calico - other than this issue it's been working great for us.

Expected Behavior

Pods spawn with full networking capabilities

Current Behavior

Some % of pods spawn without Calico networking, despite having the adapter visible, having a valid IP address and being able to communicate on other CNIs

Possible Solution

None known yet.

Steps to Reproduce (for bugs)

Run ./deployment.sh from here against a THROWAWAY, TESTING Kubernetes cluster running Calico and Harbor (or just Calico if you replace any instances of "https://harbor2.riff.cc/localmirror/" with the Docker Hub URL)

https://github.com/vacp2p/10ksim/tree/zorlin/midstrap/accelerated_deployment

Observe the results

Here is a video of an already hosed cluster where nothing will spawn any more, demonstrating entering the shell of a seemingly working pod but seeing that Calico networking has broken down while Bridge is still working

https://asciinema.org/a/3FbPab7GbsiilJjfI2QniPHby

Here is a video of the same project spawning in 143 nodes then scaling to 500-1000 and having just a few fail, then the cluster having the same connectivity issues, with again Bridge still working

https://asciinema.org/a/65fDn7G7LGrRcXhxFpjKg7MeO

Context

We are trying to spawn 10,000 containers of an open source networking/messaging tool, across a cluster of 7x 64-core 512GiB of RAM physical machines, with extremely fast networking between them. We're currently testing, at the smaller scale, a more efficient network setup with offloading that we hope will allow us to scale to approximately 2000 nodes per physical machine (each of which is split into 8 logical VMs).

This issue prevents us from being able to run as many nodes as the machines are capable of supporting.

Your Environment

  • Calico version

  • Orchestrator version (e.g. kubernetes, mesos, rkt):

Kubernetes (RKE2 v1.27.12+rke2r1)
Multus CNI
Calico installed normally
Multus installed via a small chart change that allows for it to run on NoSchedule tainted nodes

  • Operating System and version:
    Hosts: Debian 12, amd64, all latest updates, running Proxmox kernel and Mellanox OFED drivers
    Guests: Debian 12, amd64, all latest updates, running Debian kernel and Mellanox OFED drivers

  • Link to your project (optional):
    10ksim

@tomastigera
Copy link
Contributor

tomastigera commented Apr 2, 2024

What calico version do you use? Do you have enough IPs in ippools?

@lwr20
Copy link
Member

lwr20 commented Apr 10, 2024

@fasaxc looks like a usecase for a high number of pods per node.

@Zorlin
Copy link
Author

Zorlin commented Apr 10, 2024

What calico version do you use? Do you have enough IPs in ippools?

My apologies, I was sure I had replied to this. Currently we are using the rke2-calico:3.27.200 chart, based on Calico 3.27.

Yes, we have plenty of IPs left in IPPools.

lwr says: > looks like a usecase for a high number of pods per node.
(Not tagging to avoid triggering an email) Yes :)

@Zorlin
Copy link
Author

Zorlin commented Apr 10, 2024

then Calico+Bridge CNI for the Waku pods with the Bridge network set as the primary network for the Waku pods and Calico used as a secondary.

We saw huge improvement after following the suggestion from someone on Calico Slack to switch this so that Calico was primary and the Bridge network was secondary. Unfortunately it meant a fair amount of shuffling of our testing and a few hacks to make it work (within our pods, not Kubernetes wise) but it helped. We're still seeing this issue but less severely.

@fasaxc
Copy link
Member

fasaxc commented Apr 11, 2024

Calico only supports being the primary CNI with multus. We use the podIP in the Pod resource to dictate the IP that Calico expects the pod to have (so we're not a "pure" enough CNI implementation to act as secondary). Using Calico as secondary, I'd expect it to not work at all but there could be an interaction with the bridge CNI that gives something partially working. My guess would be that the traffic all comes out of the "eth0" of the pod, which will be the bridge and it ends up partially working by being bridged to the host where it gets routed on.

Note that Calico won't be able to secure any traffic that goes through the bridged interface. We won't even know about it (and even if the above partially works, we won't secure traffic that arrives over the bridge).

@Zorlin now you've made that switch, are you seeing any issues at all? We don't put a hard limit on number of pods but 3.27 includes fixes that make it work pretty well up to about 2k pods. Going beyond that, I think you'd need to disable (or increase routeRefreshInterval in the FelixConfiguration object. That's the main operation that takes longer as there are more pods.

@fasaxc
Copy link
Member

fasaxc commented Apr 11, 2024

Oh, there's another limit that may affect you: there's an IPAM block limit of 20 blocks per node. Each block has 64 IPs by default so that's enough for 64*20 pods/node. You probably want to use an IPPool with larger blocks for best efficiency. you could assign /23s, for example.

Note that the block size must not be changed on an active IP pool(!) or the IPAM database will get corrupted. You need to add a new non-overlapping pool and then set the old one's node selector to !all() so that it'll no longer be used.

@Zorlin
Copy link
Author

Zorlin commented Apr 11, 2024

[...]
@Zorlin now you've made that switch, are you seeing any issues at all? We don't put a hard limit on number of pods but 3.27 includes fixes that make it work pretty well up to about 2k pods. Going beyond that, I think you'd need to disable (or increase routeRefreshInterval in the FelixConfiguration object. That's the main operation that takes longer as there are more pods.

Fascinating! Yeah, we see slowdowns at exactly around 2000 pods, up until that point Calico is lightning fast. This was the case before we were using Multus as well; so good to see that it's consistent and a known "issue" (well, not an issue really :) My use case is unusual!)

I'll tweak that setting - maybe make it 10x less frequent. Thank you!

@Zorlin
Copy link
Author

Zorlin commented Apr 11, 2024

Oh, there's another limit that may affect you: there's an IPAM block limit of 20 blocks per node. Each block has 64 IPs by default so that's enough for 64*20 pods/node. You probably want to use an IPPool with larger blocks for best efficiency. you could assign /23s, for example.

Note that the block size must not be changed on an active IP pool(!) or the IPAM database will get corrupted. You need to add a new non-overlapping pool and then set the old one's node selector to !all() so that it'll no longer be used.

In our case our software is too heavy to do more than a couple of thousand pods per physical node - this would help if we were running directly on bare metal, but each physical node is split into 8 logical VMs anyways; so this limit won't be being hit at this stage :)

@fasaxc
Copy link
Member

fasaxc commented Apr 12, 2024

It's an unusual uses case but not that unusual. Would be good to know where things break at 2k+ interfaces. Maybe look at top during a run to see if calico-node -felix is using a lot of CPU. If it is, you could try capturing a pprof CPU profile so we can see what's using the CPU.

I think that's as simple as sending felix a SIGUSR2:

kubectl exec -n <calico namespace> calicio-node-xxxx -- sv 2 felix

Then wait 10s and there'll be a pprof file in /tmp inside the container.

@Zorlin
Copy link
Author

Zorlin commented Apr 12, 2024

I'll have a look at that soon @fasaxc - thanks for the detailed instructions!

Meanwhile, across a much larger deployment (6 physical nodes split into 8, 4 of which are dual 64 core EPYC Milan or Milan-X) I'm getting really close to the goals I'm trying to hit... with more than 8000 pods spawned in and working.

This is the main issue we're now facing that slows down further progress -

image
│ Events:                                                                                                             │
│   Type     Reason                  Age                 From               Message                                   │
│   ----     ------                  ----                ----               -------                                   │
│   Normal   Scheduled               33m                 default-scheduler  Successfully assigned zerotesting/nodes-8 │
│ 978 to opal-fragment-34                                                                                             │
│   Warning  FailedCreatePodSandBox  29m                 kubelet            Failed to create pod sandbox: rpc error:  │
│ code = Unknown desc = failed to setup network for sandbox "4c49dd843fdb9c88a7feb8c6f9d3a1be06e2cc7b651bc9e9437724fb │
│ c5e4081b": plugin type="multus" name="multus-cni-network" failed (add): [zerotesting/nodes-8978/553b1c2e-3b1b-43bc- │
│ a778-8eea36b6ca63:k8s-pod-network]: error adding container to network "k8s-pod-network": plugin type="calico" faile │
│ d (add): failed to look up reserved IPs: context deadline exceeded                                                  │
│   Warning  FailedCreatePodSandBox  24m                 kubelet            Failed to create pod sandbox: rpc error:  │
│ code = DeadlineExceeded desc = context deadline exceeded                                                            │
│   Warning  FailedCreatePodSandBox  100s (x4 over 22m)  kubelet            Failed to create pod sandbox: rpc error:  │
│ code = Unknown desc = failed to reserve sandbox name "nodes-8978_zerotesting_553b1c2e-3b1b-43bc-a778-8eea36b6ca63_1 │
│ ": name "nodes-8978_zerotesting_553b1c2e-3b1b-43bc-a778-8eea36b6ca63_1" is reserved for "c4438530ac673fbce09ba4a7a9 │
│ 6f8b30b8ddb6456ac64fba82123b85d72d013d"                                                                             │
│   Normal   SandboxChanged          95s (x11 over 29m)  kubelet            Pod sandbox changed, it will be killed an │
│ d re-created.                                                                                                       │
│

@fasaxc
Copy link
Member

fasaxc commented Apr 19, 2024

Failed to look up reserved IPs means that the CNI plugin's GET to the API server is timing out; suggests node is overloaded or the API server is overloaded. (Or any CPU reservations on your hosts are starving out the CNI plugin/API server.)

Check the load average on your nodes, if that's much higher than the number of cores, things are getting starved. Check the CPU usage of your API server, it is running hot? If not, you may need to tune the max requests in flight settings etc to allow more connections.

@tomastigera
Copy link
Contributor

@Zorlin any update?

@Zorlin
Copy link
Author

Zorlin commented May 8, 2024 via email

@Zorlin
Copy link
Author

Zorlin commented May 8, 2024

@Zorlin any update?

Hi, so I'm still gathering evidence and data, but for some reason we're seeing this "instability" (Calico networking becoming unavailable intermittently) even as low as 3000 pods now.

It shows up like this (in k9s):
image

Pods will just semi-randomly appear orange when they fail to contact their upstream service, and the common thread is that they're all things that are in Calico. Authentik, Grafana, Harbor, Rancher and Prometheus all "turn orange" at various points. Then most of the time they recover on their own.

During this time, services such as Grafana get harder and harder to use until they become unusable. Under extreme load (5k to 10k pods), spawning pods gets "harder and harder" (happens less and less frequently and with less success rate over time)

Here's Calico CPU usage during that time
image
Nothing appears particularly overloaded, and Calico is using a reasonable amount of CPU across the board.

@Zorlin
Copy link
Author

Zorlin commented May 8, 2024

[...]
@Zorlin now you've made that switch, are you seeing any issues at all? We don't put a hard limit on number of pods but 3.27 includes fixes that make it work pretty well up to about 2k pods. Going beyond that, I think you'd need to disable (or increase routeRefreshInterval in the FelixConfiguration object. That's the main operation that takes longer as there are more pods.

Fascinating! Yeah, we see slowdowns at exactly around 2000 pods, up until that point Calico is lightning fast. This was the case before we were using Multus as well; so good to see that it's consistent and a known "issue" (well, not an issue really :) My use case is unusual!)

I'll tweak that setting - maybe make it 10x less frequent. Thank you!

Speaking of which though, with having to reinstall Calico I think I accidentally reverted these settings so will fix them again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants