[BUG] Is Calico broken in k8s v1.25.2? #3315

spenceclark · 2022-11-08T13:48:12Z

Describe the bug
We have upgraded k8s clusters that use Calico networking stack to Kubernetes 1.25.x
Since that upgrade the Calico pods are in a failing state.
This has happened now on 2 different clusters.

Network connectivity between pods, and into and out of the cluster is failing with dial tcp 10.0.0.1:443: i/o timeout type errors

To Reproduce

Steps to reproduce the behavior:

Given a cluster on v1.24.x - that is running Calico, upgrade to 1.25.x either via UI or CLI
See error

Expected behavior

No failing pods.
Network connectivity between pods and the outside world.

Screenshots

If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

CLI Version v2.32.1
Kubernetes version v1.25.2
Tigera Operator version v1.23.8
Calico version v3.21.6

Additional context
My understanding is that AKS manages the "tigera-operator" on the cluster and this operator in turn manages the "calico" installation and all its resouces and config.

The tigera-operator pod has this in its logs:

{ "level": "error", "ts": 1667914355.001378, "logger": "controller_installation", "msg": "Error creating / updating resource", "Request.Namespace": "", "Request.Name": "volumesnapshots.snapshot.storage.k8s.io", "error": "no matches for kind \"PodDisruptionBudget\" in version \"policy/v1beta1\"", "stacktrace": "github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132\ngithub.com/tigera/operator/pkg/controller/installation.(*ReconcileInstallation).SetDegraded\n\t/go/src/github.com/tigera/operator/pkg/controller/installation/core_controller.go:1324\ngithub.com/tigera/operator/pkg/controller/installation.(*ReconcileInstallation).Reconcile\n\t/go/src/github.com/tigera/operator/pkg/controller/installation/core_controller.go:1183\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99" }

The calico-typha pods have this in their logs:

2022-11-08 13:36:35.838 [INFO][7] watchercache.go 175: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices" 2022-11-08 13:36:35.840 [INFO][7] watchercache.go 188: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices" error=resource does not exist: KubernetesEndpointSlice with error: the server could not find the requested resource 2022-11-08 13:36:36.323 [WARNING][7] health.go 188: Reporter is not ready. name="felix" 2022-11-08 13:36:36.323 [WARNING][7] health.go 154: Health: not ready

And the calico-node daemon pods have this in their logs:

2022-11-08 13:37:26.907 [ERROR][123609] tunnel-ip-allocator/discovery.go 174: Didn't find any ready Typha instances. 2022-11-08 13:37:26.907 [FATAL][123609] tunnel-ip-allocator/startsyncerclient.go 48: Typha discovery enabled but discovery failed. error=Kubernetes service missing IP or port 2022-11-08 13:37:26.947 [ERROR][123148] felix/discovery.go 174: Didn't find any ready Typha instances. 2022-11-08 13:37:26.947 [ERROR][123148] felix/daemon.go 322: Typha discovery enabled but discovery failed. error=Kubernetes service missing IP or port

I should note, the "calico-kube-controller" pod does start up, and doesn't have any errors in its log files.

Network connectivity between pods, and into and out of the cluster is failing with dial tcp 10.0.0.1:443: i/o timeout errors.

IF i set tigera-operator and calico-typha to 0 pods, network connectivity is restored albeit temporarily until AKS restores tigera back to 1 pod again.

From my investigation, it seems the version of Calico that is installed does not support kubernetes version 1.25.x due to the PodDisruptionBuget policy moving from "policy/v1beta1" to "policy/v1" (See projectcalico/calico#4570). It appears support for the new policy name was introducted in Calico version 3.23.x

As Calico is managed by the tigera-operator, which in turn is managed by AKS I believe this is a bug.

On one of the affected clusters, I tried to manually upgrade the tigera-operator to the lastest version , but this has just made the problem worse, and i have a whole set of new errors, and I don't believe we should be messing with the tigera-operator ourselves.

The text was updated successfully, but these errors were encountered:

siegenthalerroger · 2022-11-11T14:47:51Z

I am having the same issue. As the api EndpointSlices was removed in 1.25 it's not suprising that it's failing but why is calico trying to use that...

spenceclark · 2022-11-11T16:23:09Z

Ah ok, so EndpointSlices was removed also, that explains that error too.

We raised a support ticket with Microsoft, but as k8s 1.25.x is marked as "Preview" - there is no support or SLA so we're on our own with that one.

siegenthalerroger · 2022-11-14T10:38:29Z

I think I've narrowed this down to being either an old version of the tigera-operator or calico itself. As both projects seem to say they support 1.25. Though I'm not entirely sure on calico's 1.25 support as the webpage still only lists 1.24 as supported.

spenceclark · 2022-11-14T10:46:42Z

Yeah that was my conclusion too. tigera appears to be managed directly by Azure (as if i reduce the deployment to 0 instances, it gets set back to 1 by something a few minutes later) and tigera manages the calico installation.

The version of tigera that is being managed by AKS installs an older version of Calico that does not contain the support for kubernetes 1.25.x

On one cluster I tried to force a new version of tigera, which then upgraded Calico to the new version, but that has caused a lot of new errors and is more broken than before I tried it - so don't try that :D

siegenthalerroger · 2022-11-14T10:52:34Z

Have you tried only updating to a new calico but the same tigera by chaging the ClusterInformation field calicoVersion? It's in the default namespace and I was going to give that a try.

Edit:

I tried this and as expected nothing happens as the operator tigera isn't running correctly so it doesn't even pick up the change.

siegenthalerroger · 2022-11-14T16:43:41Z

According to MS Support, when using BYOCNI and using calico as the CNI provider a deployment with the newest tigera/calico combination works flawlessly. Obviously BYOCNI results in no SLA from MS but at least we know that calico itself is compatible with 1.25 in the newer versions.

siegenthalerroger · 2022-12-12T12:57:39Z

The fix for this is rolling out with 2022-11-27 https://github.com/Azure/AKS/releases/tag/2022-11-27

Why this wasn't linked or commented here...

ghost · 2022-12-21T20:00:53Z

Thanks for reaching out. I'm closing this issue as it was marked with "Fix released" and it hasn't had activity for 7 days.

siegenthalerroger · 2022-12-22T10:29:28Z

So I just checked and the fix has rolled out to my region, so I attempted to force the update (az resource update...) to no avail. So tried the more aggressive version by removing the tigera operator and the installation default to force a reconcile with a new version. To no avail.

It appears it's still deplioying an old version of Tigera, even if Calico was updated that isn't useful because Tigera is having the exact same issue and can't startup itself OR Calico which it manages.

Error logs from Tigera-Operator:

{"level":"error","ts":1671704884.0956094,"logger":"controller_installation","msg":"Error creating / updating resource","Request.Namespace":"","Request.Name":"volumesnapshots.snapshot.storage.k8s.io","error":"no matches for kind \"PodSecurityPolicy\" in version \"policy/v1beta1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/internal/controller/controller.go:214"}
{"level":"error","ts":1671704884.095791,"logger":"controller.tigera-installation-controller","msg":"Reconciler error","name":"volumesnapshots.snapshot.storage.k8s.io","namespace":"","error":"no matches for kind \"PodSecurityPolicy\" in version \"policy/v1beta1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/internal/controller/controller.go:214"}
{"level":"error","ts":1671704886.946103,"logger":"controller_installation","msg":"Error creating / updating resource","Request.Namespace":"","Request.Name":"default","error":"no matches for kind \"PodSecurityPolicy\" in version \"policy/v1beta1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/internal/controller/controller.go:214"}
{"level":"error","ts":1671704886.9462087,"logger":"controller.tigera-installation-controller","msg":"Reconciler error","name":"default","namespace":"","error":"no matches for kind \"PodSecurityPolicy\" in version \"policy/v1beta1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/internal/controller/controller.go:214"}
{"level":"info","ts":1671704887.8222423,"logger":"status_manager.calico","msg":"Status manager is not ready to report component statuses."}

@palma21 if you could reopen please :)

siegenthalerroger · 2022-12-22T13:33:04Z

#3403

Once again a new issue

spenceclark added the bug label Nov 8, 2022

siegenthalerroger mentioned this issue Nov 14, 2022

Kubernetes Version 1.25 GA #3206

Closed

siegenthalerroger mentioned this issue Nov 14, 2022

EndpointSlices Endpoint using v1beta1 not v1 projectcalico/calico#6990

Closed

shashankbarsin assigned shashankbarsin and phealy and unassigned shashankbarsin Nov 17, 2022

shashankbarsin mentioned this issue Nov 21, 2022

2022-11-06 Release Notes #3335

Closed

palma21 added the resolution/fix-released label Dec 14, 2022

ghost closed this as completed Dec 21, 2022

ghost locked as resolved and limited conversation to collaborators Jan 21, 2023

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Is Calico broken in k8s v1.25.2? #3315

[BUG] Is Calico broken in k8s v1.25.2? #3315

spenceclark commented Nov 8, 2022 •

edited

Loading

siegenthalerroger commented Nov 11, 2022

spenceclark commented Nov 11, 2022

siegenthalerroger commented Nov 14, 2022

spenceclark commented Nov 14, 2022

siegenthalerroger commented Nov 14, 2022 •

edited

Loading

siegenthalerroger commented Nov 14, 2022

siegenthalerroger commented Dec 12, 2022

ghost commented Dec 21, 2022

siegenthalerroger commented Dec 22, 2022

siegenthalerroger commented Dec 22, 2022

[BUG] Is Calico broken in k8s v1.25.2? #3315

[BUG] Is Calico broken in k8s v1.25.2? #3315

Comments

spenceclark commented Nov 8, 2022 • edited Loading

siegenthalerroger commented Nov 11, 2022

spenceclark commented Nov 11, 2022

siegenthalerroger commented Nov 14, 2022

spenceclark commented Nov 14, 2022

siegenthalerroger commented Nov 14, 2022 • edited Loading

siegenthalerroger commented Nov 14, 2022

siegenthalerroger commented Dec 12, 2022

ghost commented Dec 21, 2022

siegenthalerroger commented Dec 22, 2022

siegenthalerroger commented Dec 22, 2022

spenceclark commented Nov 8, 2022 •

edited

Loading

siegenthalerroger commented Nov 14, 2022 •

edited

Loading