Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Is Calico broken in k8s v1.25.2? #3315

Closed
spenceclark opened this issue Nov 8, 2022 · 10 comments
Closed

[BUG] Is Calico broken in k8s v1.25.2? #3315

spenceclark opened this issue Nov 8, 2022 · 10 comments

Comments

@spenceclark
Copy link

spenceclark commented Nov 8, 2022

Describe the bug
We have upgraded k8s clusters that use Calico networking stack to Kubernetes 1.25.x
Since that upgrade the Calico pods are in a failing state.
This has happened now on 2 different clusters.

Network connectivity between pods, and into and out of the cluster is failing with dial tcp 10.0.0.1:443: i/o timeout type errors

To Reproduce

Steps to reproduce the behavior:

  1. Given a cluster on v1.24.x - that is running Calico, upgrade to 1.25.x either via UI or CLI
  2. See error

Expected behavior

  • No failing pods.
  • Network connectivity between pods and the outside world.

Screenshots

image
image

If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

  • CLI Version v2.32.1
  • Kubernetes version v1.25.2
  • Tigera Operator version v1.23.8
  • Calico version v3.21.6

Additional context
My understanding is that AKS manages the "tigera-operator" on the cluster and this operator in turn manages the "calico" installation and all its resouces and config.

The tigera-operator pod has this in its logs:

{ "level": "error", "ts": 1667914355.001378, "logger": "controller_installation", "msg": "Error creating / updating resource", "Request.Namespace": "", "Request.Name": "volumesnapshots.snapshot.storage.k8s.io", "error": "no matches for kind \"PodDisruptionBudget\" in version \"policy/v1beta1\"", "stacktrace": "github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132\ngithub.com/tigera/operator/pkg/controller/installation.(*ReconcileInstallation).SetDegraded\n\t/go/src/github.com/tigera/operator/pkg/controller/installation/core_controller.go:1324\ngithub.com/tigera/operator/pkg/controller/installation.(*ReconcileInstallation).Reconcile\n\t/go/src/github.com/tigera/operator/pkg/controller/installation/core_controller.go:1183\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99" }

The calico-typha pods have this in their logs:

2022-11-08 13:36:35.838 [INFO][7] watchercache.go 175: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices" 2022-11-08 13:36:35.840 [INFO][7] watchercache.go 188: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices" error=resource does not exist: KubernetesEndpointSlice with error: the server could not find the requested resource 2022-11-08 13:36:36.323 [WARNING][7] health.go 188: Reporter is not ready. name="felix" 2022-11-08 13:36:36.323 [WARNING][7] health.go 154: Health: not ready

And the calico-node daemon pods have this in their logs:

2022-11-08 13:37:26.907 [ERROR][123609] tunnel-ip-allocator/discovery.go 174: Didn't find any ready Typha instances. 2022-11-08 13:37:26.907 [FATAL][123609] tunnel-ip-allocator/startsyncerclient.go 48: Typha discovery enabled but discovery failed. error=Kubernetes service missing IP or port 2022-11-08 13:37:26.947 [ERROR][123148] felix/discovery.go 174: Didn't find any ready Typha instances. 2022-11-08 13:37:26.947 [ERROR][123148] felix/daemon.go 322: Typha discovery enabled but discovery failed. error=Kubernetes service missing IP or port

I should note, the "calico-kube-controller" pod does start up, and doesn't have any errors in its log files.

Network connectivity between pods, and into and out of the cluster is failing with dial tcp 10.0.0.1:443: i/o timeout errors.

IF i set tigera-operator and calico-typha to 0 pods, network connectivity is restored albeit temporarily until AKS restores tigera back to 1 pod again.

From my investigation, it seems the version of Calico that is installed does not support kubernetes version 1.25.x due to the PodDisruptionBuget policy moving from "policy/v1beta1" to "policy/v1" (See projectcalico/calico#4570). It appears support for the new policy name was introducted in Calico version 3.23.x

As Calico is managed by the tigera-operator, which in turn is managed by AKS I believe this is a bug.

On one of the affected clusters, I tried to manually upgrade the tigera-operator to the lastest version , but this has just made the problem worse, and i have a whole set of new errors, and I don't believe we should be messing with the tigera-operator ourselves.

@spenceclark spenceclark added the bug label Nov 8, 2022
@siegenthalerroger
Copy link

I am having the same issue. As the api EndpointSlices was removed in 1.25 it's not suprising that it's failing but why is calico trying to use that...

@spenceclark
Copy link
Author

Ah ok, so EndpointSlices was removed also, that explains that error too.

We raised a support ticket with Microsoft, but as k8s 1.25.x is marked as "Preview" - there is no support or SLA so we're on our own with that one.

@siegenthalerroger
Copy link

I think I've narrowed this down to being either an old version of the tigera-operator or calico itself. As both projects seem to say they support 1.25. Though I'm not entirely sure on calico's 1.25 support as the webpage still only lists 1.24 as supported.

@spenceclark
Copy link
Author

Yeah that was my conclusion too. tigera appears to be managed directly by Azure (as if i reduce the deployment to 0 instances, it gets set back to 1 by something a few minutes later) and tigera manages the calico installation.

The version of tigera that is being managed by AKS installs an older version of Calico that does not contain the support for kubernetes 1.25.x

On one cluster I tried to force a new version of tigera, which then upgraded Calico to the new version, but that has caused a lot of new errors and is more broken than before I tried it - so don't try that :D

@siegenthalerroger
Copy link

siegenthalerroger commented Nov 14, 2022

Have you tried only updating to a new calico but the same tigera by chaging the ClusterInformation field calicoVersion? It's in the default namespace and I was going to give that a try.

Edit:

I tried this and as expected nothing happens as the operator tigera isn't running correctly so it doesn't even pick up the change.

@siegenthalerroger
Copy link

According to MS Support, when using BYOCNI and using calico as the CNI provider a deployment with the newest tigera/calico combination works flawlessly. Obviously BYOCNI results in no SLA from MS but at least we know that calico itself is compatible with 1.25 in the newer versions.

@siegenthalerroger
Copy link

The fix for this is rolling out with 2022-11-27 https://github.com/Azure/AKS/releases/tag/2022-11-27

Why this wasn't linked or commented here...

@ghost
Copy link

ghost commented Dec 21, 2022

Thanks for reaching out. I'm closing this issue as it was marked with "Fix released" and it hasn't had activity for 7 days.

@ghost ghost closed this as completed Dec 21, 2022
@siegenthalerroger
Copy link

So I just checked and the fix has rolled out to my region, so I attempted to force the update (az resource update...) to no avail. So tried the more aggressive version by removing the tigera operator and the installation default to force a reconcile with a new version. To no avail.

It appears it's still deplioying an old version of Tigera, even if Calico was updated that isn't useful because Tigera is having the exact same issue and can't startup itself OR Calico which it manages.

Error logs from Tigera-Operator:

{"level":"error","ts":1671704884.0956094,"logger":"controller_installation","msg":"Error creating / updating resource","Request.Namespace":"","Request.Name":"volumesnapshots.snapshot.storage.k8s.io","error":"no matches for kind \"PodSecurityPolicy\" in version \"policy/v1beta1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/internal/controller/controller.go:214"}
{"level":"error","ts":1671704884.095791,"logger":"controller.tigera-installation-controller","msg":"Reconciler error","name":"volumesnapshots.snapshot.storage.k8s.io","namespace":"","error":"no matches for kind \"PodSecurityPolicy\" in version \"policy/v1beta1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/internal/controller/controller.go:214"}
{"level":"error","ts":1671704886.946103,"logger":"controller_installation","msg":"Error creating / updating resource","Request.Namespace":"","Request.Name":"default","error":"no matches for kind \"PodSecurityPolicy\" in version \"policy/v1beta1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/internal/controller/controller.go:214"}
{"level":"error","ts":1671704886.9462087,"logger":"controller.tigera-installation-controller","msg":"Reconciler error","name":"default","namespace":"","error":"no matches for kind \"PodSecurityPolicy\" in version \"policy/v1beta1\"","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.9.7/pkg/internal/controller/controller.go:214"}
{"level":"info","ts":1671704887.8222423,"logger":"status_manager.calico","msg":"Status manager is not ready to report component statuses."}

@palma21 if you could reopen please :)

@siegenthalerroger
Copy link

#3403

Once again a new issue

@ghost ghost locked as resolved and limited conversation to collaborators Jan 21, 2023
This issue was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants