-
Notifications
You must be signed in to change notification settings - Fork 295
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Is Calico broken in k8s v1.25.2? #3315
Comments
I am having the same issue. As the api EndpointSlices was removed in 1.25 it's not suprising that it's failing but why is calico trying to use that... |
Ah ok, so EndpointSlices was removed also, that explains that error too. We raised a support ticket with Microsoft, but as k8s 1.25.x is marked as "Preview" - there is no support or SLA so we're on our own with that one. |
I think I've narrowed this down to being either an old version of the |
Yeah that was my conclusion too. tigera appears to be managed directly by Azure (as if i reduce the deployment to 0 instances, it gets set back to 1 by something a few minutes later) and tigera manages the calico installation. The version of tigera that is being managed by AKS installs an older version of Calico that does not contain the support for kubernetes 1.25.x On one cluster I tried to force a new version of tigera, which then upgraded Calico to the new version, but that has caused a lot of new errors and is more broken than before I tried it - so don't try that :D |
Have you tried only updating to a new calico but the same tigera by chaging the Edit: I tried this and as expected nothing happens as the operator tigera isn't running correctly so it doesn't even pick up the change. |
According to MS Support, when using BYOCNI and using calico as the CNI provider a deployment with the newest tigera/calico combination works flawlessly. Obviously BYOCNI results in no SLA from MS but at least we know that calico itself is compatible with 1.25 in the newer versions. |
The fix for this is rolling out with 2022-11-27 https://github.com/Azure/AKS/releases/tag/2022-11-27 Why this wasn't linked or commented here... |
Thanks for reaching out. I'm closing this issue as it was marked with "Fix released" and it hasn't had activity for 7 days. |
So I just checked and the fix has rolled out to my region, so I attempted to force the update ( It appears it's still deplioying an old version of Tigera, even if Calico was updated that isn't useful because Tigera is having the exact same issue and can't startup itself OR Calico which it manages. Error logs from Tigera-Operator:
@palma21 if you could reopen please :) |
Once again a new issue |
Describe the bug
We have upgraded k8s clusters that use Calico networking stack to Kubernetes 1.25.x
Since that upgrade the Calico pods are in a failing state.
This has happened now on 2 different clusters.
Network connectivity between pods, and into and out of the cluster is failing with
dial tcp 10.0.0.1:443: i/o timeout
type errorsTo Reproduce
Steps to reproduce the behavior:
Expected behavior
Screenshots
If applicable, add screenshots to help explain your problem.
Environment (please complete the following information):
Additional context
My understanding is that AKS manages the "tigera-operator" on the cluster and this operator in turn manages the "calico" installation and all its resouces and config.
The tigera-operator pod has this in its logs:
{ "level": "error", "ts": 1667914355.001378, "logger": "controller_installation", "msg": "Error creating / updating resource", "Request.Namespace": "", "Request.Name": "volumesnapshots.snapshot.storage.k8s.io", "error": "no matches for kind \"PodDisruptionBudget\" in version \"policy/v1beta1\"", "stacktrace": "github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr@v0.2.0/zapr.go:132\ngithub.com/tigera/operator/pkg/controller/installation.(*ReconcileInstallation).SetDegraded\n\t/go/src/github.com/tigera/operator/pkg/controller/installation/core_controller.go:1324\ngithub.com/tigera/operator/pkg/controller/installation.(*ReconcileInstallation).Reconcile\n\t/go/src/github.com/tigera/operator/pkg/controller/installation/core_controller.go:1183\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:263\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:235\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1.1\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.7.0/pkg/internal/controller/controller.go:198\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:155\nk8s.io/apimachinery/pkg/util/wait.BackoffUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:156\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:185\nk8s.io/apimachinery/pkg/util/wait.UntilWithContext\n\t/go/pkg/mod/k8s.io/apimachinery@v0.19.2/pkg/util/wait/wait.go:99" }
The calico-typha pods have this in their logs:
2022-11-08 13:36:35.838 [INFO][7] watchercache.go 175: Full resync is required ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices" 2022-11-08 13:36:35.840 [INFO][7] watchercache.go 188: Failed to perform list of current data during resync ListRoot="/calico/resources/v3/projectcalico.org/kubernetesendpointslices" error=resource does not exist: KubernetesEndpointSlice with error: the server could not find the requested resource 2022-11-08 13:36:36.323 [WARNING][7] health.go 188: Reporter is not ready. name="felix" 2022-11-08 13:36:36.323 [WARNING][7] health.go 154: Health: not ready
And the calico-node daemon pods have this in their logs:
2022-11-08 13:37:26.907 [ERROR][123609] tunnel-ip-allocator/discovery.go 174: Didn't find any ready Typha instances. 2022-11-08 13:37:26.907 [FATAL][123609] tunnel-ip-allocator/startsyncerclient.go 48: Typha discovery enabled but discovery failed. error=Kubernetes service missing IP or port 2022-11-08 13:37:26.947 [ERROR][123148] felix/discovery.go 174: Didn't find any ready Typha instances. 2022-11-08 13:37:26.947 [ERROR][123148] felix/daemon.go 322: Typha discovery enabled but discovery failed. error=Kubernetes service missing IP or port
I should note, the "calico-kube-controller" pod does start up, and doesn't have any errors in its log files.
Network connectivity between pods, and into and out of the cluster is failing with
dial tcp 10.0.0.1:443: i/o timeout
errors.IF i set tigera-operator and calico-typha to 0 pods, network connectivity is restored albeit temporarily until AKS restores tigera back to 1 pod again.
From my investigation, it seems the version of Calico that is installed does not support kubernetes version 1.25.x due to the PodDisruptionBuget policy moving from "policy/v1beta1" to "policy/v1" (See projectcalico/calico#4570). It appears support for the new policy name was introducted in Calico version 3.23.x
As Calico is managed by the tigera-operator, which in turn is managed by AKS I believe this is a bug.
On one of the affected clusters, I tried to manually upgrade the tigera-operator to the lastest version , but this has just made the problem worse, and i have a whole set of new errors, and I don't believe we should be messing with the tigera-operator ourselves.
The text was updated successfully, but these errors were encountered: