-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Traefik (Gateway API) breaks if a route points to a missing service (as seen with cert-manager) #9158
Comments
… for it: 1) cert-manager: cert-manager/cert-manager#5260 2) traefik: traefik/traefik#9158 Summary: The cert-manager services are closing once they complete but not cleaning up the HTTPRoutes immediately, which causes Traefik to break / lose all routes, which means the remaining challenge-solvers become inaccessible, which means they never complete, and thus the cluster remains inaccessible indefinitely.
+1 The whole gateway should not be invalid when only 1 route is. |
I believe I have ran into the same issue, only a simpler case when testing weighted routing: The following manifest will render the entire Traefik gateway and other HTTPRoute CR's inoperable, if there are no active pods in the "canary" service apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
name: podinfo
namespace: test
spec:
hostnames:
- podinfo.localhost
parentRefs:
- group: gateway.networking.k8s.io
kind: Gateway
name: traefik-gateway
namespace: kube-system
rules:
- backendRefs:
- group: ""
kind: Service
name: podinfo-primary <== valid service, 2 active pods
port: 9898
weight: 100
- group: ""
kind: Service
name: podinfo-canary <== valid service, 0 active pods
port: 9898
weight: 0
matches:
- path:
type: PathPrefix
value: / If pods are scaled up in the "canary" service, routing works without issue. I also saw the Traefik version: 2.9.1 |
Hi ! |
Hello @TeoGoddet, if you confirmed the issue can you share the details with us? |
Is possible that the recently merged issue below will fix this issue (9158)?: |
@israel-morales no, that fix was specifically for the CRD provider |
…bfiles. * Switched from the ingress-based version of Traefik (under @Attempt6 folder) to the gateway-based one (under @attempt7 folder). Gateway-approach is cleaner, and is apparently the focus/future. While I think it still has the certificate-provisioning bug (traefik/traefik#9158), it was flagged recently as "bug:confirmed", so it'll likely be fixed eventually -- and I don't need it fixed right away, since Cloudflare offers the "Flexible" HTTPS option, which solves the main problem (ie. for end-users) for now. * Improved traefik's gateway config, by having it no longer require the "NET_BIND_SERVICE" security-context. * Updated traefik to latest version, by changing the helm-chart version. * Updated traefik's gateway config, to the latest contents found at: https://github.com/kubernetes-sigs/gateway-api/blob/69e4d8b69b8ec936bc1ed3ca8af807cd45dca09d/config/webhook * Removed some folders for kube-prometheus and such. (superseded by loki-stack) * Fixed that the grafana.debatemap.app subdomain was not working. (I had only tested locally previously using localhost:XXXX, whereas in prod I needed to handle the fact that the service was in another namespace)
Okay, I have traced the code-path that causes this issue:
Now that the code-path for the issue is made clear, what is the correct route for solving the issue? The issue being: The entire Traefik system begins to fail (ie. discards/ignores all routes) if a single http-route in one of the listeners of one of the gateways is unable to find the service behind the route.
While in this specific case, the issue could be solved by having cert-manager destroy its service after destroying the associated http-route (for which I raised an issue here, which ended up being closed due to inactivity), this seems like a fragile solution to the problem: Traefik should not be so picky/fragile that it completely stops working if a single http-route is temporarily "not backed by a service" like this. So with the code-path for the issue laid out above, I'm hoping it can result in a proper solution to the problem being found, by devs more familiar with how Traefik is intended to function when the cluster is in an imperfect/inconsistent/"intermediary" state like this. (ie. where one of its http-routes temporarily does not have a service backing it) |
Hello @Venryx, Thanks for your analysis, we think it makes a lot of sense. If you or another community member would like to fix it, let us know. |
…cert-manager solver urls... (traefik just has no way to do so through the gateway-api atm) * During cert-manager operation, encountered error: "Failed to fetch authorization: 401 urn:ietf:params:acme:error:unauthorized: The request specified an authorization that does not exist". First thought was that it was due to my prior values for the ".env" vars "EAB_HMAC_KEY" and "EAB_KID" holding values that were expired or something, so I generated a new set and provided those. The error went away. Not sure if the resolution was due to the new values, or just due to a relaunch of Tilt, but in any case it seems solved now. I am now finally hitting the original error encountered half a year ago (traefik/traefik#9158). To resolve it, there are some things I can try: * Create a pod that kills the cert-manager http-routes that are no longer needed. (thus getting Traefik to work again) * Create a pod (or change some config) to keep the service behind those cert-manager http-routes alive somehow, until the cert is fully created and active. (seems best thing to try first) * Modify the Traefik source code to handle the situation better. (best solution, but most complicated probably)
Welcome!
What did you do?
I've encountered a lot of flakiness in certificate-provisioning in my kubernetes cluster, with...
Given 12 "challenges", on average only 4 of the 12 resolve/become-valid, with the rest staying "pending" forever.
What did you see instead?
I looked into the logs (and observed the cluster state using the wonderful Lens app), and it appears there is a conflict between cert-manager (in Gateway API mode) and Traefik (also in Gateway API mode).
Summary: The cert-manager challenge-solver services are closing once they complete but not cleaning up their HTTPRoutes immediately, which causes Traefik to break / lose all routes, which means the remaining challenge-solvers become inaccessible, which means they never complete, and thus the cluster remains inaccessible indefinitely.
Longer version:
an error occurred while creating gateway status: [...] Cannot load HTTPRoute service default/cm-acme-http-solverXXXXX: service not found
. And at that same point is when -- as seen in Traefik's live web monitor -- all of the app's HTTPRoutes break/disappear.I'm not completely sure whether to file this bug under cert-manager or Traefik, so I am filing it under both: (here is the cert-manager issue)
Expected behaviour:
See above. (Traefik should be more resilient, so that if a route points to a service that disappears for one reason or another, it does not make all the other routes in that gateway break/disappear)
Steps to reproduce the bug:
The steps above explain the process as I've observed it through the Lens app (which shows all the Kubernetes objects in the cluster, with automatic refreshing, which is quite convenient); I do not have a minimal repro at this point, although it seems like the steps laid out above should be something that can be understood (and confirmed/refuted) by those familiar with the code without seeing a live demonstration. (that is, it seems like just a "logic conflict" between cert-manager's apparent expectations -- that leaving HTTPRoutes up that point to non-existent services is fine -- and Traefik's apparent expectations -- that HTTPRoutes should always have a valid target)
What version of Traefik are you using?
v2.8.0 (as seen in the Traefik web-ui; traefik is running in a remote kubernetes cluster, so I don't see how running it with the docker command above would work)
What is your environment & configuration?
Environment details::
Traefik config/yaml files: (the full set of files can be seen here, since it's open-source)
If applicable, please paste the log output in DEBUG level
It seems too long to include all logging at the DEBUG level, but this is its relevant part: (ie. what I see on re-applying the traefik-values.yaml file, in a cluster that has become stuck; ie. it shows the errors that keep occurring, in a cluster that has HTTPRoutes that point to missing services, which is the state in which it problematically breaks)
The text was updated successfully, but these errors were encountered: