Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Traefik (Gateway API) breaks if a route points to a missing service (as seen with cert-manager) #9158

Closed
2 tasks done
Venryx opened this issue Jul 3, 2022 · 9 comments · Fixed by #10714
Closed
2 tasks done
Labels
area/provider/k8s/gatewayapi contributor/wanted Participation from an external contributor is highly requested kind/bug/confirmed a confirmed bug (reproducible).
Projects

Comments

@Venryx
Copy link

Venryx commented Jul 3, 2022

Welcome!

  • Yes, I've searched similar issues on GitHub and didn't find any.
  • Yes, I've searched similar issues on the Traefik community forum and didn't find any.

What did you do?

I've encountered a lot of flakiness in certificate-provisioning in my kubernetes cluster, with...

  • Load-balancer: Traefik (Gateway API mode)
  • Certificate automator: cert-manager (Gateway API mode)

Given 12 "challenges", on average only 4 of the 12 resolve/become-valid, with the rest staying "pending" forever.

What did you see instead?

I looked into the logs (and observed the cluster state using the wonderful Lens app), and it appears there is a conflict between cert-manager (in Gateway API mode) and Traefik (also in Gateway API mode).

Summary: The cert-manager challenge-solver services are closing once they complete but not cleaning up their HTTPRoutes immediately, which causes Traefik to break / lose all routes, which means the remaining challenge-solvers become inaccessible, which means they never complete, and thus the cluster remains inaccessible indefinitely.

Longer version:

  1. The Certificate object is created successfully in kubernetes; same for the CertificateRequest, Order, and Challenges.
  2. The Challenges start as "pending"; I then observe the corresponding HTTPRoutes and services being successfully created.
  3. I then observe that 3 or 4 of the services disappear. I then check the Challenges category, and see that exactly that many of the challenges change from "pending" to "valid". So far so good. (I assume the services disappeared because they successfully completed)
  4. Here's where the problem starts: As soon as those services disappear (from completing successfully), Traefik starts logging the error an error occurred while creating gateway status: [...] Cannot load HTTPRoute service default/cm-acme-http-solverXXXXX: service not found. And at that same point is when -- as seen in Traefik's live web monitor -- all of the app's HTTPRoutes break/disappear.
  5. When all of the app's HTTPRoutes disappear, that includes all of the routes that transmit requests to the rest of the 12 challenge-solver services in my cluster. Thus, once that "first burst" of 3 or 4 challenge-solvers completes, it breaks the Traefik router's state and prevents all of the other challenges from being completed. (as well as breaking the app in general -- this was driving me crazy in earlier debugging, because I thought this indicated there must be a problem or ambiguity with my routes, when in fact the routes were just all disappearing once Traefik broke down because of the invalid routes)

I'm not completely sure whether to file this bug under cert-manager or Traefik, so I am filing it under both: (here is the cert-manager issue)

  • cert-manager's behavior seems problematic, in that it needlessly leaves up the HTTPRoute for a challenge-solver, despite the service it points to having completed. (one could argue this "shouldn't be a problem", but since cert-manager is used with multiple router solutions, it seems that immediately cleaning up after oneself would be the safer option)
  • Traefik's behavior of totally breaking when a route points to a non-existent service also seems like a problem/fragility.

Expected behaviour:

See above. (Traefik should be more resilient, so that if a route points to a service that disappears for one reason or another, it does not make all the other routes in that gateway break/disappear)

Steps to reproduce the bug:

The steps above explain the process as I've observed it through the Lens app (which shows all the Kubernetes objects in the cluster, with automatic refreshing, which is quite convenient); I do not have a minimal repro at this point, although it seems like the steps laid out above should be something that can be understood (and confirmed/refuted) by those familiar with the code without seeing a live demonstration. (that is, it seems like just a "logic conflict" between cert-manager's apparent expectations -- that leaving HTTPRoutes up that point to non-existent services is fine -- and Traefik's apparent expectations -- that HTTPRoutes should always have a valid target)

What version of Traefik are you using?

v2.8.0 (as seen in the Traefik web-ui; traefik is running in a remote kubernetes cluster, so I don't see how running it with the docker command above would work)

What is your environment & configuration?

Environment details::

  • Kubernetes version: 1.21.12-0
  • Cloud-provider/provisioner: OVH Public Cloud
  • cert-manager version: v1.8.2
  • Install method: e.g. helm/static manifests: Helm (repo_url='https://charts.jetstack.io', version='1.8.2')

Traefik config/yaml files: (the full set of files can be seen here, since it's open-source)

additionalArguments:
  - "--experimental.kubernetesgateway=true"
  - "--providers.kubernetesgateway=true"
  - "--providers.kubernetesingress=false"
  - "--api.insecure"
  - "--api.dashboard"
hostNetwork: true
ports:
  web:
    port: 80
  websecure:
    port: 443
securityContext:
  capabilities:
    drop: [ALL]
    add: [NET_BIND_SERVICE]
  readOnlyRootFilesystem: true
  runAsGroup: 0
  runAsNonRoot: false
  runAsUser: 0
---
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: GatewayClass
metadata:
  name: gateway-class-main
spec:
  controllerName: traefik.io/gateway-controller

# The HTTP (ie. non-HTTPS) gateway. It's used for:
# 1) Serving the proxy requests coming from Cloudflare/debates.app.
# 2) Serving the ECMA tls-certificate provisioning process.
# 3) For various debugging/development purposes.
---
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: Gateway
metadata:
  name: gateway-http
spec:
  gatewayClassName: gateway-class-main
  listeners:
    - name: http
      protocol: HTTP
      port: 80

# The HTTPS gateway. It's used for:
# 1) Serving requests coming from debating.app. (which is the variant of debates.app that avoids the cloudflare-cdn, eg. to fully avoid the websocket-timeout concern)
# 2) As backup for the ".app" domains, since they are required to be served as HTTPS. (if Cloudflare has issues, I want the origin server to be able to provide HTTPS connections)
---
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: Gateway
metadata:
  name: gateway-https
  annotations:
    cert-manager.io/cluster-issuer: zerossl-issuer
spec:
  gatewayClassName: gateway-class-main
  # cert-manager is picky about the values of these fields; reference this before making changes: https://cert-manager.io/docs/usage/gateway
  listeners:
    # origin/ovh
    - {name: https-ovh, hostname: 9m2x1z.nodes.c1.or1.k8s.ovh.us, protocol: HTTPS, port: 443, tls: {mode: Terminate, certificateRefs: [{name: zerossl-key-prod8, kind: Secret, group: core}]}}
    - {name: https-ovh-as, hostname: app-server.9m2x1z.nodes.c1.or1.k8s.ovh.us, protocol: HTTPS, port: 443, tls: {mode: Terminate, certificateRefs: [{name: zerossl-key-prod8, kind: Secret, group: core}]}}
    - {name: https-ovh-asjs, hostname: app-server-js.9m2x1z.nodes.c1.or1.k8s.ovh.us, protocol: HTTPS, port: 443, tls: {mode: Terminate, certificateRefs: [{name: zerossl-key-prod8, kind: Secret, group: core}]}}
    - {name: https-ovh-monitor, hostname: monitor.9m2x1z.nodes.c1.or1.k8s.ovh.us, protocol: HTTPS, port: 443, tls: {mode: Terminate, certificateRefs: [{name: zerossl-key-prod8, kind: Secret, group: core}]}}
    # debating.app
    - {name: https-d1, hostname: debating.app, protocol: HTTPS, port: 443, tls: {mode: Terminate, certificateRefs: [{name: zerossl-key-prod8, kind: Secret, group: core}]}}
    - {name: https-d1-as, hostname: app-server.debating.app, protocol: HTTPS, port: 443, tls: {mode: Terminate, certificateRefs: [{name: zerossl-key-prod8, kind: Secret, group: core}]}}
    - {name: https-d1-asjs, hostname: app-server-js.debating.app, protocol: HTTPS, port: 443, tls: {mode: Terminate, certificateRefs: [{name: zerossl-key-prod8, kind: Secret, group: core}]}}
    - {name: https-d1-monitor, hostname: monitor.debating.app, protocol: HTTPS, port: 443, tls: {mode: Terminate, certificateRefs: [{name: zerossl-key-prod8, kind: Secret, group: core}]}}
    # debates.app
    - {name: https-d2, hostname: debates.app, protocol: HTTPS, port: 443, tls: {mode: Terminate, certificateRefs: [{name: zerossl-key-prod8, kind: Secret, group: core}]}}
    - {name: https-d2-as, hostname: app-server.debates.app, protocol: HTTPS, port: 443, tls: {mode: Terminate, certificateRefs: [{name: zerossl-key-prod8, kind: Secret, group: core}]}}
    - {name: https-d2-asjs, hostname: app-server-js.debates.app, protocol: HTTPS, port: 443, tls: {mode: Terminate, certificateRefs: [{name: zerossl-key-prod8, kind: Secret, group: core}]}}
    - {name: https-d2-monitor, hostname: monitor.debates.app, protocol: HTTPS, port: 443, tls: {mode: Terminate, certificateRefs: [{name: zerossl-key-prod8, kind: Secret, group: core}]}}
---
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: HTTPRoute
metadata:
  name: route-web-server
  namespace: default
spec:
  parentRefs:
    - name: gateway-http
    - name: gateway-https
  hostnames:
  - "9m2x1z.nodes.c1.or1.k8s.ovh.us"
  - "debating.app"
  - "debates.app"
  rules:
    -
      backendRefs:
        - name: dm-web-server
          port: 5100
---
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: HTTPRoute
metadata:
  name: route-app-server
  namespace: default
spec:
  parentRefs:
    - name: gateway-http
    - name: gateway-https
  hostnames:
  - "app-server.9m2x1z.nodes.c1.or1.k8s.ovh.us"
  - "app-server.debating.app"
  - "app-server.debates.app"
  rules:
    - backendRefs:
      - name: dm-app-server-rs
        port: 5110
---
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: HTTPRoute
metadata:
  name: route-app-server-js
  namespace: default
spec:
  parentRefs:
    - name: gateway-http
    - name: gateway-https
  hostnames:
  - "app-server-js.9m2x1z.nodes.c1.or1.k8s.ovh.us"
  - "app-server-js.debating.app"
  - "app-server-js.debates.app"
  rules:
    - backendRefs:
      - name: dm-app-server-js
        port: 5115
---
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: HTTPRoute
metadata:
  name: route-monitor
  namespace: default
spec:
  parentRefs:
    - name: gateway-http
    - name: gateway-https
  hostnames:
  - "monitor.9m2x1z.nodes.c1.or1.k8s.ovh.us"
  - "monitor.debating.app"
  - "monitor.debates.app"
  rules:
    - backendRefs:
      - name: dm-monitor-backend
        port: 5130

If applicable, please paste the log output in DEBUG level

It seems too long to include all logging at the DEBUG level, but this is its relevant part: (ie. what I see on re-applying the traefik-values.yaml file, in a cluster that has become stuck; ie. it shows the errors that keep occurring, in a cluster that has HTTPRoutes that point to missing services, which is the state in which it problematically breaks)

[K8s EVENT: Pod traefik-76c979b79d-j72n8 (ns: default)] Pulling image "traefik:2.8.0"
[K8s EVENT: Pod traefik-76c979b79d-j72n8 (ns: default)] Successfully pulled image "traefik:2.8.0" in 695.500006ms
time="2022-07-03T12:45:10Z" level=info msg="Configuration loaded from flags."
time="2022-07-03T12:45:10Z" level=debug msg="Experimental Kubernetes Gateway provider has been activated"
time="2022-07-03T12:45:10Z" level=info msg="Traefik version 2.8.0 built on 2022-06-29T15:43:58Z"
time="2022-07-03T12:45:10Z" level=debug msg="Static configuration loaded {\"global\":{\"checkNewVersion\":true,\"sendAnonymousUsage\":true},\"serversTransport\":{\"maxIdleConnsPerHost\":200},\"entryPoints\":{\"metrics\":{\"address\":\":9100/tcp\",\"transport\":{\"lifeCycle\":{\"graceTimeOut\":\"10s\"},\"respondingTimeouts\":{\"idleTimeout\":\"3m0s\"}},\"forwardedHeaders\":{},\"http\":{},\"http2\":{\"maxConcurrentStreams\":250},\"udp\":{\"timeout\":\"3s\"}},\"traefik\":{\"address\":\":9000/tcp\",\"transport\":{\"lifeCycle\":{\"graceTimeOut\":\"10s\"},\"respondingTimeouts\":{\"idleTimeout\":\"3m0s\"}},\"forwardedHeaders\":{},\"http\":{},\"http2\":{\"maxConcurrentStreams\":250},\"udp\":{\"timeout\":\"3s\"}},\"web\":{\"address\":\":80/tcp\",\"transport\":{\"lifeCycle\":{\"graceTimeOut\":\"10s\"},\"respondingTimeouts\":{\"idleTimeout\":\"3m0s\"}},\"forwardedHeaders\":{},\"http\":{},\"http2\":{\"maxConcurrentStreams\":250},\"udp\":{\"timeout\":\"3s\"}},\"websecure\":{\"address\":\":443/tcp\",\"transport\":{\"lifeCycle\":{\"graceTimeOut\":\"10s\"},\"respondingTimeouts\":{\"idleTimeout\":\"3m0s\"}},\"forwardedHeaders\":{},\"http\":{},\"http2\":{\"maxConcurrentStreams\":250},\"udp\":{\"timeout\":\"3s\"}}},\"providers\":{\"providersThrottleDuration\":\"2s\",\"kubernetesCRD\":{},\"kubernetesGateway\":{}},\"api\":{\"insecure\":true,\"dashboard\":true},\"metrics\":{\"prometheus\":{\"buckets\":[0.1,0.3,1.2,5],\"addEntryPointsLabels\":true,\"addServicesLabels\":true,\"entryPoint\":\"metrics\"}},\"ping\":{\"entryPoint\":\"traefik\",\"terminatingStatusCode\":503},\"log\":{\"level\":\"debug\",\"format\":\"common\"},\"pilot\":{\"dashboard\":true},\"experimental\":{\"kubernetesGateway\":true}}"
time="2022-07-03T12:45:10Z" level=info msg="Stats collection is enabled."
time="2022-07-03T12:45:10Z" level=info msg="Many thanks for contributing to Traefik's improvement by allowing us to receive anonymous information from your configuration."
time="2022-07-03T12:45:10Z" level=info msg="Help us improve Traefik by leaving this feature on :)"
time="2022-07-03T12:45:10Z" level=info msg="More details on: [https://doc.traefik.io/traefik/contributing/data-collection/"](https://doc.traefik.io/traefik/contributing/data-collection/)
time="2022-07-03T12:45:10Z" level=warning msg="Traefik Pilot is deprecated and will be removed soon. Please check our Blog for migration instructions later this year."
time="2022-07-03T12:45:10Z" level=debug msg="Configured Prometheus metrics" metricsProviderName=prometheus
time="2022-07-03T12:45:10Z" level=info msg="Starting provider aggregator aggregator.ProviderAggregator"
time="2022-07-03T12:45:10Z" level=debug msg="Starting TCP Server" entryPointName=traefik
time="2022-07-03T12:45:10Z" level=debug msg="Starting TCP Server" entryPointName=metrics
time="2022-07-03T12:45:10Z" level=debug msg="Starting TCP Server" entryPointName=web
time="2022-07-03T12:45:10Z" level=debug msg="Starting TCP Server" entryPointName=websecure
time="2022-07-03T12:45:10Z" level=info msg="Starting provider *traefik.Provider"
time="2022-07-03T12:45:10Z" level=debug msg="*traefik.Provider provider configuration: {}"
time="2022-07-03T12:45:10Z" level=info msg="Starting provider *crd.Provider"
time="2022-07-03T12:45:10Z" level=debug msg="*crd.Provider provider configuration: {}"
time="2022-07-03T12:45:10Z" level=info msg="label selector is: \"\"" providerName=kubernetescrd
time="2022-07-03T12:45:10Z" level=info msg="Creating in-cluster Provider client" providerName=kubernetescrd
time="2022-07-03T12:45:10Z" level=debug msg="Configuration received: {\"http\":{\"routers\":{\"api\":{\"entryPoints\":[\"traefik\"],\"service\":\"api@internal\",\"rule\":\"PathPrefix(`/api`)\",\"priority\":2147483646},\"dashboard\":{\"entryPoints\":[\"traefik\"],\"middlewares\":[\"dashboard_redirect@internal\",\"dashboard_stripprefix@internal\"],\"service\":\"dashboard@internal\",\"rule\":\"PathPrefix(`/`)\",\"priority\":2147483645},\"ping\":{\"entryPoints\":[\"traefik\"],\"service\":\"ping@internal\",\"rule\":\"PathPrefix(`/ping`)\",\"priority\":2147483647},\"prometheus\":{\"entryPoints\":[\"metrics\"],\"service\":\"prometheus@internal\",\"rule\":\"PathPrefix(`/metrics`)\",\"priority\":2147483647}},\"services\":{\"api\":{},\"dashboard\":{},\"noop\":{},\"ping\":{},\"prometheus\":{}},\"middlewares\":{\"dashboard_redirect\":{\"redirectRegex\":{\"regex\":\"^(http:\\\\/\\\\/(\\\\[[\\\\w:.]+\\\\]|[\\\\w\\\\._-]+)(:\\\\d+)?)\\\\/$\",\"replacement\":\"${1}/dashboard/\",\"permanent\":true}},\"dashboard_stripprefix\":{\"stripPrefix\":{\"prefixes\":[\"/dashboard/\",\"/dashboard\"]}}},\"serversTransports\":{\"default\":{\"maxIdleConnsPerHost\":200}}},\"tcp\":{},\"udp\":{},\"tls\":{}}" providerName=internal
time="2022-07-03T12:45:10Z" level=debug msg="No default certificate, generating one" tlsStoreName=default
time="2022-07-03T12:45:10Z" level=info msg="Starting provider *gateway.Provider"
time="2022-07-03T12:45:10Z" level=debug msg="*gateway.Provider provider configuration: {}"
time="2022-07-03T12:45:10Z" level=info msg="label selector is: \"\"" providerName=kubernetesgateway
time="2022-07-03T12:45:10Z" level=info msg="Creating in-cluster Provider client" providerName=kubernetesgateway
time="2022-07-03T12:45:10Z" level=info msg="Starting provider *acme.ChallengeTLSALPN"
time="2022-07-03T12:45:10Z" level=debug msg="*acme.ChallengeTLSALPN provider configuration: {}"
time="2022-07-03T12:45:10Z" level=debug msg="Added outgoing tracing middleware prometheus@internal" entryPointName=metrics routerName=prometheus@internal middlewareName=tracing middlewareType=TracingForwarder
time="2022-07-03T12:45:10Z" level=debug msg="Creating middleware" entryPointName=metrics middlewareName=traefik-internal-recovery middlewareType=Recovery
time="2022-07-03T12:45:10Z" level=debug msg="Added outgoing tracing middleware api@internal" routerName=api@internal middlewareType=TracingForwarder middlewareName=tracing entryPointName=traefik
time="2022-07-03T12:45:10Z" level=debug msg="Added outgoing tracing middleware dashboard@internal" entryPointName=traefik routerName=dashboard@internal middlewareName=tracing middlewareType=TracingForwarder
time="2022-07-03T12:45:10Z" level=debug msg="Creating middleware" routerName=dashboard@internal middlewareName=dashboard_stripprefix@internal middlewareType=StripPrefix entryPointName=traefik
time="2022-07-03T12:45:10Z" level=debug msg="Adding tracing to middleware" entryPointName=traefik routerName=dashboard@internal middlewareName=dashboard_stripprefix@internal
time="2022-07-03T12:45:10Z" level=debug msg="Creating middleware" middlewareName=dashboard_redirect@internal middlewareType=RedirectRegex entryPointName=traefik routerName=dashboard@internal
time="2022-07-03T12:45:10Z" level=debug msg="Setting up redirection from ^(http:\\/\\/(\\[[\\w:.]+\\]|[\\w\\._-]+)(:\\d+)?)\\/$ to ${1}/dashboard/" middlewareType=RedirectRegex entryPointName=traefik routerName=dashboard@internal middlewareName=dashboard_redirect@internal
time="2022-07-03T12:45:10Z" level=debug msg="Adding tracing to middleware" entryPointName=traefik routerName=dashboard@internal middlewareName=dashboard_redirect@internal
time="2022-07-03T12:45:10Z" level=debug msg="Added outgoing tracing middleware ping@internal" middlewareType=TracingForwarder entryPointName=traefik routerName=ping@internal middlewareName=tracing
time="2022-07-03T12:45:10Z" level=debug msg="Creating middleware" middlewareType=Recovery entryPointName=traefik middlewareName=traefik-internal-recovery
time="2022-07-03T12:45:10Z" level=debug msg="Creating middleware" entryPointName=metrics middlewareName=metrics-entrypoint middlewareType=Metrics
time="2022-07-03T12:45:10Z" level=debug msg="Creating middleware" middlewareName=metrics-entrypoint entryPointName=traefik middlewareType=Metrics
time="2022-07-03T12:45:10Z" level=debug msg="Creating middleware" entryPointName=web middlewareName=metrics-entrypoint middlewareType=Metrics
time="2022-07-03T12:45:10Z" level=debug msg="Creating middleware" middlewareType=Metrics entryPointName=websecure middlewareName=metrics-entrypoint
time="2022-07-03T12:45:10Z" level=debug msg="Creating middleware" entryPointName=metrics middlewareName=metrics-entrypoint middlewareType=Metrics
time="2022-07-03T12:45:10Z" level=debug msg="Creating middleware" middlewareType=Metrics entryPointName=traefik middlewareName=metrics-entrypoint
time="2022-07-03T12:45:10Z" level=debug msg="Creating middleware" middlewareName=metrics-entrypoint middlewareType=Metrics entryPointName=web
time="2022-07-03T12:45:10Z" level=debug msg="Creating middleware" middlewareName=metrics-entrypoint entryPointName=websecure middlewareType=Metrics
time="2022-07-03T12:45:10Z" level=debug msg="Configuration received: {\"http\":{\"routers\":{\"default-traefik-dashboard-d012b7f875133eeab4e5\":{\"entryPoints\":[\"traefik\"],\"service\":\"api@internal\",\"rule\":\"PathPrefix(`/dashboard`) || PathPrefix(`/api`)\"}}},\"tcp\":{},\"udp\":{},\"tls\":{}}" providerName=kubernetescrd
time="2022-07-03T12:45:10Z" level=debug msg="No default certificate, generating one" tlsStoreName=default
time="2022-07-03T12:45:10Z" level=error msg="an error occurred while creating gateway status: 3 errors occurred:\n\t* Cannot load HTTPRoute service default/cm-acme-http-solverpm7jz: service not found\n\t* Cannot load HTTPRoute service default/cm-acme-http-solvertw978: service not found\n\t* Cannot load HTTPRoute service default/cm-acme-http-solver5dqs9: service not found\n\n" gateway=gateway-http namespace=default providerName=kubernetesgateway
time="2022-07-03T12:45:10Z" level=error msg="an error occurred while creating gateway status: 12 errors occurred:\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\n" gateway=gateway-https namespace=default providerName=kubernetesgateway
time="2022-07-03T12:45:10Z" level=debug msg="Configuration received: {\"http\":{},\"tcp\":{},\"udp\":{},\"tls\":{}}" providerName=kubernetesgateway
time="2022-07-03T12:45:11Z" level=debug msg="Added outgoing tracing middleware api@internal" routerName=api@internal middlewareName=tracing middlewareType=TracingForwarder entryPointName=traefik
time="2022-07-03T12:45:11Z" level=debug msg="Added outgoing tracing middleware dashboard@internal" routerName=dashboard@internal middlewareName=tracing middlewareType=TracingForwarder entryPointName=traefik
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" entryPointName=traefik routerName=dashboard@internal middlewareName=dashboard_stripprefix@internal middlewareType=StripPrefix
time="2022-07-03T12:45:11Z" level=debug msg="Adding tracing to middleware" middlewareName=dashboard_stripprefix@internal routerName=dashboard@internal entryPointName=traefik
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" routerName=dashboard@internal middlewareName=dashboard_redirect@internal middlewareType=RedirectRegex entryPointName=traefik
time="2022-07-03T12:45:11Z" level=debug msg="Setting up redirection from ^(http:\\/\\/(\\[[\\w:.]+\\]|[\\w\\._-]+)(:\\d+)?)\\/$ to ${1}/dashboard/" middlewareType=RedirectRegex entryPointName=traefik routerName=dashboard@internal middlewareName=dashboard_redirect@internal
time="2022-07-03T12:45:11Z" level=debug msg="Adding tracing to middleware" entryPointName=traefik routerName=dashboard@internal middlewareName=dashboard_redirect@internal
time="2022-07-03T12:45:11Z" level=debug msg="Added outgoing tracing middleware ping@internal" entryPointName=traefik middlewareName=tracing middlewareType=TracingForwarder routerName=ping@internal
time="2022-07-03T12:45:11Z" level=debug msg="Added outgoing tracing middleware api@internal" middlewareType=TracingForwarder entryPointName=traefik routerName=default-traefik-dashboard-d012b7f875133eeab4e5@kubernetescrd middlewareName=tracing
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" entryPointName=traefik middlewareName=traefik-internal-recovery middlewareType=Recovery
time="2022-07-03T12:45:11Z" level=debug msg="Added outgoing tracing middleware prometheus@internal" routerName=prometheus@internal entryPointName=metrics middlewareName=tracing middlewareType=TracingForwarder
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" middlewareType=Recovery entryPointName=metrics middlewareName=traefik-internal-recovery
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" entryPointName=metrics middlewareName=metrics-entrypoint middlewareType=Metrics
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" middlewareName=metrics-entrypoint entryPointName=traefik middlewareType=Metrics
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" entryPointName=web middlewareName=metrics-entrypoint middlewareType=Metrics
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" entryPointName=websecure middlewareName=metrics-entrypoint middlewareType=Metrics
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" middlewareName=metrics-entrypoint middlewareType=Metrics entryPointName=metrics
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" middlewareType=Metrics entryPointName=traefik middlewareName=metrics-entrypoint
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" entryPointName=web middlewareName=metrics-entrypoint middlewareType=Metrics
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" entryPointName=websecure middlewareName=metrics-entrypoint middlewareType=Metrics
time="2022-07-03T12:45:11Z" level=debug msg="No default certificate, generating one" tlsStoreName=default
time="2022-07-03T12:45:11Z" level=debug msg="Added outgoing tracing middleware dashboard@internal" routerName=dashboard@internal entryPointName=traefik middlewareName=tracing middlewareType=TracingForwarder
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" entryPointName=traefik routerName=dashboard@internal middlewareName=dashboard_stripprefix@internal middlewareType=StripPrefix
time="2022-07-03T12:45:11Z" level=debug msg="Adding tracing to middleware" middlewareName=dashboard_stripprefix@internal entryPointName=traefik routerName=dashboard@internal
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" entryPointName=traefik routerName=dashboard@internal middlewareName=dashboard_redirect@internal middlewareType=RedirectRegex
time="2022-07-03T12:45:11Z" level=debug msg="Setting up redirection from ^(http:\\/\\/(\\[[\\w:.]+\\]|[\\w\\._-]+)(:\\d+)?)\\/$ to ${1}/dashboard/" entryPointName=traefik routerName=dashboard@internal middlewareName=dashboard_redirect@internal middlewareType=RedirectRegex
time="2022-07-03T12:45:11Z" level=debug msg="Adding tracing to middleware" routerName=dashboard@internal middlewareName=dashboard_redirect@internal entryPointName=traefik
time="2022-07-03T12:45:11Z" level=debug msg="Added outgoing tracing middleware ping@internal" routerName=ping@internal middlewareName=tracing middlewareType=TracingForwarder entryPointName=traefik
time="2022-07-03T12:45:11Z" level=debug msg="Added outgoing tracing middleware api@internal" entryPointName=traefik routerName=api@internal middlewareName=tracing middlewareType=TracingForwarder
time="2022-07-03T12:45:11Z" level=debug msg="Added outgoing tracing middleware api@internal" middlewareType=TracingForwarder entryPointName=traefik routerName=default-traefik-dashboard-d012b7f875133eeab4e5@kubernetescrd middlewareName=tracing
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" entryPointName=traefik middlewareName=traefik-internal-recovery middlewareType=Recovery
time="2022-07-03T12:45:11Z" level=debug msg="Added outgoing tracing middleware prometheus@internal" entryPointName=metrics routerName=prometheus@internal middlewareName=tracing middlewareType=TracingForwarder
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" entryPointName=metrics middlewareName=traefik-internal-recovery middlewareType=Recovery
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" entryPointName=metrics middlewareName=metrics-entrypoint middlewareType=Metrics
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" middlewareType=Metrics middlewareName=metrics-entrypoint entryPointName=traefik
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" entryPointName=web middlewareName=metrics-entrypoint middlewareType=Metrics
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" middlewareType=Metrics entryPointName=websecure middlewareName=metrics-entrypoint
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" middlewareType=Metrics entryPointName=metrics middlewareName=metrics-entrypoint
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" entryPointName=traefik middlewareName=metrics-entrypoint middlewareType=Metrics
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" entryPointName=web middlewareName=metrics-entrypoint middlewareType=Metrics
time="2022-07-03T12:45:11Z" level=debug msg="Creating middleware" middlewareName=metrics-entrypoint middlewareType=Metrics entryPointName=websecure
time="2022-07-03T12:45:12Z" level=error msg="an error occurred while creating gateway status: 3 errors occurred:\n\t* Cannot load HTTPRoute service default/cm-acme-http-solver5dqs9: service not found\n\t* Cannot load HTTPRoute service default/cm-acme-http-solverpm7jz: service not found\n\t* Cannot load HTTPRoute service default/cm-acme-http-solvertw978: service not found\n\n" providerName=kubernetesgateway gateway=gateway-http namespace=default
time="2022-07-03T12:45:12Z" level=error msg="an error occurred while creating gateway status: 12 errors occurred:\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\n" gateway=gateway-https namespace=default providerName=kubernetesgateway
time="2022-07-03T12:45:12Z" level=debug msg="Skipping Kubernetes event kind *v1.Endpoints" providerName=kubernetesgateway
[K8s EVENT: Pod traefik-696fbddd6f-wl69s (ns: default)] Readiness probe failed: Get "[http://15.204.30.179:9000/ping":](http://15.204.30.179:9000/ping) dial tcp 15.204.30.179:9000: connect: connection refused
[K8s EVENT: Pod traefik-696fbddd6f-wl69s (ns: default)] Liveness probe failed: Get "[http://15.204.30.179:9000/ping":](http://15.204.30.179:9000/ping) dial tcp 15.204.30.179:9000: connect: connection refused
time="2022-07-03T12:45:12Z" level=error msg="an error occurred while creating gateway status: 3 errors occurred:\n\t* Cannot load HTTPRoute service default/cm-acme-http-solvertw978: service not found\n\t* Cannot load HTTPRoute service default/cm-acme-http-solverpm7jz: service not found\n\t* Cannot load HTTPRoute service default/cm-acme-http-solver5dqs9: service not found\n\n" providerName=kubernetesgateway gateway=gateway-http namespace=default
time="2022-07-03T12:45:12Z" level=error msg="an error occurred while creating gateway status: 12 errors occurred:\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\n" gateway=gateway-https providerName=kubernetesgateway namespace=default
time="2022-07-03T12:45:12Z" level=debug msg="Skipping Kubernetes event kind *v1.Endpoints" providerName=kubernetesgateway
time="2022-07-03T12:45:22Z" level=error msg="an error occurred while creating gateway status: 3 errors occurred:\n\t* Cannot load HTTPRoute service default/cm-acme-http-solver5dqs9: service not found\n\t* Cannot load HTTPRoute service default/cm-acme-http-solverpm7jz: service not found\n\t* Cannot load HTTPRoute service default/cm-acme-http-solvertw978: service not found\n\n" namespace=default providerName=kubernetesgateway gateway=gateway-http
time="2022-07-03T12:45:22Z" level=error msg="an error occurred while creating gateway status: 12 errors occurred:\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\n" providerName=kubernetesgateway gateway=gateway-https namespace=default
time="2022-07-03T12:45:22Z" level=debug msg="Skipping Kubernetes event kind *v1.Endpoints" providerName=kubernetesgateway
time="2022-07-03T12:45:28Z" level=error msg="an error occurred while creating gateway status: 3 errors occurred:\n\t* Cannot load HTTPRoute service default/cm-acme-http-solver5dqs9: service not found\n\t* Cannot load HTTPRoute service default/cm-acme-http-solverpm7jz: service not found\n\t* Cannot load HTTPRoute service default/cm-acme-http-solvertw978: service not found\n\n" providerName=kubernetesgateway gateway=gateway-http namespace=default
time="2022-07-03T12:45:28Z" level=error msg="an error occurred while creating gateway status: 12 errors occurred:\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\n" namespace=default providerName=kubernetesgateway gateway=gateway-https
time="2022-07-03T12:45:28Z" level=debug msg="Skipping Kubernetes event kind *v1.Endpoints" providerName=kubernetesgateway
time="2022-07-03T12:45:28Z" level=debug msg="Skipping Kubernetes event kind *v1.Endpoints" providerName=kubernetescrd
time="2022-07-03T12:45:32Z" level=error msg="an error occurred while creating gateway status: 3 errors occurred:\n\t* Cannot load HTTPRoute service default/cm-acme-http-solver5dqs9: service not found\n\t* Cannot load HTTPRoute service default/cm-acme-http-solverpm7jz: service not found\n\t* Cannot load HTTPRoute service default/cm-acme-http-solvertw978: service not found\n\n" providerName=kubernetesgateway gateway=gateway-http namespace=default
time="2022-07-03T12:45:32Z" level=error msg="an error occurred while creating gateway status: 12 errors occurred:\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\n" providerName=kubernetesgateway gateway=gateway-https namespace=default
time="2022-07-03T12:45:32Z" level=debug msg="Skipping Kubernetes event kind *v1.Endpoints" providerName=kubernetesgateway
time="2022-07-03T12:45:42Z" level=error msg="an error occurred while creating gateway status: 3 errors occurred:\n\t* Cannot load HTTPRoute service default/cm-acme-http-solver5dqs9: service not found\n\t* Cannot load HTTPRoute service default/cm-acme-http-solverpm7jz: service not found\n\t* Cannot load HTTPRoute service default/cm-acme-http-solvertw978: service not found\n\n" gateway=gateway-http namespace=default providerName=kubernetesgateway
time="2022-07-03T12:45:42Z" level=error msg="an error occurred while creating gateway status: 12 errors occurred:\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\n" providerName=kubernetesgateway gateway=gateway-https namespace=default
time="2022-07-03T12:45:42Z" level=debug msg="Skipping Kubernetes event kind *v1.Endpoints" providerName=kubernetesgateway
time="2022-07-03T12:45:52Z" level=error msg="an error occurred while creating gateway status: 12 errors occurred:\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\n" gateway=gateway-https namespace=default providerName=kubernetesgateway
time="2022-07-03T12:45:52Z" level=error msg="an error occurred while creating gateway status: 3 errors occurred:\n\t* Cannot load HTTPRoute service default/cm-acme-http-solvertw978: service not found\n\t* Cannot load HTTPRoute service default/cm-acme-http-solverpm7jz: service not found\n\t* Cannot load HTTPRoute service default/cm-acme-http-solver5dqs9: service not found\n\n" namespace=default gateway=gateway-http providerName=kubernetesgateway
time="2022-07-03T12:45:52Z" level=debug msg="Skipping Kubernetes event kind *v1.Endpoints" providerName=kubernetesgateway
time="2022-07-03T12:46:02Z" level=error msg="an error occurred while creating gateway status: 3 errors occurred:\n\t* Cannot load HTTPRoute service default/cm-acme-http-solverpm7jz: service not found\n\t* Cannot load HTTPRoute service default/cm-acme-http-solvertw978: service not found\n\t* Cannot load HTTPRoute service default/cm-acme-http-solver5dqs9: service not found\n\n" providerName=kubernetesgateway namespace=default gateway=gateway-http
time="2022-07-03T12:46:02Z" level=error msg="an error occurred while creating gateway status: 12 errors occurred:\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\t* Error while retrieving certificate: secret default/zerossl-key-prod8 does not exist\n\n" providerName=kubernetesgateway gateway=gateway-https namespace=default
time="2022-07-03T12:46:02Z" level=debug msg="Skipping Kubernetes event kind *v1.Endpoints" providerName=kubernetesgateway
Venryx added a commit to debate-map/app that referenced this issue Jul 3, 2022
… for it:

1) cert-manager: cert-manager/cert-manager#5260
2) traefik: traefik/traefik#9158
Summary: The cert-manager services are closing once they complete but not cleaning up the HTTPRoutes immediately, which causes Traefik to break / lose all routes, which means the remaining challenge-solvers become inaccessible, which means they never complete, and thus the cluster remains inaccessible indefinitely.
@kevinpollet kevinpollet added kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. area/provider/k8s/gatewayapi and removed status/0-needs-triage labels Jul 4, 2022
@kevinpollet kevinpollet added this to issues in v2 via automation Jul 4, 2022
@TeoGoddet
Copy link

TeoGoddet commented Jul 4, 2022

+1 The whole gateway should not be invalid when only 1 route is.

@israel-morales
Copy link

I believe I have ran into the same issue, only a simpler case when testing weighted routing:

The following manifest will render the entire Traefik gateway and other HTTPRoute CR's inoperable, if there are no active pods in the "canary" service

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: podinfo
  namespace: test
spec:
  hostnames:
  - podinfo.localhost
  parentRefs:
  - group: gateway.networking.k8s.io
    kind: Gateway
    name: traefik-gateway
    namespace: kube-system
  rules:
  - backendRefs:
    - group: ""
      kind: Service
      name: podinfo-primary  <== valid service, 2 active pods 
      port: 9898
      weight: 100
    - group: ""
      kind: Service
      name: podinfo-canary   <== valid service, 0 active pods
      port: 9898
      weight: 0
    matches:
    - path:
        type: PathPrefix
        value: /

If pods are scaled up in the "canary" service, routing works without issue.

I also saw the Cannot load HTTPRoute service test/podinfo in the Traefik pod logs.

Traefik version: 2.9.1

@TeoGoddet
Copy link

Hi !
This issue is a real problem !
Could you consider a new triage ?

@ddtmachado
Copy link
Contributor

Hello @TeoGoddet, if you confirmed the issue can you share the details with us?
We could use a reproduceable environment that's easy to setup so we can switch the label to bug confirmed.

@israel-morales
Copy link

israel-morales commented Nov 30, 2022

Is possible that the recently merged issue below will fix this issue (9158)?:

#9423

@ddtmachado
Copy link
Contributor

@israel-morales no, that fix was specifically for the CRD provider

@kevinpollet kevinpollet added kind/bug/confirmed a confirmed bug (reproducible). and removed kind/bug/possible a possible bug that needs analysis before it is confirmed or fixed. labels Jan 5, 2023
Venryx added a commit to debate-map/app that referenced this issue Jan 17, 2023
…bfiles.

* Switched from the ingress-based version of Traefik (under @Attempt6 folder) to the gateway-based one (under @attempt7 folder). Gateway-approach is cleaner, and is apparently the focus/future. While I think it still has the certificate-provisioning bug (traefik/traefik#9158), it was flagged recently as "bug:confirmed", so it'll likely be fixed eventually -- and I don't need it fixed right away, since Cloudflare offers the "Flexible" HTTPS option, which solves the main problem (ie. for end-users) for now.
* Improved traefik's gateway config, by having it no longer require the "NET_BIND_SERVICE" security-context.
* Updated traefik to latest version, by changing the helm-chart version.
* Updated traefik's gateway config, to the latest contents found at: https://github.com/kubernetes-sigs/gateway-api/blob/69e4d8b69b8ec936bc1ed3ca8af807cd45dca09d/config/webhook
* Removed some folders for kube-prometheus and such. (superseded by loki-stack)
* Fixed that the grafana.debatemap.app subdomain was not working. (I had only tested locally previously using localhost:XXXX, whereas in prod I needed to handle the fact that the service was in another namespace)
@Venryx
Copy link
Author

Venryx commented May 31, 2023

Okay, I have traced the code-path that causes this issue:

  • 1) Top-level logic starts in the createGatewayConf function.
  • 2) createGatewayConf calls the fillGatewayConf function, passing the gateway config.
    • 2.1) fillGatewayConf calls the gatewayHTTPRouteToHTTPConf function, passing each listener within the gateway.
      • 2.1.1) gatewayHTTPRouteToHTTPConf calls the loadServices function, for each listener's route's backendRef value (ie. specifier of port and service name).
        • 2.1.1.1) loadServices tries to retrieve information about the service backing the current listener's http-route on this line.
        • 2.1.1.2) loadServices then detects that no service exists with the specified name, for one or more of the listeners, and returns an error about it on this line. This starts a "bubble up" of the error up all the functions above, until it returns to the createGatewayConf function mentioned in step 1.
  • 3) createGatewayConf now calls the makeGatewayStatus function, with the listener errors contained within the listenerStatuses argument.
    • 3.1) makeGatewayStatus then collects the errors from its listeners on this line, and then returns those errors as an error for the gateway overall on this line.

Now that the code-path for the issue is made clear, what is the correct route for solving the issue?

The issue being: The entire Traefik system begins to fail (ie. discards/ignores all routes) if a single http-route in one of the listeners of one of the gateways is unable to find the service behind the route.

  • This is seen in my specific case of cert-manager destroying the "challenge solver" service prior to destroying its http route. (when one of the challenges gets completed)
  • Because Traefik then "breaks" (ie. discards/ignores all routes), the rest of the "challenge solver" services become "unreachable" to the ZeroSSL server, meaning cert-manager can never complete its operation -- meaning Traefik gets stuck in this broken state forever.

While in this specific case, the issue could be solved by having cert-manager destroy its service after destroying the associated http-route (for which I raised an issue here, which ended up being closed due to inactivity), this seems like a fragile solution to the problem: Traefik should not be so picky/fragile that it completely stops working if a single http-route is temporarily "not backed by a service" like this.

So with the code-path for the issue laid out above, I'm hoping it can result in a proper solution to the problem being found, by devs more familiar with how Traefik is intended to function when the cluster is in an imperfect/inconsistent/"intermediary" state like this. (ie. where one of its http-routes temporarily does not have a service backing it)

@nmengin
Copy link
Contributor

nmengin commented Jun 1, 2023

Hello @Venryx,

Thanks for your analysis, we think it makes a lot of sense.
Unfortunately, this would not make it to our roadmap for a while as we are focused elsewhere.

If you or another community member would like to fix it, let us know.

Venryx added a commit to debate-map/app that referenced this issue Jun 1, 2023
…cert-manager solver urls... (traefik just has no way to do so through the gateway-api atm)

* During cert-manager operation, encountered error: "Failed to fetch authorization: 401 urn:ietf:params:acme:error:unauthorized: The request specified an authorization that does not exist". First thought was that it was due to my prior values for the ".env" vars "EAB_HMAC_KEY" and "EAB_KID" holding values that were expired or something, so I generated a new set and provided those. The error went away. Not sure if the resolution was due to the new values, or just due to a relaunch of Tilt, but in any case it seems solved now.

I am now finally hitting the original error encountered half a year ago (traefik/traefik#9158). To resolve it, there are some things I can try:
* Create a pod that kills the cert-manager http-routes that are no longer needed. (thus getting Traefik to work again)
* Create a pod (or change some config) to keep the service behind those cert-manager http-routes alive somehow, until the cert is fully created and active. (seems best thing to try first)
* Modify the Traefik source code to handle the situation better. (best solution, but most complicated probably)
@Venryx

This comment was marked as off-topic.

@nmengin nmengin added the contributor/wanted Participation from an external contributor is highly requested label Mar 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider/k8s/gatewayapi contributor/wanted Participation from an external contributor is highly requested kind/bug/confirmed a confirmed bug (reproducible).
Projects
No open projects
v2
issues
Development

Successfully merging a pull request may close this issue.

7 participants