EnvoyConfig InSync but pods having different TLS certificates #140

slopezz · 2022-03-31T17:16:01Z

What happened

In staging environment, we had intermittent alerts of HTTP certificate expiration date of monitored VIP HTTP endpoints with blackblox_exporter:

        - alert: ProbeSSLCertExpire
          expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
          for: 5m
          labels:
            severity: critical
          annotations:
            message: "SSL certificate from probe target {{ $labels.target }} is going to expire in 14 days"

It means, that the certificate exposed by the envoy sidecar was different among different pods belonging to the same deployment.

The EnvoyConfig related to all those endpoints was in InSync, DESIRED VERSION was the same as PUBLISHED VERSION:

$ oc get envoyconfig
NAME                            NODE ID                         ENVOY API   DESIRED VERSION   PUBLISHED VERSION   CACHE STATE
mt-ingress                      mt-ingress                      v3          b98554f7b         b98554f7b           InSync
stg-saas-apicast-production     stg-saas-apicast-production     v3          6669946454        6669946454          InSync
stg-saas-apicast-staging        stg-saas-apicast-staging        v3          7bcb8576c8        7bcb8576c8          InSync
stg-saas-backend-listener       stg-saas-backend-listener       v3          746b79f77b        746b79f77b          InSync
stg-saas-echo-api               stg-saas-echo-api               v3          68f954f6c9        68f954f6c9          InSync

echo-api, backend, mt-ingress and apicast-production had 50% of pods (1 pod out of 2) with the oldest TLS certificate, so the only correct was apicast-staging.

AFAIK, when cert-manager updates a certificate in a k8s Secret, the marin3r ServiceDiscovery should update every sidecar via xDS API, and receive a NACK from every pod. If I recall well, in the past a single pod NACK was enough to mark the EnvoyConfig as InSync, but lately it was changed to a percentage 1b17645

However on this case, 50% of pods were not updated, and its EnvoyConfig was in InSync

How we checked the certificate from every pod

We checked the oldest certificate doing a port-forwarding of the envoy admin port

$ oc port-forward backend-listener-67d4786b64-lx682 9901:9901
Forwarding from 127.0.0.1:9901 -> 9901
Forwarding from [::1]:9901 -> 9901

And accessing to the cert page, where we could check the cert expiration date at http://127.0.0.1:9901/certs

When we detect a pod out-of-sync, we delete the pod, a new pod is created containing the correct cert (like the one on the other deployment pod, as it happen to 1 pod out of 2 pods per deployment).

In production environment this issue is not happening.

Workaround to be alerted

For the moment, we have added an alert to be alerted when there is a certificate expiration drifts for the same HTTP target (dtected changes over time with a rate function, checked in staging environment):

        - alert: ProbeSSLCertExpireDrift
          expr: rate(probe_ssl_earliest_cert_expiry[5m]) > 0
          for: 10m
          labels:
            severity: critical
          annotations:
            message: "SSL certificate from probe target {{ $labels.target }} is showing different certificate expiration dates, maybe some pods are not loading a renewed certificate"

How to reproduce

We haven't seen any error on ServiceDiscovery, so unfortunately we are unaware to reproduce it :(

The text was updated successfully, but these errors were encountered:

roivaz · 2022-05-30T09:08:57Z

I don't think this is a problem with the Service Discovery, let me try to explain why:

The DiscoveryService does not update the clients with the new certificate, it's the other way around, the clients are the ones that connect to the DiscoveryService pod and request updates.
There is a single DiscoveryService with a single cache where all resources (and therefore certificates) are stored, when this cache is updated with a new certificate, all the clients fetch the new certificate from the same cache (the xDS server).
When a client fetches a new resource and updates itself properly, the client sends an ACK to the DiscoveryService. A NACK is sent only if the new resource is rejected by the client.

Given all the previous points I find it unlikely that the problem is the DiscoveryService, at least without inspecting the logs from the DiscoveryService.
It is more likely that a persistent connection was opened against the Envoy pod at the time, which would cause the Envoy listener to stick in draining state forever and never being able to recreate the listener to load the new certificate.

Anyway we don't have enough data to clearly point to one root cause or the other, so we'll have to wait and see if it happens again.

roivaz · 2022-05-30T09:16:31Z

@raelga @slopezz see my previous comment 🙏 🙇

slopezz · 2022-05-30T09:51:34Z

Ok, fair enough!

Let's close it, an re-open in case this issue re-appears in the future? @roivaz

roivaz · 2022-08-30T18:33:47Z

So far we have been unable to reproduce the issue or have seen it happen again. Will re-open if there is a new report of this in the future.

/close

slopezz added the kind/bug Categorizes issue or PR as related to a bug. label Mar 31, 2022

3scale-robot added needs-priority Indicates a PR or issue lacks a `priority/foo` label and requires one. needs-size Indicates a PR or issue lacks a `size/foo` label and requires one. labels Mar 31, 2022

roivaz closed this as completed Aug 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EnvoyConfig InSync but pods having different TLS certificates #140

EnvoyConfig InSync but pods having different TLS certificates #140

slopezz commented Mar 31, 2022

roivaz commented May 30, 2022

roivaz commented May 30, 2022

slopezz commented May 30, 2022

roivaz commented Aug 30, 2022

EnvoyConfig InSync but pods having different TLS certificates #140

EnvoyConfig InSync but pods having different TLS certificates #140

Comments

slopezz commented Mar 31, 2022

What happened

How we checked the certificate from every pod

Workaround to be alerted

How to reproduce

roivaz commented May 30, 2022

roivaz commented May 30, 2022

slopezz commented May 30, 2022

roivaz commented Aug 30, 2022