EnvoyConfig InSync but pods having different TLS certificates #140
Labels
kind/bug
Categorizes issue or PR as related to a bug.
needs-priority
Indicates a PR or issue lacks a `priority/foo` label and requires one.
needs-size
Indicates a PR or issue lacks a `size/foo` label and requires one.
What happened
In staging environment, we had intermittent alerts of HTTP certificate expiration date of monitored VIP HTTP endpoints with blackblox_exporter:
It means, that the certificate exposed by the envoy sidecar was different among different pods belonging to the same deployment.
The
EnvoyConfig
related to all those endpoints was inInSync
,DESIRED VERSION
was the same asPUBLISHED VERSION
:echo-api, backend, mt-ingress and apicast-production had 50% of pods (1 pod out of 2) with the oldest TLS certificate, so the only correct was apicast-staging.
AFAIK, when cert-manager updates a certificate in a k8s
Secret
, the marin3rServiceDiscovery
should update every sidecar via xDS API, and receive a NACK from every pod. If I recall well, in the past a single pod NACK was enough to mark theEnvoyConfig
asInSync
, but lately it was changed to a percentage 1b17645However on this case, 50% of pods were not updated, and its
EnvoyConfig
was inInSync
How we checked the certificate from every pod
We checked the oldest certificate doing a port-forwarding of the envoy admin port
And accessing to the cert page, where we could check the cert expiration date at http://127.0.0.1:9901/certs
When we detect a pod out-of-sync, we delete the pod, a new pod is created containing the correct cert (like the one on the other deployment pod, as it happen to 1 pod out of 2 pods per deployment).
In production environment this issue is not happening.
Workaround to be alerted
For the moment, we have added an alert to be alerted when there is a certificate expiration drifts for the same HTTP target (dtected changes over time with a rate function, checked in staging environment):
How to reproduce
We haven't seen any error on
ServiceDiscovery
, so unfortunately we are unaware to reproduce it :(The text was updated successfully, but these errors were encountered: