Reimplement self-healing using internal statistics #102

Need to use the value of the var here, not the pointer, as the pointer changes in each iteration of the loop. Add a test to catch this.

range is already safe to use with a nil list.

An in-memory key-value store has been added so the xdss server can report statistics that can the be then consumed from the controllers. The mechanism to detect failing resources has been reimplemented using this store in conjunction with the new ability of detecting which pod is actually reporting an ACK/NACK.

The implementation consist in the EnvoyConfigRevision tainting itself when the internal stats show that the percentage of pods reporting failures is higher that a given threshold. Right now: * A pod is considered "in failure" when it has reported at least one NACK for a given resource version, for at least one resource type. * The percentage of pods required to trigger the taint of a revision is right now hardcoded to 100%.

When a NACK is received by the discovery service, it immediately retries the the same response. This usually does not help because unless something changes in the resource, the client is likely to reject the resource again and again. This leads to a storm of retries that elevates CPU comsumption in the clients, even triggering HPA if enabled. This commit implements a backoff strategy for retries to avoid this situation.

Co-authored-by: Sergio López <41513123+slopezz@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reimplement self-healing using internal statistics #102

Reimplement self-healing using internal statistics #102

Commits on May 27, 2021