-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement self-healing using internal statistics #102
Conversation
Need to use the value of the var here, not the pointer, as the pointer changes in each iteration of the loop. Add a test to catch this.
range is already safe to use with a nil list.
An in-memory key-value store has been added so the xdss server can report statistics that can the be then consumed from the controllers. The mechanism to detect failing resources has been reimplemented using this store in conjunction with the new ability of detecting which pod is actually reporting an ACK/NACK.
The implementation consist in the EnvoyConfigRevision tainting itself when the internal stats show that the percentage of pods reporting failures is higher that a given threshold. Right now: * A pod is considered "in failure" when it has reported at least one NACK for a given resource version, for at least one resource type. * The percentage of pods required to trigger the taint of a revision is right now hardcoded to 100%.
When a NACK is received by the discovery service, it immediately retries the the same response. This usually does not help because unless something changes in the resource, the client is likely to reject the resource again and again. This leads to a storm of retries that elevates CPU comsumption in the clients, even triggering HPA if enabled. This commit implements a backoff strategy for retries to avoid this situation.
/ok-to-test |
Despite the 2 small typos on comments, awesome job, true self-healing implementation! /lgtm |
LGTM label has been added. Git tree hash: c03e509eae7b2e62ec64f7a1c8e71d4ffd96b1c0
|
@slopezz suggestions committed! Need to lgtm again :P |
Nice! /lgtm |
LGTM label has been added. Git tree hash: 4ccecf50036272dfb2460307c1711ddcbdc85506
|
Outstanding job! /lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: roivaz The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR includes the following:
/kind feature
/priority important-soon
/assign