-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement self-healing using internal statistics #102
Commits on May 27, 2021
-
Fix isRevisionPublishedConditionReconciled func
Need to use the value of the var here, not the pointer, as the pointer changes in each iteration of the loop. Add a test to catch this.
Configuration menu - View commit details
-
Copy full SHA for d7687ea - Browse repository at this point
Copy the full SHA d7687eaView commit details -
Do not check nil list around range loop
range is already safe to use with a nil list.
Configuration menu - View commit details
-
Copy full SHA for 46b40fd - Browse repository at this point
Copy the full SHA 46b40fdView commit details -
Implement a mechanism to report stats internally
An in-memory key-value store has been added so the xdss server can report statistics that can the be then consumed from the controllers. The mechanism to detect failing resources has been reimplemented using this store in conjunction with the new ability of detecting which pod is actually reporting an ACK/NACK.
Configuration menu - View commit details
-
Copy full SHA for 24e5b59 - Browse repository at this point
Copy the full SHA 24e5b59View commit details -
Reimplement self-healing using internal stats
The implementation consist in the EnvoyConfigRevision tainting itself when the internal stats show that the percentage of pods reporting failures is higher that a given threshold. Right now: * A pod is considered "in failure" when it has reported at least one NACK for a given resource version, for at least one resource type. * The percentage of pods required to trigger the taint of a revision is right now hardcoded to 100%.
Configuration menu - View commit details
-
Copy full SHA for 1b17645 - Browse repository at this point
Copy the full SHA 1b17645View commit details -
Implement a backoff strategy for failure retries
When a NACK is received by the discovery service, it immediately retries the the same response. This usually does not help because unless something changes in the resource, the client is likely to reject the resource again and again. This leads to a storm of retries that elevates CPU comsumption in the clients, even triggering HPA if enabled. This commit implements a backoff strategy for retries to avoid this situation.
Configuration menu - View commit details
-
Copy full SHA for 86d1b6d - Browse repository at this point
Copy the full SHA 86d1b6dView commit details -
Configuration menu - View commit details
-
Copy full SHA for 87eccc1 - Browse repository at this point
Copy the full SHA 87eccc1View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4ff2f13 - Browse repository at this point
Copy the full SHA 4ff2f13View commit details -
Co-authored-by: Sergio López <41513123+slopezz@users.noreply.github.com>
Configuration menu - View commit details
-
Copy full SHA for bdd15f7 - Browse repository at this point
Copy the full SHA bdd15f7View commit details