Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement self-healing using internal statistics #102

Merged
merged 8 commits into from
Jun 1, 2021
Merged

Conversation

roivaz
Copy link
Member

@roivaz roivaz commented May 27, 2021

This PR includes the following:

  • Fixed a bug affecting reconcile of EnvoyConfigRevision status: d7687ea
  • Implemented a mechanism to internally store statistics related to the xDS protocol messages interchanged between clients and the discovery service: 24e5b59
  • Reimplemented the self-healing using the internal xDS stats: 1b17645
  • Implemented a backoff algorithm to avoid overloading the envoy clients with retries from the discovery service: 86d1b6d

/kind feature
/priority important-soon
/assign

Need to use the value of the var here, not the pointer, as the pointer
changes in each iteration of the loop. Add a test to catch this.
range is already safe to use with a nil list.
An in-memory key-value store has been added so the xdss server can report
statistics that can the be then consumed from the controllers. The
mechanism to detect failing resources has been reimplemented using this
store in conjunction with the new ability of detecting which pod is actually
reporting an ACK/NACK.
The implementation consist in the EnvoyConfigRevision tainting itself
when the internal stats show that the percentage of pods reporting
failures is higher that a given threshold. Right now:

* A pod is considered "in failure" when it has reported at least one
  NACK for a given resource version, for at least one resource type.
* The percentage of pods required to trigger the taint of a revision is
  right now hardcoded to 100%.
When a NACK is received by the discovery service, it immediately retries
the the same response. This usually does not help because unless
something changes in the resource, the client is likely to reject the
resource again and again. This leads to a storm of retries that elevates
CPU comsumption in the clients, even triggering HPA if enabled.
This commit implements a backoff strategy for retries to avoid this
situation.
@3scale-robot 3scale-robot added kind/feature Categorizes issue or PR as related to a new feature. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next sprint. needs-size Indicates a PR or issue lacks a `size/foo` label and requires one. labels May 27, 2021
@3scale-robot 3scale-robot added size/XL Requires about a week to complete the PR or the issue. and removed needs-size Indicates a PR or issue lacks a `size/foo` label and requires one. labels May 27, 2021
@roivaz
Copy link
Member Author

roivaz commented May 27, 2021

/ok-to-test

@3scale-robot 3scale-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label May 27, 2021
@slopezz
Copy link
Member

slopezz commented May 27, 2021

Despite the 2 small typos on comments, awesome job, true self-healing implementation!

/lgtm

@3scale-robot 3scale-robot added the lgtm Indicates that a PR is ready to be merged. label May 27, 2021
@3scale-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: c03e509eae7b2e62ec64f7a1c8e71d4ffd96b1c0

Co-authored-by: Sergio López <41513123+slopezz@users.noreply.github.com>
@3scale-robot 3scale-robot removed the lgtm Indicates that a PR is ready to be merged. label May 27, 2021
@roivaz
Copy link
Member Author

roivaz commented May 27, 2021

@slopezz suggestions committed! Need to lgtm again :P

@slopezz
Copy link
Member

slopezz commented May 27, 2021

@slopezz suggestions committed! Need to lgtm again :P

Nice!

/lgtm

@3scale-robot 3scale-robot added the lgtm Indicates that a PR is ready to be merged. label May 27, 2021
@3scale-robot
Copy link
Contributor

LGTM label has been added.

Git tree hash: 4ccecf50036272dfb2460307c1711ddcbdc85506

@raelga
Copy link
Contributor

raelga commented Jun 1, 2021

Outstanding job!

/lgtm

@roivaz
Copy link
Member Author

roivaz commented Jun 1, 2021

/approve

@3scale-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: roivaz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@3scale-robot 3scale-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 1, 2021
@3scale-robot 3scale-robot merged commit 48957ab into main Jun 1, 2021
@3scale-robot 3scale-robot deleted the feat/stats branch June 1, 2021 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next sprint. size/XL Requires about a week to complete the PR or the issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants