Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement self-healing using internal statistics #102

Merged
merged 8 commits into from
Jun 1, 2021
Merged

Commits on May 27, 2021

  1. Fix isRevisionPublishedConditionReconciled func

    Need to use the value of the var here, not the pointer, as the pointer
    changes in each iteration of the loop. Add a test to catch this.
    roivaz committed May 27, 2021
    Configuration menu
    Copy the full SHA
    d7687ea View commit details
    Browse the repository at this point in the history
  2. Do not check nil list around range loop

    range is already safe to use with a nil list.
    roivaz committed May 27, 2021
    Configuration menu
    Copy the full SHA
    46b40fd View commit details
    Browse the repository at this point in the history
  3. Implement a mechanism to report stats internally

    An in-memory key-value store has been added so the xdss server can report
    statistics that can the be then consumed from the controllers. The
    mechanism to detect failing resources has been reimplemented using this
    store in conjunction with the new ability of detecting which pod is actually
    reporting an ACK/NACK.
    roivaz committed May 27, 2021
    Configuration menu
    Copy the full SHA
    24e5b59 View commit details
    Browse the repository at this point in the history
  4. Reimplement self-healing using internal stats

    The implementation consist in the EnvoyConfigRevision tainting itself
    when the internal stats show that the percentage of pods reporting
    failures is higher that a given threshold. Right now:
    
    * A pod is considered "in failure" when it has reported at least one
      NACK for a given resource version, for at least one resource type.
    * The percentage of pods required to trigger the taint of a revision is
      right now hardcoded to 100%.
    roivaz committed May 27, 2021
    Configuration menu
    Copy the full SHA
    1b17645 View commit details
    Browse the repository at this point in the history
  5. Implement a backoff strategy for failure retries

    When a NACK is received by the discovery service, it immediately retries
    the the same response. This usually does not help because unless
    something changes in the resource, the client is likely to reject the
    resource again and again. This leads to a storm of retries that elevates
    CPU comsumption in the clients, even triggering HPA if enabled.
    This commit implements a backoff strategy for retries to avoid this
    situation.
    roivaz committed May 27, 2021
    Configuration menu
    Copy the full SHA
    86d1b6d View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    87eccc1 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    4ff2f13 View commit details
    Browse the repository at this point in the history
  8. Fix typos

    Co-authored-by: Sergio López <41513123+slopezz@users.noreply.github.com>
    roivaz and slopezz committed May 27, 2021
    Configuration menu
    Copy the full SHA
    bdd15f7 View commit details
    Browse the repository at this point in the history