Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prometheus snapshot not created when kubetest times out #1086

Open
oxddr opened this issue Feb 28, 2020 · 5 comments
Open

Prometheus snapshot not created when kubetest times out #1086

oxddr opened this issue Feb 28, 2020 · 5 comments
Labels
area/clusterloader lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@oxddr
Copy link
Contributor

oxddr commented Feb 28, 2020

Prometheus snapshot is not created if kubetest times out. Snapshotting logic lives inside clusterloader, so when kubetest times out the logic is simply not executed. It is unfortunate, especially timeouts are situation, which we'd like to debug usually.

I was hit by this issue when tryring to debug: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/88342/pull-kubernetes-e2e-gce-large-performance/1230477531426066432/

I see two options: 1) move snapshotting outside of the test (e.g. similarly to log dumping) 2) reconsider using Cortex (or any other solution that allows live recording of metrics).

@mm4tt - WDYT?

/priority important-soon
/area clusterloader

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/clusterloader labels Feb 28, 2020
@mm4tt
Copy link
Contributor

mm4tt commented Mar 3, 2020

I also took some notes about that run:

  1. It looked like the api-server exploded during the test, but it wasn't obvious from the logs that it was the case (it requires some experience with CL2 to guess / know that)
  2. There is no point in continuing the tests if such thing happens. Actually, continuing the test creates a lot of issues, e.g.
    1. Such test is a waste of resources, it will last until timeout (which is much longer than the average test duration)
    2. There is no time for some important operations, e.g. gathering measurements or snapshotting prometheus db
    3. CL2 log will be full of strange errors, it's hard for someone unexperienced to understand it

I had two things in mind as a remedy:

  1. Introducing an "api-availability" measurement (and SLO). Basically we could check apiservers's /healthz endpoint every Xs from clusterloader2 and later compute the overall availability of api-server (e.g. % of OKs, or longest unavailability period). I believe this info would be very useful for a quick-glance debugging
    In addition to reporting the availability, we could make this measurement fail if e.g. apiserver was not available at all in the last XXmin. In such case, IMHO it's safe to assume that the control-plane died. If such error is returned, CL2 would stop executing the test. As a side note, I think this mechanism for stopping test is already implemented, there is a concept of critical error, but we just don't use it in any measurement.

  2. Introduce a timeout on the CL2 test execution. It's a more generic solution to the problem you described, but I think it might be tricky to tune it properly and later to there will be a maintenance cost (e.g. we change something and now load test starts taking X% shorter / longer -> we need to change the timeouts)

Even though 2 seems to be much more generic, I don't remember any situation where Cl2 would timeout for reason other than control-plane unavailability. Because of that I think we should first implement 1 (which has other benefits, e.g. makes it easier for people outside scalability, it's a nice SLO to provide users) and then see whether 2 (or solutions you proposed) are really needed.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 1, 2020
@mm4tt
Copy link
Contributor

mm4tt commented Jun 2, 2020

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 2, 2020
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 31, 2020
@wojtek-t
Copy link
Member

/remove-lifecycle stale
/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/clusterloader lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests

5 participants