Prometheus snapshot not created when kubetest times out #1086

oxddr · 2020-02-28T11:10:46Z

Prometheus snapshot is not created if kubetest times out. Snapshotting logic lives inside clusterloader, so when kubetest times out the logic is simply not executed. It is unfortunate, especially timeouts are situation, which we'd like to debug usually.

I was hit by this issue when tryring to debug: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/88342/pull-kubernetes-e2e-gce-large-performance/1230477531426066432/

I see two options: 1) move snapshotting outside of the test (e.g. similarly to log dumping) 2) reconsider using Cortex (or any other solution that allows live recording of metrics).

@mm4tt - WDYT?

/priority important-soon
/area clusterloader

mm4tt · 2020-03-03T12:08:47Z

I also took some notes about that run:

It looked like the api-server exploded during the test, but it wasn't obvious from the logs that it was the case (it requires some experience with CL2 to guess / know that)
There is no point in continuing the tests if such thing happens. Actually, continuing the test creates a lot of issues, e.g.
1. Such test is a waste of resources, it will last until timeout (which is much longer than the average test duration)
2. There is no time for some important operations, e.g. gathering measurements or snapshotting prometheus db
3. CL2 log will be full of strange errors, it's hard for someone unexperienced to understand it

I had two things in mind as a remedy:

Introducing an "api-availability" measurement (and SLO). Basically we could check apiservers's /healthz endpoint every Xs from clusterloader2 and later compute the overall availability of api-server (e.g. % of OKs, or longest unavailability period). I believe this info would be very useful for a quick-glance debugging
In addition to reporting the availability, we could make this measurement fail if e.g. apiserver was not available at all in the last XXmin. In such case, IMHO it's safe to assume that the control-plane died. If such error is returned, CL2 would stop executing the test. As a side note, I think this mechanism for stopping test is already implemented, there is a concept of critical error, but we just don't use it in any measurement.
Introduce a timeout on the CL2 test execution. It's a more generic solution to the problem you described, but I think it might be tricky to tune it properly and later to there will be a maintenance cost (e.g. we change something and now load test starts taking X% shorter / longer -> we need to change the timeouts)

Even though 2 seems to be much more generic, I don't remember any situation where Cl2 would timeout for reason other than control-plane unavailability. Because of that I think we should first implement 1 (which has other benefits, e.g. makes it easier for people outside scalability, it's a nice SLO to provide users) and then see whether 2 (or solutions you proposed) are really needed.

fejta-bot · 2020-06-01T12:49:46Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

mm4tt · 2020-06-02T10:48:14Z

/remove-lifecycle stale

fejta-bot · 2020-08-31T11:13:29Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

wojtek-t · 2020-08-31T11:54:56Z

/remove-lifecycle stale
/lifecycle frozen

k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. area/clusterloader labels Feb 28, 2020

mm4tt mentioned this issue Mar 3, 2020

Create api-availability measurement #1096

Open

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 1, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 2, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 31, 2020

k8s-ci-robot added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prometheus snapshot not created when kubetest times out #1086

Prometheus snapshot not created when kubetest times out #1086

oxddr commented Feb 28, 2020

mm4tt commented Mar 3, 2020 •

edited

Loading

fejta-bot commented Jun 1, 2020

mm4tt commented Jun 2, 2020

fejta-bot commented Aug 31, 2020

wojtek-t commented Aug 31, 2020

Prometheus snapshot not created when kubetest times out #1086

Prometheus snapshot not created when kubetest times out #1086

Comments

oxddr commented Feb 28, 2020

mm4tt commented Mar 3, 2020 • edited Loading

fejta-bot commented Jun 1, 2020

mm4tt commented Jun 2, 2020

fejta-bot commented Aug 31, 2020

wojtek-t commented Aug 31, 2020

mm4tt commented Mar 3, 2020 •

edited

Loading