Create api-availability measurement #1096

mm4tt · 2020-03-03T12:21:58Z

Justification

See this comment - #1086 (comment)

Milestones

V0

Create a new "ApiAvailability" measurement
This measurement should periodically probe the apiserver's /healthz endpoint and record whether api was available or not
The measurement should output the following stats
1. availability percentage (e.g. % of OK responses over all responses)
2. Longest consecutively unavailability period
3. ...

V1

Make the measurement fail if apiserver is consecutively not available in the last XX min (e.g. 30min).
Make the error critical to make sure the test execution is stopped in such case

V2

Come up with exact SLI/SLO definition, make it available in perf-dash for further analysis
Make it a WIP Scalability SLO
Evaluate and promote to official SLO

/good-first-issue

k8s-ci-robot · 2020-03-03T12:21:59Z

@mm4tt:
This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

Justification

See this comment - #1086 (comment)

Milestones

V0

Create a new "ApiAvailability" measurement

This measurement should periodically probe the apiserver's /healthz endpoint and record whether api was available or not

The measurement should output the following stats

availability percentage (e.g. % of OK responses over all responses)

Longest consecutively unavailability period

...

V1

Make the measurement fail if apiserver is consecutively not available in the last XX min (e.g. 30min).

Make the error critical to make sure the test execution is stopped in such case

V2

Come up with exact SLI/SLO definition, make it available in perf-dash for further analysis

Make it a WIP Scalability SLO

Evaluate and promote to official SLO

/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vamossagar12 · 2020-03-13T15:30:49Z

/assign
I would like to take this up. Can I expect more details?

mm4tt · 2020-03-18T08:39:40Z

Great, thanks @vamossagar12

Let's start with the V0. Could you let me know which part of the description is not clear or requires further details?

vamossagar12 · 2020-03-18T15:44:43Z

thanks @mm4tt Couple of things:

how frequently should the probe happen? Would it be configurable or we can define it to be fixed?
The availability percentage should be wrt the entire perf test run and we have to keep track of percentage and longest unavailable period?

As an outside question, typically if the health end point is down for a configured time, response would be to scale up or something like that. Just wanted to know the rationale behind adding this particular measurement. Thanks!

wojtek-t · 2020-03-19T20:19:06Z

how frequently should the probe happen? Would it be configurable or we can define it to be fixed?

Seems like implementation detail. We can make it configurable, but I doubt we will be using different values once we agree on sth.

The availability percentage should be wrt the entire perf test run and we have to keep track of percentage and longest unavailable period?

Correct

As an outside question, typically if the health end point is down for a configured time, response would be to scale up or something like that. Just wanted to know the rationale behind adding this particular measurement. Thanks!

There are couple points:

we want to ensure that we understand how availability looks like (I can imgine all other SLOs being fine, but cluster being periodicall unavailable)
we can use it as optimization - if cluster is down for X minutes, the it probably won't get up ever, and we can shut down the test and save money

vamossagar12 · 2020-03-21T08:09:24Z

Thanks for the updates @wojtek-t . I will start with this issue now. Pretty sure there would be a few more questions along the way though :)

vamossagar12 · 2020-03-26T16:11:00Z

hi. Since I last commented, I didn't get a chance to look at it. Will start over the next couple of days..

vamossagar12 · 2020-03-29T10:16:51Z

hi so i started looking at this today. One of the things is that there's already a measurement metrics_for_e2e which, among other things, also fetches the metrics from api-server.

So, the new measurement that needs to be created, can work along the same lines just that instead of hitting the /metrics end-point, it can hit the /healthz end-point of api-server. Thats' from an implementation standpoint.

The other question that I had is that the measurements that we defines are within a Step and a group of Steps forms a single test, so when we say that this new measurement measures the health of the api-server for the duration of a test(mentioned above), so what exactly does a test mean in this case? A Step which houses the measurement(s) or the overall test?

mm4tt · 2020-03-30T11:26:07Z

Hey, @vamossagar12

It makes sense to make the implementation similar to the metrics_for_e2e. It's exactly as you said, instead of hitting /metrics, we'll be hitting /healthz.

Regarding your second question. Usually a measurement has two actions: start and gather. This one should work the same. The start action should start a goroutine that will be continuously pinging /healthz endpoint and the gather will stop this goroutine and wrap everything up. Does it make sense?

wojtek-t · 2020-03-30T11:39:21Z

It makes sense to make the implementation similar to the metrics_for_e2e. It's exactly as you said, instead of hitting /metrics, we'll be hitting /healthz.

Not sure I understand - for metrics_for_e2e IIRC we're fetching the metrics once at the end of test. Assuming that the above is true - it can't really work the same way...

mm4tt · 2020-03-30T11:59:33Z

It makes sense to make the implementation similar to the metrics_for_e2e. It's exactly as you said, instead of hitting /metrics, we'll be hitting /healthz.

Not sure I understand - for metrics_for_e2e IIRC we're fetching the metrics once at the end of test. Assuming that the above is true - it can't really work the same way...

Good point, I should have checked how the metrics_for_e2e works :) Still you should be able to take some inspiration from that measurement. Let me know if you have more questions.

vamossagar12 · 2020-03-30T14:15:23Z

Actually I meant to use the metrics_e2e as only a baseline of how to interact with the api server. I hadn't looked at the internals so what @wojtek-t told is even more valuable :)

vamossagar12 · 2020-03-30T14:20:50Z

Hey, @vamossagar12

It makes sense to make the implementation similar to the metrics_for_e2e. It's exactly as you said, instead of hitting /metrics, we'll be hitting /healthz.

Regarding your second question. Usually a measurement has two actions: start and gather. This one should work the same. The start action should start a goroutine that will be continuously pinging /healthz endpoint and the gather will stop this goroutine and wrap everything up. Does it make sense?

Regarding point 2, I still have a question. As far as what I have understood, the hierarchy of a test is as follows:

Test -> Step -> Phase(s) Or measurement(s)

If we just focus on measurements, it's 2 levels lower than a test. So, when we hit the api-server from start and gather from within a measurement, that would be still within the context of that particular step. Is the question clear now? Or is my thinking totally off track here 😄

mm4tt · 2020-03-30T14:25:50Z

Oh, I see where the confusion comes from. The hierarchy you listed is correct. But the measurement there should be treated as measurement invocation. The measurement object life spans the whole test. So if your test is

Step1: call method start on measurement A
Step2: do something else
Step3: do something else
Step4: Call method gather on measurement A

Then methods start and gather will be called on the same measurement instance

vamossagar12 · 2020-04-07T06:43:45Z

@mm4tt I have taken an initial stab at this here: https://github.com/kubernetes/perf-tests/pull/1162/files

Plz review when you have the bandwidth. I would also put more thoughts on improving it.

fejta-bot · 2020-07-06T07:22:33Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

mm4tt · 2020-07-06T07:37:03Z

/remove-lifecycle stale

vamossagar12 · 2020-08-17T13:21:50Z

@wojtek-t The PR got merged. I guess the next task would be to write the config files? Any specific areas for me to start looking at for that?

fejta-bot · 2020-11-15T13:54:22Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

wojtek-t · 2020-11-16T07:00:29Z

/remove-lifecycle stale

fejta-bot · 2021-02-14T07:24:05Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

tosi3k · 2021-05-17T07:52:49Z

We're in the middle of V1. We have already added a configurable availability percentage threshold under which the test would fail.

The problem is that the API availability measurement makes the API call latency measurement fail. This is because the former runs kubectl exec periodically underneath - this results in POST pods' exec subresource API call latency jump above the 1s threshold (which is understandable and shouldn't be considered as an API call latency SLO violation). @jupblb is currently working on fixing this.

k8s-triage-robot · 2021-08-15T08:25:58Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t · 2021-08-23T10:11:30Z

/remove-lifecycle stale

k8s-triage-robot · 2021-11-21T10:18:45Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t · 2021-11-22T07:34:48Z

/remove-lifecycle stale

k8s-triage-robot · 2022-02-20T07:44:10Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t · 2022-02-21T07:21:39Z

/remove-lifecycle stale

k8s-triage-robot · 2022-05-22T07:22:44Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t · 2022-05-23T10:53:51Z

/remove-lifecycle stale

k8s-triage-robot · 2022-08-21T11:04:05Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-09-20T11:24:20Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

wojtek-t · 2022-09-22T07:28:05Z

/remove-lifecycle rotten

k8s-triage-robot · 2023-02-08T04:23:36Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t · 2023-02-08T09:22:31Z

/remove-lifecycle rotten

k8s-ci-robot added good first issue Denotes an issue ready for a new contributor, according to the "help wanted" guidelines. help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. labels Mar 3, 2020

k8s-ci-robot assigned vamossagar12 Mar 13, 2020

vamossagar12 mentioned this issue Apr 7, 2020

Api availability measurement #1162

Merged

vamossagar12 mentioned this issue Jun 23, 2020

Understand the semantics of ssh client in case the command fails #1343

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 6, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 6, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 15, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 16, 2020

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 17, 2021

jupblb mentioned this issue May 20, 2021

Ignore exec subresource in APIResponsivenessPrometheus #1809

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 15, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 23, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 21, 2021

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 22, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 20, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 21, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 22, 2022

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 21, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 20, 2022

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Sep 22, 2022

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 8, 2023

vamossagar12 removed their assignment Jun 2, 2023

Create api-availability measurement #1096

Create api-availability measurement #1096

Comments

mm4tt commented Mar 3, 2020

Justification

Milestones

V0

V1

V2

k8s-ci-robot commented Mar 3, 2020

Justification

Milestones

V0

V1

V2

vamossagar12 commented Mar 13, 2020 • edited

mm4tt commented Mar 18, 2020

vamossagar12 commented Mar 18, 2020

wojtek-t commented Mar 19, 2020

vamossagar12 commented Mar 21, 2020

vamossagar12 commented Mar 26, 2020 • edited

vamossagar12 commented Mar 29, 2020

mm4tt commented Mar 30, 2020

wojtek-t commented Mar 30, 2020

mm4tt commented Mar 30, 2020

vamossagar12 commented Mar 30, 2020

vamossagar12 commented Mar 30, 2020

mm4tt commented Mar 30, 2020

vamossagar12 commented Apr 7, 2020 • edited

fejta-bot commented Jul 6, 2020

mm4tt commented Jul 6, 2020

vamossagar12 commented Aug 17, 2020

fejta-bot commented Nov 15, 2020

wojtek-t commented Nov 16, 2020

fejta-bot commented Feb 14, 2021

tosi3k commented May 17, 2021

k8s-triage-robot commented Aug 15, 2021

wojtek-t commented Aug 23, 2021

k8s-triage-robot commented Nov 21, 2021

wojtek-t commented Nov 22, 2021

k8s-triage-robot commented Feb 20, 2022

wojtek-t commented Feb 21, 2022

k8s-triage-robot commented May 22, 2022

wojtek-t commented May 23, 2022

k8s-triage-robot commented Aug 21, 2022

k8s-triage-robot commented Sep 20, 2022

wojtek-t commented Sep 22, 2022

k8s-triage-robot commented Feb 8, 2023

wojtek-t commented Feb 8, 2023

vamossagar12 commented Mar 13, 2020 •

edited

vamossagar12 commented Mar 26, 2020 •

edited

vamossagar12 commented Apr 7, 2020 •

edited