etcdserver: add server range duration metrics #17983

thedtripp · 2024-05-11T01:01:11Z

This PR is intended to address this issue:
#16866

k8s-ci-robot · 2024-05-11T01:01:20Z

Hi @thedtripp. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

ivanvc · 2024-05-11T03:04:38Z

Hi @thedtripp, it looks like your commit message is wrongly formatted (signed-off-by appears twice). I also suggest co-authoring with the author of the original PR #16902.

/ok-to-test

ivanvc · 2024-05-11T03:23:43Z

Hi @jmhbnz, the original suggestion from #16902 (review) was to add an integration or e2e case for this functionality. But later in #16866 (comment), you refer to it as basic testing. The current implementation introduced unit tests. Is this what we're looking for?

thedtripp · 2024-05-11T03:34:52Z

Hi @thedtripp, it looks like your commit message is wrongly formatted (signed-off-by appears twice). I also suggest co-authoring with the author of the original PR #16902.

/ok-to-test

@ivanvc. Fixed. Thanks!

jmhbnz · 2024-05-11T08:00:45Z

Hi @jmhbnz, the original suggestion from #16902 (review) was to add an integration or e2e case for this functionality. But later in #16866 (comment), you refer to it as basic testing. The current implementation introduced unit tests. Is this what we're looking for?

The unit test is a good start, I think ideally we should expand tests/integration/metrics_test.go or tests/e2e/metrics_test.go to have coverage for this new metric. @thedtripp please take a look at implementing an integration or e2e test and let us know if you have any questions, thanks.

server/etcdserver/txn/metrics.go

thedtripp · 2024-05-11T19:13:38Z

Hi @jmhbnz, the original suggestion from #16902 (review) was to add an integration or e2e case for this functionality. But later in #16866 (comment), you refer to it as basic testing. The current implementation introduced unit tests. Is this what we're looking for?

The unit test is a good start, I think ideally we should expand tests/integration/metrics_test.go or tests/e2e/metrics_test.go to have coverage for this new metric. @thedtripp please take a look at implementing an integration or e2e test and let us know if you have any questions, thanks.

@jmhbnz Is there any test coverage for the existing transaction metrics (applySec and slowApplies)? I'm trying to figure out how to write the requested integration or e2e test. Thanks.

jmhbnz

Thanks for your work on this @thedtripp - It's looking good, please see the suggestions below.

server/etcdserver/txn/metrics.go

server/etcdserver/txn/txn.go

jmhbnz · 2024-05-12T21:40:09Z

tests/integration/metrics_test.go

+	if err != nil {
+		t.Fatal(err)
+	}
+	if rangeDurationSeconds == "" {


Could we verify if the number is positive and within an expected range?

Of course. Sometimes the number comes back as 0. So, should the range be between 0 and 900 (for a 15 minute timeout)? @jmhbnz

I would suggest running the test a number of times to establish an expected baseline duration then adding some buffer to prevent flakes.

I tried setting the upper limit to something reasonable (e.g. 10 seconds) and the integration tests failed repeatedly. I added some buffer and now it is 900 seconds. With this, the tests pass consistently.

Perhaps a way to find this value would be to use stress (go install golang.org/x/tools/cmd/stress@latest) as we do to find the flakiness ratio.

cd tests/integration # edit the duration go test -v -c -count 1 stress -p=8 ./integration.test -test.run "^TestMetricsRangeDurationSeconds$" # leave it running for some minutes and check the failure ratio, then edit and repeat the compilation and stress

@ivanvc Thanks for this. I'll try it out!

@jmhbnz @ivanvc

Range duration threshold stress test log

TL;DR:
I have abridged the logs for easier reading.
Running the tests in parallel seems to increase the failure rate significantly.
Running the single test results in easy passing while running all tests causes failures.
Based on my these findings, I propose max range duration 300 seconds.

10 seconds: 100% failure.

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m5s: 0 runs so far, 0 failures
8m10s: 1 runs so far, 1 failures (100.00%)
8m15s: 5 runs so far, 5 failures (100.00%)
8m20s: 7 runs so far, 7 failures (100.00%)
9m0s: 7 runs so far, 7 failures (100.00%)
10m0s: 7 runs so far, 7 failures (100.00%)
11m0s: 8 runs so far, 8 failures (100.00%)
12m0s: 8 runs so far, 8 failures (100.00%)
13m0s: 8 runs so far, 8 failures (100.00%)
14m0s: 8 runs so far, 8 failures (100.00%)
15m0s: 8 runs so far, 8 failures (100.00%)
16m0s: 8 runs so far, 8 failures (100.00%)
16m5s: 9 runs so far, 9 failures (100.00%)
16m10s: 9 runs so far, 9 failures (100.00%)
16m15s: 9 runs so far, 9 failures (100.00%)
16m20s: 10 runs so far, 10 failures (100.00%)
16m25s: 11 runs so far, 11 failures (100.00%)
16m30s: 15 runs so far, 15 failures (100.00%)
17m0s: 15 runs so far, 15 failures (100.00%)
17m5s: 15 runs so far, 15 failures (100.00%)
18m0s: 16 runs so far, 16 failures (100.00%)

——————————————————————————————————————————————————————————————————————————

30 seconds: 100% failure.

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m15s: 1 runs so far, 1 failures (100.00%)
8m20s: 1 runs so far, 1 failures (100.00%)
8m25s: 1 runs so far, 1 failures (100.00%)
8m30s: 5 runs so far, 5 failures (100.00%)
8m35s: 7 runs so far, 7 failures (100.00%)
9m0s: 8 runs so far, 8 failures (100.00%)

——————————————————————————————————————————————————————————————————————————

120 seconds : 100% failure.

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m15s: 1 runs so far, 1 failures (100.00%)
8m20s: 2 runs so far, 2 failures (100.00%)
8m25s: 4 runs so far, 4 failures (100.00%)
8m30s: 8 runs so far, 8 failures (100.00%)
9m0s: 8 runs so far, 8 failures (100.00%)
10m0s: 8 runs so far, 8 failures (100.00%)

——————————————————————————————————————————————————————————————————————————

300 seconds: 63.75% failure

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m35s: 5 runs so far, 4 failures (80.00%)
8m40s: 8 runs so far, 6 failures (75.00%)
9m0s: 8 runs so far, 6 failures (75.00%)
10m0s: 8 runs so far, 6 failures (75.00%)
11m0s: 8 runs so far, 6 failures (75.00%)
12m0s: 8 runs so far, 6 failures (75.00%)
13m0s: 8 runs so far, 6 failures (75.00%)
14m0s: 8 runs so far, 6 failures (75.00%)
15m0s: 8 runs so far, 6 failures (75.00%)
16m0s: 8 runs so far, 6 failures (75.00%)
16m50s: 12 runs so far, 9 failures (75.00%)
16m55s: 16 runs so far, 11 failures (68.75%)
17m0s: 16 runs so far, 11 failures (68.75%)
18m0s: 16 runs so far, 11 failures (68.75%)
19m0s: 16 runs so far, 11 failures (68.75%)

——————————————————————————————————————————————————————————————————————————

600 seconds: 62.5% failure

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m30s: 3 runs so far, 2 failures (66.67%)
8m35s: 8 runs so far, 5 failures (62.50%)
9m0s: 8 runs so far, 5 failures (62.50%)
10m0s: 8 runs so far, 5 failures (62.50%)
11m0s: 8 runs so far, 5 failures (62.50%)
12m0s: 8 runs so far, 5 failures (62.50%)

——————————————————————————————————————————————————————————————————————————

900 seconds: failure 56.25%

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m30s: 2 runs so far, 1 failures (50.00%)
9m0s: 8 runs so far, 4 failures (50.00%)
10m0s: 8 runs so far, 4 failures (50.00%)
11m0s: 8 runs so far, 4 failures (50.00%)
12m0s: 8 runs so far, 4 failures (50.00%)
13m0s: 8 runs so far, 4 failures (50.00%)
14m0s: 8 runs so far, 4 failures (50.00%)
15m0s: 8 runs so far, 4 failures (50.00%)
16m0s: 8 runs so far, 4 failures (50.00%)
17m0s: 16 runs so far, 9 failures (56.25%)

——————————————————————————————————————————————————————————————————————————

1,800 seconds:

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m25s: 2 runs so far, 2 failures (100.00%)
8m30s: 8 runs so far, 5 failures (62.50%)
9m0s: 8 runs so far, 5 failures (62.50%)
10m0s: 8 runs so far, 5 failures (62.50%)
11m0s: 8 runs so far, 5 failures (62.50%)
12m0s: 8 runs so far, 5 failures (62.50%)
13m0s: 8 runs so far, 5 failures (62.50%)
14m0s: 8 runs so far, 5 failures (62.50%)
15m0s: 8 runs so far, 5 failures (62.50%)
16m0s: 8 runs so far, 5 failures (62.50%)
17m0s: 16 runs so far, 10 failures (62.50%)

——————————————————————————————————————————————————————————————————————————

600 seconds -p=3

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=3 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 3 runs so far, 1 failures (33.33%)
9m0s: 3 runs so far, 1 failures (33.33%)
10m0s: 3 runs so far, 1 failures (33.33%)
11m0s: 3 runs so far, 1 failures (33.33%)
12m0s: 3 runs so far, 1 failures (33.33%)
13m0s: 3 runs so far, 1 failures (33.33%)
13m55s: 5 runs so far, 2 failures (40.00%)
14m0s: 5 runs so far, 2 failures (40.00%)
14m10s: 6 runs so far, 2 failures (33.33%)
14m15s: 6 runs so far, 2 failures (33.33%)
14m20s: 6 runs so far, 2 failures (33.33%)

——————————————————————————————————————————————————————————————————————————

120 seconds: -p=1

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=1 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
6m30s: 1 runs so far, 1 failures (100.00%)
6m55s: 1 runs so far, 1 failures (100.00%)

——————————————————————————————————————————————————————————————————————————

300 seconds -p=1 failure: 0%

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=1 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 1 runs so far, 0 failures
8m0s: 1 runs so far, 0 failures
9m0s: 1 runs so far, 0 failures
10m0s: 1 runs so far, 0 failures
11m0s: 1 runs so far, 0 failures
12m0s: 1 runs so far, 0 failures
13m0s: 1 runs so far, 0 failures
14m0s: 2 runs so far, 0 failures
15m0s: 2 runs so far, 0 failures
16m0s: 2 runs so far, 0 failures
17m0s: 2 runs so far, 0 failures
18m0s: 2 runs so far, 0 failures
19m0s: 2 runs so far, 0 failures
20m0s: 3 runs so far, 0 failures

tests/integration/metrics_test.go

thedtripp · 2024-05-16T06:11:13Z

Commit 4f029a9, which set threshold to 300 seconds, resulted in failing tests. I will increase the value and try again.

ivanvc · 2024-05-16T23:22:37Z

Thanks, @thedtripp, this is looking good. I suggest squashing your commits into a single one in preparation for merging this PR.

thedtripp · 2024-05-16T23:28:12Z

Thanks, @thedtripp, this is looking good. I suggest squashing your commits into a single one in preparation for merging this PR.

Is it ok to squash the commits with interactive rebase?

ivanvc · 2024-05-17T00:00:19Z

Is it ok to squash the commits with interactive rebase?

Yes! That's how I usually do it 🙄 😄

tests/integration/metrics_test.go

ahrtr · 2024-05-17T14:29:55Z

Thanks @thedtripp for your first PR.

Overall looks good to me. Please also add a changelog item in CHANGELOG-3.6.md#metrics-monitoring

tests/integration/metrics_test.go

Signed-off-by: Daniel Tripp <38776199+thedtripp@users.noreply.github.com> Co-authored-by: Ravi Hari <RaviHari@users.noreply.github.com>

jmhbnz

LGTM - Thanks for this first contribution to etcd @thedtripp 🎉

ivanvc

LGTM. Thanks, @thedtripp

k8s-ci-robot added the needs-ok-to-test label May 11, 2024

thedtripp mentioned this pull request May 11, 2024

Add etcd_server_range_duration_seconds metrics #16866

Closed

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels May 11, 2024

thedtripp changed the title ~~etcdserver: add server range duration metrics\n Signed-off-by: Daniel…~~ etcdserver: add server range duration metrics May 11, 2024

ahrtr reviewed May 11, 2024

View reviewed changes

server/etcdserver/txn/metrics.go Outdated Show resolved Hide resolved

jmhbnz reviewed May 12, 2024

View reviewed changes

k8s-ci-robot added the area/testing label May 12, 2024

jmhbnz mentioned this pull request May 13, 2024

Auto label PRs on Github #17970

Closed

thedtripp requested review from jmhbnz, ivanvc and ahrtr May 13, 2024 21:03

jmhbnz reviewed May 15, 2024

View reviewed changes

tests/integration/metrics_test.go Outdated Show resolved Hide resolved

ahrtr reviewed May 17, 2024

View reviewed changes

tests/integration/metrics_test.go Outdated Show resolved Hide resolved

tests/integration/metrics_test.go Outdated Show resolved Hide resolved

thedtripp requested review from ahrtr and jmhbnz May 17, 2024 14:18

ahrtr approved these changes May 17, 2024

View reviewed changes

ivanvc reviewed May 18, 2024

View reviewed changes

tests/integration/metrics_test.go Outdated Show resolved Hide resolved

etcdserver: add server range duration metrics

0232686

Signed-off-by: Daniel Tripp <38776199+thedtripp@users.noreply.github.com> Co-authored-by: Ravi Hari <RaviHari@users.noreply.github.com>

jmhbnz approved these changes May 18, 2024

View reviewed changes

ivanvc approved these changes May 18, 2024

View reviewed changes

jmhbnz merged commit 52fb28c into etcd-io:main May 18, 2024
44 checks passed

jmhbnz mentioned this pull request May 19, 2024

feat: add etcd_debugging_lease_total gauge metric #17991

Closed

jmhbnz mentioned this pull request Jun 29, 2024

feat: add etcd_debugging_lease_total gauge metric #18067

Open

ivanvc mentioned this pull request Sep 28, 2024

Document deflaking process #18650

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcdserver: add server range duration metrics #17983

etcdserver: add server range duration metrics #17983

thedtripp commented May 11, 2024 •

edited

Loading

k8s-ci-robot commented May 11, 2024

ivanvc commented May 11, 2024

ivanvc commented May 11, 2024

thedtripp commented May 11, 2024

jmhbnz commented May 11, 2024 •

edited

Loading

thedtripp commented May 11, 2024

jmhbnz left a comment

jmhbnz May 12, 2024

thedtripp May 12, 2024

jmhbnz May 12, 2024

thedtripp May 13, 2024

ivanvc May 15, 2024

thedtripp May 15, 2024

thedtripp May 16, 2024

thedtripp commented May 16, 2024

ivanvc commented May 16, 2024

thedtripp commented May 16, 2024

ivanvc commented May 17, 2024

ahrtr commented May 17, 2024

jmhbnz left a comment

ivanvc left a comment

etcdserver: add server range duration metrics #17983

etcdserver: add server range duration metrics #17983

Conversation

thedtripp commented May 11, 2024 • edited Loading

k8s-ci-robot commented May 11, 2024

ivanvc commented May 11, 2024

ivanvc commented May 11, 2024

thedtripp commented May 11, 2024

jmhbnz commented May 11, 2024 • edited Loading

thedtripp commented May 11, 2024

jmhbnz left a comment

Choose a reason for hiding this comment

jmhbnz May 12, 2024

Choose a reason for hiding this comment

thedtripp May 12, 2024

Choose a reason for hiding this comment

jmhbnz May 12, 2024

Choose a reason for hiding this comment

thedtripp May 13, 2024

Choose a reason for hiding this comment

ivanvc May 15, 2024

Choose a reason for hiding this comment

thedtripp May 15, 2024

Choose a reason for hiding this comment

thedtripp May 16, 2024

Choose a reason for hiding this comment

thedtripp commented May 16, 2024

ivanvc commented May 16, 2024

thedtripp commented May 16, 2024

ivanvc commented May 17, 2024

ahrtr commented May 17, 2024

jmhbnz left a comment

Choose a reason for hiding this comment

ivanvc left a comment

Choose a reason for hiding this comment

thedtripp commented May 11, 2024 •

edited

Loading

jmhbnz commented May 11, 2024 •

edited

Loading