Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: add server range duration metrics #17983

Merged
merged 1 commit into from
May 18, 2024
Merged

etcdserver: add server range duration metrics #17983

merged 1 commit into from
May 18, 2024

Conversation

thedtripp
Copy link
Member

@thedtripp thedtripp commented May 11, 2024

This PR is intended to address this issue:
#16866

@k8s-ci-robot
Copy link

Hi @thedtripp. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ivanvc
Copy link
Member

ivanvc commented May 11, 2024

Hi @thedtripp, it looks like your commit message is wrongly formatted (signed-off-by appears twice). I also suggest co-authoring with the author of the original PR #16902.

/ok-to-test

@thedtripp thedtripp changed the title etcdserver: add server range duration metrics\n Signed-off-by: Daniel… etcdserver: add server range duration metrics May 11, 2024
@ivanvc
Copy link
Member

ivanvc commented May 11, 2024

Hi @jmhbnz, the original suggestion from #16902 (review) was to add an integration or e2e case for this functionality. But later in #16866 (comment), you refer to it as basic testing. The current implementation introduced unit tests. Is this what we're looking for?

@thedtripp
Copy link
Member Author

Hi @thedtripp, it looks like your commit message is wrongly formatted (signed-off-by appears twice). I also suggest co-authoring with the author of the original PR #16902.

/ok-to-test

@ivanvc. Fixed. Thanks!

@jmhbnz
Copy link
Member

jmhbnz commented May 11, 2024

Hi @jmhbnz, the original suggestion from #16902 (review) was to add an integration or e2e case for this functionality. But later in #16866 (comment), you refer to it as basic testing. The current implementation introduced unit tests. Is this what we're looking for?

The unit test is a good start, I think ideally we should expand tests/integration/metrics_test.go or tests/e2e/metrics_test.go to have coverage for this new metric. @thedtripp please take a look at implementing an integration or e2e test and let us know if you have any questions, thanks.

@thedtripp
Copy link
Member Author

Hi @jmhbnz, the original suggestion from #16902 (review) was to add an integration or e2e case for this functionality. But later in #16866 (comment), you refer to it as basic testing. The current implementation introduced unit tests. Is this what we're looking for?

The unit test is a good start, I think ideally we should expand tests/integration/metrics_test.go or tests/e2e/metrics_test.go to have coverage for this new metric. @thedtripp please take a look at implementing an integration or e2e test and let us know if you have any questions, thanks.

@jmhbnz Is there any test coverage for the existing transaction metrics (applySec and slowApplies)? I'm trying to figure out how to write the requested integration or e2e test. Thanks.

Copy link
Member

@jmhbnz jmhbnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this @thedtripp - It's looking good, please see the suggestions below.

server/etcdserver/txn/metrics.go Outdated Show resolved Hide resolved
server/etcdserver/txn/txn.go Outdated Show resolved Hide resolved
server/etcdserver/txn/txn.go Outdated Show resolved Hide resolved
if err != nil {
t.Fatal(err)
}
if rangeDurationSeconds == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we verify if the number is positive and within an expected range?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course. Sometimes the number comes back as 0. So, should the range be between 0 and 900 (for a 15 minute timeout)? @jmhbnz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest running the test a number of times to establish an expected baseline duration then adding some buffer to prevent flakes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried setting the upper limit to something reasonable (e.g. 10 seconds) and the integration tests failed repeatedly. I added some buffer and now it is 900 seconds. With this, the tests pass consistently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a way to find this value would be to use stress (go install golang.org/x/tools/cmd/stress@latest) as we do to find the flakiness ratio.

cd tests/integration

# edit the duration
go test -v -c -count 1
stress -p=8 ./integration.test -test.run "^TestMetricsRangeDurationSeconds$"
# leave it running for some minutes and check the failure ratio, then edit and repeat the compilation and stress

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ivanvc Thanks for this. I'll try it out!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jmhbnz @ivanvc

Range duration threshold stress test log

TL;DR:
I have abridged the logs for easier reading.
Running the tests in parallel seems to increase the failure rate significantly.
Running the single test results in easy passing while running all tests causes failures.
Based on my these findings, I propose max range duration 300 seconds.

10 seconds: 100% failure.

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m5s: 0 runs so far, 0 failures
8m10s: 1 runs so far, 1 failures (100.00%)
8m15s: 5 runs so far, 5 failures (100.00%)
8m20s: 7 runs so far, 7 failures (100.00%)
9m0s: 7 runs so far, 7 failures (100.00%)
10m0s: 7 runs so far, 7 failures (100.00%)
11m0s: 8 runs so far, 8 failures (100.00%)
12m0s: 8 runs so far, 8 failures (100.00%)
13m0s: 8 runs so far, 8 failures (100.00%)
14m0s: 8 runs so far, 8 failures (100.00%)
15m0s: 8 runs so far, 8 failures (100.00%)
16m0s: 8 runs so far, 8 failures (100.00%)
16m5s: 9 runs so far, 9 failures (100.00%)
16m10s: 9 runs so far, 9 failures (100.00%)
16m15s: 9 runs so far, 9 failures (100.00%)
16m20s: 10 runs so far, 10 failures (100.00%)
16m25s: 11 runs so far, 11 failures (100.00%)
16m30s: 15 runs so far, 15 failures (100.00%)
17m0s: 15 runs so far, 15 failures (100.00%)
17m5s: 15 runs so far, 15 failures (100.00%)
18m0s: 16 runs so far, 16 failures (100.00%)

——————————————————————————————————————————————————————————————————————————

30 seconds: 100% failure.

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m15s: 1 runs so far, 1 failures (100.00%)
8m20s: 1 runs so far, 1 failures (100.00%)
8m25s: 1 runs so far, 1 failures (100.00%)
8m30s: 5 runs so far, 5 failures (100.00%)
8m35s: 7 runs so far, 7 failures (100.00%)
9m0s: 8 runs so far, 8 failures (100.00%)

——————————————————————————————————————————————————————————————————————————

120 seconds : 100% failure.

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m15s: 1 runs so far, 1 failures (100.00%)
8m20s: 2 runs so far, 2 failures (100.00%)
8m25s: 4 runs so far, 4 failures (100.00%)
8m30s: 8 runs so far, 8 failures (100.00%)
9m0s: 8 runs so far, 8 failures (100.00%)
10m0s: 8 runs so far, 8 failures (100.00%)

——————————————————————————————————————————————————————————————————————————

300 seconds: 63.75% failure

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m35s: 5 runs so far, 4 failures (80.00%)
8m40s: 8 runs so far, 6 failures (75.00%)
9m0s: 8 runs so far, 6 failures (75.00%)
10m0s: 8 runs so far, 6 failures (75.00%)
11m0s: 8 runs so far, 6 failures (75.00%)
12m0s: 8 runs so far, 6 failures (75.00%)
13m0s: 8 runs so far, 6 failures (75.00%)
14m0s: 8 runs so far, 6 failures (75.00%)
15m0s: 8 runs so far, 6 failures (75.00%)
16m0s: 8 runs so far, 6 failures (75.00%)
16m50s: 12 runs so far, 9 failures (75.00%)
16m55s: 16 runs so far, 11 failures (68.75%)
17m0s: 16 runs so far, 11 failures (68.75%)
18m0s: 16 runs so far, 11 failures (68.75%)
19m0s: 16 runs so far, 11 failures (68.75%)

——————————————————————————————————————————————————————————————————————————

600 seconds: 62.5% failure

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m30s: 3 runs so far, 2 failures (66.67%)
8m35s: 8 runs so far, 5 failures (62.50%)
9m0s: 8 runs so far, 5 failures (62.50%)
10m0s: 8 runs so far, 5 failures (62.50%)
11m0s: 8 runs so far, 5 failures (62.50%)
12m0s: 8 runs so far, 5 failures (62.50%)

——————————————————————————————————————————————————————————————————————————

900 seconds: failure 56.25%

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m30s: 2 runs so far, 1 failures (50.00%)
9m0s: 8 runs so far, 4 failures (50.00%)
10m0s: 8 runs so far, 4 failures (50.00%)
11m0s: 8 runs so far, 4 failures (50.00%)
12m0s: 8 runs so far, 4 failures (50.00%)
13m0s: 8 runs so far, 4 failures (50.00%)
14m0s: 8 runs so far, 4 failures (50.00%)
15m0s: 8 runs so far, 4 failures (50.00%)
16m0s: 8 runs so far, 4 failures (50.00%)
17m0s: 16 runs so far, 9 failures (56.25%)

——————————————————————————————————————————————————————————————————————————

1,800 seconds:

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m25s: 2 runs so far, 2 failures (100.00%)
8m30s: 8 runs so far, 5 failures (62.50%)
9m0s: 8 runs so far, 5 failures (62.50%)
10m0s: 8 runs so far, 5 failures (62.50%)
11m0s: 8 runs so far, 5 failures (62.50%)
12m0s: 8 runs so far, 5 failures (62.50%)
13m0s: 8 runs so far, 5 failures (62.50%)
14m0s: 8 runs so far, 5 failures (62.50%)
15m0s: 8 runs so far, 5 failures (62.50%)
16m0s: 8 runs so far, 5 failures (62.50%)
17m0s: 16 runs so far, 10 failures (62.50%)

——————————————————————————————————————————————————————————————————————————

600 seconds -p=3

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=3 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 3 runs so far, 1 failures (33.33%)
9m0s: 3 runs so far, 1 failures (33.33%)
10m0s: 3 runs so far, 1 failures (33.33%)
11m0s: 3 runs so far, 1 failures (33.33%)
12m0s: 3 runs so far, 1 failures (33.33%)
13m0s: 3 runs so far, 1 failures (33.33%)
13m55s: 5 runs so far, 2 failures (40.00%)
14m0s: 5 runs so far, 2 failures (40.00%)
14m10s: 6 runs so far, 2 failures (33.33%)
14m15s: 6 runs so far, 2 failures (33.33%)
14m20s: 6 runs so far, 2 failures (33.33%)

——————————————————————————————————————————————————————————————————————————

120 seconds: -p=1

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=1 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
6m30s: 1 runs so far, 1 failures (100.00%)
6m55s: 1 runs so far, 1 failures (100.00%)

——————————————————————————————————————————————————————————————————————————

300 seconds -p=1 failure: 0%

@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=1 ./integration.test

1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 1 runs so far, 0 failures
8m0s: 1 runs so far, 0 failures
9m0s: 1 runs so far, 0 failures
10m0s: 1 runs so far, 0 failures
11m0s: 1 runs so far, 0 failures
12m0s: 1 runs so far, 0 failures
13m0s: 1 runs so far, 0 failures
14m0s: 2 runs so far, 0 failures
15m0s: 2 runs so far, 0 failures
16m0s: 2 runs so far, 0 failures
17m0s: 2 runs so far, 0 failures
18m0s: 2 runs so far, 0 failures
19m0s: 2 runs so far, 0 failures
20m0s: 3 runs so far, 0 failures

@thedtripp
Copy link
Member Author

Commit 4f029a9, which set threshold to 300 seconds, resulted in failing tests. I will increase the value and try again.

@ivanvc
Copy link
Member

ivanvc commented May 16, 2024

Thanks, @thedtripp, this is looking good. I suggest squashing your commits into a single one in preparation for merging this PR.

@thedtripp
Copy link
Member Author

Thanks, @thedtripp, this is looking good. I suggest squashing your commits into a single one in preparation for merging this PR.

Is it ok to squash the commits with interactive rebase?

@ivanvc
Copy link
Member

ivanvc commented May 17, 2024

Is it ok to squash the commits with interactive rebase?

Yes! That's how I usually do it 🙄 😄

tests/integration/metrics_test.go Outdated Show resolved Hide resolved
tests/integration/metrics_test.go Outdated Show resolved Hide resolved
@thedtripp thedtripp requested review from ahrtr and jmhbnz May 17, 2024 14:18
@ahrtr
Copy link
Member

ahrtr commented May 17, 2024

Thanks @thedtripp for your first PR.

Overall looks good to me. Please also add a changelog item in CHANGELOG-3.6.md#metrics-monitoring

Signed-off-by: Daniel Tripp <38776199+thedtripp@users.noreply.github.com>
Co-authored-by: Ravi Hari <RaviHari@users.noreply.github.com>
Copy link
Member

@jmhbnz jmhbnz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Thanks for this first contribution to etcd @thedtripp 🎉

Copy link
Member

@ivanvc ivanvc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks, @thedtripp

@jmhbnz jmhbnz merged commit 52fb28c into etcd-io:main May 18, 2024
44 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants