-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcdserver: add server range duration metrics #17983
etcdserver: add server range duration metrics #17983
Conversation
Hi @thedtripp. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi @thedtripp, it looks like your commit message is wrongly formatted (signed-off-by appears twice). I also suggest co-authoring with the author of the original PR #16902. /ok-to-test |
Hi @jmhbnz, the original suggestion from #16902 (review) was to add an integration or e2e case for this functionality. But later in #16866 (comment), you refer to it as basic testing. The current implementation introduced unit tests. Is this what we're looking for? |
@ivanvc. Fixed. Thanks! |
The unit test is a good start, I think ideally we should expand |
@jmhbnz Is there any test coverage for the existing transaction metrics (applySec and slowApplies)? I'm trying to figure out how to write the requested integration or e2e test. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on this @thedtripp - It's looking good, please see the suggestions below.
tests/integration/metrics_test.go
Outdated
if err != nil { | ||
t.Fatal(err) | ||
} | ||
if rangeDurationSeconds == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we verify if the number is positive and within an expected range?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course. Sometimes the number comes back as 0. So, should the range be between 0 and 900 (for a 15 minute timeout)? @jmhbnz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest running the test a number of times to establish an expected baseline duration then adding some buffer to prevent flakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried setting the upper limit to something reasonable (e.g. 10 seconds) and the integration tests failed repeatedly. I added some buffer and now it is 900 seconds. With this, the tests pass consistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a way to find this value would be to use stress
(go install golang.org/x/tools/cmd/stress@latest
) as we do to find the flakiness ratio.
cd tests/integration
# edit the duration
go test -v -c -count 1
stress -p=8 ./integration.test -test.run "^TestMetricsRangeDurationSeconds$"
# leave it running for some minutes and check the failure ratio, then edit and repeat the compilation and stress
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivanvc Thanks for this. I'll try it out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Range duration threshold stress test log
TL;DR:
I have abridged the logs for easier reading.
Running the tests in parallel seems to increase the failure rate significantly.
Running the single test results in easy passing while running all tests causes failures.
Based on my these findings, I propose max range duration 300 seconds.
10 seconds: 100% failure.
@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test
1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m5s: 0 runs so far, 0 failures
8m10s: 1 runs so far, 1 failures (100.00%)
8m15s: 5 runs so far, 5 failures (100.00%)
8m20s: 7 runs so far, 7 failures (100.00%)
9m0s: 7 runs so far, 7 failures (100.00%)
10m0s: 7 runs so far, 7 failures (100.00%)
11m0s: 8 runs so far, 8 failures (100.00%)
12m0s: 8 runs so far, 8 failures (100.00%)
13m0s: 8 runs so far, 8 failures (100.00%)
14m0s: 8 runs so far, 8 failures (100.00%)
15m0s: 8 runs so far, 8 failures (100.00%)
16m0s: 8 runs so far, 8 failures (100.00%)
16m5s: 9 runs so far, 9 failures (100.00%)
16m10s: 9 runs so far, 9 failures (100.00%)
16m15s: 9 runs so far, 9 failures (100.00%)
16m20s: 10 runs so far, 10 failures (100.00%)
16m25s: 11 runs so far, 11 failures (100.00%)
16m30s: 15 runs so far, 15 failures (100.00%)
17m0s: 15 runs so far, 15 failures (100.00%)
17m5s: 15 runs so far, 15 failures (100.00%)
18m0s: 16 runs so far, 16 failures (100.00%)
——————————————————————————————————————————————————————————————————————————
30 seconds: 100% failure.
@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test
1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m15s: 1 runs so far, 1 failures (100.00%)
8m20s: 1 runs so far, 1 failures (100.00%)
8m25s: 1 runs so far, 1 failures (100.00%)
8m30s: 5 runs so far, 5 failures (100.00%)
8m35s: 7 runs so far, 7 failures (100.00%)
9m0s: 8 runs so far, 8 failures (100.00%)
——————————————————————————————————————————————————————————————————————————
120 seconds : 100% failure.
@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test
1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m15s: 1 runs so far, 1 failures (100.00%)
8m20s: 2 runs so far, 2 failures (100.00%)
8m25s: 4 runs so far, 4 failures (100.00%)
8m30s: 8 runs so far, 8 failures (100.00%)
9m0s: 8 runs so far, 8 failures (100.00%)
10m0s: 8 runs so far, 8 failures (100.00%)
——————————————————————————————————————————————————————————————————————————
300 seconds: 63.75% failure
@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test
1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m35s: 5 runs so far, 4 failures (80.00%)
8m40s: 8 runs so far, 6 failures (75.00%)
9m0s: 8 runs so far, 6 failures (75.00%)
10m0s: 8 runs so far, 6 failures (75.00%)
11m0s: 8 runs so far, 6 failures (75.00%)
12m0s: 8 runs so far, 6 failures (75.00%)
13m0s: 8 runs so far, 6 failures (75.00%)
14m0s: 8 runs so far, 6 failures (75.00%)
15m0s: 8 runs so far, 6 failures (75.00%)
16m0s: 8 runs so far, 6 failures (75.00%)
16m50s: 12 runs so far, 9 failures (75.00%)
16m55s: 16 runs so far, 11 failures (68.75%)
17m0s: 16 runs so far, 11 failures (68.75%)
18m0s: 16 runs so far, 11 failures (68.75%)
19m0s: 16 runs so far, 11 failures (68.75%)
——————————————————————————————————————————————————————————————————————————
600 seconds: 62.5% failure
@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test
1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m30s: 3 runs so far, 2 failures (66.67%)
8m35s: 8 runs so far, 5 failures (62.50%)
9m0s: 8 runs so far, 5 failures (62.50%)
10m0s: 8 runs so far, 5 failures (62.50%)
11m0s: 8 runs so far, 5 failures (62.50%)
12m0s: 8 runs so far, 5 failures (62.50%)
——————————————————————————————————————————————————————————————————————————
900 seconds: failure 56.25%
@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test
1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m30s: 2 runs so far, 1 failures (50.00%)
9m0s: 8 runs so far, 4 failures (50.00%)
10m0s: 8 runs so far, 4 failures (50.00%)
11m0s: 8 runs so far, 4 failures (50.00%)
12m0s: 8 runs so far, 4 failures (50.00%)
13m0s: 8 runs so far, 4 failures (50.00%)
14m0s: 8 runs so far, 4 failures (50.00%)
15m0s: 8 runs so far, 4 failures (50.00%)
16m0s: 8 runs so far, 4 failures (50.00%)
17m0s: 16 runs so far, 9 failures (56.25%)
——————————————————————————————————————————————————————————————————————————
1,800 seconds:
@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=8 ./integration.test
1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 0 runs so far, 0 failures
8m25s: 2 runs so far, 2 failures (100.00%)
8m30s: 8 runs so far, 5 failures (62.50%)
9m0s: 8 runs so far, 5 failures (62.50%)
10m0s: 8 runs so far, 5 failures (62.50%)
11m0s: 8 runs so far, 5 failures (62.50%)
12m0s: 8 runs so far, 5 failures (62.50%)
13m0s: 8 runs so far, 5 failures (62.50%)
14m0s: 8 runs so far, 5 failures (62.50%)
15m0s: 8 runs so far, 5 failures (62.50%)
16m0s: 8 runs so far, 5 failures (62.50%)
17m0s: 16 runs so far, 10 failures (62.50%)
——————————————————————————————————————————————————————————————————————————
600 seconds -p=3
@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=3 ./integration.test
1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 0 runs so far, 0 failures
8m0s: 3 runs so far, 1 failures (33.33%)
9m0s: 3 runs so far, 1 failures (33.33%)
10m0s: 3 runs so far, 1 failures (33.33%)
11m0s: 3 runs so far, 1 failures (33.33%)
12m0s: 3 runs so far, 1 failures (33.33%)
13m0s: 3 runs so far, 1 failures (33.33%)
13m55s: 5 runs so far, 2 failures (40.00%)
14m0s: 5 runs so far, 2 failures (40.00%)
14m10s: 6 runs so far, 2 failures (33.33%)
14m15s: 6 runs so far, 2 failures (33.33%)
14m20s: 6 runs so far, 2 failures (33.33%)
——————————————————————————————————————————————————————————————————————————
120 seconds: -p=1
@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=1 ./integration.test
1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
6m30s: 1 runs so far, 1 failures (100.00%)
6m55s: 1 runs so far, 1 failures (100.00%)
——————————————————————————————————————————————————————————————————————————
300 seconds -p=1 failure: 0%
@thedtripp ➜ /workspaces/etcd/tests/integration (feature/addServerRangeDurationMetrics) $ stress -p=1 ./integration.test
1m0s: 0 runs so far, 0 failures
2m0s: 0 runs so far, 0 failures
3m0s: 0 runs so far, 0 failures
4m0s: 0 runs so far, 0 failures
5m0s: 0 runs so far, 0 failures
6m0s: 0 runs so far, 0 failures
7m0s: 1 runs so far, 0 failures
8m0s: 1 runs so far, 0 failures
9m0s: 1 runs so far, 0 failures
10m0s: 1 runs so far, 0 failures
11m0s: 1 runs so far, 0 failures
12m0s: 1 runs so far, 0 failures
13m0s: 1 runs so far, 0 failures
14m0s: 2 runs so far, 0 failures
15m0s: 2 runs so far, 0 failures
16m0s: 2 runs so far, 0 failures
17m0s: 2 runs so far, 0 failures
18m0s: 2 runs so far, 0 failures
19m0s: 2 runs so far, 0 failures
20m0s: 3 runs so far, 0 failures
Commit 4f029a9, which set threshold to 300 seconds, resulted in failing tests. I will increase the value and try again. |
Thanks, @thedtripp, this is looking good. I suggest squashing your commits into a single one in preparation for merging this PR. |
Is it ok to squash the commits with interactive rebase? |
Yes! That's how I usually do it 🙄 😄 |
Thanks @thedtripp for your first PR. Overall looks good to me. Please also add a changelog item in CHANGELOG-3.6.md#metrics-monitoring |
Signed-off-by: Daniel Tripp <38776199+thedtripp@users.noreply.github.com> Co-authored-by: Ravi Hari <RaviHari@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - Thanks for this first contribution to etcd @thedtripp 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks, @thedtripp
This PR is intended to address this issue:
#16866