Replies: 2 comments 4 replies
-
Hey @yitian108 - Thanks for your question. Your metrics graph for disk shows WAL fsync reaching 500ms and DB fsync reaching over 1.5 seconds, Because etcd writes data to disk and persists proposals on disk, its performance depends on disk performance. Although etcd is not particularly I/O intensive, it requires a low latency block device for optimal performance and stability. Because etcd’s consensus protocol depends on persistently storing metadata to a log (WAL), etcd is sensitive to disk-write latency. Slow disks and disk activity from other processes can cause long fsync latencies. Those latencies can cause etcd to miss heartbeats, not commit new proposals to the disk on time, and ultimately experience request timeouts and temporary leader loss. High write latencies also lead to an flow on impacts like Kubernetes API slowness, which affects cluster performance. I would suggest running etcd on better hardware to ensure you can keep your fsync latencies under 10ms ideally. To measure those numbers, you can use a benchmarking tool, such as |
Beta Was this translation helpful? Give feedback.
-
@yitian108 Has this issue been resolved? I have the same issue |
Beta Was this translation helpful? Give feedback.
-
Here is the steps that to simulate high I/O operations:
Generate a 1GB file.
Copy the file multiple times.
Perform I/O intensive operations.
Run the shell script.
The following log snippet from etcd server like this:
{"level":"warn","ts":"2024-01-23T11:40:45.194+0800","caller":"etcdserver/v3_server.go:815","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947328307020133,"retry-timeout":"500ms"} {"level":"warn","ts":"2024-01-23T11:40:45.695+0800","caller":"etcdserver/v3_server.go:815","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947328307020133,"retry-timeout":"500ms"} {"level":"warn","ts":"2024-01-23T11:40:46.196+0800","caller":"etcdserver/v3_server.go:815","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947328307020133,"retry-timeout":"500ms"} {"level":"warn","ts":"2024-01-23T11:40:46.696+0800","caller":"etcdserver/v3_server.go:815","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947328307020133,"retry-timeout":"500ms"} {"level":"warn","ts":"2024-01-23T11:40:47.197+0800","caller":"etcdserver/v3_server.go:815","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947328307020133,"retry-timeout":"500ms"} {"level":"warn","ts":"2024-01-23T11:40:47.697+0800","caller":"etcdserver/v3_server.go:815","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947328307020133,"retry-timeout":"500ms"} {"level":"warn","ts":"2024-01-23T11:40:48.194+0800","caller":"wal/wal.go:802","msg":"slow fdatasync","took":"3.584506237s","expected-duration":"1s"} {"level":"warn","ts":"2024-01-23T11:40:48.198+0800","caller":"etcdserver/v3_server.go:815","msg":"waiting for ReadIndex response took too long, retrying","sent-request-id":4822947328307020133,"retry-timeout":"500ms"} ... {"level":"warn","ts":"2024-01-23T11:42:20.521+0800","caller":"etcdserver/server.go:1130","msg":"failed to revoke lease","lease-id":"42ee8d0ac21d438a","error":"etcdserver: request timed out"} {"level":"warn","ts":"2024-01-23T11:42:20.521+0800","caller":"etcdserver/server.go:1130","msg":"failed to revoke lease","lease-id":"42ee8d0ac21d6178","error":"etcdserver: request timed out"} {"level":"warn","ts":"2024-01-23T11:42:20.521+0800","caller":"etcdserver/server.go:1130","msg":"failed to revoke lease","lease-id":"42ee8d0ac21d5b9e","error":"etcdserver: request timed out"} ... {"level":"warn","ts":"2024-01-23T11:45:18.150+0800","caller":"etcdserver/util.go:166","msg":"apply request took too long","took":"630.894281ms","expected-duration":"100ms","prefix":"","request":"header:<ID:4822947328307021375 > lease_grant:<ttl:15-second id:42ee8d0ac228c63e>","response":"size:41"} {"level":"info","ts":"2024-01-23T11:45:18.150+0800","caller":"etcdserver/server.go:2363","msg":"saved snapshot","snapshot-index":622784} {"level":"info","ts":"2024-01-23T11:45:18.151+0800","caller":"etcdserver/server.go:2393","msg":"compacted Raft logs","compact-index":617784} {"level":"info","ts":"2024-01-23T11:45:18.151+0800","caller":"etcdserver/server.go:2363","msg":"saved snapshot","snapshot-index":622565} {"level":"warn","ts":"2024-01-23T11:45:18.151+0800","caller":"etcdserver/util.go:123","msg":"failed to apply request","took":"9.719µs","request":"header:<ID:4822947328307021427 > lease_revoke:<id:42ee8d0ac21d602d>","response":"size:29","error":"lease not found"}
So, I want to know how to avoid the error as title shows even if in the high I/O environment.
Below I picked some grafana monitor graphs that the high I/O cases.
Beta Was this translation helpful? Give feedback.
All reactions