lease: Persist remainingTTL to prevent indefinite auto-renewal of long lived leases #9924

jpbetz · 2018-07-14T00:50:30Z

Fixes #9888 by introducing a "lease checkpointing" mechanism.

The basic ideas is that for all leases with TTLs greater than 5 minutes, their remaining TTL will be checkpointed every 5 minutes so that if a new leader is elected, the leases are not auto-renewed to their full TTL, but instead only to the remaining TTL from the last checkpoint. A checkpoint is an entry that persisted to the RAFT consensus log that records the remainingTTL as determined by the leader when the checkpoint occurred.

If keep-alive is called on a lease that has been checkpointed. The remaining TTL will be cleared by a checkpoint entry in the RAFT consensus log where remainingTTL=0, indicating it is unset and that the original TTL should be used.

All checkpointing is scheduled and performed by the leader, and when a new leader is elected, it takes over checkpointing as part of lease.Promote.

An advantage of this approach is that leases where keep-alive is called often will still write at most two entries to the RAFT consensus log every 5 minutes since only the first keep-alive after a checkpoint must be recorded to the RAFT consensus log, all other keep-alives can be ignored.

Additionally, to prevent this mechanism from degrading system performance, it is designed to be best effort. There is a limit on how many checkpoints can be persisted per second, and how many pending checkpoint operations can be scheduled. If these limits are reached, checkpoints may not be scheduled or written to the RAFT consensus log to prevent the checkpointing operations from overwhelming the system, which could otherwise occur if large volumes of long lived leases were granted.

cc @gyuho @wenjiaswe @jingyih

gyuho

I will have another look next week as well. And just quick question from first pass, if findDueScheduledCheckpoints returns multiple leases, have we thought about batching them all in one raft request?

gyuho · 2018-07-14T00:56:50Z

lease/lessor.go

@@ -57,6 +70,10 @@ type TxnDelete interface {
 // RangeDeleter is a TxnDelete constructor.
 type RangeDeleter func() TxnDelete

+// Checkpointer permits checkpointing of lease remaining TTLs to the concensus log. Defined here to


s/concensus/consensus/?

Fixed, thanks!

gyuho · 2018-07-14T00:59:56Z

lease/lessor.go

+}
+
+// checkpointScheduledLeases finds all scheduled lease checkpoints that are due and
+// submits them to the checkpointer to persist them to the concensus log.


s/concensus/consensus/ :)

jpbetz · 2018-07-16T18:06:57Z

@gyuho Batching only briefly crossed my mind, but it's something we should clearly do. I'll add it shortly.

gyuho · 2018-07-16T18:14:27Z

lease/lessor.go

+		return l.remainingTTL
+	} else {
+		return l.ttl
+	}


no need else? just return return l.ttl following Go idioms? :)

sounds good! This is one of the hardest idoms to unlearn from other languages that do the opposite :)

jpbetz · 2018-07-17T18:50:38Z

Due to the size of this PR, I'll split it into three commits:

.proto change and resulting codegen
Lessor config and logging change
checkpointing mechanism

jpbetz · 2018-07-17T20:27:54Z

For review, note that the PR now has 4 separate commits:

The main change is c81870f, which is only +243 -45 lines.

xiang90 · 2018-07-17T20:34:49Z

lease/lessor.go

+			return cps
+		}
+		heap.Pop(&le.leaseCheckpointHeap)
+		if l, ok := le.leaseMap[lt.id]; ok {


we probably need to remove a few indentations here.

if l, ok := ...; !ok { continue } ...

Sounds good. I'll flatten this down.

xiang90 · 2018-07-17T20:38:15Z

lease/lessor.go

+
+	// Limit the total number of scheduled checkpoints, checkpoint should be best effort and it is
+	// better to throttle checkpointing than to degrade performance.
+	maxScheduledCheckpoints = 10000


how do we come up with these default values? have you done any benchmark?

would it be helpful if we make the checkpoint api accept multiple leases as a batch?

how do we come up with these default values? have you done any benchmark?

Not yet, but I need to. I'm betting these number can be much higher. I'll do some benchmarking this week. For the scheduling, we just need to keep the heap size to some reasonable size, so I might look at typical etcd memory footprints and use that to help establish a limit that is based on the worst case memory utilization we're able to accept.

would it be helpful if we make the checkpoint api accept multiple leases as a batch?

We just added the batching of lease checkpointing yesterday (proto change) per @gyuho's suggestion. Since this is not clear from how the leaseCheckpointRate constant is defined, I'll clear that up with some code changes. Maybe by defining a maxLeaseCheckpointBatchSize and using leaseCheckpointRate to define how many patched checkpoint operations can occur per second, which I might set quite low once we have batching.

xiang90 · 2018-07-17T22:02:03Z

The approach looks good to me. We need to have some benchmarks to show the overhead is acceptable in normal cases.

jpbetz · 2018-07-17T22:07:41Z

The approach looks good to me. We need to have some benchmarks to show the overhead is acceptable in normal cases.

Thanks @xiang90. I'll post a full benchmark shortly.

codecov-io · 2018-07-18T18:24:47Z

Codecov Report

Merging #9924 into master will increase coverage by 0.03%.
The diff coverage is 90%.

@@            Coverage Diff             @@
##           master    #9924      +/-   ##
==========================================
+ Coverage   68.99%   69.03%   +0.03%     
==========================================
  Files         386      386              
  Lines       35792    35891      +99     
==========================================
+ Hits        24695    24776      +81     
- Misses       9296     9300       +4     
- Partials     1801     1815      +14

Impacted Files	Coverage Δ
etcdserver/config.go	`79.51% <ø> (ø)`	⬆️
lease/lease_queue.go	`100% <100%> (ø)`	⬆️
integration/cluster.go	`82.17% <100%> (+0.05%)`	⬆️
etcdserver/server.go	`73.6% <100%> (+0.05%)`	⬆️
clientv3/snapshot/v3_snapshot.go	`64.75% <100%> (ø)`	⬆️
etcdserver/apply.go	`88.87% <75%> (-0.19%)`	⬇️
lease/lessor.go	`87.62% <90.35%> (+0.83%)`	⬆️
client/keys.go	`73.86% <0%> (-17.59%)`	⬇️
pkg/tlsutil/tlsutil.go	`86.2% <0%> (-6.9%)`	⬇️
pkg/netutil/netutil.go	`63.11% <0%> (-6.56%)`	⬇️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3f725e1...d1de41e. Read the comment docs.

gyuho · 2018-07-19T13:46:57Z

lease/lease_queue.go

-	index      int
+	id LeaseID
+	// Unix nanos timestamp.
+	time  int64


Can we comment this time field? It can be either expiration timestamp or checkpoint timestamp. Took me a while to find how time is used :)

Looks like the field rename from expiration to time only got me from misleading to unclear. I'll add a comment and see if there is anything else I should do to make this more obvious.

Added a couple comments to both lease_queue.go and the two places where the time field is used in lessor.go.

…ng lived leases

jpbetz · 2018-07-23T23:22:12Z

@xiang90 @gyuho

Ran two benchmarks:

Checkpoint heap size Benchmark

Checked etcd server heap size up to 10,000,000 live leases.

With checkpointing 3.3GB
Without checkpointing 3.3GB
This makes sense given that the heap is a slice of structs that contain only three int64s, so the total memory usage for all the entries is only about 40MB or just a bit more than 1% of the total memory utilization. I've removed the limit on this heap as it does not seem to be needed.

Checkpoint rate limit Benchmark

Set leases to checkpoint every 1s, created 15k of them, and then checked server performance with benchmark put while the checkpointing is happening concurrently. This was with a 3-member etcd cluster on localhost.

Without checkpointing - write latency ~ 0.006ms
With checkpointing, maxLeaseCheckpointBatchSize=1 (no batching of checkpoints in RAFT log) - write latency ~0.015ms
With checkpointing, maxLeaseCheckpointBatchSize=1000 - write latency ~0.008ms
With checkpointing, maxLeaseCheckpointBatchSize=1000, leaseCheckpointRate=10000 - write latency ~0.008ms
With checkpointing, maxLeaseCheckpointBatchSize=1000, leaseCheckpointRate=1000 - write latency ~0.006ms

Since 1,000,000 checkpoints per sec seems sufficient, and the limits of maxLeaseCheckpointBatchSize=1000, leaseCheckpointRate=1000 appear to have negligible impact on performance, I've gone with those settings.

gyuho · 2018-07-23T23:36:23Z

With checkpointing, maxLeaseCheckpointBatchSize=1000, leaseCheckpointRate=1000 - write latency ~0.006ms

Results look good to me. Thanks for benchmarks!

gyuho

lgtm /cc @xiang90

gyuho · 2018-07-23T23:37:17Z

@jpbetz Also, can you add this to CHANGELOG? Just separate commit or PR should be fine. Thanks.

xiang90 · 2018-07-24T00:07:44Z

LGTM

keeplowkey · 2021-12-08T08:14:36Z

As a normal user, how do I use the "lease checkpointing" mechanism：
By upgrading the etcd server to a certain version or setting some specific parammeters in config file？I don't know.
Much thx for your help~ @jpbetz @xiang90 @gyuho

serathius · 2021-12-08T08:56:49Z

We have recently found that Lease Checkpointing doesn't work as intended in #13491. Fix is planned to be released in v3.5.2. With this release you should be able to enable release checkpointing by providing --experimental-enable-lease-checkpoint and --experimental-enable-lease-checkpoint-persist flags

keeplowkey · 2021-12-08T09:39:26Z

We have recently found that Lease Checkpointing doesn't work as intended in #13491. Fix is planned to be released in v3.5.2. With this release you should be able to enable release checkpointing by providing --experimental-enable-lease-checkpoint and --experimental-enable-lease-checkpoint-persist flags

@serathius If I want to enable lease checkpointing, just start etcd server with flag "--experimental-enable-lease-checkpoint true" will do the trick? Current etcd version: 3.4.7. Looking forward to your reply.

jpbetz added WIP type/feature labels Jul 14, 2018

jpbetz added this to the etcd-v3.4 milestone Jul 14, 2018

jpbetz self-assigned this Jul 14, 2018

jpbetz requested a review from gyuho July 14, 2018 00:50

jpbetz mentioned this pull request Jul 14, 2018

Leases may be auto-renewed indefinitely due to leader elections #9888

Closed

gyuho reviewed Jul 14, 2018

View reviewed changes

gyuho reviewed Jul 16, 2018

View reviewed changes

jpbetz force-pushed the persist-lease-deadline branch 3 times, most recently from 9392bab to e463c07 Compare July 17, 2018 05:25

jpbetz changed the title ~~[WIP] lease: Persist remainingTTL to prevent indefinite auto-renewal of long lived leases~~ lease: Persist remainingTTL to prevent indefinite auto-renewal of long lived leases Jul 17, 2018

jpbetz removed the WIP label Jul 17, 2018

jpbetz added 2 commits July 17, 2018 11:54

lease: Add and lease checkpoint protobuf types

75ac18c

lease: Add LessorConfig and wire zap logger into Lessor

bbe2d77

jpbetz force-pushed the persist-lease-deadline branch from e463c07 to ec26ef2 Compare July 17, 2018 20:22

xiang90 reviewed Jul 17, 2018

View reviewed changes

jpbetz force-pushed the persist-lease-deadline branch from ec26ef2 to 904b906 Compare July 17, 2018 22:07

jpbetz force-pushed the persist-lease-deadline branch from 904b906 to c939c0a Compare July 18, 2018 17:23

gyuho reviewed Jul 19, 2018

View reviewed changes

jpbetz force-pushed the persist-lease-deadline branch from c939c0a to 37b7484 Compare July 23, 2018 20:25

lease: Checkpoint lease TTLs to prevent indefinite auto-renewal of lo…

2edb954

…ng lived leases

lease: Add unit and integration tests for lease checkpointing

d1de41e

jpbetz force-pushed the persist-lease-deadline branch from 37b7484 to d1de41e Compare July 23, 2018 23:12

gyuho approved these changes Jul 23, 2018

View reviewed changes

jpbetz merged commit 750b87d into etcd-io:master Jul 24, 2018

xiang90 mentioned this pull request Jul 24, 2018

Add new flag to persist lease expiry #9526

Closed

jpbetz mentioned this pull request Aug 7, 2018

etcd lease auto-renewal can extend event TTL indefinitely kubernetes/kubernetes#65497

Closed

hexfusion mentioned this pull request Sep 13, 2018

lease read safety #10082

Closed

jpbetz mentioned this pull request Sep 27, 2018

Correctly handle TTLs on update kubernetes/kubernetes#68699

Closed

jingyih mentioned this pull request Jun 5, 2019

*: enable lease checkpoint via experimental flag #10797

Merged

ptabor mentioned this pull request Nov 22, 2021

Lease checkpoints fix #13491

Closed

ahrtr mentioned this pull request Jun 8, 2022

[Lease] Refactor lease renew request via raft #14094

Draft

ishan16696 mentioned this pull request Dec 8, 2023

[Feature] Persist etcd lease TTLs by enabling checkpointing and analyse its effect like on write throughput, disk usage etc. gardener/etcd-druid#733

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lease: Persist remainingTTL to prevent indefinite auto-renewal of long lived leases #9924

lease: Persist remainingTTL to prevent indefinite auto-renewal of long lived leases #9924

jpbetz commented Jul 14, 2018 •

edited

gyuho left a comment

gyuho Jul 14, 2018

jpbetz Jul 16, 2018

gyuho Jul 14, 2018

jpbetz commented Jul 16, 2018

gyuho Jul 16, 2018

jpbetz Jul 16, 2018

jpbetz commented Jul 17, 2018

jpbetz commented Jul 17, 2018

xiang90 Jul 17, 2018

jpbetz Jul 17, 2018

xiang90 Jul 17, 2018

jpbetz Jul 17, 2018 •

edited

xiang90 commented Jul 17, 2018

jpbetz commented Jul 17, 2018

codecov-io commented Jul 18, 2018 •

edited

gyuho Jul 19, 2018 •

edited

jpbetz Jul 19, 2018

jpbetz Jul 23, 2018

jpbetz commented Jul 23, 2018 •

edited

gyuho commented Jul 23, 2018

gyuho left a comment

gyuho commented Jul 23, 2018

xiang90 commented Jul 24, 2018

keeplowkey commented Dec 8, 2021 •

edited

serathius commented Dec 8, 2021

keeplowkey commented Dec 8, 2021

lease: Persist remainingTTL to prevent indefinite auto-renewal of long lived leases #9924

lease: Persist remainingTTL to prevent indefinite auto-renewal of long lived leases #9924

Conversation

jpbetz commented Jul 14, 2018 • edited

gyuho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpbetz commented Jul 16, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpbetz commented Jul 17, 2018

jpbetz commented Jul 17, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpbetz Jul 17, 2018 • edited

Choose a reason for hiding this comment

xiang90 commented Jul 17, 2018

jpbetz commented Jul 17, 2018

codecov-io commented Jul 18, 2018 • edited

Codecov Report

gyuho Jul 19, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpbetz commented Jul 23, 2018 • edited

Checkpoint heap size Benchmark

Checkpoint rate limit Benchmark

gyuho commented Jul 23, 2018

gyuho left a comment

Choose a reason for hiding this comment

gyuho commented Jul 23, 2018

xiang90 commented Jul 24, 2018

keeplowkey commented Dec 8, 2021 • edited

serathius commented Dec 8, 2021

keeplowkey commented Dec 8, 2021

jpbetz commented Jul 14, 2018 •

edited

jpbetz Jul 17, 2018 •

edited

codecov-io commented Jul 18, 2018 •

edited

gyuho Jul 19, 2018 •

edited

jpbetz commented Jul 23, 2018 •

edited

keeplowkey commented Dec 8, 2021 •

edited