Correctly handle TTLs on update #68699

enj · 2018-09-14T22:37:29Z

This change updates the ETCD3 objState to store the associated lease ID. If the objState has an associated lease, the value is used to look up the remaining TTL. This value is used to populate the ResponseMeta.TTL field which is passed to TTLFuncs as the existing TTL value.

This guarantees that an object with a TTLFunc can reliably use the existing TTL value on update. Furthermore, as an optimization, objects that use the existing TTL on update will reuse the same lease.

If an object with a TTL is updated, it will ignore the suggestion from the watch cache and instead directly pull from ETCD. This could have some performance implications for TTL'd resources that rely on the watch cache as an optimization. No such resource exists in core Kubernetes today.

To understand the underlying issue, one must look at how TTLFunc is defined today:

// TTLFunc returns the TTL (time to live) that objects should be persisted
// with. The existing parameter is the current TTL or the default for this
// operation. The update parameter indicates whether this is an operation
// against an existing object.
//
// Objects that are persisted with a TTL are evicted once the TTL expires.
TTLFunc func(obj runtime.Object, existing uint64, update bool) (uint64, error)

In the ETCD2 code, when update was true, you could simply return existing if you wanted to leave the TTL value unchanged. Getting this same behavior with ETCD3 is more difficult since you have to deal with an extra API (leases) and the watch cache (data from the cache does not have lease/TTL information).

The code as it stands today, with ETCD3, always sets existing to 0 because it does not fetch the lease to retrieve the current TTL value. This change corrects that deficit as best as it can while preserving the watch cache optimization for all non-TTL'd objects.

Signed-off-by: Monis Khan mkhan@redhat.com

/kind bug
@kubernetes/sig-api-machinery-pr-reviews

NONE

k8s-ci-robot · 2018-09-14T22:37:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: enj
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: smarterclayton

If they are not already assigned, you can assign the PR to them by writing /assign @smarterclayton in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

pkg/registry/OWNERS
staging/src/k8s.io/apiserver/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

smarterclayton · 2018-09-14T23:10:59Z

We don't use the watch cache for events, so shouldn't be a blocker there.

…

On Fri, Sep 14, 2018 at 6:38 PM k8s-ci-robot ***@***.***> wrote: [APPROVALNOTIFIER] This PR is *NOT APPROVED* This pull-request has been approved by: *enj <#68699#>* To fully approve this pull request, please assign additional approvers. We suggest the following additional approver: *smarterclayton* If they are not already assigned, you can assign the PR to them by writing /assign @smarterclayton in a comment when ready. The full list of commands accepted by this bot can be found here <https://go.k8s.io/bot-commands>. The pull request process is described here <https://git.k8s.io/community/contributors/guide/owners.md#the-code-review-process> Needs approval from an approver in each of these files: - *pkg/registry/OWNERS <https://github.com/kubernetes/kubernetes/blob/master/pkg/registry/OWNERS>* - *staging/src/k8s.io/apiserver/OWNERS <https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/apiserver/OWNERS>* Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment — You are receiving this because you are on a team that was mentioned. Reply to this email directly, view it on GitHub <#68699 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p4Qj6TQgC3-v41BjND_ZIFYrNJz8ks5ubC_MgaJpZM4WqCrE> .

smarterclayton · 2018-09-14T23:12:26Z

This is a lot of change.

What's the actual bug this is fixing?

enj · 2018-09-15T01:52:59Z

We don't use the watch cache for events, so shouldn't be a blocker there.

Ah yeah, forgot about that. Will update the description.

This is a lot of change.

Mostly just comments and tests...

What's the actual bug this is fixing?

// TTLFunc returns the TTL (time to live) that objects should be persisted
// with. The existing parameter is the current TTL or the default for this
// operation. The update parameter indicates whether this is an operation
// against an existing object.
//
// Objects that are persisted with a TTL are evicted once the TTL expires.
TTLFunc func(obj runtime.Object, existing uint64, update bool) (uint64, error)

In the ETCD2 code, when update was true, you could simply return existing if you wanted to leave the TTL value unchanged. Getting this same behavior with ETCD3 is more difficult since you have to deal with an extra API (leases) and the watch cache (data from the cache does not have lease/TTL information).

The code as it stands today, with ETCD3, always sets existing to 0 because it does not fetch the lease to retrieve the current TTL value. This change corrects that deficit as best as it can while preserving the watch cache optimization for all non-TTL'd objects.

wenjiaswe · 2018-09-17T20:04:29Z

/assign @jpbetz
/cc @janetkuo @wenjiaswe

k8s-ci-robot · 2018-09-17T20:04:31Z

@wenjiaswe: GitHub didn't allow me to request PR reviews from the following users: wenjiaswe.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/assign @jpbetz
/cc @janetkuo @wenjiaswe

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

This change updates the ETCD3 objState to store the associated lease ID. If the objState has an associated lease, the value is used to look up the remaining TTL. This value is used to populate the ResponseMeta.TTL field which is passed to TTLFuncs as the existing TTL value. This guarantees that an object with a TTLFunc can reliably use the existing TTL value on update. Furthermore, as an optimization, objects that use the existing TTL on update will reuse the same lease. If an object with a TTL is updated, it will ignore the suggestion from the watch cache and instead directly pull from ETCD. This could have some performance implications for TTL'd resources that rely on the watch cache as an optimization. No such resource exists in core Kubernetes today. Signed-off-by: Monis Khan <mkhan@redhat.com>

enj · 2018-09-17T20:23:49Z

Added a test to confirm the updated registry.Store behavior.

enj · 2018-09-18T00:48:23Z

/retest

jpbetz

In the ETCD2 code, when update was true, you could simply return existing if you wanted to leave the TTL value unchanged. Getting this same behavior with ETCD3 is more difficult since you have to deal with an extra API (leases) and the watch cache (data from the cache does not have lease/TTL information).

The code as it stands today, with ETCD3, always sets existing to 0 because it does not fetch the lease to retrieve the current TTL value. This change corrects that deficit as best as it can while preserving the watch cache optimization for all non-TTL'd objects.

@enj, Would you add this to the description? The explanation of how this should work (and does with etcd2) is helpful in understanding why this needs to be fixed.

jpbetz · 2018-09-26T19:31:39Z

pkg/registry/core/event/storage/storage.go

@@ -49,6 +49,9 @@ func NewREST(optsGetter generic.RESTOptionsGetter, ttl uint64) *REST {
 		NewListFunc:   func() runtime.Object { return &api.EventList{} },
 		PredicateFunc: event.MatchEvent,
 		TTLFunc: func(runtime.Object, uint64, bool) (uint64, error) {
+			// this means that on every update, bump the TTL
+			// so an event expires at last modified time + ttl
+			// the last modified timestamp is not stored


Had a bit of trouble groking the last line here. What are we trying to indicate?

For first two lines, Maybe just "Bump the TTL on every update so events expire at last modified time + TTL." ?

For first two lines, Maybe just "Bump the TTL on every update so events expire at last modified time + TTL." ?

👍

Had a bit of trouble groking the last line here. What are we trying to indicate?

I was thinking through how etcd TTLs are Kube API hostile (see bold):

They do not respect Kube API semantics such as finalizers

You cannot generally TTL a resource (requires code in the server). The GC comes close when you have an ownership hierarchy but that does not work when the owner is "time."

TTLs are fuzzy - they expire at some point close to the requested TTL and the leaseManager makes this even more fuzzy. I am also unsure of the semantics of TTLs when etcd goes through leader election changes.

There is no way from the Kube API to determine if something has a lease attached to it and the remaining time

Even when you know the amount a TTL will be bumped based on configuration, the Kube API does not track last modified time so you cannot calculate when the resource will expire

@janetkuo @caesarxuchao Thoughts on above?

@janetkuo has a KEP proposing adding generic TTL-based garbage collection, see kep. It doesn't use etcdl lease. It respects Kube API semantics like finalizers.

Seems like events have a field called lastTimestamp which is kept updated by the event sink code so I used that to make the comment clearer.

jpbetz · 2018-09-26T20:11:40Z

staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go

+
+		// always return a nil TTL when we have no TTL func so that the
+		// storage interface can distinguish between no TTL, 0 TTL, 1+ TTL
+		if e.TTLFunc == nil {


Can we eliminate the copy paste of the e.TTLFunc == nil checks? Maybe default TTL function that we use and does the right thing so that we don't have to special case e.TTLFunc == nil plus modify calculateTTL to return a pointer for TTL?

I will see what I can do, the semantics for TTL on create and update are different since create does not care about nil whereas update does.

jpbetz · 2018-09-26T20:19:57Z

staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go

@@ -582,11 +582,16 @@ func (e *Store) Update(ctx context.Context, name string, objInfo rest.UpdatedObj
 					return nil, nil, err
 				}
 			}
+
+			// always return a nil TTL when we have no TTL func so that the
+			// storage interface can distinguish between no TTL, 0 TTL, 1+ TTL


What's the semantic difference between "no TTL" and "0 TTL"? Might be worth clarifying here as it's not obvious.

Oh, I see storage.Interface GuarenteedUpdate has a listing of the semantics. Maybe reference that?

Yeah the distinction is really important on update, but does not matter for create.

jpbetz · 2018-09-26T21:16:27Z

staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go

+	if st != nil && ttl == st.meta.TTL && st.lease != clientv3.NoLease {
+		return []clientv3.OpOption{clientv3.WithLease(st.lease)}, nil
+	}
+
 	id, err := s.leaseManager.GetLease(ctx, ttl)


Is the difference between s.leaseManager.GetLease(ctx, ttl) and ttl == st.meta.TTL is that the lease manager call would create options with a remaining TTL == to the provided ttl, but ttl == st.meta.TTL will return the lease ID previously used and not extend the remaining TTL? I.e. for st.meta.TTL = 5 where the lease exists and has a remaining TTL of 3 seconds, because 2 seconds have already passed ttlOps(ctx, 5, st) would return a ID of existing TTL, with 3 seconds remaining, but for another objState.meta.TTL = 30 where lease exists, ttlOps(ctx, 5, st) returns a lease with remaining TTL of 5 seconds?

Is the difference between s.leaseManager.GetLease(ctx, ttl) and ttl == st.meta.TTL is that the lease manager call would create options with a remaining TTL == to the provided ttl, but ttl == st.meta.TTL will return the lease ID previously used and not extend the remaining TTL?

Yeah, without the new if statement I added, even when the same TTL is requested, a new lease would get created and attached to the key (and the old lease would be unattached from the key). I assume the point of passing update and existing to the TTLFunc is to allow flows that basically say "do not change the TTL once it is set." I doubt it is a super common flow, but it does not seem wise to keep creating new leases when the old one can be reused.

I.e. for st.meta.TTL = 5 where the lease exists and has a remaining TTL of 3 seconds, because 2 seconds have already passed ttlOps(ctx, 5, st) would return a ID of existing TTL, with 3 seconds remaining

st.meta.TTL is assigned to the remaining TTL value (not the originally requested TTL), meaning in this example it would be 3. The TTLFunc tells you how much time is remaining via the existing parameter.

If it was based on the originally requested TTL, the bump case used by events would not work: you request 30s the first time, and if you request 30s again it would not bump and reuse the lease.

but for another objState.meta.TTL = 30 where lease exists, ttlOps(ctx, 5, st) returns a lease with remaining TTL of 5 seconds?

I believe this is correct.

Is the difference between s.leaseManager.GetLease(ctx, ttl) and ttl == st.meta.TTL is that the lease manager call would create options with a remaining TTL == to the provided ttl, but ttl == st.meta.TTL will return the lease ID previously used and not extend the remaining TTL?

Yeah, without the new if statement I added, even when the same TTL is requested, a new lease would get created and attached to the key (and the old lease would be unattached from the key). I assume the point of passing update and existing to the TTLFunc is to allow flows that basically say "do not change the TTL once it is set." I doubt it is a super common flow, but it does not seem wise to keep creating new leases when the old one can be reused.

The lease manager already ensures that leases are reused when appropriate.

I.e. for st.meta.TTL = 5 where the lease exists and has a remaining TTL of 3 seconds, because 2 seconds have already passed ttlOps(ctx, 5, st) would return a ID of existing TTL, with 3 seconds remaining

st.meta.TTL is assigned to the remaining TTL value (not the originally requested TTL), meaning in this example it would be 3. The TTLFunc tells you how much time is remaining via the existing parameter.

If it was based on the originally requested TTL, the bump case used by events would not work: you request 30s the first time, and if you request 30s again it would not bump and reuse the lease.

but for another objState.meta.TTL = 30 where lease exists, ttlOps(ctx, 5, st) returns a lease with remaining TTL of 5 seconds?

I believe this is correct.

It feels like behavior violates the principal of least surprise. If the goal is to support a "continue with existing lease" option, I'm in favor or making it an more explicit than providing a TTL that matches the meta.TTL. One concrete example of how ttl == st.meta.TTL might be suprising is that if one intends to renew the lease to a desired TTL to X seconds, so they return that value, but meta.TTL just happens to be X so instead they get the remaining time on the existing lease, not a full X seconds.

The lease manager already ensures that leases are reused when appropriate.

It just does simple math based on last lease it created. It cannot correctly reuse leases over time for the same key. It shares leases across keys created at close intervals within some approximation of the requested TTL.

One concrete example of how ttl == st.meta.TTL might be suprising is that if one intends to renew the lease to a desired TTL to X seconds, so they return that value, but meta.TTL just happens to be X so instead they get the remaining time on the existing lease, not a full X seconds.

In general the TTL value is fuzzy with the lease manager in play. Realistically one should expected the TTL to be within +/- 30 seconds of the requested time. You asked for some TTL (from "now") and we honored it approximately.

It just does simple math based on last lease it created. It cannot correctly reuse leases over time for the same key. It shares leases across keys created at close intervals within some approximation of the requested TTL.

We shouldn't need to care what etcd lease is used, only that the "at least as long as X" guarantee is adhered to. The lease manager achieves this and does it in a way that offers some efficiency gains.

In general the TTL value is fuzzy with the lease manager in play. Realistically one should expected the TTL to be within +/- 30 seconds of the requested time. You asked for some TTL (from "now") and we honored it approximately.

Anyone expecting +/- 30 seconds is in trouble, because that's not the guarantee. The guarantee provided by etcd is that the lease will expire no sooner than the TTL, hence that's the guarantee we've got in k8s and should focus on. In etcd leader elections extend out TTL without any theoretical bounds and we recently merged etcd-io/etcd#9924 to try to reduce the probability of this being an issue in practice.

The problem as I see it with this change is that if one returns 5 s as expected, they expect the lease to last at least 5 seconds, but if the st.meta.TTL and the existing lease only has 3 seconds left, then they got 3 seconds instead of the desired 5, which does not honor their intent.

The problem as I see it with this change is that if one returns 5 s as expected, they expect the lease to last at least 5 seconds, but if the st.meta.TTL and the existing lease only has 3 seconds left, then they got 3 seconds instead of the desired 5, which does not honor their intent.

I do not see how this would ever happen. If you asked for 5 and 3 s was remaining, you would get 5. If you said 3, you would get 3 and the lease would be reused.

Anyone expecting +/- 30 seconds is in trouble

I meant that the TTL requested from etcd would be within some range, not that etcd would honor it explicitly within that range.

For now I have left this in (because it is easy to drop the if statement and update the tests to not require lease reuse).

jpbetz · 2018-09-26T21:27:53Z

staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go

@@ -738,28 +747,56 @@ func (s *store) getStateFromObject(obj runtime.Object) (*objState, error) {
 	return state, nil
 }

-func (s *store) updateState(st *objState, userUpdate storage.UpdateFunc) (runtime.Object, uint64, error) {
-	ret, ttlPtr, err := userUpdate(st.obj, *st.meta)
+var errStaleTTL = errors.New("updateState needs current objState for TTL calculation")


Move this to etcd3/errors.go and make it a const?

Is 'stale' the right word here? That imply something has exceeded its shelf life. But the comment suggests the objState is missing needed data for a TTL calculation.

errIncompleteTTL?

Works for me. With a clear comment and error string that feels clear enough to me.

jpbetz · 2018-09-26T21:30:45Z

staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go

 	var ttl uint64
 	if ttlPtr != nil {
+		// if userUpdate asserts a TTL and we are using a cached object, it means that we passed a zero TTL
+		// to userUpdate as the existing object's TTL (because the cache does not know the TTL value).
+		// send a specific error in this case to signal that we need to get the actual object from ETCD


nit: etcd should be lowercase per etcd community guidelines (it's not an acronym).

Oh, that is a fun bit of trivia.

jpbetz · 2018-09-26T21:38:09Z

staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go

 	var ttl uint64
 	if ttlPtr != nil {
+		// if userUpdate asserts a TTL and we are using a cached object, it means that we passed a zero TTL


I'm finding all the TTL states and their semantics difficult to keep track of. I'm concerned that this code is going to be difficult to maintain. Should introduce additional structs with more explicitly named states?

I will write it out on paper and see if I can come up with something cleaner.

jpbetz · 2018-09-26T21:46:23Z

staging/src/k8s.io/apiserver/pkg/registry/generic/registry/store.go

@@ -582,11 +582,16 @@ func (e *Store) Update(ctx context.Context, name string, objInfo rest.UpdatedObj
 					return nil, nil, err
 				}
 			}
+
+			// always return a nil TTL when we have no TTL func so that the
+			// storage interface can distinguish between no TTL, 0 TTL, 1+ TTL


Oh, I see storage.Interface GuarenteedUpdate has a listing of the semantics. Maybe reference that?

enj · 2018-09-27T16:01:01Z

Would you add this to the description? The explanation of how this should work (and does with etcd2) is helpful in understanding why this needs to be fixed.

Added to description, will add it to the git commit once we get everything sorted.

Signed-off-by: Monis Khan <mkhan@redhat.com>

jpbetz · 2018-10-02T05:59:32Z

@enj Previously, you mentioned that this might not actually need the "reuse lease" functionality. Would it be safe to strip that out? It seems to be the main remaining cause of complexity in the logic.

Signed-off-by: Monis Khan <mkhan@redhat.com>

k8s-ci-robot · 2018-10-02T19:17:57Z

@enj: The following tests failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-typecheck	`d0d3b7c`	link	`/test pull-kubernetes-typecheck`
pull-kubernetes-bazel-test	`d0d3b7c`	link	`/test pull-kubernetes-bazel-test`
pull-kubernetes-verify	`d0d3b7c`	link	`/test pull-kubernetes-verify`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

k8s-ci-robot · 2018-10-06T21:26:10Z

@enj: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

fejta-bot · 2019-01-04T21:30:28Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.