Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Service Load Balancer finalizer support #78262

Merged
merged 4 commits into from Jun 1, 2019

Conversation

MrHohn
Copy link
Member

@MrHohn MrHohn commented May 23, 2019

What type of PR is this?
/kind feature

What this PR does / why we need it:
KEP link: https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/20190423-service-lb-finalizer.md

This PR adds finalizer protection for service LoadBalancers. It defines an alpha feature gate for the finalizer addition part while the removal part is always on.

Note that is is largely based off of #54569 and #65912. While one significant difference is that I haven't removed the cached service layer given the complexity and risks of hosting both pre-finalizer and with-finalizer control flow logic. Keeping the cache layer makes this easier.

End-to-end tests is being added in #78410.

Which issue(s) this PR fixes:

Fixes #53451. Also ref kubernetes/enhancements#980 and kubernetes/cloud-provider#16.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Finalizer Protection for Service LoadBalancers is now added as Alpha (disabled by default). This feature ensures the Service resource is not fully deleted until the correlating load balancer resources are deleted.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/cloudprovider sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 23, 2019
@k8s-ci-robot k8s-ci-robot added area/test sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels May 23, 2019
@MrHohn MrHohn force-pushed the svc-finalizer-cleanup2 branch 2 times, most recently from 37b0b5a to f8c5e6a Compare May 23, 2019 21:43
@MrHohn
Copy link
Member Author

MrHohn commented May 23, 2019

/test all

@MrHohn MrHohn force-pushed the svc-finalizer-cleanup2 branch 5 times, most recently from b1deee2 to 8cc9c3b Compare May 24, 2019 20:09
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 24, 2019
@MrHohn
Copy link
Member Author

MrHohn commented May 24, 2019

So I'm having trouble with unit testing:

--- FAIL: TestSyncLoadBalancerIfNeeded (0.06s)
    --- FAIL: TestSyncLoadBalancerIfNeeded/service_with_finalizer_that_no_longer_wants_LB (0.00s)
        service_controller_test.go:319: Service hasFinalizer=true, want false
    --- FAIL: TestSyncLoadBalancerIfNeeded/service_that_needs_cleanup (0.00s)
        service_controller_test.go:319: Service hasFinalizer=true, want false
--- FAIL: TestPatchFinalizer (0.00s)
    --- FAIL: TestPatchFinalizer/remove_finalizer (0.00s)
        service_controller_test.go:1136: Service hasFinalizer = true, want false
--- FAIL: TestPatchStatus (0.00s)
    --- FAIL: TestPatchStatus/clear_status (0.00s)
        service_controller_test.go:1214: Got status {[{8.8.8.8 }]}, want &LoadBalancerStatus{Ingress:[],}
FAIL

It seems like the fake kube client doesn't properly support directive delete for patch operation, hence all the test cases that require removing fields failed. As I can confirm the generated patch bytes indeed contain directive:

{"metadata":{"$deleteFromPrimitiveList/finalizers":["service.kubernetes.io/load-balancer-cleanup"],"$setElementOrder/finalizers":["bar"],"finalizers":["bar"]}}

My current plan is fixing that fake kube client.

MrHohn and others added 4 commits May 30, 2019 21:09
- Always try to remove finalizer upon load balancer cleanup
- Add finalizer prior to load balancer creation (feature gated)
- Cache logic fix-ups
- Event type/message fix-ups
- Use runtime.HandleError() on eaten errors

Co-authored-by: Josh Horwitz <horwitzja@gmail.com>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 31, 2019
@MrHohn
Copy link
Member Author

MrHohn commented May 31, 2019

@andrewsykim Sorry, I rebased to fix the merge conflict in kube_features.go. Would be great to have a re-lgtm, thanks!

@andrewsykim
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 31, 2019
@MrHohn
Copy link
Member Author

MrHohn commented May 31, 2019

/retest

@timoreimann
Copy link
Contributor

@MrHohn apologies for commenting on this merged PR with a late thought:

Is there a need to consider the edge case where the type is changed from LoadBalancer, the LB resource is about to be cleaned up, and while that process is still ongoing (possibly delayed/retried due to API connectivity issues or other events), the type is being switched back to LoadBalancer again? Would we then cancel the deletion efforts (or delete and recreate in sequence)?

It didn't become immediately clear to me by looking at the change, but my familiarity with the code is limited.

@bowei
Copy link
Member

bowei commented Jun 12, 2019

While the object is in a finalization state, I would imagine changing the fields in the object should not have any effect. In other words, the user would have to wait for the finalizers to be done processing and the object has been fully removed before recreating the object.

@MrHohn
Copy link
Member Author

MrHohn commented Jun 12, 2019

@timoreimann No worries at all and thanks for giving thoughts on this.

For the case you mentioned (type changed from LoadBalancer then back), there will be two outcomes depends on the timeline.

The first outcome is "Delete and Recreate". It happens if service controller starts processing the type:LoadBalancer->other update before it received the type:other->LoadBalancer update.

The second outcome is "No-op or Reconciling". It happens if service controller hasn't got to process the type:LoadBalancer->other update before the type:other->LoadBalancer update comes. So service controller processes both two updates together. In this case service controller simply calls EnsureLoadBalancer() again. It can either be a no-op if nothing changed, or a reconciliation if there is other valid update.

@MrHohn
Copy link
Member Author

MrHohn commented Jun 12, 2019

While the object is in a finalization state, I would imagine changing the fields in the object should not have any effect.

Note that in this case, the behavior in fact stays the same as before this finalizer support. As the LB deletion action is triggered by a type change update, instead of an object deletion with finalizer.

But yes, for the cases where the object is in a finalization state, update won't have any effect.

@timoreimann
Copy link
Contributor

@MrHohn @bowei thanks for explaining -- happy to hear we're able to cope with the described scenario just fine. 👍

@sylr
Copy link
Contributor

sylr commented Jul 26, 2019

If proven effective could this be backported til 1.13 ?

@MrHohn
Copy link
Member Author

MrHohn commented Jul 26, 2019

@sylr Unfortunately it seems less likely we will backport a feature to 1.13.

@andrewsykim
Copy link
Member

Backporting to v1.13 will also break backwards compatibility for the finalizer in v1.12.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cloudprovider area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

service controller can leave orphaned load balancers after services are deleted
8 participants