fix: Graceful coredns shutdown #4443

technicianted · 2021-06-04T21:34:38Z

Reason for Change:
This PR adds graceful shutdown configurations to coredns.

Currently, if a coredns pod is deleted (due to eviction for example), the container shuts down immediately. This has negative side effects:

Some in-flight requests may fail.
A transient race where some requests may still be sent to the shutdown instance until this pod's IP address have been removed from the service.
If any of the two happens, we end up with the dreadful 5s timeout/retry latency.

The PR adds lameduck configuration to the health plugin such that it would delay shutting down the service until 5 seconds after the readiness probes had failed. This guarantees that both in-flight queries are completed and enough time for this instance IP address to be discarded from the service.

Note: Since original readiness probe parameters were left at defaults, the PR also explicitly sets the default to make it clearer.

Credit Where Due:
technicianted

Does this change contain code from or inspired by another project?

No
Yes

Requirements:

uses conventional commit messages
includes documentation
adds unit tests
tested upgrade from previous version

Notes:

The issue can be reproduced by running dig in a loop (100ms delay) then deleting/evicting a coredns pod.

welcome · 2021-06-04T21:34:39Z

💖 Thanks for opening your first pull request! 💖 We use semantic commit messages to streamline the release process. Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix. Examples of commit messages with semantic prefixes: - fix: change azure disk cachingMode to ReadOnly - feat: make maximumLoadBalancerRuleCount configurable - docs: add note on AKS Engine and AKS relationship
Make sure to check out the developer guide for guidance on testing your change.

Michael-Sinz

Thank you so much for this fix. This will have significant positive impacts on clusters built with this.

Michael-Sinz · 2021-06-04T21:43:25Z

@jackfrancis - It will be great to be able to get a point release with this change (0.63.1 at least).

Osama did a great job finding the root cause of this problem and identify such a simple fix.

jackfrancis · 2021-06-04T21:43:49Z

parts/k8s/addons/coredns.yaml

-        health
+        health {
+          # this should be > readiness probe failure time
+          lameduck 35s


So the idea here is to fallback to the coredns runtime health check in case readinessProbe fails to restart things after 3 failures separated by 10 seconds?

We want to make sure that coredns will not terminate until it is safe for it to do.

The idea here is when coredns gets a SIGTERM, lameduck will delay the actual shutdown for 35s. Until then, coredns will continue to service requests. In same time, readiness plugin will start failing. We wait 10*3+5 seconds to make sure that this instance is and no longer will be getting any queries.

Ah, thanks for the clarification. Just to be super clear, 3 failures, separated by 10 seconds, could be as short as just over 20 seconds. It should never be as long as 30 seconds. So if it helps, we could reduce the lameduck config to 30s.

The thing is to delay the eviction of the pod for 35 seconds but trigger health failures right away at the start of the eviction. This will then, after the health probe failure (3x10) remove the pod from the service (and thus the load balancing for the service) such that shutting it down will not trigger DNS requests to route to the now dead pod.

It was technically a race between not sending any requests to the pod and the pod no longer running. The old way always ended up with the pod stopping before the requests stopped coming and with this new mechanism this is the inverse now - it always stops sending requests to the pod before the pod stops running.

The trick is that there is some time after the health fails that the kubernetes infrastructure removes the pod from the service so the extra 5 seconds is to allow this to fully propagate through the whole cluster. Remember, some node may have just started to want to do DNS at the same moment that the 30 seconds runs out and thus we could end up routing to the bad pod before the route change got pushed everywhere.

Makes sense, lgtm, thanks for this improvement!

jackfrancis

/lgtm

acs-bot · 2021-06-04T21:58:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis, Michael-Sinz, technicianted

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jackfrancis]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jackfrancis · 2021-06-05T01:00:42Z

/azp run pr-e2e

azure-pipelines · 2021-06-05T01:00:56Z

Azure Pipelines successfully started running 1 pipeline(s).

codecov · 2021-06-05T01:15:23Z

Codecov Report

Merging #4443 (e0a8cff) into master (60828c7) will increase coverage by 0.00%.
The diff coverage is 71.89%.

@@           Coverage Diff            @@
##           master    #4443    +/-   ##
========================================
  Coverage   72.04%   72.05%            
========================================
  Files         141      141            
  Lines       21631    21764   +133     
========================================
+ Hits        15584    15681    +97     
- Misses       5096     5131    +35     
- Partials      951      952     +1

Impacted Files	Coverage Δ
cmd/rotate_certs.go	`11.03% <0.00%> (ø)`
cmd/upgrade.go	`35.92% <0.00%> (ø)`
pkg/api/common/versions.go	`96.37% <ø> (ø)`
pkg/api/types.go	`92.85% <ø> (ø)`
pkg/api/vlabs/types.go	`73.04% <ø> (ø)`
pkg/engine/templates_generated.go	`43.31% <ø> (ø)`
pkg/engine/template_generator.go	`66.37% <11.53%> (-2.17%)`	⬇️
cmd/get_logs.go	`49.57% <30.43%> (-1.79%)`	⬇️
pkg/api/addons.go	`98.01% <100.00%> (ø)`
pkg/api/converterfromapi.go	`95.71% <100.00%> (+<0.01%)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 75af8c1...e0a8cff. Read the comment docs.

welcome · 2021-06-05T03:19:44Z

Congrats on merging your first pull request! 🎉🎉🎉

fix: Graceful coredns shutdown

e0a8cff

Michael-Sinz approved these changes Jun 4, 2021

View reviewed changes

jackfrancis reviewed Jun 4, 2021

View reviewed changes

jackfrancis approved these changes Jun 4, 2021

View reviewed changes

jackfrancis merged commit 6f730c8 into Azure:master Jun 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Graceful coredns shutdown #4443

fix: Graceful coredns shutdown #4443

technicianted commented Jun 4, 2021

welcome bot commented Jun 4, 2021

Michael-Sinz left a comment

Michael-Sinz commented Jun 4, 2021

jackfrancis Jun 4, 2021

technicianted Jun 4, 2021

jackfrancis Jun 4, 2021

Michael-Sinz Jun 4, 2021

Michael-Sinz Jun 4, 2021

jackfrancis Jun 4, 2021

jackfrancis left a comment

acs-bot commented Jun 4, 2021

jackfrancis commented Jun 5, 2021

azure-pipelines bot commented Jun 5, 2021

codecov bot commented Jun 5, 2021

welcome bot commented Jun 5, 2021

fix: Graceful coredns shutdown #4443

fix: Graceful coredns shutdown #4443

Conversation

technicianted commented Jun 4, 2021

welcome bot commented Jun 4, 2021

Michael-Sinz left a comment

Choose a reason for hiding this comment

Michael-Sinz commented Jun 4, 2021

jackfrancis Jun 4, 2021

Choose a reason for hiding this comment

technicianted Jun 4, 2021

Choose a reason for hiding this comment

jackfrancis Jun 4, 2021

Choose a reason for hiding this comment

Michael-Sinz Jun 4, 2021

Choose a reason for hiding this comment

Michael-Sinz Jun 4, 2021

Choose a reason for hiding this comment

jackfrancis Jun 4, 2021

Choose a reason for hiding this comment

jackfrancis left a comment

Choose a reason for hiding this comment

acs-bot commented Jun 4, 2021

jackfrancis commented Jun 5, 2021

azure-pipelines bot commented Jun 5, 2021

codecov bot commented Jun 5, 2021

Codecov Report

welcome bot commented Jun 5, 2021