Skip to content
This repository has been archived by the owner on Oct 24, 2023. It is now read-only.

fix: Graceful coredns shutdown #4443

Conversation

technicianted
Copy link
Contributor

Reason for Change:
This PR adds graceful shutdown configurations to coredns.

Currently, if a coredns pod is deleted (due to eviction for example), the container shuts down immediately. This has negative side effects:

  1. Some in-flight requests may fail.
  2. A transient race where some requests may still be sent to the shutdown instance until this pod's IP address have been removed from the service.
    If any of the two happens, we end up with the dreadful 5s timeout/retry latency.

The PR adds lameduck configuration to the health plugin such that it would delay shutting down the service until 5 seconds after the readiness probes had failed. This guarantees that both in-flight queries are completed and enough time for this instance IP address to be discarded from the service.

Note: Since original readiness probe parameters were left at defaults, the PR also explicitly sets the default to make it clearer.

Credit Where Due:
technicianted

Does this change contain code from or inspired by another project?

  • No
  • Yes

Requirements:

Notes:

The issue can be reproduced by running dig in a loop (100ms delay) then deleting/evicting a coredns pod.

@welcome
Copy link

welcome bot commented Jun 4, 2021

💖 Thanks for opening your first pull request! 💖 We use semantic commit messages to streamline the release process. Before your pull request can be merged, you should make sure your first commit and PR title start with a semantic prefix. Examples of commit messages with semantic prefixes: - fix: change azure disk cachingMode to ReadOnly - feat: make maximumLoadBalancerRuleCount configurable - docs: add note on AKS Engine and AKS relationship
Make sure to check out the developer guide for guidance on testing your change.

Copy link
Collaborator

@Michael-Sinz Michael-Sinz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this fix. This will have significant positive impacts on clusters built with this.

@Michael-Sinz
Copy link
Collaborator

@jackfrancis - It will be great to be able to get a point release with this change (0.63.1 at least).

Osama did a great job finding the root cause of this problem and identify such a simple fix.

health
health {
# this should be > readiness probe failure time
lameduck 35s
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the idea here is to fallback to the coredns runtime health check in case readinessProbe fails to restart things after 3 failures separated by 10 seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to make sure that coredns will not terminate until it is safe for it to do.

The idea here is when coredns gets a SIGTERM, lameduck will delay the actual shutdown for 35s. Until then, coredns will continue to service requests. In same time, readiness plugin will start failing. We wait 10*3+5 seconds to make sure that this instance is and no longer will be getting any queries.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for the clarification. Just to be super clear, 3 failures, separated by 10 seconds, could be as short as just over 20 seconds. It should never be as long as 30 seconds. So if it helps, we could reduce the lameduck config to 30s.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing is to delay the eviction of the pod for 35 seconds but trigger health failures right away at the start of the eviction. This will then, after the health probe failure (3x10) remove the pod from the service (and thus the load balancing for the service) such that shutting it down will not trigger DNS requests to route to the now dead pod.

It was technically a race between not sending any requests to the pod and the pod no longer running. The old way always ended up with the pod stopping before the requests stopped coming and with this new mechanism this is the inverse now - it always stops sending requests to the pod before the pod stops running.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The trick is that there is some time after the health fails that the kubernetes infrastructure removes the pod from the service so the extra 5 seconds is to allow this to fully propagate through the whole cluster. Remember, some node may have just started to want to do DNS at the same moment that the 30 seconds runs out and thus we could end up routing to the bad pod before the route change got pushed everywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, lgtm, thanks for this improvement!

Copy link
Member

@jackfrancis jackfrancis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@acs-bot
Copy link

acs-bot commented Jun 4, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jackfrancis, Michael-Sinz, technicianted

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jackfrancis
Copy link
Member

/azp run pr-e2e

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@codecov
Copy link

codecov bot commented Jun 5, 2021

Codecov Report

Merging #4443 (e0a8cff) into master (60828c7) will increase coverage by 0.00%.
The diff coverage is 71.89%.

Impacted file tree graph

@@           Coverage Diff            @@
##           master    #4443    +/-   ##
========================================
  Coverage   72.04%   72.05%            
========================================
  Files         141      141            
  Lines       21631    21764   +133     
========================================
+ Hits        15584    15681    +97     
- Misses       5096     5131    +35     
- Partials      951      952     +1     
Impacted Files Coverage Δ
cmd/rotate_certs.go 11.03% <0.00%> (ø)
cmd/upgrade.go 35.92% <0.00%> (ø)
pkg/api/common/versions.go 96.37% <ø> (ø)
pkg/api/types.go 92.85% <ø> (ø)
pkg/api/vlabs/types.go 73.04% <ø> (ø)
pkg/engine/templates_generated.go 43.31% <ø> (ø)
pkg/engine/template_generator.go 66.37% <11.53%> (-2.17%) ⬇️
cmd/get_logs.go 49.57% <30.43%> (-1.79%) ⬇️
pkg/api/addons.go 98.01% <100.00%> (ø)
pkg/api/converterfromapi.go 95.71% <100.00%> (+<0.01%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 75af8c1...e0a8cff. Read the comment docs.

@jackfrancis jackfrancis merged commit 6f730c8 into Azure:master Jun 5, 2021
@welcome
Copy link

welcome bot commented Jun 5, 2021

Congrats on merging your first pull request! 🎉🎉🎉

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants