Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add parameter to configure max restarts for canary pause #51

Merged
merged 11 commits into from
Nov 2, 2020

Conversation

celenechang
Copy link
Contributor

@celenechang celenechang commented Oct 20, 2020

What does this PR do?

Currently the Canary autopause feature pauses the canary deployment if any of the canary pod containers restart more than 0 times. We have found that this is too sensitive, so this PR makes this a configurable parameter with a default value of 2. The maxRestarts parameter means that this is the number of restarts of a single container that is considered tolerable before we autopause.

It also adds the ability to disable the feature entirely via the spec.

Motivation

Better user experience

Additional Notes

A subsequent PR with a new parameter to define a stability (no restarts) duration requirement is planned.

Describe your test plan

Use an image such as bash:latest in the EDS configuration that causes the container to enter CLB. Increase the canary duration to be sufficiently large to give time for restarts.

Check if canary is automatically paused when the container restarts 3 times (default) without adding the autoPause.maxRestarts setting to the EDS spec.

Then try setting autoPause.maxRestarts to a larger number and see if the canary pauses after the corresponding number of restarts (see examples/foo-eds_v2.yaml for an example configuration).

Lastly, try setting autoPause.enabled to false and autoPause.maxRestarts to 0 and make sure that a restart does not pause the canary.

@celenechang celenechang added this to the v0.4 milestone Oct 20, 2020
@celenechang celenechang requested a review from a team as a code owner October 20, 2020 13:46
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request does not contain a valid label. Please add one of the following labels: bug, enhancement, documentation

@celenechang celenechang added the enhancement New feature or request label Oct 20, 2020
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request contains a valid label.

@celenechang celenechang marked this pull request as draft October 20, 2020 14:17
Copy link
Collaborator

@clamoriniere clamoriniere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good.

Could you add also a documentation about how configure and use the pause feature:

  • configuration example
  • kubectl plugin command example

@@ -75,6 +75,7 @@ type ExtendedDaemonSetSpecStrategyRollingUpdate struct {
type ExtendedDaemonSetSpecStrategyCanary struct {
Replicas *intstr.IntOrString `json:"replicas,omitempty"`
Duration *metav1.Duration `json:"duration,omitempty"`
MaxRestarts *intstr.IntOrString `json:"maxRestarts,omitempty"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it is necessary to have a *intstr.IntOrString . I feel like *int should be enough.

pkg/controller/utils/pod/pod.go Show resolved Hide resolved
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request contains a valid label.

@celenechang celenechang marked this pull request as ready for review October 27, 2020 03:31
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request contains a valid label.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request contains a valid label.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request contains a valid label.

Comment on lines 41 to 55
if daemonset.Spec.Strategy.Canary == nil || daemonset.Spec.Strategy.Canary.AutoPause == nil {
autoPauseEnabled = defaultAutoPauseEnabled
maxRestarts = defaultMaxRestarts
} else {
if daemonset.Spec.Strategy.Canary.AutoPause.Enabled == nil {
autoPauseEnabled = defaultAutoPauseEnabled
} else {
autoPauseEnabled = *daemonset.Spec.Strategy.Canary.AutoPause.Enabled
}
if daemonset.Spec.Strategy.Canary.AutoPause.MaxRestarts == nil {
maxRestarts = defaultMaxRestarts
} else {
maxRestarts = int(*daemonset.Spec.Strategy.Canary.AutoPause.MaxRestarts)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please correct me if I'm wrong, but I think at this point the defaulting should be already done via DefaultExtendedDaemonSetSpecStrategyCanaryAutoPause ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clearer, the question is: why do we default the autopause config twice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for bringing this up, as I didn't have a complete grasp on this.

It seems we default the ERS (in extendeddaemonset_default.go) if the minimum required settings are not defined. In this case we could set the default AutoPause settings with DefaultExtendedDaemonSetSpecStrategyCanaryAutoPause(), but it's not strictly necessary since we also have this defaulting code here in canary.go. If you think it's confusing to include the change in extendeddaemonset_default.go I can remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for clarifying this
tbh I would expect the defaulting to be done in one place (in extendeddaemonset_default.go), the fact that the same defaulting logic and the same constants (defaultMaxRestarts defaultAutoPauseEnabled defaultCanaryAutoPauseEnabled defaultCanaryAutoPauseMaxRestarts) are defined in different packages doesn't seem right 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we do it as @ahmed-mez said in extendeddaemonset_default.go, the defaulting is not needed in ManageCanaryDeployment because we always call the default method.
To achieve it, the function IsDefaultedExtendedDaemonSet() link, should be updated, to return false if the canary configuration is missing default values.

Copy link
Contributor Author

@celenechang celenechang Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it like this, with only setting the defaults in extendeddaemonset_default.go, and it led to undesired behaviors. I may have done something wrong; I will try again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK got this fixed, and confirmed it works whether autoPause, autoPause.Enabled, or autoPause.MaxRestarts is missing. Sorry for the confusion

@@ -0,0 +1,6 @@
kind: Cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should probably move this under hack ? as examples should probably contain CR examples only

Copy link
Collaborator

@clamoriniere clamoriniere Oct 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we propose to use this file in the documentation for testing the examples.
hack folder is more for contributor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

Comment on lines 87 to 88
Enabled *bool `json:"enabled,omitempty"`
MaxRestarts *int32 `json:"maxRestarts,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we please document these fields? especially MaxRestarts, we should make it clear that it corresponds to the max number of restarts per canary pod, not the sum of restarts of all canary pods

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request contains a valid label.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request contains a valid label.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request contains a valid label.

@codecov-io
Copy link

codecov-io commented Oct 29, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@d72a1bc). Click here to learn what that means.
The diff coverage is 46.66%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master      #51   +/-   ##
=========================================
  Coverage          ?   36.87%           
=========================================
  Files             ?       29           
  Lines             ?     1589           
  Branches          ?        0           
=========================================
  Hits              ?      586           
  Misses            ?      929           
  Partials          ?       74           
Flag Coverage Δ
unittests 36.87% <46.66%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ers/extendeddaemonsetreplicaset/strategy/canary.go 0.00% <0.00%> (ø)
controllers/extendeddaemonset/controller.go 68.75% <42.85%> (ø)
pkg/controller/utils/pod/pod.go 16.43% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d72a1bc...76d421b. Read the comment docs.

Copy link
Contributor

@ahmed-mez ahmed-mez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great! just few comments

- `replicas`: The number of replica pods to participate in the Canary deployment
- `duration`: The duration of the Canary deployment, after which the Canary deployment will end and the active ExtendedReplicaSet will update
- `autoPause.enabled`: Activation of the Canary deployment auto pausing feature (default is `true`)
- `autoPause.maxRestarts`: The maximum number of restarts tolerable before the Canary deployment is automatically paused (default is `2`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's clarify that it corresponds to the max number of restarts per pod

Comment on lines +274 to +302
#### Validate Canary deployment

As an alternative to waiting for the Canary duration to end, the deployment can be manually validated.

`kubectl-eds canary validate <ExtendedDaemonSet name>`

#### Pause Canary deployment

The Canary deployment can be paused to investigate an issue.

`kubectl-eds canary pause <ExtendedDaemonSet name>`

#### Unpause Canary deployment

The Canary deployment can be unpaused, and the Canary duration will continue.

`kubectl-eds canary unpause <ExtendedDaemonSet name>`

#### Fail Canary deployment

The Canary deployment can be manually failed. This command will restore the currently active ExtendedReplicaSet on the Canary pods.

`kubectl-eds canary fail <ExtendedDaemonSet name>`

#### Reset Canary deployment

Following failure of the Canary deployment, the `fail` annotation should be reset with this command.

`kubectl-eds canary reset <ExtendedDaemonSet name>`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks that's super useful

controllers/extendeddaemonsetreplicaset/strategy/canary.go Outdated Show resolved Hide resolved
pkg/controller/utils/pod/pod.go Show resolved Hide resolved
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request contains a valid label.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request contains a valid label.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request contains a valid label.

@celenechang celenechang merged commit 738e5f9 into master Nov 2, 2020
@celenechang celenechang deleted the celene/configure_pause_max_restarts branch November 2, 2020 15:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants