-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add parameter to configure max restarts for canary pause #51
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request does not contain a valid label. Please add one of the following labels: bug, enhancement, documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request contains a valid label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Look good.
Could you add also a documentation about how configure and use the pause
feature:
- configuration example
- kubectl plugin command example
@@ -75,6 +75,7 @@ type ExtendedDaemonSetSpecStrategyRollingUpdate struct { | |||
type ExtendedDaemonSetSpecStrategyCanary struct { | |||
Replicas *intstr.IntOrString `json:"replicas,omitempty"` | |||
Duration *metav1.Duration `json:"duration,omitempty"` | |||
MaxRestarts *intstr.IntOrString `json:"maxRestarts,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if it is necessary to have a *intstr.IntOrString
. I feel like *int
should be enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request contains a valid label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request contains a valid label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request contains a valid label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request contains a valid label.
if daemonset.Spec.Strategy.Canary == nil || daemonset.Spec.Strategy.Canary.AutoPause == nil { | ||
autoPauseEnabled = defaultAutoPauseEnabled | ||
maxRestarts = defaultMaxRestarts | ||
} else { | ||
if daemonset.Spec.Strategy.Canary.AutoPause.Enabled == nil { | ||
autoPauseEnabled = defaultAutoPauseEnabled | ||
} else { | ||
autoPauseEnabled = *daemonset.Spec.Strategy.Canary.AutoPause.Enabled | ||
} | ||
if daemonset.Spec.Strategy.Canary.AutoPause.MaxRestarts == nil { | ||
maxRestarts = defaultMaxRestarts | ||
} else { | ||
maxRestarts = int(*daemonset.Spec.Strategy.Canary.AutoPause.MaxRestarts) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please correct me if I'm wrong, but I think at this point the defaulting should be already done via DefaultExtendedDaemonSetSpecStrategyCanaryAutoPause
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be clearer, the question is: why do we default the autopause config twice?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for bringing this up, as I didn't have a complete grasp on this.
It seems we default
the ERS (in extendeddaemonset_default.go) if the minimum required settings are not defined. In this case we could set the default AutoPause settings with DefaultExtendedDaemonSetSpecStrategyCanaryAutoPause()
, but it's not strictly necessary since we also have this defaulting code here in canary.go. If you think it's confusing to include the change in extendeddaemonset_default.go I can remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for clarifying this
tbh I would expect the defaulting to be done in one place (in extendeddaemonset_default.go
), the fact that the same defaulting logic and the same constants (defaultMaxRestarts
defaultAutoPauseEnabled
defaultCanaryAutoPauseEnabled
defaultCanaryAutoPauseMaxRestarts
) are defined in different packages doesn't seem right 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if we do it as @ahmed-mez said in extendeddaemonset_default.go
, the defaulting is not needed in ManageCanaryDeployment
because we always call the default method.
To achieve it, the function IsDefaultedExtendedDaemonSet()
link, should be updated, to return false if the canary configuration is missing default values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it like this, with only setting the defaults in extendeddaemonset_default.go, and it led to undesired behaviors. I may have done something wrong; I will try again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK got this fixed, and confirmed it works whether autoPause
, autoPause.Enabled
, or autoPause.MaxRestarts
is missing. Sorry for the confusion
@@ -0,0 +1,6 @@ | |||
kind: Cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should probably move this under hack
? as examples
should probably contain CR examples only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we propose to use this file in the documentation for testing the examples.
hack
folder is more for contributor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay
Enabled *bool `json:"enabled,omitempty"` | ||
MaxRestarts *int32 `json:"maxRestarts,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we please document these fields? especially MaxRestarts
, we should make it clear that it corresponds to the max number of restarts per canary pod, not the sum of restarts of all canary pods
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request contains a valid label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request contains a valid label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request contains a valid label.
Codecov Report
@@ Coverage Diff @@
## master #51 +/- ##
=========================================
Coverage ? 36.87%
=========================================
Files ? 29
Lines ? 1589
Branches ? 0
=========================================
Hits ? 586
Misses ? 929
Partials ? 74
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great! just few comments
- `replicas`: The number of replica pods to participate in the Canary deployment | ||
- `duration`: The duration of the Canary deployment, after which the Canary deployment will end and the active ExtendedReplicaSet will update | ||
- `autoPause.enabled`: Activation of the Canary deployment auto pausing feature (default is `true`) | ||
- `autoPause.maxRestarts`: The maximum number of restarts tolerable before the Canary deployment is automatically paused (default is `2`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's clarify that it corresponds to the max number of restarts per pod
#### Validate Canary deployment | ||
|
||
As an alternative to waiting for the Canary duration to end, the deployment can be manually validated. | ||
|
||
`kubectl-eds canary validate <ExtendedDaemonSet name>` | ||
|
||
#### Pause Canary deployment | ||
|
||
The Canary deployment can be paused to investigate an issue. | ||
|
||
`kubectl-eds canary pause <ExtendedDaemonSet name>` | ||
|
||
#### Unpause Canary deployment | ||
|
||
The Canary deployment can be unpaused, and the Canary duration will continue. | ||
|
||
`kubectl-eds canary unpause <ExtendedDaemonSet name>` | ||
|
||
#### Fail Canary deployment | ||
|
||
The Canary deployment can be manually failed. This command will restore the currently active ExtendedReplicaSet on the Canary pods. | ||
|
||
`kubectl-eds canary fail <ExtendedDaemonSet name>` | ||
|
||
#### Reset Canary deployment | ||
|
||
Following failure of the Canary deployment, the `fail` annotation should be reset with this command. | ||
|
||
`kubectl-eds canary reset <ExtendedDaemonSet name>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks that's super useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request contains a valid label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request contains a valid label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request contains a valid label.
What does this PR do?
Currently the Canary autopause feature pauses the canary deployment if any of the canary pod containers restart more than 0 times. We have found that this is too sensitive, so this PR makes this a configurable parameter with a default value of 2. The
maxRestarts
parameter means that this is the number of restarts of a single container that is considered tolerable before we autopause.It also adds the ability to disable the feature entirely via the spec.
Motivation
Better user experience
Additional Notes
A subsequent PR with a new parameter to define a stability (no restarts) duration requirement is planned.
Describe your test plan
Use an image such as
bash:latest
in the EDS configuration that causes the container to enter CLB. Increase the canary duration to be sufficiently large to give time for restarts.Check if canary is automatically paused when the container restarts 3 times (default) without adding the
autoPause.maxRestarts
setting to the EDS spec.Then try setting
autoPause.maxRestarts
to a larger number and see if the canary pauses after the corresponding number of restarts (see examples/foo-eds_v2.yaml for an example configuration).Lastly, try setting
autoPause.enabled
tofalse
andautoPause.maxRestarts
to0
and make sure that a restart does not pause the canary.