-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ERS canary NoRestartDuration and AutoFail configuration options to complement AutoPause behavior #66
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request does not contain a valid label. Please add one of the following labels: bug, enhancement, documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request does not contain a valid label. Please add one of the following labels: bug, enhancement, documentation
Codecov Report
@@ Coverage Diff @@
## master #66 +/- ##
==========================================
- Coverage 36.75% 34.71% -2.04%
==========================================
Files 29 29
Lines 1570 1688 +118
==========================================
+ Hits 577 586 +9
- Misses 919 1028 +109
Partials 74 74
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
…is exceeded TODO - CR validation: - Check that autoFail.enabled = true if canary.noRestartsDuration != nil - Check that autoFail.maxRestarts > autoPause.maxRestarts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request does not contain a valid label. Please add one of the following labels: bug, enhancement, documentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request contains a valid label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request contains a valid label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request contains a valid label.
fb07462
to
b5c4838
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request contains a valid label.
b5c4838
to
a18451c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request contains a valid label.
- Extract `manageCanaryPodRestarts` to be separate from the rest of the pod counting logic. - Unit tests - Clarify comment for `MaxRestarts` that it is taken per pod. - Validate `AutoFail.MaxRestarts`
a18451c
to
f852e00
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This pull request contains a valid label.
result.NewStatus.Desired = desiredPods | ||
result.NewStatus.Ready = readyPods | ||
result.NewStatus.Available = availablePods | ||
result.NewStatus.Current = currentPods | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't update the result.NewStatus.Reason
with the info: Paused
, Failed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So while there is currently no result.NewStatus.Reason
there is result.NewStatus.Status
which is updated to:
canary
for active and paused canaries.canary-failed
for failed canaries.
type ExtendedDaemonSetReplicaSetStatus struct {
Status string `json:"status"`
Desired int32 `json:"desired"`
Current int32 `json:"current"`
Ready int32 `json:"ready"`
Available int32 `json:"available"`
IgnoredUnresponsiveNodes int32 `json:"ignoredUnresponsiveNodes"`
// Conditions Represents the latest available observations of a DaemonSet's current state.
// +listType=map
// +listMapKey=type
Conditions []ExtendedDaemonSetReplicaSetCondition `json:"conditions,omitempty"`
}
We interpret status as an enumerable field today (as opposed to a freeform description). Example where it is being taken into consideration as such: https://github.com/DataDog/extendeddaemonset/blob/master/controllers/extendeddaemonsetreplicaset/controller.go#L183
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really nice PR 👍
I just added one comment
What does this PR do?
Adds new features to handle restarts in the canary phase:
NoRestartsDuration
for canary preventing end of the canary phase until there are no restarts within this time.AutoFail
behaving similarly to pause but causing canary to be failed either onMaxRestarts
or on exceedingMaxRestartsDuration
for pod restarts.Additional Notes
Status.Conditions
are used to record the last restart of an ERS pod and progressively track the difference between initial transition and update times.A future TODO - adding more pod termination statuses recorded as reason for canary pause or fail statuses, currently only
CrashLoopBackOff
andOOMKilled
are captured, the rest is mapped toUnknown
.There's quite a bit of refactoring of existing canary management logic to make it more testable with unit tests. Hoping reviewers can critique the changes and advise on possible improvements.
Describe your test plan
For
NoRestartsDuration
- configure a canary with a pod that is causing restarts within no restart duration, make sure that canary phase is not ended until this duration elapses.For
AutoFail
- configure it with eitherMaxRestarts
orMaxRestartsDuration
and make sure the transition tocanary-failed
state takes place.