Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ERS canary NoRestartDuration and AutoFail configuration options to complement AutoPause behavior #66

Merged
merged 4 commits into from
Nov 30, 2020

Conversation

xornivore
Copy link
Contributor

@xornivore xornivore commented Nov 20, 2020

What does this PR do?

Adds new features to handle restarts in the canary phase:

  • NoRestartsDuration for canary preventing end of the canary phase until there are no restarts within this time.
  • AutoFail behaving similarly to pause but causing canary to be failed either on MaxRestarts or on exceeding MaxRestartsDuration for pod restarts.

Additional Notes

  • Status.Conditions are used to record the last restart of an ERS pod and progressively track the difference between initial transition and update times.

  • A future TODO - adding more pod termination statuses recorded as reason for canary pause or fail statuses, currently only CrashLoopBackOff and OOMKilled are captured, the rest is mapped to Unknown.

  • There's quite a bit of refactoring of existing canary management logic to make it more testable with unit tests. Hoping reviewers can critique the changes and advise on possible improvements.

Describe your test plan

For NoRestartsDuration - configure a canary with a pod that is causing restarts within no restart duration, make sure that canary phase is not ended until this duration elapses.

For AutoFail - configure it with either MaxRestarts or MaxRestartsDuration and make sure the transition to canary-failed state takes place.

@xornivore xornivore added this to the v0.4 milestone Nov 20, 2020
@xornivore xornivore requested a review from a team November 20, 2020 20:47
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request does not contain a valid label. Please add one of the following labels: bug, enhancement, documentation

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request does not contain a valid label. Please add one of the following labels: bug, enhancement, documentation

@codecov-io
Copy link

codecov-io commented Nov 20, 2020

Codecov Report

Merging #66 (a84ce22) into master (4755dcd) will decrease coverage by 2.03%.
The diff coverage is 11.93%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #66      +/-   ##
==========================================
- Coverage   36.75%   34.71%   -2.04%     
==========================================
  Files          29       29              
  Lines        1570     1688     +118     
==========================================
+ Hits          577      586       +9     
- Misses        919     1028     +109     
  Partials       74       74              
Flag Coverage Δ
unittests 34.71% <11.93%> (-2.04%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...trollers/extendeddaemonsetreplicaset/controller.go 53.81% <0.00%> (-0.99%) ⬇️
...ers/extendeddaemonsetreplicaset/strategy/canary.go 0.00% <0.00%> (ø)
...endeddaemonsetreplicaset/strategy/rollingupdate.go 9.00% <0.00%> (ø)
...llers/extendeddaemonsetreplicaset/strategy/type.go 0.00% <ø> (ø)
...lers/extendeddaemonsetreplicaset/strategy/utils.go 16.85% <20.00%> (-2.38%) ⬇️
pkg/controller/utils/pod/pod.go 15.78% <61.53%> (-0.65%) ⬇️
controllers/extendeddaemonset/utils.go 94.87% <100.00%> (+1.53%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4755dcd...a84ce22. Read the comment docs.

…is exceeded

TODO - CR validation:
- Check that autoFail.enabled = true if canary.noRestartsDuration != nil
- Check that autoFail.maxRestarts > autoPause.maxRestarts
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request does not contain a valid label. Please add one of the following labels: bug, enhancement, documentation

@xornivore xornivore added the enhancement New feature or request label Nov 24, 2020
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request contains a valid label.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request contains a valid label.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request contains a valid label.

@xornivore xornivore changed the title WIP Add ERS canary NoRestartDuration configuration option Add ERS canary NoRestartDuration and AutoFail configuration options to complement AutoPause behavior Nov 25, 2020
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request contains a valid label.

@xornivore xornivore marked this pull request as ready for review November 25, 2020 23:29
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request contains a valid label.

- Extract `manageCanaryPodRestarts` to be separate from the rest of the pod counting logic.
- Unit tests
- Clarify comment for `MaxRestarts` that it is taken per pod.
- Validate `AutoFail.MaxRestarts`
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pull request contains a valid label.

@clamoriniere clamoriniere modified the milestones: v0.4, v0.5 Nov 27, 2020
result.NewStatus.Desired = desiredPods
result.NewStatus.Ready = readyPods
result.NewStatus.Available = availablePods
result.NewStatus.Current = currentPods

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't update the result.NewStatus.Reason with the info: Paused, Failed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So while there is currently no result.NewStatus.Reason there is result.NewStatus.Status which is updated to:

  • canary for active and paused canaries.
  • canary-failed for failed canaries.
type ExtendedDaemonSetReplicaSetStatus struct {
	Status                   string `json:"status"`
	Desired                  int32  `json:"desired"`
	Current                  int32  `json:"current"`
	Ready                    int32  `json:"ready"`
	Available                int32  `json:"available"`
	IgnoredUnresponsiveNodes int32  `json:"ignoredUnresponsiveNodes"`
	// Conditions Represents the latest available observations of a DaemonSet's current state.
	// +listType=map
	// +listMapKey=type
	Conditions []ExtendedDaemonSetReplicaSetCondition `json:"conditions,omitempty"`
}

We interpret status as an enumerable field today (as opposed to a freeform description). Example where it is being taken into consideration as such: https://github.com/DataDog/extendeddaemonset/blob/master/controllers/extendeddaemonsetreplicaset/controller.go#L183

Copy link
Collaborator

@clamoriniere clamoriniere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really nice PR 👍
I just added one comment

@xornivore xornivore merged commit 2db920b into master Nov 30, 2020
@xornivore xornivore deleted the xornivore/no-restart-duration branch November 30, 2020 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants