Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix StatefulSet#deploy_succeeded? for OnDelete #950

Merged
merged 3 commits into from
Apr 3, 2024

Conversation

JamesOwenHall
Copy link
Contributor

What are you trying to accomplish with this PR?
In #926, a check was added to StatefulSet#deploy_succeeded? that requires all pods belonging to a StatefulSet to have the controller-revision-hash label set to the StatefulSet's updateRevision status. It ensures that all of the StatefulSet's pods have been updated to the StatefulSet's latest revision. This is fine for StatefulSets with updateStrategy: RollingUpdate, but doesn't work when it's OnDelete and we don't want to wait for some other mechanism to roll the pods.

We're wrong to run this check for all StatefulSets.

How is this accomplished?
I've removed the check entirely since the StatefulSet's status should be sufficient to monitor rollout status.

What could go wrong?
...

Comment on lines -25 to -27
@pods.each do |pod|
success &= pod.definition["metadata"]["labels"]["controller-revision-hash"] == status_data['updateRevision']
end
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the check I've removed because it's not needed as far as I can tell. This previously passed in some of our deploys due to kubernetes/kubernetes#106055 not reliably setting .status.currentRevision properly.

I don't believe it should be necessary to look up the pods. The StatefulSet's status should be enough.

@@ -80,7 +81,7 @@ def build_ss_template(status: {}, updateStrategy: "RollingUpdate", rollout: "ful
ss_fixture.dup.deep_merge(
"status" => status,
"spec" => {"updateStrategy" => {"type" => updateStrategy}},
"metadata" => {"annotations" => {"krane.shopify.io/rollout" => rollout}}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This annotation wasn't set correctly.

success &= desired_replicas == status_data['currentReplicas'].to_i

elsif update_strategy == ONDELETE
if update_strategy == ONDELETE && required_rollout != "full"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there known cases of update_strategy == DELETE AND required_rollout == "full"? If those conditions hold, krane will time out unless something external deletes the pods, right? I suppose that's not a problem if that's what's been encoded in the manifest since that would signal that as the desired behaviour 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do know of one such case in kafka-infrastructure, where this script in zonal-deploy-controller is used to delete the pods. I don't know that this change would break anything there, but tagging @Shopify/streaming-foundations for visibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If those conditions hold, krane will time out unless something external deletes the pods, right?

Yes, that's the desired behaviour as I understand it. You opt in with the annotation set to full to wait until something else rolls the pods.

Copy link
Contributor

@ayana-s ayana-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tophatted (unsure if it's useful to at this point -- if a tophat is necessary, maybe it'd just involve manually deploying a test runtime and deleting/recreating an existing StatefulSet to confirm that changes introduced here aren't breaking changes?), but the code & logic look sound - thanks for fixing this, James!

@JamesOwenHall JamesOwenHall merged commit f1eb262 into main Apr 3, 2024
65 checks passed
@JamesOwenHall JamesOwenHall deleted the fix-statefulset-on-delete branch April 3, 2024 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants