-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return success before fully deployed #208
Conversation
cc: @dwradcliffe @ibawt since we were just chatting about this. its still a wip, I'm looking for feedback before finishing it up |
I think it probably needs to be an annotation because each deployment might have different requirements |
I agree that an annotation makes more sense. I'd go with a generic name like |
51e3c26
to
508c913
Compare
@@ -74,7 +81,7 @@ def pretty_timeout_type | |||
|
|||
def deploy_timed_out? | |||
# Do not use the hard timeout if progress deadline is set | |||
@progress_condition.present? ? deploy_failing_to_progress? : super | |||
super #@progress_condition.present? ? deploy_failing_to_progress? : super |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we assuming that deployments always use progressDeadlineSeconds and never the timeout in this file?
@@ -74,7 +73,11 @@ def pretty_timeout_type | |||
|
|||
def deploy_timed_out? | |||
# Do not use the hard timeout if progress deadline is set | |||
@progress_condition.present? ? deploy_failing_to_progress? : super | |||
if @definition['spec']['progressDeadlineSeconds'].present? && @progress_condition.present? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like k8s
merges in a default, this will check to see if you manually set one. This might be a behavior change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would be a behaviour change. Why do you think that would be better? IMO getting the smarter timeout logic shouldn't depend on whether you liked the default value the server gives you. (interestingly, we originally intended to look at @definition
but made a mistake... and I ended up thinking it was a happy accident 😄 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have strong feelings either way. Though the behavior being different on first vs subsequent deploys seems less than ideal and makes me wonder if there are cases the test aren't capturing.
My bigger issue is without changing this behavior, I have no idea how to write a reliable test here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the behavior being different on first vs subsequent deploys
What do you mean by this? Why would that be the case?
My bigger issue is without changing this behavior, I have no idea how to write a reliable test here.
Maybe I'm missing the obvious, but why does changing this behaviour help? I agree this is tricky to test reliably... you'd almost want a way to prevent the whole RS from coming up... I'll think about it.
def partial_deployment_success | ||
minimum_needed = minimum_unavailable_replicas_to_succeeded | ||
|
||
running_rs.size <= 2 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can understand the motivation for this check, but couldn't it make the new feature ineffective for deployments with long-running jobs (like core has) that take a long time to terminate? We actually have a test asserting that terminating pods from older revisions do not impede deploy success. I'm not certain what that state looks like from a RS standpoint, but I'd expect it to be desired = 0
and replicas > 0
.
6b7bbd8
to
b65aa07
Compare
ready for 👀 again. |
@latest_rs = find_latest_rs(deployment_data) | ||
|
||
@rollout_data = { "replicas" => 0 }.merge(deployment_data["status"] | ||
@deployment_data = JSON.parse(raw_json) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to make the entire blob available now? I may be missing something, but I don't actually see new information being used outside the sync
cycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its going to have to be an instance var if we want to use it in minimum_unavailable_replicas_to_succeeded
(or whatever it gets called). Well that or an instance var that just captures the value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well that or an instance var that just captures the value
Yeah, if it's only one more value that you need, I'd be explicit about capturing that (and potentially resetting it on sync
cycles that err) rather than the whole blob.
when 'none' | ||
0 | ||
else | ||
raise "#{min_unavailable_rollout} is not a valid value for required-rollout" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be helpful to include what the valid options are in this message. (see also my comment about this needing to happen earlier)
@@ -123,5 +126,35 @@ def find_latest_rs(deployment_data) | |||
rs.sync(latest_rs_data) | |||
rs | |||
end | |||
|
|||
def min_unavailable_rollout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd call this something like required_rollout_type
, have it always return a value, i.e.:
def required_rollout_type
strategy = @definition.dig('metadata', 'annotations', 'kubernetes-deploy.shopify.io/required-rollout')
strategy.presence || REQUIRED_ROLLOUT_TYPES.fetch(:full_availability)
end
The validation of the annotation value also needs to happen earlier, before the deploy might have run. I'd suggest overwriting validate_definition
and calling super
first, then validating required_rollout_type against a list in a constant if that passed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like Katrina's suggestion above. I think this function as is requires too much knowledge from users (i.e. calling .blank?
on its result).
@@ -42,6 +40,7 @@ def fetch_logs | |||
|
|||
def deploy_succeeded? | |||
return false unless @latest_rs.present? | |||
return partial_deployment_success unless min_unavailable_rollout.blank? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the new concept you're introducing should have first-class visibility here. I.e. I'd move the case statement on required_rollout_type
(or whatever you name the method) right in here, and have minimum_unavailable_replicas
assume it's being asked in the context of the relevant rollout strategy.
|
||
case min_unavailable_rollout | ||
when 'maxUnavailable' | ||
max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to use the value from sync
rather than @definition
, possibly with a fallback to @definition
(like we do for @progress_deadline
), since this value can be server-side defaulted.
when 'maxUnavailable' | ||
max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable') | ||
if max_unavailable =~ /%/ | ||
(desired * (100 - max_unavailable.to_i) / 100.0).to_i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't match k8s behaviour, because k8s rounds up (well, it rounds the maxUnavailable num down) and Float.to_i
simply truncates (i.e. rounds down).
e.g. desired = 5, maxUnavailable = 25%, we should return 4 here and we'll return 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know there's not much in the way of infrastructure for this, but we should probably get some unit tests on this logic.
end | ||
end | ||
|
||
def partial_deployment_success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: returns a boolean so should end in a ?
|
||
@latest_rs.desired_replicas >= minimum_needed && | ||
@latest_rs.rollout_data["readyReplicas"].to_i >= minimum_needed && | ||
@latest_rs.rollout_data["availableReplicas"].to_i >= minimum_needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rollout_data
seems like a bit of an implementation detail. I'd rather expose ready_replicas
and available_replicas
as integers, just like deisred_replicas
.
test/fixtures/slow-cloud/web.yml.erb
Outdated
@@ -0,0 +1,75 @@ | |||
apiVersion: extensions/v1beta1 | |||
kind: Ingress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this fixture set really needs an ingress or service (doesn't matter that much since those deploy quickly, but you aren't asserting about them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like you really use the configmap either.
test/fixtures/slow-cloud/web.yml.erb
Outdated
command: | ||
- sleep | ||
- '4' | ||
image: nginx:alpine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're trying to use busybox for everything now (with imagePullPolicy: IfNotPresent
and command: ["tail", "-f", "/dev/null"]
) so that the suite only pulls one tiny image.
test/fixtures/slow-cloud/web.yml.erb
Outdated
spec: | ||
containers: | ||
- name: app | ||
readinessProbe: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having the probe sleep, would it work to set initialDelaySeconds
to 4
? I'd be surprised if this probe would ever pass as is, because timeoutSeconds
is 1
by default. As a rule I've also been setting periodSeconds
to 1
in fixtures, so that we don't get delayed 10s by the probe if something weird happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be surprised if this probe would ever pass as is ...
I don't understand this comment. The tests pass so are you suggesting I'm not testing what I think I am?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm confused by how this works. It could be the test, or it could just be me. If the probe should time out after 1 second by default (maybe that's not the default you're getting, somehow?), and your probe sleeps for 4, how can that work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like a k8s bug kubernetes/kubernetes#26895
@@ -632,6 +632,18 @@ def test_can_deploy_deployment_with_zero_replicas | |||
]) | |||
end | |||
|
|||
def test_can_deploy_deployment_with_partial_rollout_success | |||
result = deploy_fixtures("slow-cloud", subset: ["configmap-data.yml", "web.yml.erb"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this test is the purpose of this fixture set, you shouldn't need to deploy a subset.
test/fixtures/slow-cloud/web.yml.erb
Outdated
configMapKeyRef: | ||
name: hello-cloud-configmap-data | ||
key: datapoint1 | ||
- name: sidecar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't doing anything useful, so I'd remove it.
when 'maxUnavailable' | ||
max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable') | ||
if max_unavailable =~ /%/ | ||
(desired * (100 - max_unavailable.to_i) / 100.0).to_i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know there's not much in the way of infrastructure for this, but we should probably get some unit tests on this logic.
@@ -632,6 +632,18 @@ def test_can_deploy_deployment_with_zero_replicas | |||
]) | |||
end | |||
|
|||
def test_can_deploy_deployment_with_partial_rollout_success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the point is more that a deployment with the maxUnavailable required-rollout annotation does not wait for full availability (and the copy in restart_task_test
should talk about restarts)
assert_deploy_success(result) | ||
|
||
assert_logs_match_all( | ||
[%r{Deployment\/web\s+[34] replicas, 3 updatedReplicas, 2 availableReplicas, [12] unavailableReplica}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could maxSurge: 0
remove the variability here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought it might, but after trying it doesn't, just changes the combinations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach LGTM - I think Katrina already captured a few areas needing attention.
@@ -123,5 +126,35 @@ def find_latest_rs(deployment_data) | |||
rs.sync(latest_rs_data) | |||
rs | |||
end | |||
|
|||
def min_unavailable_rollout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like Katrina's suggestion above. I think this function as is requires too much knowledge from users (i.e. calling .blank?
on its result).
7ac23f2
to
6ca7ebe
Compare
assert_deploy_success(result) | ||
|
||
assert_logs_match_all( | ||
[%r{Deployment\/web\s+[34] replicas, 3 updatedReplicas, 2 availableReplicas, [12] unavailableReplica}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test still isn't that reliable. But I'm skeptical that increasing the sleep or using initialDelaySeconds
is going to make it better. The rest of the feedback is addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I think tuning the sleep/delay could help is that the gem only polls for statuses every 3 seconds. So e.g. if we make a scaling operation take, say, 5-7 seconds, we should have a much better chance of succeeding at a consistent point in the scaling, because we'll see every step.
If I'm wrong about that, or it makes the test hella slow, let's use a fancier regex (e.g. Deployment\/web\s+(?<replicas>\d+) replicas?, (?<updatedReplicas>\d+) updatedReplicas?, (?<availableReplicas>\d+) availableReplicas?
) to parse the numbers we want out of the logs and make looser assertions (e.g. availableReplicas < replicas
, updatedReplicas < replicas
). It's shitty, but having multiple flaky tests is worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these tests reliable now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The last two commits have passed all tests on the first pass, so I'd comfortable calling them reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just ran this one 100 times locally, and it failed 3 times:
Deployment/web 3 replicas, 3 updatedReplicas, 3 availableReplicas
Deployment/web 3 replicas, 3 updatedReplicas, 3 availableReplicas
Deployment/web 4 replicas, 2 updatedReplicas, 3 availableReplicas, 1 unavailableReplica
That's not horrible, but considering that there are three of them, they could end up noticeably increasing the fail rate overall. I'm thinking we should make sure we have good unit coverage of the %
option and remove that copy of the test. Maybe the restart copy too. @stefanmb wdyt? Do you have any ideas for improving the reliability here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could change the restart one to be %
based and take out the %
one here. That would save us 1 test w/o any loss of coverage.
@KnVerey We may need to revisit the annotation not taking a number or |
Why would you want to declare it successful when you don't have acceptable capacity in the target revision? I'm worried it's because you just want to be able to twist the tool's arm when your deploys aren't rolling as fast as you'd like, not because that result is actually ok. That doesn't sound like the right thing to do, and I don't want to make the wrong thing easy. |
@@ -19,6 +20,8 @@ def sync | |||
conditions = deployment_data.fetch("status", {}).fetch("conditions", []) | |||
@progress_condition = conditions.find { |condition| condition['type'] == 'Progressing' } | |||
@progress_deadline = deployment_data['spec']['progressDeadlineSeconds'] | |||
@required_rollout = deployment_data.dig('metadata', 'annotations', | |||
'kubernetes-deploy.shopify.io/required-rollout') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to keep all the logic for setting this here, and use constants for the annotation name and default value. E.g.:
if @found
@required_rollout = deployment_data.dig('metadata', 'annotations', REQUIRED_ROLLOUT_ANNOTATION).presence || DEFAULT_REQUIRED_ROLLOUT
else
@required_rollout = @definition.dig('metadata', 'annotations', REQUIRED_ROLLOUT_ANNOTATION).presence || DEFAULT_REQUIRED_ROLLOUT
end
super | ||
|
||
unless REQUIRED_ROLLOUT_TYPES.include?(required_rollout_type) | ||
@validation_error_msg = "#{required_rollout_type} is not valid for required-rollout."\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to accommodate the case where super
already put something in this message.
|
||
unless REQUIRED_ROLLOUT_TYPES.include?(required_rollout_type) | ||
@validation_error_msg = "#{required_rollout_type} is not valid for required-rollout."\ | ||
" Acceptable options: #{REQUIRED_ROLLOUT_TYPES}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: join the list with commas
end | ||
|
||
def minimum_replicas_to_succeeded | ||
desired = @desired_replicas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not use @desired_replicas
directly?
def minimum_replicas_to_succeeded | ||
desired = @desired_replicas | ||
|
||
case required_rollout_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think this method should be something like min_available_replicas
and should not contain this case statement. Maybe I'm missing something, but I don't think the else
would ever get hit, and IMO the 'none'
option should be reflected directly in the case statement in deploy_succeeded?
--that case statement should reflect every valid option, and maybe even raise in its else
since it should never be hit. Incidentally the success condition for the none
should probably be exists?
, not >= 0
.
when 'maxUnavailable' | ||
max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable') | ||
if max_unavailable =~ /%/ | ||
(desired * (100 - max_unavailable.to_i) / 100.0).ceil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
codeCov is pointing out that this line is not covered by any test. Same with the none
option.
assert_deploy_success(result) | ||
|
||
assert_logs_match_all( | ||
[%r{Deployment\/web\s+[34] replicas, 3 updatedReplicas, 2 availableReplicas, [12] unavailableReplica}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I think tuning the sleep/delay could help is that the gem only polls for statuses every 3 seconds. So e.g. if we make a scaling operation take, say, 5-7 seconds, we should have a much better chance of succeeding at a consistent point in the scaling, because we'll see every step.
If I'm wrong about that, or it makes the test hella slow, let's use a fancier regex (e.g. Deployment\/web\s+(?<replicas>\d+) replicas?, (?<updatedReplicas>\d+) updatedReplicas?, (?<availableReplicas>\d+) availableReplicas?
) to parse the numbers we want out of the logs and make looser assertions (e.g. availableReplicas < replicas
, updatedReplicas < replicas
). It's shitty, but having multiple flaky tests is worse.
end | ||
end | ||
|
||
def required_rollout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put this back because validate_definition
gets called before sync
and I wasn't happy extracting the definition value there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait a second, why don't we just use the value we see in the definition all the time? This annotation isn't something that'd be server-side defaulted.
d770e9e
to
f8ba745
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to add something in the readme (like https://github.com/Shopify/kubernetes-deploy/pull/232/files#diff-04c6e90faac2675aa89e2176d2eec7d8).
@rollout_data["updatedReplicas"].to_i == @desired_replicas && | ||
@rollout_data["updatedReplicas"].to_i == @rollout_data["availableReplicas"].to_i | ||
when 'none' | ||
true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is untested. I think a test using a bad readiness probe + the annotation could work nicely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a unit test for this
unless REQUIRED_ROLLOUT_TYPES.include?(required_rollout) | ||
@validation_error_msg ||= '' | ||
@validation_error_msg += "#{required_rollout} is not valid for required-rollout."\ | ||
" Acceptable options: #{REQUIRED_ROLLOUT_TYPES.join(',')}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the same message here as you did on L64, where you include the complete annotation name using the constant. We also still need a test for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw it looks like this would tack the error messages (when super
added one) together without a space. In https://github.com/Shopify/kubernetes-deploy/pull/232/files#diff-d2654750695bbb60cf39d64df9b868eeR72, I'm introducing an errors array, which gets around that problem.
|
||
def initialize(namespace:, context:, definition:, logger:, parent: nil, deploy_started_at: nil) | ||
@parent = parent | ||
@deploy_started_at = deploy_started_at | ||
@rollout_data = { "replicas" => 0 } | ||
@desired_replicas = -1 | ||
@ready_replicas = -1 | ||
@available_replicas = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same thing should technically be done in the reset section at L35 too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
assert_deploy_success(result) | ||
|
||
assert_logs_match_all( | ||
[%r{Deployment\/web\s+[34] replicas, 3 updatedReplicas, 2 availableReplicas, [12] unavailableReplica}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these tests reliable now?
@@ -43,10 +46,24 @@ def fetch_logs | |||
def deploy_succeeded? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really like to start unit testing this now that it's pretty complex. It could reduce the number of possibly-flakey integration tests we'd need (i.e. separate %/# tests would be unnecessary).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added unit tests
@@ -123,5 +156,19 @@ def find_latest_rs(deployment_data) | |||
rs.sync(latest_rs_data) | |||
rs | |||
end | |||
|
|||
def min_available_replicas | |||
max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one could very well be server-side defaulted. IIRC you originally cached the entire resource blob from the sync
to get this, and I asked you to save the specific value instead. Why are we not using it at all now? Am I forgetting something from earlier review passes?
max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable') | ||
if max_unavailable =~ /%/ | ||
(@desired_replicas * (100 - max_unavailable.to_i) / 100.0).ceil | ||
else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if max_unavailable
is nil, we'll effectively return @desired_replicas
. If that's on purpose, I think it might be worth a comment. Because of the possibility of defaulting (see my other comment), we can't really validate this at template parsing time. However, it could be worth having a separate case that prints a warning (just once, not on every polling cycle).
I think the main time this would happen is for deployments with strategy: Recreate
, which I hadn't been considering previously. The only time I'm aware of us using them is single-replica deployments with PVCs, but there's nothing preventing others from using them with larger deployments. That would be a compelling use case for having the %/#
option on the annotation as you suggested. That said, I still think maxUnavailable
is the correct option for rollingUpdate
deployments, and %/#
options could be added in future PRs instead of now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a validation that 'kubernetes-deploy.shopify.io/required-rollout = maxUnavailable
isn't in use with strategy: Recreate
. Since that's a bit nonSensical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me. The strategy isn't necessarily present in the templates, but the default has been RollingUpdate for many versions, so a validation using the definition should catch all cases in practice.
ccb8e31
to
f65bd72
Compare
unless REQUIRED_ROLLOUT_TYPES.include?(required_rollout) | ||
errors << "#{REQUIRED_ROLLOUT_ANNOTATION}:#{required_rollout} is invalid "\ | ||
"Acceptable options: #{REQUIRED_ROLLOUT_TYPES.join(',')}" | ||
return false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to return here anymore. Otherwise the errors never make it into the message.
README.md
Outdated
- `kubernetes-deploy.shopify.io/timeout-override`: Override the tool's hard timeout for one specific resource. Both full ISO8601 durations and the time portion of ISO8601 durations are valid. Value must be between 1 second and 24 hours. | ||
- _Example values_: 45s / 3m / 1h / PT0.25H | ||
- _Compatibility_: all resource types (Note: `Deployment` timeouts are based on `spec.progressDeadlineSeconds` if present, and that field has a default value as of the `apps/v1beta1` group version. Using this annotation will have no effect on `Deployment`s that time out with "Timeout reason: ProgressDeadlineExceeded".) | ||
- `kubernetes-deploy.shopify.io/required-rollout` (deployment): Modifies the conditions a deployment is | ||
considered to be successful |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with the other one (which I updated based on your suggestions on my PR 😄 ), how about:
kubernetes-deploy.shopify.io/required-rollout
: Modifies how much of the rollout needs to finish before the deployment is considered successful.
- Compatibility: Deployment
- Valid values:
maxUnavailable
: ...
README.md
Outdated
- `kubernetes-deploy.shopify.io/timeout-override`: Override the tool's hard timeout for one specific resource. Both full ISO8601 durations and the time portion of ISO8601 durations are valid. Value must be between 1 second and 24 hours. | ||
- _Example values_: 45s / 3m / 1h / PT0.25H | ||
- _Compatibility_: all resource types (Note: `Deployment` timeouts are based on `spec.progressDeadlineSeconds` if present, and that field has a default value as of the `apps/v1beta1` group version. Using this annotation will have no effect on `Deployment`s that time out with "Timeout reason: ProgressDeadlineExceeded".) | ||
- `kubernetes-deploy.shopify.io/required-rollout` (deployment): Modifies the conditions a deployment is | ||
considered to be successful | ||
- `maxUnavailable`: Only for `RollingUpdate deploys`. The deploy is successful when there are less |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be backwards ("less than X updated and ready"). Suggestion:
"The deploy is successful when minimum availability is reached in the new ReplicaSet. In other words, the number of new pods that must be ready is equal to spec.replicas
- strategy.RollingUpdate.maxUnavailable
(converted from percentages by rounding up, if applicable). This option is only valid for deployments that use the RollingUpdate strategy."
README.md
Outdated
considered to be successful | ||
- `maxUnavailable`: Only for `RollingUpdate deploys`. The deploy is successful when there are less | ||
than `strategy.RollingUpdate.maxUnavailable` (supports both the % and fixed number) updated & ready replicas. | ||
- `full`: maintain current behavior |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deploy is successful when all pods in the new ReplicaSet are ready.
README.md
Outdated
- `maxUnavailable`: Only for `RollingUpdate deploys`. The deploy is successful when there are less | ||
than `strategy.RollingUpdate.maxUnavailable` (supports both the % and fixed number) updated & ready replicas. | ||
- `full`: maintain current behavior | ||
- `none`: deploy is successful as soon as the new `replicaSet` is created for the deploy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deploy is successful as soon as the new ReplicaSet is created (no pods required).
@@ -81,6 +100,24 @@ def exists? | |||
@found | |||
end | |||
|
|||
def validate_definition | |||
valid = super |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we actually need this anymore. We should be able to use @validation_errors.empty?
as the return value, since that has to contain the errors from super
as well, right?
valid = false | ||
end | ||
|
||
if required_rollout == 'maxUnavailable' && @definition.dig('spec', 'strategy') == 'recreate' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the strategy case-sensitive? I don't think it is, in which case we should make sure these string comparisons are also case-insensitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I have no idea. I'll just downcase everything.
assert_deploy_success(result) | ||
|
||
assert_logs_match_all( | ||
[%r{Deployment\/web\s+[34] replicas, 3 updatedReplicas, 2 availableReplicas, [12] unavailableReplica}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just ran this one 100 times locally, and it failed 3 times:
Deployment/web 3 replicas, 3 updatedReplicas, 3 availableReplicas
Deployment/web 3 replicas, 3 updatedReplicas, 3 availableReplicas
Deployment/web 4 replicas, 2 updatedReplicas, 3 availableReplicas, 1 unavailableReplica
That's not horrible, but considering that there are three of them, they could end up noticeably increasing the fail rate overall. I'm thinking we should make sure we have good unit coverage of the %
option and remove that copy of the test. Maybe the restart copy too. @stefanmb wdyt? Do you have any ideas for improving the reliability here?
add a restart test and unit tests Docs are nice
71f214d
to
8f46ab3
Compare
Feedback is in. Let me know if there is anything else you'd like tweaked. |
The new tests already flaked on one of the runs on my branch. Test flakes are already a problem on this repo, so I think it isn't actually ok to merge two new ones that are known to be unreliable. I put some more thought into this, and here is a faster (runs in ~19s instead of ~36s) test that passed 100 times in a row for me locally: Test def test_deploy_successful_with_partial_availability
result = deploy_fixtures("slow-cloud", sha: "deploy1")
assert_deploy_success(result)
result = deploy_fixtures("slow-cloud", sha: "deploy2") do |fixtures|
dep = fixtures["web.yml.erb"]["Deployment"].first
container = dep["spec"]["template"]["spec"]["containers"].first
container["readinessProbe"] = {
"exec" => { "command" => %w(sleep 5) },
"timeoutSeconds" => 6
}
end
assert_deploy_success(result)
new_pods = kubeclient.get_pods(namespace: @namespace, label_selector: 'name=web,app=slow-cloud,sha=deploy2')
assert new_pods.length >= 1, "Expected at least one new pod, saw #{new_pods.length}"
new_ready_pods = new_pods.select do |pod|
pod.status.phase == "Running" &&
pod.status.conditions.any? { |condition| condition["type"] == "Ready" && condition["status"] == "True" }
end
assert_equal 1, new_ready_pods.length, "Expected exactly one new pod to be ready, saw #{new_ready_pods.length}"
end FixtureapiVersion: apps/v1beta1
kind: Deployment
metadata:
name: web
annotations:
shipit.shopify.io/restart: "true"
kubernetes-deploy.shopify.io/required-rollout: maxUnavailable
spec:
replicas: 2
selector:
matchLabels:
name: web
app: slow-cloud
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 0
maxUnavailable: 1
template:
metadata:
labels:
name: web
app: slow-cloud
sha: "<%= current_sha %>"
spec:
terminationGracePeriodSeconds: 0
containers:
- name: app
image: busybox
imagePullPolicy: IfNotPresent
command: ["tail", "-f", "/dev/null"]
ports:
- containerPort: 80
name: http
Unless you see a reason why the test isn't valid, let's use this instead (some adjustments will need to be made for the restart version). |
|
||
pods = kubeclient.get_pods(namespace: @namespace, label_selector: 'name=web,app=slow-cloud') | ||
new_pods = pods.select do |pod| | ||
pod.spec.containers.select { |c| c["name"] == "app" && c.env&.find { |n| n.name == "RESTARTED_AT" } } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will always be truthy, because a select
with no results is []
. So I'd expect that new_pods
would always include all pods. I'd use any?
instead.
new_ready_pods = new_pods.select do |pod| | ||
pod.status.phase == "Running" && | ||
pod.status.conditions.any? { |condition| condition["type"] == "Ready" && condition["status"] == "True" } && | ||
pod.spec.containers.detect { |c| c["name"] == "app" && c.env&.find { |n| n.name == "RESTARTED_AT" } } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You shouldn't need this if L225 filters properly
- `kubernetes-deploy.shopify.io/timeout-override`: Override the tool's hard timeout for one specific resource. Both full ISO8601 durations and the time portion of ISO8601 durations are valid. Value must be between 1 second and 24 hours. | ||
- _Example values_: 45s / 3m / 1h / PT0.25H | ||
- _Compatibility_: all resource types (Note: `Deployment` timeouts are based on `spec.progressDeadlineSeconds` if present, and that field has a default value as of the `apps/v1beta1` group version. Using this annotation will have no effect on `Deployment`s that time out with "Timeout reason: ProgressDeadlineExceeded".) | ||
- `kubernetes-deploy.shopify.io/required-rollout`: Modifies how much of the rollout needs to finish | ||
before the deployment is considered successful. | ||
- _Compatibility_: Deployment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be consistent about how we reference objects and whether they're in backticks. The new uses both "ReplicaSet" and "replicaSet
", and has "deployment" and "Deployment" and "deploy". The existing text above uses "Deployment
".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good eye. I'm going to standardize on deployment and replicaSet
6b1a03e
to
ee5e9cb
Compare
Support a new annotation on deployments:
kubernetes-deploy.shopify.io/required-rollout
. It changes the behavior of deploy_succesful?full
: maintain current behavior.none
: deploy is sucessful as soon as the new rs is created for the deploy.maxUnavailable
: Only for RollingUpdate deploys. The deploy is successful when there are less thanstrategy.RollingUpdate.maxUnavailable
(supports both the % and fixed number) updated & ready replicas.This feature is useful when deployments may take an extended period of time to fully roll out. This could happen when cluster capacity is near usage.