Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return success before fully deployed #208

Merged
merged 7 commits into from
Jan 11, 2018
Merged

Return success before fully deployed #208

merged 7 commits into from
Jan 11, 2018

Conversation

dturn
Copy link
Contributor

@dturn dturn commented Nov 20, 2017

Support a new annotation on deployments: kubernetes-deploy.shopify.io/required-rollout. It changes the behavior of deploy_succesful?

  • full: maintain current behavior.
  • none: deploy is sucessful as soon as the new rs is created for the deploy.
  • maxUnavailable: Only for RollingUpdate deploys. The deploy is successful when there are less than strategy.RollingUpdate.maxUnavailable (supports both the % and fixed number) updated & ready replicas.

This feature is useful when deployments may take an extended period of time to fully roll out. This could happen when cluster capacity is near usage.

@dturn dturn requested a review from KnVerey November 20, 2017 21:39
@dturn
Copy link
Contributor Author

dturn commented Nov 21, 2017

cc: @dwradcliffe @ibawt since we were just chatting about this. its still a wip, I'm looking for feedback before finishing it up

@dwradcliffe
Copy link
Member

I think it probably needs to be an annotation because each deployment might have different requirements

@KnVerey
Copy link
Contributor

KnVerey commented Nov 21, 2017

I agree that an annotation makes more sense. I'd go with a generic name like kubernetes-deploy.shopify.io/required-rollout so we could theoretically expand it to multiple values in the future, e.g. maxUnavailable (this PR), none (could be a more flexible replacement for the --skip-wait feature)

@dturn dturn force-pushed the partial-rollout-success branch 2 times, most recently from 51e3c26 to 508c913 Compare November 30, 2017 18:09
@@ -74,7 +81,7 @@ def pretty_timeout_type

def deploy_timed_out?
# Do not use the hard timeout if progress deadline is set
@progress_condition.present? ? deploy_failing_to_progress? : super
super #@progress_condition.present? ? deploy_failing_to_progress? : super
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we assuming that deployments always use progressDeadlineSeconds and never the timeout in this file?

@@ -74,7 +73,11 @@ def pretty_timeout_type

def deploy_timed_out?
# Do not use the hard timeout if progress deadline is set
@progress_condition.present? ? deploy_failing_to_progress? : super
if @definition['spec']['progressDeadlineSeconds'].present? && @progress_condition.present?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like k8s merges in a default, this will check to see if you manually set one. This might be a behavior change

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it would be a behaviour change. Why do you think that would be better? IMO getting the smarter timeout logic shouldn't depend on whether you liked the default value the server gives you. (interestingly, we originally intended to look at @definition but made a mistake... and I ended up thinking it was a happy accident 😄 )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have strong feelings either way. Though the behavior being different on first vs subsequent deploys seems less than ideal and makes me wonder if there are cases the test aren't capturing.

My bigger issue is without changing this behavior, I have no idea how to write a reliable test here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the behavior being different on first vs subsequent deploys

What do you mean by this? Why would that be the case?

My bigger issue is without changing this behavior, I have no idea how to write a reliable test here.

Maybe I'm missing the obvious, but why does changing this behaviour help? I agree this is tricky to test reliably... you'd almost want a way to prevent the whole RS from coming up... I'll think about it.

def partial_deployment_success
minimum_needed = minimum_unavailable_replicas_to_succeeded

running_rs.size <= 2 &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can understand the motivation for this check, but couldn't it make the new feature ineffective for deployments with long-running jobs (like core has) that take a long time to terminate? We actually have a test asserting that terminating pods from older revisions do not impede deploy success. I'm not certain what that state looks like from a RS standpoint, but I'd expect it to be desired = 0 and replicas > 0.

@dturn dturn force-pushed the partial-rollout-success branch 2 times, most recently from 6b7bbd8 to b65aa07 Compare December 14, 2017 17:23
@dturn
Copy link
Contributor Author

dturn commented Dec 14, 2017

ready for 👀 again.

@latest_rs = find_latest_rs(deployment_data)

@rollout_data = { "replicas" => 0 }.merge(deployment_data["status"]
@deployment_data = JSON.parse(raw_json)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to make the entire blob available now? I may be missing something, but I don't actually see new information being used outside the sync cycle.

Copy link
Contributor Author

@dturn dturn Dec 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its going to have to be an instance var if we want to use it in minimum_unavailable_replicas_to_succeeded (or whatever it gets called). Well that or an instance var that just captures the value

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well that or an instance var that just captures the value

Yeah, if it's only one more value that you need, I'd be explicit about capturing that (and potentially resetting it on sync cycles that err) rather than the whole blob.

when 'none'
0
else
raise "#{min_unavailable_rollout} is not a valid value for required-rollout"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be helpful to include what the valid options are in this message. (see also my comment about this needing to happen earlier)

@@ -123,5 +126,35 @@ def find_latest_rs(deployment_data)
rs.sync(latest_rs_data)
rs
end

def min_unavailable_rollout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call this something like required_rollout_type, have it always return a value, i.e.:

def required_rollout_type
  strategy = @definition.dig('metadata', 'annotations', 'kubernetes-deploy.shopify.io/required-rollout')
  strategy.presence || REQUIRED_ROLLOUT_TYPES.fetch(:full_availability)
end

The validation of the annotation value also needs to happen earlier, before the deploy might have run. I'd suggest overwriting validate_definition and calling super first, then validating required_rollout_type against a list in a constant if that passed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like Katrina's suggestion above. I think this function as is requires too much knowledge from users (i.e. calling .blank? on its result).

@@ -42,6 +40,7 @@ def fetch_logs

def deploy_succeeded?
return false unless @latest_rs.present?
return partial_deployment_success unless min_unavailable_rollout.blank?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new concept you're introducing should have first-class visibility here. I.e. I'd move the case statement on required_rollout_type (or whatever you name the method) right in here, and have minimum_unavailable_replicas assume it's being asked in the context of the relevant rollout strategy.


case min_unavailable_rollout
when 'maxUnavailable'
max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to use the value from sync rather than @definition, possibly with a fallback to @definition (like we do for @progress_deadline), since this value can be server-side defaulted.

when 'maxUnavailable'
max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable')
if max_unavailable =~ /%/
(desired * (100 - max_unavailable.to_i) / 100.0).to_i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This won't match k8s behaviour, because k8s rounds up (well, it rounds the maxUnavailable num down) and Float.to_i simply truncates (i.e. rounds down).

e.g. desired = 5, maxUnavailable = 25%, we should return 4 here and we'll return 3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there's not much in the way of infrastructure for this, but we should probably get some unit tests on this logic.

end
end

def partial_deployment_success
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: returns a boolean so should end in a ?


@latest_rs.desired_replicas >= minimum_needed &&
@latest_rs.rollout_data["readyReplicas"].to_i >= minimum_needed &&
@latest_rs.rollout_data["availableReplicas"].to_i >= minimum_needed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rollout_data seems like a bit of an implementation detail. I'd rather expose ready_replicas and available_replicas as integers, just like deisred_replicas.

@@ -0,0 +1,75 @@
apiVersion: extensions/v1beta1
kind: Ingress
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this fixture set really needs an ingress or service (doesn't matter that much since those deploy quickly, but you aren't asserting about them).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like you really use the configmap either.

command:
- sleep
- '4'
image: nginx:alpine
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're trying to use busybox for everything now (with imagePullPolicy: IfNotPresent and command: ["tail", "-f", "/dev/null"]) so that the suite only pulls one tiny image.

spec:
containers:
- name: app
readinessProbe:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having the probe sleep, would it work to set initialDelaySeconds to 4? I'd be surprised if this probe would ever pass as is, because timeoutSeconds is 1 by default. As a rule I've also been setting periodSeconds to 1 in fixtures, so that we don't get delayed 10s by the probe if something weird happens.

Copy link
Contributor Author

@dturn dturn Dec 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be surprised if this probe would ever pass as is ...

I don't understand this comment. The tests pass so are you suggesting I'm not testing what I think I am?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm confused by how this works. It could be the test, or it could just be me. If the probe should time out after 1 second by default (maybe that's not the default you're getting, somehow?), and your probe sleeps for 4, how can that work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a k8s bug kubernetes/kubernetes#26895

@@ -632,6 +632,18 @@ def test_can_deploy_deployment_with_zero_replicas
])
end

def test_can_deploy_deployment_with_partial_rollout_success
result = deploy_fixtures("slow-cloud", subset: ["configmap-data.yml", "web.yml.erb"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this test is the purpose of this fixture set, you shouldn't need to deploy a subset.

configMapKeyRef:
name: hello-cloud-configmap-data
key: datapoint1
- name: sidecar
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't doing anything useful, so I'd remove it.

when 'maxUnavailable'
max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable')
if max_unavailable =~ /%/
(desired * (100 - max_unavailable.to_i) / 100.0).to_i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know there's not much in the way of infrastructure for this, but we should probably get some unit tests on this logic.

@@ -632,6 +632,18 @@ def test_can_deploy_deployment_with_zero_replicas
])
end

def test_can_deploy_deployment_with_partial_rollout_success
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the point is more that a deployment with the maxUnavailable required-rollout annotation does not wait for full availability (and the copy in restart_task_test should talk about restarts)

assert_deploy_success(result)

assert_logs_match_all(
[%r{Deployment\/web\s+[34] replicas, 3 updatedReplicas, 2 availableReplicas, [12] unavailableReplica}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could maxSurge: 0 remove the variability here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it might, but after trying it doesn't, just changes the combinations.

Copy link
Contributor

@stefanmb stefanmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach LGTM - I think Katrina already captured a few areas needing attention.

@@ -123,5 +126,35 @@ def find_latest_rs(deployment_data)
rs.sync(latest_rs_data)
rs
end

def min_unavailable_rollout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like Katrina's suggestion above. I think this function as is requires too much knowledge from users (i.e. calling .blank? on its result).

assert_deploy_success(result)

assert_logs_match_all(
[%r{Deployment\/web\s+[34] replicas, 3 updatedReplicas, 2 availableReplicas, [12] unavailableReplica}]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test still isn't that reliable. But I'm skeptical that increasing the sleep or using initialDelaySeconds is going to make it better. The rest of the feedback is addressed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I think tuning the sleep/delay could help is that the gem only polls for statuses every 3 seconds. So e.g. if we make a scaling operation take, say, 5-7 seconds, we should have a much better chance of succeeding at a consistent point in the scaling, because we'll see every step.

If I'm wrong about that, or it makes the test hella slow, let's use a fancier regex (e.g. Deployment\/web\s+(?<replicas>\d+) replicas?, (?<updatedReplicas>\d+) updatedReplicas?, (?<availableReplicas>\d+) availableReplicas?) to parse the numbers we want out of the logs and make looser assertions (e.g. availableReplicas < replicas, updatedReplicas < replicas). It's shitty, but having multiple flaky tests is worse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these tests reliable now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last two commits have passed all tests on the first pass, so I'd comfortable calling them reliable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just ran this one 100 times locally, and it failed 3 times:

  • Deployment/web 3 replicas, 3 updatedReplicas, 3 availableReplicas
  • Deployment/web 3 replicas, 3 updatedReplicas, 3 availableReplicas
  • Deployment/web 4 replicas, 2 updatedReplicas, 3 availableReplicas, 1 unavailableReplica

That's not horrible, but considering that there are three of them, they could end up noticeably increasing the fail rate overall. I'm thinking we should make sure we have good unit coverage of the % option and remove that copy of the test. Maybe the restart copy too. @stefanmb wdyt? Do you have any ideas for improving the reliability here?

Copy link
Contributor Author

@dturn dturn Jan 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could change the restart one to be % based and take out the % one here. That would save us 1 test w/o any loss of coverage.

@KnVerey KnVerey mentioned this pull request Dec 19, 2017
22 tasks
@dturn
Copy link
Contributor Author

dturn commented Dec 20, 2017

@KnVerey We may need to revisit the annotation not taking a number or % . It seems reasonable to me that you might not want to loose availability during a deployment so have 0 for maxUnavailable but still be ok with the deployment 'succeeding' when 75% of the replicas are for the new SHA

@KnVerey
Copy link
Contributor

KnVerey commented Dec 20, 2017

Why would you want to declare it successful when you don't have acceptable capacity in the target revision? I'm worried it's because you just want to be able to twist the tool's arm when your deploys aren't rolling as fast as you'd like, not because that result is actually ok. That doesn't sound like the right thing to do, and I don't want to make the wrong thing easy.

@@ -19,6 +20,8 @@ def sync
conditions = deployment_data.fetch("status", {}).fetch("conditions", [])
@progress_condition = conditions.find { |condition| condition['type'] == 'Progressing' }
@progress_deadline = deployment_data['spec']['progressDeadlineSeconds']
@required_rollout = deployment_data.dig('metadata', 'annotations',
'kubernetes-deploy.shopify.io/required-rollout')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep all the logic for setting this here, and use constants for the annotation name and default value. E.g.:

if @found
  @required_rollout = deployment_data.dig('metadata', 'annotations', REQUIRED_ROLLOUT_ANNOTATION).presence || DEFAULT_REQUIRED_ROLLOUT
else
  @required_rollout = @definition.dig('metadata', 'annotations', REQUIRED_ROLLOUT_ANNOTATION).presence || DEFAULT_REQUIRED_ROLLOUT
end

super

unless REQUIRED_ROLLOUT_TYPES.include?(required_rollout_type)
@validation_error_msg = "#{required_rollout_type} is not valid for required-rollout."\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to accommodate the case where super already put something in this message.


unless REQUIRED_ROLLOUT_TYPES.include?(required_rollout_type)
@validation_error_msg = "#{required_rollout_type} is not valid for required-rollout."\
" Acceptable options: #{REQUIRED_ROLLOUT_TYPES}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: join the list with commas

end

def minimum_replicas_to_succeeded
desired = @desired_replicas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use @desired_replicas directly?

def minimum_replicas_to_succeeded
desired = @desired_replicas

case required_rollout_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this method should be something like min_available_replicas and should not contain this case statement. Maybe I'm missing something, but I don't think the else would ever get hit, and IMO the 'none' option should be reflected directly in the case statement in deploy_succeeded?--that case statement should reflect every valid option, and maybe even raise in its else since it should never be hit. Incidentally the success condition for the none should probably be exists?, not >= 0.

when 'maxUnavailable'
max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable')
if max_unavailable =~ /%/
(desired * (100 - max_unavailable.to_i) / 100.0).ceil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

codeCov is pointing out that this line is not covered by any test. Same with the none option.

assert_deploy_success(result)

assert_logs_match_all(
[%r{Deployment\/web\s+[34] replicas, 3 updatedReplicas, 2 availableReplicas, [12] unavailableReplica}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I think tuning the sleep/delay could help is that the gem only polls for statuses every 3 seconds. So e.g. if we make a scaling operation take, say, 5-7 seconds, we should have a much better chance of succeeding at a consistent point in the scaling, because we'll see every step.

If I'm wrong about that, or it makes the test hella slow, let's use a fancier regex (e.g. Deployment\/web\s+(?<replicas>\d+) replicas?, (?<updatedReplicas>\d+) updatedReplicas?, (?<availableReplicas>\d+) availableReplicas?) to parse the numbers we want out of the logs and make looser assertions (e.g. availableReplicas < replicas, updatedReplicas < replicas). It's shitty, but having multiple flaky tests is worse.

end
end

def required_rollout
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put this back because validate_definition gets called before sync and I wasn't happy extracting the definition value there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait a second, why don't we just use the value we see in the definition all the time? This annotation isn't something that'd be server-side defaulted.

Copy link
Contributor

@KnVerey KnVerey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rollout_data["updatedReplicas"].to_i == @desired_replicas &&
@rollout_data["updatedReplicas"].to_i == @rollout_data["availableReplicas"].to_i
when 'none'
true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is untested. I think a test using a bad readiness probe + the annotation could work nicely.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a unit test for this

unless REQUIRED_ROLLOUT_TYPES.include?(required_rollout)
@validation_error_msg ||= ''
@validation_error_msg += "#{required_rollout} is not valid for required-rollout."\
" Acceptable options: #{REQUIRED_ROLLOUT_TYPES.join(',')}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the same message here as you did on L64, where you include the complete annotation name using the constant. We also still need a test for this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw it looks like this would tack the error messages (when super added one) together without a space. In https://github.com/Shopify/kubernetes-deploy/pull/232/files#diff-d2654750695bbb60cf39d64df9b868eeR72, I'm introducing an errors array, which gets around that problem.


def initialize(namespace:, context:, definition:, logger:, parent: nil, deploy_started_at: nil)
@parent = parent
@deploy_started_at = deploy_started_at
@rollout_data = { "replicas" => 0 }
@desired_replicas = -1
@ready_replicas = -1
@available_replicas = -1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same thing should technically be done in the reset section at L35 too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

assert_deploy_success(result)

assert_logs_match_all(
[%r{Deployment\/web\s+[34] replicas, 3 updatedReplicas, 2 availableReplicas, [12] unavailableReplica}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these tests reliable now?

@@ -43,10 +46,24 @@ def fetch_logs
def deploy_succeeded?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd really like to start unit testing this now that it's pretty complex. It could reduce the number of possibly-flakey integration tests we'd need (i.e. separate %/# tests would be unnecessary).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added unit tests

@@ -123,5 +156,19 @@ def find_latest_rs(deployment_data)
rs.sync(latest_rs_data)
rs
end

def min_available_replicas
max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one could very well be server-side defaulted. IIRC you originally cached the entire resource blob from the sync to get this, and I asked you to save the specific value instead. Why are we not using it at all now? Am I forgetting something from earlier review passes?

max_unavailable = @definition.dig('spec', 'strategy', 'rollingUpdate', 'maxUnavailable')
if max_unavailable =~ /%/
(@desired_replicas * (100 - max_unavailable.to_i) / 100.0).ceil
else
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if max_unavailable is nil, we'll effectively return @desired_replicas. If that's on purpose, I think it might be worth a comment. Because of the possibility of defaulting (see my other comment), we can't really validate this at template parsing time. However, it could be worth having a separate case that prints a warning (just once, not on every polling cycle).

I think the main time this would happen is for deployments with strategy: Recreate, which I hadn't been considering previously. The only time I'm aware of us using them is single-replica deployments with PVCs, but there's nothing preventing others from using them with larger deployments. That would be a compelling use case for having the %/# option on the annotation as you suggested. That said, I still think maxUnavailable is the correct option for rollingUpdate deployments, and %/# options could be added in future PRs instead of now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a validation that 'kubernetes-deploy.shopify.io/required-rollout = maxUnavailable isn't in use with strategy: Recreate. Since that's a bit nonSensical.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. The strategy isn't necessarily present in the templates, but the default has been RollingUpdate for many versions, so a validation using the definition should catch all cases in practice.

unless REQUIRED_ROLLOUT_TYPES.include?(required_rollout)
errors << "#{REQUIRED_ROLLOUT_ANNOTATION}:#{required_rollout} is invalid "\
"Acceptable options: #{REQUIRED_ROLLOUT_TYPES.join(',')}"
return false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to return here anymore. Otherwise the errors never make it into the message.

README.md Outdated
- `kubernetes-deploy.shopify.io/timeout-override`: Override the tool's hard timeout for one specific resource. Both full ISO8601 durations and the time portion of ISO8601 durations are valid. Value must be between 1 second and 24 hours.
- _Example values_: 45s / 3m / 1h / PT0.25H
- _Compatibility_: all resource types (Note: `Deployment` timeouts are based on `spec.progressDeadlineSeconds` if present, and that field has a default value as of the `apps/v1beta1` group version. Using this annotation will have no effect on `Deployment`s that time out with "Timeout reason: ProgressDeadlineExceeded".)
- `kubernetes-deploy.shopify.io/required-rollout` (deployment): Modifies the conditions a deployment is
considered to be successful
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the other one (which I updated based on your suggestions on my PR 😄 ), how about:

kubernetes-deploy.shopify.io/required-rollout: Modifies how much of the rollout needs to finish before the deployment is considered successful.

  • Compatibility: Deployment
  • Valid values:
    • maxUnavailable: ...

README.md Outdated
- `kubernetes-deploy.shopify.io/timeout-override`: Override the tool's hard timeout for one specific resource. Both full ISO8601 durations and the time portion of ISO8601 durations are valid. Value must be between 1 second and 24 hours.
- _Example values_: 45s / 3m / 1h / PT0.25H
- _Compatibility_: all resource types (Note: `Deployment` timeouts are based on `spec.progressDeadlineSeconds` if present, and that field has a default value as of the `apps/v1beta1` group version. Using this annotation will have no effect on `Deployment`s that time out with "Timeout reason: ProgressDeadlineExceeded".)
- `kubernetes-deploy.shopify.io/required-rollout` (deployment): Modifies the conditions a deployment is
considered to be successful
- `maxUnavailable`: Only for `RollingUpdate deploys`. The deploy is successful when there are less
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be backwards ("less than X updated and ready"). Suggestion:

"The deploy is successful when minimum availability is reached in the new ReplicaSet. In other words, the number of new pods that must be ready is equal to spec.replicas - strategy.RollingUpdate.maxUnavailable (converted from percentages by rounding up, if applicable). This option is only valid for deployments that use the RollingUpdate strategy."

README.md Outdated
considered to be successful
- `maxUnavailable`: Only for `RollingUpdate deploys`. The deploy is successful when there are less
than `strategy.RollingUpdate.maxUnavailable` (supports both the % and fixed number) updated & ready replicas.
- `full`: maintain current behavior
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deploy is successful when all pods in the new ReplicaSet are ready.

README.md Outdated
- `maxUnavailable`: Only for `RollingUpdate deploys`. The deploy is successful when there are less
than `strategy.RollingUpdate.maxUnavailable` (supports both the % and fixed number) updated & ready replicas.
- `full`: maintain current behavior
- `none`: deploy is successful as soon as the new `replicaSet` is created for the deploy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deploy is successful as soon as the new ReplicaSet is created (no pods required).

@@ -81,6 +100,24 @@ def exists?
@found
end

def validate_definition
valid = super
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we actually need this anymore. We should be able to use @validation_errors.empty? as the return value, since that has to contain the errors from super as well, right?

valid = false
end

if required_rollout == 'maxUnavailable' && @definition.dig('spec', 'strategy') == 'recreate'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the strategy case-sensitive? I don't think it is, in which case we should make sure these string comparisons are also case-insensitive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I have no idea. I'll just downcase everything.

assert_deploy_success(result)

assert_logs_match_all(
[%r{Deployment\/web\s+[34] replicas, 3 updatedReplicas, 2 availableReplicas, [12] unavailableReplica}]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just ran this one 100 times locally, and it failed 3 times:

  • Deployment/web 3 replicas, 3 updatedReplicas, 3 availableReplicas
  • Deployment/web 3 replicas, 3 updatedReplicas, 3 availableReplicas
  • Deployment/web 4 replicas, 2 updatedReplicas, 3 availableReplicas, 1 unavailableReplica

That's not horrible, but considering that there are three of them, they could end up noticeably increasing the fail rate overall. I'm thinking we should make sure we have good unit coverage of the % option and remove that copy of the test. Maybe the restart copy too. @stefanmb wdyt? Do you have any ideas for improving the reliability here?

add a restart test and unit tests
Docs are nice
@dturn
Copy link
Contributor Author

dturn commented Jan 9, 2018

Feedback is in. Let me know if there is anything else you'd like tweaked.

@KnVerey KnVerey mentioned this pull request Jan 10, 2018
@KnVerey
Copy link
Contributor

KnVerey commented Jan 10, 2018

The new tests already flaked on one of the runs on my branch. Test flakes are already a problem on this repo, so I think it isn't actually ok to merge two new ones that are known to be unreliable. I put some more thought into this, and here is a faster (runs in ~19s instead of ~36s) test that passed 100 times in a row for me locally:

Test
  def test_deploy_successful_with_partial_availability
      result = deploy_fixtures("slow-cloud", sha: "deploy1")
      assert_deploy_success(result)

      result = deploy_fixtures("slow-cloud", sha: "deploy2") do |fixtures|
        dep = fixtures["web.yml.erb"]["Deployment"].first
        container = dep["spec"]["template"]["spec"]["containers"].first
        container["readinessProbe"] = {
          "exec" => { "command" => %w(sleep 5) },
          "timeoutSeconds" => 6
        }
      end
      assert_deploy_success(result)

      new_pods = kubeclient.get_pods(namespace: @namespace, label_selector: 'name=web,app=slow-cloud,sha=deploy2')
      assert new_pods.length >= 1, "Expected at least one new pod, saw #{new_pods.length}"

      new_ready_pods = new_pods.select do |pod|
        pod.status.phase == "Running" &&
        pod.status.conditions.any? { |condition| condition["type"] == "Ready" && condition["status"] == "True" }
      end
      assert_equal 1, new_ready_pods.length, "Expected exactly one new pod to be ready, saw #{new_ready_pods.length}"
  end
Fixture
apiVersion: apps/v1beta1
kind: Deployment
metadata:
  name: web
  annotations:
    shipit.shopify.io/restart: "true"
    kubernetes-deploy.shopify.io/required-rollout: maxUnavailable
spec:
  replicas: 2
  selector:
    matchLabels:
      name: web
      app: slow-cloud
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 0
      maxUnavailable: 1
  template:
    metadata:
      labels:
        name: web
        app: slow-cloud
        sha: "<%= current_sha %>"
    spec:
      terminationGracePeriodSeconds: 0
      containers:
      - name: app
        image: busybox
        imagePullPolicy: IfNotPresent
        command: ["tail", "-f", "/dev/null"]
        ports:
        - containerPort: 80
          name: http
Finished in 1859.656208s, 0.0538 runs/s, 0.4302 assertions/s.

100 runs, 800 assertions, 0 failures, 0 errors, 0 skips

Unless you see a reason why the test isn't valid, let's use this instead (some adjustments will need to be made for the restart version).


pods = kubeclient.get_pods(namespace: @namespace, label_selector: 'name=web,app=slow-cloud')
new_pods = pods.select do |pod|
pod.spec.containers.select { |c| c["name"] == "app" && c.env&.find { |n| n.name == "RESTARTED_AT" } }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will always be truthy, because a select with no results is []. So I'd expect that new_pods would always include all pods. I'd use any? instead.

new_ready_pods = new_pods.select do |pod|
pod.status.phase == "Running" &&
pod.status.conditions.any? { |condition| condition["type"] == "Ready" && condition["status"] == "True" } &&
pod.spec.containers.detect { |c| c["name"] == "app" && c.env&.find { |n| n.name == "RESTARTED_AT" } }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need this if L225 filters properly

- `kubernetes-deploy.shopify.io/timeout-override`: Override the tool's hard timeout for one specific resource. Both full ISO8601 durations and the time portion of ISO8601 durations are valid. Value must be between 1 second and 24 hours.
- _Example values_: 45s / 3m / 1h / PT0.25H
- _Compatibility_: all resource types (Note: `Deployment` timeouts are based on `spec.progressDeadlineSeconds` if present, and that field has a default value as of the `apps/v1beta1` group version. Using this annotation will have no effect on `Deployment`s that time out with "Timeout reason: ProgressDeadlineExceeded".)
- `kubernetes-deploy.shopify.io/required-rollout`: Modifies how much of the rollout needs to finish
before the deployment is considered successful.
- _Compatibility_: Deployment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be consistent about how we reference objects and whether they're in backticks. The new uses both "ReplicaSet" and "replicaSet", and has "deployment" and "Deployment" and "deploy". The existing text above uses "Deployment".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good eye. I'm going to standardize on deployment and replicaSet

@dturn dturn merged commit 967864e into master Jan 11, 2018
@dturn dturn deleted the partial-rollout-success branch January 11, 2018 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants