-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a new exit code for failures due to timeouts #244
Conversation
As far as I can tell every piece of software defines its own exit status numbers:
Given the lack of standardization, I'd prefer not to cargo cult something. Technically this is a breaking change, but not one that makes me thing we need to bump to 1.0 immediately. (a) we're pre 1.0 so anything goes and (b) 2 is still a failing exit code and I can't image people have Tests coming soon. |
lib/kubernetes-deploy/deploy_task.rb
Outdated
ensure | ||
@logger.print_summary(success) | ||
status = success ? "success" : "failed" | ||
::StatsD.measure('all_resources.duration', StatsD.duration(start), tags: statsd_tags << "status:#{status}") | ||
success |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensure blocks only have return values if you explicitly use the return
keyword
@@ -26,18 +26,24 @@ module FixtureDeployHelper | |||
# pod = fixtures["unmanaged-pod.yml.erb"]["Pod"].first | |||
# pod["spec"]["containers"].first["image"] = "hello-world:thisImageIsBad" | |||
# end | |||
def deploy_fixtures(set, subset: nil, **args) # extra args are passed through to deploy_dir_without_profiling | |||
def deploy_fixtures(set, subset: nil, **args, &block) # extra args are passed through to deploy_dir_without_profiling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to convert every test that uses deploy_fixtures to expect two return args....
@KnVerey This is ready for 👀 again. I like to test the executables, but I think they need to be re factored to make that easy and that should go into a different PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a couple of minor comments, feel free to ignore if you want as I don't have a lot experience with the code base, yet.
exe/kubernetes-deploy
Outdated
verify_result: !skip_wait, | ||
allow_protected_ns: allow_protected_ns, | ||
prune: prune | ||
) | ||
exit 1 unless success | ||
|
||
if error == :timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not:
exit 2 if error == :timeout
exit 1 unless success
lib/kubernetes-deploy/deploy_task.rb
Outdated
@@ -107,6 +107,7 @@ def initialize(namespace:, context:, current_sha:, template_dir:, logger:, kubec | |||
|
|||
def run(verify_result: true, allow_protected_ns: false, prune: true) | |||
start = Time.now.utc | |||
error = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should error be a symbol or do we want to create error classes?
lib/kubernetes-deploy/deploy_task.rb
Outdated
@@ -149,6 +150,7 @@ def run(verify_result: true, allow_protected_ns: false, prune: true) | |||
deploy_resources(resources, prune: prune, verify: true) | |||
::StatsD.measure('normal_resources.duration', StatsD.duration(start_normal_resource), tags: statsd_tags) | |||
success = resources.all?(&:deploy_succeeded?) | |||
error = :timeout if !success && resources.any?(&:deploy_timed_out?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then here add which resource(s) timed out?
container["image"] = "some-invalid-image:badtag" | ||
end | ||
assert_deploy_failure(result) | ||
refute_equal error, :timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of the test suggests that the error
is nil
when it didn't time out, but then it doesn't test for it.
In case we want to make this also work in the future (when we maybe support other errors than :timeout
) do we want to make the check assert_equal error, nil
? or change the name of the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're right, assert_nil error
makes sense here. I also think the test name matches the current behavior so should stay put. If / when we add more conditions we'll need to change what we assert here and the test name.
@@ -100,6 +100,14 @@ def test_restart_not_existing_deployment | |||
in_order: true) | |||
end | |||
|
|||
def test_restart_error_nil_when_not_timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here as above
exe/kubernetes-restart
Outdated
@@ -17,5 +17,10 @@ context = ARGV[1] | |||
logger = KubernetesDeploy::FormattedLogger.build(namespace, context) | |||
|
|||
restart = KubernetesDeploy::RestartTask.new(namespace: namespace, context: context, logger: logger) | |||
success = restart.perform(raw_deployments) | |||
exit 1 unless success | |||
success, error = restart.perform(raw_deployments) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here as above
ensure | ||
@logger.print_summary(success) | ||
status = success ? "success" : "failed" | ||
tags = %W(namespace:#{@namespace} context:#{@context} status:#{status} deployments:#{deployments.to_a.length}}) | ||
::StatsD.measure('restart.duration', StatsD.duration(start), tags: tags) | ||
[success, error] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you mentioned in your comment above ensure
s don't return, so this can be deleted, right?
@@ -26,18 +26,24 @@ module FixtureDeployHelper | |||
# pod = fixtures["unmanaged-pod.yml.erb"]["Pod"].first | |||
# pod["spec"]["containers"].first["image"] = "hello-world:thisImageIsBad" | |||
# end | |||
def deploy_fixtures(set, subset: nil, **args) # extra args are passed through to deploy_dir_without_profiling | |||
def deploy_fixtures(set, subset: nil, **args, &block) # extra args are passed through to deploy_dir_without_profiling | |||
success, _ = deploy_fixtures_with_error(set, subset: subset, **args, &block) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we make sure that _
is nil
, so:
success, error = deploy_fixtures_with_error...
assert_equal error, nil
success
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out several (20) tests break with this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should change that in the future, as you mentioned. And I am fine with not doing this it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This helper doesn't make any assumptions about the desired result; most (all?) tests that use deploy_fixtures
should be consuming the return value with either assert_deploy_failure
or assert_deploy_success
(those helpers give you nice failure output when the result is incorrect). The 20 tests that fail are probably the ones where the deploy is expected to fail.
@klautcomputing Thanks for the review. Adding an error class was a good idea. I've made the changes you suggested with the exception of the one that broke 20 tests. I'd be open to refactor that in a separate PR after this, but think its too much churn for 1 pr. |
@@ -26,18 +26,24 @@ module FixtureDeployHelper | |||
# pod = fixtures["unmanaged-pod.yml.erb"]["Pod"].first | |||
# pod["spec"]["containers"].first["image"] = "hello-world:thisImageIsBad" | |||
# end | |||
def deploy_fixtures(set, subset: nil, **args) # extra args are passed through to deploy_dir_without_profiling | |||
def deploy_fixtures(set, subset: nil, **args, &block) # extra args are passed through to deploy_dir_without_profiling | |||
success, _ = deploy_fixtures_with_error(set, subset: subset, **args, &block) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should change that in the future, as you mentioned. And I am fine with not doing this it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're going to need to find a way to test that we get the verdict we want (in terms of both logs and return value or exception raised) when the result is actually mixed. Relevant existing tests include:
- test_resource_quotas_are_deployed_first (success + timeout)
- test_deployment_with_progress_times_out_for_short_duration (timeout only)
- test_deploy_result_logging_for_mixed_result_deploy (success, failure and timeout)
exe/kubernetes-deploy
Outdated
verify_result: !skip_wait, | ||
allow_protected_ns: allow_protected_ns, | ||
prune: prune | ||
) | ||
|
||
exit 2 if error.is_a?(KubernetesDeploy::DeploymentTimeoutError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: which specific code to use, I don't see any indication that there's a standard either. However, 2
appears to be one of very few codes that actually has a reserved meaning, which is "misuse of shell built-ins": http://tldp.org/LDP/abs/html/exitcodes.html. A timeout doesn't mean that the command was called incorrectly, so we should probably choose something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having a typed exception is a good idea, but it looks a little odd to me to pass an exception around manually in ruby rather than raising it. WDYT about giving DeployTask both run
and run!
(for backwards compatibility, and probably the integration test framework) and using the latter from the executable?
# DeployTask
def run(*args)
run!(*args)
true
rescue FatalDeploymentError
false
end
def run!(verify_result: true, allow_protected_ns: false, prune: true)
# things
success = true
rescue FatalDeploymentError => error
@logger.summary.add_action(error.message)
success = false
raise
ensure
@logger.print_summary(success)
status = success ? "success" : "failed"
::StatsD.measure('all_resources.duration', StatsD.duration(start), tags: statsd_tags << "status:#{status}")
end
# exe/kubernetes-deploy
begin
runner.run!(
verify_result: !skip_wait,
allow_protected_ns: allow_protected_ns,
prune: prune
)
rescue DeploymentTimeoutError
exit :not1 # whatever we pick
rescue FatalDeploymentError
exit 1
end
lib/kubernetes-deploy/deploy_task.rb
Outdated
@@ -149,6 +150,9 @@ def run(verify_result: true, allow_protected_ns: false, prune: true) | |||
deploy_resources(resources, prune: prune, verify: true) | |||
::StatsD.measure('normal_resources.duration', StatsD.duration(start_normal_resource), tags: statsd_tags) | |||
success = resources.all?(&:deploy_succeeded?) | |||
if !success && (timedout_resources = resources.select(&:deploy_timed_out?).presence) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's possible for some resources to time out and others to fail, right? In that case, I think the failure should trump regardless of the relative failure/timeout volumes because the failure means they need to do something.
lib/kubernetes-deploy/deploy_task.rb
Outdated
rescue FatalDeploymentError => error | ||
@logger.summary.add_action(error.message) | ||
success = false | ||
[success, error] | ||
ensure | ||
@logger.print_summary(success) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the summary outputted by this line should also reflect the difference between timeouts and failures. In line with my other suggestion, we could do something like this:
if verify_result
# things
::StatsD.measure('normal_resources.duration', StatsD.duration(start_normal_resource), tags: statsd_tags)
raise FatalDeploymentError if resources.any?(&:deploy_failed?)
raise DeploymentTimeoutError if resources.any?(&:deploy_timed_out?)
else
# things
end
@logger.print_summary(:success)
::StatsD.measure('all_resources.duration', StatsD.duration(start), tags: statsd_tags << "status:success")
rescue DeploymentTimeoutError
@logger.print_summary(:timeout)
::StatsD.measure('all_resources.duration', StatsD.duration(start), tags: statsd_tags << "status:timeout")
raise
rescue FatalDeploymentError => error
@logger.summary.add_action(error.message) if error.message.present?
@logger.print_summary(:failure)
::StatsD.measure('all_resources.duration', StatsD.duration(start), tags: statsd_tags << "status:failed")
raise
end
It's kinda weird to raise something you're about to rescue yourself, but in a way it treats final failures more consistently with earlier failures. Might also be easier to follow? WDYT?
lib/kubernetes-deploy/errors.rb
Outdated
@@ -8,6 +8,15 @@ def initialize(name, context) | |||
super("Namespace `#{name}` not found in context `#{context}`") | |||
end | |||
end | |||
|
|||
class DeploymentTimeoutError < StandardError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should inherit from FatalDeploymentError
, which we've been using as a generic all-is-doomed-now base class for this gem.
@@ -26,18 +26,24 @@ module FixtureDeployHelper | |||
# pod = fixtures["unmanaged-pod.yml.erb"]["Pod"].first | |||
# pod["spec"]["containers"].first["image"] = "hello-world:thisImageIsBad" | |||
# end | |||
def deploy_fixtures(set, subset: nil, **args) # extra args are passed through to deploy_dir_without_profiling | |||
def deploy_fixtures(set, subset: nil, **args, &block) # extra args are passed through to deploy_dir_without_profiling | |||
success, _ = deploy_fixtures_with_error(set, subset: subset, **args, &block) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This helper doesn't make any assumptions about the desired result; most (all?) tests that use deploy_fixtures
should be consuming the return value with either assert_deploy_failure
or assert_deploy_success
(those helpers give you nice failure output when the result is incorrect). The 20 tests that fail are probably the ones where the deploy is expected to fail.
e47af04
to
48a4066
Compare
Changes are in, part of me is worried this has made things worse and not better, but maybe I've just spent too much time staring at this code today. |
exe/kubernetes-deploy
Outdated
prune: prune | ||
) | ||
rescue KubernetesDeploy::DeploymentTimeoutError | ||
exit 124 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we use the code Jean already configured in our Shipit? (70)
lib/kubernetes-deploy/deploy_task.rb
Outdated
success = false | ||
raise | ||
ensure | ||
@logger.print_summary(success) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you miss this comment? #244 (comment)
I'm not married to that specific implementation, but the problem described is still an issue in this version. I realized the other day that getting rid of the ensure
would have the nice side-effect of not printing the result banner in the case of unexpected exceptions that are dumping backtraces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I misunderstood what you were suggesting since print_summary
takes a bool right now. I'll fix this up to use a symbol.
test/test_helper.rb
Outdated
if ENV["PRINT_LOGS"] | ||
assert_equal false, result, "Deploy succeeded when it was expected to fail" | ||
return | ||
end | ||
|
||
logging_assertion do |logs| | ||
assert_match(/failed due to timeouts./, logs) if cause == :timeout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be asserting on Result: TIMED OUT
(which isn't happening yet in this PR), like L114
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code LGTM now. Just a few comments/questions based on 🎩 .
level = :info | ||
else | ||
heading("Result: ", "FAILURE", :red) | ||
heading("Result: ", status_string, :red) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lib/kubernetes-deploy/errors.rb
Outdated
attr_reader :resources | ||
def initialize(resources) | ||
@resources = resources | ||
super("Resources #{@resources.map(&:name).join(',')} failed due to timeouts.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with the other messaging (e.g. right now we see Successfully deployed 1 resource, failed to deploy 1 resource, and resources web failed due to timeouts.
), and because this could be a long list, I think this should be a count, not a list.
Since we're trying to clarify that timeouts aren't exactly failures, might it be more helpful to not use "failed" in this message? e.g. "gave up watching X resources", "timed out waiting for X resources", "reached the watch timeout for X resources", I dunno.
@logger.summary.add_action(error.message) | ||
success = false | ||
ensure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By getting rid of the ensure
, this PR is implementing solution 2 for #137.
27074e0
to
1d900af
Compare
level = :info | ||
elsif status == :timed_out | ||
heading("Result: ", status_string, :yellow) | ||
level = :warn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎩 success/failure/timeout via exe/kuberentes-deploy and success/failure via exe/kuberentes-restart looks good (and has expected exit codes)
Pulled from #229
Different exit status for failures and timeouts
. This is technically a breaking change, but probably not one that will impact many people.