Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a new exit code for failures due to timeouts #244

Merged
merged 12 commits into from
Mar 28, 2018
Merged

Conversation

dturn
Copy link
Contributor

@dturn dturn commented Feb 22, 2018

Pulled from #229 Different exit status for failures and timeouts. This is technically a breaking change, but probably not one that will impact many people.

@dturn dturn requested a review from KnVerey February 22, 2018 20:43
@dturn
Copy link
Contributor Author

dturn commented Mar 5, 2018

As far as I can tell every piece of software defines its own exit status numbers:

  • The timeout command uses exit code 124 for a timeout
  • Curl uses 28
  • wget doesn't even bother having a specific timeout exit status

Given the lack of standardization, I'd prefer not to cargo cult something.

Technically this is a breaking change, but not one that makes me thing we need to bump to 1.0 immediately. (a) we're pre 1.0 so anything goes and (b) 2 is still a failing exit code and I can't image people have exit_status == 1 instead of > 0

Tests coming soon.

ensure
@logger.print_summary(success)
status = success ? "success" : "failed"
::StatsD.measure('all_resources.duration', StatsD.duration(start), tags: statsd_tags << "status:#{status}")
success
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensure blocks only have return values if you explicitly use the return keyword

@@ -26,18 +26,24 @@ module FixtureDeployHelper
# pod = fixtures["unmanaged-pod.yml.erb"]["Pod"].first
# pod["spec"]["containers"].first["image"] = "hello-world:thisImageIsBad"
# end
def deploy_fixtures(set, subset: nil, **args) # extra args are passed through to deploy_dir_without_profiling
def deploy_fixtures(set, subset: nil, **args, &block) # extra args are passed through to deploy_dir_without_profiling
Copy link
Contributor Author

@dturn dturn Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't want to convert every test that uses deploy_fixtures to expect two return args....

@dturn
Copy link
Contributor Author

dturn commented Mar 6, 2018

@KnVerey This is ready for 👀 again. I like to test the executables, but I think they need to be re factored to make that easy and that should go into a different PR

Copy link
Contributor

@klautcomputing klautcomputing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of minor comments, feel free to ignore if you want as I don't have a lot experience with the code base, yet.

verify_result: !skip_wait,
allow_protected_ns: allow_protected_ns,
prune: prune
)
exit 1 unless success

if error == :timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not:

exit 2 if error == :timeout
exit 1 unless success

@@ -107,6 +107,7 @@ def initialize(namespace:, context:, current_sha:, template_dir:, logger:, kubec

def run(verify_result: true, allow_protected_ns: false, prune: true)
start = Time.now.utc
error = nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should error be a symbol or do we want to create error classes?

@@ -149,6 +150,7 @@ def run(verify_result: true, allow_protected_ns: false, prune: true)
deploy_resources(resources, prune: prune, verify: true)
::StatsD.measure('normal_resources.duration', StatsD.duration(start_normal_resource), tags: statsd_tags)
success = resources.all?(&:deploy_succeeded?)
error = :timeout if !success && resources.any?(&:deploy_timed_out?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then here add which resource(s) timed out?

container["image"] = "some-invalid-image:badtag"
end
assert_deploy_failure(result)
refute_equal error, :timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the test suggests that the error is nil when it didn't time out, but then it doesn't test for it.

In case we want to make this also work in the future (when we maybe support other errors than :timeout) do we want to make the check assert_equal error, nil? or change the name of the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right, assert_nil error makes sense here. I also think the test name matches the current behavior so should stay put. If / when we add more conditions we'll need to change what we assert here and the test name.

@@ -100,6 +100,14 @@ def test_restart_not_existing_deployment
in_order: true)
end

def test_restart_error_nil_when_not_timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here as above

@@ -17,5 +17,10 @@ context = ARGV[1]
logger = KubernetesDeploy::FormattedLogger.build(namespace, context)

restart = KubernetesDeploy::RestartTask.new(namespace: namespace, context: context, logger: logger)
success = restart.perform(raw_deployments)
exit 1 unless success
success, error = restart.perform(raw_deployments)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here as above

ensure
@logger.print_summary(success)
status = success ? "success" : "failed"
tags = %W(namespace:#{@namespace} context:#{@context} status:#{status} deployments:#{deployments.to_a.length}})
::StatsD.measure('restart.duration', StatsD.duration(start), tags: tags)
[success, error]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you mentioned in your comment above ensures don't return, so this can be deleted, right?

@@ -26,18 +26,24 @@ module FixtureDeployHelper
# pod = fixtures["unmanaged-pod.yml.erb"]["Pod"].first
# pod["spec"]["containers"].first["image"] = "hello-world:thisImageIsBad"
# end
def deploy_fixtures(set, subset: nil, **args) # extra args are passed through to deploy_dir_without_profiling
def deploy_fixtures(set, subset: nil, **args, &block) # extra args are passed through to deploy_dir_without_profiling
success, _ = deploy_fixtures_with_error(set, subset: subset, **args, &block)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we make sure that _ is nil, so:

success, error = deploy_fixtures_with_error...
assert_equal error, nil
success

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out several (20) tests break with this change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change that in the future, as you mentioned. And I am fine with not doing this it in this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helper doesn't make any assumptions about the desired result; most (all?) tests that use deploy_fixtures should be consuming the return value with either assert_deploy_failure or assert_deploy_success (those helpers give you nice failure output when the result is incorrect). The 20 tests that fail are probably the ones where the deploy is expected to fail.

@dturn
Copy link
Contributor Author

dturn commented Mar 14, 2018

@klautcomputing Thanks for the review. Adding an error class was a good idea. I've made the changes you suggested with the exception of the one that broke 20 tests. I'd be open to refactor that in a separate PR after this, but think its too much churn for 1 pr.

@@ -26,18 +26,24 @@ module FixtureDeployHelper
# pod = fixtures["unmanaged-pod.yml.erb"]["Pod"].first
# pod["spec"]["containers"].first["image"] = "hello-world:thisImageIsBad"
# end
def deploy_fixtures(set, subset: nil, **args) # extra args are passed through to deploy_dir_without_profiling
def deploy_fixtures(set, subset: nil, **args, &block) # extra args are passed through to deploy_dir_without_profiling
success, _ = deploy_fixtures_with_error(set, subset: subset, **args, &block)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change that in the future, as you mentioned. And I am fine with not doing this it in this PR.

Copy link
Contributor

@KnVerey KnVerey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're going to need to find a way to test that we get the verdict we want (in terms of both logs and return value or exception raised) when the result is actually mixed. Relevant existing tests include:

  • test_resource_quotas_are_deployed_first (success + timeout)
  • test_deployment_with_progress_times_out_for_short_duration (timeout only)
  • test_deploy_result_logging_for_mixed_result_deploy (success, failure and timeout)

verify_result: !skip_wait,
allow_protected_ns: allow_protected_ns,
prune: prune
)

exit 2 if error.is_a?(KubernetesDeploy::DeploymentTimeoutError)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re: which specific code to use, I don't see any indication that there's a standard either. However, 2 appears to be one of very few codes that actually has a reserved meaning, which is "misuse of shell built-ins": http://tldp.org/LDP/abs/html/exitcodes.html. A timeout doesn't mean that the command was called incorrectly, so we should probably choose something else.

Copy link
Contributor

@KnVerey KnVerey Mar 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a typed exception is a good idea, but it looks a little odd to me to pass an exception around manually in ruby rather than raising it. WDYT about giving DeployTask both run and run! (for backwards compatibility, and probably the integration test framework) and using the latter from the executable?

# DeployTask
    def run(*args)
      run!(*args)
      true
    rescue FatalDeploymentError
      false
    end

    def run!(verify_result: true, allow_protected_ns: false, prune: true)
      # things
      success = true
    rescue FatalDeploymentError => error
      @logger.summary.add_action(error.message)
      success = false
      raise
    ensure
      @logger.print_summary(success)
      status = success ? "success" : "failed"
      ::StatsD.measure('all_resources.duration', StatsD.duration(start), tags: statsd_tags << "status:#{status}")
    end
# exe/kubernetes-deploy
begin
  runner.run!(
    verify_result: !skip_wait,
    allow_protected_ns: allow_protected_ns,
    prune: prune
  )
rescue DeploymentTimeoutError
  exit :not1 # whatever we pick
rescue FatalDeploymentError
  exit 1
end

@@ -149,6 +150,9 @@ def run(verify_result: true, allow_protected_ns: false, prune: true)
deploy_resources(resources, prune: prune, verify: true)
::StatsD.measure('normal_resources.duration', StatsD.duration(start_normal_resource), tags: statsd_tags)
success = resources.all?(&:deploy_succeeded?)
if !success && (timedout_resources = resources.select(&:deploy_timed_out?).presence)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's possible for some resources to time out and others to fail, right? In that case, I think the failure should trump regardless of the relative failure/timeout volumes because the failure means they need to do something.

rescue FatalDeploymentError => error
@logger.summary.add_action(error.message)
success = false
[success, error]
ensure
@logger.print_summary(success)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the summary outputted by this line should also reflect the difference between timeouts and failures. In line with my other suggestion, we could do something like this:

      if verify_result
        # things
        ::StatsD.measure('normal_resources.duration', StatsD.duration(start_normal_resource), tags: statsd_tags)
        raise FatalDeploymentError if resources.any?(&:deploy_failed?)
        raise DeploymentTimeoutError if resources.any?(&:deploy_timed_out?)
      else
        # things
      end
      @logger.print_summary(:success)
      ::StatsD.measure('all_resources.duration', StatsD.duration(start), tags: statsd_tags << "status:success")
    rescue DeploymentTimeoutError
      @logger.print_summary(:timeout)
      ::StatsD.measure('all_resources.duration', StatsD.duration(start), tags: statsd_tags << "status:timeout")
      raise
    rescue FatalDeploymentError => error
      @logger.summary.add_action(error.message) if error.message.present?
      @logger.print_summary(:failure)
      ::StatsD.measure('all_resources.duration', StatsD.duration(start), tags: statsd_tags << "status:failed")
      raise
    end

It's kinda weird to raise something you're about to rescue yourself, but in a way it treats final failures more consistently with earlier failures. Might also be easier to follow? WDYT?

@@ -8,6 +8,15 @@ def initialize(name, context)
super("Namespace `#{name}` not found in context `#{context}`")
end
end

class DeploymentTimeoutError < StandardError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should inherit from FatalDeploymentError, which we've been using as a generic all-is-doomed-now base class for this gem.

@@ -26,18 +26,24 @@ module FixtureDeployHelper
# pod = fixtures["unmanaged-pod.yml.erb"]["Pod"].first
# pod["spec"]["containers"].first["image"] = "hello-world:thisImageIsBad"
# end
def deploy_fixtures(set, subset: nil, **args) # extra args are passed through to deploy_dir_without_profiling
def deploy_fixtures(set, subset: nil, **args, &block) # extra args are passed through to deploy_dir_without_profiling
success, _ = deploy_fixtures_with_error(set, subset: subset, **args, &block)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helper doesn't make any assumptions about the desired result; most (all?) tests that use deploy_fixtures should be consuming the return value with either assert_deploy_failure or assert_deploy_success (those helpers give you nice failure output when the result is incorrect). The 20 tests that fail are probably the ones where the deploy is expected to fail.

@dturn
Copy link
Contributor Author

dturn commented Mar 15, 2018

Changes are in, part of me is worried this has made things worse and not better, but maybe I've just spent too much time staring at this code today.

@dturn dturn changed the title Use exit code 2 for failures due to timeouts Use a new exit code for failures due to timeouts Mar 16, 2018
prune: prune
)
rescue KubernetesDeploy::DeploymentTimeoutError
exit 124
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the code Jean already configured in our Shipit? (70)

success = false
raise
ensure
@logger.print_summary(success)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you miss this comment? #244 (comment)
I'm not married to that specific implementation, but the problem described is still an issue in this version. I realized the other day that getting rid of the ensure would have the nice side-effect of not printing the result banner in the case of unexpected exceptions that are dumping backtraces.

Copy link
Contributor Author

@dturn dturn Mar 23, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood what you were suggesting since print_summary takes a bool right now. I'll fix this up to use a symbol.

if ENV["PRINT_LOGS"]
assert_equal false, result, "Deploy succeeded when it was expected to fail"
return
end

logging_assertion do |logs|
assert_match(/failed due to timeouts./, logs) if cause == :timeout
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be asserting on Result: TIMED OUT (which isn't happening yet in this PR), like L114

Copy link
Contributor

@KnVerey KnVerey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code LGTM now. Just a few comments/questions based on 🎩 .

level = :info
else
heading("Result: ", "FAILURE", :red)
heading("Result: ", status_string, :red)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more consistent to make this piece of heading yellow? (Just a thought; I don't have a strong opinion)

image

attr_reader :resources
def initialize(resources)
@resources = resources
super("Resources #{@resources.map(&:name).join(',')} failed due to timeouts.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with the other messaging (e.g. right now we see Successfully deployed 1 resource, failed to deploy 1 resource, and resources web failed due to timeouts.), and because this could be a long list, I think this should be a count, not a list.

Since we're trying to clarify that timeouts aren't exactly failures, might it be more helpful to not use "failed" in this message? e.g. "gave up watching X resources", "timed out waiting for X resources", "reached the watch timeout for X resources", I dunno.

@logger.summary.add_action(error.message)
success = false
ensure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By getting rid of the ensure, this PR is implementing solution 2 for #137.

level = :info
elsif status == :timed_out
heading("Result: ", status_string, :yellow)
level = :warn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still want :fatal for this, since it did in fact kill the deploy... but also because error and warn with this particular logger make the whole lined coloured:

image

Copy link
Contributor

@KnVerey KnVerey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎩 success/failure/timeout via exe/kuberentes-deploy and success/failure via exe/kuberentes-restart looks good (and has expected exit codes)

@dturn dturn merged commit 3c0109d into master Mar 28, 2018
@dturn dturn deleted the timeout-exit-code branch March 28, 2018 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants