Use a new exit code for failures due to timeouts #244

dturn · 2018-02-22T20:43:11Z

Pulled from #229 Different exit status for failures and timeouts. This is technically a breaking change, but probably not one that will impact many people.

dturn · 2018-03-05T16:07:36Z

As far as I can tell every piece of software defines its own exit status numbers:

The timeout command uses exit code 124 for a timeout
Curl uses 28
wget doesn't even bother having a specific timeout exit status

Given the lack of standardization, I'd prefer not to cargo cult something.

Technically this is a breaking change, but not one that makes me thing we need to bump to 1.0 immediately. (a) we're pre 1.0 so anything goes and (b) 2 is still a failing exit code and I can't image people have exit_status == 1 instead of > 0

Tests coming soon.

dturn · 2018-03-06T16:43:11Z

lib/kubernetes-deploy/deploy_task.rb

    ensure
      @logger.print_summary(success)
      status = success ? "success" : "failed"
      ::StatsD.measure('all_resources.duration', StatsD.duration(start), tags: statsd_tags << "status:#{status}")
-      success


ensure blocks only have return values if you explicitly use the return keyword

dturn · 2018-03-06T16:43:47Z

test/helpers/fixture_deploy_helper.rb

@@ -26,18 +26,24 @@ module FixtureDeployHelper
  #     pod = fixtures["unmanaged-pod.yml.erb"]["Pod"].first
  #     pod["spec"]["containers"].first["image"] = "hello-world:thisImageIsBad"
  #   end
-  def deploy_fixtures(set, subset: nil, **args) # extra args are passed through to deploy_dir_without_profiling
+  def deploy_fixtures(set, subset: nil, **args, &block) # extra args are passed through to deploy_dir_without_profiling


I didn't want to convert every test that uses deploy_fixtures to expect two return args....

dturn · 2018-03-06T16:44:59Z

@KnVerey This is ready for 👀 again. I like to test the executables, but I think they need to be re factored to make that easy and that should go into a different PR

klautcomputing

I have a couple of minor comments, feel free to ignore if you want as I don't have a lot experience with the code base, yet.

klautcomputing · 2018-03-13T21:58:35Z

exe/kubernetes-deploy

  verify_result: !skip_wait,
  allow_protected_ns: allow_protected_ns,
  prune: prune
 )
-exit 1 unless success
+
+if error == :timeout


why not:

exit 2 if error == :timeout exit 1 unless success

klautcomputing · 2018-03-13T22:00:41Z

lib/kubernetes-deploy/deploy_task.rb

@@ -107,6 +107,7 @@ def initialize(namespace:, context:, current_sha:, template_dir:, logger:, kubec

    def run(verify_result: true, allow_protected_ns: false, prune: true)
      start = Time.now.utc
+      error = nil


Should error be a symbol or do we want to create error classes?

klautcomputing · 2018-03-13T22:00:57Z

lib/kubernetes-deploy/deploy_task.rb

@@ -149,6 +150,7 @@ def run(verify_result: true, allow_protected_ns: false, prune: true)
        deploy_resources(resources, prune: prune, verify: true)
        ::StatsD.measure('normal_resources.duration', StatsD.duration(start_normal_resource), tags: statsd_tags)
        success = resources.all?(&:deploy_succeeded?)
+        error = :timeout if !success && resources.any?(&:deploy_timed_out?)


And then here add which resource(s) timed out?

klautcomputing · 2018-03-13T22:11:32Z

test/integration/kubernetes_deploy_test.rb

+      container["image"] = "some-invalid-image:badtag"
+    end
+    assert_deploy_failure(result)
+    refute_equal error, :timeout


The name of the test suggests that the error is nil when it didn't time out, but then it doesn't test for it.

In case we want to make this also work in the future (when we maybe support other errors than :timeout) do we want to make the check assert_equal error, nil? or change the name of the test?

I think you're right, assert_nil error makes sense here. I also think the test name matches the current behavior so should stay put. If / when we add more conditions we'll need to change what we assert here and the test name.

klautcomputing · 2018-03-13T22:11:57Z

test/integration/restart_task_test.rb

@@ -100,6 +100,14 @@ def test_restart_not_existing_deployment
      in_order: true)
  end

+  def test_restart_error_nil_when_not_timeout


same here as above

klautcomputing · 2018-03-13T22:12:07Z

exe/kubernetes-restart

@@ -17,5 +17,10 @@ context = ARGV[1]
 logger = KubernetesDeploy::FormattedLogger.build(namespace, context)

 restart = KubernetesDeploy::RestartTask.new(namespace: namespace, context: context, logger: logger)
-success = restart.perform(raw_deployments)
-exit 1 unless success
+success, error = restart.perform(raw_deployments)


same here as above

klautcomputing · 2018-03-13T22:22:06Z

lib/kubernetes-deploy/restart_task.rb

    ensure
      @logger.print_summary(success)
      status = success ? "success" : "failed"
      tags = %W(namespace:#{@namespace} context:#{@context} status:#{status} deployments:#{deployments.to_a.length}})
      ::StatsD.measure('restart.duration', StatsD.duration(start), tags: tags)
+      [success, error]


As you mentioned in your comment above ensures don't return, so this can be deleted, right?

klautcomputing · 2018-03-13T22:25:32Z

test/helpers/fixture_deploy_helper.rb

@@ -26,18 +26,24 @@ module FixtureDeployHelper
  #     pod = fixtures["unmanaged-pod.yml.erb"]["Pod"].first
  #     pod["spec"]["containers"].first["image"] = "hello-world:thisImageIsBad"
  #   end
-  def deploy_fixtures(set, subset: nil, **args) # extra args are passed through to deploy_dir_without_profiling
+  def deploy_fixtures(set, subset: nil, **args, &block) # extra args are passed through to deploy_dir_without_profiling
+    success, _ = deploy_fixtures_with_error(set, subset: subset, **args, &block)


shouldn't we make sure that _ is nil, so:

success, error = deploy_fixtures_with_error... assert_equal error, nil success

Turns out several (20) tests break with this change.

I think we should change that in the future, as you mentioned. And I am fine with not doing this it in this PR.

This helper doesn't make any assumptions about the desired result; most (all?) tests that use deploy_fixtures should be consuming the return value with either assert_deploy_failure or assert_deploy_success (those helpers give you nice failure output when the result is incorrect). The 20 tests that fail are probably the ones where the deploy is expected to fail.

dturn · 2018-03-14T15:26:34Z

@klautcomputing Thanks for the review. Adding an error class was a good idea. I've made the changes you suggested with the exception of the one that broke 20 tests. I'd be open to refactor that in a separate PR after this, but think its too much churn for 1 pr.

klautcomputing · 2018-03-14T15:32:16Z

test/helpers/fixture_deploy_helper.rb

@@ -26,18 +26,24 @@ module FixtureDeployHelper
  #     pod = fixtures["unmanaged-pod.yml.erb"]["Pod"].first
  #     pod["spec"]["containers"].first["image"] = "hello-world:thisImageIsBad"
  #   end
-  def deploy_fixtures(set, subset: nil, **args) # extra args are passed through to deploy_dir_without_profiling
+  def deploy_fixtures(set, subset: nil, **args, &block) # extra args are passed through to deploy_dir_without_profiling
+    success, _ = deploy_fixtures_with_error(set, subset: subset, **args, &block)


I think we should change that in the future, as you mentioned. And I am fine with not doing this it in this PR.

KnVerey

We're going to need to find a way to test that we get the verdict we want (in terms of both logs and return value or exception raised) when the result is actually mixed. Relevant existing tests include:

test_resource_quotas_are_deployed_first (success + timeout)
test_deployment_with_progress_times_out_for_short_duration (timeout only)
test_deploy_result_logging_for_mixed_result_deploy (success, failure and timeout)

KnVerey · 2018-03-14T23:19:53Z

exe/kubernetes-deploy

  verify_result: !skip_wait,
  allow_protected_ns: allow_protected_ns,
  prune: prune
 )
+
+exit 2 if error.is_a?(KubernetesDeploy::DeploymentTimeoutError)


Re: which specific code to use, I don't see any indication that there's a standard either. However, 2 appears to be one of very few codes that actually has a reserved meaning, which is "misuse of shell built-ins": http://tldp.org/LDP/abs/html/exitcodes.html. A timeout doesn't mean that the command was called incorrectly, so we should probably choose something else.

I think having a typed exception is a good idea, but it looks a little odd to me to pass an exception around manually in ruby rather than raising it. WDYT about giving DeployTask both run and run! (for backwards compatibility, and probably the integration test framework) and using the latter from the executable?

# DeployTask def run(*args) run!(*args) true rescue FatalDeploymentError false end def run!(verify_result: true, allow_protected_ns: false, prune: true) # things success = true rescue FatalDeploymentError => error @logger.summary.add_action(error.message) success = false raise ensure @logger.print_summary(success) status = success ? "success" : "failed" ::StatsD.measure('all_resources.duration', StatsD.duration(start), tags: statsd_tags << "status:#{status}") end

# exe/kubernetes-deploy begin runner.run!( verify_result: !skip_wait, allow_protected_ns: allow_protected_ns, prune: prune ) rescue DeploymentTimeoutError exit :not1 # whatever we pick rescue FatalDeploymentError exit 1 end

KnVerey · 2018-03-14T23:49:13Z

lib/kubernetes-deploy/deploy_task.rb

@@ -149,6 +150,9 @@ def run(verify_result: true, allow_protected_ns: false, prune: true)
        deploy_resources(resources, prune: prune, verify: true)
        ::StatsD.measure('normal_resources.duration', StatsD.duration(start_normal_resource), tags: statsd_tags)
        success = resources.all?(&:deploy_succeeded?)
+        if !success && (timedout_resources = resources.select(&:deploy_timed_out?).presence)


So it's possible for some resources to time out and others to fail, right? In that case, I think the failure should trump regardless of the relative failure/timeout volumes because the failure means they need to do something.

KnVerey · 2018-03-14T23:54:52Z

lib/kubernetes-deploy/deploy_task.rb

    rescue FatalDeploymentError => error
      @logger.summary.add_action(error.message)
      success = false
+      [success, error]
    ensure
      @logger.print_summary(success)


IMO the summary outputted by this line should also reflect the difference between timeouts and failures. In line with my other suggestion, we could do something like this:

if verify_result # things ::StatsD.measure('normal_resources.duration', StatsD.duration(start_normal_resource), tags: statsd_tags) raise FatalDeploymentError if resources.any?(&:deploy_failed?) raise DeploymentTimeoutError if resources.any?(&:deploy_timed_out?) else # things end @logger.print_summary(:success) ::StatsD.measure('all_resources.duration', StatsD.duration(start), tags: statsd_tags << "status:success") rescue DeploymentTimeoutError @logger.print_summary(:timeout) ::StatsD.measure('all_resources.duration', StatsD.duration(start), tags: statsd_tags << "status:timeout") raise rescue FatalDeploymentError => error @logger.summary.add_action(error.message) if error.message.present? @logger.print_summary(:failure) ::StatsD.measure('all_resources.duration', StatsD.duration(start), tags: statsd_tags << "status:failed") raise end

It's kinda weird to raise something you're about to rescue yourself, but in a way it treats final failures more consistently with earlier failures. Might also be easier to follow? WDYT?

KnVerey · 2018-03-14T23:56:01Z

lib/kubernetes-deploy/errors.rb

@@ -8,6 +8,15 @@ def initialize(name, context)
      super("Namespace `#{name}` not found in context `#{context}`")
    end
  end
+
+  class DeploymentTimeoutError < StandardError


I think this should inherit from FatalDeploymentError, which we've been using as a generic all-is-doomed-now base class for this gem.

KnVerey · 2018-03-15T00:00:18Z

test/helpers/fixture_deploy_helper.rb

@@ -26,18 +26,24 @@ module FixtureDeployHelper
  #     pod = fixtures["unmanaged-pod.yml.erb"]["Pod"].first
  #     pod["spec"]["containers"].first["image"] = "hello-world:thisImageIsBad"
  #   end
-  def deploy_fixtures(set, subset: nil, **args) # extra args are passed through to deploy_dir_without_profiling
+  def deploy_fixtures(set, subset: nil, **args, &block) # extra args are passed through to deploy_dir_without_profiling
+    success, _ = deploy_fixtures_with_error(set, subset: subset, **args, &block)


This helper doesn't make any assumptions about the desired result; most (all?) tests that use deploy_fixtures should be consuming the return value with either assert_deploy_failure or assert_deploy_success (those helpers give you nice failure output when the result is incorrect). The 20 tests that fail are probably the ones where the deploy is expected to fail.

dturn · 2018-03-15T19:50:38Z

Changes are in, part of me is worried this has made things worse and not better, but maybe I've just spent too much time staring at this code today.

KnVerey · 2018-03-23T00:29:46Z

exe/kubernetes-deploy

+    prune: prune
+  )
+rescue KubernetesDeploy::DeploymentTimeoutError
+  exit 124


Should we use the code Jean already configured in our Shipit? (70)

KnVerey · 2018-03-23T00:39:03Z

lib/kubernetes-deploy/deploy_task.rb

      success = false
+      raise
    ensure
      @logger.print_summary(success)


Did you miss this comment? #244 (comment)
I'm not married to that specific implementation, but the problem described is still an issue in this version. I realized the other day that getting rid of the ensure would have the nice side-effect of not printing the result banner in the case of unexpected exceptions that are dumping backtraces.

I misunderstood what you were suggesting since print_summary takes a bool right now. I'll fix this up to use a symbol.

KnVerey · 2018-03-23T00:44:41Z

test/test_helper.rb

      if ENV["PRINT_LOGS"]
        assert_equal false, result, "Deploy succeeded when it was expected to fail"
        return
      end

      logging_assertion do |logs|
+        assert_match(/failed due to timeouts./, logs) if cause == :timeout


This should be asserting on Result: TIMED OUT (which isn't happening yet in this PR), like L114

KnVerey

Code LGTM now. Just a few comments/questions based on 🎩 .

KnVerey · 2018-03-27T16:12:41Z

lib/kubernetes-deploy/deferred_summary_logging.rb

        level = :info
      else
-        heading("Result: ", "FAILURE", :red)
+        heading("Result: ", status_string, :red)


Would it be more consistent to make this piece of heading yellow? (Just a thought; I don't have a strong opinion)

KnVerey · 2018-03-27T16:16:03Z

lib/kubernetes-deploy/errors.rb

+    attr_reader :resources
+    def initialize(resources)
+      @resources = resources
+      super("Resources #{@resources.map(&:name).join(',')} failed due to timeouts.")


For consistency with the other messaging (e.g. right now we see Successfully deployed 1 resource, failed to deploy 1 resource, and resources web failed due to timeouts.), and because this could be a long list, I think this should be a count, not a list.

Since we're trying to clarify that timeouts aren't exactly failures, might it be more helpful to not use "failed" in this message? e.g. "gave up watching X resources", "timed out waiting for X resources", "reached the watch timeout for X resources", I dunno.

KnVerey · 2018-03-27T16:20:45Z

lib/kubernetes-deploy/deploy_task.rb

      @logger.summary.add_action(error.message)
-      success = false
-    ensure


By getting rid of the ensure, this PR is implementing solution 2 for #137.

…eoutError class.

KnVerey · 2018-03-28T00:23:51Z

lib/kubernetes-deploy/deferred_summary_logging.rb

        level = :info
+      elsif status == :timed_out
+        heading("Result: ", status_string, :yellow)
+        level = :warn


I think we still want :fatal for this, since it did in fact kill the deploy... but also because error and warn with this particular logger make the whole lined coloured:

KnVerey

🎩 success/failure/timeout via exe/kuberentes-deploy and success/failure via exe/kuberentes-restart looks good (and has expected exit codes)

dturn requested a review from KnVerey February 22, 2018 20:43

dturn force-pushed the timeout-exit-code branch from 9e69bb0 to 1bca798 Compare March 5, 2018 22:30

dturn commented Mar 6, 2018

View reviewed changes

dturn requested a review from klautcomputing March 9, 2018 18:40

klautcomputing suggested changes Mar 13, 2018

View reviewed changes

klautcomputing approved these changes Mar 14, 2018

View reviewed changes

KnVerey reviewed Mar 15, 2018

View reviewed changes

dturn mentioned this pull request Mar 15, 2018

Distinguish deploy timeouts from deploy failures Shopify/shipit-engine#764

Merged

dturn force-pushed the timeout-exit-code branch 3 times, most recently from e47af04 to 48a4066 Compare March 15, 2018 19:43

dturn changed the title ~~Use exit code 2 for failures due to timeouts~~ Use a new exit code for failures due to timeouts Mar 16, 2018

KnVerey reviewed Mar 23, 2018

View reviewed changes

KnVerey reviewed Mar 27, 2018

View reviewed changes

dturn added 11 commits March 27, 2018 15:08

Use exit code 2 for timeouts

3c35087

Add some tests

e6d6c3c

Fix tests

225c0be

Pass the block in tests

7f3daa6

PR feedback, major change is adding a KubernetesDeploy::DeploymentTim…

c8ee270

…eoutError class.

Fix tests

cc56d31

PR refactor

f6fac14

update changelog

4bf428d

Use new exit code and better logging

c85c08f

Fix up test

473579d

PR feedback

1d900af

dturn force-pushed the timeout-exit-code branch from 27074e0 to 1d900af Compare March 27, 2018 20:08

KnVerey reviewed Mar 28, 2018

View reviewed changes

Timed out error is fatal

48fbfba

KnVerey approved these changes Mar 28, 2018

View reviewed changes

dturn merged commit 3c0109d into master Mar 28, 2018

dturn deleted the timeout-exit-code branch March 28, 2018 16:37

KnVerey mentioned this pull request Apr 2, 2018

Fix action summaries with failures/timeouts #258

Merged

dturn mentioned this pull request Feb 27, 2019

Summary section inaccurate when deploy aborted #137

Closed

Use a new exit code for failures due to timeouts #244

Use a new exit code for failures due to timeouts #244

Conversation

dturn commented Feb 22, 2018

dturn commented Mar 5, 2018 • edited Loading

Choose a reason for hiding this comment

dturn Mar 6, 2018 • edited Loading

Choose a reason for hiding this comment

dturn commented Mar 6, 2018

klautcomputing left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dturn commented Mar 14, 2018

Choose a reason for hiding this comment

KnVerey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnVerey Mar 14, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dturn commented Mar 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dturn Mar 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnVerey left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

KnVerey left a comment

Choose a reason for hiding this comment

dturn commented Mar 5, 2018 •

edited

Loading

dturn Mar 6, 2018 •

edited

Loading

KnVerey Mar 14, 2018 •

edited

Loading

dturn Mar 23, 2018 •

edited

Loading