Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable per-resource timeout customization via an annotation #232

Merged
merged 4 commits into from
Jan 9, 2018

Conversation

KnVerey
Copy link
Contributor

@KnVerey KnVerey commented Dec 21, 2017

@Shopify/cloudplatform @Shopify/redis-memcached @psycotica0-shopify @kirs

What this does

This PR makes it possible to add a kubernetes-deploy.shopify.io/timeout-override annotation to any individual resource. If that annotation is present and valid, its value will be used in place of the global hard timeout for that resource type.

I think this is a reasonable feature even though it may let people shoot themselves/Shipit in the foot with reeeeeaaaaaly long timeouts because:

  • We've tinkered with the global timeout on Deployment in the past just for core.
  • Timeout on stateful sets #225 requested a timeout override for StatefulSets (and the same thing was raised internally for kafka)
  • increase timeout to 15mins #231 is asking to bump the timout for Redis because of a single instance.
  • This gem is open-source, and different organizations may be using it in very different contexts and for deploying very different things.

Alternatives

  • The various pod controllers are the main use case for this override (even Redis is actually watching a deployment), and for those using something like ProgressDeadlineSeconds (which only exists on Deployment) would be ideal. I've seen issues in k8s core about bringing all the workload controllers in line with each other before v1, so presumably this will be an option eventually. But last I checked StatefulSets (for example) still don't have a Progressing condition on k8s core master.
  • Make Redis use ProgressDeadlineSeconds (since as mentioned above it is watching a Deployment), and set that value really high on the offending deployment. This isn't crazy... in fact the Redis controller should probably use that strategy when we switch to controller-reported statuses for CRs.
  • Implement ProgressDeadlineSeconds-like behaviour for StatefulSets (and possibly other pod controller classes) client-side. I'm hesitant to do this since it would not be straightforward, and the feature is eventually coming upstream (where it belongs).

@KnVerey KnVerey mentioned this pull request Dec 21, 2017
22 tasks
@devonboyer
Copy link
Contributor

@kwestein Once this PR ships we will be able to fix the timeout problems you are having.

@@ -22,6 +22,8 @@ class KubernetesResource
If you have reason to believe it will succeed, retry the deploy to continue to monitor the rollout.
MSG

TIMEOUT_OVERRIDE_ANNOTATION = "kubernetes-deploy.shopify.io/timeout-override-seconds"
Copy link
Contributor

@devonboyer devonboyer Dec 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to support more ways to specify the duration like 10s or 5m? Or perhaps just make the annotation kubernetes-deploy.shopify.io/timeout-override and state that for the time being the only acceptable value is seconds?

I am bringing this up because a timeout of 15m (which may not be unreasonable for Redis) would be 900s which isn't as easy for a human to parse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I'll update it to accept "s" or "m", and to assume that unqualified values are seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well if you're going to have m and s, you might as well have h...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend ISO8601 format (to be used with ActiveSupport::Duration::ISO8601Parse) for time durations. It's what I used in my open PR on resource discovery.

Copy link

@katdrobnjakovic katdrobnjakovic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im on board with this idea for redis 👍

in fact the Redis controller should probably use that strategy when we switch to controller-reported statuses for CRs.

Is this something that is happening soon? I'm okay with shipping this (as I understand it it will be useful for others as well) and adding the annotations to quickly fix the few apps we have seen have problems with their deployments timing out and then refactor to the ProgressDeadlineSeconds strategy.

On our side, we are not in a rush to ship this today as everyone is on holiday and we have a workaround in place if those apps do need to deploy over the break.

@KnVerey
Copy link
Contributor Author

KnVerey commented Dec 22, 2017

Is this something that is happening soon?

@stefanmb is working on it in #188, but it's a big change to both the gem and the controllers, so I wouldn't expect it super soon.

On our side, we are not in a rush to ship this today as everyone is on holiday and we have a workaround in place if those apps do need to deploy over the break.

Thanks for letting me know 👍

@psycotica0-shopify
Copy link
Contributor

It should be possible to write a test like

  • Construct resource, set deploy_started_at to 6 minutes ago, validate that dummy.deploy_timed_out? is true.
  • Construct a resource with a custom timeout of 10m, set deploy_started_at to 6 minutes ago, validate that dummy.deploy_timed_out? is false.

Unless that's too Integrationy for the unit tests.

@KnVerey
Copy link
Contributor Author

KnVerey commented Dec 23, 2017

Updated to use ISO8601 durations* and add more tests.

* These are neat and powerful and standard, and it's great to get the parser for free, but I felt like the intuitive "1s/m/h" should also work without the "PT" prefix, and that being case-sensitive would be annoying. So I wrapped the parser in a custom one with those features. LMK what you think.

class ParsingError < ArgumentError; end

def initialize(value)
@iso8601_str = value.to_s.strip.upcase
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe if you add:

@iso8601_str = "PT" + @iso8601_str unless @iso8601_str.start_with?("P")

You get all of the features you want from just normal ActiveSupport::Duration.parse(@iso8601_str) without needing to reach into the parser or write calculate_total_seconds.

The only issue is that the error message on a failed parse becomes Invalid ISO 8601 duration: "PTFOO", which is not really what the user provided.

You could maybe solve this with a sketchy Find&Replace on the returned message... maybe

Or you could do a more complicated thing where you do:

  begin
    ActiveSupport::Duration.parse(@iso8601_str)
  rescue ActiveSupport::Duration::ISO8601Parser::ParsingError => e
    begin
      ActiveSupport::Duration.parse("PT" + @iso8601_str)
    rescue ActiveSupport::Duration::ISO8601Parser::ParsingError
      raise ParsingError, e.message
    end
  end

Where you do the blind parse, and if it fails you always try jamming PT on the front.
If that succeeds, we return that one, and if it also fails, we return the first failure and discard the second (We raise the e.message, which is the first exception, in the second rescue block and don't even capture the second exception)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the retry-and-append solution! The only thing that bugs me about it is that we'll be retying in what I'm expecting to be the average case. I can't imagine that's meaningfully expensive, but for the purposes of reflecting intent, I'd slightly prefer to have the first try use the PT + version.

def test_timeout_override_upper_bound_validation
customized_resource = DummyResource.new(definition_extras: build_timeout_metadata("24H1S"))
customized_resource.validate_definition
assert customized_resource.validation_failed?, "Annotation with '0' was valid"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This case is not Annotation with '0'

@psycotica0-shopify
Copy link
Contributor

(These tests seem great, by the way!)

Copy link
Contributor

@psycotica0-shopify psycotica0-shopify left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me! I like the "try the PT version first" hack on my hack.

README.md Outdated
@@ -120,6 +121,9 @@ All templates must be YAML formatted. You can also use ERB. The following local
You can add additional variables using the `--bindings=BINDINGS` option. For example, `kubernetes-deploy my-app cluster1 --bindings=color=blue,size=large` will expose `color` and `size` in your templates.


### Customizing behaviour with annotations

- `kubernetes-deploy.shopify.io/timeout-override` (any resource): Override the tool's hard timeout for one specific resource. Both full ISO8601 durations and the time portion of ISO8601 durations are valid. Value must be between 1 second and 24 hours. Example values: 45s / 3m / 1h / PT0.25H
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding: "This feature does not work as expected for the deployment resources because progressDeadlineSeconds takes precedence over hard timeouts and have a server side default value for deployments."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with more details on compatibility

Copy link
Contributor

@stefanmb stefanmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Approach is sound and tests look pretty exhaustive. Thanks!

end

def parse!
ActiveSupport::Duration.parse("PT#{@iso8601_str}") # By default assume it is just a time component
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also prefer trying the more specific logic first.

basic_resource = DummyResource.new
assert_equal 300, basic_resource.timeout

customized_resource = DummyResource.new(definition_extras: build_timeout_metadata("60S"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems a bit overkill to the multiple formats here since we already have unit tests for the parser, but it doesn't hurt.

@KnVerey KnVerey merged commit ab6ccdc into master Jan 9, 2018
@KnVerey KnVerey deleted the timeout_override branch October 5, 2018 00:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants