Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry delay is set to a maximum of 5 seconds #3630

Open
dead10ck opened this issue Jul 28, 2017 · 13 comments
Open

Retry delay is set to a maximum of 5 seconds #3630

dead10ck opened this issue Jul 28, 2017 · 13 comments

Comments

@dead10ck
Copy link

dead10ck commented Jul 28, 2017

If you try to make a retry policy with a delay of longer than 5 seconds, it will not register. With this policy:

---
name: test-retry
description: Retry test if it fails
enabled: true
resource_ref: examples.test
policy_type: action.retry
parameters:
  retry_on: failure
  delay: 30

When you try to register it, it complains:

2017-07-28 22:30:59,712 INFO [-] =========================================================
2017-07-28 22:30:59,712 INFO [-] ############## Registering policies #####################
2017-07-28 22:30:59,712 INFO [-] =========================================================
2017-07-28 22:30:59,720 WARNING [-] Failed to register policies: Failed to register policy "/opt/stackstorm/packs.dev/examples/policies/test-retry-policy.yaml" from pack "examples": 30 is greater than the maximum of 5

Failed validating u'maximum' in schema['properties'][u'delay']:
    {u'description': u'Number of seconds to wait before retrying the execution.',
     u'maximum': 5,
     u'minimum': 0,
     u'required': False,
     u'type': [u'number', 'null']}

On instance[u'delay']:
    30
Traceback (most recent call last):
  File "/usr/bin/st2-register-content", line 22, in <module>
    sys.exit(content_loader.main(sys.argv[1:]))
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2common/content/bootstrap.py", line 387, in main
    register_content()
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2common/content/bootstrap.py", line 341, in register_content
    register_policies()
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2common/content/bootstrap.py", line 303, in register_policies
    raise e
ValueError: Failed to register policy "/opt/stackstorm/packs.dev/examples/policies/test-retry-policy.yaml" from pack "examples": 30 is greater than the maximum of 5

Failed validating u'maximum' in schema['properties'][u'delay']:
    {u'description': u'Number of seconds to wait before retrying the execution.',
     u'maximum': 5,
     u'minimum': 0,
     u'required': False,
     u'type': [u'number', 'null']}

On instance[u'delay']:
    30

5 seconds is incredibly and arbitrarily short. Honestly, I don't think there should be a maximum at all. If I want to delay my action's retries for 4 hours, or 4 days, or 4 weeks, I should be able to.

@Kami
Copy link
Member

Kami commented Jul 30, 2017

We intentionally put this upper limit there because of the way retry is currently implemented.

Right now retry is implemented inside the notifier service as "wait and retry" and not as a separate execution status. This means that retry is not notifier service restart safe - if you restarted the service and there were some actions to retry, those retries would get lost. And a chance of this happening is more likely with higher retry delays.

This first implementation was mostly meant for simple retries on networking errors (connection time out, etc.) which are usually intermediate and even retrying after couple of seconds usually works just fine.

In the future we plan to implement this as a separate delayed action execution status so retry will be restart safe - after that we will bump this value to something more reasonable.

@Kami
Copy link
Member

Kami commented Aug 1, 2017

@dead10ck For now we decided to bump the max limit to 10 minutes by default and also making this upper limit user configurable in st2.conf.

As mentioned above, current implementation has limitations you need to be aware off and even when we re-do the implementation it will be designed for retries up to 10 minutes.

If you want to do longer retries you probably need to re-design your approach and utilize other primitives we offer (e.g. interval trigger).

@dead10ck
Copy link
Author

dead10ck commented Aug 1, 2017

With all due respect, if your retry system can't handle waits longer than 10 minutes, perhaps you need to re-design your approach. It's not an unreasonable workflow to have an action that runs on a period of once a day, or once a week, where it would be preferrable to wait an hour or a day to retry in the event of failure, rather than the full day or week, and it would not make sense to retry in 10 minutes.

@Kami
Copy link
Member

Kami commented Aug 2, 2017

@dead10ck We already have other primitives to handle those long delay which were designed specifically for such use cases - timers (https://docs.stackstorm.com/rules.html#timers) which allow you to run action on a specific date or time intervals / periods.

@dead10ck
Copy link
Author

dead10ck commented Aug 2, 2017

@Kami ok, maybe a concrete example will help you understand my problem. Say I worked for a company whose accounting department makes a financial report every 30 days. Say one of my responsibilities was to run some numbers on these financial reports every month, and I wanted to use StackStorm to automate the analysis. I would set up the action to run on an IntervalTimer with the following rule:

---
name: monthly-financial-report-timer
pack: mycompany
description: "Run analysis on the monthly report"
enabled: true

trigger:
  type: core.st2.IntervalTimer
  parameters:
    unit: days
    delta: 30

action:
  ref: mycompany.monthly-financial-report

Now suppose this day of the month comes around, and for some reason, accounting gets delayed, and they won't be able to publish this month's report until the next day. So when this timer triggers examples.monthly-financial-report, say it fails because the report is not available yet (though really, the reason that it fails is not important). In these cases, I would like for the action to get retried on its own. However, if I wanted to use StackStorm's retry system, the longest I could wait would be 10 minutes. If I did this, it would mean that it would retry every 10 minutes for 24 hours until the report was published the next day, so my execution history would get filled up with 144 failed runs.

How do you propose I use StackStorm's timer primitives to help me in this situation?

@LindsayHill
Copy link
Contributor

Rather than retry policies, maybe that would be better done using a Mistral workflow?

@dead10ck
Copy link
Author

dead10ck commented Aug 2, 2017

@LindsayHill Adding Mistral to my architecture, learning it, and maintaining it is a lot effort just to be able to delay retries longer than 10 mins.

@LindsayHill
Copy link
Contributor

So you have a customised ST2 install that does not include Mistral? You don't need to do a separate install of OpenStack Mistral.

Sooner or later you'll run into other limitations of workflows if you're only using action chains.

@dead10ck
Copy link
Author

dead10ck commented Aug 2, 2017

@LindsayHill Forgive me, I'm new to StackStorm; I wasn't aware that Mistral came packaged with it. In any case, though, is the answer to my problem just "don't use our retry system if you really need longer than 10 mins"?

@LindsayHill
Copy link
Contributor

At this stage, yes, Mistral is probably a better answer. See https://docs.stackstorm.com/mistral.html
That also gives you a bunch more capabilities around complex workflows.

Using retry policies is not going to work for > 10 mins. The current implementation of it is not designed for what you're trying to achieve.

@dead10ck
Copy link
Author

dead10ck commented Aug 2, 2017

Got it, thanks for your help.

@Kami
Copy link
Member

Kami commented Aug 3, 2017

There are also a couple of other way you could try to approach this:

  1. Event driven approach

The place where the report is generated could be modified to run a script or similar which sends a webhook (an event) to StackStorm when a report is generated which you use to trigger your workflow / action.

If that is not possible, you could write a sensor which periodically checks when report is ready and when it is, it dispatches an event.

Both of those approaches might sound and look a little bit more complicated than retry one, but the follow "event driven" approach which make it more powerful and useful - e.g. you could also use those events to trigger other actions, etc.

  1. "Check and run" action approach

Another way to approach it would be to write an workflow / action which checks when report is ready and when it is, it also generates a report (e.g. `generate_report_if_available). You would then use interval timer to run this action every day or similar.

Those are both fairly common patterns in the StackStorm land :)

@arm4b
Copy link
Member

arm4b commented Mar 10, 2018

There is another limitation when the retry policy does not allow more than 5 attempts,
reported and discussed in forum thread: https://forum.stackstorm.com/t/best-way-to-persist-an-action-on-failure/29

I agree that max 5 retries could be unpractical setting in some circumstances.
This is similar to what was discussed here where retry delay is set to 5s (now raised to 120s).

Ideally if there will be no limits and guarantees of state persistence between service restarts, which will require re-working the implementation.

Otherwise we're not helping, but encouraging our users to workaround with hacks while there are established expectations from the retry policy to see it working and rely on it in tough situations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants