Retry delay is set to a maximum of 5 seconds #3630

dead10ck · 2017-07-28T22:50:02Z

If you try to make a retry policy with a delay of longer than 5 seconds, it will not register. With this policy:

---
name: test-retry
description: Retry test if it fails
enabled: true
resource_ref: examples.test
policy_type: action.retry
parameters:
  retry_on: failure
  delay: 30

When you try to register it, it complains:

2017-07-28 22:30:59,712 INFO [-] =========================================================
2017-07-28 22:30:59,712 INFO [-] ############## Registering policies #####################
2017-07-28 22:30:59,712 INFO [-] =========================================================
2017-07-28 22:30:59,720 WARNING [-] Failed to register policies: Failed to register policy "/opt/stackstorm/packs.dev/examples/policies/test-retry-policy.yaml" from pack "examples": 30 is greater than the maximum of 5

Failed validating u'maximum' in schema['properties'][u'delay']:
    {u'description': u'Number of seconds to wait before retrying the execution.',
     u'maximum': 5,
     u'minimum': 0,
     u'required': False,
     u'type': [u'number', 'null']}

On instance[u'delay']:
    30
Traceback (most recent call last):
  File "/usr/bin/st2-register-content", line 22, in <module>
    sys.exit(content_loader.main(sys.argv[1:]))
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2common/content/bootstrap.py", line 387, in main
    register_content()
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2common/content/bootstrap.py", line 341, in register_content
    register_policies()
  File "/opt/stackstorm/st2/local/lib/python2.7/site-packages/st2common/content/bootstrap.py", line 303, in register_policies
    raise e
ValueError: Failed to register policy "/opt/stackstorm/packs.dev/examples/policies/test-retry-policy.yaml" from pack "examples": 30 is greater than the maximum of 5

Failed validating u'maximum' in schema['properties'][u'delay']:
    {u'description': u'Number of seconds to wait before retrying the execution.',
     u'maximum': 5,
     u'minimum': 0,
     u'required': False,
     u'type': [u'number', 'null']}

On instance[u'delay']:
    30

5 seconds is incredibly and arbitrarily short. Honestly, I don't think there should be a maximum at all. If I want to delay my action's retries for 4 hours, or 4 days, or 4 weeks, I should be able to.

The text was updated successfully, but these errors were encountered:

Kami · 2017-07-30T09:40:22Z

We intentionally put this upper limit there because of the way retry is currently implemented.

Right now retry is implemented inside the notifier service as "wait and retry" and not as a separate execution status. This means that retry is not notifier service restart safe - if you restarted the service and there were some actions to retry, those retries would get lost. And a chance of this happening is more likely with higher retry delays.

This first implementation was mostly meant for simple retries on networking errors (connection time out, etc.) which are usually intermediate and even retrying after couple of seconds usually works just fine.

In the future we plan to implement this as a separate delayed action execution status so retry will be restart safe - after that we will bump this value to something more reasonable.

Kami · 2017-08-01T14:31:17Z

@dead10ck For now we decided to bump the max limit to 10 minutes by default and also making this upper limit user configurable in st2.conf.

As mentioned above, current implementation has limitations you need to be aware off and even when we re-do the implementation it will be designed for retries up to 10 minutes.

If you want to do longer retries you probably need to re-design your approach and utilize other primitives we offer (e.g. interval trigger).

dead10ck · 2017-08-01T17:43:09Z

With all due respect, if your retry system can't handle waits longer than 10 minutes, perhaps you need to re-design your approach. It's not an unreasonable workflow to have an action that runs on a period of once a day, or once a week, where it would be preferrable to wait an hour or a day to retry in the event of failure, rather than the full day or week, and it would not make sense to retry in 10 minutes.

Kami · 2017-08-02T08:08:41Z

@dead10ck We already have other primitives to handle those long delay which were designed specifically for such use cases - timers (https://docs.stackstorm.com/rules.html#timers) which allow you to run action on a specific date or time intervals / periods.

dead10ck · 2017-08-02T16:47:50Z

@Kami ok, maybe a concrete example will help you understand my problem. Say I worked for a company whose accounting department makes a financial report every 30 days. Say one of my responsibilities was to run some numbers on these financial reports every month, and I wanted to use StackStorm to automate the analysis. I would set up the action to run on an IntervalTimer with the following rule:

---
name: monthly-financial-report-timer
pack: mycompany
description: "Run analysis on the monthly report"
enabled: true

trigger:
  type: core.st2.IntervalTimer
  parameters:
    unit: days
    delta: 30

action:
  ref: mycompany.monthly-financial-report

Now suppose this day of the month comes around, and for some reason, accounting gets delayed, and they won't be able to publish this month's report until the next day. So when this timer triggers examples.monthly-financial-report, say it fails because the report is not available yet (though really, the reason that it fails is not important). In these cases, I would like for the action to get retried on its own. However, if I wanted to use StackStorm's retry system, the longest I could wait would be 10 minutes. If I did this, it would mean that it would retry every 10 minutes for 24 hours until the report was published the next day, so my execution history would get filled up with 144 failed runs.

How do you propose I use StackStorm's timer primitives to help me in this situation?

LindsayHill · 2017-08-02T20:54:44Z

Rather than retry policies, maybe that would be better done using a Mistral workflow?

dead10ck · 2017-08-02T22:11:39Z

@LindsayHill Adding Mistral to my architecture, learning it, and maintaining it is a lot effort just to be able to delay retries longer than 10 mins.

LindsayHill · 2017-08-02T22:22:56Z

So you have a customised ST2 install that does not include Mistral? You don't need to do a separate install of OpenStack Mistral.

Sooner or later you'll run into other limitations of workflows if you're only using action chains.

dead10ck · 2017-08-02T22:34:16Z

@LindsayHill Forgive me, I'm new to StackStorm; I wasn't aware that Mistral came packaged with it. In any case, though, is the answer to my problem just "don't use our retry system if you really need longer than 10 mins"?

LindsayHill · 2017-08-02T22:45:22Z

At this stage, yes, Mistral is probably a better answer. See https://docs.stackstorm.com/mistral.html
That also gives you a bunch more capabilities around complex workflows.

Using retry policies is not going to work for > 10 mins. The current implementation of it is not designed for what you're trying to achieve.

dead10ck · 2017-08-02T23:08:19Z

Got it, thanks for your help.

Kami · 2017-08-03T08:33:09Z

There are also a couple of other way you could try to approach this:

Event driven approach

The place where the report is generated could be modified to run a script or similar which sends a webhook (an event) to StackStorm when a report is generated which you use to trigger your workflow / action.

If that is not possible, you could write a sensor which periodically checks when report is ready and when it is, it dispatches an event.

Both of those approaches might sound and look a little bit more complicated than retry one, but the follow "event driven" approach which make it more powerful and useful - e.g. you could also use those events to trigger other actions, etc.

"Check and run" action approach

Another way to approach it would be to write an workflow / action which checks when report is ready and when it is, it also generates a report (e.g. `generate_report_if_available). You would then use interval timer to run this action every day or similar.

Those are both fairly common patterns in the StackStorm land :)

arm4b · 2018-03-10T12:10:05Z

There is another limitation when the retry policy does not allow more than 5 attempts,
reported and discussed in forum thread: https://forum.stackstorm.com/t/best-way-to-persist-an-action-on-failure/29

I agree that max 5 retries could be unpractical setting in some circumstances.
This is similar to what was discussed here where retry delay is set to 5s (now raised to 120s).

Ideally if there will be no limits and guarantees of state persistence between service restarts, which will require re-working the implementation.

Otherwise we're not helping, but encouraging our users to workaround with hacks while there are established expectations from the retry policy to see it working and rely on it in tough situations.

Kami added complexity:medium enhancement labels Jul 30, 2017

Kami self-assigned this Aug 1, 2017

Kami mentioned this issue Aug 1, 2017

Increase max retry delay for retry policy from 5 to 120 seconds #3637

Merged

1 task

Kami mentioned this issue Oct 23, 2018

Add delay for task scheduling #4397

Merged

arm4b added the policies label Sep 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry delay is set to a maximum of 5 seconds #3630

Retry delay is set to a maximum of 5 seconds #3630

dead10ck commented Jul 28, 2017 •

edited

Kami commented Jul 30, 2017

Kami commented Aug 1, 2017

dead10ck commented Aug 1, 2017

Kami commented Aug 2, 2017

dead10ck commented Aug 2, 2017

LindsayHill commented Aug 2, 2017

dead10ck commented Aug 2, 2017

LindsayHill commented Aug 2, 2017

dead10ck commented Aug 2, 2017

LindsayHill commented Aug 2, 2017

dead10ck commented Aug 2, 2017

Kami commented Aug 3, 2017 •

edited

arm4b commented Mar 10, 2018 •

edited

Retry delay is set to a maximum of 5 seconds #3630

Retry delay is set to a maximum of 5 seconds #3630

Comments

dead10ck commented Jul 28, 2017 • edited

Kami commented Jul 30, 2017

Kami commented Aug 1, 2017

dead10ck commented Aug 1, 2017

Kami commented Aug 2, 2017

dead10ck commented Aug 2, 2017

LindsayHill commented Aug 2, 2017

dead10ck commented Aug 2, 2017

LindsayHill commented Aug 2, 2017

dead10ck commented Aug 2, 2017

LindsayHill commented Aug 2, 2017

dead10ck commented Aug 2, 2017

Kami commented Aug 3, 2017 • edited

arm4b commented Mar 10, 2018 • edited

dead10ck commented Jul 28, 2017 •

edited

Kami commented Aug 3, 2017 •

edited

arm4b commented Mar 10, 2018 •

edited