Skip to content

[Research] Retries for certain HTTP codes #100

@alexellis

Description

@alexellis

This issue is to gather research and opinions on how to tackle retries for certain HTTP codes.

Expected Behaviour

If a function returns certain errors like 429 (too busy) (as can be set by max_inflight in the function's watchdog), then the queue-worker could retry the request a number of times.

Current Behaviour

The failure will be logged, but not retried.

It does seem like retries will be made implicitly if the function invocation takes longer than the "ack window".

So if a function takes 2m to finish, and the ack window is 30s, that will be retried, possibly indefinitely in our current implementation.

Possible Solution

I'd like to gather some use-cases and requests from users on how they expect this to work.

Context

@matthiashanel also has some suggestions on how the new NATS JetStream project could help with this use-case.

@andeplane recently told me about a custom fork / patch that retries up to 5 times whenever a 429 error is received with an exponential back-off. My concern with an exponential backoff with the current implementation is that it effectively shortens the ack window and could cause undefined results.

The Linkerd team also caution about automatic retries in their documentation stating risk of cascading failure. "How Retries Can Go Wrong" "Choosing a maximum number of retry attempts is a guessing game" "Systems configured this way are vulnerable to retry storms" -> https://linkerd.io/2/features/retries-and-timeouts/

The team discuss a "retry budget", should we look into this?

Should individual functions be able to express an annotation with retry data? I.e. a backoff for processing an image may be valid at 2, 4, 8 seconds, but retrying a Tweet because Twitter's API has rate-limited us for 4 hours, will clearly not work.

What happens if we cannot retry a function call like in the Twitter example above? Where does the message go, how is this persisted? See also (call for a dead-letter queue) in #81

Finally, if we do start retrying, that metadata seems key to operational tuning of the system and auto-scaling, should this be exposed via Prometheus metrics and a HTTP /metrics endpoint?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions