Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Retriability of timeouts appears inconsistent #5368

Open
2 tasks done
fg91 opened this issue May 15, 2024 · 2 comments
Open
2 tasks done

[BUG] Retriability of timeouts appears inconsistent #5368

fg91 opened this issue May 15, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@fg91
Copy link
Member

fg91 commented May 15, 2024

Describe the bug

Let's consider this task:

@task(timeout=timedelta(seconds=120), retries=10)
def foo():
    from time import sleep
    sleep(3600)

I'm seeing two attempts:

Screenshot 2024-05-15 at 11 45 33 Screenshot 2024-05-15 at 11 45 40
  • The first timeout must have been counted as retriable as we see a second attempt.
  • The second timeout must have been counted as non-retriable as we don't see further attempts even though we configured retries=10

Expected behavior

I would have expected either no retry or 10 retries.

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@fg91 fg91 added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels May 15, 2024
Copy link

runllm bot commented May 15, 2024

Timeouts in Flyte can be inconsistent due to how retries are handled for different types of errors. According to this discussion, node timeouts can sometimes lead to an "unknown" state, making it difficult to debug. Additionally, another discussion highlights that retries might not always behave as expected, especially with map tasks.

Would you like more detail on how to configure retries and handle timeouts in Flyte?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

@eapolinario eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label May 16, 2024
@hamersaw
Copy link
Contributor

this seems to be caused by differentiation between active and execution deadlines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants