Addresses two race conditions in background task runs #14115

chrisguidry · 2024-06-18T17:10:02Z

During reliability testing of background tasks, I kept experiencing hangs that
couldn't be explained by task deadlocking due to the limit on the worker (see
#14092). Here I'm addressing two causes of them:

Returning the first singleton instance of TaskRunWaiter before it is
confirmed to be listening to the websocket. Using an asyncio.Event to
signal that the socket is actually connected before proceeding.
When PrefectDistributedFuture was waiting for task runs to complete, it
would ask the API if the task run was complete first, then go on to start
waiting. If the task changed state in-between those calls, it would likely
be missed entirely.

With these changes in place, I've run a large number of my test tasks (see
PrefectHQ/nebula#7962, which has a task with 5 layers of dependencies) at
TASKS=100 and have experienced no hangs. I did get a hang with TASKS=1000,
but this is a marked improvement from where we were before.

Part of #14098

During reliability testing of background tasks, I kept experiencing hangs that couldn't be explained by task deadlocking due to the `limit` on the worker (see #14092). Here I'm addressing two causes of them: 1. Returning the first singleton instance of `TaskRunWaiter` before it is confirmed to be listening to the websocket. Using an `asyncio.Event` to signal that the socket is actually connected before proceeding. 2. When `PrefectDistributedFuture` was waiting for task runs to complete, it would ask the API if the task run was complete first, then go on to start waiting. If the task changed state in-between those calls, it would likely be missed entirely. With these changes in place, I've run a large number of my test tasks (see PrefectHQ/nebula#7962, which has a task with 5 layers of dependencies) at `TASKS=100` and have experienced no hangs. I did get a hang with `TASKS=1000`, but this is a marked improvement from where we were before. Part of #14098

chrisguidry · 2024-06-18T17:14:42Z

src/prefect/task_worker.py

@@ -76,8 +77,9 @@ def __init__(
        limit: Optional[int] = 10,
    ):
        self.tasks: List[Task] = list(tasks)
+        self.task_keys = set(t.task_key for t in tasks if isinstance(t, Task))


The refactoring in this file is just to make the /status endpoint more useful and isn't part of the fix here

desertaxle

LGTM!

chrisguidry requested a review from a team as a code owner June 18, 2024 17:10

chrisguidry requested a review from desertaxle June 18, 2024 17:10

desertaxle added the fix A fix for a bug in an existing feature label Jun 18, 2024

chrisguidry commented Jun 18, 2024

View reviewed changes

chrisguidry added 2 commits June 18, 2024 13:20

Merge branch 'main' into wait-for-task-event-subscription

016dc3b

Fix exit

4e89014

desertaxle approved these changes Jun 18, 2024

View reviewed changes

chrisguidry merged commit c0f803e into main Jun 18, 2024
26 checks passed

chrisguidry deleted the wait-for-task-event-subscription branch June 18, 2024 19:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Addresses two race conditions in background task runs #14115

Addresses two race conditions in background task runs #14115

chrisguidry commented Jun 18, 2024

chrisguidry Jun 18, 2024

desertaxle left a comment

Addresses two race conditions in background task runs #14115

Addresses two race conditions in background task runs #14115

Conversation

chrisguidry commented Jun 18, 2024

chrisguidry Jun 18, 2024

Choose a reason for hiding this comment

desertaxle left a comment

Choose a reason for hiding this comment