Skip to content

fix: honor request waiter deadlines before admission#681

Merged
eric-tramel merged 1 commit into
scheduling-yolofrom
andreatgretel/fix/request-waiter-deadlines
May 19, 2026
Merged

fix: honor request waiter deadlines before admission#681
eric-tramel merged 1 commit into
scheduling-yolofrom
andreatgretel/fix/request-waiter-deadlines

Conversation

@andreatgretel
Copy link
Copy Markdown
Contributor

Summary

  • Expire queued request waiters before selecting them for admission.
  • Check waiter deadlines before re-enqueuing a woken waiter.
  • Add a regression for an async waiter whose deadline passes before another thread releases request capacity.

Why

A queued async waiter could receive a request lease after its deadline_monotonic had already passed if the event loop was stalled and capacity was released from another thread. Queue wait timeouts should be terminal before admission assigns a lease.

Validation

  • .venv/bin/ruff check --fix .
  • .venv/bin/ruff format .
  • .venv/bin/pytest packages/data-designer-engine/tests/engine/models/request_admission/test_controller.py
  • .venv/bin/pytest packages/data-designer-engine/tests/engine/dataset_builders/scheduling/test_resolver.py packages/data-designer-engine/tests/engine/dataset_builders/test_async_scheduler.py packages/data-designer-engine/tests/engine/models/request_admission/test_controller.py packages/data-designer-engine/tests/engine/models/clients/test_model_request_executor.py

@eric-tramel eric-tramel marked this pull request as ready for review May 19, 2026 20:34
@eric-tramel eric-tramel requested a review from a team as a code owner May 19, 2026 20:34
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 19, 2026

Greptile Summary

This PR fixes a race where an async waiter could receive an admission lease after its deadline_monotonic had already passed, caused by the event loop being stalled while a separate thread released capacity. The fix adds two complementary guards: _expire_waiters_locked removes and wakes any deadline-exceeded waiters before the queue is consulted for selection or the "any waiter ahead" check, and a new early-exit block at the top of both acquire loops rejects a re-woken waiter whose deadline has since expired before it can be re-enqueued.

  • controller.py: introduces _expire_waiters_locked, calls it in _admit_waiters_locked and _queued_waiter_ahead_locked, and adds the leading deadline guard to both acquire_sync and acquire_async.
  • queue.py: adds waiters() returning a dict-values snapshot so _expire_waiters_locked can safely remove entries while iterating.
  • test_controller.py: adds a targeted regression that blocks the event loop with time.sleep after a cross-thread release; _wait_seconds_locked is patched to 10 s to ensure the wakeup arrives only via _expire_waiters_locked, exactly exercising the race path described in the PR.

Confidence Score: 5/5

The change is safe to merge; it narrows admission to waiters whose deadlines have not yet elapsed and adds no new mutable shared state.

All three touch points — admission selection, the queue-ahead check, and both acquire loops — receive consistent expiry handling. The waiters() snapshot prevents concurrent-modification issues, and double-removal via _remove_waiter_locked is explicitly guarded as a no-op. The regression test exercises the cross-thread/stalled-loop race precisely by patching the wait duration, making the only wakeup path the one under test.

No files require special attention; all three changed files are straightforward and internally consistent.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/models/request_admission/controller.py Adds _expire_waiters_locked to purge and wake deadline-exceeded waiters before admission or queue-ahead checks, and adds an early deadline guard at the top of both acquire loops to reject woken-but-expired waiters before they can be re-enqueued.
packages/data-designer-engine/src/data_designer/engine/models/request_admission/queue.py Adds waiters() method returning a snapshot tuple of all queued waiters; the snapshot is necessary to avoid mutating the backing dict while _expire_waiters_locked iterates it.
packages/data-designer-engine/tests/engine/models/request_admission/test_controller.py Adds a regression test that intentionally stalls the event loop with time.sleep after a cross-thread release to verify an expired async waiter raises RequestAdmissionError rather than receiving a lease; _wait_seconds_locked is patched to 10 s to force the wakeup to come only from _expire_waiters_locked, proving the specific race path.

Sequence Diagram

sequenceDiagram
    participant C as Caller (async)
    participant L as Event Loop
    participant AC as AdmissionController
    participant RT as Release Thread

    C->>AC: "acquire_async(item, timeout=10ms)"
    AC->>AC: "enqueue waiter (deadline = now+10ms)"
    AC->>AC: _admit_waiters_locked → no capacity
    AC->>L: "await asyncio.wait_for(wakeup, timeout=10s)"

    Note over C,RT: Event loop blocks on time.sleep(60ms)

    RT->>AC: release(lease)
    AC->>AC: _admit_waiters_locked()
    AC->>AC: _expire_waiters_locked(now≥deadline)
    AC->>AC: _remove_waiter_locked(waiter)
    AC->>L: call_soon_threadsafe(wakeup.set)

    Note over C,RT: Event loop resumes after sleep

    L->>AC: wakeup fires → acquire loop iterates
    AC->>AC: Check assigned_lease → None
    AC->>AC: Check deadline → now ≥ deadline
    AC->>AC: _remove_waiter_locked (no-op, already removed)
    AC-->>C: raise RequestAdmissionError(queue_timeout)
Loading

Reviews (1): Last reviewed commit: "fix request waiter deadline admission" | Re-trigger Greptile

@eric-tramel eric-tramel merged commit 387d5a5 into scheduling-yolo May 19, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants