Skip to content

Scheduler stalls can turn healthy request waiters into queue_timeout row drops #725

@eric-tramel

Description

@eric-tramel

Priority Level

High (Major functionality broken)

Describe the bug

Scheduler stalls can turn otherwise healthy request waiters into queue_timeout failures and dropped rows. In large async runs with healthy mock endpoints, the scheduler event loop stalled long enough that request-admission waiters timed out locally. Those timeout outcomes were then treated as non-retryable drops, even though the endpoint was not failing.

This creates a false failure mode under high scheduler load: rows are dropped because the scheduler cannot revisit waiting work quickly enough, not because the provider is unavailable.

Steps/Code to reproduce bug

Use a throwaway async scheduler harness with mock endpoints and no DataDesigner product-code changes. Configure a very large logical job with high request pressure and bounded timeboxes:

  • 500k to 1M logical records
  • wide and fanout shapes with 8 to 12 LLM-like columns
  • endpoint caps around 128 to 512
  • max_in_flight_tasks above endpoint cap so waiters can accumulate
  • queue wait timeout set low enough to expose the stall, for example 2-5s
  • no injected provider failures for the healthy-endpoint control

Representative observed cases:

Case Endpoint failures injected Request wait timeout Outcome
large wide job, cap 512, max in-flight 1,024 none 2s hundreds of queue_timeout events and dropped rows
large wide job, cap 512, max in-flight 2,048 none 5s more than 1,000 queue-timeout events and dropped rows
fixed-frontier healthy control none larger timeout no queue-timeout drops, but severe queue overhead

Expected behavior

Healthy request waiters should not become non-retryable row drops solely because scheduler bookkeeping stalls the event loop. The scheduler should either:

  • keep request admission aligned with provider capacity so excessive waiters are not created,
  • yield often enough that local request wait timeouts reflect real request pressure,
  • classify scheduler-induced wait timeouts separately from provider failures, or
  • retry/salvage these outcomes instead of dropping rows as non-retryable.

Agent Diagnostic / Prior Investigation

The investigation found periods where the scheduler had pending request waiters and active ready work, but queue observation/selection consumed the event loop. In positive cases with lower wait-timeout settings, this produced request_wait_timeout and non_retryable_dropped event patterns despite healthy mock endpoints.

This is related to, but distinct from, the queue CPU bottleneck: the queue bottleneck explains the stall, while this issue tracks the data-loss/failure behavior caused by interpreting scheduler-induced request wait timeouts as row drops.

Additional context

Suggested direction:

  • Bound scheduler task admission closer to request-admission capacity.
  • Treat local request wait timeout under scheduler stall as retryable or diagnostic, not as a provider/data failure.
  • Include queue-lag/request-waiter context in error events.
  • Add tests for healthy endpoints where scheduler load cannot cause row drops.

Local artifact paths and machine identifiers from the investigation were intentionally omitted from this issue.

Checklist

  • I reproduced this issue or provided a minimal example
  • I searched the docs/issues myself, or had my agent do so
  • If I used an agent, I included its diagnostics above

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions