Priority Level
High (Major functionality broken)
Describe the bug
Scheduler stalls can turn otherwise healthy request waiters into queue_timeout failures and dropped rows. In large async runs with healthy mock endpoints, the scheduler event loop stalled long enough that request-admission waiters timed out locally. Those timeout outcomes were then treated as non-retryable drops, even though the endpoint was not failing.
This creates a false failure mode under high scheduler load: rows are dropped because the scheduler cannot revisit waiting work quickly enough, not because the provider is unavailable.
Steps/Code to reproduce bug
Use a throwaway async scheduler harness with mock endpoints and no DataDesigner product-code changes. Configure a very large logical job with high request pressure and bounded timeboxes:
- 500k to 1M logical records
- wide and fanout shapes with 8 to 12 LLM-like columns
- endpoint caps around 128 to 512
max_in_flight_tasks above endpoint cap so waiters can accumulate
- queue wait timeout set low enough to expose the stall, for example 2-5s
- no injected provider failures for the healthy-endpoint control
Representative observed cases:
| Case |
Endpoint failures injected |
Request wait timeout |
Outcome |
| large wide job, cap 512, max in-flight 1,024 |
none |
2s |
hundreds of queue_timeout events and dropped rows |
| large wide job, cap 512, max in-flight 2,048 |
none |
5s |
more than 1,000 queue-timeout events and dropped rows |
| fixed-frontier healthy control |
none |
larger timeout |
no queue-timeout drops, but severe queue overhead |
Expected behavior
Healthy request waiters should not become non-retryable row drops solely because scheduler bookkeeping stalls the event loop. The scheduler should either:
- keep request admission aligned with provider capacity so excessive waiters are not created,
- yield often enough that local request wait timeouts reflect real request pressure,
- classify scheduler-induced wait timeouts separately from provider failures, or
- retry/salvage these outcomes instead of dropping rows as non-retryable.
Agent Diagnostic / Prior Investigation
The investigation found periods where the scheduler had pending request waiters and active ready work, but queue observation/selection consumed the event loop. In positive cases with lower wait-timeout settings, this produced request_wait_timeout and non_retryable_dropped event patterns despite healthy mock endpoints.
This is related to, but distinct from, the queue CPU bottleneck: the queue bottleneck explains the stall, while this issue tracks the data-loss/failure behavior caused by interpreting scheduler-induced request wait timeouts as row drops.
Additional context
Suggested direction:
- Bound scheduler task admission closer to request-admission capacity.
- Treat local request wait timeout under scheduler stall as retryable or diagnostic, not as a provider/data failure.
- Include queue-lag/request-waiter context in error events.
- Add tests for healthy endpoints where scheduler load cannot cause row drops.
Local artifact paths and machine identifiers from the investigation were intentionally omitted from this issue.
Checklist
Priority Level
High (Major functionality broken)
Describe the bug
Scheduler stalls can turn otherwise healthy request waiters into
queue_timeoutfailures and dropped rows. In large async runs with healthy mock endpoints, the scheduler event loop stalled long enough that request-admission waiters timed out locally. Those timeout outcomes were then treated as non-retryable drops, even though the endpoint was not failing.This creates a false failure mode under high scheduler load: rows are dropped because the scheduler cannot revisit waiting work quickly enough, not because the provider is unavailable.
Steps/Code to reproduce bug
Use a throwaway async scheduler harness with mock endpoints and no DataDesigner product-code changes. Configure a very large logical job with high request pressure and bounded timeboxes:
max_in_flight_tasksabove endpoint cap so waiters can accumulateRepresentative observed cases:
queue_timeoutevents and dropped rowsExpected behavior
Healthy request waiters should not become non-retryable row drops solely because scheduler bookkeeping stalls the event loop. The scheduler should either:
Agent Diagnostic / Prior Investigation
The investigation found periods where the scheduler had pending request waiters and active ready work, but queue observation/selection consumed the event loop. In positive cases with lower wait-timeout settings, this produced
request_wait_timeoutandnon_retryable_droppedevent patterns despite healthy mock endpoints.This is related to, but distinct from, the queue CPU bottleneck: the queue bottleneck explains the stall, while this issue tracks the data-loss/failure behavior caused by interpreting scheduler-induced request wait timeouts as row drops.
Additional context
Suggested direction:
Local artifact paths and machine identifiers from the investigation were intentionally omitted from this issue.
Checklist