Scheduler stalls can turn healthy request waiters into queue_timeout row drops

### Priority Level

High (Major functionality broken)

### Describe the bug

Scheduler stalls can turn otherwise healthy request waiters into `queue_timeout` failures and dropped rows. In large async runs with healthy mock endpoints, the scheduler event loop stalled long enough that request-admission waiters timed out locally. Those timeout outcomes were then treated as non-retryable drops, even though the endpoint was not failing.

This creates a false failure mode under high scheduler load: rows are dropped because the scheduler cannot revisit waiting work quickly enough, not because the provider is unavailable.

### Steps/Code to reproduce bug

Use a throwaway async scheduler harness with mock endpoints and no DataDesigner product-code changes. Configure a very large logical job with high request pressure and bounded timeboxes:

- 500k to 1M logical records
- wide and fanout shapes with 8 to 12 LLM-like columns
- endpoint caps around 128 to 512
- `max_in_flight_tasks` above endpoint cap so waiters can accumulate
- queue wait timeout set low enough to expose the stall, for example 2-5s
- no injected provider failures for the healthy-endpoint control

Representative observed cases:

| Case | Endpoint failures injected | Request wait timeout | Outcome |
| --- | --- | ---: | --- |
| large wide job, cap 512, max in-flight 1,024 | none | 2s | hundreds of `queue_timeout` events and dropped rows |
| large wide job, cap 512, max in-flight 2,048 | none | 5s | more than 1,000 queue-timeout events and dropped rows |
| fixed-frontier healthy control | none | larger timeout | no queue-timeout drops, but severe queue overhead |

### Expected behavior

Healthy request waiters should not become non-retryable row drops solely because scheduler bookkeeping stalls the event loop. The scheduler should either:

- keep request admission aligned with provider capacity so excessive waiters are not created,
- yield often enough that local request wait timeouts reflect real request pressure,
- classify scheduler-induced wait timeouts separately from provider failures, or
- retry/salvage these outcomes instead of dropping rows as non-retryable.

### Agent Diagnostic / Prior Investigation

The investigation found periods where the scheduler had pending request waiters and active ready work, but queue observation/selection consumed the event loop. In positive cases with lower wait-timeout settings, this produced `request_wait_timeout` and `non_retryable_dropped` event patterns despite healthy mock endpoints.

This is related to, but distinct from, the queue CPU bottleneck: the queue bottleneck explains the stall, while this issue tracks the data-loss/failure behavior caused by interpreting scheduler-induced request wait timeouts as row drops.

### Additional context

Suggested direction:

- Bound scheduler task admission closer to request-admission capacity.
- Treat local request wait timeout under scheduler stall as retryable or diagnostic, not as a provider/data failure.
- Include queue-lag/request-waiter context in error events.
- Add tests for healthy endpoints where scheduler load cannot cause row drops.

Local artifact paths and machine identifiers from the investigation were intentionally omitted from this issue.

### Checklist

- [x] I reproduced this issue or provided a minimal example
- [x] I searched the docs/issues myself, or had my agent do so
- [x] If I used an agent, I included its diagnostics above


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler stalls can turn healthy request waiters into queue_timeout row drops #725

Priority Level

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Agent Diagnostic / Prior Investigation

Additional context

Checklist

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Case	Endpoint failures injected	Request wait timeout	Outcome
large wide job, cap 512, max in-flight 1,024	none	2s	hundreds of `queue_timeout` events and dropped rows
large wide job, cap 512, max in-flight 2,048	none	5s	more than 1,000 queue-timeout events and dropped rows
fixed-frontier healthy control	none	larger timeout	no queue-timeout drops, but severe queue overhead

Scheduler stalls can turn healthy request waiters into queue_timeout row drops #725

Description

Priority Level

Describe the bug

Steps/Code to reproduce bug

Expected behavior

Agent Diagnostic / Prior Investigation

Additional context

Checklist

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions