fix(async): pack of fixes for async engine under degraded providers by andreatgretel · Pull Request #585 · NVIDIA-NeMo/DataDesigner

andreatgretel · 2026-04-29T22:54:01Z

📋 Summary

Fixes the async engine's tendency to abandon entire runs (returning 0 records) when the provider is degraded. The root cause was the early-shutdown gate counting retryable errors (timeouts, transient 5xx, connection blips) toward its trip threshold, then leaving partially-completed row groups un-checkpointed. This PR keeps the gate as a circuit-breaker for genuine structural failures (auth, schema, code bugs) but stops it from firing under transient provider stress, where salvage rounds can recover.

The set of changes also closes a few smaller correctness gaps along the way: CustomColumnGenerator was stripping retryability from wrapped exceptions, the sync-to-async bridge raised a non-retryable stdlib TimeoutError, and the throttle's consecutive_429s counter wasn't being reset on non-rate-limit failures.

Replaces #578 (~921 lines, cooldown-queue rabbit hole) with a focused fix (~830 lines including tests) that addresses the actual failure mode observed in production runs.

🔗 Related Issue

Closes #575

🔄 Changes

🐛 Fixed

Exclude retryable errors from the early-shutdown gate (`b746f2e6`). Sliding-window error rate now ignores `ModelRateLimitError`, `ModelTimeoutError`, `ModelInternalServerError`, and `ModelAPIConnectionError`. These cluster in time under provider degradation and would otherwise trip the gate even when salvage could recover. Non-retryable failures (auth, schema, code bugs) still trip the gate as intended.
Salvage partial row groups on early shutdown (`763eeddb`). When the gate does fire, the scheduler now drops incomplete rows, then checkpoints any fully-complete survivors instead of leaving the row group un-written and raising `FileNotFoundError` downstream. Finalization runs after worker cancellation so in-flight `from_scratch` / `batch` tasks can't race a freed buffer.
Reset `consecutive_429s` on non-rate-limit failure (`25d53b69`). A `429 → 500 → 429` sequence was being treated as 2 consecutive 429s by the AIMD throttle. Independent bug surfaced during this investigation.
Preserve retryability through `CustomColumnGenerator` wrap (`49cc9bfd`). The wrapper's `except Exception` was wrapping retryable model errors in `CustomColumnGenerationError`, hiding them from the scheduler's salvage path. Also fixes the sync-to-async bridge in `_AsyncBridgedModelFacade` to raise `ModelTimeoutError` (was raising stdlib `TimeoutError`, which is non-retryable). Moves `RETRYABLE_MODEL_ERRORS` to `models.errors` as the canonical source.

✨ Added

Periodic WARN when retryable error rate is elevated (`6a85d1e9`). Today's runs failed silently — no visibility until shutdown. Logs "provider X showing degraded performance — run may take longer than expected" when the rolling rate exceeds a threshold. Throttled to one warning per cool-down interval.
`DataDesignerEarlyShutdownError` typed exception (`6e508b47`). New subclass of `DataDesignerGenerationError` raised when an early-shutdown produced no records, so callers can detect the case programmatically instead of pattern-matching error messages.

🧪 Testing

`make test` passes locally
Unit tests added/updated — new tests cover: gate exclusion of each retryable error type, WARN throttling under sustained degradation, partial-RG checkpoint with mixed completion states (including zero-survivor case), `consecutive_429s` reset, retryability preservation through both sync and async `CustomColumnGenerator` paths, bridge `ModelTimeoutError` wrapping, and the new `DataDesignerEarlyShutdownError` raising path.
Live A/B against `build.nvidia.com` (Anonymizer notebook 04, 25 biographies, async engine):
- Without fix: `DataDesignerGenerationError: 0 records produced` after ~17 min. Gate trips on clustered `ModelTimeoutError`s mid-rewrite, row group never checkpoints.
- With fix: 25/25 records produced in ~39 min. 27 retryable errors during the run (rate-limits + bridge timeouts), all retried successfully via salvage. No early-shutdown event.
- Bumped concurrency stress test (`gpt-oss-120b: max_parallel_requests=8`, `nemotron: 32`): 25/25 records in ~19.5 min. AIMD reduced concurrency on initial 429 storms; salvage absorbed all retryable errors. Demonstrates the fix lets users tune for higher throughput without paying the 0-records tax.

✅ Checklist

Follows commit message conventions
Commits are signed off (DCO)
Architecture docs updated — N/A; behavior change is internal to the scheduler

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the following:

`packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py` — biggest behavior change is here (gate exclusion + partial-RG salvage in `_finalize_after_shutdown`). The worker-drain ordering matters: `_finalize_after_shutdown` must run after `_cancel_workers` to avoid a race against in-flight `from_scratch` / `batch` tasks writing into a freed buffer.
`packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py` — the `except RETRYABLE_MODEL_ERRORS: raise` pattern is added in two places (`_generate` and `agenerate`); easy to miss the async one.

📎 Follow-ups (out of scope, worth filing)

Bridge timeout configurability. `SYNC_BRIDGE_TIMEOUT = 300` is a hardcoded constant that pre-dates this PR. It should ideally derive from the model's configured timeout (and account for `max_correction_steps`), and/or be exposed via `RunConfig`. Out of scope here, but worth a follow-up issue.
Batch-level error rates for early shutdown #211 alignment for async path. The sliding-window gate still differs from the sync engine's batch-level error rate. Larger structural change, deferred.
Anonymizer logging visibility. `LoggingConfig.default()` swallows DataDesigner scheduler events. During this investigation we had to switch to `LoggingConfig.debug()` to see anything. Worth a doc note or a more verbose default for when things go wrong.

The gate previously only excluded `ModelRateLimitError`, leaving `ModelTimeoutError`, `ModelInternalServerError`, and `ModelAPIConnectionError` to count toward the sliding-window error rate. Under provider degradation these errors cluster in time (concurrent in-flight requests time out together), so 5/10 in a row is easy and trips the gate even when salvage could recover the rows. Refs #575.

Diagnostic A/Bs against build.nvidia.com showed runs failing silently under provider degradation - no log indication that retryable errors were piling up until the early-shutdown gate fired (or, post-fix, until salvage exhaustion). Surfacing this earlier helps users distinguish "DataDesigner is broken" from "the upstream provider is slow today." Tracks a separate sliding window over retryable-vs-not for every task outcome (independent of the early-shutdown gate's window) and emits a throttled WARN when the rolling fraction crosses the threshold. Refs #575.

Before: when the early-shutdown gate fired, any row group still in flight stayed in `_rg_states` un-checkpointed. The buffer manager later raised `FileNotFoundError` when the builder tried to read the finalized parquet. User-visible result: `0 records produced`. After: a new `_finalize_after_shutdown` step runs in `run()`'s finally block, after `_cancel_workers` has drained in-flight tasks (Codex caveat: in-flight `from_scratch`/`batch` tasks must not be allowed to write into a buffer that's being finalized). For each remaining row group it drops rows that aren't fully complete, then delegates to the existing `_checkpoint_completed_row_groups` so the buffer manager's zero-survivor handling (skip empty parquet, free buffer) kicks in unchanged. Also surfaces partial completion as a structured signal: scheduler exposes `early_shutdown: bool` and `partial_row_groups: tuple[int, ...]` properties so callers can detect partial completion programmatically rather than parsing log lines. Builder uses this to emit a more specific WARN distinguishing early shutdown from non-shutdown drops. Refs #575.

In `release_failure`, the cascade counter wasn't reset, so a sequence like 429 → 500 → 429 was treated as 2 consecutive 429s. The cascade counter feeds AIMD's reduce-once-per-cascade logic; the second 429 should start a fresh cascade and trigger another concurrency reduction, but currently doesn't. Standalone bug surfaced during #575 investigation; not on the failure path that drives the gate-trip outcome but worth fixing while we're in this code.

A real-workload run of #575 showed the early-shutdown gate still trips even with the gate-exclusion fix in place: the trigger is 10 timeouts inside Anonymizer's QA-repair custom columns, all wrapped in CustomColumnGenerationError (non-retryable) by the catch-all in CustomColumnGenerator. Two fixes here: 1. Re-raise RETRYABLE_MODEL_ERRORS unchanged before the wrap so the scheduler's _is_retryable correctly classifies them. 2. Surface _AsyncBridgedModelFacade timeouts as ModelTimeoutError instead of stdlib TimeoutError. Without this the sync bridge times out as the wrong exception type and is still classified non-retryable even after fix #1. Also moves _RETRYABLE_MODEL_ERRORS from async_scheduler to models/errors as the public RETRYABLE_MODEL_ERRORS tuple - both the scheduler and the wrap site need it, and models/errors is the appropriate home alongside the error class definitions. Refs #575.

…runs When the async scheduler hits early shutdown and produces zero records, the buffer manager skips writing parquet (correctly), so ArtifactStorage.load_dataset_with_dropped_columns() raises FileNotFoundError. Previously this surfaced as a generic DataDesignerGenerationError wrapping the FileNotFoundError, which is ambiguous (could be missing files for any reason). This commit: - Adds DataDesignerEarlyShutdownError as a subclass of DataDesignerGenerationError so existing handlers still match while callers that want to react programmatically (retry on different alias, surface a degraded-provider message, etc.) can catch the specific type. - Plumbs the scheduler's structured signals (early_shutdown, partial_row_groups) up through the builder so they're available at data_designer.create() time without re-introspecting the scheduler. - create() raises the typed error in both failure modes (load fails or empty DataFrame returned) when builder.early_shutdown is True. Refs #575.

github-actions · 2026-04-29T22:56:40Z

Review: PR #585 — `fix(async): pack of fixes for async engine under degraded providers`

Summary

This PR fixes the async engine's failure mode where clustered retryable errors (rate-limits, timeouts, 5xx, connection blips) were tripping the early-shutdown gate and leaving partially-completed row groups un-checkpointed, producing a 0 records error. The changes:

Exclude retryable model errors from the early-shutdown gate; keep it as a circuit-breaker for genuine failures (auth, schema, code bugs).
Add a _finalize_after_shutdown salvage path so partial row groups are drained and checkpointed instead of leaking into the final FileNotFoundError.
Introduce a throttled "provider showing degraded performance" WARN so users have visibility before shutdown.
Fix three smaller correctness gaps: CustomColumnGenerator wrapping retryable errors into a non-retryable wrapper, the sync-to-async bridge raising stdlib TimeoutError instead of ModelTimeoutError, and consecutive_429s not being reset on non-rate-limit failures in the throttle.
Add DataDesignerEarlyShutdownError as a typed subclass of DataDesignerGenerationError so callers can react programmatically.

~778 / -23 lines across 12 files, roughly half tests. Scope is focused and well-matched to the stated failure.

Findings

Correctness

Worker-drain ordering is correct. _finalize_after_shutdown runs after await asyncio.shield(self._cancel_workers()) in the finally block of run() (async_scheduler.py:313-318). Workers are fully drained before survivors are counted, so the "in-flight task writes into a freed buffer" race the PR description calls out is actually prevented.
_checkpoint_completed_row_groups is now called twice on the early-shutdown path — once inside _main_dispatch_loop (async_scheduler.py:342) and again from _finalize_after_shutdown (async_scheduler.py:596). This is benign (the first call is a no-op if nothing is newly complete and _rg_states shrinks between calls), but worth a one-line comment at the first call site so future readers don't assume it's the terminal checkpoint pass.
Zero-survivor case is handled cleanly. _finalize_after_shutdown drops all incomplete rows, then _checkpoint_completed_row_groups detects all(is_dropped(...)) and calls buffer_manager.free_row_group(rg_id) instead of writing parquet. The test test_zero_survivor_shutdown_does_not_raise asserts storage.write_batch_to_parquet_file.assert_not_called(), which pins this down well.
_record_retryable_outcome semantics are sound. Every outcome (success or failure) feeds the window; only retryable failures count as True. A 50% threshold over 20 outcomes is a reasonable signal-vs-noise point. The deque(maxlen=...) with a len(...) < window guard avoids firing on a short warmup.
consecutive_429s = 0 in release_failure (throttle_manager.py:316). This is the right call for the 429 → 500 → 429 cascade bug, but note that release_failure is also invoked for non-transient failures unrelated to rate limiting. Resetting here is correct — a gap in the cascade is a gap regardless of why — but the docstring/comment could be slightly sharper: the reset isn't about "non-rate-limit failure breaks the cascade" so much as "any non-429 failure indicates the cascade isn't a pure 429 streak."
CustomColumnGenerator exception order is right. CustomColumnGenerationError → RETRYABLE_MODEL_ERRORS → Exception — the retryable tuple isn't a subclass of the custom wrapper, so ordering is non-aliasing. The pattern is duplicated in both _generate and agenerate; this is correctly called out in the PR "Attention Areas" and tests cover both paths (test_custom.py:382-417).

Concerns / Nits

Test test_partial_row_group_salvaged_after_early_shutdown is scheduler-order-sensitive. It asserts buffer_mgr.actual_num_records == 3 exactly — relying on seeds 0,1,2 completing before the gate trips from seeds 5,6,7 failing (window=4, rate=0.5 → trips after the 2nd failure). This works because the scheduler dispatches get_ready_tasks in order and the non-slow paths don't yield, but it's a fragile invariant. Consider >= 3 with an upper-bound assert, or patch asyncio.sleep to make ordering deterministic.
SYNC_BRIDGE_TIMEOUT = 300 remains hardcoded (correctly called out as a follow-up in the PR description). No blocker.
partial_row_groups naming. A row group is added only when had_incomplete and survivors > 0 — i.e., it was partially salvaged. A row group that was all-dropped-at-shutdown is not in this list (it's just gone). That matches the docstring; no change needed, but future readers reaching for "which row groups ended early" should use early_shutdown plus this list together.
DataDesignerEarlyShutdownError is raised in two places (data_designer.py:240 and data_designer.py:258). The second path is a defensive guard for a DF-returning code path the first path's comment says shouldn't happen. Keeping both is fine; they share identical detection logic (builder.early_shutdown), so if that detection ever changes, both must move in lockstep.
__init__.py lazy-import registration (interface/init.py:19, 28) correctly follows the established pattern — both the TYPE_CHECKING block and the _LAZY_IMPORTS dict get the new symbol.

Conventions

Absolute imports only ✓
from __future__ import annotations preserved in edited files ✓
Modern type syntax (list[int], tuple[int, ...]) ✓
RETRYABLE_MODEL_ERRORS as typed tuple[type[Exception], ...] in the canonical location (models/errors.py) — this is the right move; the previous private tuple in async_scheduler.py was a duplicate.
Error class hierarchy (DataDesignerEarlyShutdownError < DataDesignerGenerationError < DataDesignerError) is consistent with existing conventions.

Tests

Coverage is thorough. Parametrized tests across all four retryable error types for both the scheduler and custom-generator paths. The bridge-timeout test exercises the actual thread-and-event-loop plumbing rather than mocking it out. The degraded-provider WARN has separate tests for "fires", "throttled", and "silent under threshold." The interface-layer tests cover both the FileNotFoundError → DataDesignerEarlyShutdownError wrap and the defensive empty-DF path.
One gap worth considering. No test verifies the interaction between a partial row group and on_before_checkpoint raising (the callback could fail mid-salvage). Low priority — existing code paths have the same behavior — but a one-off test would pin down the contract.

Security / Performance

No credentials, secrets, or user input touched.
_record_retryable_outcome is O(1) per outcome (bounded deque). The degraded-WARN throttle uses time.monotonic() — correct for interval gating.
The finalize-after-shutdown loop is O(rg_size × columns) per unfinished row group, bounded by row groups left in flight at shutdown. Not a concern.

Verdict

Solid fix. Well-scoped, well-tested, correctly orders the worker drain vs. finalize, and the RETRYABLE_MODEL_ERRORS consolidation cleans up a duplicate. The only real suggestion is de-flaking the partial-RG salvage test's exact-count assertion; everything else is polish (comment wording, noted follow-ups). No blockers.

Initialize _last_degraded_warn_at to -inf so the first WARN is always emitted. The previous initialization to 0.0 suppressed the first WARN on fresh CI runners where time.monotonic() returns a small value (system boot uptime), making the throttle interval check (now - 0.0 < interval) true on the first attempt.

Five real correctness issues caught in review of the original PR, plus a few smaller cleanups and test simplifications. Throttle - cascade reset (regression of existing AIMD invariant): release_failure() now resets consecutive_429s only when in_flight == 0. Resetting unconditionally broke "reduce once per cascade" when 429/500/429 arrived interleaved within a single in-flight burst - the second 429 was treated as a new cascade and the limit got halved twice for what was effectively one rate-limit event. Interface - typed-error gating: DataDesignerEarlyShutdownError now fires only when early_shutdown is true AND actual_num_records == 0. Without this, a partial-salvage run that fails to load for unrelated reasons (corrupt parquet, schema drift, disk hiccup) was misdiagnosed as "zero records produced," hiding the real cause. Async - WARN window scope: the degraded-provider warning was fed by every task outcome, including samplers and non-LLM customs. In realistic pipelines (one model column, several non-model columns) the rate stayed under threshold even when every model call was failing, silencing the WARN exactly when it mattered. Now gated on is_llm. Async/builder - signal preservation across raises: scheduler.early_shutdown and partial_row_groups are captured in a try/finally around future.result(), so a processor failure during the salvage path doesn't drop the structured signal. Both build() and build_preview() now reset per-run state at the start so reused builders don't leak prior-run flags. Async - dead code: dispatch_error capture in run() was unread (the post- finally check is unreachable on the exception path). Removed. Smaller cleanups: - early-shutdown WARN says "non-retryable error rate exceeded threshold" - bridge timeout WARN demoted to debug (ModelTimeoutError already surfaces it; the throttled degraded-provider WARN is the user-facing signal) - TODO note for threading degraded_warn_* through RunConfig - doc note in _finalize_after_shutdown clarifying that pre-batch processor isn't re-run on partial-salvage row groups Tests: - new regression tests for the cascade burst case, partial-salvage error gating, and LLM-only WARN window - direct unit test for _reset_run_state - dedup via _make_storage / _seed_plus_cell_setup helpers - WARN emission cases parametrized into a single test - shared parametrize lists hoisted to module-level constants - redundant cascade test dropped in favor of the more thorough drain variant; redundant healthy-baseline test folded into the zero-survivor test

greptile-apps · 2026-04-30T05:58:12Z

Greptile Summary

This PR fixes the async engine's early-shutdown gate misfiring under transient provider degradation by excluding retryable errors (ModelRateLimitError, ModelTimeoutError, ModelInternalServerError, ModelAPIConnectionError) from the sliding-window error-rate check. A new _finalize_after_shutdown method salvages partially-complete row groups instead of leaving them un-checkpointed, resolving the downstream FileNotFoundError. Three companion fixes address consecutive_429s reset timing, retryability preservation through CustomColumnGenerator, and the sync-to-async bridge raising a non-retryable TimeoutError. The changes are well-scoped with comprehensive unit and live A/B testing.

Confidence Score: 5/5

This PR is safe to merge — all changes are well-scoped, correctly ordered, and covered by comprehensive tests including a live A/B run.

No P0 or P1 bugs found. The core invariants are sound: retryable exclusion from the gate is consistent across both the error path and the degraded-WARN window; _finalize_after_shutdown correctly runs after _cancel_workers to prevent race conditions; consecutive_429s reset is gated on in_flight == 0 to avoid double-reduction within a concurrent burst; DataDesignerEarlyShutdownError is only raised when actual_num_records == 0 preventing misdiagnosis on partial-salvage runs.

No files require special attention.

Important Files Changed

Filename	Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py	Core behavioral change: retryable errors excluded from the early-shutdown gate, new `_finalize_after_shutdown` salvages in-flight row groups post-cancellation, degraded-provider WARN machinery added — logic is correct and ordering constraints are respected.
packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py	`RETRYABLE_MODEL_ERRORS` re-raise guard added in both `_generate` and `agenerate`; bridge now raises `ModelTimeoutError` instead of stdlib `TimeoutError` so the scheduler classifies it retryable.
packages/data-designer-engine/src/data_designer/engine/models/clients/throttle_manager.py	`release_failure` now resets `consecutive_429s` only when `in_flight == 0`, correctly guarding against double-reduction within a concurrent burst.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py	Captures `early_shutdown`, `partial_row_groups`, and `actual_num_records` from the scheduler in `finally` blocks; adds `_reset_run_state` to prevent state leakage on builder reuse.
packages/data-designer/src/data_designer/interface/data_designer.py	Raises `DataDesignerEarlyShutdownError` only when `early_shutdown=True` AND `actual_num_records == 0`; partial-salvage runs fall through to the generic error, preventing misdiagnosis.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Task completes or fails] --> B{Success?}
    B -- Yes --> C[_check_error_rate success=True]
    B -- No --> D{is_retryable?}
    D -- Yes --> E[Skip _check_error_rate\nRetryable errors excluded\nfrom early-shutdown gate]
    D -- No --> F[_check_error_rate success=False]
    C --> G{is_llm?}
    E --> G
    F --> G
    G -- Yes --> H[_record_retryable_outcome\nUpdate degraded-provider window]
    H --> I{Window full & rate >= threshold\n& interval elapsed?}
    I -- Yes --> J[WARN: Provider degraded]
    I -- No --> K[Continue]
    F --> L{Error rate >= threshold?}
    L -- Yes --> M[_early_shutdown = True]
    L -- No --> K
    M --> N[_main_dispatch_loop returns]
    N --> O[finally: _cancel_workers]
    O --> P{_early_shutdown & _rg_states non-empty?}
    P -- Yes --> Q[_finalize_after_shutdown]
    Q --> R[Drop incomplete rows per RG]
    R --> S{survivors > 0?}
    S -- Yes --> T[Add to partial_row_groups\n_checkpoint_completed_row_groups]
    S -- No --> U[Log zero-survivor\n_checkpoint_completed_row_groups]
    P -- No --> V[Clean exit path]
    T --> V
    U --> V
    V --> W{early_shutdown & actual_num_records == 0?}
    W -- Yes --> X[Raise DataDesignerEarlyShutdownError]
    W -- No --> Y[Normal completion or DataDesignerGenerationError]

_{Reviews (2): Last reviewed commit: "test(interface): consolidate create() er..." | Re-trigger Greptile}

nabinchha · 2026-04-30T15:24:48Z

+            if builder.early_shutdown and builder.actual_num_records == 0:
+                raise DataDesignerEarlyShutdownError(
+                    "🛑 Generation produced zero records — early shutdown was triggered. "
+                    "The non-retryable error rate exceeded the configured threshold; check the "
+                    "warnings above (and any 'Provider showing degraded performance' logs) for "
+                    "the contributing failures."
+                ) from e


Do we need a similar check for preview?

nabinchha · 2026-04-30T15:29:22Z

+    def _reset_run_state(self) -> None:
+        """Clear per-run signals so reused builder instances don't leak state across runs."""
+        self._early_shutdown = False
+        self._partial_row_groups = ()
+        self._actual_num_records = -1
+        self._task_traces = []


Per style guide, this needs to get pushed down to maintain public before private pardigm.

nabinchha · 2026-04-30T15:30:49Z

+RETRYABLE_EXCEPTION_FACTORIES = [
+    pytest.param(lambda: ModelRateLimitError("429"), id="rate_limit"),
+    pytest.param(lambda: ModelTimeoutError("timeout"), id="timeout"),
+    pytest.param(lambda: ModelInternalServerError("503"), id="internal_server"),
+    pytest.param(lambda: ModelAPIConnectionError("conn reset"), id="api_connection"),
+]


Could we loop over what's already available in errors (RETRYABLE_MODEL_ERRORS)?

nabinchha · 2026-04-30T15:32:54Z

        with pytest.raises(RuntimeError, match="connection timed out"):
            proxy.generate(prompt="hello")

+    def test_bridge_timeout_raises_model_timeout_error(self) -> None:


I guess some how the class based testing pattern slipped in.... could we take this opportunity to flatten this out?

Tests in this suite also imports inline....

nabinchha · 2026-04-30T15:34:49Z

    assert tracker.is_row_group_complete(0, 10, ["seed", "col"])


+RETRYABLE_ERROR_FACTORIES = [


Same comment as above...

nabinchha · 2026-04-30T15:56:14Z

Thanks for the careful diagnosis and write-up here, @andreatgretel — the live A/B against build.nvidia.com is exactly the kind of evidence that makes this easy to reason about, and the commit log is great for future spelunkers.

Summary

The PR keeps the early-shutdown gate as a circuit-breaker for genuine non-retryable failures while excluding clustered transient errors (rate-limit, timeout, 5xx, connection) that cluster under provider degradation. It also salvages partially-completed row groups instead of leaving them un-checkpointed, fixes a CustomColumnGenerator wrap that was stripping retryability, surfaces bridge timeouts as ModelTimeoutError, repairs the AIMD cascade reset, and exposes a typed DataDesignerEarlyShutdownError. The implementation matches the stated intent.

Findings

Suggestions — Take it or leave it

packages/data-designer/src/data_designer/interface/data_designer.py:260-269 — Empty-DataFrame guard could mirror the load-failure guard

What: The load-failure path requires both builder.early_shutdown and builder.actual_num_records == 0 before raising the typed error, but the defensive empty-DataFrame guard further down only checks early_shutdown.
Why: The two paths are protecting against the same misdiagnosis (partial-salvage run hitting an unrelated load issue). If load_dataset_with_dropped_columns() ever does return an empty DF after a partial-salvage run (today it raises, but per the comment this guard exists for "future changes to that contract"), we'd surface a misleading "early shutdown produced zero records" instead of letting the real cause bubble up.
Suggestion: Apply the same and builder.actual_num_records == 0 check to the empty-DF branch:

if len(dataset_for_profiler) == 0:
    if builder.early_shutdown and builder.actual_num_records == 0:
        raise DataDesignerEarlyShutdownError(...)
    raise DataDesignerGenerationError(...)

packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py:680-685 — _record_retryable_outcome docstring slightly overstates scope

What: The docstring says "every outcome (success or failure) feeds this window", but the call site only feeds it for is_llm tasks (lines 834 and 852).
Why: A reader looking at the function in isolation will think the rate is computed across all task outcomes; the LLM-only gating is the whole point of the window (called out in the commit message), so it'd be nice to make that visible at the function too.
Suggestion: Replace "every outcome" with "every LLM-bound task outcome" in the docstring, or add a one-line note that the call site filters by is_llm.

packages/data-designer-engine/tests/engine/column_generators/generators/test_custom.py:639-652 — Second "Sanity" block duplicates the pytest.raises

What: After the with pytest.raises(ModelTimeoutError, match="bridge timed out") block already enforces the type, the second try/except ModelTimeoutError/except concurrent.futures.TimeoutError → pytest.fail block re-runs the same scenario.
Why: pytest.raises already fails if the wrong exception type is raised, with a traceback that shows what was raised instead. The custom pytest.fail message is a nicer error string, but at the cost of doubling the test runtime (the bridge sleeps for SYNC_BRIDGE_TIMEOUT=0.05s plus event-loop overhead) and adding a maintenance surface.
Suggestion: Drop the second block. If the explicit failure message feels valuable, fold it into the pytest.raises call as a custom message, e.g. via pytest.raises(ModelTimeoutError, match="bridge timed out") — the failure already shows the actual exception type.

What Looks Good

The exclusion + salvage pairing is the right shape. Excluding retryable errors from the gate without also salvaging partial row groups would still produce 0-record runs in the timeout-cluster case; doing salvage without the exclusion would still trip the gate prematurely. Doing both is what unlocks the 25/25 result in the live A/B, and the PR description makes that dependency explicit.
The worker-drain ordering is well thought through. The "must run AFTER _cancel_workers" comment in run() plus the doc note in _finalize_after_shutdown about pre-batch processors not re-running on salvaged row groups will save someone a head-scratching debug session later.
RETRYABLE_MODEL_ERRORS was hoisted to the right home. Putting it next to the error class definitions in models/errors.py keeps the canonical list with the canonical types, and the CustomColumnGenerator wrap site can now share it without a back-import into the scheduler.
The cascade-reset fix is a tidy independent improvement. The 429 → 500 → 429 sequence with in_flight > 0 correctly stays as one cascade, and the parametrized "drain vs in-flight" test pair makes the invariant easy to read.
Typed error inherits from the existing one. DataDesignerEarlyShutdownError(DataDesignerGenerationError) lets existing handlers keep working while opening the door for callers that want to react programmatically — the test explicitly asserts both isinstance checks, which is a nice belt-and-suspenders.
Test scaffolding consolidation. _make_storage and _seed_plus_cell_setup cut a lot of duplication out of the new tests, and the module-level RETRYABLE_ERROR_FACTORIES / RETRYABLE_EXCEPTION_FACTORIES parametrize lists keep the cases readable.

Verdict

Ship it (with nits). Only Suggestions; nothing blocking. The follow-ups already filed in the PR description (SYNC_BRIDGE_TIMEOUT configurability, #211 alignment, logging defaults) cover the larger structural work this PR intentionally defers.

This review was generated by an AI assistant.

Style cleanups, parametrization, docstring polish, and one consistency fix in the typed-error path. All non-blocking ("Ship it (with nits)"). interface/data_designer.py: - preview() now raises DataDesignerEarlyShutdownError when shutdown produced zero records (parity with create()), and also gates on actual_num_records == 0 so partial-salvage runs that fail to load don't get misdiagnosed - create()'s defensive empty-DF guard mirrors the load-failure guard with the same actual_num_records == 0 check async_scheduler.py: - _record_retryable_outcome docstring clarifies that the call site filters by is_llm; the function alone reads as if every outcome feeds the window dataset_builder.py: - moved _reset_run_state() down past the public methods to match the project's public-before-private convention test_custom.py: - flattened TestAsyncBridgedModelFacade class into module-level test functions (matches the rest of the file) - hoisted inline imports (asyncio, threading, patch, _AsyncBridgedModelFacade, SyncClientUnavailableError) to top of file - driven retryable-error parametrize off RETRYABLE_MODEL_ERRORS directly instead of the hand-rolled factory list, so new retryable types pick up coverage automatically - dropped the redundant "Sanity" block in test_async_bridge_timeout_raises_ model_timeout_error - pytest.raises already enforces the type, the duplicate block was running the same slow scenario twice test_async_scheduler.py: - parametrize over RETRYABLE_MODEL_ERRORS directly (same as above) test_data_designer.py: - added preview-path tests for the typed-error and partial-salvage fall-through cases - updated the existing empty-DF test to also patch actual_num_records=0 (otherwise the new gating in the empty-DF guard skips the typed error)

Five separate tests (two existing, three new from earlier in this PR) all probed the same dispatch logic in create(): "given a load outcome and a builder state, which error type should fire?" Pulled them into a single parametrized matrix indexed by (load_side_effect, early_shutdown, actual_num_records). Net result: 5 named tests → 1 parametrized test with 6 cells, and the previously-missing empty_df + shutdown + partial salvage cell is now covered. Test names retain readable IDs (load_fails_shutdown_zero_records etc.) so failures still pinpoint the exact case in pytest output.

andreatgretel added 6 commits April 29, 2026 10:39

andreatgretel requested a review from a team as a code owner April 29, 2026 22:54

andreatgretel temporarily deployed to agentic-ci April 29, 2026 22:54 — with GitHub Actions Inactive

andreatgretel added 2 commits April 29, 2026 20:06

nabinchha reviewed Apr 30, 2026

View reviewed changes

andreatgretel added 2 commits April 30, 2026 16:56

andreatgretel mentioned this pull request Apr 30, 2026

Batch-level error rates for early shutdown #211

Open

nabinchha approved these changes Apr 30, 2026

View reviewed changes

andreatgretel merged commit 47c72b3 into main Apr 30, 2026
49 checks passed

This was referenced Apr 30, 2026

feat: support retryable errors in custom column generators #464

Closed

feat: make async engine the default execution path #592

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(async): pack of fixes for async engine under degraded providers#585

fix(async): pack of fixes for async engine under degraded providers#585
andreatgretel merged 10 commits intomainfrom
andreatgretel/fix/575-async-early-shutdown

andreatgretel commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

greptile-apps Bot commented Apr 30, 2026 •

edited

Loading

Confidence Score: 5/5

Flowchart

Uh oh!

nabinchha Apr 30, 2026

Uh oh!

nabinchha Apr 30, 2026

Uh oh!

nabinchha Apr 30, 2026

Uh oh!

nabinchha Apr 30, 2026

Uh oh!

nabinchha Apr 30, 2026

Uh oh!

nabinchha Apr 30, 2026

Uh oh!

nabinchha commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		assert tracker.is_row_group_complete(0, 10, ["seed", "col"])


		RETRYABLE_ERROR_FACTORIES = [

Conversation

andreatgretel commented Apr 29, 2026

📋 Summary

🔗 Related Issue

🔄 Changes

🐛 Fixed

✨ Added

🧪 Testing

✅ Checklist

🔍 Attention Areas

📎 Follow-ups (out of scope, worth filing)

Uh oh!

github-actions Bot commented Apr 29, 2026

Review: PR #585 — fix(async): pack of fixes for async engine under degraded providers

Summary

Findings

Correctness

Concerns / Nits

Conventions

Tests

Security / Performance

Verdict

Uh oh!

greptile-apps Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

nabinchha Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

nabinchha Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

nabinchha Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

nabinchha Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

nabinchha Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

nabinchha Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

nabinchha commented Apr 30, 2026

Summary

Findings

Suggestions — Take it or leave it

What Looks Good

Verdict

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Review: PR #585 — `fix(async): pack of fixes for async engine under degraded providers`

greptile-apps Bot commented Apr 30, 2026 •

edited

Loading