Skip to content

fix(async): pack of fixes for async engine under degraded providers#585

Merged
andreatgretel merged 10 commits intomainfrom
andreatgretel/fix/575-async-early-shutdown
Apr 30, 2026
Merged

fix(async): pack of fixes for async engine under degraded providers#585
andreatgretel merged 10 commits intomainfrom
andreatgretel/fix/575-async-early-shutdown

Conversation

@andreatgretel
Copy link
Copy Markdown
Contributor

📋 Summary

Fixes the async engine's tendency to abandon entire runs (returning 0 records) when the provider is degraded. The root cause was the early-shutdown gate counting retryable errors (timeouts, transient 5xx, connection blips) toward its trip threshold, then leaving partially-completed row groups un-checkpointed. This PR keeps the gate as a circuit-breaker for genuine structural failures (auth, schema, code bugs) but stops it from firing under transient provider stress, where salvage rounds can recover.

The set of changes also closes a few smaller correctness gaps along the way: CustomColumnGenerator was stripping retryability from wrapped exceptions, the sync-to-async bridge raised a non-retryable stdlib TimeoutError, and the throttle's consecutive_429s counter wasn't being reset on non-rate-limit failures.

Replaces #578 (~921 lines, cooldown-queue rabbit hole) with a focused fix (~830 lines including tests) that addresses the actual failure mode observed in production runs.

🔗 Related Issue

Closes #575

🔄 Changes

🐛 Fixed

  • Exclude retryable errors from the early-shutdown gate (`b746f2e6`). Sliding-window error rate now ignores `ModelRateLimitError`, `ModelTimeoutError`, `ModelInternalServerError`, and `ModelAPIConnectionError`. These cluster in time under provider degradation and would otherwise trip the gate even when salvage could recover. Non-retryable failures (auth, schema, code bugs) still trip the gate as intended.
  • Salvage partial row groups on early shutdown (`763eeddb`). When the gate does fire, the scheduler now drops incomplete rows, then checkpoints any fully-complete survivors instead of leaving the row group un-written and raising `FileNotFoundError` downstream. Finalization runs after worker cancellation so in-flight `from_scratch` / `batch` tasks can't race a freed buffer.
  • Reset `consecutive_429s` on non-rate-limit failure (`25d53b69`). A `429 → 500 → 429` sequence was being treated as 2 consecutive 429s by the AIMD throttle. Independent bug surfaced during this investigation.
  • Preserve retryability through `CustomColumnGenerator` wrap (`49cc9bfd`). The wrapper's `except Exception` was wrapping retryable model errors in `CustomColumnGenerationError`, hiding them from the scheduler's salvage path. Also fixes the sync-to-async bridge in `_AsyncBridgedModelFacade` to raise `ModelTimeoutError` (was raising stdlib `TimeoutError`, which is non-retryable). Moves `RETRYABLE_MODEL_ERRORS` to `models.errors` as the canonical source.

✨ Added

  • Periodic WARN when retryable error rate is elevated (`6a85d1e9`). Today's runs failed silently — no visibility until shutdown. Logs "provider X showing degraded performance — run may take longer than expected" when the rolling rate exceeds a threshold. Throttled to one warning per cool-down interval.
  • `DataDesignerEarlyShutdownError` typed exception (`6e508b47`). New subclass of `DataDesignerGenerationError` raised when an early-shutdown produced no records, so callers can detect the case programmatically instead of pattern-matching error messages.

🧪 Testing

  • `make test` passes locally
  • Unit tests added/updated — new tests cover: gate exclusion of each retryable error type, WARN throttling under sustained degradation, partial-RG checkpoint with mixed completion states (including zero-survivor case), `consecutive_429s` reset, retryability preservation through both sync and async `CustomColumnGenerator` paths, bridge `ModelTimeoutError` wrapping, and the new `DataDesignerEarlyShutdownError` raising path.
  • Live A/B against `build.nvidia.com` (Anonymizer notebook 04, 25 biographies, async engine):
    • Without fix: `DataDesignerGenerationError: 0 records produced` after ~17 min. Gate trips on clustered `ModelTimeoutError`s mid-rewrite, row group never checkpoints.
    • With fix: 25/25 records produced in ~39 min. 27 retryable errors during the run (rate-limits + bridge timeouts), all retried successfully via salvage. No early-shutdown event.
    • Bumped concurrency stress test (`gpt-oss-120b: max_parallel_requests=8`, `nemotron: 32`): 25/25 records in ~19.5 min. AIMD reduced concurrency on initial 429 storms; salvage absorbed all retryable errors. Demonstrates the fix lets users tune for higher throughput without paying the 0-records tax.

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated — N/A; behavior change is internal to the scheduler

🔍 Attention Areas

⚠️ Reviewers: Please pay special attention to the following:

📎 Follow-ups (out of scope, worth filing)

  • Bridge timeout configurability. `SYNC_BRIDGE_TIMEOUT = 300` is a hardcoded constant that pre-dates this PR. It should ideally derive from the model's configured timeout (and account for `max_correction_steps`), and/or be exposed via `RunConfig`. Out of scope here, but worth a follow-up issue.
  • Batch-level error rates for early shutdown #211 alignment for async path. The sliding-window gate still differs from the sync engine's batch-level error rate. Larger structural change, deferred.
  • Anonymizer logging visibility. `LoggingConfig.default()` swallows DataDesigner scheduler events. During this investigation we had to switch to `LoggingConfig.debug()` to see anything. Worth a doc note or a more verbose default for when things go wrong.

The gate previously only excluded `ModelRateLimitError`, leaving
`ModelTimeoutError`, `ModelInternalServerError`, and
`ModelAPIConnectionError` to count toward the sliding-window error
rate. Under provider degradation these errors cluster in time
(concurrent in-flight requests time out together), so 5/10 in a row
is easy and trips the gate even when salvage could recover the rows.

Refs #575.
Diagnostic A/Bs against build.nvidia.com showed runs failing silently
under provider degradation - no log indication that retryable errors
were piling up until the early-shutdown gate fired (or, post-fix,
until salvage exhaustion). Surfacing this earlier helps users
distinguish "DataDesigner is broken" from "the upstream provider is
slow today."

Tracks a separate sliding window over retryable-vs-not for every task
outcome (independent of the early-shutdown gate's window) and emits a
throttled WARN when the rolling fraction crosses the threshold.

Refs #575.
Before: when the early-shutdown gate fired, any row group still in
flight stayed in `_rg_states` un-checkpointed. The buffer manager
later raised `FileNotFoundError` when the builder tried to read the
finalized parquet. User-visible result: `0 records produced`.

After: a new `_finalize_after_shutdown` step runs in `run()`'s finally
block, after `_cancel_workers` has drained in-flight tasks (Codex
caveat: in-flight `from_scratch`/`batch` tasks must not be allowed to
write into a buffer that's being finalized). For each remaining row
group it drops rows that aren't fully complete, then delegates to the
existing `_checkpoint_completed_row_groups` so the buffer manager's
zero-survivor handling (skip empty parquet, free buffer) kicks in
unchanged.

Also surfaces partial completion as a structured signal: scheduler
exposes `early_shutdown: bool` and `partial_row_groups: tuple[int, ...]`
properties so callers can detect partial completion programmatically
rather than parsing log lines. Builder uses this to emit a more
specific WARN distinguishing early shutdown from non-shutdown drops.

Refs #575.
In `release_failure`, the cascade counter wasn't reset, so a sequence
like 429 → 500 → 429 was treated as 2 consecutive 429s. The cascade
counter feeds AIMD's reduce-once-per-cascade logic; the second 429
should start a fresh cascade and trigger another concurrency reduction,
but currently doesn't.

Standalone bug surfaced during #575 investigation; not on the failure
path that drives the gate-trip outcome but worth fixing while we're
in this code.
A real-workload run of #575 showed the early-shutdown gate still trips
even with the gate-exclusion fix in place: the trigger is 10 timeouts
inside Anonymizer's QA-repair custom columns, all wrapped in
CustomColumnGenerationError (non-retryable) by the catch-all in
CustomColumnGenerator.

Two fixes here:

1. Re-raise RETRYABLE_MODEL_ERRORS unchanged before the wrap so the
   scheduler's _is_retryable correctly classifies them.

2. Surface _AsyncBridgedModelFacade timeouts as ModelTimeoutError
   instead of stdlib TimeoutError. Without this the sync bridge times
   out as the wrong exception type and is still classified non-retryable
   even after fix #1.

Also moves _RETRYABLE_MODEL_ERRORS from async_scheduler to
models/errors as the public RETRYABLE_MODEL_ERRORS tuple - both the
scheduler and the wrap site need it, and models/errors is the
appropriate home alongside the error class definitions.

Refs #575.
…runs

When the async scheduler hits early shutdown and produces zero
records, the buffer manager skips writing parquet (correctly), so
ArtifactStorage.load_dataset_with_dropped_columns() raises
FileNotFoundError. Previously this surfaced as a generic
DataDesignerGenerationError wrapping the FileNotFoundError, which is
ambiguous (could be missing files for any reason).

This commit:

- Adds DataDesignerEarlyShutdownError as a subclass of
  DataDesignerGenerationError so existing handlers still match while
  callers that want to react programmatically (retry on different
  alias, surface a degraded-provider message, etc.) can catch the
  specific type.
- Plumbs the scheduler's structured signals (early_shutdown,
  partial_row_groups) up through the builder so they're available at
  data_designer.create() time without re-introspecting the scheduler.
- create() raises the typed error in both failure modes (load fails
  or empty DataFrame returned) when builder.early_shutdown is True.

Refs #575.
@andreatgretel andreatgretel requested a review from a team as a code owner April 29, 2026 22:54
@github-actions
Copy link
Copy Markdown
Contributor

Review: PR #585fix(async): pack of fixes for async engine under degraded providers

Summary

This PR fixes the async engine's failure mode where clustered retryable errors (rate-limits, timeouts, 5xx, connection blips) were tripping the early-shutdown gate and leaving partially-completed row groups un-checkpointed, producing a 0 records error. The changes:

  1. Exclude retryable model errors from the early-shutdown gate; keep it as a circuit-breaker for genuine failures (auth, schema, code bugs).
  2. Add a _finalize_after_shutdown salvage path so partial row groups are drained and checkpointed instead of leaking into the final FileNotFoundError.
  3. Introduce a throttled "provider showing degraded performance" WARN so users have visibility before shutdown.
  4. Fix three smaller correctness gaps: CustomColumnGenerator wrapping retryable errors into a non-retryable wrapper, the sync-to-async bridge raising stdlib TimeoutError instead of ModelTimeoutError, and consecutive_429s not being reset on non-rate-limit failures in the throttle.
  5. Add DataDesignerEarlyShutdownError as a typed subclass of DataDesignerGenerationError so callers can react programmatically.

~778 / -23 lines across 12 files, roughly half tests. Scope is focused and well-matched to the stated failure.

Findings

Correctness

  • Worker-drain ordering is correct. _finalize_after_shutdown runs after await asyncio.shield(self._cancel_workers()) in the finally block of run() (async_scheduler.py:313-318). Workers are fully drained before survivors are counted, so the "in-flight task writes into a freed buffer" race the PR description calls out is actually prevented.
  • _checkpoint_completed_row_groups is now called twice on the early-shutdown path — once inside _main_dispatch_loop (async_scheduler.py:342) and again from _finalize_after_shutdown (async_scheduler.py:596). This is benign (the first call is a no-op if nothing is newly complete and _rg_states shrinks between calls), but worth a one-line comment at the first call site so future readers don't assume it's the terminal checkpoint pass.
  • Zero-survivor case is handled cleanly. _finalize_after_shutdown drops all incomplete rows, then _checkpoint_completed_row_groups detects all(is_dropped(...)) and calls buffer_manager.free_row_group(rg_id) instead of writing parquet. The test test_zero_survivor_shutdown_does_not_raise asserts storage.write_batch_to_parquet_file.assert_not_called(), which pins this down well.
  • _record_retryable_outcome semantics are sound. Every outcome (success or failure) feeds the window; only retryable failures count as True. A 50% threshold over 20 outcomes is a reasonable signal-vs-noise point. The deque(maxlen=...) with a len(...) < window guard avoids firing on a short warmup.
  • consecutive_429s = 0 in release_failure (throttle_manager.py:316). This is the right call for the 429 → 500 → 429 cascade bug, but note that release_failure is also invoked for non-transient failures unrelated to rate limiting. Resetting here is correct — a gap in the cascade is a gap regardless of why — but the docstring/comment could be slightly sharper: the reset isn't about "non-rate-limit failure breaks the cascade" so much as "any non-429 failure indicates the cascade isn't a pure 429 streak."
  • CustomColumnGenerator exception order is right. CustomColumnGenerationError → RETRYABLE_MODEL_ERRORS → Exception — the retryable tuple isn't a subclass of the custom wrapper, so ordering is non-aliasing. The pattern is duplicated in both _generate and agenerate; this is correctly called out in the PR "Attention Areas" and tests cover both paths (test_custom.py:382-417).

Concerns / Nits

  • Test test_partial_row_group_salvaged_after_early_shutdown is scheduler-order-sensitive. It asserts buffer_mgr.actual_num_records == 3 exactly — relying on seeds 0,1,2 completing before the gate trips from seeds 5,6,7 failing (window=4, rate=0.5 → trips after the 2nd failure). This works because the scheduler dispatches get_ready_tasks in order and the non-slow paths don't yield, but it's a fragile invariant. Consider >= 3 with an upper-bound assert, or patch asyncio.sleep to make ordering deterministic.
  • SYNC_BRIDGE_TIMEOUT = 300 remains hardcoded (correctly called out as a follow-up in the PR description). No blocker.
  • partial_row_groups naming. A row group is added only when had_incomplete and survivors > 0 — i.e., it was partially salvaged. A row group that was all-dropped-at-shutdown is not in this list (it's just gone). That matches the docstring; no change needed, but future readers reaching for "which row groups ended early" should use early_shutdown plus this list together.
  • DataDesignerEarlyShutdownError is raised in two places (data_designer.py:240 and data_designer.py:258). The second path is a defensive guard for a DF-returning code path the first path's comment says shouldn't happen. Keeping both is fine; they share identical detection logic (builder.early_shutdown), so if that detection ever changes, both must move in lockstep.
  • __init__.py lazy-import registration (interface/init.py:19, 28) correctly follows the established pattern — both the TYPE_CHECKING block and the _LAZY_IMPORTS dict get the new symbol.

Conventions

  • Absolute imports only ✓
  • from __future__ import annotations preserved in edited files ✓
  • Modern type syntax (list[int], tuple[int, ...]) ✓
  • RETRYABLE_MODEL_ERRORS as typed tuple[type[Exception], ...] in the canonical location (models/errors.py) — this is the right move; the previous private tuple in async_scheduler.py was a duplicate.
  • Error class hierarchy (DataDesignerEarlyShutdownError < DataDesignerGenerationError < DataDesignerError) is consistent with existing conventions.

Tests

  • Coverage is thorough. Parametrized tests across all four retryable error types for both the scheduler and custom-generator paths. The bridge-timeout test exercises the actual thread-and-event-loop plumbing rather than mocking it out. The degraded-provider WARN has separate tests for "fires", "throttled", and "silent under threshold." The interface-layer tests cover both the FileNotFoundError → DataDesignerEarlyShutdownError wrap and the defensive empty-DF path.
  • One gap worth considering. No test verifies the interaction between a partial row group and on_before_checkpoint raising (the callback could fail mid-salvage). Low priority — existing code paths have the same behavior — but a one-off test would pin down the contract.

Security / Performance

  • No credentials, secrets, or user input touched.
  • _record_retryable_outcome is O(1) per outcome (bounded deque). The degraded-WARN throttle uses time.monotonic() — correct for interval gating.
  • The finalize-after-shutdown loop is O(rg_size × columns) per unfinished row group, bounded by row groups left in flight at shutdown. Not a concern.

Verdict

Solid fix. Well-scoped, well-tested, correctly orders the worker drain vs. finalize, and the RETRYABLE_MODEL_ERRORS consolidation cleans up a duplicate. The only real suggestion is de-flaking the partial-RG salvage test's exact-count assertion; everything else is polish (comment wording, noted follow-ups). No blockers.

  Initialize _last_degraded_warn_at to -inf so the first WARN is always
  emitted. The previous initialization to 0.0 suppressed the first WARN on
  fresh CI runners where time.monotonic() returns a small value (system
  boot uptime), making the throttle interval check (now - 0.0 < interval)
  true on the first attempt.
Five real correctness issues caught in review of the original PR, plus a
few smaller cleanups and test simplifications.

Throttle - cascade reset (regression of existing AIMD invariant):
release_failure() now resets consecutive_429s only when in_flight == 0.
Resetting unconditionally broke "reduce once per cascade" when 429/500/429
arrived interleaved within a single in-flight burst - the second 429 was
treated as a new cascade and the limit got halved twice for what was
effectively one rate-limit event.

Interface - typed-error gating: DataDesignerEarlyShutdownError now fires
only when early_shutdown is true AND actual_num_records == 0. Without
this, a partial-salvage run that fails to load for unrelated reasons
(corrupt parquet, schema drift, disk hiccup) was misdiagnosed as "zero
records produced," hiding the real cause.

Async - WARN window scope: the degraded-provider warning was fed by every
task outcome, including samplers and non-LLM customs. In realistic
pipelines (one model column, several non-model columns) the rate stayed
under threshold even when every model call was failing, silencing the
WARN exactly when it mattered. Now gated on is_llm.

Async/builder - signal preservation across raises: scheduler.early_shutdown
and partial_row_groups are captured in a try/finally around future.result(),
so a processor failure during the salvage path doesn't drop the
structured signal. Both build() and build_preview() now reset per-run
state at the start so reused builders don't leak prior-run flags.

Async - dead code: dispatch_error capture in run() was unread (the post-
finally check is unreachable on the exception path). Removed.

Smaller cleanups:
- early-shutdown WARN says "non-retryable error rate exceeded threshold"
- bridge timeout WARN demoted to debug (ModelTimeoutError already surfaces
  it; the throttled degraded-provider WARN is the user-facing signal)
- TODO note for threading degraded_warn_* through RunConfig
- doc note in _finalize_after_shutdown clarifying that pre-batch processor
  isn't re-run on partial-salvage row groups

Tests:
- new regression tests for the cascade burst case, partial-salvage error
  gating, and LLM-only WARN window
- direct unit test for _reset_run_state
- dedup via _make_storage / _seed_plus_cell_setup helpers
- WARN emission cases parametrized into a single test
- shared parametrize lists hoisted to module-level constants
- redundant cascade test dropped in favor of the more thorough drain
  variant; redundant healthy-baseline test folded into the zero-survivor
  test
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 30, 2026

Greptile Summary

This PR fixes the async engine's early-shutdown gate misfiring under transient provider degradation by excluding retryable errors (ModelRateLimitError, ModelTimeoutError, ModelInternalServerError, ModelAPIConnectionError) from the sliding-window error-rate check. A new _finalize_after_shutdown method salvages partially-complete row groups instead of leaving them un-checkpointed, resolving the downstream FileNotFoundError. Three companion fixes address consecutive_429s reset timing, retryability preservation through CustomColumnGenerator, and the sync-to-async bridge raising a non-retryable TimeoutError. The changes are well-scoped with comprehensive unit and live A/B testing.

Confidence Score: 5/5

This PR is safe to merge — all changes are well-scoped, correctly ordered, and covered by comprehensive tests including a live A/B run.

No P0 or P1 bugs found. The core invariants are sound: retryable exclusion from the gate is consistent across both the error path and the degraded-WARN window; _finalize_after_shutdown correctly runs after _cancel_workers to prevent race conditions; consecutive_429s reset is gated on in_flight == 0 to avoid double-reduction within a concurrent burst; DataDesignerEarlyShutdownError is only raised when actual_num_records == 0 preventing misdiagnosis on partial-salvage runs.

No files require special attention.

Important Files Changed

Filename Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py Core behavioral change: retryable errors excluded from the early-shutdown gate, new _finalize_after_shutdown salvages in-flight row groups post-cancellation, degraded-provider WARN machinery added — logic is correct and ordering constraints are respected.
packages/data-designer-engine/src/data_designer/engine/column_generators/generators/custom.py RETRYABLE_MODEL_ERRORS re-raise guard added in both _generate and agenerate; bridge now raises ModelTimeoutError instead of stdlib TimeoutError so the scheduler classifies it retryable.
packages/data-designer-engine/src/data_designer/engine/models/clients/throttle_manager.py release_failure now resets consecutive_429s only when in_flight == 0, correctly guarding against double-reduction within a concurrent burst.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/dataset_builder.py Captures early_shutdown, partial_row_groups, and actual_num_records from the scheduler in finally blocks; adds _reset_run_state to prevent state leakage on builder reuse.
packages/data-designer/src/data_designer/interface/data_designer.py Raises DataDesignerEarlyShutdownError only when early_shutdown=True AND actual_num_records == 0; partial-salvage runs fall through to the generic error, preventing misdiagnosis.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Task completes or fails] --> B{Success?}
    B -- Yes --> C[_check_error_rate success=True]
    B -- No --> D{is_retryable?}
    D -- Yes --> E[Skip _check_error_rate\nRetryable errors excluded\nfrom early-shutdown gate]
    D -- No --> F[_check_error_rate success=False]
    C --> G{is_llm?}
    E --> G
    F --> G
    G -- Yes --> H[_record_retryable_outcome\nUpdate degraded-provider window]
    H --> I{Window full & rate >= threshold\n& interval elapsed?}
    I -- Yes --> J[WARN: Provider degraded]
    I -- No --> K[Continue]
    F --> L{Error rate >= threshold?}
    L -- Yes --> M[_early_shutdown = True]
    L -- No --> K
    M --> N[_main_dispatch_loop returns]
    N --> O[finally: _cancel_workers]
    O --> P{_early_shutdown & _rg_states non-empty?}
    P -- Yes --> Q[_finalize_after_shutdown]
    Q --> R[Drop incomplete rows per RG]
    R --> S{survivors > 0?}
    S -- Yes --> T[Add to partial_row_groups\n_checkpoint_completed_row_groups]
    S -- No --> U[Log zero-survivor\n_checkpoint_completed_row_groups]
    P -- No --> V[Clean exit path]
    T --> V
    U --> V
    V --> W{early_shutdown & actual_num_records == 0?}
    W -- Yes --> X[Raise DataDesignerEarlyShutdownError]
    W -- No --> Y[Normal completion or DataDesignerGenerationError]
Loading

Reviews (2): Last reviewed commit: "test(interface): consolidate create() er..." | Re-trigger Greptile

Comment on lines +245 to +251
if builder.early_shutdown and builder.actual_num_records == 0:
raise DataDesignerEarlyShutdownError(
"🛑 Generation produced zero records — early shutdown was triggered. "
"The non-retryable error rate exceeded the configured threshold; check the "
"warnings above (and any 'Provider showing degraded performance' logs) for "
"the contributing failures."
) from e
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a similar check for preview?

Comment on lines +163 to +168
def _reset_run_state(self) -> None:
"""Clear per-run signals so reused builder instances don't leak state across runs."""
self._early_shutdown = False
self._partial_row_groups = ()
self._actual_num_records = -1
self._task_traces = []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per style guide, this needs to get pushed down to maintain public before private pardigm.

Comment on lines +359 to +364
RETRYABLE_EXCEPTION_FACTORIES = [
pytest.param(lambda: ModelRateLimitError("429"), id="rate_limit"),
pytest.param(lambda: ModelTimeoutError("timeout"), id="timeout"),
pytest.param(lambda: ModelInternalServerError("503"), id="internal_server"),
pytest.param(lambda: ModelAPIConnectionError("conn reset"), id="api_connection"),
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we loop over what's already available in errors (RETRYABLE_MODEL_ERRORS)?

with pytest.raises(RuntimeError, match="connection timed out"):
proxy.generate(prompt="hello")

def test_bridge_timeout_raises_model_timeout_error(self) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess some how the class based testing pattern slipped in.... could we take this opportunity to flatten this out?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests in this suite also imports inline....

assert tracker.is_row_group_complete(0, 10, ["seed", "col"])


RETRYABLE_ERROR_FACTORIES = [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above...

@nabinchha
Copy link
Copy Markdown
Contributor

Thanks for the careful diagnosis and write-up here, @andreatgretel — the live A/B against build.nvidia.com is exactly the kind of evidence that makes this easy to reason about, and the commit log is great for future spelunkers.

Summary

The PR keeps the early-shutdown gate as a circuit-breaker for genuine non-retryable failures while excluding clustered transient errors (rate-limit, timeout, 5xx, connection) that cluster under provider degradation. It also salvages partially-completed row groups instead of leaving them un-checkpointed, fixes a CustomColumnGenerator wrap that was stripping retryability, surfaces bridge timeouts as ModelTimeoutError, repairs the AIMD cascade reset, and exposes a typed DataDesignerEarlyShutdownError. The implementation matches the stated intent.

Findings

Suggestions — Take it or leave it

packages/data-designer/src/data_designer/interface/data_designer.py:260-269 — Empty-DataFrame guard could mirror the load-failure guard

  • What: The load-failure path requires both builder.early_shutdown and builder.actual_num_records == 0 before raising the typed error, but the defensive empty-DataFrame guard further down only checks early_shutdown.
  • Why: The two paths are protecting against the same misdiagnosis (partial-salvage run hitting an unrelated load issue). If load_dataset_with_dropped_columns() ever does return an empty DF after a partial-salvage run (today it raises, but per the comment this guard exists for "future changes to that contract"), we'd surface a misleading "early shutdown produced zero records" instead of letting the real cause bubble up.
  • Suggestion: Apply the same and builder.actual_num_records == 0 check to the empty-DF branch:
if len(dataset_for_profiler) == 0:
    if builder.early_shutdown and builder.actual_num_records == 0:
        raise DataDesignerEarlyShutdownError(...)
    raise DataDesignerGenerationError(...)

packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py:680-685_record_retryable_outcome docstring slightly overstates scope

  • What: The docstring says "every outcome (success or failure) feeds this window", but the call site only feeds it for is_llm tasks (lines 834 and 852).
  • Why: A reader looking at the function in isolation will think the rate is computed across all task outcomes; the LLM-only gating is the whole point of the window (called out in the commit message), so it'd be nice to make that visible at the function too.
  • Suggestion: Replace "every outcome" with "every LLM-bound task outcome" in the docstring, or add a one-line note that the call site filters by is_llm.

packages/data-designer-engine/tests/engine/column_generators/generators/test_custom.py:639-652 — Second "Sanity" block duplicates the pytest.raises

  • What: After the with pytest.raises(ModelTimeoutError, match="bridge timed out") block already enforces the type, the second try/except ModelTimeoutError/except concurrent.futures.TimeoutError → pytest.fail block re-runs the same scenario.
  • Why: pytest.raises already fails if the wrong exception type is raised, with a traceback that shows what was raised instead. The custom pytest.fail message is a nicer error string, but at the cost of doubling the test runtime (the bridge sleeps for SYNC_BRIDGE_TIMEOUT=0.05s plus event-loop overhead) and adding a maintenance surface.
  • Suggestion: Drop the second block. If the explicit failure message feels valuable, fold it into the pytest.raises call as a custom message, e.g. via pytest.raises(ModelTimeoutError, match="bridge timed out") — the failure already shows the actual exception type.

What Looks Good

  • The exclusion + salvage pairing is the right shape. Excluding retryable errors from the gate without also salvaging partial row groups would still produce 0-record runs in the timeout-cluster case; doing salvage without the exclusion would still trip the gate prematurely. Doing both is what unlocks the 25/25 result in the live A/B, and the PR description makes that dependency explicit.
  • The worker-drain ordering is well thought through. The "must run AFTER _cancel_workers" comment in run() plus the doc note in _finalize_after_shutdown about pre-batch processors not re-running on salvaged row groups will save someone a head-scratching debug session later.
  • RETRYABLE_MODEL_ERRORS was hoisted to the right home. Putting it next to the error class definitions in models/errors.py keeps the canonical list with the canonical types, and the CustomColumnGenerator wrap site can now share it without a back-import into the scheduler.
  • The cascade-reset fix is a tidy independent improvement. The 429 → 500 → 429 sequence with in_flight > 0 correctly stays as one cascade, and the parametrized "drain vs in-flight" test pair makes the invariant easy to read.
  • Typed error inherits from the existing one. DataDesignerEarlyShutdownError(DataDesignerGenerationError) lets existing handlers keep working while opening the door for callers that want to react programmatically — the test explicitly asserts both isinstance checks, which is a nice belt-and-suspenders.
  • Test scaffolding consolidation. _make_storage and _seed_plus_cell_setup cut a lot of duplication out of the new tests, and the module-level RETRYABLE_ERROR_FACTORIES / RETRYABLE_EXCEPTION_FACTORIES parametrize lists keep the cases readable.

Verdict

Ship it (with nits). Only Suggestions; nothing blocking. The follow-ups already filed in the PR description (SYNC_BRIDGE_TIMEOUT configurability, #211 alignment, logging defaults) cover the larger structural work this PR intentionally defers.


This review was generated by an AI assistant.

Style cleanups, parametrization, docstring polish, and one consistency
fix in the typed-error path. All non-blocking ("Ship it (with nits)").

interface/data_designer.py:
- preview() now raises DataDesignerEarlyShutdownError when shutdown
  produced zero records (parity with create()), and also gates on
  actual_num_records == 0 so partial-salvage runs that fail to load
  don't get misdiagnosed
- create()'s defensive empty-DF guard mirrors the load-failure guard
  with the same actual_num_records == 0 check

async_scheduler.py:
- _record_retryable_outcome docstring clarifies that the call site
  filters by is_llm; the function alone reads as if every outcome feeds
  the window

dataset_builder.py:
- moved _reset_run_state() down past the public methods to match the
  project's public-before-private convention

test_custom.py:
- flattened TestAsyncBridgedModelFacade class into module-level test
  functions (matches the rest of the file)
- hoisted inline imports (asyncio, threading, patch, _AsyncBridgedModelFacade,
  SyncClientUnavailableError) to top of file
- driven retryable-error parametrize off RETRYABLE_MODEL_ERRORS directly
  instead of the hand-rolled factory list, so new retryable types pick
  up coverage automatically
- dropped the redundant "Sanity" block in test_async_bridge_timeout_raises_
  model_timeout_error - pytest.raises already enforces the type, the
  duplicate block was running the same slow scenario twice

test_async_scheduler.py:
- parametrize over RETRYABLE_MODEL_ERRORS directly (same as above)

test_data_designer.py:
- added preview-path tests for the typed-error and partial-salvage
  fall-through cases
- updated the existing empty-DF test to also patch actual_num_records=0
  (otherwise the new gating in the empty-DF guard skips the typed error)
Five separate tests (two existing, three new from earlier in this PR)
all probed the same dispatch logic in create(): "given a load outcome
and a builder state, which error type should fire?" Pulled them into a
single parametrized matrix indexed by (load_side_effect, early_shutdown,
actual_num_records).

Net result: 5 named tests → 1 parametrized test with 6 cells, and the
previously-missing empty_df + shutdown + partial salvage cell is now
covered.

Test names retain readable IDs (load_fails_shutdown_zero_records etc.)
so failures still pinpoint the exact case in pytest output.
@andreatgretel andreatgretel merged commit 47c72b3 into main Apr 30, 2026
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Async engine stalls under sustained rate limiting - AIMD has no circuit breaker

2 participants