feat: add AsyncTaskScheduler and RowGroupBufferManager for async engine by andreatgretel · Pull Request #404 · NVIDIA-NeMo/DataDesigner

andreatgretel · 2026-03-12T21:25:36Z

Summary

PR 3 of 4 in the async engine migration plan. Adds the AsyncTaskScheduler and RowGroupBufferManager - the core orchestration layer that replaces sequential column-by-column processing with parallel, dependency-aware task dispatch.

Changes

Added

AsyncTaskScheduler - Dependency-aware async task scheduler with:
- Row-group admission via semaphore-based concurrency control
- Multi-column dedup for generators that produce multiple output columns
- Stateful generator serialization via per-instance asyncio locks
- Retryable failure deferral with configurable salvage rounds
- Post-salvage error logging for unfinished row groups
- Optional task tracing for debugging/profiling
RowGroupBufferManager - Per-row-group buffer with cell-level writes, batch updates, row dropping, checkpoint-to-parquet with full memory cleanup, and size-mismatch validation on update_batch
13 async scheduler tests covering: seed dispatch ordering, buffer integration, multiple row groups, non-retryable failure row drops, stateful serialization, bounded submission, tracing, three-column pipelines, retryable salvage recovery, and eager row-drop propagation to downstream columns
9 buffer manager tests covering: init, cell/batch writes, DataFrame exclusion of dropped rows, concurrent row groups, checkpoint with memory cleanup, and on_complete callbacks

Changed

CompletionTracker.get_ready_tasks - Added admitted_rgs parameter to filter tasks by admitted row groups
CompletionTracker._seed_frontier renamed to public seed_frontier() - no longer auto-called from with_graph; root dispatch moved to the scheduler's _dispatch_seeds (handles stateful locks and multi-column dedup). seed_frontier() remains available for static introspection (capacity planning, task enumeration)
Updated completion tracker tests to cover both empty-frontier default and explicit seed_frontier() behavior

Fixed (from Greptile review)

Non-retryable from_scratch/batch failure now drops all rows in the row group (previously left the row group stuck)
Salvage rounds re-dispatch from_scratch tasks directly instead of relying on the frontier (which never contains them)
_row_group_sizes freed on checkpoint (minor memory leak)
Row-count mismatch warning in _run_batch writeback

Lines breakdown

Category	Added	Removed	Net
Core libraries	560	27	+533
Tests	611	6	+605
Plan updates	56	31	+25
Total	1,212	46	+1,166

Attention Areas

Reviewers: Please pay special attention to the following:

async_scheduler.py - Core scheduling logic: admission loop, dispatch loop, salvage rounds (including from_scratch re-dispatch), and task execution paths (from_scratch, cell, batch)
completion_tracker.py - seed_frontier() is now public and opt-in; get_ready_tasks has new admitted_rgs parameter
row_group_buffer.py - update_batch size validation, checkpoint memory cleanup

Description updated with AI

greptile-apps · 2026-03-12T21:32:00Z

Greptile Summary

This PR adds the AsyncTaskScheduler and RowGroupBufferManager — the core orchestration layer for parallel, dependency-aware column generation — along with comprehensive test suites for both components and an updated CompletionTracker (seed_frontier() made public/opt-in, get_ready_tasks gains admitted_rgs filter).

All issues flagged in the previous Greptile review round (non-retryable batch drops, salvage from_scratch re-dispatch, stateful lock acquisition in salvage, _drain_frontier downstream propagation, post-salvage checkpoint sweep, early-exit logger.error, update_batch size validation, _row_group_sizes memory leak, unconditional on_complete call) are confirmed resolved.

New findings:

Missing batch_alias re-registration in salvage from_scratch dispatch (async_scheduler.py:171): _dispatch_seeds adds both the from_scratch task and its batch alias to _dispatched to prevent the frontier from generating a duplicate. The salvage retry re-adds only the from_scratch task but omits the batch_alias. Seeds are not in the frontier during normal scheduler operation (since seed_frontier() is not auto-called), so this does not produce a bug today — but the omission is an inconsistency that could silently cause double-dispatch if the usage pattern changes.
_dispatched set grows unboundedly (async_scheduler.py:333): Completed tasks are never pruned from _dispatched. For long runs with many row groups, this accumulates all historical task objects. Since completed tasks are already removed from the frontier, their entries in _dispatched are redundant and could be discarded from _in_flight's finally block.
Missing buffer test: on_complete(None) for all-dropped row group (test_row_group_buffer.py:108): The fix to call on_complete unconditionally is implemented correctly, but there is no test covering the case where all rows are dropped and the callback receives None.

Confidence Score: 4/5

Safe to merge with the minor batch_alias inconsistency understood and accepted as a latent risk.
All previously-identified bugs are confirmed fixed. The remaining findings are a latent inconsistency in the salvage dispatch path (not triggerable under current usage), a minor memory growth concern, and a missing test case. The core scheduling logic — admission, dispatch, stateful serialization, salvage, and checkpoint — is structurally sound and well-tested across 22 new tests.
Pay closest attention to async_scheduler.py — specifically the salvage dispatch block (lines 150–174) for the missing batch_alias re-registration.

Important Files Changed

Filename	Overview
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py	New core scheduler: admission loop, seed dispatch, main dispatch/drain loop, salvage rounds, and checkpoint logic are all well-structured. All previously-flagged bugs (non-retryable batch drop, salvage from_scratch re-dispatch, stateful lock in salvage, _drain_frontier loop, post-salvage checkpoint sweep, early-exit logger.error) confirmed fixed. Two new findings: missing batch_alias re-registration in salvage dispatch (logic inconsistency) and unbounded growth of _dispatched set (style/memory).
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/completion_tracker.py	seed_frontier() is now correctly public and opt-in; get_ready_tasks admitted_rgs filter is clean; _enqueue_downstream and _reevaluate_batch_tasks logic look correct. No new issues found.
packages/data-designer-engine/src/data_designer/engine/dataset_builders/utils/row_group_buffer.py	update_batch size validation (ValueError on mismatch), checkpoint memory cleanup (_row_group_sizes.pop), and on_complete called unconditionally (with None for empty df) are all correctly implemented. No new issues found.
packages/data-designer-engine/tests/engine/dataset_builders/utils/test_row_group_buffer.py	9 tests covering init, cell/batch writes, DataFrame exclusion, concurrent row groups, checkpoint memory cleanup, and on_complete callback. Missing a test for on_complete(None) when all rows are dropped.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant Scheduler as AsyncTaskScheduler
    participant Admission as _admit_row_groups
    participant Tracker as CompletionTracker
    participant Buffer as RowGroupBufferManager

    Caller->>Scheduler: run()
    Scheduler->>Admission: create_task(_admit_row_groups)

    loop For each row group
        Admission->>Admission: _rg_semaphore.acquire()
        Admission->>Buffer: init_row_group(rg_id, rg_size)
        Admission->>Scheduler: _dispatch_seeds(rg_id)
        Scheduler->>Scheduler: acquire stateful_lock + submission_semaphore
        Scheduler->>Scheduler: create_task(_execute_seed_task)
        Scheduler->>Admission: wake_event.set()
    end
    Admission->>Scheduler: _all_rgs_admitted = True

    loop Main dispatch loop
        Scheduler->>Tracker: get_ready_tasks(dispatched, admitted_rgs)
        Tracker-->>Scheduler: [ready tasks]
        Scheduler->>Scheduler: acquire submission_semaphore per task
        Scheduler->>Scheduler: create_task(_execute_task)
        Scheduler->>Scheduler: _checkpoint_completed_row_groups()
        Scheduler->>Buffer: checkpoint_row_group(rg_id)
        Scheduler->>Scheduler: _rg_semaphore.release()
        Scheduler->>Caller: on_row_group_complete(rg_id)
    end

    alt Retryable failures exist
        loop Salvage rounds (max salvage_max_rounds)
            Scheduler->>Scheduler: re-dispatch from_scratch tasks directly
            Scheduler->>Scheduler: _drain_frontier() [loop until quiescent]
            Scheduler->>Tracker: get_ready_tasks(dispatched, admitted_rgs)
            Tracker-->>Scheduler: [ready tasks]
            Scheduler->>Scheduler: _checkpoint_completed_row_groups()
        end
    end

    alt Unfinished row groups remain
        Scheduler->>Scheduler: logger.error(incomplete row groups)
    end

Prompt To Fix All With AI

This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
Line: 171-174

Comment:
**Missing `batch_alias` re-registration in salvage dispatch**

In `_dispatch_seeds`, both the primary `from_scratch` task and its `batch_alias` are added to `_dispatched` together to prevent `get_ready_tasks` from generating a duplicate for the seed column:

```python
self._dispatched.add(task)
self._dispatched.add(batch_alias)   # prevents frontier from re-dispatching this column
```

In the salvage loop, after clearing the aliases, only the `from_scratch` task is re-added — the `batch_alias` is not:

```python
self._dispatched.add(task)          # re-added
# batch_alias is NOT re-added here
self._in_flight.add(task)
asyncio.create_task(self._execute_seed_task(task, gid))
```

Because `seed_frontier()` is not auto-called, seed columns do not appear in the frontier under normal scheduler operation, so this omission does not cause an observable bug today. However, if `_drain_frontier` ever encounters a stale `batch` task for this column in the frontier (e.g., from a caller that also invokes `seed_frontier()` for capacity planning), the batch alias guard would be absent and the task could be double-dispatched — bypassing the stateful lock.

Mirroring `_dispatch_seeds` defensively:
```suggestion
                    await self._submission_semaphore.acquire()
                    self._dispatched.add(task)
                    self._dispatched.add(
                        Task(column=task.column, row_group=task.row_group, row_index=None, task_type="batch")
                    )
                    self._in_flight.add(task)
                    asyncio.create_task(self._execute_seed_task(task, gid))
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py
Line: 333-334

Comment:
**`_dispatched` set grows unboundedly**

Completed tasks are never pruned from `_dispatched`. The set accumulates every task dispatched across the entire run — seeds, cells, and batch tasks for every row group and column — and is only read (never cleared) after tasks finish. For a large job with many row groups and columns, this can become a non-trivial memory footprint.

Because completed tasks are already removed from `_frontier` by `mark_cell_complete` / `mark_row_range_complete`, the `t not in dispatched` guard in `get_ready_tasks` is redundant for them. The set could be pruned by discarding a task once it transitions from `_in_flight` to done, keeping it sized to the active working set rather than the full historical set:

```python
# In _execute_task_inner finally block, after discarding from _in_flight:
self._in_flight.discard(task)
self._dispatched.discard(task)   # reclaim memory for completed tasks
```

Note: do not discard inside the retryable error branch (the task must stay in `_dispatched` until the salvage loop explicitly clears it).

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/data-designer-engine/tests/engine/dataset_builders/utils/test_row_group_buffer.py
Line: 108-121

Comment:
**Missing test: `on_complete` called with `None` when all rows are dropped**

`checkpoint_row_group` now calls `on_complete(final_path)` unconditionally, where `final_path` is `None` when the row group has no records (all rows dropped). The existing `test_checkpoint_calls_on_complete` only covers the non-empty case, leaving the all-dropped code path untested.

A minimal addition would be:

```python
def test_checkpoint_calls_on_complete_when_all_rows_dropped() -> None:
    storage = _mock_artifact_storage()
    callback = Mock()

    mgr = RowGroupBufferManager(storage)
    mgr.init_row_group(0, 2)
    mgr.drop_row(0, 0)
    mgr.drop_row(0, 1)

    mgr.checkpoint_row_group(0, on_complete=callback)

    # on_complete must still fire, with None path since nothing was written
    callback.assert_called_once_with(None)
    storage.write_batch_to_parquet_file.assert_not_called()
```

This is the code path exercised when a non-retryable `from_scratch` failure drops every row in a row group, and callers using `on_complete` to gate downstream work need to know it fires correctly even for empty row groups.

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: 2b65c45}