feat: generate an exact number of rows matching a declared criterion

## Problem

DataDesigner users can generate candidate rows and filter them afterward, but they cannot declaratively request an exact number of rows that satisfy a quality or validation criterion. Doing this today requires user-managed loops across multiple `create()` calls, including candidate budgeting, artifact extension, trimming, and resume behavior.

Examples include generating 5,000 answers above a judge threshold, 10,000 policy-compliant conversations, or a fixed number of successful tool-use traces.

## Proposed feature

Add engine-native record selection to a normal `DataDesigner.create()` run. A user declares a boolean predicate column and a hard candidate limit. When selection is configured, `num_records` means the desired number of accepted output rows:

```python
builder.with_record_selection(
    dd.RecordSelectionConfig(
        predicate_column="meets_criteria",
        max_candidate_records=20_000,
        on_exhausted="raise",
    )
)

results = data_designer.create(builder, num_records=5_000)
```

The engine generates immutable candidate batches, evaluates the predicate, checkpoints accepted rows, and continues until the target is met or the candidate budget is exhausted.

## V1 scope

- Treat `num_records` as the accepted-row target when record selection is enabled.
- Require an explicit boolean predicate column and hard `max_candidate_records` bound.
- Support `raise` and `return_partial` exhaustion behavior.
- Preserve deterministic candidate ordering and trim final overshoot exactly.
- Track candidate attempts separately from accepted output.
- Persist candidate-batch completion markers so zero-acceptance batches and resume are correct.
- Preserve normal DAG generation, processors, profiling, plugins, and model-usage accounting.
- Run one candidate batch at a time while retaining the scheduler's normal within-row-group concurrency.

## Out of scope

- Concurrent candidate batches.
- Early cancellation of downstream row work after predicate rejection.
- Exporting every rejected candidate.
- Unbounded generation based on an expected acceptance rate.
- Row-count-changing after-generation processors.

These are optional performance or expansion ideas, not requirements for a complete V1.

## Acceptance criteria

- A single `create()` call can reliably return exactly `X` matching rows.
- Generation is bounded and exhaustion behavior is explicit.
- Candidate progress is independent of accepted parquet row counts.
- Runs resume after any committed candidate batch without repeating candidate offsets or accepted output.
- Zero-acceptance candidate batches remain durably complete.
- Output processing and profiling operate on accepted rows only, while usage metadata includes rejected work.
- Default-buffer behavior works when the target is reached only after three or more candidate batches.
- Documentation explains cost, bounds, partial exhaustion, and resume semantics.

## Investigation and context

PR #773 explored workflow-level repetition. Real-run testing showed that extending a static row-group plan can stall when small increments remain inside the same buffer boundary, and that an interrupted append flow can request a target smaller than persisted output. Record selection therefore needs engine-owned candidate progress rather than repeated public `create()` orchestration.

## Detailed plan

The source-of-truth design, architecture diagrams, implementation phases, test plan, and definition of done are in [`plans/790/engine-native-record-selection.md`](https://github.com/NVIDIA-NeMo/DataDesigner/blob/main/plans/790/engine-native-record-selection.md).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: generate an exact number of rows matching a declared criterion #790

Problem

Proposed feature

V1 scope

Out of scope

Acceptance criteria

Investigation and context

Detailed plan

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

feat: generate an exact number of rows matching a declared criterion #790

Description

Problem

Proposed feature

V1 scope

Out of scope

Acceptance criteria

Investigation and context

Detailed plan

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions