Problem
DataDesigner users can generate candidate rows and filter them afterward, but they cannot declaratively request an exact number of rows that satisfy a quality or validation criterion. Doing this today requires user-managed loops across multiple create() calls, including candidate budgeting, artifact extension, trimming, and resume behavior.
Examples include generating 5,000 answers above a judge threshold, 10,000 policy-compliant conversations, or a fixed number of successful tool-use traces.
Proposed feature
Add engine-native record selection to a normal DataDesigner.create() run. A user declares a boolean predicate column and a hard candidate limit. When selection is configured, num_records means the desired number of accepted output rows:
builder.with_record_selection(
dd.RecordSelectionConfig(
predicate_column="meets_criteria",
max_candidate_records=20_000,
on_exhausted="raise",
)
)
results = data_designer.create(builder, num_records=5_000)
The engine generates immutable candidate batches, evaluates the predicate, checkpoints accepted rows, and continues until the target is met or the candidate budget is exhausted.
V1 scope
- Treat
num_records as the accepted-row target when record selection is enabled.
- Require an explicit boolean predicate column and hard
max_candidate_records bound.
- Support
raise and return_partial exhaustion behavior.
- Preserve deterministic candidate ordering and trim final overshoot exactly.
- Track candidate attempts separately from accepted output.
- Persist candidate-batch completion markers so zero-acceptance batches and resume are correct.
- Preserve normal DAG generation, processors, profiling, plugins, and model-usage accounting.
- Run one candidate batch at a time while retaining the scheduler's normal within-row-group concurrency.
Out of scope
- Concurrent candidate batches.
- Early cancellation of downstream row work after predicate rejection.
- Exporting every rejected candidate.
- Unbounded generation based on an expected acceptance rate.
- Row-count-changing after-generation processors.
These are optional performance or expansion ideas, not requirements for a complete V1.
Acceptance criteria
- A single
create() call can reliably return exactly X matching rows.
- Generation is bounded and exhaustion behavior is explicit.
- Candidate progress is independent of accepted parquet row counts.
- Runs resume after any committed candidate batch without repeating candidate offsets or accepted output.
- Zero-acceptance candidate batches remain durably complete.
- Output processing and profiling operate on accepted rows only, while usage metadata includes rejected work.
- Default-buffer behavior works when the target is reached only after three or more candidate batches.
- Documentation explains cost, bounds, partial exhaustion, and resume semantics.
Investigation and context
PR #773 explored workflow-level repetition. Real-run testing showed that extending a static row-group plan can stall when small increments remain inside the same buffer boundary, and that an interrupted append flow can request a target smaller than persisted output. Record selection therefore needs engine-owned candidate progress rather than repeated public create() orchestration.
Detailed plan
The source-of-truth design, architecture diagrams, implementation phases, test plan, and definition of done are in plans/790/engine-native-record-selection.md.
Problem
DataDesigner users can generate candidate rows and filter them afterward, but they cannot declaratively request an exact number of rows that satisfy a quality or validation criterion. Doing this today requires user-managed loops across multiple
create()calls, including candidate budgeting, artifact extension, trimming, and resume behavior.Examples include generating 5,000 answers above a judge threshold, 10,000 policy-compliant conversations, or a fixed number of successful tool-use traces.
Proposed feature
Add engine-native record selection to a normal
DataDesigner.create()run. A user declares a boolean predicate column and a hard candidate limit. When selection is configured,num_recordsmeans the desired number of accepted output rows:The engine generates immutable candidate batches, evaluates the predicate, checkpoints accepted rows, and continues until the target is met or the candidate budget is exhausted.
V1 scope
num_recordsas the accepted-row target when record selection is enabled.max_candidate_recordsbound.raiseandreturn_partialexhaustion behavior.Out of scope
These are optional performance or expansion ideas, not requirements for a complete V1.
Acceptance criteria
create()call can reliably return exactlyXmatching rows.Investigation and context
PR #773 explored workflow-level repetition. Real-run testing showed that extending a static row-group plan can stall when small increments remain inside the same buffer boundary, and that an interrupted append flow can request a target smaller than persisted output. Record selection therefore needs engine-owned candidate progress rather than repeated public
create()orchestration.Detailed plan
The source-of-truth design, architecture diagrams, implementation phases, test plan, and definition of done are in
plans/790/engine-native-record-selection.md.