Skip to content

feat: generate an exact number of rows matching a declared criterion #790

Description

@nabinchha

Problem

DataDesigner users can generate candidate rows and filter them afterward, but they cannot declaratively request an exact number of rows that satisfy a quality or validation criterion. Doing this today requires user-managed loops across multiple create() calls, including candidate budgeting, artifact extension, trimming, and resume behavior.

Examples include generating 5,000 answers above a judge threshold, 10,000 policy-compliant conversations, or a fixed number of successful tool-use traces.

Proposed feature

Add engine-native record selection to a normal DataDesigner.create() run. A user declares a boolean predicate column and a hard candidate limit. When selection is configured, num_records means the desired number of accepted output rows:

builder.with_record_selection(
    dd.RecordSelectionConfig(
        predicate_column="meets_criteria",
        max_candidate_records=20_000,
        on_exhausted="raise",
    )
)

results = data_designer.create(builder, num_records=5_000)

The engine generates immutable candidate batches, evaluates the predicate, checkpoints accepted rows, and continues until the target is met or the candidate budget is exhausted.

V1 scope

  • Treat num_records as the accepted-row target when record selection is enabled.
  • Require an explicit boolean predicate column and hard max_candidate_records bound.
  • Support raise and return_partial exhaustion behavior.
  • Preserve deterministic candidate ordering and trim final overshoot exactly.
  • Track candidate attempts separately from accepted output.
  • Persist candidate-batch completion markers so zero-acceptance batches and resume are correct.
  • Preserve normal DAG generation, processors, profiling, plugins, and model-usage accounting.
  • Run one candidate batch at a time while retaining the scheduler's normal within-row-group concurrency.

Out of scope

  • Concurrent candidate batches.
  • Early cancellation of downstream row work after predicate rejection.
  • Exporting every rejected candidate.
  • Unbounded generation based on an expected acceptance rate.
  • Row-count-changing after-generation processors.

These are optional performance or expansion ideas, not requirements for a complete V1.

Acceptance criteria

  • A single create() call can reliably return exactly X matching rows.
  • Generation is bounded and exhaustion behavior is explicit.
  • Candidate progress is independent of accepted parquet row counts.
  • Runs resume after any committed candidate batch without repeating candidate offsets or accepted output.
  • Zero-acceptance candidate batches remain durably complete.
  • Output processing and profiling operate on accepted rows only, while usage metadata includes rejected work.
  • Default-buffer behavior works when the target is reached only after three or more candidate batches.
  • Documentation explains cost, bounds, partial exhaustion, and resume semantics.

Investigation and context

PR #773 explored workflow-level repetition. Real-run testing showed that extending a static row-group plan can stall when small increments remain inside the same buffer boundary, and that an interrupted append flow can request a target smaller than persisted output. Record selection therefore needs engine-owned candidate progress rather than repeated public create() orchestration.

Detailed plan

The source-of-truth design, architecture diagrams, implementation phases, test plan, and definition of done are in plans/790/engine-native-record-selection.md.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions