Use DataDesigner native resume for retrieval SDG by shan-nvidia · Pull Request #51 · NVIDIA-NeMo/DataDesignerPlugins

shan-nvidia · 2026-05-28T13:32:29Z

What

Updates data-designer-retrieval-sdg to use DataDesigner native resumable generation instead of the plugin's manual per-batch restart loop.

The generate command now:

Requires data-designer>=0.6.1.
Calls DataDesigner.create(...) once across the full seed range.
Adds --resume/-r {never,always,if_possible} and passes the selected ResumeMode.
Adds stable --dataset-name support, with validation before the name is handed to DataDesigner.
Replaces the old generation batch controls with --buffer-size.
Exports one JSONL file named from the resolved DataDesigner dataset name.

The converter keeps backward compatibility for legacy generated_batch*.json outputs while also accepting .jsonl, .json, and .parquet files.

Review follow-ups in the latest commit:

Reject unsafe dataset names: ., .., separators, control characters, and names that resolve outside --artifact-path.
Normalize parquet-loaded array-like nested values before converter filtering and corpus mapping.
Restore the workspace resolver constraints for python-multipart and urllib3.
Drive the native-resume CLI test through main() and assert export uses result.artifact_storage.resolved_dataset_name.
Document that DataDesigner 0.6.1 still profiles the completed dataset before create() returns, so --buffer-size is checkpoint granularity rather than a final peak-memory cap.

Why

DataDesigner 0.6.x owns interrupted-run checkpointing now. Keeping plugin-level --batch-size, --start-batch-index, and --end-batch-index would leave two competing restart systems and make resume behavior harder to reason about.

This PR makes DataDesigner artifacts the source of truth for resume, while --output-dir remains the downstream exported JSONL location.

Usage

data-designer-retrieval-sdg generate \
  --input-dir ./my_documents \
  --output-dir ./generated_output \
  --artifact-path ./artifacts \
  --dataset-name my_retrieval_run \
  --buffer-size 200 \
  --resume if_possible \
  --num-pairs 7

Resume an interrupted run with the same artifact path, dataset name, config, and buffer size:

data-designer-retrieval-sdg generate \
  --input-dir ./my_documents \
  --output-dir ./generated_output \
  --artifact-path ./artifacts \
  --dataset-name my_retrieval_run \
  --buffer-size 200 \
  --resume always

Convert the exported JSONL:

data-designer-retrieval-sdg convert ./generated_output/my_retrieval_run.jsonl \
  --corpus-id my_corpus

How

The CLI builds the QA pipeline once using the full discovered seed range and calls create(..., num_records=total_records, dataset_name=..., resume=...).

dd.RunConfig(disable_early_shutdown=True, buffer_size=...) controls DataDesigner checkpoint granularity. The old manual batch JSON writer was removed from generate, but the converter still recognizes legacy batch JSON for existing outputs.

The embedding-dedup scheduler test was updated for DataDesigner 0.6.1's model scheduling metadata API, replacing the old is_llm_bound assertion with get_scheduling_metadata() coverage.

The workspace declares pytest explicitly in the root dev dependency group so make test does not depend on a transitive test dependency from another tool.

Validation

Local checks:

UV_CACHE_DIR=/private/tmp/uv-cache make sync
UV_CACHE_DIR=/private/tmp/uv-cache make lint
UV_CACHE_DIR=/private/tmp/uv-cache uv run pytest plugins/data-designer-retrieval-sdg/tests/test_cli.py plugins/data-designer-retrieval-sdg/tests/test_convert.py (30 passed)
UV_CACHE_DIR=/private/tmp/uv-cache make test-plugin PLUGIN=data-designer-retrieval-sdg (78 passed)
UV_CACHE_DIR=/private/tmp/uv-cache make validate
UV_CACHE_DIR=/private/tmp/uv-cache make check

Real resume smoke against DataDesigner 0.6.1:

Input: /Users/sthan/workspace/retriever-sdg-v3/examples/nv_pp_random
Command used --num-files 2 --buffer-size 1 --resume never
Interrupted after completed parquet-files/batch_00000.parquet and partial tmp-partial-parquet-files/batch_00001.parquet
Resumed with --resume always
DataDesigner logged: Resuming from batch 2 of 2 (1 records already generated).
Final artifacts contained batch_00000.parquet and batch_00001.parquet
Final JSONL contained 2 lines
Final seed order: build-nvidia/nvidia/corrdiff.txt, then dli/course-v1:DLI+C-FX-01+V3.txt

eric-tramel

Inline review comments for the issues from the review-nuke pass.

shan-nvidia added 3 commits May 19, 2026 16:48

Use DataDesigner native resume for retrieval SDG

211474c

Update retrieval SDG to DataDesigner 0.6.1

f52a1cb

Declare pytest workspace test dependency

007bc97

shan-nvidia marked this pull request as ready for review June 2, 2026 20:58

shan-nvidia requested review from a team and oliverholworthy as code owners June 2, 2026 20:58

eric-tramel reviewed Jun 2, 2026

View reviewed changes

Address retrieval SDG resume review feedback

da9938d

shan-nvidia requested a review from eric-tramel June 3, 2026 16:55

eric-tramel approved these changes Jun 3, 2026

View reviewed changes

shan-nvidia merged commit a8c2197 into main Jun 3, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use DataDesigner native resume for retrieval SDG#51

Use DataDesigner native resume for retrieval SDG#51
shan-nvidia merged 4 commits into
mainfrom
codex/sthan/retrieval-sdg-native-resume

shan-nvidia commented May 28, 2026 •

edited

Loading

Uh oh!

eric-tramel left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shan-nvidia commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Usage

How

Validation

Uh oh!

eric-tramel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shan-nvidia commented May 28, 2026 •

edited

Loading