Skip to content

Use DataDesigner native resume for retrieval SDG#51

Merged
shan-nvidia merged 4 commits into
mainfrom
codex/sthan/retrieval-sdg-native-resume
Jun 3, 2026
Merged

Use DataDesigner native resume for retrieval SDG#51
shan-nvidia merged 4 commits into
mainfrom
codex/sthan/retrieval-sdg-native-resume

Conversation

@shan-nvidia
Copy link
Copy Markdown
Collaborator

@shan-nvidia shan-nvidia commented May 28, 2026

What

Updates data-designer-retrieval-sdg to use DataDesigner native resumable generation instead of the plugin's manual per-batch restart loop.

The generate command now:

  • Requires data-designer>=0.6.1.
  • Calls DataDesigner.create(...) once across the full seed range.
  • Adds --resume/-r {never,always,if_possible} and passes the selected ResumeMode.
  • Adds stable --dataset-name support, with validation before the name is handed to DataDesigner.
  • Replaces the old generation batch controls with --buffer-size.
  • Exports one JSONL file named from the resolved DataDesigner dataset name.

The converter keeps backward compatibility for legacy generated_batch*.json outputs while also accepting .jsonl, .json, and .parquet files.

Review follow-ups in the latest commit:

  • Reject unsafe dataset names: ., .., separators, control characters, and names that resolve outside --artifact-path.
  • Normalize parquet-loaded array-like nested values before converter filtering and corpus mapping.
  • Restore the workspace resolver constraints for python-multipart and urllib3.
  • Drive the native-resume CLI test through main() and assert export uses result.artifact_storage.resolved_dataset_name.
  • Document that DataDesigner 0.6.1 still profiles the completed dataset before create() returns, so --buffer-size is checkpoint granularity rather than a final peak-memory cap.

Why

DataDesigner 0.6.x owns interrupted-run checkpointing now. Keeping plugin-level --batch-size, --start-batch-index, and --end-batch-index would leave two competing restart systems and make resume behavior harder to reason about.

This PR makes DataDesigner artifacts the source of truth for resume, while --output-dir remains the downstream exported JSONL location.

Usage

data-designer-retrieval-sdg generate \
  --input-dir ./my_documents \
  --output-dir ./generated_output \
  --artifact-path ./artifacts \
  --dataset-name my_retrieval_run \
  --buffer-size 200 \
  --resume if_possible \
  --num-pairs 7

Resume an interrupted run with the same artifact path, dataset name, config, and buffer size:

data-designer-retrieval-sdg generate \
  --input-dir ./my_documents \
  --output-dir ./generated_output \
  --artifact-path ./artifacts \
  --dataset-name my_retrieval_run \
  --buffer-size 200 \
  --resume always

Convert the exported JSONL:

data-designer-retrieval-sdg convert ./generated_output/my_retrieval_run.jsonl \
  --corpus-id my_corpus

How

The CLI builds the QA pipeline once using the full discovered seed range and calls create(..., num_records=total_records, dataset_name=..., resume=...).

dd.RunConfig(disable_early_shutdown=True, buffer_size=...) controls DataDesigner checkpoint granularity. The old manual batch JSON writer was removed from generate, but the converter still recognizes legacy batch JSON for existing outputs.

The embedding-dedup scheduler test was updated for DataDesigner 0.6.1's model scheduling metadata API, replacing the old is_llm_bound assertion with get_scheduling_metadata() coverage.

The workspace declares pytest explicitly in the root dev dependency group so make test does not depend on a transitive test dependency from another tool.

Validation

Local checks:

  • UV_CACHE_DIR=/private/tmp/uv-cache make sync
  • UV_CACHE_DIR=/private/tmp/uv-cache make lint
  • UV_CACHE_DIR=/private/tmp/uv-cache uv run pytest plugins/data-designer-retrieval-sdg/tests/test_cli.py plugins/data-designer-retrieval-sdg/tests/test_convert.py (30 passed)
  • UV_CACHE_DIR=/private/tmp/uv-cache make test-plugin PLUGIN=data-designer-retrieval-sdg (78 passed)
  • UV_CACHE_DIR=/private/tmp/uv-cache make validate
  • UV_CACHE_DIR=/private/tmp/uv-cache make check

Real resume smoke against DataDesigner 0.6.1:

  • Input: /Users/sthan/workspace/retriever-sdg-v3/examples/nv_pp_random
  • Command used --num-files 2 --buffer-size 1 --resume never
  • Interrupted after completed parquet-files/batch_00000.parquet and partial tmp-partial-parquet-files/batch_00001.parquet
  • Resumed with --resume always
  • DataDesigner logged: Resuming from batch 2 of 2 (1 records already generated).
  • Final artifacts contained batch_00000.parquet and batch_00001.parquet
  • Final JSONL contained 2 lines
  • Final seed order: build-nvidia/nvidia/corrdiff.txt, then dli/course-v1:DLI+C-FX-01+V3.txt

@shan-nvidia shan-nvidia marked this pull request as ready for review June 2, 2026 20:58
@shan-nvidia shan-nvidia requested review from a team and oliverholworthy as code owners June 2, 2026 20:58
Copy link
Copy Markdown
Contributor

@eric-tramel eric-tramel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline review comments for the issues from the review-nuke pass.

Comment thread pyproject.toml
Comment thread plugins/data-designer-retrieval-sdg/tests/test_cli.py Outdated
Comment thread plugins/data-designer-retrieval-sdg/tests/test_cli.py Outdated
@shan-nvidia shan-nvidia requested a review from eric-tramel June 3, 2026 16:55
@shan-nvidia shan-nvidia merged commit a8c2197 into main Jun 3, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants