feat(data-designer-retrieval-sdg): add retrieval SDG plugin by shan-nvidia · Pull Request #1 · NVIDIA-NeMo/DataDesignerPlugins

shan-nvidia · 2026-04-29T16:14:45Z

Summary

Add data-designer-retrieval-sdg, a single PyPI package that registers two data_designer.plugins entry points and ships a complete pipeline + CLI for generating NeMo Retriever training and evaluation data:

embedding-dedup column generator — generic embedding cosine-similarity deduplication for any list-valued column. Implements native agenerate(), so the column engages DataDesigner's async cell-level scheduler when DATA_DESIGNER_ASYNC_ENGINE=1.
document-chunker seed reader — FileSystemSeedReader subclass that loads text files, sentence-chunks them, builds structured sections, and supports multi-document bundling.
build_qa_generation_pipeline() — four-column DataDesigner pipeline (artifact extraction → QA generation → embedding dedup → quality evaluation).
Conversion utilities — exports raw SDG output to NeMo Retriever training format (train.json / val.json), BEIR evaluation format, and a parquet corpus + merlin_metadata.json.
data-designer-retrieval-sdg CLI — generate and convert subcommands, with per-batch JSON checkpointing for resumability.

Review feedback applied

Initial round (@jgreco-nvidia, @nabinmulepati)

One package, two entry points instead of splitting into multiple PyPI packages. Users still install with a single pip install data-designer-retrieval-sdg.
No recipes/ directory — keeps the repo's existing layout (only plugins/ and devtools/).
Native async dedup — implemented agenerate() on the dedup column to drop the manual ThreadPoolExecutor and engage the new async engine when DATA_DESIGNER_ASYNC_ENGINE=1 (Python 3.11+).
Generic dedup config — column type renamed from retrieval-sdg-dedup to embedding-dedup; fields generalized (source_column, items_key, text_field, model_alias, similarity_threshold) so the column is composable beyond QA pairs.
FileSystemSeedReader for ingestion — replaces the hand-rolled load_text_files_from_directory + DataFrame construction with a DocumentChunkerSeedSource + DocumentChunkerSeedReader so the framework owns file discovery, manifest building, and DuckDB registration.
Dropped manual ETA/timing helpers — DataDesigner's own progress logger surfaces progress.

Async-engine throttling + fail-fast validation (commit `dfa743b`)

Follow-up review on dedup.py:

LLM-wait semaphore now gates embedding calls — EmbeddingDedupColumnGenerator now inherits from ColumnGeneratorWithModelRegistry, so it reports is_llm_bound = True to the async scheduler. Previously the column inherited the default is_llm_bound = False from ColumnGenerator, so build_llm_bound_lookup() in async_scheduler.py skipped _llm_wait_semaphore for it and could fan out up to a full row group's worth of concurrent embedding requests at the endpoint. Mirrors how the framework's own EmbeddingCellGenerator is wired through ColumnGeneratorWithModel.
Fail-fast model-alias validation — added _validate() override that raises DatasetGenerationError at task construction (ConfigurableTask.__init__) when the configured model_alias resolves to a non-embedding ModelConfig. Previously a misconfigured chat-model alias surfaced as either an AttributeError from the facade or a 400 from the embeddings API on the first row.
Cached embedder lookup — wrapped embedder in functools.cached_property so per-row dedup doesn't re-walk the model registry.

Logging-convention fix (commit `c39b6c0`)

Style follow-up: library code in data_designer_retrieval_sdg/ should emit warnings via logging.getLogger(__name__) like seed_reader.py and dedup.py already do, not print(). Four sites diverged from that convention and bypassed the LoggerConfig that configure_logging(...) sets up in the CLI, so --log-level ERROR users still saw Warning: ... lines on stdout and the preview-error path was invisible on the configured log stream.

chunking.py — added a module-level logger; the three load_multi_doc_manifest warnings (unreadable manifest, unparseable JSON/YAML, wrong shape) now emit via logger.warning(...) with %s-style args.
cli.py — added a module-level logger; the best-effort _run_preview except path now emits via logger.warning("Preview error: %s", e) so it honors --log-level and the configured stderr OutputConfig.

CLI status output (e.g. Discovered N text files, Processing batch 3/10, Saved generated_batch3.json) is intentionally left as print(...), matching the framework convention.

Test plan

make sync — workspace install.
make lint — ruff check + format clean.
make test — isolated-venv tests pass (63 retrieval-sdg + 11 template after the dedup fix; was 70 before).
make validate — OK: text-transform, OK: document-chunker, OK: embedding-dedup.
make check — catalog, root CODEOWNERS, and SPDX headers all fresh.
make all — green end-to-end.
CLI smoke test (sync mode) — data-designer-retrieval-sdg generate ... && convert ... against examples/sample_texts; produces train.json, val.json, eval_beir/, corpus/.
CLI smoke test (async mode) — same invocation with DATA_DESIGNER_ASYNC_ENGINE=1; logs confirm ⚡️ Async generation: 4 column(s)… engaged the async path; same record count and same top-level columns as the sync run; deduplicated_qa_pairs completes 3/3 cells with 0 failures, exercising the new is_llm_bound = True path through _llm_wait_semaphore.

Adds a new Data Designer plugin for retriever synthetic data generation. The plugin provides: - A retrieval-sdg-dedup column generator that deduplicates QA pairs by embedding cosine similarity, registered via data_designer.plugins. - A four-column SDG pipeline (artifact extraction, QA generation, dedup, quality evaluation) accessible as build_qa_generation_pipeline(). - Conversion utilities for exporting raw SDG output to NeMo Retriever training format (train.json, val.json), BEIR evaluation format, and a parquet corpus with merlin metadata. - A data-designer-retrieval-sdg CLI with generate and convert subcommands. Refreshes auto-derived metadata (docs/catalog.md, .github/CODEOWNERS) and adds data_designer_retrieval_sdg to root pyproject.toml known-first-party. Local CI (lint, isolated-venv test, validate, check) is green; 47 plugin tests pass. Signed-off-by: Steve Han <sthan@nvidia.com> Made-with: Cursor

@oliverholworthy

Restructure the plugin per PR #1 review feedback (Johnny Greco, Nabin Mulepati). The single PyPI package now registers two entry points, removes the manual ThreadPoolExecutor in favor of DataDesigner's new async engine, and replaces the hand-rolled DataFrame seed loader with a FileSystemSeedReader subclass. - Register two data_designer.plugins entry points in one package: * embedding-dedup (column generator) - generic cosine-similarity dedup of any list-valued column. Implements native agenerate() so the column engages DATA_DESIGNER_ASYNC_ENGINE=1 cell-level concurrency. * document-chunker (seed reader) - FileSystemSeedReader subclass that sentence-chunks files and emits structured sections. - Generalize the dedup column config (source_column, items_key, text_field, model_alias, similarity_threshold, column_type "embedding-dedup"); drop ThreadPoolExecutor; single batched embedding call per row in both sync and async paths. - Move reusable chunking/section/bundling helpers from ingest.py into chunking.py and delete ingest.py - file discovery, manifest building, and DataFrame construction now belong to the framework. - Update pipeline.py to take a DocumentChunkerSeedSource directly (no more DataFrameSeedSource wrapping) and the renamed EmbeddingDedupColumnConfig. - Refactor cli.py to build the seed source from CLI flags, drop the manual ETA helper, and let DataDesigner's progress logger surface progress. Per-batch JSON output for resumability is preserved. - Add @oliverholworthy to the plugin CODEOWNERS. - Refresh auto-derived metadata (docs/catalog.md now lists both entry points; .github/CODEOWNERS regenerated). - Tests: validate both plugin entries, add agenerate() async test for dedup, rename test_ingest -> test_chunking, add test_seed_reader. Local CI is green: lint, isolated-venv test (70 tests pass: 59 retrieval-sdg + 11 template), validate (3 plugins OK), check. Verified with sync and async (DATA_DESIGNER_ASYNC_ENGINE=1) end-to-end smoke runs against examples/sample_texts. Made-with: Cursor Signed-off-by: Steve Han <sthan@nvidia.com>

nabinchha · 2026-04-30T17:42:39Z

Style nit, but worth a single review-level note since it spans a few sites and is more about convention than any specific line:

chunking.py uses print(...) for warnings while the rest of the package (and the framework) uses logging. The convention DataDesigner follows is:

Library code (anything inside data_designer_retrieval_sdg/ that isn't cli.py) — uses logger = logging.getLogger(__name__) and emits via logger.warning/info/debug/error/.... Examples already in this PR: seed_reader.py:34, dedup.py:21.
CLI status output — uses print(...) directly. Things like "Discovered N text files", "Processing batch 3/10", "Saved generated_batch3.json" — all appropriate as print.
CLI error/warning paths — go through sys.stderr (and ideally also logger) so they integrate with the configured LoggerConfig.

In this PR there are four sites that diverge from that convention:

chunking.py:48 — print(f"Warning: Unable to read multi_doc_manifest at {manifest_path}: {exc}") should be logger.warning("Unable to read multi_doc_manifest at %s: %s", manifest_path, exc).
chunking.py:58 — print(f"Warning: Failed to parse multi_doc_manifest: {exc}") should be logger.warning("Failed to parse multi_doc_manifest: %s", exc).
chunking.py:77 — print("Warning: multi_doc_manifest must be a list or dict with 'bundles'") should be logger.warning("multi_doc_manifest must be a list or dict with 'bundles'").
cli.py:246 — print(f"Preview error: {e}") is in an explicitly best-effort path (# noqa: BLE001), but it's an error message going to stdout. Either print(f"Preview error: {e}", file=sys.stderr) or logger.warning("Preview error: %s", e) would route it correctly. Smallest-change wins here; both are improvements.

(%s-with-args style instead of f-strings to match the framework convention — see seed_reader.py, the engine modules. Lets logging defer formatting if the message is filtered out at the configured level. Minor perf win, mostly consistency.)

chunking.py would need an import logging + logger = logging.getLogger(__name__) at the top alongside the other imports — same pattern as seed_reader.py:14-15 already in this PR.

Practical impact: today, a user running data-designer-retrieval-sdg generate --log-level ERROR would still see "Warning: Failed to parse multi_doc_manifest" on stdout if their manifest is malformed, because print doesn't go through the level filter that configure_logging(...) set up. Same with the preview error — invisible to anyone tailing the configured log stream.

… LLM-wait semaphore Address PR review feedback that embedding-dedup column was bypassing the async scheduler's LLM-wait semaphore in DATA_DESIGNER_ASYNC_ENGINE mode. ColumnGeneratorCellByCell inherits is_llm_bound = False from the base ColumnGenerator, so build_llm_bound_lookup() in async_scheduler.py would skip _llm_wait_semaphore for this column and could fan out up to a full row group's worth of concurrent embedding requests at the endpoint. - Switch the base class to ColumnGeneratorWithModelRegistry so the generator reports is_llm_bound = True and gets the get_model() and get_model_config() helpers (mirrors how the framework's own EmbeddingCellGenerator is wired through ColumnGeneratorWithModel). - Pin the cell-by-cell strategy explicitly via get_generation_strategy(). - Cache the resolved ModelFacade via functools.cached_property so per-row dedup doesn't re-walk the model registry. - Override _validate() to fail fast at task construction with a DatasetGenerationError when the configured alias resolves to a non- embedding ModelConfig, instead of surfacing as an AttributeError from the facade or a 400 from the embeddings API on the first row. Tests added (TDD; verified RED before implementing): - is_llm_bound returns True - _validate accepts an embedding ModelConfig - _validate rejects a chat-completion ModelConfig with the offending alias name in the message - embedder is cached across accesses Local CI is green: 63/63 retrieval-sdg tests pass, ruff lint and format clean, ddp validate reports OK for all three plugins. End-to-end smoke run with DATA_DESIGNER_ASYNC_ENGINE=1 against examples/sample_texts confirms deduplicated_qa_pairs completes 3/3 cells, 0 failures. Made-with: Cursor Signed-off-by: Steve Han <sthan@nvidia.com>

…ead of print chunking.py and the cli preview path were calling print() for warning/ error messages, bypassing the LoggerConfig configure_logging() sets up in the CLI. As a result, --log-level ERROR users would still see "Warning: Failed to parse multi_doc_manifest" on stdout, and the preview-error path was invisible on the configured log stream. - chunking.py: add module-level logger; the three load_multi_doc_manifest warnings (unreadable manifest, unparseable manifest, wrong shape) now emit via logger.warning(...) with %s-style args. - cli.py: add module-level logger; the best-effort preview except path now emits via logger.warning("Preview error: %s", e) so it honors --log-level and the configured stderr OutputConfig. Signed-off-by: Steve Han <sthan@nvidia.com> Made-with: Cursor

shan-nvidia · 2026-04-30T21:38:56Z

Style nit, but worth a single review-level note since it spans a few sites and is more about convention than any specific line:

chunking.py uses print(...) for warnings while the rest of the package (and the framework) uses logging. The convention DataDesigner follows is:

Library code (anything inside data_designer_retrieval_sdg/ that isn't cli.py) — uses logger = logging.getLogger(__name__) and emits via logger.warning/info/debug/error/.... Examples already in this PR: seed_reader.py:34, dedup.py:21.

CLI status output — uses print(...) directly. Things like "Discovered N text files", "Processing batch 3/10", "Saved generated_batch3.json" — all appropriate as print.

CLI error/warning paths — go through sys.stderr (and ideally also logger) so they integrate with the configured LoggerConfig.

In this PR there are four sites that diverge from that convention:

chunking.py:48 — print(f"Warning: Unable to read multi_doc_manifest at {manifest_path}: {exc}") should be logger.warning("Unable to read multi_doc_manifest at %s: %s", manifest_path, exc).

chunking.py:58 — print(f"Warning: Failed to parse multi_doc_manifest: {exc}") should be logger.warning("Failed to parse multi_doc_manifest: %s", exc).

chunking.py:77 — print("Warning: multi_doc_manifest must be a list or dict with 'bundles'") should be logger.warning("multi_doc_manifest must be a list or dict with 'bundles'").

cli.py:246 — print(f"Preview error: {e}") is in an explicitly best-effort path (# noqa: BLE001), but it's an error message going to stdout. Either print(f"Preview error: {e}", file=sys.stderr) or logger.warning("Preview error: %s", e) would route it correctly. Smallest-change wins here; both are improvements.

(%s-with-args style instead of f-strings to match the framework convention — see seed_reader.py, the engine modules. Lets logging defer formatting if the message is filtered out at the configured level. Minor perf win, mostly consistency.)

chunking.py would need an import logging + logger = logging.getLogger(__name__) at the top alongside the other imports — same pattern as seed_reader.py:14-15 already in this PR.

Practical impact: today, a user running data-designer-retrieval-sdg generate --log-level ERROR would still see "Warning: Failed to parse multi_doc_manifest" on stdout if their manifest is malformed, because print doesn't go through the level filter that configure_logging(...) set up. Same with the preview error — invisible to anyone tailing the configured log stream.

Thanks @nabinchha . Updated accordingly

nabinchha

Added a few nits, lgtm in terms of plugin use/implementation

johnnygreco · 2026-05-01T16:36:07Z

Thanks for this contribution @shan-nvidia!

Do you have any e2e examples that show how to use your plugin? I'm thinking we should an examples/ at the top level for this.

So we'd have e2e notebook(s) and/or script(s) in examples/data-designer-retrieval-sdg/.

WDYT?

cc @nabinchha

shan-nvidia · 2026-05-01T17:15:57Z

Thanks for this contribution @shan-nvidia!

Do you have any e2e examples that show how to use your plugin? I'm thinking we should an examples/ at the top level for this.

So we'd have e2e notebook(s) and/or script(s) in examples/data-designer-retrieval-sdg/.

WDYT?

cc @nabinchha

@johnnygreco yes we can add some examples. Do you want to add a top level examples/ folder in main and I can rebase onto or do you want me to add the examples/data-designer-retrieval-sdg/ in this PR or a followup PR?

johnnygreco · 2026-05-01T17:56:27Z

Thanks for this contribution @shan-nvidia!
Do you have any e2e examples that show how to use your plugin? I'm thinking we should an examples/ at the top level for this.
So we'd have e2e notebook(s) and/or script(s) in examples/data-designer-retrieval-sdg/.
WDYT?
cc @nabinchha

@johnnygreco yes we can add some examples. Do you want to add a top level examples/ folder in main and I can rebase onto or do you want me to add the examples/data-designer-retrieval-sdg/ in this PR or a followup PR?

@shan-nvidia – please feel free to create it!

…l_sdg/config.py Co-authored-by: Nabin Mulepati <nabinchha@gmail.com>

Signed-off-by: Steve Han <sthan@nvidia.com>

shan-nvidia requested a review from a team as a code owner April 29, 2026 16:14

shan-nvidia force-pushed the steve/retriever-sdg branch from 090d19e to 852174b Compare April 29, 2026 16:29

shan-nvidia force-pushed the steve/retriever-sdg branch from 852174b to e6dec85 Compare April 29, 2026 16:32

nabinchha reviewed Apr 30, 2026

View reviewed changes

Comment thread plugins/data-designer-retrieval-sdg/src/data_designer_retrieval_sdg/dedup.py Outdated

shan-nvidia added 2 commits April 30, 2026 14:06

shan-nvidia requested a review from nabinchha April 30, 2026 21:39

nabinchha approved these changes May 1, 2026

View reviewed changes

shan-nvidia and others added 2 commits May 4, 2026 12:13

Update plugins/data-designer-retrieval-sdg/src/data_designer_retrieva…

bf5d0be

…l_sdg/config.py Co-authored-by: Nabin Mulepati <nabinchha@gmail.com>

address review comments

ea96467

Signed-off-by: Steve Han <sthan@nvidia.com>

eric-tramel added the 🎊 new plugin label May 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data-designer-retrieval-sdg): add retrieval SDG plugin#1

feat(data-designer-retrieval-sdg): add retrieval SDG plugin#1
shan-nvidia wants to merge 6 commits intomainfrom
steve/retriever-sdg

shan-nvidia commented Apr 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

nabinchha commented Apr 30, 2026

Uh oh!

shan-nvidia commented Apr 30, 2026

Uh oh!

nabinchha left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johnnygreco commented May 1, 2026

Uh oh!

shan-nvidia commented May 1, 2026

Uh oh!

johnnygreco commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

shan-nvidia commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Review feedback applied

Initial round (@jgreco-nvidia, @nabinmulepati)

Async-engine throttling + fail-fast validation (commit dfa743b)

Logging-convention fix (commit c39b6c0)

Test plan

Uh oh!

Uh oh!

nabinchha commented Apr 30, 2026

Uh oh!

shan-nvidia commented Apr 30, 2026

Uh oh!

nabinchha left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

johnnygreco commented May 1, 2026

Uh oh!

shan-nvidia commented May 1, 2026

Uh oh!

johnnygreco commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

shan-nvidia commented Apr 29, 2026 •

edited

Loading

Async-engine throttling + fail-fast validation (commit `dfa743b`)

Logging-convention fix (commit `c39b6c0`)