Skip to content

feat(data-designer-retrieval-sdg): add retrieval SDG plugin#1

Open
shan-nvidia wants to merge 6 commits intomainfrom
steve/retriever-sdg
Open

feat(data-designer-retrieval-sdg): add retrieval SDG plugin#1
shan-nvidia wants to merge 6 commits intomainfrom
steve/retriever-sdg

Conversation

@shan-nvidia
Copy link
Copy Markdown
Collaborator

@shan-nvidia shan-nvidia commented Apr 29, 2026

Summary

Add data-designer-retrieval-sdg, a single PyPI package that registers two data_designer.plugins entry points and ships a complete pipeline + CLI for generating NeMo Retriever training and evaluation data:

  • embedding-dedup column generator — generic embedding cosine-similarity deduplication for any list-valued column. Implements native agenerate(), so the column engages DataDesigner's async cell-level scheduler when DATA_DESIGNER_ASYNC_ENGINE=1.
  • document-chunker seed readerFileSystemSeedReader subclass that loads text files, sentence-chunks them, builds structured sections, and supports multi-document bundling.
  • build_qa_generation_pipeline() — four-column DataDesigner pipeline (artifact extraction → QA generation → embedding dedup → quality evaluation).
  • Conversion utilities — exports raw SDG output to NeMo Retriever training format (train.json / val.json), BEIR evaluation format, and a parquet corpus + merlin_metadata.json.
  • data-designer-retrieval-sdg CLIgenerate and convert subcommands, with per-batch JSON checkpointing for resumability.

Review feedback applied

Initial round (@jgreco-nvidia, @nabinmulepati)

  • One package, two entry points instead of splitting into multiple PyPI packages. Users still install with a single pip install data-designer-retrieval-sdg.
  • No recipes/ directory — keeps the repo's existing layout (only plugins/ and devtools/).
  • Native async dedup — implemented agenerate() on the dedup column to drop the manual ThreadPoolExecutor and engage the new async engine when DATA_DESIGNER_ASYNC_ENGINE=1 (Python 3.11+).
  • Generic dedup config — column type renamed from retrieval-sdg-dedup to embedding-dedup; fields generalized (source_column, items_key, text_field, model_alias, similarity_threshold) so the column is composable beyond QA pairs.
  • FileSystemSeedReader for ingestion — replaces the hand-rolled load_text_files_from_directory + DataFrame construction with a DocumentChunkerSeedSource + DocumentChunkerSeedReader so the framework owns file discovery, manifest building, and DuckDB registration.
  • Dropped manual ETA/timing helpers — DataDesigner's own progress logger surfaces progress.

Async-engine throttling + fail-fast validation (commit dfa743b)

Follow-up review on dedup.py:

  • LLM-wait semaphore now gates embedding callsEmbeddingDedupColumnGenerator now inherits from ColumnGeneratorWithModelRegistry, so it reports is_llm_bound = True to the async scheduler. Previously the column inherited the default is_llm_bound = False from ColumnGenerator, so build_llm_bound_lookup() in async_scheduler.py skipped _llm_wait_semaphore for it and could fan out up to a full row group's worth of concurrent embedding requests at the endpoint. Mirrors how the framework's own EmbeddingCellGenerator is wired through ColumnGeneratorWithModel.
  • Fail-fast model-alias validation — added _validate() override that raises DatasetGenerationError at task construction (ConfigurableTask.__init__) when the configured model_alias resolves to a non-embedding ModelConfig. Previously a misconfigured chat-model alias surfaced as either an AttributeError from the facade or a 400 from the embeddings API on the first row.
  • Cached embedder lookup — wrapped embedder in functools.cached_property so per-row dedup doesn't re-walk the model registry.

Logging-convention fix (commit c39b6c0)

Style follow-up: library code in data_designer_retrieval_sdg/ should emit warnings via logging.getLogger(__name__) like seed_reader.py and dedup.py already do, not print(). Four sites diverged from that convention and bypassed the LoggerConfig that configure_logging(...) sets up in the CLI, so --log-level ERROR users still saw Warning: ... lines on stdout and the preview-error path was invisible on the configured log stream.

  • chunking.py — added a module-level logger; the three load_multi_doc_manifest warnings (unreadable manifest, unparseable JSON/YAML, wrong shape) now emit via logger.warning(...) with %s-style args.
  • cli.py — added a module-level logger; the best-effort _run_preview except path now emits via logger.warning("Preview error: %s", e) so it honors --log-level and the configured stderr OutputConfig.

CLI status output (e.g. Discovered N text files, Processing batch 3/10, Saved generated_batch3.json) is intentionally left as print(...), matching the framework convention.

Test plan

  • make sync — workspace install.
  • make lint — ruff check + format clean.
  • make test — isolated-venv tests pass (63 retrieval-sdg + 11 template after the dedup fix; was 70 before).
  • make validateOK: text-transform, OK: document-chunker, OK: embedding-dedup.
  • make check — catalog, root CODEOWNERS, and SPDX headers all fresh.
  • make all — green end-to-end.
  • CLI smoke test (sync mode)data-designer-retrieval-sdg generate ... && convert ... against examples/sample_texts; produces train.json, val.json, eval_beir/, corpus/.
  • CLI smoke test (async mode) — same invocation with DATA_DESIGNER_ASYNC_ENGINE=1; logs confirm ⚡️ Async generation: 4 column(s)… engaged the async path; same record count and same top-level columns as the sync run; deduplicated_qa_pairs completes 3/3 cells with 0 failures, exercising the new is_llm_bound = True path through _llm_wait_semaphore.

@shan-nvidia shan-nvidia requested a review from a team as a code owner April 29, 2026 16:14
@shan-nvidia shan-nvidia force-pushed the steve/retriever-sdg branch from 090d19e to 852174b Compare April 29, 2026 16:29
Adds a new Data Designer plugin for retriever synthetic data generation.
The plugin provides:

- A retrieval-sdg-dedup column generator that deduplicates QA pairs by
  embedding cosine similarity, registered via data_designer.plugins.
- A four-column SDG pipeline (artifact extraction, QA generation, dedup,
  quality evaluation) accessible as build_qa_generation_pipeline().
- Conversion utilities for exporting raw SDG output to NeMo Retriever
  training format (train.json, val.json), BEIR evaluation format, and a
  parquet corpus with merlin metadata.
- A data-designer-retrieval-sdg CLI with generate and convert subcommands.

Refreshes auto-derived metadata (docs/catalog.md, .github/CODEOWNERS) and
adds data_designer_retrieval_sdg to root pyproject.toml known-first-party.

Local CI (lint, isolated-venv test, validate, check) is green; 47 plugin
tests pass.

Signed-off-by: Steve Han <sthan@nvidia.com>
Made-with: Cursor
@shan-nvidia shan-nvidia force-pushed the steve/retriever-sdg branch from 852174b to e6dec85 Compare April 29, 2026 16:32
Restructure the plugin per PR #1 review feedback (Johnny Greco, Nabin
Mulepati). The single PyPI package now registers two entry points,
removes the manual ThreadPoolExecutor in favor of DataDesigner's new
async engine, and replaces the hand-rolled DataFrame seed loader with a
FileSystemSeedReader subclass.

- Register two data_designer.plugins entry points in one package:
  * embedding-dedup (column generator) - generic cosine-similarity dedup
    of any list-valued column. Implements native agenerate() so the
    column engages DATA_DESIGNER_ASYNC_ENGINE=1 cell-level concurrency.
  * document-chunker (seed reader) - FileSystemSeedReader subclass that
    sentence-chunks files and emits structured sections.
- Generalize the dedup column config (source_column, items_key,
  text_field, model_alias, similarity_threshold, column_type
  "embedding-dedup"); drop ThreadPoolExecutor; single batched embedding
  call per row in both sync and async paths.
- Move reusable chunking/section/bundling helpers from ingest.py into
  chunking.py and delete ingest.py - file discovery, manifest building,
  and DataFrame construction now belong to the framework.
- Update pipeline.py to take a DocumentChunkerSeedSource directly (no
  more DataFrameSeedSource wrapping) and the renamed
  EmbeddingDedupColumnConfig.
- Refactor cli.py to build the seed source from CLI flags, drop the
  manual ETA helper, and let DataDesigner's progress logger surface
  progress. Per-batch JSON output for resumability is preserved.
- Add @oliverholworthy to the plugin CODEOWNERS.
- Refresh auto-derived metadata (docs/catalog.md now lists both entry
  points; .github/CODEOWNERS regenerated).
- Tests: validate both plugin entries, add agenerate() async test for
  dedup, rename test_ingest -> test_chunking, add test_seed_reader.

Local CI is green: lint, isolated-venv test (70 tests pass: 59
retrieval-sdg + 11 template), validate (3 plugins OK), check. Verified
with sync and async (DATA_DESIGNER_ASYNC_ENGINE=1) end-to-end smoke runs
against examples/sample_texts.

Made-with: Cursor
Signed-off-by: Steve Han <sthan@nvidia.com>
Comment thread plugins/data-designer-retrieval-sdg/src/data_designer_retrieval_sdg/dedup.py Outdated
@nabinchha
Copy link
Copy Markdown

Style nit, but worth a single review-level note since it spans a few sites and is more about convention than any specific line:

chunking.py uses print(...) for warnings while the rest of the package (and the framework) uses logging. The convention DataDesigner follows is:

  • Library code (anything inside data_designer_retrieval_sdg/ that isn't cli.py) — uses logger = logging.getLogger(__name__) and emits via logger.warning/info/debug/error/.... Examples already in this PR: seed_reader.py:34, dedup.py:21.
  • CLI status output — uses print(...) directly. Things like "Discovered N text files", "Processing batch 3/10", "Saved generated_batch3.json" — all appropriate as print.
  • CLI error/warning paths — go through sys.stderr (and ideally also logger) so they integrate with the configured LoggerConfig.

In this PR there are four sites that diverge from that convention:

  1. chunking.py:48print(f"Warning: Unable to read multi_doc_manifest at {manifest_path}: {exc}") should be logger.warning("Unable to read multi_doc_manifest at %s: %s", manifest_path, exc).

  2. chunking.py:58print(f"Warning: Failed to parse multi_doc_manifest: {exc}") should be logger.warning("Failed to parse multi_doc_manifest: %s", exc).

  3. chunking.py:77print("Warning: multi_doc_manifest must be a list or dict with 'bundles'") should be logger.warning("multi_doc_manifest must be a list or dict with 'bundles'").

  4. cli.py:246print(f"Preview error: {e}") is in an explicitly best-effort path (# noqa: BLE001), but it's an error message going to stdout. Either print(f"Preview error: {e}", file=sys.stderr) or logger.warning("Preview error: %s", e) would route it correctly. Smallest-change wins here; both are improvements.

(%s-with-args style instead of f-strings to match the framework convention — see seed_reader.py, the engine modules. Lets logging defer formatting if the message is filtered out at the configured level. Minor perf win, mostly consistency.)

chunking.py would need an import logging + logger = logging.getLogger(__name__) at the top alongside the other imports — same pattern as seed_reader.py:14-15 already in this PR.

Practical impact: today, a user running data-designer-retrieval-sdg generate --log-level ERROR would still see "Warning: Failed to parse multi_doc_manifest" on stdout if their manifest is malformed, because print doesn't go through the level filter that configure_logging(...) set up. Same with the preview error — invisible to anyone tailing the configured log stream.

… LLM-wait semaphore

Address PR review feedback that embedding-dedup column was bypassing the
async scheduler's LLM-wait semaphore in DATA_DESIGNER_ASYNC_ENGINE mode.
ColumnGeneratorCellByCell inherits is_llm_bound = False from the base
ColumnGenerator, so build_llm_bound_lookup() in async_scheduler.py would
skip _llm_wait_semaphore for this column and could fan out up to a full
row group's worth of concurrent embedding requests at the endpoint.

- Switch the base class to ColumnGeneratorWithModelRegistry so the
  generator reports is_llm_bound = True and gets the get_model() and
  get_model_config() helpers (mirrors how the framework's own
  EmbeddingCellGenerator is wired through ColumnGeneratorWithModel).
- Pin the cell-by-cell strategy explicitly via get_generation_strategy().
- Cache the resolved ModelFacade via functools.cached_property so per-row
  dedup doesn't re-walk the model registry.
- Override _validate() to fail fast at task construction with a
  DatasetGenerationError when the configured alias resolves to a non-
  embedding ModelConfig, instead of surfacing as an AttributeError from
  the facade or a 400 from the embeddings API on the first row.

Tests added (TDD; verified RED before implementing):
- is_llm_bound returns True
- _validate accepts an embedding ModelConfig
- _validate rejects a chat-completion ModelConfig with the offending
  alias name in the message
- embedder is cached across accesses

Local CI is green: 63/63 retrieval-sdg tests pass, ruff lint and format
clean, ddp validate reports OK for all three plugins. End-to-end smoke
run with DATA_DESIGNER_ASYNC_ENGINE=1 against examples/sample_texts
confirms deduplicated_qa_pairs completes 3/3 cells, 0 failures.

Made-with: Cursor
Signed-off-by: Steve Han <sthan@nvidia.com>
…ead of print

chunking.py and the cli preview path were calling print() for warning/
error messages, bypassing the LoggerConfig configure_logging() sets up
in the CLI. As a result, --log-level ERROR users would still see
"Warning: Failed to parse multi_doc_manifest" on stdout, and the
preview-error path was invisible on the configured log stream.

- chunking.py: add module-level logger; the three load_multi_doc_manifest
  warnings (unreadable manifest, unparseable manifest, wrong shape) now
  emit via logger.warning(...) with %s-style args.
- cli.py: add module-level logger; the best-effort preview except path
  now emits via logger.warning("Preview error: %s", e) so it honors
  --log-level and the configured stderr OutputConfig.

Signed-off-by: Steve Han <sthan@nvidia.com>
Made-with: Cursor
@shan-nvidia
Copy link
Copy Markdown
Collaborator Author

Style nit, but worth a single review-level note since it spans a few sites and is more about convention than any specific line:

chunking.py uses print(...) for warnings while the rest of the package (and the framework) uses logging. The convention DataDesigner follows is:

  • Library code (anything inside data_designer_retrieval_sdg/ that isn't cli.py) — uses logger = logging.getLogger(__name__) and emits via logger.warning/info/debug/error/.... Examples already in this PR: seed_reader.py:34, dedup.py:21.
  • CLI status output — uses print(...) directly. Things like "Discovered N text files", "Processing batch 3/10", "Saved generated_batch3.json" — all appropriate as print.
  • CLI error/warning paths — go through sys.stderr (and ideally also logger) so they integrate with the configured LoggerConfig.

In this PR there are four sites that diverge from that convention:

  1. chunking.py:48print(f"Warning: Unable to read multi_doc_manifest at {manifest_path}: {exc}") should be logger.warning("Unable to read multi_doc_manifest at %s: %s", manifest_path, exc).
  2. chunking.py:58print(f"Warning: Failed to parse multi_doc_manifest: {exc}") should be logger.warning("Failed to parse multi_doc_manifest: %s", exc).
  3. chunking.py:77print("Warning: multi_doc_manifest must be a list or dict with 'bundles'") should be logger.warning("multi_doc_manifest must be a list or dict with 'bundles'").
  4. cli.py:246print(f"Preview error: {e}") is in an explicitly best-effort path (# noqa: BLE001), but it's an error message going to stdout. Either print(f"Preview error: {e}", file=sys.stderr) or logger.warning("Preview error: %s", e) would route it correctly. Smallest-change wins here; both are improvements.

(%s-with-args style instead of f-strings to match the framework convention — see seed_reader.py, the engine modules. Lets logging defer formatting if the message is filtered out at the configured level. Minor perf win, mostly consistency.)

chunking.py would need an import logging + logger = logging.getLogger(__name__) at the top alongside the other imports — same pattern as seed_reader.py:14-15 already in this PR.

Practical impact: today, a user running data-designer-retrieval-sdg generate --log-level ERROR would still see "Warning: Failed to parse multi_doc_manifest" on stdout if their manifest is malformed, because print doesn't go through the level filter that configure_logging(...) set up. Same with the preview error — invisible to anyone tailing the configured log stream.

Thanks @nabinchha . Updated accordingly

@shan-nvidia shan-nvidia requested a review from nabinchha April 30, 2026 21:39
Copy link
Copy Markdown

@nabinchha nabinchha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few nits, lgtm in terms of plugin use/implementation

Comment thread plugins/data-designer-retrieval-sdg/src/data_designer_retrieval_sdg/chunking.py Outdated
Comment thread plugins/data-designer-retrieval-sdg/src/data_designer_retrieval_sdg/chunking.py Outdated
Comment thread plugins/data-designer-retrieval-sdg/src/data_designer_retrieval_sdg/dedup.py Outdated
@johnnygreco
Copy link
Copy Markdown
Contributor

Thanks for this contribution @shan-nvidia!

Do you have any e2e examples that show how to use your plugin? I'm thinking we should an examples/ at the top level for this.

So we'd have e2e notebook(s) and/or script(s) in examples/data-designer-retrieval-sdg/.

WDYT?

cc @nabinchha

@shan-nvidia
Copy link
Copy Markdown
Collaborator Author

Thanks for this contribution @shan-nvidia!

Do you have any e2e examples that show how to use your plugin? I'm thinking we should an examples/ at the top level for this.

So we'd have e2e notebook(s) and/or script(s) in examples/data-designer-retrieval-sdg/.

WDYT?

cc @nabinchha

@johnnygreco yes we can add some examples. Do you want to add a top level examples/ folder in main and I can rebase onto or do you want me to add the examples/data-designer-retrieval-sdg/ in this PR or a followup PR?

@johnnygreco
Copy link
Copy Markdown
Contributor

Thanks for this contribution @shan-nvidia!
Do you have any e2e examples that show how to use your plugin? I'm thinking we should an examples/ at the top level for this.
So we'd have e2e notebook(s) and/or script(s) in examples/data-designer-retrieval-sdg/.
WDYT?
cc @nabinchha

@johnnygreco yes we can add some examples. Do you want to add a top level examples/ folder in main and I can rebase onto or do you want me to add the examples/data-designer-retrieval-sdg/ in this PR or a followup PR?

@shan-nvidia – please feel free to create it!

shan-nvidia and others added 2 commits May 4, 2026 12:13
…l_sdg/config.py

Co-authored-by: Nabin Mulepati <nabinchha@gmail.com>
Signed-off-by: Steve Han <sthan@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants