feat(data-designer-retrieval-sdg): add retrieval SDG plugin#1
feat(data-designer-retrieval-sdg): add retrieval SDG plugin#1shan-nvidia wants to merge 6 commits intomainfrom
Conversation
090d19e to
852174b
Compare
Adds a new Data Designer plugin for retriever synthetic data generation. The plugin provides: - A retrieval-sdg-dedup column generator that deduplicates QA pairs by embedding cosine similarity, registered via data_designer.plugins. - A four-column SDG pipeline (artifact extraction, QA generation, dedup, quality evaluation) accessible as build_qa_generation_pipeline(). - Conversion utilities for exporting raw SDG output to NeMo Retriever training format (train.json, val.json), BEIR evaluation format, and a parquet corpus with merlin metadata. - A data-designer-retrieval-sdg CLI with generate and convert subcommands. Refreshes auto-derived metadata (docs/catalog.md, .github/CODEOWNERS) and adds data_designer_retrieval_sdg to root pyproject.toml known-first-party. Local CI (lint, isolated-venv test, validate, check) is green; 47 plugin tests pass. Signed-off-by: Steve Han <sthan@nvidia.com> Made-with: Cursor
852174b to
e6dec85
Compare
Restructure the plugin per PR #1 review feedback (Johnny Greco, Nabin Mulepati). The single PyPI package now registers two entry points, removes the manual ThreadPoolExecutor in favor of DataDesigner's new async engine, and replaces the hand-rolled DataFrame seed loader with a FileSystemSeedReader subclass. - Register two data_designer.plugins entry points in one package: * embedding-dedup (column generator) - generic cosine-similarity dedup of any list-valued column. Implements native agenerate() so the column engages DATA_DESIGNER_ASYNC_ENGINE=1 cell-level concurrency. * document-chunker (seed reader) - FileSystemSeedReader subclass that sentence-chunks files and emits structured sections. - Generalize the dedup column config (source_column, items_key, text_field, model_alias, similarity_threshold, column_type "embedding-dedup"); drop ThreadPoolExecutor; single batched embedding call per row in both sync and async paths. - Move reusable chunking/section/bundling helpers from ingest.py into chunking.py and delete ingest.py - file discovery, manifest building, and DataFrame construction now belong to the framework. - Update pipeline.py to take a DocumentChunkerSeedSource directly (no more DataFrameSeedSource wrapping) and the renamed EmbeddingDedupColumnConfig. - Refactor cli.py to build the seed source from CLI flags, drop the manual ETA helper, and let DataDesigner's progress logger surface progress. Per-batch JSON output for resumability is preserved. - Add @oliverholworthy to the plugin CODEOWNERS. - Refresh auto-derived metadata (docs/catalog.md now lists both entry points; .github/CODEOWNERS regenerated). - Tests: validate both plugin entries, add agenerate() async test for dedup, rename test_ingest -> test_chunking, add test_seed_reader. Local CI is green: lint, isolated-venv test (70 tests pass: 59 retrieval-sdg + 11 template), validate (3 plugins OK), check. Verified with sync and async (DATA_DESIGNER_ASYNC_ENGINE=1) end-to-end smoke runs against examples/sample_texts. Made-with: Cursor Signed-off-by: Steve Han <sthan@nvidia.com>
|
Style nit, but worth a single review-level note since it spans a few sites and is more about convention than any specific line:
In this PR there are four sites that diverge from that convention:
(
Practical impact: today, a user running |
… LLM-wait semaphore Address PR review feedback that embedding-dedup column was bypassing the async scheduler's LLM-wait semaphore in DATA_DESIGNER_ASYNC_ENGINE mode. ColumnGeneratorCellByCell inherits is_llm_bound = False from the base ColumnGenerator, so build_llm_bound_lookup() in async_scheduler.py would skip _llm_wait_semaphore for this column and could fan out up to a full row group's worth of concurrent embedding requests at the endpoint. - Switch the base class to ColumnGeneratorWithModelRegistry so the generator reports is_llm_bound = True and gets the get_model() and get_model_config() helpers (mirrors how the framework's own EmbeddingCellGenerator is wired through ColumnGeneratorWithModel). - Pin the cell-by-cell strategy explicitly via get_generation_strategy(). - Cache the resolved ModelFacade via functools.cached_property so per-row dedup doesn't re-walk the model registry. - Override _validate() to fail fast at task construction with a DatasetGenerationError when the configured alias resolves to a non- embedding ModelConfig, instead of surfacing as an AttributeError from the facade or a 400 from the embeddings API on the first row. Tests added (TDD; verified RED before implementing): - is_llm_bound returns True - _validate accepts an embedding ModelConfig - _validate rejects a chat-completion ModelConfig with the offending alias name in the message - embedder is cached across accesses Local CI is green: 63/63 retrieval-sdg tests pass, ruff lint and format clean, ddp validate reports OK for all three plugins. End-to-end smoke run with DATA_DESIGNER_ASYNC_ENGINE=1 against examples/sample_texts confirms deduplicated_qa_pairs completes 3/3 cells, 0 failures. Made-with: Cursor Signed-off-by: Steve Han <sthan@nvidia.com>
…ead of print
chunking.py and the cli preview path were calling print() for warning/
error messages, bypassing the LoggerConfig configure_logging() sets up
in the CLI. As a result, --log-level ERROR users would still see
"Warning: Failed to parse multi_doc_manifest" on stdout, and the
preview-error path was invisible on the configured log stream.
- chunking.py: add module-level logger; the three load_multi_doc_manifest
warnings (unreadable manifest, unparseable manifest, wrong shape) now
emit via logger.warning(...) with %s-style args.
- cli.py: add module-level logger; the best-effort preview except path
now emits via logger.warning("Preview error: %s", e) so it honors
--log-level and the configured stderr OutputConfig.
Signed-off-by: Steve Han <sthan@nvidia.com>
Made-with: Cursor
Thanks @nabinchha . Updated accordingly |
nabinchha
left a comment
There was a problem hiding this comment.
Added a few nits, lgtm in terms of plugin use/implementation
|
Thanks for this contribution @shan-nvidia! Do you have any e2e examples that show how to use your plugin? I'm thinking we should an So we'd have e2e notebook(s) and/or script(s) in WDYT? cc @nabinchha |
@johnnygreco yes we can add some examples. Do you want to add a top level |
@shan-nvidia – please feel free to create it! |
…l_sdg/config.py Co-authored-by: Nabin Mulepati <nabinchha@gmail.com>
Signed-off-by: Steve Han <sthan@nvidia.com>
Summary
Add
data-designer-retrieval-sdg, a single PyPI package that registers twodata_designer.pluginsentry points and ships a complete pipeline + CLI for generating NeMo Retriever training and evaluation data:embedding-dedupcolumn generator — generic embedding cosine-similarity deduplication for any list-valued column. Implements nativeagenerate(), so the column engages DataDesigner's async cell-level scheduler whenDATA_DESIGNER_ASYNC_ENGINE=1.document-chunkerseed reader —FileSystemSeedReadersubclass that loads text files, sentence-chunks them, builds structured sections, and supports multi-document bundling.build_qa_generation_pipeline()— four-column DataDesigner pipeline (artifact extraction → QA generation → embedding dedup → quality evaluation).train.json/val.json), BEIR evaluation format, and a parquet corpus +merlin_metadata.json.data-designer-retrieval-sdgCLI —generateandconvertsubcommands, with per-batch JSON checkpointing for resumability.Review feedback applied
Initial round (@jgreco-nvidia, @nabinmulepati)
pip install data-designer-retrieval-sdg.recipes/directory — keeps the repo's existing layout (onlyplugins/anddevtools/).agenerate()on the dedup column to drop the manualThreadPoolExecutorand engage the new async engine whenDATA_DESIGNER_ASYNC_ENGINE=1(Python 3.11+).retrieval-sdg-deduptoembedding-dedup; fields generalized (source_column,items_key,text_field,model_alias,similarity_threshold) so the column is composable beyond QA pairs.FileSystemSeedReaderfor ingestion — replaces the hand-rolledload_text_files_from_directory+ DataFrame construction with aDocumentChunkerSeedSource+DocumentChunkerSeedReaderso the framework owns file discovery, manifest building, and DuckDB registration.Async-engine throttling + fail-fast validation (commit dfa743b)
Follow-up review on
dedup.py:EmbeddingDedupColumnGeneratornow inherits fromColumnGeneratorWithModelRegistry, so it reportsis_llm_bound = Trueto the async scheduler. Previously the column inherited the defaultis_llm_bound = FalsefromColumnGenerator, sobuild_llm_bound_lookup()inasync_scheduler.pyskipped_llm_wait_semaphorefor it and could fan out up to a full row group's worth of concurrent embedding requests at the endpoint. Mirrors how the framework's ownEmbeddingCellGeneratoris wired throughColumnGeneratorWithModel._validate()override that raisesDatasetGenerationErrorat task construction (ConfigurableTask.__init__) when the configuredmodel_aliasresolves to a non-embeddingModelConfig. Previously a misconfigured chat-model alias surfaced as either anAttributeErrorfrom the facade or a 400 from the embeddings API on the first row.embedderinfunctools.cached_propertyso per-row dedup doesn't re-walk the model registry.Logging-convention fix (commit c39b6c0)
Style follow-up: library code in
data_designer_retrieval_sdg/should emit warnings vialogging.getLogger(__name__)likeseed_reader.pyanddedup.pyalready do, notprint(). Four sites diverged from that convention and bypassed theLoggerConfigthatconfigure_logging(...)sets up in the CLI, so--log-level ERRORusers still sawWarning: ...lines on stdout and the preview-error path was invisible on the configured log stream.chunking.py— added a module-levellogger; the threeload_multi_doc_manifestwarnings (unreadable manifest, unparseable JSON/YAML, wrong shape) now emit vialogger.warning(...)with%s-style args.cli.py— added a module-levellogger; the best-effort_run_previewexceptpath now emits vialogger.warning("Preview error: %s", e)so it honors--log-leveland the configured stderrOutputConfig.CLI status output (e.g.
Discovered N text files,Processing batch 3/10,Saved generated_batch3.json) is intentionally left asprint(...), matching the framework convention.Test plan
make sync— workspace install.make lint— ruff check + format clean.make test— isolated-venv tests pass (63 retrieval-sdg + 11 template after the dedup fix; was 70 before).make validate—OK: text-transform,OK: document-chunker,OK: embedding-dedup.make check— catalog, root CODEOWNERS, and SPDX headers all fresh.make all— green end-to-end.data-designer-retrieval-sdg generate ... && convert ...againstexamples/sample_texts; producestrain.json,val.json,eval_beir/,corpus/.DATA_DESIGNER_ASYNC_ENGINE=1; logs confirm⚡️ Async generation: 4 column(s)…engaged the async path; same record count and same top-level columns as the sync run;deduplicated_qa_pairscompletes 3/3 cells with 0 failures, exercising the newis_llm_bound = Truepath through_llm_wait_semaphore.