docs: add retriever SDG toolkit dev note#666
Conversation
Signed-off-by: Steve Han <sthan@nvidia.com>
|
All contributors have signed the DCO ✍️ ✅ |
Review: PR #666 — docs: add retriever SDG toolkit dev noteSummaryDocs-only PR adding a new "Retriever SDG Toolkit" dev note to both the MkDocs and Fern documentation sites. Changes:
PR is +940 / -0 across 10 files. No code is touched. FindingsConsistency with existing dev note conventions — good
Slug / filename mismatch — worth a noteThe Fern file is
Either is fine; the asymmetry is the smell. The MkDocs post uses a third, longer slug ( External links — plausible but unverifiable from CIThe post links to several external resources that I cannot reach from this runner:
The reranking-recipe link uses the Code-snippet accuracyThe post documents APIs from the external
The author has noted Asset duplication
Minor copy notes
Security / sensitive contentNothing concerning. No secrets, no internal hostnames, no embedded directives that look like injection attempts in the diff. VerdictApprove with minor revisions suggested. The PR is a clean, well-structured docs addition that follows existing dev-note conventions for both MkDocs and Fern. The pipeline SVG is well-crafted and accessible (has
Optional: align em-dash style with neighboring posts and consider deduplicating |
Greptile SummaryThis PR adds a new dev note for the
|
| Filename | Overview |
|---|---|
| docs/devnotes/posts/retrieval-sdg-toolkit.md | New MkDocs dev note covering the four-stage retriever SDG pipeline; content, code samples, and frontmatter are all consistent and well-structured. |
| fern/versions/latest/pages/devnotes/posts/retriever-sdg-toolkit.mdx | Fern-formatted counterpart of the dev note; slug, author reference, and image path are correct and consistent with the Fern asset layout. |
| fern/versions/latest/pages/devnotes/index.mdx | Adds a BlogCard for the new post at the top of the dev-notes index; href, image src, date, and author ID all align with the new page and asset paths. |
| fern/components/devnotes/authors-data.ts | Adds the sthan author entry to the TypeScript registry; name, description, and avatar URL match the YAML author files exactly. |
| docs/devnotes/.authors.yml | Adds sthan author to the MkDocs authors registry; entry is well-formed and consistent with the Fern YAML and TypeScript counterparts. |
| fern/components/devnotes/.authors.yml | Adds sthan author to the Fern authors YAML; mirrors the MkDocs authors file correctly. |
| mkdocs.yml | Inserts the new dev note at the top of the Dev Notes nav section (most-recent-first order); file path reference matches the added markdown file. |
| fern/versions/latest.yml | Adds the Retriever SDG Toolkit page to the Fern navigation; path reference matches the new MDX file. |
| docs/devnotes/posts/assets/retrieval-sdg-toolkit/pipeline.svg | New SVG pipeline diagram for MkDocs; includes accessible title/desc elements and aria-labelledby attribute. |
| fern/assets/retrieval-sdg-toolkit/pipeline.svg | Identical SVG copied for the Fern docs asset path; content is identical to the MkDocs copy. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Source Documents\nDocs / policies / tickets / manuals] --> B[Stage 1: Bundle Docs\nsingle + multi-doc groups]
A --> C[Stage 1: Chunk Docs\nstable segment IDs]
B --> D[Stage 2: Extract Artifacts\nconcepts / entities / links]
C --> E[Stage 2: Generate QA\ngrounded multi-hop questions]
D --> F[Stage 3: Deduplicate\nnear-duplicate queries]
E --> G[Stage 3: Judge Quality\nrelevance / support / clarity]
F --> H[Stage 4: Convert\ntrain/val, BEIR qrels, AutoModel data]
G --> H
H --> I[train.json / val.json / corpus/]
H --> J[eval_beir/ corpus.jsonl / qrels/test.tsv]
Reviews (5): Last reviewed commit: "docs: fix retriever SDG pipeline flow or..." | Re-trigger Greptile
Signed-off-by: Steve Han <sthan@nvidia.com>
|
MkDocs preview: https://c430f2c2.dd-docs-preview.pages.dev Fern preview: https://nvidia-preview-pr-666.docs.buildwithfern.com/nemo/datadesigner
|
|
I have read the DCO document and I hereby sign the DCO. |
| @@ -0,0 +1,339 @@ | |||
| --- | |||
| date: 2026-05-14 | |||
There was a problem hiding this comment.
You'll want this to be the date this will be published!
Signed-off-by: Steve Han <sthan@nvidia.com>
|
|
||
| The new [`data-designer-retrieval-sdg`](https://github.com/NVIDIA-NeMo/DataDesignerPlugins/tree/main/plugins/data-designer-retrieval-sdg) toolkit fills that gap: start with a directory of documents, generate synthetic query-positive examples with NeMo Data Designer, filter them, and export them for retriever fine-tuning and BEIR-style evaluation. | ||
|
|
||
| <!-- more --> |
There was a problem hiding this comment.
nit: can we try moving this up so that there is a shorter "abstract" on the index page? This is the text that will appear above "Continue reading".
| This is not just a demo package. The same toolkit produced the [Retrieval-Synthetic-NVDocs-v1](https://huggingface.co/datasets/nvidia/Retrieval-Synthetic-NVDocs-v1) dataset from NVIDIA public documentation, and it powers the bootstrap SDG stage for both the NeMo [embedding fine-tune recipe](https://github.com/NVIDIA-NeMo/Nemotron/tree/main/src/nemotron/recipes/embed) and [reranking fine-tune recipe](https://github.com/NVIDIA-NeMo/Nemotron/tree/e6e8a3281a11b8e1b7b47af098bbf54416c68d47/src/nemotron/recipes/rerank). It is now available as a standalone tool for generating high-quality, complex, multi-document, multi-hop retrieval data compatible with [AutoModel](https://github.com/NVIDIA-NeMo/Automodel). | ||
|
|
||
| This post walks through what the toolkit does, why the generated labels matter, and how to make your first small run useful before you scale it up. | ||
|
|
||
| --- | ||
|
|
||
| ## **From Documents to Retriever Data** | ||
|
|
There was a problem hiding this comment.
I'm wondering if this combined with the "If you are building a RAG system..." narrative at the top can be combined into an intro / context setting section before going straight to the contents of the plugin and the diagram.
Signed-off-by: Steve Han <sthan@nvidia.com>
|
|
||
| The hard part is not asking an LLM to write questions about a document. The hard part is keeping every generated question tied to the exact chunk, document, or multi-hop evidence set that a retriever should recover. Many RAG tutorials stop at chunk, embed, retrieve, and prompt. Fine-tuning recipes often begin once labeled query-passage pairs already exist. The gap in between is where developers lose the most time. | ||
|
|
||
| The new [`data-designer-retrieval-sdg`](https://github.com/NVIDIA-NeMo/DataDesignerPlugins/tree/main/plugins/data-designer-retrieval-sdg) toolkit fills that gap: start with a directory of documents, generate synthetic query-positive examples with NeMo Data Designer, filter them, and export them for retriever fine-tuning and BEIR-style evaluation. |
There was a problem hiding this comment.
Can we call this a "plugin" rather than "toolkit" throughout. You can still say "toolkit" if you prefer that noun, but maybe something like "the plugin contains a retrieval SDG toolkit" to help make clear that this is a Data Designer plugin.
| | `document-chunker` | seed reader | Turns text files into sentence chunks with stable segment IDs, so each query can point back to the passages that answer it. | | ||
| | `embedding-dedup` | column generator | Removes near-duplicate generated questions before judging and export, so the training data has more variety. | | ||
|
|
||
| It also ships a normal Python API and a CLI: |
There was a problem hiding this comment.
We'll need to update this Dev Note when we figure out the long-term pattern for CLI-like functionality. Okay to keep this here, but we can't forget to update this later!
| ## **Why This Belongs in a Plugin** | ||
|
|
||
| A blog recipe can teach the workflow. A plugin makes the workflow reusable. |
There was a problem hiding this comment.
IMO this "Why This Belongs in a Plugin" framing feels more like our internal discussions rather than how we should speak about it here. What do you think about framing in more about how Data Designer plugins unlock custom use cases?
📋 Summary
Adds a new dev note for the
data-designer-retrieval-sdgtoolkit, explaining why retriever synthetic data generation matters and how the toolkit turns documents into retriever training and BEIR evaluation artifacts.🔗 Related Issue
N/A
🔄 Changes
🧪 Testing
.venv/bin/mkdocs buildpassesmake check-fern-docs-locallypasses✅ Checklist