docs: add Enterprise Text-to-SQL and Search Agent recipes by dhruvnathawani · Pull Request #395 · NVIDIA-NeMo/DataDesigner

dhruvnathawani · 2026-03-11T14:32:00Z

Summary

Adds two new recipes that turn techniques from the dev notes into ready-to-run code:

Enterprise Text-to-SQL — A five-stage pipeline (seed → prompt → schema with distractors → dialect-specific SQL → validation + judges) based on the text-to-sql dev note. Demonstrates SubcategorySamplerParams for conditional sampling, distractor table/column injection, dirty data handling, per-dialect code validation (SQLite/MySQL/PostgreSQL), two LLM judges with score extraction, and prompt style diversification (instruction style × linguistic register × politeness level).
Search Agent — A Tavily-powered MCP pipeline for generating multi-turn search agent trajectories, based on the search agent SFT dev note. Seeds from Wikidata knowledge graph paths, generates BrowseComp-style obfuscated riddles through a two-stage LLM rewrite (draft → obfuscation), then runs a tool-using agent with live Tavily web search to produce full thought-action-observation trajectories captured via with_trace=dd.TraceType.ALL_MESSAGES. Uses dd.MCPProvider with Tavily's hosted streamable_http endpoint — no local server or extra dependencies needed.

Both recipes follow the existing conventions (PEP 723 script metadata, build_config/create_dataset or serve/main patterns, --model-alias/--num-records/--artifact-path CLI args for make test-run-recipes compatibility).

Files changed

docs/assets/recipes/code_generation/enterprise_text_to_sql.py — new recipe script
docs/assets/recipes/mcp_and_tooluse/search_agent.py — new recipe script
docs/recipes/code_generation/enterprise_text_to_sql.md — recipe doc page
docs/recipes/mcp_and_tooluse/search_agent.md — recipe doc page
docs/recipes/cards.md — two new cards added
mkdocs.yml — nav entries for both recipes

Add two new recipes derived from dev notes: - Enterprise Text-to-SQL: dialect-specific SQL generation with distractor table/column injection, dirty data handling, conditional sampling, and multi-judge scoring (from text-to-sql dev note) - Search Agent: multi-turn deep research trajectories using BM25 retriever MCP server with search/open/find tools and LLM judge rejection sampling (from deep-research-trajectories dev note) Made-with: Cursor

Switch from LocalStdioMCPProvider with bm25s to Tavily's hosted MCP endpoint (streamable_http). Removes bm25s/PyStemmer/mcp dependencies from the recipe, simplifies the code, and matches the battle-tested pattern from the search agent dev note and GTC notebooks.

greptile-apps · 2026-03-11T15:30:38Z

Greptile Summary

This PR adds two production-grade recipe scripts and their corresponding documentation pages: Enterprise Text-to-SQL (a five-stage pipeline with distractor injection, dirty data, dialect-specific SQL, and 5 LLM judges producing 15 score columns) and Search Agent (a Tavily-powered MCP pipeline generating multi-turn BrowseComp-style search trajectories from Wikidata seeds). Both recipes follow established repo conventions and integrate cleanly into the MkDocs navigation.

Key observations:

The enterprise_text_to_sql.py pipeline is well-structured; score extraction prefixes are computed correctly and all 15 columns are unique.
search_agent.py uses preview() consistently with all other MCP recipes in the repo, so that design is intentional.
Both recipe .md pages reference dev notes that do not yet exist in the repository (engineering-an-enterprise-grade-text-to-sql-dataset-with-nemo-data-designer and search-agent-sft-data-teaching-llms-to-browse-the-web). Deploying without those posts will produce 404s for anyone who clicks the "Dev Note" callout links.
search_agent.py's TAVILY_API_KEY guard uses is None instead of a truthiness check, allowing an empty-string value to bypass the helpful error.
--artifact-path in search_agent.py is shown in the help output, contrary to the argparse.SUPPRESS convention used by all other MCP recipes (basic_mcp.py, pdf_qa.py).

Confidence Score: 4/5

Safe to merge with minor fixes; no functional regressions introduced.
The core pipeline logic in both recipes is sound and follows existing patterns. The two issues in search_agent.py (empty-string API key check and suppressed help flag) are low-severity style/robustness concerns that don't break functionality for typical users. The bigger practical risk is the two forward-linked dev notes that will 404 on deploy if those posts haven't been published yet.
docs/assets/recipes/mcp_and_tooluse/search_agent.py (API key guard + argparse convention), docs/recipes/code_generation/enterprise_text_to_sql.md and docs/recipes/mcp_and_tooluse/search_agent.md (broken dev-note links).

Important Files Changed

Filename	Overview
docs/assets/recipes/code_generation/enterprise_text_to_sql.py	New 929-line recipe for enterprise text-to-SQL; five-stage pipeline (seed → prompt → schema → SQL → 5 LLM judges) is well-structured, score extraction math is correct (15 columns across 5 judges), and CLI follows existing conventions. No issues found.
docs/assets/recipes/mcp_and_tooluse/search_agent.py	New search-agent recipe using Tavily MCP; two issues: TAVILY_API_KEY validated with `is None` instead of truthiness check (misses empty-string case), and `--artifact-path` is exposed in help output contrary to the argparse.SUPPRESS convention used by all other MCP recipes.
docs/recipes/code_generation/enterprise_text_to_sql.md	Recipe doc page; references a dev note (`engineering-an-enterprise-grade-text-to-sql-dataset-with-nemo-data-designer`) that does not exist in the repo — will produce a 404 when deployed before that post is published.
docs/recipes/mcp_and_tooluse/search_agent.md	Recipe doc page; previously missing H1 title is now fixed. References a dev note (`search-agent-sft-data-teaching-llms-to-browse-the-web`) that does not exist in the repo — same forward-link 404 risk as the enterprise_text_to_sql doc.

Sequence Diagram

sequenceDiagram
    participant S as Seed / Sampler
    participant L as LLM
    participant V as Validator
    participant J as LLM Judges
    participant T as Tavily MCP

    rect rgb(30, 60, 90)
        Note over S,J: Enterprise Text-to-SQL Pipeline
        S->>S: Stage 1 — Category/Subcategory sampling<br/>(industry, topic, sql_complexity, sql_concept,<br/>sql_task_type, data_quality, knowledge, style)
        S->>L: Stage 2 — Generate sql_prompt (NL request)
        L->>L: Stage 3 — Generate sql_context (DDL + INSERT<br/>core tables + distractor tables + dirty data)
        L->>L: Stage 4 — Generate sql (dialect-specific SQL)
        L->>V: Stage 5a — SQL syntax validation (SQLite/MySQL/PG)
        L->>J: Stage 5b — 5 LLM judges → 15 score columns
    end

    rect rgb(60, 30, 90)
        Note over S,T: Search Agent Pipeline
        S->>S: Stage 1 — Wikidata KG seed rows<br/>(seed_entity, final_answer_entity, readable_path)
        S->>L: Stage 2a — Draft multi-hop search riddle
        L->>L: Stage 2b — BrowseComp-style obfuscation
        L->>T: Stage 3 — Agent loop: tavily_search calls<br/>(max 25 turns, 300 s timeout, ALL_MESSAGES trace)
        T-->>L: Search results (observations)
        L->>L: Stage 4 — Normalize raw output → AgentSolution JSON
    end

Comments Outside Diff (3)

docs/assets/recipes/mcp_and_tooluse/search_agent.py, line 359 (link)

--artifact-path should be suppressed from help output

The other MCP-based recipes — pdf_qa.py (line 528) and basic_mcp.py (line 199) — both pass help=argparse.SUPPRESS for --artifact-path because it's an internal argument used only for make test-run-recipes compatibility and not intended to be surfaced to end users. search_agent.py is inconsistent with this pattern by exposing the argument in the help output.
docs/recipes/code_generation/enterprise_text_to_sql.md, line 4 (link)

Referenced dev note does not exist yet

The link ../../devnotes/engineering-an-enterprise-grade-text-to-sql-dataset-with-nemo-data-designer/ points to a dev note that is not present in the repository. Searching docs/devnotes/posts/ shows only four existing posts: deep-research-trajectories.md, design-principles.md, rqa.md, and structured-outputs-from-nemotron.md. There is no post with the slug engineering-an-enterprise-grade-text-to-sql-dataset-with-nemo-data-designer.

If this recipe page is deployed before the corresponding dev note is merged, all users who click the "Dev Note" link will hit a 404. The same issue applies to search_agent.md → search-agent-sft-data-teaching-llms-to-browse-the-web.

Either create the dev note files in this PR, or replace the links with a forward-looking note until the dev notes are published.
docs/assets/recipes/mcp_and_tooluse/search_agent.py, line 367-368 (link)

TAVILY_API_KEY validation misses empty-string case

os.environ.get("TAVILY_API_KEY") is None only catches a completely absent variable. If a user sets the variable to an empty string "", the is None check passes and build_config() constructs an MCP endpoint URL with an empty API key value, which will fail at the Tavily API level rather than with the clear RuntimeError here.

_{Last reviewed commit: 7676f8d}

- Add ASCII pipeline diagram to docstring - Add all 5 LLM judges (prompt, SQL, context, data quality, knowledge) with production rubrics (15 scoring dimensions) - Expand samplers: 10 industries/50 topics, conditional task types, data quality concepts, knowledge dependency concepts - Use dialect-specific prompts for schema and SQL generation - Extract all 15 judge scores into flat columns - Remove dev note references; recipe is fully standalone

- Rename recipes to "Nemotron Super Text to SQL" and "Nemotron Super Search Agent" across nav, cards, headings, and docstrings - Add Nemotron Super training context to Python docstrings (BIRD benchmark results for text-to-sql, 7k trajectories for search agent) - Add dev note links as admonition boxes in recipe markdown pages - Add seed dataset guidance (required columns, generation process) to search agent recipe page

dhruvnathawani · 2026-03-11T17:27:09Z

Tested both workflows after all the changes and working! Let me know if anything else

johnnygreco

this is awesome, thanks @dhruvnathawani !!

dhruvnathawani added 2 commits March 11, 2026 07:30

dhruvnathawani marked this pull request as ready for review March 11, 2026 15:28

dhruvnathawani requested a review from a team as a code owner March 11, 2026 15:28

dhruvnathawani requested a review from johnnygreco March 11, 2026 15:28

greptile-apps Bot reviewed Mar 11, 2026

View reviewed changes

Comment thread docs/recipes/mcp_and_tooluse/search_agent.md

fix

43fd2eb

3mei reviewed Mar 11, 2026

View reviewed changes

dhruvnathawani requested a review from mvansegbroeck March 11, 2026 16:07

fix ascii

5fa7737

3mei reviewed Mar 11, 2026

View reviewed changes

Comment thread docs/assets/recipes/code_generation/enterprise_text_to_sql.py

3mei reviewed Mar 11, 2026

View reviewed changes

Comment thread docs/assets/recipes/code_generation/enterprise_text_to_sql.py Outdated

fix format

effde37

johnnygreco reviewed Mar 11, 2026

View reviewed changes

Comment thread docs/assets/recipes/code_generation/enterprise_text_to_sql.py

johnnygreco reviewed Mar 11, 2026

View reviewed changes

Comment thread docs/recipes/code_generation/enterprise_text_to_sql.md Outdated

johnnygreco reviewed Mar 11, 2026

View reviewed changes

Comment thread docs/assets/recipes/mcp_and_tooluse/search_agent.py

johnnygreco reviewed Mar 11, 2026

View reviewed changes

Comment thread docs/assets/recipes/mcp_and_tooluse/search_agent.py

dhruvnathawani and others added 2 commits March 11, 2026 10:22

Merge branch 'main' into dhruv/recipes-nemotron-super

7676f8d

johnnygreco approved these changes Mar 11, 2026

View reviewed changes

dhruvnathawani merged commit 7de879a into main Mar 11, 2026
47 checks passed

dhruvnathawani mentioned this pull request Mar 12, 2026

docs: add text-to-sql dev note #349

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add Enterprise Text-to-SQL and Search Agent recipes#395

docs: add Enterprise Text-to-SQL and Search Agent recipes#395
dhruvnathawani merged 8 commits intomainfrom
dhruv/recipes-nemotron-super

dhruvnathawani commented Mar 11, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Mar 11, 2026 •

edited

Loading

Confidence Score: 4/5

Sequence Diagram

Comments Outside Diff (3)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dhruvnathawani commented Mar 11, 2026

Uh oh!

johnnygreco left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dhruvnathawani commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Files changed

Uh oh!

greptile-apps Bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (3)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dhruvnathawani commented Mar 11, 2026

Uh oh!

johnnygreco left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dhruvnathawani commented Mar 11, 2026 •

edited

Loading

greptile-apps Bot commented Mar 11, 2026 •

edited

Loading