# 👨‍💻 NeMo Data Designer: Text-to-SQL

#### 📚 What you'll learn

- This notebook demonstrates how to use NeMo Data Designer to create a synthetic data generation pipeline for SQL code examples.

- We'll build a system that generates SQL code based on natural language instructions, with varying complexity levels and industry focuses.

<br>

> 👋 **IMPORTANT** – Environment Setup
>
> - If you haven't already, follow the instructions in the [README](../../../README.md) to install the necessary dependencies.
>
> - You may need to restart your notebook's kernel after setting up the environment.
> - In this notebook, we assume you have a self-hosted instance of Data Designer up and running.
>
> - For deployment instructions, see the [Installation Options](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html#installation-options) section of the [NeMo Data Designer documentation](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/index.html).


### 📦 Import the essentials

- The `data_designer` module of `nemo_microservices` exposes Data Designer's high-level SDK.

- The `essentials` module provides quick access to the most commonly used objects.


In [None]:
from nemo_microservices.data_designer.essentials import (
    CategorySamplerParams,
    CodeLang,
    CodeValidatorParams,
    DataDesignerConfigBuilder,
    InferenceParameters,
    LLMCodeColumnConfig,
    LLMJudgeColumnConfig,
    LLMTextColumnConfig,
    ModelConfig,
    NeMoDataDesignerClient,
    SamplerColumnConfig,
    SamplerType,
    Score,
    SubcategorySamplerParams,
    ValidationColumnConfig,
    ValidatorType,
)

### ⚙️ Initialize the NeMo Data Designer Client

- `NeMoDataDesignerClient` is responsible for submitting generation requests to the microservice.


In [None]:
NEMO_MICROSERVICES_BASE_URL = "http://localhost:8080"

data_designer_client = NeMoDataDesignerClient(base_url=NEMO_MICROSERVICES_BASE_URL)

### 🎛️ Define model configurations

- Each `ModelConfig` defines a model that can be used during the generation process.

- The "model alias" is used to reference the model in the Data Designer config (as we will see below).

- The "model provider" is the external service that hosts the model (see [the model config docs](https://docs.nvidia.com/nemo/microservices/latest/design-synthetic-data-from-scratch-or-seeds/configure-models.html) for more details).

- By default, the microservice uses [build.nvidia.com](https://build.nvidia.com/models) as the model provider.


In [None]:
# This name is set in the microservice deployment configuration.
MODEL_PROVIDER = "nvidiabuild"

# The model ID is from build.nvidia.com.
MODEL_ID = "nvidia/llama-3.3-nemotron-super-49b-v1"

# We choose this alias to be descriptive for our use case.
MODEL_ALIAS = "nemotron-super-49b-v1"

model_configs = [
    ModelConfig(
        alias=MODEL_ALIAS,
        model=MODEL_ID,
        provider=MODEL_PROVIDER,
        inference_parameters=InferenceParameters(
            temperature=0.6,
            top_p=0.95,
            max_tokens=1024,
            timeout=300,
        ),
    )
]

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- The list of model configs is provided to the builder at initialization.


In [None]:
config_builder = DataDesignerConfigBuilder(model_configs=model_configs)

## 🎲 Adding Sampler Columns

- Sampler columns offer non-LLM based generation of synthetic data.

- They are particularly useful for **steering the diversity** of the generated data, as we demonstrate below.


In [None]:
# Add industry sector categories
config_builder.add_column(
    SamplerColumnConfig(
        name="industry_sector",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=["Healthcare", "Finance", "Technology"],
        ),
    )
)

# Add topic as a subcategory of industry_sector
config_builder.add_column(
    SamplerColumnConfig(
        name="topic",
        sampler_type=SamplerType.SUBCATEGORY,
        params=SubcategorySamplerParams(
            category="industry_sector",
            values={
                "Healthcare": [
                    "Electronic Health Records (EHR) Systems",
                    "Telemedicine Platforms",
                    "AI-Powered Diagnostic Tools",
                ],
                "Finance": [
                    "Fraud Detection Software",
                    "Automated Trading Systems",
                    "Personal Finance Apps",
                ],
                "Technology": [
                    "Cloud Computing Platforms",
                    "Artificial Intelligence and Machine Learning Platforms",
                    "DevOps and CI/CD Tools",
                ],
            },
        ),
    )
)

# Add SQL complexity with subcategory for SQL concepts
config_builder.add_column(
    SamplerColumnConfig(
        name="sql_complexity",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=["Beginner", "Intermediate", "Advanced"],
        ),
    )
)

# Add SQL concept as a subcategory of sql_complexity
config_builder.add_column(
    SamplerColumnConfig(
        name="sql_concept",
        sampler_type=SamplerType.SUBCATEGORY,
        params=SubcategorySamplerParams(
            category="sql_complexity",
            values={
                "Beginner": [
                    "Basic SELECT Statements",
                    "WHERE Clauses",
                    "Basic JOINs",
                    "INSERT, UPDATE, DELETE",
                ],
                "Intermediate": [
                    "Aggregation Functions",
                    "Multiple JOINs",
                    "Subqueries",
                    "Views",
                ],
                "Advanced": [
                    "Window Functions",
                    "Common Table Expressions (CTEs)",
                    "Stored Procedures",
                    "Query Optimization",
                ],
            },
        ),
    )
)

# Add SQL task types
config_builder.add_column(
    SamplerColumnConfig(
        name="sql_task_type",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=[
                "Data Retrieval",
                "Data Manipulation",
                "Analytics and Reporting",
                "Data Transformation",
            ],
        ),
    )
)

# Add instruction phrases
config_builder.add_column(
    SamplerColumnConfig(
        name="instruction_phrase",
        sampler_type=SamplerType.CATEGORY,
        params=CategorySamplerParams(
            values=[
                "Write an SQL query that",
                "Create an SQL statement to",
                "Develop an SQL query to",
                "Can you write SQL that",
                "Formulate an SQL query that",
            ],
        ),
    )
)


## 🦜 Define Generated Data Columns

Now we'll set up the columns that will be generated by the LLMs, including the instruction, database context, and SQL implementation.


In [None]:
# Generate instruction for the SQL query
SQL_PROMPT_TEXT = (
    "Generate an instruction to create SQL code that solves a specific problem.\n"
    "Each instruction should begin with one of the following phrases: {{instruction_phrase}}.\n\n"
    "Important Guidelines:\n"
    "* Industry Relevance: Ensure the instruction pertains to the {{industry_sector}} sector and {{topic}} topic.\n"
    "* SQL Complexity: Tailor the instruction to the {{sql_complexity}} level. Utilize relevant {{sql_concept}} "
    "where appropriate to match the complexity level.\n"
    "* Task Type: The instruction should involve a {{sql_task_type}} task.\n"
    "* Clarity and Specificity: Make the problem statement clear and unambiguous. Provide sufficient context to "
    "understand the requirements without being overly verbose.\n"
    "* Response Formatting: Do not include any markers such as ### Response ### in the instruction.\n"
)

config_builder.add_column(
    LLMTextColumnConfig(
        name="sql_prompt",
        model_alias=MODEL_ALIAS,
        system_prompt="You are an expert at generating clear and specific SQL tasks.",
        prompt=SQL_PROMPT_TEXT,
    )
)

# Generate database context
SQL_CONTEXT_TEXT = (
    "Generate the SQL for creating database tables that would be relevant for the following instruction:\n"
    "Instruction: {{sql_prompt}}\n\n"
    "Important Guidelines:\n"
    "* Relevance: Ensure all tables are directly related to the {{industry_sector}} sector and {{topic}} topic.\n"
    "* Completeness: Include all essential columns with appropriate data types, primary/foreign keys, and necessary constraints.\n"
    "* Realism: Use realistic table structures typical for the specified industry.\n"
    "* Executable SQL: Provide complete CREATE TABLE statements that can be run without modification.\n"
    "* Consistency: Use consistent naming conventions (e.g., snake_case for table and column names).\n"
    "* Sample Data: Include INSERT statements with sample data that makes sense for the tables (at least 5-10 rows per table)."
)

config_builder.add_column(
    LLMCodeColumnConfig(
        name="sql_context",
        model_alias=MODEL_ALIAS,
        code_lang=CodeLang.SQL_ANSI,
        system_prompt=(
            "You are an expert SQL database designer who creates clean, efficient, and "
            "well-structured database schemas."
        ),
        prompt=SQL_CONTEXT_TEXT,
    )
)

# Generate the SQL code
SQL_CODE_TEXT = (
    "Write SQL code for the following instruction based on the provided database context:\n"
    "Instruction: {{sql_prompt}}\n\n"
    "Database Context:\n"
    "{{sql_context}}\n\n"
    "Important Guidelines:\n"
    "* Code Quality: Your SQL should be clean, complete, self-contained and accurate.\n"
    "* Code Validity: Please ensure that your SQL code is executable and does not contain any errors.\n"
    "* Context: Base your query on the provided database context. Only reference tables and columns that "
    "exist in the context.\n"
    "* Complexity & Concepts: The SQL should be written at a {{sql_complexity}} level, making use of "
    "concepts such as {{sql_concept}}.\n"
    "* Task Type: Ensure your solution implements the appropriate {{sql_task_type}} operation.\n"
    "* Comments: Include brief comments explaining the key parts of your query.\n"
)

config_builder.add_column(
    LLMCodeColumnConfig(
        name="sql",
        model_alias=MODEL_ALIAS,
        code_lang=CodeLang.SQL_ANSI,
        system_prompt="You are an expert SQL programmer who writes clean, efficient, and well-structured queries.",
        prompt=SQL_CODE_TEXT,
    )
)


## 🔍 Quality Assessment: LLM-as-a-Judge

When generating our synthetic dataset, we need to determine the quality of the generated data \
We use the LLM-as-a-Judge strategy to do this.

To do so, we need to define the rubric that the LLM should use to assess generation quality along with a prompt
that provides relavant instructions.


In [None]:
TEXT_TO_SQL_JUDGE_TEMPLATE = """\
You are an expert in SQL with deep knowledge of relational modeling, query semantics,
and performance tuning across common dialects (e.g., PostgreSQL, MySQL, SQLite, SQL Server).
You think critically about correctness, readability, and efficiency.

Use the SQL Query Quality Rubric below to score the **Generated SQL Query** based on the INSTRUCTIONS.

#### INSTRUCTIONS
The Generated SQL Query should be a valid response to the Natural Language Prompt below

Natural Language Prompt:
{{ sql_prompt }}

Database Context:
{{ sql_context }}

Generated SQL Query
{{ sql }}
"""

sql_scoring = [
    Score(
        name="Relevance",
        description="Adherence to INSTRUCTIONS and CONTEXT",
        options={
            "4": "Perfectly meets all specified requirements.",
            "3": "Meets most requirements with minor deviations.",
            "2": "Moderate deviation from the instructions.",
            "1": "Significant deviations from the instructions.",
            "0": "Does not adhere to the instructions.",
        },
    ),
    Score(
        name="SQL Correctness",
        description="Syntax and semantic correctness; returns the intended result",
        options={
            "4": "Valid SQL with correct joins, filters, grouping/aggregation, and NULL handling; produces the intended result set under the stated/implicit dialect.",
            "3": "Generally correct with minor issues (e.g., edge-case NULLs, minor grouping detail) but still likely yields the intended result.",
            "2": "Partially correct; noticeable semantic mistakes (joins, grouping, filters) that may change results or fail in edge cases.",
            "1": "Largely incorrect; major semantic or syntactic errors likely causing failure or wrong results.",
            "0": "Invalid SQL or unrelated to the task; will not run or cannot produce a meaningful result.",
        },
    ),
    Score(
        name="Readability",
        description="Formatting, clarity, and maintainability",
        options={
            "4": "Cleanly formatted (keywords/clauses consistently styled), clear structure (CTEs/subqueries where helpful), meaningful table/column aliases, and concise.",
            "3": "Generally readable with consistent formatting and understandable aliases; could be organized slightly better.",
            "2": "Somewhat readable but inconsistent formatting or confusing aliasing; structure is harder to follow.",
            "1": "Poorly formatted and hard to read; unclear structure and aliasing.",
            "0": "Unreadable or chaotic; no meaningful structure or styling.",
        },
    ),
    Score(
        name="Efficiency",
        description="Query performance best practices",
        options={
            "4": "Uses sargable predicates, appropriate joins, selective filters early, avoids SELECT *, unnecessary DISTINCT, and wasteful subqueries; likely to use indexes effectively.",
            "3": "Mostly efficient; minor opportunities for improvement (e.g., simplifying expressions, reducing data early).",
            "2": "Moderate inefficiencies (e.g., non-sargable filters, unnecessary nested subqueries, broad SELECT *).",
            "1": "Notably inefficient patterns likely causing large scans or poor plans.",
            "0": "Highly inefficient; ignores basic best practices and likely to perform very poorly.",
        },
    ),
]

# Add an LLM judge to evaluate code quality
config_builder.add_column(
    LLMJudgeColumnConfig(
        name="code_judge_result",
        model_alias=MODEL_ALIAS,
        prompt=TEXT_TO_SQL_JUDGE_TEMPLATE,
        scores=sql_scoring,
    )
)

## ⚡️ Quality Assessment: Code Validation

- Now we'll add validation for the initial code and generate analysis of any issues found.

- NeMo Data Designer includes a built-in code validation feature that automatically checks the syntactic correctness and executable validity of \
  generated code snippets.

- This helps ensure that outputs from language models are not only syntactically correct, but also able to run successfully in the \
  intended programming language environment.

- Leveraging this validation step significantly increases dataset quality by promptly identifying invalid or non-functional code, \
  streamlining the process of generating reliable and production-ready data samples.

- NeMo Data Designer supports validation for these languages

  - Python (CodeLang.PYTHON)

  - SQL dialects:

    - ANSI SQL (CodeLang.SQL_ANSI)

    - MySQL (CodeLang.SQL_MYSQL)

    - PostgreSQL (CodeLang.SQL_POSTGRES)

    - SQLite (CodeLang.SQL_SQLITE)

    - T-SQL (CodeLang.SQL_TSQL)

    - BigQuery (CodeLang.SQL_BIGQUERY)


In [None]:
config_builder.add_column(
    ValidationColumnConfig(
        name="code_validity_result",
        validator_type=ValidatorType.CODE,
        target_columns=["sql"],
        validator_params=CodeValidatorParams(
            code_lang=CodeLang.SQL_ANSI,
        ),
        batch_size=100,
    )
)

### 🔁 Iteration is key – preview the dataset!

1. Use the `preview` method to generate a sample of records quickly.

2. Inspect the results for quality and format issues.

3. Adjust column configurations, prompts, or parameters as needed.

4. Re-run the preview until satisfied.


In [None]:
# Preview a few records
preview = data_designer_client.preview(config_builder)

In [None]:
# More previews
preview.display_sample_record()

### 📊 Analyze the generated data

- Data Designer automatically generates a basic statistical analysis of the generated data.

- This analysis is available via the `analysis` property of generation result objects.


In [None]:
# Print the analysis as a table.
preview.analysis.to_report()

### 🆙 Scale up!

- Happy with your preview data?

- Use the `create` method to submit larger Data Designer generation jobs.


In [None]:
job_results = data_designer_client.create(config_builder, num_records=20)

# This will block until the job is complete.
job_results.wait_until_done()

In [None]:
# Load the generated dataset as a pandas DataFrame.
dataset = job_results.load_dataset()

dataset.head()

In [None]:
# Load the analysis results into memory.
analysis = job_results.load_analysis()

analysis.to_report()

In [None]:
TUTORIAL_OUTPUT_PATH = "data-designer-tutorial-output"

# Download the job artifacts and save them to disk.
job_results.download_artifacts(
    output_path=TUTORIAL_OUTPUT_PATH,
    artifacts_folder_name="artifacts-community-contributions-text-to-code-text-to-sql",
);