# 🎨 NeMo Data Designer: Text-to-SQL

> ⚠️ **Warning**: NeMo Data Designer is current in Early Release and is not recommended for production use.
>
> **Note**: In order to run this notebook, you must have the NeMo Data Designer microservice deployed locally via docker compose. See the [deployment guide](http://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html) for more details.
>
> Alternatively, you can use the [NeMo Data Designer managed service](https://build.nvidia.com/nemo/data-designer). Please refer the [intro-tutorials](../../intro-tutorials/1-the-basics.ipynb) on how to connect to it. 
>
> **Note**: If you are using the NeMo Data Designer managed service, you will only be able to launch preview jobs. You will not be able to launch jobs using the `create` method.

This notebook demonstrates how to use NeMo Data Designer to create a synthetic data generation pipeline for SQL code examples. We'll build a system that generates SQL code based on natural language instructions, with varying complexity levels and industry focuses.

#### 💾 Install dependencies

**IMPORTANT** 👉 If you haven't already, follow the instructions in the [README](../../README.md) to install the necessary dependencies. Note you may need to restart your kernel after setting up the environment.

In [None]:
from nemo_microservices import NeMoMicroservices
from nemo_microservices.beta.data_designer import (
    DataDesignerConfigBuilder,
    DataDesignerClient,
)
from nemo_microservices.beta.data_designer.config import columns as C
from nemo_microservices.beta.data_designer.config import params as P

### ⚙️ Initialize the NeMo Data Designer Client

- The data designer client is responsible for submitting generation requests to the Data Designer microservice.
- In this notebook, we connect to a local deployment of data designer. You can deploy your own instance of data designer by following the deployment instructions [here](https://docs.nvidia.com/nemo/microservices/latest/set-up/deploy-as-microservices/data-designer/docker-compose.html).


In [None]:
data_designer_client = DataDesignerClient(client=NeMoMicroservices(base_url="http://localhost:8080"))

### 🏗️ Initialize the Data Designer Config Builder

- The Data Designer config defines the dataset schema and generation process.

- The config builder provides an intuitive interface for building this configuration.

- You must provide a list of model configs to the builder at initialization.

- This list contains the models you can choose from (via the `model_alias` argument) during the generation process.


In [None]:
# We specify the endpoint of the model during deployment using the model_provider_registry.
model_id = "nvidia/nvidia-nemotron-nano-9b-v2"
model_alias = "nemotron-nano-9b-v2"

In [None]:
config_builder = DataDesignerConfigBuilder(
    model_configs=[
        P.ModelConfig(
            alias=model_alias,
            provider="nvidiabuild",
            model=model_id,
            inference_parameters=P.InferenceParameters(
                max_tokens=1024,
                temperature=0.6,
                top_p=0.95,
            ),
            is_reasoner=True
        ),
    ]
)

## 🌱 Define Categorical Seed Columns

We'll set up our seed columns for industry sectors, code complexity, and instruction types. These will help generate diverse and relevant SQL examples.

In [None]:
# Add industry sector categories
config_builder.add_column(
    name="industry_sector",
    type="category",
    params={
        "values": ["Healthcare", "Finance", "Technology"],
        "description": "The industry sector for the SQL example"
    }
)

# Add topic as a subcategory of industry_sector
config_builder.add_column(
    name="topic",
    type="subcategory",
    params={
        "category": "industry_sector",
        "values": {
            "Healthcare": [
                "Electronic Health Records (EHR) Systems",
                "Telemedicine Platforms",
                "AI-Powered Diagnostic Tools"
            ],
            "Finance": [
                "Fraud Detection Software",
                "Automated Trading Systems",
                "Personal Finance Apps"
            ],
            "Technology": [
                "Cloud Computing Platforms",
                "Artificial Intelligence and Machine Learning Platforms",
                "DevOps and CI/CD Tools"
            ]
        }
    }
)

# Add SQL complexity with subcategory for SQL concepts
config_builder.add_column(
    name="sql_complexity",
    type="category",
    params={
        "values": ["Beginner", "Intermediate", "Advanced"],
        "description": "The complexity level of the SQL code"
    }
)

# Add SQL concept as a subcategory of sql_complexity
config_builder.add_column(
    name="sql_concept",
    type="subcategory",
    params={
        "category": "sql_complexity",
        "values": {
            "Beginner": [
                "Basic SELECT Statements",
                "WHERE Clauses",
                "Basic JOINs",
                "INSERT, UPDATE, DELETE"
            ],
            "Intermediate": [
                "Aggregation Functions",
                "Multiple JOINs",
                "Subqueries",
                "Views"
            ],
            "Advanced": [
                "Window Functions",
                "Common Table Expressions (CTEs)",
                "Stored Procedures",
                "Query Optimization"
            ]
        }
    }
)

# Add SQL task types
config_builder.add_column(
    name="sql_task_type",
    type="category",
    params={
        "values": [
            "Data Retrieval",
            "Data Manipulation",
            "Analytics and Reporting",
            "Data Transformation"
        ],
        "description": "The type of SQL task being performed"
    }
)

# Add instruction phrases
config_builder.add_column(
    name="instruction_phrase",
    type="category",
    params={
        "values": [
            "Write an SQL query that",
            "Create an SQL statement to",
            "Develop an SQL query to",
            "Can you write SQL that",
            "Formulate an SQL query that"
        ],
        "description": "Starting phrase for the SQL instruction"
    }
)

## ✨ Define Generated Data Columns

Now we'll set up the columns that will be generated by the LLMs, including the instruction, database context, and SQL implementation.

In [None]:
# Generate instruction for the SQL query
config_builder.add_column(
    name="sql_prompt",
    type="llm-text",
    model_alias=model_alias,
    system_prompt="You are an expert at generating clear and specific SQL tasks.",
    prompt="""\
Generate an instruction to create SQL code that solves a specific problem.
Each instruction should begin with one of the following phrases: {{instruction_phrase}}.

Important Guidelines:
* Industry Relevance: Ensure the instruction pertains to the {{industry_sector}} sector and {{topic}} topic.
* SQL Complexity: Tailor the instruction to the {{sql_complexity}} level. Utilize relevant {{sql_concept}} where appropriate to match the complexity level.
* Task Type: The instruction should involve a {{sql_task_type}} task.
* Clarity and Specificity: Make the problem statement clear and unambiguous. Provide sufficient context to understand the requirements without being overly verbose.
* Response Formatting: Do not include any markers such as ### Response ### in the instruction.
"""
)

# Generate database context
config_builder.add_column(
    name="sql_context",
    type="llm-code",
    model_alias=model_alias,
    output_format=P.CodeLang.SQL_ANSI, # Specify CodeLang.SQL_ANSI to ensure the code is structured as valid SQL
    system_prompt="You are an expert SQL database designer who creates clean, efficient, and well-structured database schemas.",
    prompt="""\
Generate the SQL for creating database tables that would be relevant for the following instruction:
Instruction: {{sql_prompt}}

Important Guidelines:
* Relevance: Ensure all tables are directly related to the {{industry_sector}} sector and {{topic}} topic.
* Completeness: Include all essential columns with appropriate data types, primary/foreign keys, and necessary constraints.
* Realism: Use realistic table structures typical for the specified industry.
* Executable SQL: Provide complete CREATE TABLE statements that can be run without modification.
* Consistency: Use consistent naming conventions (e.g., snake_case for table and column names).
* Sample Data: Include INSERT statements with sample data that makes sense for the tables (at least 5-10 rows per table).
"""
)

# Generate the SQL code
config_builder.add_column(
    name="sql",
    type="llm-code",
    model_alias=model_alias,
    output_format=P.CodeLang.SQL_ANSI, # Specify CodeLang.SQL_ANSI to ensure the code is structured as valid SQL
    system_prompt="You are an expert SQL programmer who writes clean, efficient, and well-structured queries.",
    prompt="""\
Write SQL code for the following instruction based on the provided database context:
Instruction: {{sql_prompt}}

Database Context:
{{sql_context}}

Important Guidelines:
* Code Quality: Your SQL should be clean, complete, self-contained and accurate.
* Code Validity: Please ensure that your SQL code is executable and does not contain any errors.
* Context: Base your query on the provided database context. Only reference tables and columns that exist in the context.
* Complexity & Concepts: The SQL should be written at a {{sql_complexity}} level, making use of concepts such as {{sql_concept}}.
* Task Type: Ensure your solution implements the appropriate {{sql_task_type}} operation.
* Comments: Include brief comments explaining the key parts of your query.
"""
)

## 🔍 Add Validation and Evaluation

Let's add post-processing steps to validate the generated code and evaluate the text-to-SQL conversion.

In [None]:
from nemo_microservices.beta.data_designer.config.params.rubrics import TEXT_TO_SQL_LLM_JUDGE_PROMPT_TEMPLATE, SQL_RUBRICS

# Add validators and evaluators
config_builder.add_column(name="sql_validity_result",
                          model_alias=model_alias,
                          type="code-validation",
                          code_lang=P.CodeLang.SQL_ANSI,
                          target_column="sql")


config_builder.add_column(name="sql_judge_result",
                          type="llm-judge",
                          model_alias=model_alias,
                          prompt=TEXT_TO_SQL_LLM_JUDGE_PROMPT_TEMPLATE,
                          rubrics=SQL_RUBRICS)

## 👀 Generate Preview Dataset

Let's generate a preview to see some data.

In [None]:
# Generate a preview
preview = data_designer_client.preview(config_builder, verbose_logging=True)

In [None]:
preview.display_sample_record()

## 🚀 Generate Full Dataset

If you're satisfied with the preview, you can generate a larger dataset using a batch workflow.

In [None]:
# Submit batch job
job_results = data_designer_client.create(config_builder, num_records=20, wait_until_done=False)

job_results.wait_until_done()

In [None]:
dataset = job_results.load_dataset()
print("\nGenerated dataset shape:", dataset.shape)

dataset.head()