# Data Curation with NIMs: Synthetic Chat QA Generation

For chat data, prompts can come from both public, real-world user interactions and a synthetic data generation pipeline. Synthetic prompts cover various tasks such as open QA, closed QA, and creative writing. 

For each prompt task, the LLM can be seeded for generation with a diverse set of topics or keywords so that the prompts cover a wide variety of topics. For the responses, LLMs can be prompted for multiple generations in order to do rejection sampling with a reward model and generate higher quality synthetic data.


### Requirements:
* Container: `nvcr.io/nvidia/nemo:25.07`
* GPUs: One NVIDIA GPU with at least 20 GB of GPU memory is required for Endpoint deployment.
* Storage: To persist datasets, mount shared storage in the Dev Pod.

---

**Table of Contents**

* API Key Setup
* OpenAI Client
* Prompt Templates
    * Topics & Subtopics
    * Generating Questions
    * Generating Responses to Questions
* Summary & Next Steps

---

![Figure 1. Chat data curation pipeline](images/chat-data-curation-pipeline.png)


The following notebook will demonstrate how to leverage [Nemotron-nano-9b-v2](https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2) through build.nvidia.com to synthetically generate chat QA data.
The pipeline is designed to create a preference dataset suitable for training a reward model using the [SteerLM method](https://docs.nvidia.com/nemo-framework/user-guide/latest/modelalignment/steerlm.html), which can be used for training an RLHF Reward Model or for DPO with a framework, such as [NeMo-RL](https://github.com/NVIDIA-NeMo/RL).

### Inference Endpoint Setup
In this notebook, we will launch an inference server using DGX Cloud Lepton's Endpoints to be able to send prompts to a model. The notebook walks through this process, but you can also use `build.nvidia.com` to send an API request to an existing NVIDIA-managed endpoint - more on that later on.

First, we need to setup an Endpoint to host the model we will use for generating datasets. The cell below configures several environment settings required for accessing DGX Cloud Lepton. You will need an API key for accessing your workspace. Follow [this link](https://docs.nvidia.com/dgx-cloud/lepton/features/workspace/token/) to generate an API key for authenticating with your cluster. Once generated, add it to the `LEPTON_KEY` output in the format `<workspace ID>:<Lepton API key>` (without brackets) where `<workspace ID>` is the unique identifier for your workspace.

Next, update the `RESOURCE_SHAPE` and `NODE_GROUP` variables to match your workspace. The default model only needs approximately 20 GB of GPU memory so select an appropriate GPU resource for deploying the model.

In [None]:
import os
import re
import subprocess
import time

BASE_MODEL = "nvidia/nvidia-nemotron-nano-9b-v2"  # Optionally use a different model for requests
ACCESS_TOKEN = "my-access-token"  # Set the password for authenticating endpoint requests
SAVE_DIRECTORY = ""  # Specify the absolute path to save the generated data. To save on shared storage, must be a mounted storage path

os.environ["LEPTON_KEY"] = "<workspace ID>:<Lepton API key>"  # Set the workspace ID and Lepton API key for authenticating with DGX Cloud Lepton
os.environ["RESOURCE_SHAPE"] = "gpu.1xh200"  # Select the appropriate GPU resource for deploying the model
os.environ["NODE_GROUP"] = ""  # Select the appropriate node group for deploying the model
os.environ["ACCESS_TOKEN"] = ACCESS_TOKEN  # Set the password for authenticating endpoint requests
os.environ["BASE_MODEL"] = BASE_MODEL  # Set the model to be used for data generation
os.environ["ENDPOINT_NAME"] = "nemotron-nano-9b-v2"  # Set the name for the endpoint

Now we authenticate with DGX Cloud Lepton using the provided credentials and deploy the model as an Endpoint.

In [None]:
%%bash

lep login -c $LEPTON_KEY

lep endpoint create \
--resource-shape $RESOURCE_SHAPE \
--node-group $NODE_GROUP \
--container-image "vllm/vllm-openai" \
--container-port 8080 \
--container-command "vllm serve ${BASE_MODEL} --port 8080 --trust_remote_code" \
--name $ENDPOINT_NAME \
--tokens $ACCESS_TOKEN

Wait for the Endpoint to be available for requests:

In [None]:
def wait_for_endpoint(endpoint_name: str, interval: int = 10) -> str:
    command = ["lep", "endpoint", "status", "-n", endpoint_name, "--detail"]
    while True:
        result = subprocess.run(command, capture_output=True, text=True, check=True)
        for line in result.stdout.split("\n"):
            if line.startswith("State"):
                _, state = line.strip().rsplit(" ", maxsplit=1)
                if "LeptonDeploymentState.Ready" in state:
                    print("Endpoint deployed!")
                else:
                    break
            url_match = re.search(r'https://[\w\d\.\-]+', line)
            if url_match:
                print(f"URL: {url_match[0]}")
                return url_match[0]
        print(f"Waiting for endpoint {endpoint_name} to be ready...")
        time.sleep(interval)

endpoint_url = wait_for_endpoint(os.environ["ENDPOINT_NAME"])
base_url = os.path.join(endpoint_url, "v1")

### OpenAI Client

Now we're going to:
1. Initialize OpenAI's client.
2. Configure the LLM parameters.
3. Perform a request, and print the LLM response.

<div class="alert alert-block alert-information">
<b>NOTE:</b> As stated above, this notebook relies on a DGX Cloud Lepton Endpoint, but you can instead deploy with an existing OpenAI-compatible API service, such as `build.nvidia.com`. If desired, generate a personal API key with:</br>
1. Login (or sign up) through <a href="https://build.nvidia.com">build.nvidia.com</a>.</br>
2. Click the View Code button and then Generate API Key available on the <a href="https://build.nvidia.com/nvidia/nvidia-nemotron-nano-9b-v2">nvidia/nemotron-nano-9b-v2 model</a>.</br></br>

Next, update `base_url` to be `https://integrate.api.nvidia.com/v1` and `ACCESS_TOKEN` to your personal API key that was generated.
</div>

</Note>

In [None]:
from openai import OpenAI

#ACCESS_TOKEN = "my-nv-api-key"  # Optional: Use NVIDIA's API service instead of the deployed endpoint

# Initialize the OpenAI client for NVIDIA's API
client = OpenAI(
    #base_url = "https://integrate.api.nvidia.com/v1",  # Optional: Use NVIDIA's API service instead of the deployed endpoint
    base_url = base_url,
    api_key = ACCESS_TOKEN
)

# Example: Generate a limerick about GPU computing
completion = client.chat.completions.create(
    model=BASE_MODEL,
    messages=[{"role":"user","content":"Write a limerick about the wonders of GPU computing."}],
    temperature=0.6,
    top_p=0.95,
    max_tokens=2048,
    frequency_penalty=0,
    presence_penalty=0,
    stream=True,
    extra_body={
        "min_thinking_tokens": 1024,
        "max_thinking_tokens": 2048
    }
)

# Print the streamed response
for chunk in completion:
    reasoning = getattr(chunk.choices[0].delta, "reasoning_content", None)
    if reasoning:
        print(reasoning, end="")
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

### Prompt Templates

To generate questions and responses, we need a prompt template so that the model can understand how to generate the data.

We will use:
* A prompt template to generate subtopics from a user provided topic
* A prompt template to generate questions for a given subtopic
* A prompt template to generate responses for a given question

#### Topics and Subtopics

In [None]:
TOPIC_GENERATION_PROMPT_TEMPLATE = """\
Given a topic, generate a list of {n_subtopics} subtopics that are related to the topic.

The topic is: {topic}

The list must be without numbers, and without any description of the subtopics. The subtopics should be separated by a comma. There must be no other text than the list.
"""

We can use `AsyncOpenAI` to allow nested event loops for async code execution.

In [None]:
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url = base_url,
    api_key = ACCESS_TOKEN
)

In [None]:
async def generate_subtopics(client, topic, n_subtopics): 
    prompt = TOPIC_GENERATION_PROMPT_TEMPLATE.format(topic=topic, n_subtopics=n_subtopics)
    return await client.chat.completions.create(
        model=BASE_MODEL,
        messages=[
            {"role": "user", "content": prompt},
        ],
        temperature=0.2,
        top_p=0.7,
        max_tokens=1024,
    )

In [None]:
# Example topic and number of subtopics to generate
# (You can change these values as needed)
topic = "Wales"
n_subtopics = 5

In [None]:
responses = await generate_subtopics(client, topic=topic, n_subtopics=n_subtopics)
nonreasoning_answer = re.sub(r'.*</think>', "", responses.choices[0].message.content, flags=re.DOTALL).strip()

In [None]:
print(nonreasoning_answer)

Now we can store the generated subtopics as a list for the next stage.

In [None]:
subtopic_list = nonreasoning_answer.split(",")

#### Generating Questions

With a list of subtopics, the next step is to generate a set of questions for each subtopic. The following code defines a function to batch-generate questions using the LLM.

In [None]:
QUESTION_PROMPT_TEMPLATE = """\
Given a topic, generate {n_questions} questions that could be asked about that topic. Your response should be in a list format.

The topic is: {sub_topic}

The list must be without numbers. The questions should be separated by a newline character. There must be no other text than the list.
"""

In [None]:
async def generate_questions(client, sub_topic, n_questions):
    prompt = QUESTION_PROMPT_TEMPLATE.format(sub_topic=sub_topic, n_questions=n_questions)
    response = await client.chat.completions.create(
        model=BASE_MODEL,
        messages=[
            {"role": "system", "content": "/no_think"},
            {"role": "user", "content": prompt},
        ],
        temperature=0.2,
        top_p=0.7,
        max_tokens=1024,
    )
    return response.choices[0].message.content

In [None]:
import asyncio

async def question_generator(client, subtopic_list, n_question): 
    tasks = [generate_questions(client, subtopic, n_question) for subtopic in subtopic_list]
    return await asyncio.gather(*tasks)

In [None]:

n_questions = 5
question_list = await question_generator(client, subtopic_list, n_questions)

In [None]:
question_list

Now we can convert the questions into a single long list for downstream response generation.

In [None]:
question_list_formatted = []

for question_set in question_list:
    question_list_formatted += question_set.split("\n")

In [None]:
len(question_list_formatted)

#### Generating Responses to Questions

Using the question list, we can prompt `Nemotron-nano-9b-v2` to generate multiple responses for each question. 

Note: This section includes concurrency control to avoid rate limit errors using the API.

In [None]:
RESPONSE_PROMPT_TEMPLATE = """\
Given a question, generate 2 responses that could be given to that question. Your response should be in a list format.

The question is: {question}

The list must be in the format:

RESPONSE A: Response A text here
RESPONSE B: Response B text here
"""

**Tip:** Limit concurrency with a semaphore to avoid API rate limit errors when generating many responses in parallel.

In [None]:
async def generate_response(client, question, sem=None):
    prompt = RESPONSE_PROMPT_TEMPLATE.format(question=question)
    async with sem or asyncio.Semaphore(1): 
        response = await client.chat.completions.create(
            model=BASE_MODEL,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.2,
            top_p=0.7,
            max_tokens=1024,
        )
    return response.choices[0].message.content

In [None]:
async def response_generator(client, question_list, max_concurrent=5):
    sem = asyncio.Semaphore(max_concurrent) if max_concurrent else None
    tasks = [generate_response(client, question, sem) for question in question_list]
    return await asyncio.gather(*tasks)

In [None]:
question_response_list = await response_generator(client, question_list_formatted, max_concurrent=5)

In [None]:
question_response_pair_list = []

for question, response_set in zip(question_list_formatted, question_response_list, strict=False):
    question_response_pair_list.append(
        {
            "question": question,
            "responses": {
                "response_a": {"response": response_set.split("RESPONSE B:")[0].replace("RESPONSE A:", "").strip()},
                "response_b": {"response": response_set.split("RESPONSE B:")[-1].split("\n\n")[0].strip()},
            },
        },
    )

In [None]:
# Save the generated question-response pairs to a JSONL file for downstream use.
import json

save_path = os.path.join(SAVE_DIRECTORY, "synthetic_data.jsonl")

with open(save_path, "w") as f:
    for item in question_response_pair_list:
        f.write(json.dumps(item))
        f.write("\n")

print(f"Saved {len(question_response_pair_list)} question-response pairs to {save_path}")

### Math Problems

We can also generate math problems based on a specific subtopic, using a custom parser.

In [None]:
MATH_PROMPT_TEMPLATE = (
    "Create {n_problems} diverse mathematics problems related to the topic '{topic}' "
    "or solvable using concepts from '{topic}'. Provide your response as a numbered list, "
    "and include both the problem description and its solution. Format your response as follows: "
    "((###1###)) >>>Problem<<<: [Description of the first problem]. >>>Solution<<<: [Solution to the first problem].\n"
    "((###2###)) >>>Problem<<<: [Description of the second problem]. >>>Solution<<<: [Solution to the second problem].\n"
    "Only include the problems and their solutions—no additional text."
)

In [None]:
n_problems = 5
topic = "Algebra"

 In this case, we will use a customer parser to clean our list of questions/answers.

In [None]:
from typing import List, Tuple

def parse_math_llm_response(
    llm_response: str, tag_replacements: dict
    ) -> List[Tuple[str, str]]:
    """
    Expects response from LLM to be wrapped between 2 distinct tags (>>>Problem<<<: and >>>Solution<<<:).
    Uses regex to extract each problem-solution pair, then cleans and replaces original tags with
    more user-friendly labels.
    """
    
    first_tag, second_tag = list(tag_replacements.keys())[:2]

    pattern = rf"{re.escape(first_tag)}(.*?){re.escape(second_tag)}(.*?)(?=\(\(###|\Z)"
    matches = re.findall(pattern, llm_response, re.DOTALL)

    problem_solution_pairs = []
    for problem, solution in matches:
        for tag, replacement in tag_replacements.items():
            problem = problem.replace(tag, replacement).strip()
            solution = solution.replace(tag, replacement).strip()
        problem_solution_pairs.append((problem.strip(), solution.strip()))

    return problem_solution_pairs


async def generate_math(
    client,
    topic: str,
    n_problems: int,
    math_prompt_template: str,
    tag_replacements: dict,
    n_retries: int = 3,
) -> List[Tuple[str, str]]:
    """
    Async function to generate math problems and parse them into (problem, solution) pairs.
    """
    math_problems = []
    prompt = math_prompt_template.format(topic=topic, n_problems=n_problems)

    for attempt in range(n_retries):
        try:
            response = await client.chat.completions.create(
                model=BASE_MODEL,
                messages=[{"role": "user", "content": prompt}],
                temperature=0.1,
                top_p=0.9,
                max_tokens=1024,
            )

            # Extract raw LLM text
            llm_response = response.choices[0].message.content

            # Parse with your existing parser
            math_problems = parse_math_llm_response(llm_response, tag_replacements)

            break  # success, exit retries

        except Exception as e:
            print(f"Attempt {attempt+1}/{n_retries} failed: {e}")

    return math_problems


# Example usage
tag_replacements = {
    ">>>Problem<<<:": "Problem:",
    ">>>Solution<<<:": "Solution:",
}

In [None]:
math_problems = await generate_math(client, topic, n_problems, MATH_PROMPT_TEMPLATE, tag_replacements)

for idx, (problem, solution) in enumerate(math_problems):
    print(f"Problem {idx+1}: {problem}")
    print(f"Solution {idx+1}: {solution}\n")

### Summary & Next Steps

- You have now generated a synthetic chat QA dataset using NVIDIA's Nemotron LLMs.
- The data is saved in `synthetic_data.jsonl` and ready for use in reward modeling or DPO training.
- Next steps: Evaluate the data, filter for quality, and consider augmenting with real-world data for best results.

---