<img src="./images/DLI_Header.png" style="width: 400px;">

# 3. Generating Topics and Subtopics with NeMo Curator

In this notebook, we will demonstrate how to generate synthetic data using **[NVIDIA NeMo Curator](https://github.com/NVIDIA/NeMo-Curator/)**, a powerful tool designed to create high-quality datasets for training large language models (LLMs).

Synthetic data generation is particularly valuable when real-world data is limited, noisy, or difficult to obtain. NeMo Curator simplifies this process by providing prebuilt pipelines and customizable modules that allow developers to generate, filter, and refine synthetic datasets efficiently. 

By leveraging its synthetic data generation capabilities, we will produce a diverse list of **topics and subtopics in English and Spanish**.

**[3.1 NeMo Curator OpenAI Client](#3.1-NeMo-Curator-OpenAI-Client)<br>**
**[3.2 Topics-Subtopics Generation in English and Spanish](#3.2-Topics-Subtopics-Generation-in-English-and-Spanish)<br>**

---

## Connecting to the NVIDIA API Catalog

NeMo Curator supports connecting to [OpenAI API](https://github.com/openai/openai-python?tab=readme-ov-file#openai-python-api-library) compatible services and [NeMo Deploy](https://docs.nvidia.com/nemo-framework/user-guide/latest/deployingthenemoframeworkmodel.html#use-nemo-export-and-deploy-module-apis-to-run-inference) services.

In this notebook, we rely on the `build.nvidia.com` API endpoints. You can use this same flow with a model deployed as an NVIDIA NIM for LLMs which can be found [here](https://github.com/NVIDIA/NeMo-Curator/blob/main/docs/user-guide/syntheticdata.rst#connecting-to-an-llm-service).

Your environment already has an NVIDIA API key installed for you. For work outside of this workshop environment, please see the instructions below for how to obtain your own free NVIDIA API key.

### Obtaining Your Own NVIDIA API Key

If you would like an NVIDIA API key for your own work outside this workshop environment, you can generate one for free using the following steps:

1. Login (or sign up) through [build.nvidia.com](https://build.nvidia.com/explore/discover).
2. Click the `Get API Key` button available on the the `nvidia/nemotron-4-340b-instruct` page, found [here](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct).

---

## 3.1 NeMo Curator OpenAI Client

### Loading NVIDIA API Credentials

Before connecting to NVIDIA's API, we need to load the required credentials. This cell automatically checks multiple locations:

1. **Project directory** (priority 1): `./secrets.env` (in the same folder as this notebook) ✅ **Found!**
2. **Home directory** (priority 2): `~/.nvidia/secrets.env`
3. **Environment variables** (priority 3): Pre-set in some workshop environments

**Required credentials:**
- `NVIDIA_API_KEY`: Your NVIDIA API key from build.nvidia.com
- `NVIDIA_BASE_URL`: The NVIDIA API endpoint (https://integrate.api.nvidia.com/v1)

**Get your free API key:**
Visit https://build.nvidia.com/nvidia/nemotron-4-340b-instruct and click "Get API Key"


In [None]:
# Load NVIDIA API credentials from secrets file
import os
from pathlib import Path

# Path to secrets file - check multiple locations
# Priority: 1) Local project directory, 2) Home directory, 3) Environment variables
try:
    project_secrets = Path("secrets.env")
    home_secrets = Path.home() / ".nvidia" / "secrets.env"
except Exception as e:
    print(f"Warning: Path setup issue: {e}")
    project_secrets = None
    home_secrets = None

def load_secrets_from_file(filepath):
    """Load environment variables from a secrets file"""
    try:
        if not filepath or not filepath.exists():
            return False
        
        print(f"Loading secrets from {filepath}")
        with open(filepath, 'r') as f:
            for line in f:
                line = line.strip()
                if line and not line.startswith('#') and '=' in line:
                    key, value = line.split('=', 1)
                    os.environ[key.strip()] = value.strip().strip('"').strip("'")
        print("✓ NVIDIA API credentials loaded")
        return True
    except Exception as e:
        print(f"Error loading secrets: {e}")
        return False

# Try loading from different locations
loaded = False
if project_secrets and project_secrets.exists():
    loaded = load_secrets_from_file(project_secrets)
elif home_secrets and home_secrets.exists():
    loaded = load_secrets_from_file(home_secrets)
elif "NVIDIA_API_KEY" in os.environ:
    print("✓ Using NVIDIA_API_KEY from environment variables")
    loaded = True
else:
    print("⚠️  NVIDIA_API_KEY not found!")
    print("\nSearched locations:")
    print(f"  1. ./secrets.env")
    print(f"  2. ~/.nvidia/secrets.env")
    print(f"  3. Environment variables")
    print("\nPlease create a secrets.env file with:")
    print("   NVIDIA_API_KEY=nvapi-xxxxxxxxxxxxx")
    print("   NVIDIA_BASE_URL=https://integrate.api.nvidia.com/v1")

# Verify credentials are available
if "NVIDIA_API_KEY" in os.environ and "NVIDIA_BASE_URL" in os.environ:
    print(f"\n✓ API Key: {os.environ['NVIDIA_API_KEY'][:10]}...")
    print(f"✓ Base URL: {os.environ['NVIDIA_BASE_URL']}")
else:
    print("\n❌ Missing required environment variables!")
    print("   Required: NVIDIA_API_KEY, NVIDIA_BASE_URL")


✓ NVIDIA API key loaded successfully


We are going to:

1. Initialize OpenAI's client.
2. Initialize NeMo Curator's `OpenAIClient`
3. Perform a request, and print the LLM response.

Notice that the structure of the request is very close to the original OpenAI client.

In [2]:
from openai import OpenAI

# Initialize OpenAI's client with the NVIDIA API endpoint and the API key loaded from secrets.env
openai_client = OpenAI(
    base_url=os.getenv("NVIDIA_BASE_URL"),
    api_key=os.getenv("NVIDIA_API_KEY"),
)

print("✓ OpenAI client initialized successfully")

✓ OpenAI client initialized successfully


In [3]:
from nemo_curator import OpenAIClient

# Initialize NeMo Curator's OpenAIClient by passing the OpenAI client instance.
# This wraps the OpenAI client to provide additional functionality specific to NeMo Curator.
curator_openai_client = OpenAIClient(openai_client)

In [4]:
# Perform a request to a hosted model
llm_response = curator_openai_client.query_model(
    model="nvidia/nemotron-mini-4b-instruct",
    messages=[
        {
            "role": "user",
            "content": "Write a limerick about the wonders of GPU computing.",
        }
    ],
    temperature=0.2,
    top_p=0.7,
    max_tokens=1024,
)

# Print response
print(llm_response[0])

There once was a scientist, so clever and bright,
Who used GPU computing, day and night.
With CUDA and OpenCL,
He solved complex problems,
In a flash, like a lightning bolt.


## 3.2 Topics-Subtopics Generation in English and Spanish

The NeMo Curator Synthetic Data Generation (SDG) features are primarily accessed through the `NemotronGenerator` class.

This useful wrapped helps expose both: 

1. Pre-built SDG pipelines 
2. A number of specific generation utilities.

Before heading into the pre-built pipelines, we're going to "break-apart" an existing pipeline, in this case: the Topics-Subtopics Generation Pipeline - and see the granular customization that Nemo Curator provides for each step. 

We're going to work through the following process (extracted from [Nemotron-4 340B Technical Report](https://arxiv.org/pdf/2406.11704)):

1. Generate `n` Macro Topics - Have our LLM generate `n` broad topics relating to daily life, the world, etc.
2. Generate `n` Sub Topics - Have our LLM take each Macro Topic and generate `n` topics relating to the Macro Topics.

### 3.2.1 Generating `n` Macro Topics in English and Spanish

Our first step is to generate our Macro Topics. 

Let's look at the prompt that drives this process as well, to get a better understanding of what's happening "under the hood":

```python
"Can you generate {n_macro_topics} comprehensive topics that encompass various aspects of our daily life, the world, and science? Your answer should be a list of topics. Make the topics as diverse as possible. For example, 1. Food and drinks. \n2. Technology.\n"
```

To do this, we'll use the [`generate_macro_topics`](https://github.com/NVIDIA/NeMo-Curator/blob/cd4c4907bd4d87cd11d0f37be4ae0fe167a79696/nemo_curator/synthetic/nemotron.py#L115) method of our `NemotronGenerator`.

> NOTE: All prompt templates are fully customizable, and we'll take a look at how we can do that in the upcoming cells!

In [5]:
from nemo_curator.synthetic import NemotronGenerator

# An example class used to generate syntethic data.
generator = NemotronGenerator(curator_openai_client)

# Model used to generate syntethic data.
model = "mistralai/mistral-7b-instruct-v0.3"
model_kwargs = {
    "temperature": 0.1,
    "top_p": 0.9,
    "max_tokens": 1024,
}

# Number of macro topics to generate
n_macro_topics = 5

# Generate macro topics
llm_response = generator.generate_macro_topics(
    model=model, model_kwargs=model_kwargs, n_macro_topics=n_macro_topics
)

print(llm_response[0])

1. Sustainable Agriculture and Food Systems: This topic covers the production, distribution, and consumption of food in a way that ensures environmental, economic, and social sustainability. It includes organic farming, permaculture, food waste reduction, and fair trade.

2. Climate Change and Renewable Energy: This topic discusses the impact of human activities on climate change, the science behind it, and the solutions to mitigate its effects. It includes renewable energy sources, carbon capture technologies, and climate policies.

3. Mental Health and Wellness: This topic focuses on understanding mental health, its importance, and strategies for maintaining it. It includes stress management, mindfulness, therapy, and the role of technology in mental health.

4. Global Politics and Diplomacy: This topic explores the political landscape of the world, international relations, diplomacy, and global governance. It includes peacekeeping, human rights, international law, and the role of mu

While this is a great start - we'd love to have this response in a Python list. 

We'll use the NeMo Curator `convert_response_to_yaml_list` method to accomplish this goal. 

Please notice that currently, this method is quite strict. Custom parsing might be required depending on model choice, and use-case. 

In [6]:
from typing import List

from nemo_curator.synthetic.error import YamlConversionError
from nemo_curator.synthetic.prompts import DEFAULT_MACRO_TOPICS_PROMPT_TEMPLATE


def generate_macro_topics(
    generator: NemotronGenerator,
    model: str,
    model_kwargs: dict,
    n_macro_topics: int,
    prompt_template: str = DEFAULT_MACRO_TOPICS_PROMPT_TEMPLATE,
    n_retries: int = 5,
) -> List[str]:
    """
    Generates a list of macro topics using a language model and retries on YAML conversion errors.

    This method leverages a `NemotronGenerator` instance to generate macro topics by invoking a
    language model (LLM). It processes the LLM's response, converts it into a YAML-formatted list,
    and retries the process up to `n_retries` times if a `YamlConversionError` occurs.

    Args:
        generator (NemotronGenerator): An instance of the `NemotronGenerator` class responsible for
            generating and processing LLM responses.
        model (str): The name or identifier of the language model to be used for generating macro topics.
        model_kwargs (dict): A dictionary of additional keyword arguments to configure the language model.
        n_macro_topics (int): The number of macro topics to generate.
        prompt_template (str, optional): A string template used to construct the prompt for the LLM.
            Defaults to `DEFAULT_MACRO_TOPICS_PROMPT_TEMPLATE`.
        n_retries (int, optional): The maximum number of retry attempts in case of a `YamlConversionError`.
            Defaults to 5.

    Returns:
        List[str]: A list of generated macro topics as strings.

    Raises:
        YamlConversionError: If all retry attempts fail due to YAML conversion issues.
    """
    # Initialize an empty list to store the generated macro topics
    macro_topics = []

    # Attempt to generate and convert macro topics up to `n_retries` times
    for _ in range(n_retries):
        try:
            # Generate a response from the language model using the specified parameters
            llm_response = generator.generate_macro_topics(
                n_macro_topics=n_macro_topics,
                model=model,
                model_kwargs=model_kwargs,
                prompt_template=prompt_template,
            )

            # Convert the first response from the LLM into a YAML-formatted list of topics
            macro_topics = generator.convert_response_to_yaml_list(
                llm_response=llm_response[0], model=model, model_kwargs=model_kwargs
            )

            # Break out of the retry loop if conversion is successful
            break
        except YamlConversionError as e:
            # Print an error message and retry if YAML conversion fails
            print(f"Hit: {e}, Retrying...")

    # Return the generated list of macro topics (empty if all retries failed)
    return macro_topics


# Generate a list of macro topics in English
macro_topics_english = generate_macro_topics(
    generator=generator,
    model=model,
    model_kwargs=model_kwargs,
    n_macro_topics=n_macro_topics,
    prompt_template=DEFAULT_MACRO_TOPICS_PROMPT_TEMPLATE,
)

# Print the generated list of macro topics in English
print(macro_topics_english)

['Sustainable Agriculture and Food Systems', 'Climate Change and Renewable Energy', 'Mental Health and Wellness', 'Global Politics and Diplomacy', 'Artificial Intelligence and Ethics']


Let’s replicate the process in Spanish. To do this, we simply need to modify the prompt accordingly:

In [7]:
# Define a prompt template in Spanish for generating macro topics
macro_topics_prompt_template_spanish = (
    "Genera {n_macro_topics} temas amplios que abarquen diversos aspectos de nuestra vida diaria, el mundo y la ciencia. "
    "Tu respuesta debe ser únicamente una lista de temas, sin incluir explicaciones ni texto adicional. Por ejemplo: "
    "1. Comida y bebidas. \n2. Tecnología.\n"
)

# Generate macro topics list in Spanish
macro_topics_spanish = generate_macro_topics(
    generator=generator,
    model=model,
    model_kwargs=model_kwargs,
    n_macro_topics=n_macro_topics,
    prompt_template=macro_topics_prompt_template_spanish,
)

# Print the generated list of macro topics in Spanish
print(macro_topics_spanish)

['Salud y bienestar', 'Medio ambiente y sustentabilidad', 'Educación y aprendizaje', 'Economía y negocios', 'Ciencia y tecnología avanzada']


**Exercise:** Generate `n_macro_topics` in a language other than Spanish or English.

If you are not fluent in another language, use an LLM or an online translation tool (e.g., [Google Translate](https://translate.google.com), [DeepL](https://www.deepl.com/)) to translate one of the previous prompts into your chosen language.

Once translated, run the pipeline to generate macro topics in that language.

Afterward, reflect on the results:

- Did the topics align with the intended meaning of the original prompt?
- Were there any noticeable differences in diversity or quality compared to the topics generated in Spanish or English?

In [None]:
# Your code here

### 3.2.2 Generating `n` Subtopics in English and Spanish

We will follow the same procedure as before; however, this time, the prompt will be tailored to specify the generation of subtopics.

In [8]:
# Define a prompt template for generating subtopics in English.
subtopics_prompt_template_english = (
    "Generate {n_subtopics} topics that cover various aspects of {macro_topic}. "
    "Your response should only be a list of topics, as diverse as possible. Do not include explanations or additional text. For example: "
    "1. Food and drinks. \n2. Technology.\n"
)

# Specify the number of subtopics to generate for each macro topic.
n_subtopics = 3

# Use the generator to produce `n_subtopics` in English for the first macro topic in the list.
llm_response = generator.generate_subtopics(
    model=model,
    model_kwargs=model_kwargs,
    macro_topic=macro_topics_english[0],
    n_subtopics=n_subtopics,
    prompt_template=subtopics_prompt_template_english,
)

# Print the generated subtopics.
print(llm_response[0])

1. Agroecology and Biodiversity Conservation
2. Climate-Smart Agriculture and Carbon Sequestration
3. Food Waste Management and Zero Hunger Initiatives


As we did earlier, we will utilize the `convert_response_to_yaml_list` method to transform the LLM response into a Python list.

In [9]:
def generate_subtopics(
    model: str,
    model_kwargs: dict,
    macro_topic: str,
    n_subtopics: int,
    prompt_template: str,
    n_retries: int = 5,
) -> List[str]:
    """
    Generates a list of subtopics for a given macro topic using a language model (LLM).

    This function interacts with the `generator` object to produce subtopics based on the provided
    macro topic and prompt template. If the YAML conversion fails, it retries the process up to
    `n_retries` times.

    Args:
        model (str): The name or identifier of the language model to use.
        model_kwargs (dict): A dictionary of additional configuration parameters for the language model.
        macro_topic (str): The macro topic for which subtopics will be generated.
        n_subtopics (int): The number of subtopics to generate.
        prompt_template (str): The prompt template used to guide the LLM in generating subtopics.
        n_retries (int, optional): The maximum number of retry attempts if YAML conversion fails. Defaults to 5.

    Returns:
        List[str]: A list of generated subtopics as strings. If all retries fail, returns an empty list.

    Raises:
        YamlConversionError: If YAML conversion fails after all retry attempts.
    """
    # Initialize an empty list to store generated subtopics
    subtopics = []

    # Retry loop to handle potential YAML conversion errors
    for _ in range(n_retries):
        try:
            # Generate a response from the language model using the specified parameters
            llm_response = generator.generate_subtopics(
                model=model,
                model_kwargs=model_kwargs,
                macro_topic=macro_topic,
                n_subtopics=n_subtopics,
                prompt_template=prompt_template,
            )

            # Convert the first response from the LLM into a YAML-formatted list of subtopics
            subtopics = generator.convert_response_to_yaml_list(
                llm_response=llm_response[0], model=model
            )

            # Exit the retry loop if conversion is successful
            break
        except YamlConversionError as e:
            # Print an error message and retry if YAML conversion fails
            print(f"Hit: {e}, Retrying...")

    # Return the generated list of subtopics (or an empty list if all retries failed)
    return subtopics


# Generate a list of subtopics in English for the first macro topic in `macro_topics_english`
subtopics_english = generate_subtopics(
    model=model,
    model_kwargs=model_kwargs,
    macro_topic=macro_topics_english[0],
    n_subtopics=n_subtopics,
    prompt_template=subtopics_prompt_template_english,
)

# Print the generated list of subtopics in English
print(subtopics_english)

['Agroecology and Biodiversity Conservation', 'Climate-Smart Agriculture and Carbon Sequestration', 'Food Waste Management and Zero Hunger Initiatives']


Let’s replicate the process in Spanish. To do this, we simply need to modify the prompt accordingly:

In [10]:
# Define a prompt template for generating subtopics in Spanish.
subtopics_prompt_template_spanish = (
    "Genera {n_subtopics} temas amplios que abarquen diversos aspectos de {macro_topic}. "
    "Tu respuesta debe ser únicamente una lista de temas. "
    "Cada tema es un elemento de la lista. No incluyas subtemas ni explicaciones ni texto adicional. "
    "Por ejemplo: 1. Comida y bebidas. \n2. Tecnología.\n"
)

# Use the generator to produce `n_subtopics` in Spanish for the first macro topic in the list.
subtopics_spanish = generate_subtopics(
    model=model,
    model_kwargs=model_kwargs,
    macro_topic=macro_topics_spanish[0],
    n_subtopics=n_subtopics,
    prompt_template=subtopics_prompt_template_spanish,
)

# Print the generated subtopics.
print(subtopics_spanish)

['Salud Mental y Emocional', 'Nutrición y Alimentación', 'Ejercicio y Actividad Física']


**Exercise:** Generate `n_subtopics` for a macro topic in a language other than Spanish or English.

Afterward, reflect on the results:
- Did the subtopics align with the intended meaning of the original prompt?
- Were there any noticeable differences in diversity, specificity, or quality compared to subtopics generated in Spanish or English?

In [None]:
# Your code here

---
<h2 style="color:green;">Congratulations!</h2>

In this notebook, you used Nemo Curator to generate lists of topics and subtopics in both English and Spanish.

In the next notebook, we will select a subtopic to create a **Question-and-Answer Dataset for Supervised Fine-Tuning (SFT)**, available in both languages.
<img src="./images/DLI_Header.png" style="width: 400px; float: right;">