<img src="./images/DLI_Header.png" style="width: 400px;">

# 5. Generating Math Problems and Solutions with NeMo Curator

In this notebook, we will demonstrate how to use **NVIDIA NeMo Curator** to generate a set of math problems and their solutions on a specific topic.

The generated outputs will then be evaluated using the **Llama-3.1-nemotron-70b-reward model** to ensure quality and relevance.

**[5.1 NeMo Curator OpenAI Client](#5.1-NeMo-Curator-OpenAI-Client)<br>**
**[5.2 Math problems generation in English and Spanish](#5.2-Math-problems-generation-in-English-and-Spanish)<br>**
**[5.3 Combining subtopics and math problems generation to create a diverse set of math problems in English and Spanish](#5.3-Combining-subtopics-and-math-problems-generation-to-create-a-diverse-set-of-math-problems-in-English-and-Spanish)<br>**


---

## Connecting to the NVIDIA API Catalog

NeMo Curator supports connecting to [OpenAI API](https://github.com/openai/openai-python?tab=readme-ov-file#openai-python-api-library) compatible services and [NeMo Deploy](https://docs.nvidia.com/nemo-framework/user-guide/latest/deployingthenemoframeworkmodel.html#use-nemo-export-and-deploy-module-apis-to-run-inference) services.

In this notebook, we rely on the `build.nvidia.com` API endpoints. You can use this same flow with a model deployed as an NVIDIA NIM for LLMs which can be found [here](https://github.com/NVIDIA/NeMo-Curator/blob/main/docs/user-guide/syntheticdata.rst#connecting-to-an-llm-service).

Your environment already has an NVIDIA API key installed for you. For work outside of this workshop environment, please see the instructions below for how to obtain your own free NVIDIA API key.

### Obtaining Your Own NVIDIA API Key

If you would like an NVIDIA API key for your own work outside this workshop environment, you can generate one for free using the following steps:

1. Login (or sign up) through [build.nvidia.com](https://build.nvidia.com/explore/discover).
2. Click the `Get API Key` button available on the the `Llama 3.1 Nemotron 70B Reward` page, found [here](https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-reward).

---

## 5.1 NeMo Curator OpenAI Client

We will begin by initializing NeMo Curator's `OpenAI Client`.

Please note that this step is identical to the process outlined in the previous notebook.

### Loading NVIDIA API Credentials

Before connecting to NVIDIA's API, we need to load the required credentials. This cell automatically checks multiple locations:

1. **Project directory** (priority 1): `./secrets.env` (in the same folder as this notebook) ✅ **Found!**
2. **Home directory** (priority 2): `~/.nvidia/secrets.env`
3. **Environment variables** (priority 3): Pre-set in some workshop environments

**Required credentials:**
- `NVIDIA_API_KEY`: Your NVIDIA API key from build.nvidia.com
- `NVIDIA_BASE_URL`: The NVIDIA API endpoint (https://integrate.api.nvidia.com/v1)

**Get your free API key:**
Visit https://build.nvidia.com/nvidia/llama-3_1-nemotron-70b-reward and click "Get API Key"


In [1]:
# Load NVIDIA API credentials from secrets file
import os
from pathlib import Path

# Path to secrets file - check multiple locations
# Priority: 1) Local project directory, 2) Home directory, 3) Environment variables
try:
    project_secrets = Path("secrets.env")
    home_secrets = Path.home() / ".nvidia" / "secrets.env"
except Exception as e:
    print(f"Warning: Path setup issue: {e}")
    project_secrets = None
    home_secrets = None

def load_secrets_from_file(filepath):
    """Load environment variables from a secrets file"""
    try:
        if not filepath or not filepath.exists():
            return False
        
        print(f"Loading secrets from {filepath}")
        with open(filepath, 'r') as f:
            for line in f:
                line = line.strip()
                if line and not line.startswith('#') and '=' in line:
                    key, value = line.split('=', 1)
                    os.environ[key.strip()] = value.strip().strip('"').strip("'")
        print("✓ NVIDIA API credentials loaded")
        return True
    except Exception as e:
        print(f"Error loading secrets: {e}")
        return False

# Try loading from different locations
loaded = False
if project_secrets and project_secrets.exists():
    loaded = load_secrets_from_file(project_secrets)
elif home_secrets and home_secrets.exists():
    loaded = load_secrets_from_file(home_secrets)
elif "NVIDIA_API_KEY" in os.environ:
    print("✓ Using NVIDIA_API_KEY from environment variables")
    loaded = True
else:
    print("⚠️  NVIDIA_API_KEY not found!")
    print("\nSearched locations:")
    print(f"  1. ./secrets.env")
    print(f"  2. ~/.nvidia/secrets.env")
    print(f"  3. Environment variables")
    print("\nPlease create a secrets.env file with:")
    print("   NVIDIA_API_KEY=nvapi-xxxxxxxxxxxxx")
    print("   NVIDIA_BASE_URL=https://integrate.api.nvidia.com/v1")

# Verify credentials are available
if "NVIDIA_API_KEY" in os.environ and "NVIDIA_BASE_URL" in os.environ:
    print(f"\n✓ API Key: {os.environ['NVIDIA_API_KEY'][:10]}...")
    print(f"✓ Base URL: {os.environ['NVIDIA_BASE_URL']}")
else:
    print("\n❌ Missing required environment variables!")
    print("   Required: NVIDIA_API_KEY, NVIDIA_BASE_URL")


Loading secrets from secrets.env
✓ NVIDIA API credentials loaded

✓ API Key: nvapi-vMsg...
✓ Base URL: https://integrate.api.nvidia.com/v1


In [2]:
import os

from nemo_curator import OpenAIClient
from nemo_curator.synthetic import NemotronGenerator
from openai import OpenAI

# Initialize OpenAI's client with the NVIDIA API endpoint and the API key.
openai_client = OpenAI(
    # Outside this workshop environment you would set NVIDIA_BASE_URL to "https://integrate.api.nvidia.com/v1".
    base_url=os.environ["NVIDIA_BASE_URL"],
    api_key=os.environ["NVIDIA_API_KEY"],
)

# Initialize NeMo Curator's OpenAIClient by passing the OpenAI client instance.
# This wraps the OpenAI client to provide additional functionality specific to NeMo Curator.
curator_openai_client = OpenAIClient(openai_client)

# Create an instance of NemotronGenerator, which facilitates synthetic data generation.
generator = NemotronGenerator(curator_openai_client)

# Model used to generate syntethic data.
model = "mistralai/mistral-7b-instruct-v0.3"
model_kwargs = {
    "temperature": 0.1,
    "top_p": 0.9,
    "max_tokens": 1024,
}

## 5.2 Math problems generation in English and Spanish
We will generate math problems based on a specific subtopic. Instead of using the `convert_response_to_yaml_list` method, we will create and use **a custom parser**.

In order to do it, we will slightly modify the default `MATH_PROBLEM_GENERAL_PROMPT_TEMPLATE`, and will use the following instead:

```python
    "Create {n_openlines} diverse mathematics problems related to the topic '{topic}' " \
    "or solvable using concepts from '{topic}'. Provide your response as a numbered list, " \
    "and include both the problem description and its solution. Format your response as follows: " \
    "((###1###)) >>>Problem<<<: [Description of the first problem]. >>>Solution<<<: [Solution to the first problem].\n" \
    "((###2###)) >>>Problem<<<: [Description of the second problem]. >>>Solution<<<: [Solution to the second problem].\n" \
    "Only include the problems and their solutions—no additional text."
```


In [3]:
math_prompt_template_english = (
    "Create {n_openlines} diverse mathematics problems related to the topic '{topic}' "
    "or solvable using concepts from '{topic}'. Provide your response as a numbered list, "
    "and include both the problem description and its solution. Format your response as follows: "
    "((###1###)) >>>Problem<<<: [Description of the first problem]. >>>Solution<<<: [Solution to the first problem].\n"
    "((###2###)) >>>Problem<<<: [Description of the second problem]. >>>Solution<<<: [Solution to the second problem].\n"
    "Only include the problems and their solutions—no additional text."
)

n_problems = 3
topic = "Algebra"

llm_response = generator.generate_math_problem(
    model=model,
    topic=topic,
    n_openlines=n_problems,
    prompt_template=math_prompt_template_english,
)

print(llm_response[0])

((###1###))
>>>Problem<<<: Find the solutions for the quadratic equation 3x² - 4x - 5 = 0.
>>>Solution<<<: To find the solutions, we need to use the quadratic formula: x = (-b ± √(b² - 4ac)) / (2a). Here, a = 3, b = -4, and c = -5.
Calculate b² - 4ac = (-4)² - 4 * 3 * (-5) = 16 + 30 = 46.
Now, find the two solutions:
x₁ = (-(-4) + √46) / (2 * 3)
x₁ = (4 + √46) / 6
x₂ = (-(-4) - √46) / (2 * 3)
x₂ = (4 - √46) / 6

((###2###))
>>>Problem<<<: Solve for x in the linear equation 2(3x + 5) - 7 = 0.
>>>Solution<<<: First, notice that parentheses are needed to distribute the -7 to each term within the parentheses.
2(3x + 5) - 7 = 0 becomes 6x + 10 - 7 = 0, which simplifies to 6x = 3.
Divide both sides by 6, x = 3/6 = 1/2.

((###3###))
>>>Problem<<<: If a, b, and c are roots of the equation x³ - 6x² + 11x - 6 = 0, find the values of Sa² + Tb² + Cc². Here, S = a + b + c, T = ab + bc + ca, and C = abc.
>>>Solution<<<: First, we need to find the roots a, b, and c of the given cubic equation. We wil

In this case, we will use a customer parser to clean our list of questions/answers.

In [4]:
import re
from typing import List, Tuple


def parse_math_llm_response(
    llm_response: str, tag_replacements: dict
) -> List[Tuple[str, str]]:
    # Extract the first and second tags from the tag_replacements dictionary
    first_tag, second_tag = list(tag_replacements.keys())[:2]

    # Regular expression to match each problem-solution block using the first and second tags
    pattern = rf"{re.escape(first_tag)}(.*?){re.escape(second_tag)}(.*?)(?=\(\(###|\Z)"

    # Find all matches using re.DOTALL to handle multiline content
    matches = re.findall(pattern, llm_response, re.DOTALL)

    # Clean up and replace tags in extracted problems and solutions
    problem_solution_pairs = []
    for problem, solution in matches:
        # Replace tags with their corresponding replacements
        for tag, replacement in tag_replacements.items():
            problem = problem.replace(tag, replacement).strip()
            solution = solution.replace(tag, replacement).strip()

        problem_solution_pairs.append((problem.strip(), solution.strip()))

    return problem_solution_pairs


def generate_math_problems(
    model: str,
    model_kwargs: dict,
    topic: str,
    n_problems: int,
    math_prompt_template: str,
    tag_replacements: dict,
    n_retries: int = 5,
) -> List[Tuple[str, str]]:
    math_problems = []
    for _ in range(n_retries):
        try:
            llm_response = generator.generate_math_problem(
                model=model,
                topic=topic,
                n_openlines=n_problems,
                prompt_template=math_prompt_template,
            )
            math_problems = parse_math_llm_response(
                llm_response=llm_response[0], tag_replacements=tag_replacements
            )
            break
        except Exception as e:
            print(f"Hit: {e}, Retrying...")

    return math_problems


# We will use these tags to find the problem and solution
tag_replacements_english = {
    ">>>Problem<<<:": "Problem:",
    ">>>Solution<<<:": "Solution:",
}

# Example usage
math_problems_english = generate_math_problems(
    model=model,
    model_kwargs=model_kwargs,
    topic=topic,
    n_problems=n_problems,
    math_prompt_template=math_prompt_template_english,
    tag_replacements=tag_replacements_english,
)

# Print each problem-solution pair
for idx, (problem, solution) in enumerate(math_problems_english):
    print(f"Problem: {problem}")
    print(f"Solution: {solution}\n")

Problem: Find the solutions for the quadratic equation 3x² - 4x - 5 = 0.
Solution: To find the solutions, we need to use the quadratic formula: x = (-b ± √(b² - 4ac)) / (2a). Here, a = 3, b = -4, and c = -5.
Calculate b² - 4ac = (-4)² - 4 * 3 * (-5) = 16 + 30 = 46.
Now, find the two solutions:
x₁ = (-(-4) + √46) / (2 * 3)
x₁ = (4 + √46) / 6
x₂ = (-(-4) - √46) / (2 * 3)
x₂ = (4 - √46) / 6

Problem: Solve for x in the linear equation 2(3x + 5) - 7 = 0.
Solution: First, notice that parentheses are needed to distribute the -7 to each term within the parentheses.
2(3x + 5) - 7 = 0 becomes 6x + 10 - 7 = 0, which simplifies to 6x = 3.
Divide both sides by 6, x = 3/6 = 1/2.

Problem: If a, b, and c are roots of the equation x³ - 6x² + 11x - 6 = 0, find the values of Sa² + Tb² + Cc². Here, S = a + b + c, T = ab + bc + ca, and C = abc.
Solution: First, we need to find the roots a, b, and c of the given cubic equation. We will use subsitution method by assuming a = t - 3 and solve for t.
Replacin

Now it’s time to do the same in Spanish. To accomplish this, we simply need to adjust the prompt as follows:

In [5]:
math_prompt_template_spanish = (
    "Crea {n_openlines} problemas de matemáticas diversos relacionados con el tema '{topic}' "
    "o que se puedan resolver utilizando conceptos de '{topic}'. Proporciona tu respuesta como una lista numerada, "
    "e incluye tanto la descripción del problema como su solución. Formatea tu respuesta de la siguiente manera: "
    "((###1###)) >>>Problema<<<: [Descripción del primer problema]. >>>Solución<<<: [Solución del primer problema].\n"
    "((###2###)) >>>Problema<<<: [Descripción del segundo problema]. >>>Solución<<<: [Solución del segundo problema].\n"
    "Incluye únicamente los problemas y sus soluciones, sin texto adicional. Responde usando sólo el idioma Español."
)

# We will use these tags to find the problem and solution
tag_replacements_spanish = {
    ">>>Problema<<<:": "Problema:",
    ">>>Solución<<<:": "Solución:",
}

math_problems_spanish = generate_math_problems(
    model=model,
    model_kwargs=model_kwargs,
    topic=topic,
    n_problems=n_problems,
    math_prompt_template=math_prompt_template_spanish,
    tag_replacements=tag_replacements_spanish,
)

# Print each problem-solution pair
for idx, (problem, solution) in enumerate(math_problems_spanish):
    print(f"Problema: {problem}")
    print(f"Solución: {solution}\n")

Problema: Encontrar el valor de las variables en una ecuación cuadrática para que la suma de las raíces sea 10 y sea menor a cero.
Solución: Si la suma de las raíces es 10, es decir, `x1 + x2 = 10`, y ambas raíces son menores que cero, entonces la ecuación cuadrática tiene que ser del tipo `x² + 10x + c = 0`, donde `c` es un número positivo menor que 100 para que sus raíces sean menores que cero. Por ejemplo, la ecuación `x² + 10x + 25 = 0` tiene soluciones `x1 = -5` y `x2 = -3`, que cumplen con las condiciones.

Problema: Determinar la fórmula de la parábola que tiene un punto cíclico en `(2, 1)` y una distances entropia de `3` desde el origen.
Solución: La fórmula de la parábola en el sistema cartesiano es de la forma `z = a(x - h)² + k`, donde `(h, k)` es el punto cíclico del parábola y `a` es una constante positiva que determina la distancia entre el origen y el punto cíclico. En nuestro caso, el punto cíclico está en `(2, 1)`, así que si tomamos la distancia sharan hacia el origen

## 5.3 Combining subtopics and math problems generation to create a diverse set of math problems in English and Spanish

Now it's time to bring together everything we've covered so far. To achieve this, we will begin by generating a list of `n` math subtopics and then create `m` pairs of problem-solution sets for each subtopic.

Let's begin with the English language!

In [7]:
from utils import generate_subtopics


def math_pipeline(
    generator: NemotronGenerator,
    model: str,
    model_kwargs: dict,
    topic: str,
    n_subtopics: int,
    n_problems: int,
    subtopics_prompt_template: str,
    math_prompt_template: str,
    tag_replacements: dict,
    n_retries: int = 5,
) -> List[Tuple[str, str]]:

    # Generate n math subtopic
    print(f"Generating {n_subtopics} subtopics for {topic}...")
    subtopics = generate_subtopics(
        generator=generator,
        model=model,
        model_kwargs=model_kwargs,
        macro_topic=topic,
        n_subtopics=n_subtopics,
        prompt_template=subtopics_prompt_template,
    )

    # Print the subtopics
    for subtopic in subtopics:
        print(f"\t{subtopic}")

    # Convert tag_replacements values to list, to use in the next step
    tag_replacements_values = list(tag_replacements.values())

    # Generate n problems per subtopic
    math_problems = []
    for subtopic in subtopics:
        print(f"\nGenerating {n_problems} {subtopic} problem-solution pair(s)...")
        try:
            llm_response = generate_math_problems(
                model=model,
                model_kwargs=model_kwargs,
                topic=subtopic,
                n_problems=n_problems,
                math_prompt_template=math_prompt_template,
                tag_replacements=tag_replacements,
            )

            # Print each problem-solution pair
            for idx, (problem, solution) in enumerate(llm_response):
                print(f"{tag_replacements_values[0]} {problem}")
                print(f"{tag_replacements_values[1]} {solution}\n")

            math_problems.extend(llm_response)
        except Exception as e:
            print(f"Hit: {e}")

    return math_problems


# Define a prompt template for generating subtopics in English.
subtopics_prompt_template_english = (
    "Generate {n_subtopics} topics that cover various aspects of {macro_topic}. "
    "Your response should only be a list of topics, as diverse as possible. Do not include explanations or additional text. For example: "
    "1. Food and drinks. \n2. Technology.\n"
)

topic_english = "Math"
n_subtopics = 3
n_problems = 3

# Run the pipeline for English
math_problems_english = math_pipeline(
    generator=generator,
    model=model,
    model_kwargs=model_kwargs,
    topic=topic_english,
    n_subtopics=n_subtopics,
    n_problems=n_problems,
    subtopics_prompt_template=subtopics_prompt_template_english,
    math_prompt_template=math_prompt_template_english,
    tag_replacements=tag_replacements_english,
)

Generating 3 subtopics for Math...
	Calculus Applications in Physics
	Geometry in Architecture
	Probability and Statistics in Finance

Generating 3 Calculus Applications in Physics problem-solution pair(s)...
Problem: Find the velocity of an object moving with acceleration a(t) = t^2 + 3t, if the initial velocity is 0 at t=0.
Solution: Since the initial velocity is 0, we need to find the definite integral of acceleration from 0 to t, which gives us the velocity at time t.

    v(t) = ∫a(t) dt = ∫(t^2 + 3t) dt = (t^3/3 + 3t^2/2) + C

    Since v(0) = 0, we find the constant of integration, C, to be 0. So, the velocity function becomes v(t) = (t^3/3 + 3t^2/2).

Problem: A spring is stretched with a force F(x) = 5x^3 - 3x^2, where F is in Newtons and x is measured in meters. Determine how far the spring will be stretched when a force of 10 Newton is applied.
Solution: To find the displacement (x) where F(x) = 10 N, we need to solve the equation 5x^3 - 3x^2 - 10 = 0. This is a cubic equati

In [8]:
# Define a prompt template for generating subtopics in Spanish.
subtopics_prompt_template_spanish = (
    "Genera {n_subtopics} temas amplios que abarquen diversos aspectos de {macro_topic}. "
    "Tu respuesta debe ser únicamente una lista de temas. "
    "Cada tema es un elemento de la lista. No incluyas subtemas ni explicaciones ni texto adicional. "
    "Por ejemplo: 1. Comida y bebidas. \n2. Tecnología.\n"
)

# Run the pipeline for Spanish
topic_spanish = "Matemáticas"
math_problems_spanish = math_pipeline(
    generator=generator,
    model=model,
    model_kwargs=model_kwargs,
    topic=topic_spanish,
    n_subtopics=n_subtopics,
    n_problems=n_problems,
    subtopics_prompt_template=subtopics_prompt_template_spanish,
    math_prompt_template=math_prompt_template_spanish,
    tag_replacements=tag_replacements_spanish,
)

Generating 3 subtopics for Matemáticas...
	Álgebra Lineal
	Análisis Matemático
	Geometría y Topología

Generating 3 Álgebra Lineal problem-solution pair(s)...
Problema: Encontrar la ecuación de la recta que pasa por los puntos (1,3) y (4,5).
Solución: Primero encontramos el gradiente de la recta: m = (y2 - y1) / (x2 - x1) = (5 - 3) / (4 - 1) = 2/3. Luego, encontramos el punto en coordenadas (x1, y1) y lo usamos para encontrar el coeficiente a de la ecuación general de una recta: y - y1 = m(x - x1), es decir, a = -m*x1 + y1. Sustituyendo los valores en (1,3), obtendremos a = -(2/3)*1 + 3 = 5/3. Así, la ecuación general de la recta es y = (5/3)x + (3 - (5/3)*1).

Problema: Determinar el valor de la tiene las cuatro matrices A, B, C y D, dado que existen tal que AB - AC = BD.
Solución: La relación dada tiene la siguiente forma: AB - AC = BD. Se podría reescribir esta relación como A(B - C) = BD. Dado que la multiplicación de matrices no es definida sin un orden de multiplicación predeterm

**Exercise:** Using `nvidia/llama-3.1-nemotron-70b-reward`, filter the generated examples to retain only those that meet a certain quality threshold.

In [None]:
# Your code here

---
<h2 style="color:green;">Congratulations!</h2>

In this notebook, you used Nemo Curator to generate math problems along with their solutions on a specific topic in both English and Spanish.

<img src="./images/DLI_Header.png" style="width: 400px; float: right;">