# Synthetic Preference Data Generation Using Meta's Llama 3.1 405B Instruct

The following notebook will demonstrate how to leverage [Meta's Llama 3.1 405B Instruct](https://build.nvidia.com/meta/llama3.1-405b-instruct), and [Nemotron-4 340B Reward](https://build.nvidia.com/nvidia/nemotron-4-340b-reward) through [build.nvidia.com](https://build.nvidia.com/explore/discover).

The build will be a demonstration of the following pipeline.

![image](./SDG%20Pipeline.png)

The flow will be split into 2 general parts: 

1. **Synthetic Response Generation**: A domain specific input query will be provided by the developer - at which point Llama 3.1 405B Instruct will be leveraged to generate ~150 questions. Then, Llama 3.1 405B Instruct will be used to generated 2 responses for each question. 
2. **Reward Model as a Judge**: Nemotron-4 340B Reward will be used to score the 2 responses per question to be used for further alignment training via [NeMo Aligner](https://github.com/NVIDIA/NeMo-Aligner).

## build.nvidia.com API Key Set-up!

In order to access the endpoints through [build.nvidia.com](https://build.nvidia.com/explore/discover), an API key is required. 

A trial API key is made available with 1,000 tokens (or 5,000 tokens for corporate emails) - the example below will leverage ~4,500 tokens of data, but can be extended beyond that limit using local instances of the models.

There are two steps to get a trial API key:

1. Login (or sign up) through [build.nvidia.com](https://build.nvidia.com/)
2. Click the `Get API Key` button available on the the `meta/llama3.1-405b-instruct` page, found [here](https://build.nvidia.com/meta/llama3.1-405b-instruct).



## Part 1: Generate Subtopics, questions, and responses with Meta's Llama 3.1 405B Instruct

The first part of the notebook will cover the creation of raw synthetic data from Meta's Llama 3.1 405B Instruct model.

The data generated with this model can be used in accordance with [Meta's Llama 3.1 License]()

### NEED LICENSE LINK

### Prompt Templates for Synthetic Data Generation

To generate questions and responses, there are a few prompt templates required:

1. A prompt template to generate subtopics from a user provided topic
2. A prompt template to generate questions for a given subtopic
2. A prompt template to generate responses for a given question

In [22]:
TOPIC_GENERATION_PROMPT_TEMPLATE = """\
Given a topic, generate a list of {n_subtopics} subtopics that are related to the topic.

The topic is: {topic}

The list must be without numbers, and without any description of the subtopics. The subtopics should be separated by a comma. There must be no other text than the list.
"""

In [23]:
QUESTION_PROMPT_TEMPLATE = """\
Given a topic, generate {n_questions} questions that could be asked about that topic. Your response should be in a list format.

The topic is: {sub_topic}

The list must be without numbers. The questions should be separated by a newline character. There must be no other text than the list.
"""

In [24]:
RESPONSE_PROMPT_TEMPLATE = """\
Given a question, generate 2 responses that could be given to that question. Your response should be in a list format.

The question is: {question}

The list must be in the format:

RESPONSE A: Response A text here
RESPONSE B: Response B text here
"""

Defined below are the parameters that will be used throughout the notebook to generate numbers of datapoints. 

1. `n_subtopics`, for the given topic `10` sub-topics will be generated by Meta's Llama 3.1 405B Instruct
2. `n_questions`, for the given sub-topic, `10` questions will be generated by Llama 3.1 405B Instruct

> NOTE: Using the default parameters above - there will be 10 sub-topics, each with 10 questions, each with 2 (hardcoded) responses. That is a total of an estimated ~200 rows of data. 

In [25]:
n_subtopics = 10
n_questions = 10

### Setting OpenAI Client for Synthetic Data Generation

Due to [build.nvidia.com](https://build.nvidia.com/)'s integration with the OpenAI API template - the OpenAI Python library can be used to interact with Meta's Llama 3.1 405B Instruct and Nemotron-4 340B Reward.

To begin, install the [OpenAI Python library](https://github.com/openai/openai-python).

In [26]:
!pip install -qU openai

Provide the NVIDIA API key obtained above in order to ensure access to both models.

In [27]:
import os
import getpass

os.environ["NVIDIA_API_KEY"] = getpass.getpass("Please enter your NVIDIA API key: ")

Using the OpenAI Async client will enable quick and efficient data generation.

It's as easy as pointing the `base_url` parameter to `https://integrate.api.nvidia.com/v1` - and providing the API key.

In [28]:
from openai import AsyncOpenAI

client = AsyncOpenAI(
  base_url = "https://integrate.api.nvidia.com/v1",
  api_key = os.environ["NVIDIA_API_KEY"]
)

### Generating Subtopics

To start things off, subtopics will be generated for the provided topic. 

> NOTE: The parameters of `temperature`, `top_p`, and `max_tokens` can be customized to individual preference.

In [32]:
async def generate_subtopics(client, topic, n_subtopics):
    prompt = TOPIC_GENERATION_PROMPT_TEMPLATE.format(topic=topic, n_subtopics=n_subtopics)
    response = await client.chat.completions.create(
        model="meta/llama-3.1-405b-instruct",
        messages=[
            {"role" : "user",
             "content" : prompt}
        ],
        temperature=0.2,
        top_p=0.7,
        max_tokens=1024,
    )
    return response

The main topic can be defined below - for the example in the notebook, "Machine Learning" will be used.

In [33]:
topic = "Machine Learning"

The cell below will call the Meta's Llama 3.1 405B Instruct endpoint - and return a list of subtopics separated by commas.

In [34]:
responses = await generate_subtopics(client, topic=topic, n_subtopics=n_subtopics)

The output conforms to the expected format below.

> NOTE: It is possible that additional data cleaning, or formatting may be necessary depending on the prompt templates used. Be sure to confirm the format of the generated data at each step.

In [35]:
print(responses.choices[0].message.content)

Supervised Learning, Unsupervised Learning, Reinforcement Learning, Deep Learning, Natural Language Processing, Computer Vision, Predictive Modeling, Clustering, Dimensionality Reduction, Neural Networks


Due to the data being generated in a comma separated list, Python's `.split(",")` will convert the string into a usable list for the following steps.

In [39]:
subtopic_list = responses.choices[0].message.content.split(",")

### Generating Questions from Subtopic List

With a list of subtopics, the next step will be to generate `n_questions`, for each subtopic.

First, there needs to be a function to generate "batches" of questions.

> NOTE: It would suitable to generate a single question per topic at a time, but more care would be needed to confirm there were no duplicate questions in the dataset.

In [40]:
async def generate_questions(client, sub_topic, n_questions):
    prompt = QUESTION_PROMPT_TEMPLATE.format(sub_topic=sub_topic, n_questions=n_questions)
    response = await client.chat.completions.create(
        model="meta/llama-3.1-405b-instruct",
        messages=[
            {"role" : "user",
             "content" : prompt}
        ],
        temperature=0.2,
        top_p=0.7,
        max_tokens=1024,
    )
    return response.choices[0].message.content

This step leverages [`asyncio`](https://docs.python.org/3/library/asyncio.html) from Python's standard library for efficient API calls to [build.nvidia.com](https://build.nvidia.com/).

In [41]:
import asyncio

async def question_generator(client, subtopic_list, n_question):
    tasks = [generate_questions(client, subtopic, n_question) for subtopic in subtopic_list]
    question_list = await asyncio.gather(*tasks)
    return question_list

Due to running in a notebook environment - it is necessary to use `nest_asyncio` to run an event loop during the current Jupyter event loop.

In [42]:
import nest_asyncio

nest_asyncio.apply()

question_list = asyncio.run(question_generator(client, subtopic_list, n_questions))

It's time to examine the output of the above process!

In [43]:
question_list

['What is supervised learning and how does it differ from unsupervised learning?\n\nHow does supervised learning work in machine learning algorithms?\n\nWhat are the advantages and disadvantages of using supervised learning in real-world applications?\n\nCan supervised learning be used for regression tasks, or is it limited to classification tasks?\n\nWhat is the role of labeled data in supervised learning, and how is it used to train models?\n\nHow do supervised learning algorithms handle missing or noisy data in the training set?\n\nWhat are some common supervised learning algorithms, and how do they differ from one another?\n\nHow can supervised learning be used for image classification tasks, such as object detection and facial recognition?\n\nWhat are some techniques for evaluating the performance of supervised learning models, and what metrics are commonly used?\n\nCan supervised learning be used in conjunction with other machine learning techniques, such as reinforcement learnin

The list for each question is now collected into a single long list. 

In [44]:
question_list_formatted = []

for question_set in question_list:
    question_list_formatted += question_set.split("\n\n")

In [45]:
len(question_list_formatted)

100

### Generating Responses from Question List

Using the question list, Meta's Llama 3.1 405B Instruct can be used to generate responses to the questions. 

The first things needed is a function that will be used to generate the response from [build.nvidia.com](https://build.nvidia.com/)!

In [46]:
async def generate_responses(client, question):
    prompt = RESPONSE_PROMPT_TEMPLATE.format(question=question)
    response = await client.chat.completions.create(
        model="meta/llama-3.1-405b-instruct",
        messages=[
            {"role" : "user",
             "content" : prompt}
        ],
        temperature=0.2,
        top_p=0.7,
        max_tokens=1024,
    )
    return response.choices[0].message.content

Again, the `asycio` library allows efficient use of the API.

In [47]:
async def response_generator(client, question_list):
    tasks = [generate_responses(client, question) for question in question_list]
    response_list = await asyncio.gather(*tasks)
    return response_list

In [48]:
question_response_list = asyncio.run(response_generator(client, question_list_formatted))

In [49]:
question_response_list[:5]

['Here are two possible responses to the question:\n\nRESPONSE A: Supervised learning is a type of machine learning where the algorithm is trained on labeled data, meaning the data is already tagged with the correct output. The goal of supervised learning is to learn a mapping between input data and the corresponding output labels, so the algorithm can make predictions on new, unseen data. In contrast, unsupervised learning involves training an algorithm on unlabeled data, and the goal is to identify patterns or structure in the data without any prior knowledge of the expected output.\n\nRESPONSE B: Supervised learning is a machine learning approach where the model is trained on a dataset that includes both input data and corresponding target outputs. The model learns to map inputs to outputs based on the labeled examples, and its performance is evaluated on a separate test dataset. Unsupervised learning, on the other hand, involves training a model on a dataset without any labeled out

In order to move to the next stage, a dataset will be created in `.jsonl` format and will store questions with the responses generated.

In [50]:
question_response_pair_list = []
for question, response_set in zip(question_list_formatted, question_response_list):
    question_response_pair_list.append(
        {
            "question" : question, 
            "responses" : {
                "response_a" : {"response" : response_set.split("RESPONSE B:")[0].replace("RESPONSE A:", "").strip().split("\n\n")[-1].strip()},
                "response_b" : {"response" : response_set.split("RESPONSE B:")[-1].split("\n\n")[0].strip()}
            },
        }
    )

The dataset will be written out to a file called `synthetic_data.jsonl` below!

In [51]:
import json

with open('synthetic_data.jsonl', 'w') as f:
    for item in question_response_pair_list:
        f.write(json.dumps(item))
        f.write('\n')

## Using Nemotron-4 340B Reward to Generate a Preference Dataset

Equipped with a dataset that has questions that have response pairs, a preference dataset that is compatible with DPO training, SteerLM reward model training, and RLHF reward model training can be generated straightforwardly thanks to [Nemotron-4 340B Reward](https://build.nvidia.com/nvidia/nemotron-4-340b-reward) available through [build.nvidia.com](https://build.nvidia.com/)!

First, an example of how to use the endpoint.

1. You must both provide a user message, and an assistant message!
2. It will return a chat-style message with the scores, as well as the scores in the `logprogs` parameter.

The response package will include scores related to five attributes:

1. Helpfulness: Overall helpfulness of the response to the prompt.
2. Correctness: Inclusion of all pertinent facts without errors.
3. Coherence: Consistency and clarity of expression.
4. Complexity: Intellectual depth required to write response (i.e. whether the response can be written by anyone with basic language competency or requires deep domain expertise).
5. Verbosity: Amount of detail included in the response, relative to what is asked for in the prompt.

In [52]:
messages = [
    {
        "role" : "user",
        "content" : "Hello!"
    },
    {
        "role": "assistant",
        "content": "Hello! How can I help you today?"
    },
]

In [53]:
response = await client.chat.completions.create(
        model="nvidia/nemotron-4-340b-reward",
        messages=messages,
    )

In [54]:
response

ChatCompletion(id='50010ecc-e198-4a14-8a7f-6b2fee9e2c45', choices=[Choice(finish_reason='length', index=0, logprobs=ChoiceLogprobs(content=[ChatCompletionTokenLogprob(token='helpfulness', bytes=None, logprob=4.09375, top_logprobs=[]), ChatCompletionTokenLogprob(token='correctness', bytes=None, logprob=4.03125, top_logprobs=[]), ChatCompletionTokenLogprob(token='coherence', bytes=None, logprob=4.25, top_logprobs=[]), ChatCompletionTokenLogprob(token='complexity', bytes=None, logprob=0.5703125, top_logprobs=[]), ChatCompletionTokenLogprob(token='verbosity', bytes=None, logprob=1.109375, top_logprobs=[])], refusal=None), message=[ChatCompletionMessage(content='helpfulness:4.09375,correctness:4.03125,coherence:4.25,complexity:0.5703125,verbosity:1.109375', refusal=None, role='assistant', function_call=None, tool_calls=None)])], created=None, model=None, object=None, service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=1, prompt_tokens=54, total_tokens=55, com

The `logprobs` can be handled in a similar fashion to message content, as demonstrated below!

In [55]:
response.choices[0].logprobs.content

[ChatCompletionTokenLogprob(token='helpfulness', bytes=None, logprob=4.09375, top_logprobs=[]),
 ChatCompletionTokenLogprob(token='correctness', bytes=None, logprob=4.03125, top_logprobs=[]),
 ChatCompletionTokenLogprob(token='coherence', bytes=None, logprob=4.25, top_logprobs=[]),
 ChatCompletionTokenLogprob(token='complexity', bytes=None, logprob=0.5703125, top_logprobs=[]),
 ChatCompletionTokenLogprob(token='verbosity', bytes=None, logprob=1.109375, top_logprobs=[])]

It's useful to define a simple helper function that can extract the scores to be used in the construction of a dataset.

In [56]:
def get_scores_from_response(openai_response_template):
    logprobs = openai_response_template.choices[0].logprobs.content
    score_dict = {}
    for score in logprobs:
        score_dict[score.token] = score.logprob
    return score_dict

In [57]:
get_scores_from_response(response)

{'helpfulness': 4.09375,
 'correctness': 4.03125,
 'coherence': 4.25,
 'complexity': 0.5703125,
 'verbosity': 1.109375}

Similar to the synthetic data generation above, using `asyncio` will help provide scores in a time-efficient manner.

In [58]:
async def get_response_and_scores(client, model, question, response_content):
    messages = [
        {
            "role": "user",
            "content": question
        },
        {
            "role": "assistant",
            "content": response_content
        },
    ]
    
    response = await client.chat.completions.create(
        model=model,
        messages=messages,
    )
    
    scores = get_scores_from_response(response)
    return scores

Copying the list is important to avoid overwriting or modifying the original data - though it can be reloaded from `JSONL`.

In [59]:
question_response_score_list = question_response_pair_list.copy()

Scores are calculated efficiently using `asyncio`.

In [60]:
async def process_question_response_pairs(client, model, question_response_score_list):
    tasks = []
    for question_response_pair in question_response_score_list:
        question = question_response_pair["question"]
        
        task_a = get_response_and_scores(client, model, question, question_response_pair["responses"]["response_a"]["response"])
        task_b = get_response_and_scores(client, model, question, question_response_pair["responses"]["response_b"]["response"])
        
        tasks.append((task_a, question_response_pair, "response_a"))
        tasks.append((task_b, question_response_pair, "response_b"))
    
    results = await asyncio.gather(*[task[0] for task in tasks])
    
    for i, (result, task_info) in enumerate(zip(results, tasks)):
        _, question_response_pair, response_key = task_info
        question_response_pair["responses"][response_key].update(result)

Nothing left to do but fire it off!

In [61]:
await process_question_response_pairs(client, "nvidia/nemotron-4-340b-reward", question_response_score_list)

Quality can be relatively preserved by only keeping rows that have at least a `3.0` in the overall metric - in this case helpfulness. This will help ensure that the data remains high quality. 

In [62]:
threshold = 3.0

FInally, the dataset can be exported in `.JSONL` format for use in [NeMo Aligner](https://github.com/NVIDIA/NeMo-Aligner).

In [63]:
with open(f'synthetic_data_with_scores_filtered-{threshold}.jsonl', 'w') as f:
    for item in question_response_score_list:
        question = item["question"]
        response_a = item["responses"]["response_a"]
        response_b = item["responses"]["response_b"]
        response_a["question"] = question
        response_b["question"] = question
        if response_a["helpfulness"] < threshold and response_b["helpfulness"] < threshold:
            continue
        f.write(json.dumps(response_a))
        f.write('\n')
        f.write(json.dumps(response_b))
        f.write('\n')