# Synthetic Data Generation with Large Language Models
## Notebook details
This notebook generates synthetic data with an LLM on a sample NLI dataset.


## Step 1: Install the dependencies in your environment

Install the libraries/dependencies required to run the python code.

In [None]:
%pip install azure-ai-ml
%pip install azure-identity
%pip install datasets

# TASK : NLI Synthetic Data generation

### Natural Language Inference (NLI)

Synthetic data generation is targeted towards cases where user does not have labeled data, so teacher LLM is used to create high quality, synthetic labels for the data.

This notebook assumes the data to have the above three fields: 'premise', 'hypothesis'. The 'label' can optionally be used to compute metrics based on original ground truth. However, the purpose of synthetic data generation is to replace the labels with the high quality labels generated by a large, capable LLM.

Natural Language Inference or Recognizing Textual Entailment (RTE) is the task of classifying a pair of premise and hypothesis sentences into three classes: **contradiction, neutral, and entailment**. For example:

| premise                                           | hypothesis                                             | label         |
|---------------------------------------------------|--------------------------------------------------------|---------------|
| A man inspects the uniform of a figure in some East Asian country. | The man is sleeping.                                   | contradiction |
| An older and younger man smiling.                 | Two men are smiling and laughing at the cats playing on the floor. | neutral       |
| A soccer game with multiple males playing.        | Some men are playing a sport.                          | entailment    |



## Step 2: Consume input dataset

The classes in this cell handle the responsibility of ingesting the input dataset. Dataset can be anything, HuggingFace, Locally hosted, JSON, string etc. For our NLI example, we have written a `NLIHuggingFaceInputDataset` class to ingests input from HuggingFace datasets.

Example NLI Dataset looks like the following:
```json
{
    "premise": "Aside from the Indigenous population, nearly all Argentines or their ancestors immigrated within the past five centuries.",
    "hypothesis": "Aside from the Indigenous population, some Argentines or their ancestors immigrated within the past five centuries.",
    "label": 0
}

Labels 0, 1, 2 correspond to entailment, neutral and contradiction respectively.

In [None]:
from utils import NLIHuggingFaceInputDataset

# We can define train and test sample sizes here.
train_sample_size = 2
val_sample_size = 2
test_sample_size = 2

# Sample notebook using the dataset: https://huggingface.co/datasets/cestwc/conjnli
dataset_name = "cestwc/conjnli"
input_dataset = NLIHuggingFaceInputDataset()

# Note: train_split_name and test_split_name can vary by dataset. They are passed as arguments in load_hf_dataset.
# If val_split_name is None, the below function will split the train set to create the specified sized validation set.
train, val, test = input_dataset.load_hf_dataset(
    dataset_name=dataset_name,
    train_sample_size=train_sample_size,
    val_sample_size=val_sample_size,
    test_sample_size=test_sample_size,
    train_split_name="adversarial",
    val_split_name=None,
    test_split_name="dev",
)

print("Len of train data sample is " + str(len(train)))
print("Len of validation data sample is " + str(len(val)))
print("Len of test data sample is " + str(len(test)))

#### Check format of data

In [None]:
train[0]

## Step 3: Generate prompt for inference

We generate the prompts in the required format to be able to output a desired answer.

So the previous cell prompt 
```json
{
    "premise": "Aside from the Indigenous population, nearly all Argentines or their ancestors immigrated within the past five centuries.",
    "hypothesis": "Aside from the Indigenous population, some Argentines or their ancestors immigrated within the past five centuries.",
    "label": 0
}
```
**transforms to**

```json

{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant. Write out in a step by step manner your reasoning about the answer using no more than 80 words. Based on the reasoning, produce the final answer. Your response should be in JSON format without using any backticks. The JSON is a dictionary whose keys are 'reason' and 'answer_choice'."
        },
        {
            "role": "user",
            "content": "Given the following two texts, your task is to determine the logical relationship between them. The first text is the 'premise' and the second text is the 'hypothesis'. The relationship should be labeled as one of the following: 'entailment' if the premise entails the hypothesis, 'contradiction' if the premise contradicts the hypothesis, or 'neutral' if the premise neither entails nor contradicts the hypothesis.\n\nPremise: Aside from the Indigenous population, nearly all Argentines or their ancestors immigrated within the past five centuries.\nHypothesis:Aside from the Indigenous population, some Argentines or their ancestors immigrated within the past five centuries.\n"
        }
    ]
}


 #### We have abstracted out this functionality in a separate class which you can use as follows.

In [None]:
# An example of how a final NLI prompt looks like
from utils import NLIPromptGenerator

# You can set the enable chain of thought flag to True to enable CoT prompting

nli_prompt_generator = NLIPromptGenerator(enable_chain_of_thought=True)
nli_prompt_generator.generate_prompt(train[0])

## Step 4: Setup inference with Azure ML endpoints

### First deploy the teacher model in Azure AI Studio
* Go to Azure AI Studio (ai.azure.com)
* Select Meta-Llama-3.1-405B-Instruct model from Model catalog.
* Deploy with "Pay-as-you-go"
* Once deployed successfully, you should be assigned for an API endpoint and a security key for inference.

The following cell builds the Azure ML endpoints to be able to get outputs from the LLama endpoint set up in Azure. You can directly use the `AzureInference` class that handles this.

In [None]:
from utils import AzureInference

# The `url` can be copied from the `Deployments` > `Consume` tab in the `URL Endpoint` field.
url = "https://<CHAT_TEACHER_MODEL_DEPLOYMENT_NAME>.<REGION>.models.ai.azure.com/v1/chat/completions"

#  The `key` can be copied from the `Deployments` > `Details` tab in the `Endpoint` > `Key` field.
key = "<API_KEY>"

az_llama_405b_model_inf = AzureInference(url=url, key=key)

## Step 5: Build the final dataset with synthetic labels

In the following cell, we utilize the previously built classes to get input dataset, prompt engineer it, call the LLM from Azure ML endpoints, generate the output and write it to a file.
Sample final output: 

```json

{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant. Write out in a step by step manner your reasoning about the answer using no more than 80 words. Based on the reasoning, produce the final answer. Your response should be in JSON format without using any backticks. The JSON is a dictionary whose keys are 'reason' and 'answer_choice'."
        },
        {
            "content": "Given the following two texts, your task is to determine the logical relationship between them. The first text is the 'premise' and the second text is the 'hypothesis'. The relationship should be labeled as one of the following: 'entailment' if the premise entails the hypothesis, 'contradiction' if the premise contradicts the hypothesis, or 'neutral' if the premise neither entails nor contradicts the hypothesis.\n\nPremise: None but Jake managed to win their game.\nHypothesis: Jake managed to win their game.",
            "role": "user"
        },
        {
            "role": "assistant",
            "content": "entailment"
        }
    ]
}
```

The answer "entailment" in the above sample JSON is generated as a response by the LLM. We wrap it as a response generated by the "assistant".

##### We have abstracted out the above functionality in `NLISyntheticDatasetBuilder` which builds prompts, calls Llama endpoint, and then writes the final dataset in your local directory.

In [None]:
from utils import NLISyntheticDatasetBuilder

nli_dataset_builder = NLISyntheticDatasetBuilder(
    nli_prompt_generator, inference_pointer=az_llama_405b_model_inf
)

# Write synthetic training and validation data to local directory.
nli_dataset_builder.build_dataset(train, file_name="train_nli")
nli_dataset_builder.build_dataset(val, file_name="valid_nli")