# Notebook 1: Synthetic Data Generation for RAG Evaluation
This notebook demonstrates how to use LLMs to generate question-answer pairs on a knowledge dataset using LLMs.
We will use the dataset of pdf files containing the NVIDIA blogs.


![synthetic_data](imgs/synthetic_data_pipeline.png)

## Step 1: Load the PDF Data

[LangChain](https://python.langchain.com/docs/get_started/introduction) library provides document loader functionalities that handle several data format (HTML, PDF, code) from different sources and locations (private s3 buckets, public websites, etc).

LangChain Document loaders  provide a ``load`` method and output a piece of text (`page_content`) and associated metadata. Learn more about LangChain Document loaders [here](https://python.langchain.com/docs/integrations/document_loaders).

In this notebook, we will use a LangChain [`UnstructuredFileLoader`](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file) to load a pdf of NVIDIA blog post.

In [None]:
%%capture
!unzip dataset.zip

In [None]:
# take a pdf sample
pdf_example='dataset/RGVsbCBUZWNoIDUvMjMvMjMucGRm.pdf'

# visualize the pdf sample
from IPython.display import IFrame
IFrame(pdf_example, width=900, height=500)

In [None]:
# import the relevant libraries
from langchain.document_loaders import UnstructuredFileLoader

In [None]:
# load the pdf sample
loader = UnstructuredFileLoader(pdf_example)
data = loader.load()

## Step 2: Transform the Data 

The goal of this step is tp break large documents into smaller **chunks**. 

LangChain library provides a [variety of document transformers](https://python.langchain.com/docs/integrations/document_transformers/), such as `text splitters`. In this example, we will use the generic [``RecursiveCharacterTextSplitter``](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter), we will set the chunk size to 3K and overlap to 100. 


In [None]:
# import the relevant libraries
from langchain.text_splitter import  RecursiveCharacterTextSplitter

In [None]:
# instantiate the RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3000, chunk_overlap=100)

# split the loaded pdf sample
all_splits = text_splitter.split_documents(data)

Let's check the number of chunks of the document.

In [None]:
# check the number of chunks
len(all_splits)

Let's check the first chunk of the document.

In [None]:
# check the first chunks
all_splits[0].page_content

## Step 3: Generate Question-Answer Pairs


**Instruction prompt:**
```
Given the previous paragraph, create one very good question answer pair.
Your output should be in a json format of individual question answer pairs.
Restrict the question to the context information provided.

```


In [None]:
# set the instruction_prompt
instruction_prompt = "Given the previous paragraph, create two very good question answer pairs. Your output should be in a json format of individual question answer pairs. Restrict the question to the context information provided."

# set the context prompt
context = '\n'.join([all_splits[0].page_content, instruction_prompt])

In [None]:
# check the prompt
print(context)

#### a) AI Playground LLM generator

**NVIDIA AI Playground** on NGC allows developers to experience state of the art LLMs accelerated on NVIDIA DGX Cloud with NVIDIA TensorRT nd Triton Inference Server. Developers get **free credits for 10K requests** to any of the available models. Sign up process is easy. follow the steps <a href="https://github.com/NVIDIA/GenerativeAIExamples/blob/main/docs/rag/aiplayground.md">here</a>. 

We are going to use the [Nvidia API catalog](https://build.nvidia.com/meta/llama3-70b)  `llama3-70B `LLM to generate the Question-Answer pairs.

In [None]:
# import the relevant libraries from langchain
from langchain_nvidia_ai_endpoints import ChatNVIDIA

Let's now use the AI Playground's langchain connector to generate the question-answer pair from the previous context prompt (document chunk + instruction prompt). Populate your API key in the cell below.

In [None]:
import os
os.environ['NVIDIA_API_KEY'] = "nvapi-*"

llm = ChatNVIDIA(
    model="meta/llama3-70b-instruct",
    temperature=0.2,
    max_tokens=300
)

In [None]:
# check the output
answer = llm.invoke(context)

In [None]:
print(answer)

# End-to-End Synthetic Data Generation

We have run the above steps and on 600 pdfs of NVIDIA blogs dataset and saved the data in json format below. Where gt_context is the ground truth context and gt_answer is ground truth answer.

```
{
'gt_context': chunk,
'document': filename,
'question': "xxxx",
'gt_answer': "xxxx"
}
```

In [None]:
import json
with open("qa_generation.json") as f:
    dataset = json.load(f)

In [None]:
print(dataset[0])

# Synthetic Data Post-processing 

So far, the generated JSON file structure embeds `gt_context`, `document`, the `question` and `gt_answer` pair.

In order to evaluate Retrieval Augmented Generation (RAG) systems, we need to add the RAG results fields (To be populated in the next notebook):
   - `contexts`: Retrieved documents by the retriever 
   - `answer`: Generated answer

The new dataset JSON format should be: 

```
{
'gt_context': chunk,
'document': filename,
'question': "xxxxx",
'gt_answer': "xxx xxx xxxx",
'contexts':
'answer':
}
```