# Notebook 1: Synthetic Data Generation for RAG Evaluation
This notebook demonstrates how to use LLMs to generate question-answer pairs on a knowledge dataset using LLMs.
We will use the dataset of pdf files containing the NVIDIA blogs.


![synthetic_data](imgs/synthetic_data_pipeline.png)

## Step 1: Load the PDF Data

[LangChain](https://python.langchain.com/docs/get_started/introduction) library provides document loader functionalities that handle several data format (HTML, PDF, code) from different sources and locations (private s3 buckets, public websites, etc).

LangChain Document loaders  provide a ``load`` method and output a piece of text (`page_content`) and associated metadata. Learn more about LangChain Document loaders [here](https://python.langchain.com/docs/integrations/document_loaders).

In this notebook, we will use a LangChain [`UnstructuredFileLoader`](https://python.langchain.com/docs/integrations/document_loaders/unstructured_file) to load a pdf of NVIDIA blog post.

In [None]:
# %%capture
# !unzip dataset.zip
# !pip install -r requirements.txt
# !pip install --upgrade langchain
!!pip install docx2txt

In [None]:
# take a pdf sample
pdf_example='../ORAN_kb/O-RAN.SFG.Non-RT-RIC-Security-TR-v01.00.pdf'
DOCS_DIR = "../ORAN_kb/"
# # visualize the pdf sample
# from IPython.display import IFrame
# IFrame(pdf_example, width=900, height=500)

In [None]:
# import the relevant libraries
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import DirectoryLoader
# from langchain.document_loaders import  DirectoryLoader
from langchain.document_loaders import TextLoader
from langchain_community.document_loaders import UnstructuredHTMLLoader
from langchain_community.document_loaders import UnstructuredPDFLoader
from langchain_community.document_loaders import PyPDFLoader
from langchain.vectorstores import FAISS
import pickle
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_loaders import UnstructuredFileLoader, Docx2txtLoader
import re
import pandas as pd

In [None]:
# load the pdf sample
loader = UnstructuredFileLoader(pdf_example)
data = loader.load()
DOCS_DIR = "../ORAN_kb"
# Load raw documents from the directory
text_loader_kwargs={'autodetect_encoding': True} #loader_kwargs=text_loader_kwargs
raw_txts = DirectoryLoader(DOCS_DIR, glob="**/*.txt", show_progress=True, loader_cls=TextLoader).load()
raw_htmls = DirectoryLoader(DOCS_DIR, glob="**/*.html", show_progress=True, loader_cls=UnstructuredHTMLLoader).load()
raw_pdfs = DirectoryLoader(DOCS_DIR, glob="**/*.pdf", show_progress=True, loader_cls=UnstructuredPDFLoader).load()
raw_docs = DirectoryLoader(DOCS_DIR, glob="**/*.docx", show_progress=True, loader_cls=Docx2txtLoader).load()

In [None]:
raw_pdfs[0]

## Step 2: Transform the Data 

The goal of this step is tp break large documents into smaller **chunks**. 

LangChain library provides a [variety of document transformers](https://python.langchain.com/docs/integrations/document_transformers/), such as `text splitters`. In this example, we will use the generic [``RecursiveCharacterTextSplitter``](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter), we will set the chunk size to 3K and overlap to 100. 


In [None]:
def remove_line_break(text):
    text = text.replace("\n", " ").strip()
    text = re.sub("\.\.+", "", text)
    text = re.sub(" +", " ", text)
    return text
def remove_two_points(text):
    text = text.replace("..","")
    return text
def remove_two_slashes(text):
    text = text.replace("__","")
    return text

def just_letters(text):
    return re.sub(r"[^a-z]+", "", text).strip()

def remove_non_english_letters(text):
    return re.sub(r"[^\x00-\x7F]+", "", text)


def langchain_length_function(text):
    return len(just_letters(remove_line_break(text)))


def word_count(text):
    text = remove_line_break(text)
    text = just_letters(text)
    tokenizer = RegexpTokenizer(r"[A-Za-z]\w+")
    tokenized_text = tokenizer.tokenize(text)
    tokenized_text_len = len(tokenized_text)
    return tokenized_text_len


def truncate(text, max_word_count=1530):
    tokenizer = RegexpTokenizer(r"[A-Za-z]\w+")
    tokenized_text = tokenizer.tokenize(text)
    return " ".join(tokenized_text[:max_word_count])


def strip_non_ascii(string):
    ''' Returns the string without non ASCII characters'''
    stripped = (c for c in string if 0 < ord(c) < 127)
    return ''.join(stripped)


def generate_questions(chunk, triton_client):
    triton_client.predict_streaming(chunk)
    chat_history = ""
    response_streaming_list = []
    while True:
        try:
            response_streaming = triton_client.user_data._completed_requests.get(block=True)
            
        except Exception:
            triton_client.close_streaming()
            break
    
        if type(response_streaming) == InferenceServerException:
            print("err")
            triton_client.close_streaming()
            break
    
        if response_streaming is None:
            triton_client.close_streaming()
            break
    
        else:
            response_streaming_list.append(response_streaming)
            chat_history = triton_client.prepare_outputs(response_streaming_list)
            yield chat_history

In [None]:
# import the relevant libraries
from langchain.text_splitter import  RecursiveCharacterTextSplitter

In [None]:
# instantiate the RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100, length_function=langchain_length_function, )

all_documents=raw_pdfs+raw_docs

# split all the docs
documents = text_splitter.split_documents(all_documents)

Let's check the number of chunks of the document.

In [None]:
# check the number of chunks
len(documents)

In [None]:
#remove short chuncks
filtered_documents = [item for item in documents if len(item.page_content) >= 200]
[(len(item.page_content),item.page_content) for item in documents]
documents = filtered_documents
pd.DataFrame([doc.metadata for doc in documents])['source'].unique()
#remove line break
for i in range(0,len(documents)-1):
    documents[i].page_content=remove_line_break(documents[i].page_content)
#remove two points
for i in range(0,len(documents)-1):
    documents[i].page_content=remove_two_points(documents[i].page_content)
#remove non english characters points
for i in range(0,len(documents)-1):
    documents[i].page_content=remove_two_slashes(documents[i].page_content)
#remove two points
for i in range(0,len(documents)-1):
    documents[i].page_content=remove_two_points(documents[i].page_content)
[(len(item.page_content),item.page_content) for item in documents]

Let's check the first chunk of the document.

In [None]:
# check the first chunks
len(documents)

## Step 3: Generate Question-Answer Pairs


**Instruction prompt:**
```
Given the previous paragraph, create one very good question answer pair.
Your output should be in a json format of individual question answer pairs.
Restrict the question to the context information provided.

```


In [None]:
all_splits = documents
# set the instruction_prompt
instruction_prompt = "Given the previous paragraph, create one very good question answer pair. Your output should be in a json format of individual question answer pairs. Restrict the question to the context information provided."

# set the context prompt
context = '\n'.join([all_splits[0].page_content, instruction_prompt])

In [None]:
# check the prompt
print(context)

#### a) AI Playground LLM generator

**NVIDIA AI Playground** on NGC allows developers to experience state of the art LLMs accelerated on NVIDIA DGX Cloud with NVIDIA TensorRT nd Triton Inference Server. Developers get **free credits for 10K requests** to any of the available models. Sign up process is easy. follow the steps <a href="https://github.com/NVIDIA/GenerativeAIExamples/blob/main/docs/rag/aiplayground.md">here</a>. 

We are going to use theAI playground'ss `llama2-70B `LLM to generate the Question-Answer pairs.

Let's now use the AI Playground's langchain connector to generate the question-answer pair from the previous context prompt (document chunk + instruction prompt). Populate your API key in the cell below.

In [None]:
import os
os.environ['NVIDIA_API_KEY'] = ""
os.environ['NVAPI_KEY'] = ""
from langchain_nvidia_ai_endpoints import ChatNVIDIA, NVIDIAEmbeddings
import json
# # make sure to export your NVIDIA AI Playground key as NVIDIA_API_KEY!
llm = ChatNVIDIA(model="playground_llama2_70b")
print(list(llm.available_models))
an = llm.invoke(context)
json.loads(an.content)['answer']

In [None]:
import multiprocessing
all_contexts = []
all_answers = []
def llm_ans(args):
    count, split = args
    context = '\n'.join([split.page_content, instruction_prompt])
    filename = split.metadata['source']
    #check output
    answer = llm.invoke(context)
#     all_contexts.insert(count,context)
#     all_answers.insert(count,answer)
    print(count)
    return [context, filename, answer]

n = multiprocessing.cpu_count() 
print("Number of cores: ",n)
pool = multiprocessing.Pool(12)
args = [(count,split) for count,split in enumerate(all_splits[0::300])]
print(len(args))
results = pool.map(llm_ans, args)
pool.close()

In [None]:
import json
context,filename,answer = results[0]
answer = json.loads(an.content) #['answer']json.loads(answer)
answer['answer']

In [None]:
import json
data = []
for i in results:
    context,filename,answer = i
    print(answer)
    try:
        answer = json.loads(answer.content)
    except:
        continue
    if isinstance(answer,list):
     answer = answer[0]
#     print(answer)
    data.append({'gt_context': context,'document': filename,'question': answer['question'],'gt_answer': answer['answer']})
#     print(data)
with open('syn_data_oran.json', 'w') as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

In [None]:
len(data)

# End-to-End Synthetic Data Generation

We have run the above steps and on 600 pdfs of NVIDIA blogs dataset and saved the data in json format below. Where gt_context is the ground truth context and gt_answer is ground truth answer.

```
{
'gt_context': chunk,
'document': filename,
'question': "xxxx",
'gt_answer': "xxxx"
}
```

In [None]:
import json
with open("syn_data_oran.json") as f:
    dataset = json.load(f)

In [None]:
print(dataset[50])

# Synthetic Data Post-processing 

So far, the generated JSON file structure embeds `gt_context`, `document` and the `question`, `answer` pair.

In order to evaluate Retrieval Augmented Generation (RAG) systems, we need to add the RAG results fields (To be populated in the next notebook):
   - `contexts`: Retrieved documents by the retriever 
   - `answer`: Generated answer

The new dataset JSON format should be: 

```
{
'gt_context': chunk,
'document': filename,
'question': "xxxxx",
'gt_answer': "xxx xxx xxxx",
'contexts':
'answer':
}
```