# Retriever Customization - Synthetic Data Generation

Authors - Aditya Malte, Vinay Raman

## Introduction
Text Retriever/Embedding models serve the purpose of allowing the retrieval of a relevant document given a query. However, they might not perform as well, if the target documents/queries are out-of-domain for the retriever. An example includes retrieving medical documents using a retriever trained only on generic text.

In these cases, it is often the case that a domain-specific training data must be created to 'adapt' these retrievers. This is often a time-consuming and costly process.

This notebook is a sample demonstration on how Large Language Models (LLMs) could be used to synthetically generate training data, which can then be used to adapt retriever models.

## Overview

This is the first notebook as part of a two-notebook tutorial. The end-goal of this tutorial is a demo on:
1. `synthetic_data_generation_nemo.ipynb`(this notebook) Generating Synthetic Training Data - training examples containing generated queries and positive chunks. The results of which are saved to a `.csv` file, the path of which is printed at the end
2. `retriever_customization.ipynb` - Using the generated training data in the `.csv` file to fine-tune a retriever model

NOTE: This tutorial is only meant as a demo, and hence only a small subset of the corpus is used for training data generation - in order for the notebook run to complete in a reasonable time.

## Setup Instructions

1. This notebook runs in a Docker environment built from the NeMo FW repo. Refer https://github.com/NVIDIA/NeMo/tree/main for instructions on how to build and run the docker containers. Ensure that the docker container you run this notebook in is built from the `main` branch of the `NeMo` repository.
2. Update you HuggingFace token to download the Llama model using `export HUGGINGFACE_TOKEN=<yourtoken>`
3. Before running this notebook, please run setup.sh to perform the following functions:
    1. Download the Llama-2-13B Chat model from HuggingFace
    2. Convert it into a .nemo checkpoint
    3. Serve the converted model in a Megatron Server
    
This notebook was tested on a setup comprising 2xA6000 GPUs with CUDA setup. Please ensure you have adequate GPU memory to avoid CUDA out-of-memory issues.

## Import Libraries

In [None]:
import torch
import numpy as np
import os
import json

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import pandas as pd
from tqdm import tqdm
import re
import datetime
import re
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')
from sklearn.model_selection import train_test_split
import requests

In [None]:
from datasets import load_dataset

In [None]:
!pip install rich -q

In [None]:
!pip install openai -q
import openai

In [None]:
!pip install datasets -q

As an example, we have chosen the `nfcorpus` to generate training data. You may choose a corpus of your domain, or even feed your own domain data for this process.

In [581]:
from openai import OpenAI

client = OpenAI(
  base_url = "https://integrate.api.nvidia.com/v1",
  api_key = "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"
)
def generate_multiple(texts,do_print=False ):
    port_num = 5555
    headers = {"Content-Type": "application/json"}
    prompts = texts
    data = {
        "sentences": prompts,
        "tokens_to_generate": TOKENS_TO_GENERATE,
        "temperature": TEMPERATURE,
        "add_BOS": False,
        "top_k": TOP_K,
        "top_p": TOP_P,
        "greedy": False,
        "all_probs": False,
        "repetition_penalty": 1.2,
        "min_tokens_to_generate": 50,
        }
    resp = nstruct(json.dumps(data), do_print=do_print)
    return resp
from rich import print as rprint
def pcontent(completion, do_print = True):
    text= (completion.choices[0].message.content)
    if do_print:
        rprint(text)
    return text
def nstruct(prompt, do_print = True):
  completion = client.chat.completions.create(
    model="nvidia/nemotron-4-340b-instruct",
    messages=[{"role":"user","content":prompt}],
    temperature=0.1,
    top_p=0.7,
    max_tokens=500,
    stream=False,
  )
  _ =pcontent(completion, do_print=do_print)
  completion = completion.choices[0].message.content
  return completion
# prompt = "How do I adjust the display so that my picture does not go out of the screen?"
# resp = nstruct(prompt)


In [582]:
import pandas as pd

splits = {'train': 'train.csv', 'validation': 'dev.csv', 'test': 'test.csv'}
df = pd.read_csv("hf://datasets/kmfoda/booksum/" + splits["train"])

In [None]:
# DOMAIN = "BeIR/nfcorpus"
# corpus = load_dataset(DOMAIN, "corpus")["corpus"]

In [None]:
GENERATIONS_SAVE_DIR = "/workspace/files/data"
GENERATION_MODEL_NAME_OR_PATH = "meta-llama/Llama-2-13b-chat-hf"

In [None]:
DO_SAMPLE=True
TOKENS_TO_GENERATE = 500
TEMPERATURE = 0.1
TOP_P= 0.95
TOP_K= 50

## Synthetic Data Generation from Knowledge Base

In [572]:
def get_prompt(message: str, chat_history: list[tuple[str, str]],
               system_prompt: str) -> str:
    texts = [f'[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n']
    # The first user input is _not_ stripped
    do_strip = False
    for user_input, response in chat_history:
        user_input = user_input.strip() if do_strip else user_input
        do_strip = True
        texts.append(f'{user_input} [/INST] {response.strip()} [INST] ')
    if do_strip:
      print(message)

    message = message.strip() if do_strip else message
    message = "\nDocument 2: \n" + message
    texts.append(f'{message} [/INST]')
    return ''.join(texts)

### Chunk Knowledge Base

In [None]:
dff = df[["book_id", "chapter"]]


Chunking is required to break large documents into smaller chunks that an LLM can take as input. I this case we chunk the texts into samples of around word count 300, ensuring that sentences are not broken.

In [567]:
def chunk_text(samples, chunk_size=200):

    final_chunks = []
    for idx, (text, title, paragraph_id) in enumerate(samples):
        sentence_list = sent_tokenize(text)
        chunk = []
        chunk_id = 0
        word_count = 0
        for sentence in sentence_list:
            word_tokens = sentence.split()
            word_count += len(word_tokens)
            chunk.append(sentence)

            if(word_count >= chunk_size):
                chunk_text = " ".join(chunk)
                final_chunks.append((chunk_text, title, paragraph_id, chunk_id))
                chunk_id += 1

                chunk = []
                word_count = 0

        if len(chunk) > 0:
            chunk_text = " ".join(chunk)
            if(chunk_id==0 or len(chunk_text.split())>int((0.4*chunk_size))): # Only include the last chunk if it has significant number of words
                final_chunks.append((chunk_text, title, paragraph_id, chunk_id)) # , or if the sample itself was a single chunk (chunk_id=0)

    return final_chunks

In [None]:
kb = dff

In [None]:
# kb = pd.DataFrame(corpus)
kb["paragraph_id"] = range(len(kb))
kb

As this is just an example notebook, we are only sampling 100 out of around 5000 documents in the corpus, in order to allow the notebook to complete in a reasonable time.

In [None]:
kb_chunked = pd.DataFrame(chunk_text(kb[["chapter", "book_id", "paragraph_id"]].itertuples(index=False)), columns=["text", "title", "paragraph_id", "chunk_id"])
kb_chunked.columns = ['chunk_text', 'book_id', 'paragraph_id', 'chunk_id']
# kb_chunked.drop(["book_id"], axis=1, inplace=True)
kb_chunked

In [568]:
# kb = kb.merge(kb_chunked, how="left", on="paragraph_id")
kb = kb_chunked
kb

Unnamed: 0,chunk_text,book_id,paragraph_id,chunk_id,chunk_title_text_concat
0,"\n ""Mine ear is open, and my heart prepared:\...",The Last of the Mohicans.chapters 1-2,0,0,"The Last of the Mohicans.chapters 1-2\n\n ""Mi..."
1,The facilities which nature had there offered ...,The Last of the Mohicans.chapters 1-2,0,1,The Last of the Mohicans.chapters 1-2\nThe fac...
2,"While, in the pursuit of their daring plans of...",The Last of the Mohicans.chapters 1-2,0,2,"The Last of the Mohicans.chapters 1-2\nWhile, ..."
3,It was in this scene of strife and bloodshed t...,The Last of the Mohicans.chapters 1-2,0,3,The Last of the Mohicans.chapters 1-2\nIt was ...
4,[2] A wide frontier had been laid naked by thi...,The Last of the Mohicans.chapters 1-2,0,4,The Last of the Mohicans.chapters 1-2\n[2] A w...
...,...,...,...,...,...
133459,"I must say, I can't help feeling\na sort of co...",The Rise of Silas Lapham.chapter xxvii,6606,28,The Rise of Silas Lapham.chapter xxvii\nI must...
133460,Sewell was intensely interested in the moral s...,The Rise of Silas Lapham.chapter xxvii,6606,29,The Rise of Silas Lapham.chapter xxvii\nSewell...
133461,"He spoke freely of his failure, and with a con...",The Rise of Silas Lapham.chapter xxvii,6606,30,The Rise of Silas Lapham.chapter xxvii\nHe spo...
133462,"And in your own\ncase, as I understand, you do...",The Rise of Silas Lapham.chapter xxvii,6606,31,The Rise of Silas Lapham.chapter xxvii\nAnd in...


In [569]:
# kb["title_text_concat"] = kb["book_id"] + "\n" + kb["chapter"]
kb["chunk_title_text_concat"] = kb["book_id"] + "\n" + kb["chunk_text"]


## Prompt Generation

A prompt serves the purpose of providing context to the LLM for generation. You could modify this prompt as per your desired domain
Some ideas including modifying the example to suit your domain
The current prompt is used as an examplea and has been picked up from NVIDIA documentation/help page
You could modify the prompt to represent your desired domain. The queries you give as an example  define the kind of queries the model will generate. In this prompt we ask the LLM to generate three unique questions.


In [570]:
kb = kb.sample(n=200)
kb

Unnamed: 0,chunk_text,book_id,paragraph_id,chunk_id,chunk_title_text_concat
109343,"Jem will be hung, and will go to his father an...",Mary Barton.chapters 26-30,5256,43,"Mary Barton.chapters 26-30\nJem will be hung, ..."
83416,"I could hear, as well as see, that brandy-face...",Treasure Island.part 4.chapter 17,3789,3,Treasure Island.part 4.chapter 17\nI could hea...
46141,Mysterious backs\nand ends of houses peeped at...,A Tale of Two Cities.book 2.chapter 6,1872,14,A Tale of Two Cities.book 2.chapter 6\nMysteri...
118495,"""Slept, Monsieur! When? where?"" ""You may well ...",Villette.chapters 29-32,5769,50,"Villette.chapters 29-32\n""Slept, Monsieur! Whe..."
88763,"But, to be sure, he\nwould be idle here--or an...",David Copperfield.chapter x,3923,5,"David Copperfield.chapter x\nBut, to be sure, ..."
...,...,...,...,...,...
81711,But although the lists rang with the applauses...,Ivanhoe.chapter 12,3709,13,Ivanhoe.chapter 12\nBut although the lists ran...
111917,But to-night it seemed to Linda there was\nsom...,The Garden Party.chapter 1,5523,54,The Garden Party.chapter 1\nBut to-night it se...
128355,"The result was not becoming, to state the case...",Anne of Green Gables.chapter 27,6270,10,Anne of Green Gables.chapter 27\nThe result wa...
120659,"One could see that, as far as it had gone, her...",The Return of the Native.book 3.chapter 3,5941,16,The Return of the Native.book 3.chapter 3\nOne...


In [571]:
system_prompt = """You are a data designer trying to generate 2 possible Scenarios that could take place after the passage in Document 2.
Each Scenario must contain an Action and the State of the scene post action.
The Action must be an active decision, choice or movement taken by the focal character in the passage, and it should be concise and limited to a phrase.
The State should be a technical description of the scene, setting, or character immediately after the action is completed, focusing on the changes caused by the action and the resulting condition of the relevant elements in the scene.
The State should be 2 to 4 sentences. DO NOT continue the storyline, use a literary tone, or add additional new actions in the State description. Use a neutral, descriptive tone without literary or emotional language.
Each of the 2 Actions must be enclosed within the <a> and </a> tags, and each 2 generated States must be enclosed in <s> and </s> tags, as shown in the example below:"""
example = """Example:
Time passed like molasses through the hourglass—but it did pass. Thirty minutes left before midnight. Fifteen minutes. Beads of sweat accumulated on my brow. Ten. Five. Three.
I got up briefly to stretch my sleeping legs, and right at that moment something erupted from the cabinet next to me, which I could have sworn I had checked.
Olaf jumped out. Olaf, the valiant defender of the stars, had somehow found a way in and he held a butcher knife in his hands. He fell heavily on the bundle I was ostensibly protecting, preternaturally quickly, so that I had no time to react. He stabbed the bundle over and over and over again. I screamed.
Olaf stopped as suddenly as he had started. There was no blood on the knife. The bundle was empty. He turned to me, but I was already gone, frantically pulling out the nails on the board I had used to condemn the door leading to the stairs.
Generated Scenarios:
Scenario 1:
<a>Olaf jumps out of the cabinet.</a>
<s>Olaf is now standing outside the cabinet, holding a butcher knife in his hands. The cabinet door is open, revealing an empty interior space. Olaf's feet are firmly planted on the floor, and his arms are extended, with the knife gripped tightly in his hands. The narrator is positioned nearby, their body frozen in a state of surprise and fear. The bundle remains on the floor, untouched and intact.</s>
Scenario 2:
<a>The narrator pulls out the last nail from the board on the door.</a>
<s>The board is now completely detached from the door frame, with all the nails removed and scattered on the floor around the narrator's feet. The narrator's hand is gripping the tool used to remove the nails, their knuckles white from the tight grip. The door is no longer obstructed by the board, and a small gap is visible between the door and the frame. The stairs leading down are partially visible through the gap, illuminated by a faint light source from below. </s>
"""
system_prompt += example
system_prompt += "Do not use text from Example to generate queries for Document 2. Only Produce 2 Scenario's"
print(system_prompt)

You are a data designer trying to generate 2 possible Scenarios that could take place after the passage in Document 2. 
Each Scenario must contain an Action and the State of the scene post action. 
The Action must be an active decision, choice or movement taken by the focal character in the passage, and it should be concise and limited to a phrase. 
The State should be a technical description of the scene, setting, or character immediately after the action is completed, focusing on the changes caused by the action and the resulting condition of the relevant elements in the scene. 
The State should be 2 to 4 sentences. DO NOT continue the storyline, use a literary tone, or add additional new actions in the State description. Use a neutral, descriptive tone without literary or emotional language.
Each of the 2 Actions must be enclosed within the <a> and </a> tags, and each 2 generated States must be enclosed in <s> and </s> tags, as shown in the example below:Example:
Time passed like mo

In [None]:
# drop the prompt column
if "chunk_text" in kb.columns:
  kb = kb.drop(["chunk_text"], axis=1)
kb["prompt"] = kb["chunk_title_text_concat"].apply(get_prompt, system_prompt=system_prompt, chat_history="")


## Synthetic Data Generation

This is the function used to generate examples

In [583]:
import time
from tqdm import tqdm
texts = kb["prompt"].tolist()
len_texts = len(texts)
generations = []

for i in tqdm(range(len_texts)):
    #st_time = time.time()
    resp = generate_multiple([texts[i]], do_print = False)
    # print(resp
    generations.extend([resp])
    # print(f'Completed batch --> {i} out of {len_texts+1}')
    # print('Time for batch = {:.2f} s'.format(time.time() - st_time))


100%|██████████| 200/200 [1:18:38<00:00, 23.59s/it]


In [587]:
print(f"Len of Generation: {len(generations)} Len of kb: {len(kb)}")
kb["generated_text"] = generations

Len of Generation: 200 Len of kb: 200


## Parsing generations

Parse the generations to extract the generate queries.

In [588]:
import pandas as pd
import re

def extract_questions_from_generations(kb):
    paragraph_id_action_state = []

    for row in kb.to_dict(orient='records'):
        paragraph_id = row["paragraph_id"]
        chunk_id = row["chunk_id"]
        print(row["generated_text"])

        actions = re.findall(r'<a>(.+?)</a>', row["generated_text"])
        states = re.findall(r'<s>(.+?)</s>', row["generated_text"])
        print(actions)
        print(states)
        print("-"*200)

        # Ensure the lists are of the same length by filling the shorter one with None
        max_length = max(len(actions), len(states))
        actions += [None] * (max_length - len(actions))
        states += [None] * (max_length - len(states))

        # Combine actions and states in the same row
        paragraph_id_action_state += [(paragraph_id, chunk_id, action, state) for action, state in zip(actions, states)]

    df_combined = pd.DataFrame(paragraph_id_action_state, columns=["paragraph_id", "chunk_id", "chunk_action", "chunk_state"])

    return df_combined


In [589]:
extracted_questions = extract_questions_from_generations(kb)
kb = kb.merge(extracted_questions, how="left", left_on=["paragraph_id", "chunk_id"], right_on=["paragraph_id", "chunk_id"])
kb.head()

Scenario 1:
<a>Job steps outside into the pitch darkness.</a>
<s>Job is now standing outside the house, with the door slightly ajar behind him. The night is quiet and still, with no moonlight to illuminate the surroundings. Job's figure is barely visible in the darkness, and he seems to be looking around, as if trying to get his bearings or searching for something in the gloom.</s>

Scenario 2:
<a>The widow goes to bed, leaving Jem's fate in the hands of others.</a>
<s>The widow is now lying in bed, her face pale and drawn with worry and exhaustion. The room is dimly lit by a single candle, casting long shadows on the walls. The bedclothes are rumpled and twisted, indicating her restlessness and inability to sleep. The house is quiet, except for the sound of her labored breathing, as she tries to gather her strength for the challenges of the next day.</s>
['Job steps outside into the pitch darkness.', "The widow goes to bed, leaving Jem's fate in the hands of others."]
["Job is now sta

Unnamed: 0,book_id,paragraph_id,chunk_id,chunk_title_text_concat,prompt,generated_text,chunk_action,chunk_state
0,Mary Barton.chapters 26-30,5256,43,"Mary Barton.chapters 26-30\nJem will be hung, ...",[INST] <<SYS>>\nYou are a data designer trying...,Scenario 1:\n<a>Job steps outside into the pit...,Job steps outside into the pitch darkness.,"Job is now standing outside the house, with th..."
1,Mary Barton.chapters 26-30,5256,43,"Mary Barton.chapters 26-30\nJem will be hung, ...",[INST] <<SYS>>\nYou are a data designer trying...,Scenario 1:\n<a>Job steps outside into the pit...,"The widow goes to bed, leaving Jem's fate in t...","The widow is now lying in bed, her face pale a..."
2,Treasure Island.part 4.chapter 17,3789,3,Treasure Island.part 4.chapter 17\nI could hea...,[INST] <<SYS>>\nYou are a data designer trying...,Scenario 1:\n<a>Trelawney fires the gun at Han...,Trelawney fires the gun at Hands.,"The gun has been fired, and the ball has whist..."
3,Treasure Island.part 4.chapter 17,3789,3,Treasure Island.part 4.chapter 17\nI could hea...,[INST] <<SYS>>\nYou are a data designer trying...,Scenario 1:\n<a>Trelawney fires the gun at Han...,The pirates on the shore begin to row towards ...,The pirates on the shore have tumbled into the...
4,A Tale of Two Cities.book 2.chapter 6,1872,14,A Tale of Two Cities.book 2.chapter 6\nMysteri...,[INST] <<SYS>>\nYou are a data designer trying...,Scenario 1:\n<a>Mr. Darnay asks Doctor Manette...,Mr. Darnay asks Doctor Manette about his visit...,Doctor Manette's expression turns thoughtful a...


In [590]:
qa_pairs = kb[["chunk_action", "chunk_state", "chunk_title_text_concat", "chunk_id", "paragraph_id"]]
qa_pairs.columns = ["action", "state", "positive_chunk", "positive_chunk_id", "paragraph_id"]
qa_pairs

Unnamed: 0,action,state,positive_chunk,positive_chunk_id,paragraph_id
0,Job steps outside into the pitch darkness.,"Job is now standing outside the house, with th...","Mary Barton.chapters 26-30\nJem will be hung, ...",43,5256
1,"The widow goes to bed, leaving Jem's fate in t...","The widow is now lying in bed, her face pale a...","Mary Barton.chapters 26-30\nJem will be hung, ...",43,5256
2,Trelawney fires the gun at Hands.,"The gun has been fired, and the ball has whist...",Treasure Island.part 4.chapter 17\nI could hea...,3,3789
3,The pirates on the shore begin to row towards ...,The pirates on the shore have tumbled into the...,Treasure Island.part 4.chapter 17\nI could hea...,3,3789
4,Mr. Darnay asks Doctor Manette about his visit...,Doctor Manette's expression turns thoughtful a...,A Tale of Two Cities.book 2.chapter 6\nMysteri...,14,1872
...,...,...,...,...,...
395,Anne remains silent when Josie Pye insults her.,"Josie Pye is standing in front of Anne, a smir...",Anne of Green Gables.chapter 27\nThe result wa...,10,6270
396,Clym brings the pot of bones home.,"The pot of bones now sits in Clym's study, ato...",The Return of the Native.book 3.chapter 3\nOne...,16,5941
397,Clym decides to leave the bones at the barrow.,"The barrow remains open, with the hole dug by ...",The Return of the Native.book 3.chapter 3\nOne...,16,5941
398,Cedric offers a ransom for their freedom to th...,"The sewer stands before Cedric, holding the wh...",Ivanhoe.chapters 20-21\nYet how should this be...,25,3713


## Save the results

In [591]:
GENERATIONS_SAVE_FILENAME =  f"qa_pairs_{GENERATION_MODEL_NAME_OR_PATH}_num_questions_{len(qa_pairs)}_{DOMAIN}"
GENERATIONS_SAVE_FILENAME = re.sub(r'\W+', '_', GENERATIONS_SAVE_FILENAME)
GENERATIONS_SAVE_PATH = os.path.join(GENERATIONS_SAVE_DIR, f"{GENERATIONS_SAVE_FILENAME}.csv")

In [592]:
kb.to_csv("/content/drive/MyDrive/syndata.csv", mode='a', index=None, header=None)
kb.to_csv("./syndata.csv", mode='a', index=None, header=None)
print(f"Generated QA Pairs saved to {GENERATIONS_SAVE_PATH}")

Generated QA Pairs saved to /workspace/files/data/qa_pairs_meta_llama_Llama_2_13b_chat_hf_num_questions_400_BeIR_nfcorpus.csv
