# Retriever Customization - Synthetic Data Generation (Part 1/2)

Authors - Aditya Malte, Dora Li, Vinay Raman

## Introduction
Text retrievers and embedding models play a crucial role in modern information retrieval systems by converting both queries and documents into dense numerical vectors (embeddings) that capture their semantic meaning. This allows the system to find relevant documents by measuring the similarity between a query's embedding and document embeddings in the database.

The accuracy of these models directly impacts their usefulness. When a retriever has been trained primarily on one type of content (like general web text or news articles) but is asked to retrieve documents from a specialized domain (such as medical literature), its performance can degrade significantly.

This is why many organizations fine-tune domain-specific retrievers for their particular use cases, ensuring more accurate and relevant document retrieval. As with all fine-tuning, high-quality domain-specific data is required and can be generated with LLMs such as [NVIDIA's Nemotron-4-340B-Instruct](https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/) that are specially trained and licensed for synthetic data generation. Other models like Llama-3.1-405B or Mixtral-8x22B-Instruct can also produce good results. 

## Overview 

This two-part tutorial demonstrates how to improve retrieval performance by fine-tuning embedding models using synthetic training data. The process is split across two notebooks:
 
1. `synthetic_data_generation_nemo.ipynb` **(this notebook)**:
    - Use an LLM from build.nvidia.com (or deploy your own using NIM!) to create training examples containing generated queries and positive chunks. By default the notebook will use nfcorpus, but you can easily swap in your own data.
    - Save results to a `.csv` file 


2. `retriever_customization.ipynb`:
    - Implement hard negative mining to find challenging negative examples
    - Use the generated training data in the `.csv` file to fine-tune a retriever model using Nemo Framework
    - Evaluate the results of your fine-tuned embedding model against the original using BeIR Benchmark

NOTE: This tutorial is only meant as a demo, and hence only a small subset of the corpus is used for training data generation - in order for the notebook run to complete in a reasonable time. A GPU is required to run notebook 2, but not notebook 1 if an LLM endpoint is used.  

## Setup Instructions

#### NeMo Framework Docker Container ####
This notebook runs in a Docker environment built from the NeMo FW repo. Refer https://github.com/NVIDIA/NeMo/tree/main for instructions on how to build and run the docker containers. Ensure that the docker container you run this notebook in is built from the main branch of the NeMo repository. The current notebooks were tested on Nemo Framework 24.07 on a single-GPU machine (L40s).

Run docker when inside the `synthetic-data-retriever-customization` directory using this command:

`docker run -it --rm --gpus all --ipc=host --network host -v $(pwd):/workspace nvcr.io/nvidia/nemo:24.07`

<br> 

#### NVIDIA AI Endpoints
You'll need access to an **LLM** for generating queries. By default, this notebook uses the [Nemotron-4-340b-Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct) API endpoint from [www.build.nvidia.com](https://www.build.nvidia.com). 

**An API Key is required.** Get your API Key by following the link above to the model and clicking on "Build with this NIM". All new users will get a number of tokens upon registering. Set the environment variable NVIDIA_API_KEY with your API key value.

Optionally, you can self-host either model using **[NIM (NVIDIA Inference Microservice)](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html)** and pass in the local url when creating your LLM client later on. Follow the instructions in the link. Note that system GPU requirements will depend on the model you choose to deploy.  

## Import Libraries

In [None]:
import os
import json
import pandas as pd
from collections import OrderedDict
import torch
import math
import numpy as np

import re
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

In [None]:
!pip install ipywidgets
!pip install beir

#### Specify the directory where the final `.csv` with generated QA pairs will be saved

In [None]:
GENERATIONS_MODEL_NAME = "nvidia-nemotron-4-340b-instruct"
GENERATIONS_SAVE_DIR = "/workspace/files/data"

### Download Nfcorpus Dataset

In [None]:
from datasets import load_dataset

Just as an example, we have chosen the [`nfcorpus`](https://www.cl.uni-heidelberg.de/statnlpgroup/nfcorpus/) public text dataset to generate the synthetic data from. But you can choose any other existing dataset or ideally provide your own proprietary documents to generate data from. 

In [None]:
DOMAIN = "BeIR/nfcorpus"
corpus = load_dataset(DOMAIN, "corpus")["corpus"]

## Synthetic Data Generation from Knowledge Base

In this section we will: 
1. Break each text sample in the nfcorpus dataset that we downloaded into smaller chunks.
2. Compose an LLM prompt that provides detailed instructions on how to generate queries based on each chunk.
3. Send the queries to our LLM as an asynchronous batch job.
4. Parse the queries and populate our synthetic dataset with query + positive chunks.

### 1. Chunk Knowledge Base

Chunking is required to break large documents into smaller chunks that an LLM can take as input. In this case we chunk the texts into samples of around word count 300, ensuring that sentences are not broken. 

In [None]:
def chunk_text(samples, chunk_size=300):
    
    final_chunks = []
    for idx, (text, title, paragraph_id) in enumerate(samples):
        sentence_list = sent_tokenize(text)
        chunk = []
        chunk_id = 0
        word_count = 0
        for sentence in sentence_list:
            word_tokens = sentence.split()
            word_count += len(word_tokens)
            chunk.append(sentence)
            
            if(word_count >= chunk_size):
                chunk_text = " ".join(chunk)
                final_chunks.append((chunk_text, title, paragraph_id, chunk_id))
                chunk_id += 1
                
                chunk = []
                word_count = 0
                
        if len(chunk) > 0:
            chunk_text = " ".join(chunk)
            if(chunk_id==0 or len(chunk_text.split())>int((0.4*chunk_size))): # Only include the last chunk if it has significant number of words
                final_chunks.append((chunk_text, title, paragraph_id, chunk_id)) # , or if the sample itself was a single chunk (chunk_id=0)

    return final_chunks

In [None]:
kb = pd.DataFrame(corpus)
kb["paragraph_id"] = range(len(kb)) # assign a paragraph id to keep track of the original source document of each chunk
kb

**Notes:**
- We are only sampling 100 out of around 5000 documents in the corpus, in order to allow the notebook to complete in a reasonable time for this tutorial. Feel free to increase it, especially if you are running this with your own data.

- Most of the nfcorpus documents are already very short passages so they will only contain a single chunk. 

In [None]:
kb = kb.sample(n=100)

kb_chunked = pd.DataFrame(chunk_text(kb[["text", "title", "paragraph_id"]].itertuples(index=False)), columns=["text", "title", "paragraph_id", "chunk_id"])
kb_chunked.columns = ['chunk_text', 'title', 'paragraph_id', 'chunk_id']
kb_chunked.drop(["title"], axis=1, inplace=True)
kb_chunked

In [None]:
kb = kb.merge(kb_chunked, how="left", on="paragraph_id")
kb

In [None]:
kb["chunk_title_text_concat"] = kb["title"] + "\n" + kb["chunk_text"]
kb

### 2. Prompt Generation

A prompt serves the purpose of providing context to the LLM for generation. You should modify this prompt as appropriate for your specific domain. Prompt engineering is incredibly important and can greatly impact the quality of the generated quries. 

The default prompt in this example is from the NVIDIA documentation/help page. It provides detailed instructions and provides examples of the types of queries the model should generate. In this prompt we ask the LLM to generate three unique queries for each chunk. 

With NVIDIA AI Endpoints, you can request the queries to be returned as JSON by specifying a json schema as follows. 

In [None]:
json_schema = {
  "type": "object",
  "properties": {
    "queries": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "minItems": 3,
      "maxItems": 3
    }
  },
  "required": ["queries"]
}

In [None]:
system_prompt = f"""You are a data annotator trying to generate three search queries for the Document 2. The generated queries must be answerable from Document 2. 
Return the generated responses according this JSON schema: {str(json_schema)}

Here's an example:
Sample Document: AV Sync 
Use of an AV Receiver with HDMI for video may result in audio lagging behind video.  First try 
using the receiver AV sync settings to calibrate.  If this does not work, use the AV sync slider 
utility in Settings  > Display & sound > Advanced settings > Audio video sync to calibrate for 
any audio delay.  The AV sync slider allows you to advance audio by 1 second (in small 
increments of 10ms) to synchronize the audio and video. 
Note that this tool is effective only when SHIELD is connected to your AV Receiver over HDMI 
(i.e. audio/video over HDMI); it is not meant to be used when a headset is plugged into SHIELD 
Controller/SHIELD Remote or USB audio device or Bluetooth audio device. 
If video lags behind audio (i.e. audio is ahead of video) then use your AV receiver’s settings to 
delay audio.
ADJUST FOR OVERSCAN
For TVs that don't provide their own overscan settings, use this setting to adjust the picture size to fit the screen.
Go to Settings > Device Preferences > Display & Sound > Advanced Settings > Adjust for overscan to resize the picture on your TV or display.  Use the UP and DOWN d-pad buttons on your remote to maximize the picture on your TV.  Make sure the green triangles are completely visible to avoid overscan.

Generated queries in JSON format:
{{
  "queries": [
    "How do I adjust the display so that my picture does not go out of the screen",
    "Why is AV Sync not working when I'm plugging my SHIELD into my bluetooth earphone?",
    "How many seconds can I delay audio by in AV Sync?"
  ]
}}
"""

system_prompt += "Do not use text from Example to generate queries for Document 2."
print(system_prompt)

In [None]:
def get_prompt(message:str, system_prompt: str) -> str:
    build_prompt = system_prompt + "\nDocument 2: " + message
    return build_prompt

In [None]:
kb["prompt"] = kb["chunk_title_text_concat"].apply(get_prompt, system_prompt=system_prompt)
kb

### 3. Synthetic Data Generation

Now we'll use [Nemotron-4-340B-Instruct](https://build.nvidia.com/nvidia/nemotron-4-340b-instruct) from NVIDIA AI Endpoints (www.build.nvidia.com) to generate synthetic data. Make sure you have a valid API key stored as the environment variable NVIDIA_API_KEY, or you can generate one following the link earlier. 

The NVIDIA AI endpoint follows the same schemas as the OpenAI API standard, so we'll go ahead and use the AsyncOpenAI() client in order to asynchronously send many requests to the server. 

In [None]:
texts = kb["prompt"].tolist()
len(texts)

In [None]:
from openai import AsyncOpenAI
import asyncio
import nest_asyncio

nest_asyncio.apply()

# If you are using a self-hosted NIM or any other API endpoint, modify base_url and other relevant parameters here.
llm_client = AsyncOpenAI(
    base_url = "https://integrate.api.nvidia.com/v1",
    api_key = os.environ["NVIDIA_API_KEY"]
)

In [None]:
async def generate_response(client, prompt):
    try:
        response = await client.chat.completions.create(
            model="nvidia/nemotron-4-340b-instruct", # specify which model to use
            messages=[{"role": "user", "content": prompt}],
            temperature=0.2,
            top_p=0.7,
            max_tokens=1024
        )

        if hasattr(response, 'choices') and len(response.choices) > 0:
            return response.choices[0].message.content
            
    except Exception as e:
        return f"Error occurred: {str(e)}"
    

async def generate_batch_response(client, all_prompts):
    tasks = [generate_response(client, prompt) for prompt in all_prompts]
    results_list = await asyncio.gather(*tasks)
    return results_list

In [None]:
# test to see that the API endpoint is responding
result = await generate_response(llm_client, texts[0])
print(result)

The output should look like this:
```
{
  "queries": [
    "What is the total number of reported cases of cardiac tamponade resulting from acupuncture, as identified in the systematic review?",
    "How many of the reported cases of cardiac tamponade caused by acupuncture had fatal outcomes, according to the literature review?",
    "What measure does the systematic review suggest to reduce the risk of cardiac tamponade in acupuncture practice?"
  ]
}
```

In [None]:
# this could take a while depending on the number of LLM calls
generations = await generate_batch_response(llm_client, texts)

In [None]:
kb["generated_text"] = generations

In [None]:
# It's possible that some requests get dropped for various reasons. Retry them here.
print("Requests to retry: " + str(len(kb[kb['generated_text'].isna()])))
for idx in kb[kb['generated_text'].isna()].index.tolist(): 
    kb.loc[idx, 'generated_text'] = await generate_response(client, kb.loc[idx, 'prompt'])

### 4. Parsing Generations

We'll do some simple text parsing to extract the generated queries, then store them as individual entries in the dataset.

In [None]:
def extract_queries_from_generations(kb):
    paragraph_id_query = []
    for row in kb.to_dict(orient="records"):
        paragraph_id = row["paragraph_id"]
        title = row["title"]
        text = row["chunk_text"]
        chunk_id = row["chunk_id"]
        data = json.loads(row["generated_text"])
        queries = data["queries"]
        print(queries)
        print("-"*200)
        paragraph_id_query += [(paragraph_id, chunk_id, query) for query in queries]
    return pd.DataFrame(paragraph_id_query, columns=["paragraph_id", "chunk_id", "chunk_query"])

In [None]:
extracted_queries = extract_queries_from_generations(kb)

Example output: 
```
['What is the range of BMAA concentrations found in cyanobacterial blooms in South Florida?', 'Which neurodegenerative diseases have been linked to BMAA exposure?', 'What is the highest BMAA concentration found in resident animals used as human food in South Florida?']
```

In [None]:
kb = kb.merge(extracted_queries, how="left", left_on=["paragraph_id", "chunk_id"], right_on=["paragraph_id", "chunk_id"])
kb

In [None]:
qa_pairs = kb[["chunk_query", "chunk_title_text_concat", "chunk_id", "paragraph_id"]]
qa_pairs.columns = ["query", "positive_chunk", "positive_chunk_id", "paragraph_id"]
qa_pairs

#### Save QA Pair Data

In [None]:
GENERATIONS_SAVE_FILENAME =  f"qa_pairs_{GENERATIONS_MODEL_NAME}_num_queries_{len(qa_pairs)}_{DOMAIN}"
GENERATIONS_SAVE_FILENAME = re.sub(r'\W+', '_', GENERATIONS_SAVE_FILENAME)
GENERATIONS_SAVE_PATH = os.path.join(GENERATIONS_SAVE_DIR, f"{GENERATIONS_SAVE_FILENAME}.csv")

In [None]:
qa_pairs.to_csv(GENERATIONS_SAVE_PATH, index=None)
print(f"Generated QA Pairs saved to {GENERATIONS_SAVE_PATH}")

Congratulations, you've now successfully generated a synthetic dataset for Fine-Tuning a text embedding model! In the next notebook you'll use the `.csv` file you've just generated to fine-tune NV-EmbedQA-V4 using NeMo Framework. 