# Retriever Customization - Synthetic Data Generation

Authors - Aditya Malte, Vinay Raman

## Introduction
Text Retriever/Embedding models serve the purpose of allowing the retrieval of a relevant document given a query. However, they might not perform as well, if the target documents/queries are out-of-domain for the retriever. An example includes retrieving medical documents using a retriever trained only on generic text.

In these cases, it is often the case that a domain-specific training data must be created to 'adapt' these retrievers. This is often a time-consuming and costly process.

This notebook is a sample demonstration on how Large Language Models (LLMs) could be used to synthetically generate training data, which can then be used to adapt retriever models.

## Overview

This is the first notebook as part of a two-notebook tutorial. The end-goal of this tutorial is a demo on:
1. `synthetic_data_generation_nemo.ipynb`(this notebook) Generating Synthetic Training Data - training examples containing generated queries and positive chunks. The results of which are saved to a `.csv` file, the path of which is printed at the end
2. `retriever_customization.ipynb` - Using the generated training data in the `.csv` file to fine-tune a retriever model

NOTE: This tutorial is only meant as a demo, and hence only a small subset of the corpus is used for training data generation - in order for the notebook run to complete in a reasonable time.

## Setup Instructions

1. This notebook runs in a Docker environment built from the NeMo FW repo. Refer https://github.com/NVIDIA/NeMo/tree/main for instructions on how to build and run the docker containers. Ensure that the docker container you run this notebook in is built from the `main` branch of the `NeMo` repository.
2. Update you HuggingFace token to download the Llama model using `export HUGGINGFACE_TOKEN=<yourtoken>`
3. Before running this notebook, please run setup.sh to perform the following functions:
    1. Download the Llama-2-13B Chat model from HuggingFace
    2. Convert it into a .nemo checkpoint
    3. Serve the converted model in a Megatron Server
    
This notebook was tested on a setup comprising 2xA6000 GPUs with CUDA setup. Please ensure you have adequate GPU memory to avoid CUDA out-of-memory issues.

## Import Libraries

In [1]:
import torch
import numpy as np
import os
import json

In [2]:
import pandas as pd
from tqdm import tqdm
import re
import datetime
import re
from nltk.tokenize import sent_tokenize
import nltk
nltk.download('punkt')
from sklearn.model_selection import train_test_split
import requests

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
from datasets import load_dataset

As an example, we have chosen the `nfcorpus` to generate training data. You may choose a corpus of your domain, or even feed your own domain data for this process.

In [4]:
DOMAIN = "BeIR/nfcorpus"
corpus = load_dataset(DOMAIN, "corpus")["corpus"]

In [5]:
GENERATIONS_SAVE_DIR = "/workspace/files/data"
GENERATION_MODEL_NAME_OR_PATH = "meta-llama/Llama-2-13b-chat-hf"

In [6]:
DO_SAMPLE=True
TOKENS_TO_GENERATE = 384
TEMPERATURE = 0.2
TOP_P= 0.95
TOP_K= 50

## Synthetic Data Generation from Knowledge Base

### Chunk Knowledge Base

Chunking is required to break large documents into smaller chunks that an LLM can take as input. I this case we chunk the texts into samples of around word count 300, ensuring that sentences are not broken. 

In [7]:
def chunk_text(samples, chunk_size=300):
    
    final_chunks = []
    for idx, (text, title, paragraph_id) in enumerate(samples):
        sentence_list = sent_tokenize(text)
        chunk = []
        chunk_id = 0
        word_count = 0
        for sentence in sentence_list:
            word_tokens = sentence.split()
            word_count += len(word_tokens)
            chunk.append(sentence)
            
            if(word_count >= chunk_size):
                chunk_text = " ".join(chunk)
                final_chunks.append((chunk_text, title, paragraph_id, chunk_id))
                chunk_id += 1
                
                chunk = []
                word_count = 0
                
        if len(chunk) > 0:
            chunk_text = " ".join(chunk)
            if(chunk_id==0 or len(chunk_text.split())>int((0.4*chunk_size))): # Only include the last chunk if it has significant number of words
                final_chunks.append((chunk_text, title, paragraph_id, chunk_id)) # , or if the sample itself was a single chunk (chunk_id=0)

    return final_chunks

In [8]:
kb = pd.DataFrame(corpus)
kb["paragraph_id"] = range(len(kb))
kb

Unnamed: 0,_id,title,text,paragraph_id
0,MED-10,Statin Use and Breast Cancer Survival: A Natio...,"Recent studies have suggested that statins, an...",0
1,MED-14,Statin use after diagnosis of breast cancer an...,BACKGROUND: Preclinical studies have shown tha...,1
2,MED-118,Alkylphenols in human milk and their relations...,The aims of this study were to determine the c...,2
3,MED-301,Methylmercury: A Potential Environmental Risk ...,Epilepsy or seizure disorder is one of the mos...,3
4,MED-306,Sensitivity of Continuous Performance Test (CP...,Hit Reaction Time latencies (HRT) in the Conti...,4
...,...,...,...,...
3628,MED-917,Effect of freezing and storage on the phenolic...,Scottish-grown red raspberries are a rich sour...,3628
3629,MED-941,Topical vitamin A treatment of recalcitrant co...,BACKGROUND: Common warts (verruca vulgaris) ar...,3629
3630,MED-942,Esophageal injury by apple cider vinegar table...,Apple cider vinegar products are advertised in...,3630
3631,MED-952,Cannabis and the lung.,The use of cannabis is embedded within many so...,3631


As this is just an example notebook, we are only sampling 100 out of around 5000 documents in the corpus, in order to allow the notebook to complete in a reasonable time.

In [9]:
kb = kb.sample(n=100)
kb

Unnamed: 0,_id,title,text,paragraph_id
2386,MED-3931,Pilot dietary study with normoproteic protein-...,Although a plant-based diet can provide some b...,2386
2546,MED-4147,Occurrence of steroid hormones and antibiotics...,Wastewater impoundments at concentrated animal...,2546
1296,MED-2335,Xenohormesis: health benefits from an eon of p...,Xenohormesis is a biological principle that ex...,1296
3369,MED-5120,L-theanine intervention enhances human gammade...,Human gammadeltaT lymphocytes are a subset of ...,3369
202,MED-1002,"Polybrominated diphenyl ethers (PBDEs), hydrox...",Prenatal exposure to polybrominated diphenyl e...,202
...,...,...,...,...
1829,MED-3179,Cognitive impairment and dementia in neurocyst...,OBJECTIVES: Neurocysticercosis (NCYST) is the ...,1829
2787,MED-4462,Sulforaphane protects human chondrocytes again...,Chondrocyte cell death can contribute to carti...,2787
2450,MED-4007,Study of the ketogenic agent AC-1202 in mild t...,Background Alzheimer's disease (AD) is charact...,2450
2862,MED-4550,An impact of the diet on serum fatty acid and ...,BACKGROUND/OBJECTIVES: Vegetarian diet has bec...,2862


In [10]:
kb_chunked = pd.DataFrame(chunk_text(kb[["text", "title", "paragraph_id"]].itertuples(index=False)), columns=["text", "title", "paragraph_id", "chunk_id"])
kb_chunked.columns = ['chunk_text', 'title', 'paragraph_id', 'chunk_id']
kb_chunked.drop(["title"], axis=1, inplace=True)
kb_chunked

Unnamed: 0,chunk_text,paragraph_id,chunk_id
0,Although a plant-based diet can provide some b...,2386,0
1,Wastewater impoundments at concentrated animal...,2546,0
2,Xenohormesis is a biological principle that ex...,1296,0
3,Human gammadeltaT lymphocytes are a subset of ...,3369,0
4,Prenatal exposure to polybrominated diphenyl e...,202,0
...,...,...,...
96,Chondrocyte cell death can contribute to carti...,2787,0
97,Background Alzheimer's disease (AD) is charact...,2450,0
98,"In the per protocol population, E4(-) particip...",2450,1
99,BACKGROUND/OBJECTIVES: Vegetarian diet has bec...,2862,0


In [11]:
kb = kb.merge(kb_chunked, how="left", on="paragraph_id")
kb

Unnamed: 0,_id,title,text,paragraph_id,chunk_text,chunk_id
0,MED-3931,Pilot dietary study with normoproteic protein-...,Although a plant-based diet can provide some b...,2386,Although a plant-based diet can provide some b...,0
1,MED-4147,Occurrence of steroid hormones and antibiotics...,Wastewater impoundments at concentrated animal...,2546,Wastewater impoundments at concentrated animal...,0
2,MED-2335,Xenohormesis: health benefits from an eon of p...,Xenohormesis is a biological principle that ex...,1296,Xenohormesis is a biological principle that ex...,0
3,MED-5120,L-theanine intervention enhances human gammade...,Human gammadeltaT lymphocytes are a subset of ...,3369,Human gammadeltaT lymphocytes are a subset of ...,0
4,MED-1002,"Polybrominated diphenyl ethers (PBDEs), hydrox...",Prenatal exposure to polybrominated diphenyl e...,202,Prenatal exposure to polybrominated diphenyl e...,0
...,...,...,...,...,...,...
96,MED-4462,Sulforaphane protects human chondrocytes again...,Chondrocyte cell death can contribute to carti...,2787,Chondrocyte cell death can contribute to carti...,0
97,MED-4007,Study of the ketogenic agent AC-1202 in mild t...,Background Alzheimer's disease (AD) is charact...,2450,Background Alzheimer's disease (AD) is charact...,0
98,MED-4007,Study of the ketogenic agent AC-1202 in mild t...,Background Alzheimer's disease (AD) is charact...,2450,"In the per protocol population, E4(-) particip...",1
99,MED-4550,An impact of the diet on serum fatty acid and ...,BACKGROUND/OBJECTIVES: Vegetarian diet has bec...,2862,BACKGROUND/OBJECTIVES: Vegetarian diet has bec...,0


In [12]:
kb["title_text_concat"] = kb["title"] + "\n" + kb["text"]
kb["chunk_title_text_concat"] = kb["title"] + "\n" + kb["chunk_text"]
kb

Unnamed: 0,_id,title,text,paragraph_id,chunk_text,chunk_id,title_text_concat,chunk_title_text_concat
0,MED-3931,Pilot dietary study with normoproteic protein-...,Although a plant-based diet can provide some b...,2386,Although a plant-based diet can provide some b...,0,Pilot dietary study with normoproteic protein-...,Pilot dietary study with normoproteic protein-...
1,MED-4147,Occurrence of steroid hormones and antibiotics...,Wastewater impoundments at concentrated animal...,2546,Wastewater impoundments at concentrated animal...,0,Occurrence of steroid hormones and antibiotics...,Occurrence of steroid hormones and antibiotics...
2,MED-2335,Xenohormesis: health benefits from an eon of p...,Xenohormesis is a biological principle that ex...,1296,Xenohormesis is a biological principle that ex...,0,Xenohormesis: health benefits from an eon of p...,Xenohormesis: health benefits from an eon of p...
3,MED-5120,L-theanine intervention enhances human gammade...,Human gammadeltaT lymphocytes are a subset of ...,3369,Human gammadeltaT lymphocytes are a subset of ...,0,L-theanine intervention enhances human gammade...,L-theanine intervention enhances human gammade...
4,MED-1002,"Polybrominated diphenyl ethers (PBDEs), hydrox...",Prenatal exposure to polybrominated diphenyl e...,202,Prenatal exposure to polybrominated diphenyl e...,0,"Polybrominated diphenyl ethers (PBDEs), hydrox...","Polybrominated diphenyl ethers (PBDEs), hydrox..."
...,...,...,...,...,...,...,...,...
96,MED-4462,Sulforaphane protects human chondrocytes again...,Chondrocyte cell death can contribute to carti...,2787,Chondrocyte cell death can contribute to carti...,0,Sulforaphane protects human chondrocytes again...,Sulforaphane protects human chondrocytes again...
97,MED-4007,Study of the ketogenic agent AC-1202 in mild t...,Background Alzheimer's disease (AD) is charact...,2450,Background Alzheimer's disease (AD) is charact...,0,Study of the ketogenic agent AC-1202 in mild t...,Study of the ketogenic agent AC-1202 in mild t...
98,MED-4007,Study of the ketogenic agent AC-1202 in mild t...,Background Alzheimer's disease (AD) is charact...,2450,"In the per protocol population, E4(-) particip...",1,Study of the ketogenic agent AC-1202 in mild t...,Study of the ketogenic agent AC-1202 in mild t...
99,MED-4550,An impact of the diet on serum fatty acid and ...,BACKGROUND/OBJECTIVES: Vegetarian diet has bec...,2862,BACKGROUND/OBJECTIVES: Vegetarian diet has bec...,0,An impact of the diet on serum fatty acid and ...,An impact of the diet on serum fatty acid and ...


## Prompt Generation

A prompt serves the purpose of providing context to the LLM for generation. You could modify this prompt as per your desired domain
Some ideas including modifying the example to suit your domain
The current prompt is used as an examplea and has been picked up from NVIDIA documentation/help page
You could modify the prompt to represent your desired domain. The queries you give as an example  define the kind of queries the model will generate. In this prompt we ask the LLM to generate three unique questions.


In [13]:
system_prompt = """You are a data annotator trying to generate three search queries for the Document 2. The generated queries must be answerable from Document 2. Each generated query must be enclosed within the <q> and </q> tags as shown in Example. Only generate the query, do not generate the answer. An example is:\n"""

example = """Example:AV Sync 
Use of an AV Receiver with HDMI for video may result in audio lagging behind video.  First try 
using the receiver AV sync settings to calibrate.  If this does not work, use the AV sync slider 
utility in Settings  > Display & sound > Advanced settings > Audio video sync to calibrate for 
any audio delay.  The AV sync slider allows you to advance audio by 1 second (in small 
increments of 10ms) to synchronize the audio and video. 
Note that this tool is effective only when SHIELD is connected to your AV Receiver over HDMI 
(i.e. audio/video over HDMI); it is not meant to be used when a headset is plugged into SHIELD 
Controller/SHIELD Remote or USB audio device or Bluetooth audio device. 
If video lags behind audio (i.e. audio is ahead of video) then use your AV receiver’s settings to 
delay audio.
ADJUST FOR OVERSCAN

For TVs that don't provide their own overscan settings, use this setting to adjust the picture size to fit the screen.

Go to Settings > Device Preferences > Display & Sound > Advanced Settings > Adjust for overscan to resize the picture on your TV or display.  Use the UP and DOWN d-pad buttons on your remote to maximize the picture on your TV.  Make sure the green triangles are completely visible to avoid overscan.
Generated Queries:
1. <q>How do I adjust the display so that my picture does not go out of the screen?</q>
2. <q>Why is AV Sync not working when I'm plugging my SHIELD into my bluetooth earphone?</q>
3. <q>How many seconds can I delay audio by in AV Sync?</q>
"""
system_prompt += example
system_prompt += "Do not use text from Example to generate queries for Document 2."
print(system_prompt)

You are a data annotator trying to generate three search queries for the Document 2. The generated queries must be answerable from Document 2. Each generated query must be enclosed within the <q> and </q> tags as shown in Example. Only generate the query, do not generate the answer. An example is:
Example:AV Sync 
Use of an AV Receiver with HDMI for video may result in audio lagging behind video.  First try 
using the receiver AV sync settings to calibrate.  If this does not work, use the AV sync slider 
utility in Settings  > Display & sound > Advanced settings > Audio video sync to calibrate for 
any audio delay.  The AV sync slider allows you to advance audio by 1 second (in small 
increments of 10ms) to synchronize the audio and video. 
Note that this tool is effective only when SHIELD is connected to your AV Receiver over HDMI 
(i.e. audio/video over HDMI); it is not meant to be used when a headset is plugged into SHIELD 
Controller/SHIELD Remote or USB audio device or Bluetooth a

In [14]:
def get_prompt(message: str, chat_history: list[tuple[str, str]],
               system_prompt: str) -> str:
    texts = [f'<s>[INST] <<SYS>>\n{system_prompt}\n<</SYS>>\n\n']
    # The first user input is _not_ stripped
    do_strip = False
    for user_input, response in chat_history:
        user_input = user_input.strip() if do_strip else user_input
        do_strip = True
        texts.append(f'{user_input} [/INST] {response.strip()} </s><s>[INST] ') 
    message = message.strip() if do_strip else message
    message = "\nDocument 2:" + message
    texts.append(f'{message} [/INST]')
    return ''.join(texts)

In [15]:
kb["prompt"] = kb["chunk_title_text_concat"].apply(get_prompt, system_prompt=system_prompt, chat_history="")
kb

Unnamed: 0,_id,title,text,paragraph_id,chunk_text,chunk_id,title_text_concat,chunk_title_text_concat,prompt
0,MED-3931,Pilot dietary study with normoproteic protein-...,Although a plant-based diet can provide some b...,2386,Although a plant-based diet can provide some b...,0,Pilot dietary study with normoproteic protein-...,Pilot dietary study with normoproteic protein-...,<s>[INST] <<SYS>>\nYou are a data annotator tr...
1,MED-4147,Occurrence of steroid hormones and antibiotics...,Wastewater impoundments at concentrated animal...,2546,Wastewater impoundments at concentrated animal...,0,Occurrence of steroid hormones and antibiotics...,Occurrence of steroid hormones and antibiotics...,<s>[INST] <<SYS>>\nYou are a data annotator tr...
2,MED-2335,Xenohormesis: health benefits from an eon of p...,Xenohormesis is a biological principle that ex...,1296,Xenohormesis is a biological principle that ex...,0,Xenohormesis: health benefits from an eon of p...,Xenohormesis: health benefits from an eon of p...,<s>[INST] <<SYS>>\nYou are a data annotator tr...
3,MED-5120,L-theanine intervention enhances human gammade...,Human gammadeltaT lymphocytes are a subset of ...,3369,Human gammadeltaT lymphocytes are a subset of ...,0,L-theanine intervention enhances human gammade...,L-theanine intervention enhances human gammade...,<s>[INST] <<SYS>>\nYou are a data annotator tr...
4,MED-1002,"Polybrominated diphenyl ethers (PBDEs), hydrox...",Prenatal exposure to polybrominated diphenyl e...,202,Prenatal exposure to polybrominated diphenyl e...,0,"Polybrominated diphenyl ethers (PBDEs), hydrox...","Polybrominated diphenyl ethers (PBDEs), hydrox...",<s>[INST] <<SYS>>\nYou are a data annotator tr...
...,...,...,...,...,...,...,...,...,...
96,MED-4462,Sulforaphane protects human chondrocytes again...,Chondrocyte cell death can contribute to carti...,2787,Chondrocyte cell death can contribute to carti...,0,Sulforaphane protects human chondrocytes again...,Sulforaphane protects human chondrocytes again...,<s>[INST] <<SYS>>\nYou are a data annotator tr...
97,MED-4007,Study of the ketogenic agent AC-1202 in mild t...,Background Alzheimer's disease (AD) is charact...,2450,Background Alzheimer's disease (AD) is charact...,0,Study of the ketogenic agent AC-1202 in mild t...,Study of the ketogenic agent AC-1202 in mild t...,<s>[INST] <<SYS>>\nYou are a data annotator tr...
98,MED-4007,Study of the ketogenic agent AC-1202 in mild t...,Background Alzheimer's disease (AD) is charact...,2450,"In the per protocol population, E4(-) particip...",1,Study of the ketogenic agent AC-1202 in mild t...,Study of the ketogenic agent AC-1202 in mild t...,<s>[INST] <<SYS>>\nYou are a data annotator tr...
99,MED-4550,An impact of the diet on serum fatty acid and ...,BACKGROUND/OBJECTIVES: Vegetarian diet has bec...,2862,BACKGROUND/OBJECTIVES: Vegetarian diet has bec...,0,An impact of the diet on serum fatty acid and ...,An impact of the diet on serum fatty acid and ...,<s>[INST] <<SYS>>\nYou are a data annotator tr...


## Synthetic Data Generation

This is the function used to generate examples

In [74]:
def generate_multiple(texts, ):
    port_num = 5555
    headers = {"Content-Type": "application/json"}
    if isinstance(texts, str):
        prompts = [texts]
    elif isinstance(texts, list):
        prompts = texts
    data = {
        "sentences": prompts,
        "tokens_to_generate": TOKENS_TO_GENERATE,
        "temperature": TEMPERATURE,
        "add_BOS": True,
        "top_k": TOP_K,
        "top_p": TOP_P,
        "greedy": False,
        "all_probs": False,
        "repetition_penalty": 1.2,
        "min_tokens_to_generate": 2,
        }
    resp = requests.put('http://localhost:{}/generate'.format(port_num),
                        data=json.dumps(data),
                        headers=headers)
    return resp

In [None]:
import time
batch_size= 4
texts = kb["prompt"].tolist()
len_texts = len(texts)
generations = []
for i in range(0, len_texts, batch_size):
    st_time = time.time()
    if i + batch_size <= len_texts:
        resp = generate_multiple(texts[i: i + batch_size])
        print (f'Completed batch --> {i}:{i+batch_size}')
    else:
        resp = generate_multiple(texts[i: len_texts])
        print (f'Completed batch --> {i}:{len_texts}')
        
    generations.extend(resp.json()['sentences'])
    print ('Time taken per batch = {:.2f} s'.format(time.time()-st_time))


In [None]:
kb["generated_text"] = generations

## Parsing generations

Parse the generations to extract the generate queries.

In [89]:
def extract_questions_from_generations(kb):
    paragraph_id_question = []
    for row in kb.to_dict(orient='records'):
        paragraph_id = row["paragraph_id"]
        title = row["title"]
        text = row["chunk_text"]
        chunk_id = row["chunk_id"]
        print(row["generated_text"])
        questions = re.findall(r'<q>(.+?)</q>', row["generated_text"])
        print(questions)
        print("-"*200)
        paragraph_id_question += [(paragraph_id, chunk_id, question) for question in questions]
    return pd.DataFrame(paragraph_id_question, columns=["paragraph_id", "chunk_id", "chunk_question"])

In [24]:
extracted_questions = extract_questions_from_generations(kb)

  Sure, here are three search queries that can be generated based on Document 2:

1. <q>What are the health risks associated with consuming beverages sweetened with high-fructose corn syrup?</q>
2. <q>How does fructose intake from sugar-sweetened beverages affect liver function and metabolism?</q>
3. <q>What are the differences in metabolic effects between glucose and fructose in sugar-sweetened beverages?</q>

These queries are based on the information in Document 2 regarding the health risks associated with high-fructose corn syrup and the metabolic effects of fructose and glucose in sugar-sweetened beverages.
['What are the health risks associated with consuming beverages sweetened with high-fructose corn syrup?', 'How does fructose intake from sugar-sweetened beverages affect liver function and metabolism?', 'What are the differences in metabolic effects between glucose and fructose in sugar-sweetened beverages?']
--------------------------------------------------------------------

In [25]:
kb = kb.merge(extracted_questions, how="left", left_on=["paragraph_id", "chunk_id"], right_on=["paragraph_id", "chunk_id"])
kb

Unnamed: 0,_id,title,text,paragraph_id,chunk_text,chunk_id,title_text_concat,chunk_title_text_concat,prompt,generated_text,chunk_question
0,MED-1710,Energy and Fructose From Beverages Sweetened W...,Sugar intake in the United States has increase...,0,Sugar intake in the United States has increase...,0,Energy and Fructose From Beverages Sweetened W...,Energy and Fructose From Beverages Sweetened W...,<s>[INST] <<SYS>>\nYou are a data annotator tr...,"Sure, here are three search queries that can...",What are the health risks associated with cons...
1,MED-1710,Energy and Fructose From Beverages Sweetened W...,Sugar intake in the United States has increase...,0,Sugar intake in the United States has increase...,0,Energy and Fructose From Beverages Sweetened W...,Energy and Fructose From Beverages Sweetened W...,<s>[INST] <<SYS>>\nYou are a data annotator tr...,"Sure, here are three search queries that can...",How does fructose intake from sugar-sweetened ...
2,MED-1710,Energy and Fructose From Beverages Sweetened W...,Sugar intake in the United States has increase...,0,Sugar intake in the United States has increase...,0,Energy and Fructose From Beverages Sweetened W...,Energy and Fructose From Beverages Sweetened W...,<s>[INST] <<SYS>>\nYou are a data annotator tr...,"Sure, here are three search queries that can...",What are the differences in metabolic effects ...
3,MED-5032,Processed meats and risk of childhood leukemia...,The relation between the intake of certain foo...,1,The relation between the intake of certain foo...,0,Processed meats and risk of childhood leukemia...,Processed meats and risk of childhood leukemia...,<s>[INST] <<SYS>>\nYou are a data annotator tr...,"Sure, here are three search queries that can...",What is the association between hot dog consum...
4,MED-5032,Processed meats and risk of childhood leukemia...,The relation between the intake of certain foo...,1,The relation between the intake of certain foo...,0,Processed meats and risk of childhood leukemia...,Processed meats and risk of childhood leukemia...,<s>[INST] <<SYS>>\nYou are a data annotator tr...,"Sure, here are three search queries that can...",Do processed meats increase the risk of leukem...
...,...,...,...,...,...,...,...,...,...,...,...
295,MED-1918,"Intensive meditation training, immune cell tel...",BACKGROUND: Telomerase activity is a predictor...,98,BACKGROUND: Telomerase activity is a predictor...,0,"Intensive meditation training, immune cell tel...","Intensive meditation training, immune cell tel...",<s>[INST] <<SYS>>\nYou are a data annotator tr...,"Based on the content of Document 2, here are...",How does perceived control affect the relation...
296,MED-1918,"Intensive meditation training, immune cell tel...",BACKGROUND: Telomerase activity is a predictor...,98,BACKGROUND: Telomerase activity is a predictor...,0,"Intensive meditation training, immune cell tel...","Intensive meditation training, immune cell tel...",<s>[INST] <<SYS>>\nYou are a data annotator tr...,"Based on the content of Document 2, here are...",What is the role of mindfulness and purpose in...
297,MED-1259,Consumption of blueberries with a high-carbohy...,We sought to determine whether consumption of ...,99,We sought to determine whether consumption of ...,0,Consumption of blueberries with a high-carbohy...,Consumption of blueberries with a high-carbohy...,<s>[INST] <<SYS>>\nYou are a data annotator tr...,"Sure, here are three search queries that can...",How does consuming blueberries with a high-car...
298,MED-1259,Consumption of blueberries with a high-carbohy...,We sought to determine whether consumption of ...,99,We sought to determine whether consumption of ...,0,Consumption of blueberries with a high-carbohy...,Consumption of blueberries with a high-carbohy...,<s>[INST] <<SYS>>\nYou are a data annotator tr...,"Sure, here are three search queries that can...",What is the optimal amount of blueberries to c...


In [26]:
qa_pairs = kb[["chunk_question", "chunk_title_text_concat", "chunk_id", "paragraph_id"]]
qa_pairs.columns = ["question", "positive_chunk", "positive_chunk_id", "paragraph_id"]
qa_pairs

Unnamed: 0,question,positive_chunk,positive_chunk_id,paragraph_id
0,What are the health risks associated with cons...,Energy and Fructose From Beverages Sweetened W...,0,0
1,How does fructose intake from sugar-sweetened ...,Energy and Fructose From Beverages Sweetened W...,0,0
2,What are the differences in metabolic effects ...,Energy and Fructose From Beverages Sweetened W...,0,0
3,What is the association between hot dog consum...,Processed meats and risk of childhood leukemia...,0,1
4,Do processed meats increase the risk of leukem...,Processed meats and risk of childhood leukemia...,0,1
...,...,...,...,...
295,How does perceived control affect the relation...,"Intensive meditation training, immune cell tel...",0,98
296,What is the role of mindfulness and purpose in...,"Intensive meditation training, immune cell tel...",0,98
297,How does consuming blueberries with a high-car...,Consumption of blueberries with a high-carbohy...,0,99
298,What is the optimal amount of blueberries to c...,Consumption of blueberries with a high-carbohy...,0,99


## Save the results

In [45]:
GENERATIONS_SAVE_FILENAME =  f"qa_pairs_{GENERATION_MODEL_NAME_OR_PATH}_num_questions_{len(qa_pairs)}_{DOMAIN}"
GENERATIONS_SAVE_FILENAME = re.sub(r'\W+', '_', GENERATIONS_SAVE_FILENAME)
GENERATIONS_SAVE_PATH = os.path.join(GENERATIONS_SAVE_DIR, f"{GENERATIONS_SAVE_FILENAME}.csv")

In [47]:
qa_pairs.to_csv(GENERATIONS_SAVE_PATH, index=None)
print(f"Generated QA Pairs saved to {GENERATIONS_SAVE_PATH}")

Generated QA Pairs saved to /workspace/files/data/qa_pairs_meta_llama_Llama_2_13b_chat_hf_num_questions_300_BeIR_nfcorpus.csv
