# Generate Synthetic Dataset with LLM

Reference: [Fine-Tuning Embeddings for RAG with Synthetic Data](https://medium.com/llamaindex-blog/fine-tuning-embeddings-for-rag-with-synthetic-data-e534409a3971)

Generate a synthetic dataset of (query, relevant documents) pairs from a corpus of **documents without labelers** by leveraging LLM.

## Generate Corpus

In [1]:
import json
import re
import uuid

from llama_index import SimpleDirectoryReader
from llama_index.llms import OpenAI
from llama_index.node_parser import SimpleNodeParser
from llama_index.schema import MetadataMode

from tqdm import tqdm

In [28]:
files_list = ['data_finetuning/one_file/merkblatt-fuer-arbeitslose_ba036520.pdf']
corpus_fpath = 'data_finetuning/one_file/corpus.json'

In [10]:
# reader = SimpleDirectoryReader(input_files=files_list)

# docs = reader.load_data()  

# parser = SimpleNodeParser.from_defaults()

# nodes = parser.get_nodes_from_documents(docs, show_progress=False)

# list(nodes[0])

# nodes[4].get_content(metadata_mode=MetadataMode.NONE)

In [24]:
def load_corpus(files, verbose=False):
    """
    Load the files with "SimpleDirectoryReader", split the document with "SimpleNodeParser"
    and extract the text.

    Args:
        files (str or list): The folder with the files or a list of filenames.
        verbose (bool): Whether or not print info (True/Flase)

    Returns:
        A query engine to use to send queries to a LLM.
    """

    if verbose:
        print(f"Loading files {files}")

    reader = SimpleDirectoryReader(input_files=files)
    docs = reader.load_data()
    
    if verbose:
        print(f'Loaded {len(docs)} docs')
    
    parser = SimpleNodeParser.from_defaults()
    nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)

    if verbose:
        print(f'Parsed {len(nodes)} nodes')

    corpus = {node.node_id: node.get_content(metadata_mode=MetadataMode.NONE) for node in nodes}
    return corpus

In [25]:
corpus = load_corpus(files_list, verbose=True)

Loading files ['data_finetuning/one_file/merkblatt-fuer-arbeitslose_ba036520.pdf']


  from .autonotebook import tqdm as notebook_tqdm


Loaded 103 docs


Parsing documents into nodes: 100%|██████████| 103/103 [00:00<00:00, 2994.37it/s]

Parsed 103 nodes





In [27]:
print(f"Type: {type(corpus)}")
print(f"Length: {len(corpus)}")
for key in list(corpus.keys())[0:2]:
    print(corpus[key])
    print("-"*80)

Type: <class 'dict'>
Length: 103
49466_BA_MB_1.indd   1 10.02.2015   13:20:58Agentur für Arbeit  
Musterstadthausen  Merkblatt
1Merkblatt für
Arbeitslose 
Ihre Rechte –
Ihre Pflichten
--------------------------------------------------------------------------------
3 
Ihre Agentur für Arbeit hält eine Fülle von 
 Informationen für Sie bereit. 
Neben den Informationen in diesem Merkblatt finden 
Sie unter » www.arbeitsagentur.de  unser umfassen ­
des Online-Angebot der „eServices “ sowie ein 
 interessantes Informationsangebot aus allen Aufgaben ­
bereichen der Bundesagentur für Arbeit. Sie erhalten 
wertvolle Tipps zu den Themen Ausbil ­
dung, Berufs- und Studienwahl, Weiter ­
bildung, wichtige Informationen über 
Geldleistungen sowie ein umfangreiches 
Serviceangebot.
Über das Job- und Serviceportal  
» www.arbeitsagentur.de  können Sie beispielsweise:
•  sich arbeitsuchend und arbeitslos melden,
•  Geldleistungen, wie Arbeitslosengeld, beantragen
•  Fragen zum Arbeitslosengeld unserem

In [29]:
with open(corpus_fpath, 'w+') as f:
    json.dump(corpus, f)

## Generate synthetic queries

Use an LLM (e.g., gpt-3.5-turbo) to generate questions using each text chunk in the corpus as context.

For both training and validation, it creates pairs (`generated question`, `text chunk as context`).These pairs are used as data points in the finetuning dataset.

In [31]:
train_queries_fpath = 'data_finetuning/one_file/train_val_data/train_queries.json'
train_relevant_docs_fpath = 'data_finetuning/one_file/train_val_data/train_relevant_docs.json'

val_queries_fpath = 'data_finetuning/one_file/train_val_data/val_queries.json'
val_relevant_docs_fpath = 'data_finetuning/one_file/train_val_data/val_relevant_docs.json'

In [30]:
with open(corpus_fpath, 'r+') as f:
    json.dump(corpus, f)

In [48]:
def generate_queries(corpus, num_questions_per_chunk=2, num_val_questions=1, prompt_template=None, verbose=False):
    """
    Generate hypothetical questions that could be answered with documents in the corpus.

    Args:
        corpus (dict): A dictionary with {"node_id":"text"} format
        num_questions_per_chunk (int): number of questions to generate
        num_val_questions (int): Number of questions to use in the validation set ("num_val_questions" < "num_questions_per_chunk")
        prompt_template (f-string): A custom prompt to use to generate the questions
        verbose (bool): Whether or not print info (True/Flase) - TODO

    Returns:
        queries: 
        relevant_docs: 
    """

    if not (num_val_questions < num_questions_per_chunk):
        print("num_val_questions must be less than num_questions_per_chunk")
        return None
    
    llm = OpenAI(model='gpt-3.5-turbo')

    prompt_template = prompt_template or """\
    Context information is below.
    
    ---------------------
    {context_str}
    ---------------------
    
    Given the context information and not prior knowledge,
    generate only questions based on the below query.
    
    You are a Teacher/ Professor. Your task is to setup \
    {num_questions_per_chunk} questions for an upcoming \
    quiz/examination. The questions should be diverse in nature \
    across the document. Restrict the questions to the \
    context information provided."
    """

    queries_train = {}
    relevant_docs_train = {}

    queries_val = {}
    relevant_docs_val = {}


    # for node_id, text in corpus.items():
    for node_id, text in tqdm(corpus.items()):
        query = prompt_template.format(context_str=text, num_questions_per_chunk=num_questions_per_chunk)
        response = llm.complete(query)
 
        result = str(response).strip().split("\n")
        questions = [
            re.sub(r"^\d+[\).\s]", "", question).strip() for question in result
        ]
        
        questions = [question for question in questions if len(question) > 0]
        split_index = num_questions_per_chunk - num_val_questions
        questions_train = questions[:split_index]
        questions_val = questions[split_index:]

        for question in questions_train:
            question_id = str(uuid.uuid4())
            queries_train[question_id] = question
            relevant_docs_train[question_id] = [node_id]
        
        for question in questions_val:
            question_id = str(uuid.uuid4())
            queries_val[question_id] = question
            relevant_docs_val[question_id] = [node_id]

    return queries_train, relevant_docs_train, queries_val, relevant_docs_val # queries, relevant_docs

In [None]:
corpus_small = dict()
i = 0
for key, value in corpus.items():
    corpus_small[key] = value
    i += 1
    if i > 5:
        break

# page = list(corpus_small.keys())[6] # a page in the document
# print(corpus_small[page])

In [44]:
train_queries, train_relevant_docs, val_queries, val_relevant_docs = generate_queries(corpus_small)

100%|██████████| 6/6 [00:15<00:00,  2.51s/it]


In [45]:
train_queries

{'11e2af26-4180-49ed-a665-1551884587c8': 'What are some of the rights and responsibilities of unemployed individuals according to the Merkblatt from the Agentur für Arbeit?',
 '3b034aff-62dc-4eab-b1cc-1b6f6c506b49': 'What services can individuals access through the Job and Service Portal on the Arbeitsagentur website?',
 '5d7dcf30-1f1d-4be3-beb7-e646467188fd': 'How can individuals access selected features of their online profile through the new customer app "BA-mobil"?',
 'be6aa851-ece0-4e79-adfa-ca3f6d498ae6': 'What are some of the important rights and obligations that individuals need to be aware of when applying for or receiving unemployment benefits under the Third Book of the Social Code (SGB III)?',
 '564775fb-a3a8-488b-8304-b1bf7a2ac76d': 'What is the purpose of the Merkblatt Bürgergeld – Grundsicherung für Arbeit suchende – SGB II and where can it be obtained?',
 '8f593f4f-1191-4d44-9bf6-73f64ddb5ee4': 'What are the consequences of not reporting your unemployment status in a ti

In [46]:
train_relevant_docs

{'11e2af26-4180-49ed-a665-1551884587c8': ['34714d27-0488-49fa-b8e5-a7c287da38c2'],
 '3b034aff-62dc-4eab-b1cc-1b6f6c506b49': ['6d263834-c687-4214-aad5-df728e814b96'],
 '5d7dcf30-1f1d-4be3-beb7-e646467188fd': ['389a30ad-c887-4fa0-bd05-4e17a8430b06'],
 'be6aa851-ece0-4e79-adfa-ca3f6d498ae6': ['2b3d3a74-dc55-415c-91ee-49bf16186145'],
 '564775fb-a3a8-488b-8304-b1bf7a2ac76d': ['0240a5b7-0800-4059-ab7a-67aece2a20e9'],
 '8f593f4f-1191-4d44-9bf6-73f64ddb5ee4': ['8535c3a6-7d9f-4553-86ef-36a2a560bcd3']}

In [47]:
val_relevant_docs

{'ea5a9cd7-7316-44ed-9211-0267909c4014': ['34714d27-0488-49fa-b8e5-a7c287da38c2'],
 'd5cd98f6-0bb8-413f-b1dc-7db879c2b874': ['6d263834-c687-4214-aad5-df728e814b96'],
 '90fe076f-fd8a-433a-b28a-259b9179a9d2': ['389a30ad-c887-4fa0-bd05-4e17a8430b06'],
 '62d4675a-4a20-46ad-9d28-83c44dc618e4': ['2b3d3a74-dc55-415c-91ee-49bf16186145'],
 '76af64e4-996c-4a10-b163-24ec7580dd27': ['0240a5b7-0800-4059-ab7a-67aece2a20e9'],
 '22b466b5-59b9-4b01-9105-9d1f71a5f339': ['8535c3a6-7d9f-4553-86ef-36a2a560bcd3']}

# Create full dataset

In [49]:
train_queries, train_relevant_docs, val_queries, val_relevant_docs = generate_queries(
    corpus=corpus,
    num_questions_per_chunk=4,
    num_val_questions=1)

100%|██████████| 103/103 [06:16<00:00,  3.66s/it]


In [50]:
with open(train_queries_fpath, 'w+') as f:
    json.dump(train_queries, f)

with open(train_relevant_docs_fpath, 'w+') as f:
    json.dump(train_relevant_docs, f)

with open(val_queries_fpath, 'w+') as f:
    json.dump(val_queries, f)

with open(val_relevant_docs_fpath, 'w+') as f:
    json.dump(val_relevant_docs, f)

## Merge data

Reorganize the data for easier accessing the training and evaluation datasets

In [51]:
train_dataset_fpath = 'data_finetuning/one_file/train_val_data/train_dataset.json'
val_dataset_fpath = 'data_finetuning/one_file/train_val_data/val_dataset.json'

In [52]:
train_dataset = {
    'queries': train_queries,
    'corpus': corpus,
    'relevant_docs': train_relevant_docs,
}

val_dataset = {
    'queries': val_queries,
    'corpus': corpus,
    'relevant_docs': val_relevant_docs,
}

In [53]:
with open(train_dataset_fpath, 'w+') as f:
    json.dump(train_dataset, f)

with open(val_dataset_fpath, 'w+') as f:
    json.dump(val_dataset, f)