# **Generate Synthetic Dataset with LLM**
In this we can generate a synthetic dataset of (query, relevant documents) pairs from a corpus of documents without labelers by leveraging LLM.

## **Generate Corpus**
First, we create the corpus of text chunks by leveraging Langchain to load some some PDFs, and parsing/chunking into plain text chunks.

In [81]:
# !pip install langchain
# !pip install transformers
# !pip install chromadb
# !pip install sentence_transformers
# !pip install accelerate
# !pip install bitsandbytes
# !pip install rank_bm25 > /dev/null
import os, glob, textwrap, time
import json
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DirectoryLoader
from langchain.document_loaders import TextLoader
from langchain.chains import RetrievalQA
from langchain.embeddings import HuggingFaceBgeEmbeddings
from transformers import pipeline
from langchain.llms import HuggingFacePipeline
from langchain.vectorstores.chroma import Chroma
from langchain.chains import ConversationalRetrievalChain
from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain import PromptTemplate
from langchain.prompts import SystemMessagePromptTemplate, HumanMessagePromptTemplate, ChatPromptTemplate
from tqdm.notebook import tqdm
import re
import uuid


In [82]:
import warnings
warnings.filterwarnings("ignore")

## ***Load and chunk Documents***

In [83]:
def loadSplitDocuments(file_path, chunk_size, chunk_overlap):
  loader = TextLoader(file_path)
  documents = loader.load()
  text_splitter = RecursiveCharacterTextSplitter(chunk_size = chunk_size, chunk_overlap= chunk_overlap)
  text  = text_splitter.split_documents(documents)
  return text


train_documents = loadSplitDocuments("/content/Basic Structure of the Local High Voltage Product _parsed.txt", chunk_size = 600, chunk_overlap=60)
validation_documents = loadSplitDocuments("/content/10. Corrective Action Policy_parsed.txt", chunk_size = 600, chunk_overlap=60)

In [54]:
def create_corpus(documents):
    corpus = {}
    for node in documents:
        page_content = node.page_content
        metadata = node.metadata

        # Extracting the filename from the source path
        source_path = metadata.get("source", "")
        filename = os.path.basename(source_path)

        # If filename already exists in the corpus, append a number to make it unique
        if filename in corpus:
            filename = f"{filename}_{len(corpus)}"

        # Modifying the "source" metadata to contain only the filename
        metadata["source"] = filename

        corpus[filename] = page_content

    return corpus

In [84]:
train_corpus = create_corpus(train_documents)
validation_corpus = create_corpus(validation_documents)

In [85]:
TRAIN_CORPUS_FPATH = '/content/train_corpus.json'
VAL_CORPUS_FPATH = '/content/val_corpus.json'

In [86]:
with open(TRAIN_CORPUS_FPATH, 'w+') as f:
    json.dump(train_corpus, f)

with open(VAL_CORPUS_FPATH, 'w+') as f:
    json.dump(validation_corpus, f)

In [87]:
with open(TRAIN_CORPUS_FPATH, 'r+') as f:
    train_corpus = json.load(f)

with open(VAL_CORPUS_FPATH, 'r+') as f:
    val_corpus = json.load(f)

In [92]:
from itertools import islice

def slice_top_n(dictionary, n):
    return dict(islice(dictionary.items(), n))

val_corpus = slice_top_n(val_corpus, 5)

## **Generate synthetic queries**
Now, we use an Wizard model to generate questions using each text chunk in the corpus as context.

Each pair of (generated question, text chunk used as context) becomes a datapoint in the finetuning dataset (either for training or evaluation).

In [66]:
import torch
from transformers import LlamaTokenizer, LlamaForCausalLM, GenerationConfig

Tokenizer = LlamaTokenizer.from_pretrained("TheBloke/wizardLM-7B-HF")
model = LlamaForCausalLM.from_pretrained("TheBloke/wizardLM-7B-HF",
                                         load_in_4bit=True,
                                         torch_dtype=torch.float16)

tokenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/21.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/435 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json:   0%|          | 0.00/555 [00:00<?, ?B/s]

The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


pytorch_model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

pytorch_model-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

pytorch_model-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

In [69]:
from langchain.llms import HuggingFacePipeline
from transformers import pipeline

pipe = pipeline(
                'text-generation',
                model=model,
                tokenizer=Tokenizer,
                max_length=2048,
)

local_llm = HuggingFacePipeline(pipeline=pipe)

In [73]:
def generate_queries(
    corpus,
    num_questions_per_chunk=2,
    prompt_template=None,
):
    """
    Automatically generate hypothetical questions that could be answered with
    doc in the corpus.
    """

    prompt_template = prompt_template or """\
    Context information is below.

    ---------------------
    {context_str}
    ---------------------

    Given the context information and not prior knowledge.
    generate only questions based on the below query.

    You are a Teacher/ Professor. Your task is to setup \
    {num_questions_per_chunk} questions for an upcoming \
    quiz/examination. The questions should be diverse in nature \
    across the document. Restrict the questions to the \
    context information provided."
    """

    queries = {}
    relevant_docs = {}
    for node_id, text in tqdm(corpus.items()):
        query = prompt_template.format(context_str=text, num_questions_per_chunk=num_questions_per_chunk)
        response = local_llm(query)

        result = str(response).strip().split("\n")
        questions = [
            re.sub(r"^\d+[\).\s]", "", question).strip() for question in result
        ]
        questions = [question for question in questions if len(question) > 0]

        for question in questions:
            question_id = str(uuid.uuid4())
            queries[question_id] = question
            relevant_docs[question_id] = [node_id]
    return queries, relevant_docs

In [74]:
train_queries, train_relevant_docs = generate_queries(train_corpus)

  0%|          | 0/27 [00:00<?, ?it/s]

In [93]:
val_queries, val_relevant_docs = generate_queries(val_corpus)

  0%|          | 0/5 [00:00<?, ?it/s]

In [94]:
TRAIN_QUERIES_FPATH = '/content//train_queries.json'
TRAIN_RELEVANT_DOCS_FPATH = '/content/train_relevant_docs.json'

VAL_QUERIES_FPATH = '/content/val_queries.json'
VAL_RELEVANT_DOCS_FPATH = '/content/val_relevant_docs.json'

In [95]:
with open(TRAIN_QUERIES_FPATH, 'w+') as f:
    json.dump(train_queries, f)

with open(TRAIN_RELEVANT_DOCS_FPATH, 'w+') as f:
    json.dump(train_relevant_docs, f)

with open(VAL_QUERIES_FPATH, 'w+') as f:
    json.dump(val_queries, f)

with open(VAL_RELEVANT_DOCS_FPATH, 'w+') as f:
    json.dump(val_relevant_docs, f)

## **Merge data**
Finally, we do some minor re-organization to make it easier to access the dataset for training and evaluation.

In [96]:
TRAIN_DATASET_FPATH = '/content/train_dataset.json'
VAL_DATASET_FPATH = '/content/val_dataset.json'

In [97]:
train_dataset = {
    'queries': train_queries,
    'corpus': train_corpus,
    'relevant_docs': train_relevant_docs,
}

val_dataset = {
    'queries': val_queries,
    'corpus': val_corpus,
    'relevant_docs': val_relevant_docs,
}

In [98]:
with open(TRAIN_DATASET_FPATH, 'w+') as f:
    json.dump(train_dataset, f)

with open(VAL_DATASET_FPATH, 'w+') as f:
    json.dump(val_dataset, f)