## Retrieval Augmented Generation
 I developed a RAG solution to answer questions about a repository of research papers. I parsed the paper PDF files, chunked and indexed the data, and then designed and executed an evaluation of the retriever results. In Naïve RAG, the query is compared to documents in the vector database for retrieval of the top N documents that match the query. The language model is then used to summarize the retrieved documents into an answer to the user query. Research papers are highly structured documents with technically deep content, in contrast to blogs, which contain more general and introductory content. This means that queries may be unlikely to match relevant chunks of the paper without additional processing, such as information extraction or summarization.

One approach to address this problem is to use the language model to generate answerable questions from chunks of each paper. The generated questions can then be indexed as "documents" in a vector database, and the user query can be matched against the most similar questions. By maintaining a mapping between the indexed, generated question and the paper chunk, the retrieval process can then produce the most relevant chunks for use in summarizing an answer to the user query.

## Setup the functions for prompting

In [50]:
import json
from openai import OpenAI
client = OpenAI()

def prompt_model(prompt):
    completion = client.chat.completions.create(
        model="gpt-4o",
        store=True,
        messages=[
            {"role": "user", 'content': prompt}
        ]
    )
    return completion.choices[0].message.content

## Parse data from source

In [3]:
import os, bibtexparser, pypdf, logging

# silence non-critical errors while parsing PDF files
logging.getLogger("pypdf").setLevel(logging.CRITICAL)

data_path = 'data/'
data = {}

files = os.listdir(data_path)
print('Reading %i files:' % len(files))
for f in files:
    path = os.path.join(data_path, f)

    # each datum will have at least these attributes
    d = {'filepath': None, 'title': None, 'text': None}

    # parse bibtex file, if exists
    if path.endswith('.bib'):
        if path[:-4] in data:
            d = data[path[:-4]]

        bib = bibtexparser.load(open(path, 'r'))
        if 'title' in bib.entries[0]:
            d['title'] = bib.entries[0]['title']
            data[path[:-4]] = d

    # parse pdf text, if exists
    if path.endswith('.pdf'):
        if path[:-4] in data:
            d = data[path[:-4]]

        print('  File: %s' % f)
        text = ''
        reader = pypdf.PdfReader(path)
        for page in reader.pages:
            text += page.extract_text()
        d['filepath'] = path
        d['text'] = text
        data[path[:-4]] = d

data = [d for d in data.values()]

Reading 53 files:
  File: 2020.coling-main.207.pdf
  File: 2021.findings-emnlp.320.pdf
  File: 2022.naacl-main.191.pdf
  File: 2023.acl-long.557.pdf
  File: 2023.emnlp-main.495.pdf
  File: 2023.findings-emnlp.620.pdf
  File: 2024.acl-long.642.pdf
  File: 2024.eacl-demo.16.pdf
  File: 2024.emnlp-industry.66.pdf
  File: 2202.01110v2.pdf
  File: 2212.14024v2.pdf
  File: 23-0037.pdf
  File: 2312.10997v5.pdf
  File: 2402.19473v6.pdf
  File: 29728-Article Text-33782-1-2-20240324-3.pdf
  File: 3626772.3657834.pdf
  File: 3626772.3657957.pdf
  File: 3637870.pdf
  File: 8917_Retrieval_meets_Long_Cont.pdf
  File: 947_Augmented_Language_Models_.pdf
  File: 967_generate_rather_than_retrieve_.pdf
  File: NeurIPS-2020-retrieval-augmented-generation-for-knowledge-intensive-nlp-tasks.pdf
  File: NeurIPS-2023-leandojo-theorem-proving-with-retrieval-augmented-language-models.pdf
  File: NeurIPS-2023-lift-yourself-up-retrieval-augmented-text-generation-with-self-memory.pdf
  File: tacl_a_00605.pdf


## Chunk data and generate indices

User queries will be matched to indexes that best approximate the text chunks used to summarize an answer. I chunked the text and then prompted the model to generate questions that are answerable by the text. The generated questions are then used as the "documents" stored in the vector database.

In [None]:
def chunk_text(text, chunk_size=500, overlap=50):
    """
    Splits text into chunks of `chunk_size` words with `overlap` words overlapping between chunks.
    
    Args:
        text (str): The input text to chunk.
        chunk_size (int): The number of words per chunk.
        overlap (int): The number of words to overlap between chunks.
    
    Returns:
        List[str]: A list of text chunks.
    """
    words = text.split()
    chunks = []
    start = 0
    total_words = len(words)
    while start < total_words:
        end = start + chunk_size
        chunk = words[start:end]
        chunk_text = " ".join(chunk)
        chunks.append(chunk_text)
        start += chunk_size - overlap
    return chunks

# Apply chunking to each document in your the list
for d in data:
    if d.get('text'):
        d['chunks'] = chunk_text(d['text'], chunk_size=500, overlap=50)

# Print the first few chunks from the first document for verification:
if data and data[0].get('chunks'):
    print("First few chunks of the first document:")
    for i, chunk in enumerate(data[0]['chunks'][:3]):
        print(f"Chunk {i+1}:\n{chunk}\n")


First few chunks of the first document:
Chunk 1:
Proceedings of the 28th International Conference on Computational Linguistics, pages 2284–2295 Barcelona, Spain (Online), December 8-13, 2020 2284 Retrieval-Augmented Controllable Review Generation Jihyeok Kim Yonsei University zizi1532@yonsei.ac.kr Seungtaek Choi Yonsei University hist0613@yonsei.ac.kr Reinald Kim Amplayo University of Edinburgh reinald.kim@ed.ac.uk Seung-won Hwang∗ Yonsei University seungwonh@yonsei.ac.kr Abstract In this paper, we study review generation given a set of attribute identiﬁers which are user ID, product ID and rating. This is a difﬁcult subtask of natural language generation since models are limited to the given identiﬁers, without any speciﬁc descriptive information regarding the inputs, when generating the text. The capacity of these models is thus conﬁned and dependent to how well the models can capture vector representations of attributes. We thus propose to additionally leverage references, which are

In [5]:
def count_words(text):
    return len(text.split())


total_words = sum(count_words(d['text']) for d in data if d.get('text'))
print("Total words in all documents:", total_words)


Total words in all documents: 313190


In [None]:
# Function to generate a question for a given text chunk
def generate_question(chunk):
    # Construct the prompt for the language model
    prompt = f"Generate a question that can be answered using the following text:\n\n{chunk}\n\nQuestion:"
    # Use LLM prompt_model function to get the question
    question = prompt_model(prompt)
    return question.strip()

# Loop through each document in the data and generate questions for each chunk
for d in data:
    if d.get('chunks'):
        # Initialize a new list to store questions corresponding to the chunks
        d['chunk_questions'] = []
        for chunk in d['chunks']:
            question = generate_question(chunk)
            d['chunk_questions'].append(question)

# Verify by printing the first question from the first document 
if data and data[0].get('chunk_questions'):
    print("First generated question for the first document:\n", data[0]['chunk_questions'][0])

# Save to a JSON file
with open("chunk_questions.json", "w", encoding="utf-8") as f:
    json.dump(data, f, ensure_ascii=False, indent=4)

First generated question for the first document:
 What is the main contribution of the paper "Retrieval-Augmented Controllable Review Generation" presented at the 28th International Conference on Computational Linguistics?


## Build the vector database

When building the vector database, I maintained a mapping between the generated questions and the chunks that can be used later to retrieve the chunks from the most similar indices to the user query provided.


In [3]:
import chromadb
from sentence_transformers import SentenceTransformer

# Load from the JSON file
with open("chunk_questions.json", "r", encoding="utf-8") as f:
    data = json.load(f)

# Initialize ChromaDB client
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection(name="rag_questions")

# Load the embedding model
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Store questions and their corresponding chunks
for d in data:
    if 'chunk_questions' in d and 'chunks' in d:
        for i, question in enumerate(d['chunk_questions']):
            chunk_text = d['chunks'][i]
            embedding = embedding_model.encode(question).tolist()  # Convert to list for ChromaDB storage

            # Use a unique ID for each entry
            doc_id = f"{d['filepath']}_chunk_{i}"

            # Store in ChromaDB
            collection.add(
                ids=[doc_id],
                embeddings=[embedding],
                metadatas=[{"question": question, "chunk": chunk_text}]
            )

print("Vector database has been populated with questions and their corresponding chunks.")


  from .autonotebook import tqdm as notebook_tqdm
Add of existing embedding ID: data/23-0037.pdf_chunk_25
Add of existing embedding ID: data/23-0037.pdf_chunk_26
Add of existing embedding ID: data/23-0037.pdf_chunk_27
Add of existing embedding ID: data/23-0037.pdf_chunk_28
Add of existing embedding ID: data/23-0037.pdf_chunk_29
Add of existing embedding ID: data/23-0037.pdf_chunk_30
Add of existing embedding ID: data/23-0037.pdf_chunk_31
Add of existing embedding ID: data/23-0037.pdf_chunk_32
Add of existing embedding ID: data/23-0037.pdf_chunk_33
Add of existing embedding ID: data/23-0037.pdf_chunk_34
Add of existing embedding ID: data/23-0037.pdf_chunk_35
Add of existing embedding ID: data/23-0037.pdf_chunk_36
Add of existing embedding ID: data/23-0037.pdf_chunk_37
Add of existing embedding ID: data/23-0037.pdf_chunk_38
Add of existing embedding ID: data/23-0037.pdf_chunk_39
Add of existing embedding ID: data/23-0037.pdf_chunk_40
Add of existing embedding ID: data/23-0037.pdf_chunk_4

Vector database has been populated with questions and their corresponding chunks.


In [None]:
# Function to query vector database for similar questions and obtain chunks
def query_vector_database(query, top_k=5):
    query_embedding = embedding_model.encode(query).tolist()
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k
    )
    
    retrieved_chunks = []
    for result in results["metadatas"][0]:
        retrieved_chunks.append(result["chunk"])

    return retrieved_chunks

# Example Query
user_query = "What is retrieval-augmented generation?"
retrieved_chunks = query_vector_database(user_query)

print("Retrieved Chunks:")
for i, chunk in enumerate(retrieved_chunks):
    print(f"Chunk {i+1}:\n{chunk}\n")


Retrieved Chunks:
Chunk 1:
(SMT) (Koehn et al., 2003) models (Simard and Isabelle, 2009; Koehn and Senellart, 2010) and in particular, we inten- sively highlight several popular methods to inte- grating translation memory to NMT models (Gu et al., 2018; Zhang et al., 2018; Xu et al., 2020; He et al., 2021). We also review the applications of retrieval-augmented generation in other genera- tion tasks such as abstractive summarization (Peng et al., 2019), code generation (Hashimoto et al., 2018), paraphrase (Kazemnejad et al., 2020; Su et al., 2021b), and knowledge-intensive generation (Lewis et al., 2020b). Finally, we also point out some promising directions on retrieval-augmented generation to push forward the future research. 2 Retrieval-Augmented Paradigm In this section, we ﬁrst give a general formulation of retrieval-augmented text generation. Then, we discuss three major components of the retrieval- augmented generation paradigm, including the re- arXiv:2202.01110v2 [cs.CL] 13 Fe

In [52]:
def generate_answer(query_text):
    # Retrieve relevant chunks from ChromaDB
    retrieved_chunks = query_vector_database(query_text)

    # Print the retrieved chunks
    print("\nRetrieved Chunks:")
    for i, chunk in enumerate(retrieved_chunks):
        print(f"Chunk {i+1}:\n{chunk}\n")

    # Construct a prompt for the LLM
    context = "\n\n".join(retrieved_chunks)
    prompt = f"""Use the following context to answer the query concisely:
    
    {context}
    
    If the answer is not found in the context, respond only with "IDK".
    
    Query: {query_text}
    
    Answer:"""
    
    # Use the LLM to generate an answer
    answer = prompt_model(prompt)
    
    return answer.strip()

# Example user query
user_query = "What are the challenges of attribute-to-text generation?"
answer = generate_answer(user_query)

print("\nGenerated Answer:\n", answer)



Retrieved Chunks:
Chunk 1:
of data-to-text generation. To bridge the gap be- tween the structured data and natural language text, Su et al. (2021a) propose a novel retrieval- augmented framework. Speciﬁcally, given the source data, a set of candidate texts are ﬁrst re- trieved from a large unlabelled corpus. Then, a neural selector is applied to measure the similari- ties between the source data and candidate texts, and extract a set of more ﬁne-grained prototypes from the candidates. Lastly, a generation model takes the prototypes as input to produce the text that describes the given structured data. While retrieval-augmented generation has been widely explored in the NLP community, we sug- gest that future research could extend this approach to tasks that involve data from multiple modali- ties. For instance, with recent advancements in image-text retrieval (Jia et al., 2021; Radford et al., 2021), the structural gap between images and texts is largely bridged. Some early studies (Z

## Conduct experiments to evaluate user queries


In [6]:
import json

# Define the test queries
test_queries = [
    {
        "query": "For ALMs, what is a tool?",
        "answer": "A tool is an external module that is typically called using a rule or special token and whose output is included in the ALM’s context."
    },
    {
        "query": "What is a major feature of retrieve-then-read?",
        "answer": "The ability to swap in new documents when new information is learned, such as temporally more recent documents, or adding in documents from a new domain to quickly adapt to a new downstream task."
    },
    {
        "query": "The proposed retriever method, IRCoT, can be instantiated from what three ingredients?",
        "answer": "(i) a base retriever that can take a query and return a given number of paragraphs from a corpus or knowledge source; (ii) a language model with zero/few-shot Chain-of-Thought (CoT) generation capabilities; and (iii) a small number of annotated questions with reasoning steps explaining how to arrive at the answer in natural language (chain of thoughts) and a set of paragraphs from the knowledge source that collectively support the reasoning chain and the answer."
    },
    {
        "query": "What problem does augmenting LMs try to address and when is this method limiting?",
        "answer": "Augmenting LMs tries to address the problem of LMs hallucinating and producing factually incorrect output. This method is limited in more general scenarios that involve the generation of long texts."
    },
    {
        "query": "What are symmetric tasks and what are two applicable scenarios?",
        "answer": "Symmetric tasks have to do with queries and documents that have similar semantic meanings but different surface forms. Two applicable scenarios are monolingual semantic textual similarity and bitext retrieval."
    },
    {
        "query": "What is the advantage of using LC instead of RAG, and what is the advantage of using RAG instead of LC?",
        "answer": "The advantage of using LC instead of RAG is that it consistently outperforms RAG in terms of average performance. The advantage of using RAG instead of LC is that it has a much lower cost."
    },
    {
        "query": "For ALMs, what is a hammer?",
        "answer": "IDK"
    },
    {
        "query": "What is a major feature of fetch-then-receive?",
        "answer": "IDK"
    },
    {
        "query": "Does Self-Route help the computer to have a longer life?",
        "answer": "IDK"
    }
]

# Save to a JSON file
with open("test-queries.json", "w") as f:
    json.dump(test_queries, f, indent=4)

print("test-queries.json has been created successfully.")


test-queries.json has been created successfully.


In [53]:
import json

# Load provided queries
with open("provided-queries.json", "r") as f:
    provided_queries = json.load(f)

# Generate answers for provided queries
for q in provided_queries:
    q["generated_answer"] = generate_answer(q["query"])

# Save provided results
with open("generated-provided-answers.json", "w") as f:
    json.dump(provided_queries, f, indent=4)

print("Generated answers for provided queries have been saved to generated-provided-answers.json.")



Retrieved Chunks:
Chunk 1:
and metrics, as well as the current evaluation benchmarks and tools. Additionally, we anticipate future directions for RAG, emphasizing potential enhancements to tackle current challenges. The paper unfolds as follows: Section II introduces the main concept and current paradigms of RAG. The following three sections explore core components—“Retrieval”, “Gen- eration” and “Augmentation”, respectively. Section III focuses on optimization methods in retrieval,including indexing, query and embedding optimization. Section IV concentrates on post- retrieval process and LLM fine-tuning in generation. Section V analyzes the three augmentation processes. Section VI focuses on RAG’s downstream tasks and evaluation system. Sec- tion VII mainly discusses the challenges that RAG currently faces and its future development directions. At last, the paper concludes in Section VIII. II. O VERVIEW OF RAG A typical application of RAG is illustrated in Figure 2. Here, a user pose

In [54]:
# Load test queries
with open("test-queries.json", "r") as f:
    test_queries = json.load(f)

# Generate answers for test queries
for q in test_queries:
    q["generated_answer"] = generate_answer(q["query"])

# Save test results
with open("generated-test-answers.json", "w") as f:
    json.dump(test_queries, f, indent=4)

print("Generated answers for test queries have been saved to generated-test-answers.json.")



Retrieved Chunks:
Chunk 1:
can lead to the correct predictions even though the intermediate reasoning doesn’t make any sense, indicating clear challenges for researchers exploring this direction. • Enhanced capabilities: ALMs with improved reasoning abilities and tools can be more helpful assistants and solve a wider range of tasks than standard LMs. For example, an ALM connected to a python interpreter can run code and experiments on a user’s behalf, which a vanilla LM cannot do. In addition, a feedback loop can emerge between reasoning and acting, where each ability further improves the other (Yao et al., 2022b). Interacting with external tools, entities, and environments can improve reasoning since it allows the ALM to collect additional information and ground itself in the real-world. Similarly, reasoning can improve the ALM’s decision making abilities such as when and how to use a certain tool. Cost of using tools. To the best of our knowledge, the cost of using tools has not yet

In [55]:
from bert_score import score

# Extract ground truth and generated answers for provided queries
provided_ground_truths = [q["answer"] for q in provided_queries]
provided_generated_answers = [q["generated_answer"] for q in provided_queries]

# Compute BertScore
P_provided, R_provided, F1_provided = score(provided_generated_answers, provided_ground_truths, lang="en")

# Compute average scores
avg_precision_provided = P_provided.mean().item()
avg_recall_provided = R_provided.mean().item()
avg_f1_provided = F1_provided.mean().item()

# Print results
print(f"Provided Queries - Average Precision: {avg_precision_provided:.4f}")
print(f"Provided Queries - Average Recall: {avg_recall_provided:.4f}")
print(f"Provided Queries - Average F1 Score: {avg_f1_provided:.4f}")


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Provided Queries - Average Precision: 0.8460
Provided Queries - Average Recall: 0.8412
Provided Queries - Average F1 Score: 0.8432


In [56]:
print("P_provided:", P_provided)
print("R_provided:", R_provided)
print("F1_provided:", F1_provided)


P_provided: tensor([0.7919, 0.7656, 0.9180, 0.8554, 0.9791, 0.8427, 0.8028, 0.7598, 0.8983])
R_provided: tensor([0.7648, 0.8025, 0.8675, 0.8738, 0.9687, 0.8378, 0.7876, 0.8135, 0.8549])
F1_provided: tensor([0.7781, 0.7836, 0.8921, 0.8645, 0.9739, 0.8402, 0.7951, 0.7858, 0.8760])


In [57]:
# Extract ground truth and generated answers for test queries
test_ground_truths = [q["answer"] for q in test_queries]
test_generated_answers = [q["generated_answer"] for q in test_queries]

# Compute BertScore
P_test, R_test, F1_test = score(test_generated_answers, test_ground_truths, lang="en")

# Compute average scores
avg_precision_test = P_test.mean().item()
avg_recall_test = R_test.mean().item()
avg_f1_test = F1_test.mean().item()

# Print results
print(f"Test Queries - Average Precision: {avg_precision_test:.4f}")
print(f"Test Queries - Average Recall: {avg_recall_test:.4f}")
print(f"Test Queries - Average F1 Score: {avg_f1_test:.4f}")


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Test Queries - Average Precision: 0.9207
Test Queries - Average Recall: 0.9315
Test Queries - Average F1 Score: 0.9259


In [58]:
print("P_test:", P_test)
print("R_test:", R_test)
print("F1_test:", F1_test)


P_test: tensor([0.9355, 0.8604, 0.9953, 0.8723, 0.7944, 0.8286, 1.0000, 1.0000, 1.0000])
R_test: tensor([0.9760, 0.8580, 0.9889, 0.9005, 0.7768, 0.8836, 1.0000, 1.0000, 1.0000])
F1_test: tensor([0.9553, 0.8592, 0.9921, 0.8861, 0.7855, 0.8552, 1.0000, 1.0000, 1.0000])
