## Assignment3 - Retrieval Augmented Generation

In this asignment, you will develop a RAG solution to answer questions about a repository of research papers. The assignment requires you to parse the paper PDF files, chunk and index the data, and then design and execute an evaluation of the retriever results. In Naïve RAG, the query is compared to documents in the vector database for retrieval of the top N documents that match the query. The language model is then used to summarize the retrieved documents into an answer to the user query. Research papers are highly structured documents with technically deep content, in contrast to blogs, which contain more general and introductory content. This means that queries may be unlikely to match relevant chunks of the paper without additional processing, such as information extraction or summarization.

One approach to address this problem is to use the language model to generate answerable questions from chunks of each paper. The generated questions can then be indexed as "documents" in a vector database, and the user query can be matched against the most similar questions. By maintaining a mapping between the indexed, generated question and the paper chunk, the retrieval process can then produce the most relevant chunks for use in summarizing an answer to the user query.

## Setup the functions for prompting

In [1]:
from openai import OpenAI
client = OpenAI()

def prompt_model(prompt):
    completion = client.chat.completions.create(
        model="gpt-4o-mini",
        store=True,
        messages=[
            {"role": "user", 'content': prompt}
        ]
    )
    return completion.choices[0].message.content

## Parse data from source

In [2]:
import os, bibtexparser, pypdf, logging

# silence non-critical errors while parsing PDF files
logging.getLogger("pypdf").setLevel(logging.CRITICAL)

data_path = 'data/'
data = {}

files = os.listdir(data_path)
print('Reading %i files:' % len(files))
for f in files:
    path = os.path.join(data_path, f)

    # each datum will have at least these attributes
    d = {'filepath': None, 'title': None, 'text': None}

    # parse bibtex file, if exists
    if path.endswith('.bib'):
        if path[:-4] in data:
            d = data[path[:-4]]

        bib = bibtexparser.load(open(path, 'r'))
        if 'title' in bib.entries[0]:
            d['title'] = bib.entries[0]['title']
            data[path[:-4]] = d

    # parse pdf text, if exists
    if path.endswith('.pdf'):
        if path[:-4] in data:
            d = data[path[:-4]]

        print('  File: %s' % f)
        text = ''
        reader = pypdf.PdfReader(path)
        for page in reader.pages:
            text += page.extract_text()
        d['filepath'] = path
        d['text'] = text
        data[path[:-4]] = d

data = [d for d in data.values()]

Reading 53 files:
  File: 2023.findings-emnlp.620.pdf
  File: 29728-Article Text-33782-1-2-20240324-3.pdf
  File: 2024.acl-long.642.pdf
  File: 2021.findings-emnlp.320.pdf
  File: 2020.coling-main.207.pdf
  File: 2202.01110v2.pdf
  File: 2212.14024v2.pdf
  File: 2024.emnlp-industry.66.pdf
  File: 8917_Retrieval_meets_Long_Cont.pdf
  File: NeurIPS-2023-lift-yourself-up-retrieval-augmented-text-generation-with-self-memory.pdf
  File: NeurIPS-2023-leandojo-theorem-proving-with-retrieval-augmented-language-models.pdf
  File: NeurIPS-2020-retrieval-augmented-generation-for-knowledge-intensive-nlp-tasks.pdf
  File: 2023.acl-long.557.pdf
  File: tacl_a_00605.pdf
  File: 3637870.pdf
  File: 2023.emnlp-main.495.pdf
  File: 3626772.3657834.pdf
  File: 2402.19473v6.pdf
  File: 3626772.3657957.pdf
  File: 2024.eacl-demo.16.pdf
  File: 967_generate_rather_than_retrieve_.pdf
  File: 23-0037.pdf
  File: 2022.naacl-main.191.pdf
  File: 2312.10997v5.pdf
  File: 947_Augmented_Language_Models_.pdf


In [3]:
# Save to a TXT file
with open("cleaneddata_input.txt", "w", encoding="utf-8") as f:
    for item in data:
        f.write(item['text'] + "\n\n")  # Add spacing between entries

print("TXT file saved successfully!")

TXT file saved successfully!


In [4]:
import pprint

# pprint.pprint(data)

## Pre-process the data

Prior to indexing and chunking the data, the data may need to be pre-processed. This can be done to remove portions of the data irrelevant to queries to reduce mismatches between the user query and the index. This is not required for this assignment.

In [5]:
def remove_keys_from_list_of_dicts(list_of_dicts, keys_to_remove):
    for dictionary in list_of_dicts:
        for key in keys_to_remove:
            if key in dictionary:
                del dictionary[key]
    return list_of_dicts

keys_to_remove = ['filepath', 'title']

cleaned_data = remove_keys_from_list_of_dicts(data, keys_to_remove)

In [6]:
len(cleaned_data)

25

In [7]:
import re

def clean_text(text):
    # Define section names to remove (in lowercase)
    section_names = ['title', 'acknowledgements', 'references']

    # Define patterns for sections to remove
    remove_patterns = [
        r'(?si)^.*?(?=\n\n)',  # Title (everything from start until first double newline)
    ]

    # Add patterns for other sections
    for name in section_names:
        pattern = rf'(?si){re.escape(name)}.*?(?=\n\n|\Z)'
        remove_patterns.append(pattern)

    # Apply removals
    for pattern in remove_patterns:
        text = re.sub(pattern, '', text)

    return text.strip()

cleaned_data_result = []
for item in cleaned_data:
    if 'text' in item:
        item['text'] = clean_text(item['text'])
    cleaned_data_result.append(item)

In [8]:
# pprint.pprint(cleaned_data_result)

In [9]:
import re
import csv  # Add this import statement

output_file = 'cleaned_data.txt'

with open(output_file, 'w', newline='', encoding='utf-8') as csvfile:
    fieldnames = cleaned_data_result[0].keys()
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)

    writer.writeheader()
    for row in cleaned_data_result:
        writer.writerow(row)

print(f"Cleaned data has been saved to {output_file}")

Cleaned data has been saved to cleaned_data.txt


## Chunk data and generate indices

User queries will be matched to indexes that best approximate the text chunks used to summarize an answer. For this assignment, you may chunk the text and then prompt the model to generate questions that are answerable by the text. The generated questions can then be used as the "documents" stored in the vector database.

In [10]:
import nltk
#nltk.download('punkt_tab')

In [11]:
def chunk_text_with_overlap(text, max_tokens=300, overlap=64):
    """
    Splits text into chunks while maintaining semantic coherence using sentence tokenization.

    Parameters:
    - text (str): The full text to be chunked.
    - max_tokens (int): Maximum token size per chunk.
    - overlap (int): Number of tokens overlapping between chunks.

    Returns:
    - List[str]: A list of text chunks.
    """
    sentences = nltk.tokenize.sent_tokenize(text)  # Tokenize by sentence
    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        sentence_length = len(sentence.split())  # Approximate token count
        if current_length + sentence_length > max_tokens:
            chunks.append(" ".join(current_chunk))  # Save chunk
            current_chunk = current_chunk[-overlap:]  # Keep overlap
            current_length = sum(len(s.split()) for s in current_chunk)

        current_chunk.append(sentence)
        current_length += sentence_length

    if current_chunk:
        chunks.append(" ".join(current_chunk))  # Add the last chunk

    return chunks

In [12]:
# chunked_data = []
# for doc in cleaned_data:
#     text = doc.get("text", "")
#     if text:
#         chunks = chunk_text_with_overlap(text)
#         for chunk in chunks:
#             chunked_data.append({"chunk": chunk})  # Store chunks

chunked_data = []
chunk_id_counter = 0  # Initialize a counter for chunk IDs
for doc in cleaned_data:
    text = doc.get("text", "")
    if text:
        chunks = chunk_text_with_overlap(text)
        for chunk in chunks:
            chunked_data.append({"chunk_id": chunk_id_counter, "chunk": chunk})  # Assign unique ID
            chunk_id_counter += 1

In [13]:
print(chunked_data[:3])

[{'chunk_id': 0, 'chunk': 'Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9248–9274\nDecember 6-10, 2023 ©2023 Association for Computational Linguistics\nEnhancing Retrieval-Augmented Large Language Models with Iterative\nRetrieval-Generation Synergy\nZhihong Shao1, Yeyun Gong2, yelong shen3, Minlie Huang1∗, Nan Duan2, Weizhu Chen3\n1 The CoAI Group, DCST, Institute for Artificial Intelligence,\n1 State Key Lab of Intelligent Technology and Systems,\n1 Beijing National Research Center for Information Science and Technology,\n1 Tsinghua University, Beijing 100084, China\n2 Microsoft Research Asia 3 Microsoft Azure AI\nszh19@mails.tsinghua.edu.cn aihuang@tsinghua.edu.cn\nAbstract\nRetrieval-augmented generation has raise exten-\nsive attention as it is promising to address the\nlimitations of large language models including\noutdated knowledge and hallucinations. How-\never, retrievers struggle to capture relevance,\nespecially for queries with complex infor

In [14]:
len(chunked_data)

4434

In [15]:
import openai
# Select a small sample (e.g., 5 chunks)  
sample_chunks = chunked_data[0:5]  

generated_questions = []  

for i, chunk_dict in enumerate(sample_chunks):  
    chunk_text = chunk_dict["chunk"]  # Use "chunk" instead of "chunk_text"

    prompt = f"Generate 3 questions that can be answered from the following text:\n\n{chunk_text}"  

    try:  
        response = openai.chat.completions.create(  
            model="gpt-4o-mini",  
            messages=[{"role": "user", "content": prompt}],  
            max_tokens=150  
        )  

        # Extract questions  
        questions = response.choices[0].message.content.strip().split("\n") 
        generated_questions.append({"chunk_id": i, "questions": questions})  

        print(f"Chunk {i}: {questions}")  # Debugging  

    except Exception as e:  
        print(f"Error processing chunk {i}: {e}")  

Chunk 0: ['1. What is the primary benefit of using retrieval-augmented generation as discussed in the findings of EMNLP 2023?', '', '2. How does the ITER-RETGEN method improve the relevance of retrieved knowledge for complex queries?', '', '3. In what areas was the ITER-RETGEN method evaluated according to the text?']
Chunk 1: ['1. What is the primary goal of the retrieval-augmented generation method discussed in the paper "Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval-Generation Synergy"?', '', '2. How does the ITER-RETGEN method differ from previous approaches to interleaving retrieval with generation when completing a single output?', '', '3. In which areas was ITER-RETGEN evaluated, and what were the outcomes of these evaluations compared to state-of-the-art retrieval-augmented baselines?']
Chunk 2: ['1. What is the primary focus of the research presented in the paper titled "Enhancing Retrieval-Augmented Large Language Models with Iterative Retrieval

In [16]:
from openai import AsyncOpenAI
client = AsyncOpenAI()

In [17]:
import asyncio
import re

async def generate_questions_for_batch(batch, batch_index):
    """
    Asynchronously generates questions for a batch of chunks.
    """
    batched_prompt = (
        "You are an AI assistant that generates questions for given text chunks, including titles, metadata, and partial content.\n"
        "For each text chunk below, generate 1-2 diverse questions ONLY.\n"
        "For title or metadata chunks, focus on questions about the topic, authors, or publication details.\n"
        "For partial content, generate questions about the potential full context or significance of the fragment.\n"
        "Start each question with 'Question:' on a new line.\n"
        "Separate questions for different chunks with '---'.\n\n"
        "Here are the chunks:\n\n"
    )

    for chunk_dict in batch:
        batched_prompt += f"{chunk_dict['chunk']}\n---\n\n"
    
    try:
        response = await client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": batched_prompt}],
            temperature=0.7,
            max_tokens=250
        )
        
        response_text = response.choices[0].message.content.strip()
        chunk_responses = response_text.split('---')
        
        chunk_questions = []
        for i, chunk_response in enumerate(chunk_responses):
            questions = [q.strip()[len("Question:"):].strip() for q in chunk_response.strip().split('\n') if q.strip().lower().startswith("question:")]

            if not questions and i < len(batch):
                chunk_text = batch[i]['chunk']
                if "Published in" in chunk_text and "Abstract" in chunk_text:
                    title = chunk_text.split('Abstract')[0].strip().split('\n')[-1]
                    questions = [
                        f"What is the main focus of the survey titled '{title}'?",
                        f"Who are the primary authors of this survey paper?",
                        f"What are the key points discussed in the abstract of this survey?"
                    ]
                elif len(chunk_text.split()) <= 15:  # Short chunk, likely metadata
                    if "Published in" in chunk_text:
                        questions = [f"What is the publication venue and date for this research?"]
                    elif any(word in chunk_text.lower() for word in ['survey', 'benchmarking']):
                        questions = [f"What is the main focus of the study titled '{chunk_text}'?"]
                    else:
                        questions = [f"What is the significance of '{chunk_text}' in the context of this research?"]
                else:  # Longer chunk, likely partial content
                    questions = [f"What are the main ideas presented in the fragment '{chunk_text[:50]}...'?"]

            if i < len(batch):
                chunk_questions.append({
                    "chunk_id": batch[i]['chunk_id'],
                    "questions": questions
                })
        
        return chunk_questions
        
    except Exception as e:
        print(f"Error processing batch {batch_index}: {e}")
        return []

async def generate_questions_parallel(chunked_data, batch_size=5):
    """
    Generates questions asynchronously using parallel API calls.
    """
    tasks = []
    for i in range(0, len(chunked_data), batch_size):
        batch = chunked_data[i:i + batch_size]
        tasks.append(generate_questions_for_batch(batch, i // batch_size))
    
    results = await asyncio.gather(*tasks)  # Run all API calls in parallel
    flattened_results = [item for sublist in results for item in sublist]  # Flatten list
    
    # Final check for missing chunks
    processed_chunk_ids = set(item['chunk_id'] for item in flattened_results)
    for i in range(len(chunked_data)):
        if i not in processed_chunk_ids:
            flattened_results.append({
                "chunk_id": i,
                "questions": [f"What is the main point of chunk {i} in the document?"]
            })
    
    print(f"Total chunks processed: {len(flattened_results)}")
    return flattened_results

In [18]:
import nest_asyncio
nest_asyncio.apply()  # Allows running asyncio inside Jupyter

# # Run the function asynchronously
# generated_questions = asyncio.run(generate_questions_parallel(chunked_data, batch_size=5))
generated_questions = await generate_questions_parallel(chunked_data, batch_size=5)

Total chunks processed: 4434


In [19]:
pprint.pprint(generated_questions[:5])  # Print only the first 5 items

[{'chunk_id': 0,
  'questions': ['What are the main contributions of the paper "Enhancing '
                'Retrieval-Augmented Large Language Models with Iterative '
                'Retrieval-Generation Synergy" presented at EMNLP 2023?',
                'Who are the authors of the study and what institutions are '
                'they affiliated with?']},
 {'chunk_id': 1,
  'questions': ['How does the proposed ITER-RETGEN method improve the '
                'relevance of retrieval in comparison to previous approaches?',
                'What specific applications were evaluated to demonstrate the '
                'effectiveness of the ITER-RETGEN method?']},
 {'chunk_id': 2,
  'questions': ['What limitations of large language models are addressed by '
                'the retrieval-augmented generation approach mentioned in the '
                'paper?',
                'What is the significance of the iterative process in the '
                'ITER-RETGEN method for enhancing

In [20]:
len(generated_questions)

4434

In [21]:
len(chunked_data)

4434

## Build the vector database

When building the vector database, be sure to maintain a mapping between the generated questions and the chunks that can be used later to retrieve the chunks from the most similar indices to the user query provided.

You may also add the function to query the vector database that you will use later.

In [22]:
import chromadb
from chromadb.utils import embedding_functions

# Initialize Chroma client
client = chromadb.Client()

# Delete if Exists
try:
    collection = client.get_collection("research_papers")
    client.delete_collection("research_papers")
    print("Collection 'research_papers' has been deleted.")
except chromadb.errors.InvalidCollectionException:
    print("Collection 'research_papers' does not exist.")


# Create a collection
embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
collection = client.create_collection(name="research_papers", embedding_function=embedding_function)

Collection 'research_papers' does not exist.


In [23]:
# First, index the actual text chunks
for chunk in chunked_data:
    collection.add(
        documents=[chunk["chunk"]],  # Store the actual text chunk
        metadatas=[{"chunk_id": chunk["chunk_id"]}],  
        ids=[f"chunk_{chunk['chunk_id']}"]
    )

# Then, index the generated questions
for i, question_set in enumerate(generated_questions):
    chunk_id = question_set['chunk_id']
    for j, question in enumerate(question_set['questions']):
        collection.add(
            documents=[question],
            metadatas=[{"chunk_id": chunk_id}],
            ids=[f"question_{i}_{j}"]
        )

In [24]:
print(f"Total items in collection: {collection.count()}")

Total items in collection: 12492


In [25]:
# def retrieve_relevant_chunks(query, n=3):
#     results = collection.query(
#         query_texts=[query],
#         n_results=n
#     )
#     chunk_ids = [result['chunk_id'] for result in results['metadatas'][0]]
#     return list(set(chunk_ids))  # Remove duplicates


# invalid chunk id
# def retrieve_relevant_chunks(query, n=3):
#     """
#     Retrieves relevant chunks based on question similarity.

#     1. Queries ChromaDB to find the questions most similar to the input query.
#     2. Extracts the chunk_ids from the metadata of the retrieved questions.
#     3. Retrieves the corresponding chunks from chunked_data.

#     Args:
#         query (str): The input query.
#         n (int): The number of relevant chunks to retrieve.

#     Returns:
#         list: A list of relevant chunks.
#     """
#     # Query ChromaDB for similar questions
#     results = collection.query(
#         query_texts=[query],
#         n_results=n,  # Number of questions to retrieve
#     )

#     # Extract chunk_ids from the metadata of the retrieved questions
#     chunk_ids = []
#     for metadata in results['metadatas'][0]:
#         chunk_ids.append(metadata['chunk_id'])

#     # Retrieve the corresponding chunks from chunked_data
#     relevant_chunks = [chunked_data[chunk_id]['chunk'] for chunk_id in chunk_ids if chunk_id < len(chunked_data)]

#     return relevant_chunks

def retrieve_chunks(query, top_k=3):
    results = collection.query(
        query_texts=[query],
        n_results=top_k,
        include=["metadatas"]  # Ensure we're retrieving metadata
    )

    retrieved_chunk_ids = []

    # Check if metadata is properly structured
    if "metadatas" in results and results["metadatas"]:
        for metadata in results["metadatas"][0]:  # Access metadata from first query result
            if 'chunk_id' in metadata:
                retrieved_chunk_ids.append(metadata['chunk_id'])  # Extract correct chunk_id
            else:
                print(f"Warning: No chunk_id found in metadata for result: {metadata}")
    else:
        print("Error: No metadata retrieved from vector DB.")

    print(f"Final retrieved chunk IDs: {retrieved_chunk_ids}")  # Debugging output
    return retrieved_chunk_ids

In [26]:
query = "What is the main contribution of the RAG survey paper?"
chunk_ids = retrieve_chunks(query)
print(f"Retrieved Chunk IDs: {chunk_ids}")

Final retrieved chunk IDs: [363, 358, 328]
Retrieved Chunk IDs: [363, 358, 328]


In [27]:
# Convert list to dictionary for quick lookups
chunked_data_dict = {entry['chunk_id']: entry['chunk'] for entry in chunked_data}

In [28]:
query = "What is the main contribution of the RAG survey paper?"
chunk_ids = retrieve_chunks(query)
print(f"Retrieved Chunk IDs: {chunk_ids}")

print("\nRetrieved Chunks:")
for chunk_id in chunk_ids:
    if chunk_id in chunked_data_dict:
        print(f"Chunk {chunk_id}: {chunked_data_dict[chunk_id][:300]}...")  # Print first 300 chars
    else:
        print(f"Chunk {chunk_id} not found in chunked_data_dict")

Final retrieved chunk IDs: [363, 358, 328]
Retrieved Chunk IDs: [363, 358, 328]

Retrieved Chunks:
Chunk 363: Overall, this
paper sets out to meticulously compile and categorize the
foundational technical concepts, historical progression, and
the spectrum of RAG methodologies and applications that
have emerged post-LLMs. It is designed to equip readers and
professionals with a detailed and structured unders...
Chunk 358: As research
progressed, the enhancement of RAG was no longer limited
to the inference stage but began to incorporate more with LLM
fine-tuning techniques. The burgeoning field of RAG has experienced swift growth,
yet it has not been accompanied by a systematic synthesis that
could clarify its broade...
Chunk 328: 1
Retrieval-Augmented Generation for Large
Language Models: A Survey
Yunfan Gaoa, Yun Xiongb, Xinyu Gao b, Kangxiang Jia b, Jinliu Pan b, Yuxi Bic, Yi Dai a, Jiawei Sun a, Meng
Wangc, and Haofen Wang a,c
aShanghai Research Institute for Intelligent Autonomous 

In [29]:
# def generate_answer(query, chunk_ids):
#     chunks = [chunked_data[chunk_id]['chunk'] for chunk_id in chunk_ids]
#     context = "\n".join(chunks)
    
#     prompt = f"""Given the following context, answer the question. If the answer cannot be found in the context, respond with 'IDK'.

# Context:
# {context}

# Question: {query}

# Answer:"""

#     response = openai.chat.completions.create(
#         model="gpt-4o-mini",  # or whichever model you're using
#         messages=[{"role": "user", "content": prompt}],
#         temperature=0.7,
#         max_tokens=250
#     )
    
#     return response.choices[0].message.content.strip()


## invalid chunk id
# def generate_answer(query, chunk_ids):
#     chunks = []
#     for chunk_id in chunk_ids:
#         try:
#             chunk_id = int(chunk_id)  # Convert chunk_id to integer
#             chunks.append(chunked_data[chunk_id]['chunk'])
#         except (ValueError, KeyError) as e:
#             print(f"Invalid chunk_id: {chunk_id} - Skipping. Error: {e}")
#             continue  # Skip this chunk if chunk_id is invalid

#     context = "\n".join(chunks)

#     prompt = f"""Given the following context, answer the question. If the answer cannot be found in the context, respond with 'IDK'.

#     Context:
#     {context}

#     Question: {query}

#     Answer:"""

#     response = openai.chat.completions.create(
#         model="gpt-4o-mini",  # or whichever model you're using
#         messages=[{"role": "user", "content": prompt}],
#         temperature=0.7,
#         max_tokens=250
#     )

#     return response.choices[0].message.content.strip()

In [30]:
# def generate_answer(query, chunk_ids):
#     chunks = []
#     for chunk_id in chunk_ids:
#         try:
#             chunk_id = int(chunk_id)  # Convert to integer
#             if chunk_id in chunked_data_dict:
#                 chunks.append(chunked_data_dict[chunk_id])  # Fetch chunk text
#             else:
#                 print(f"Warning: Chunk ID {chunk_id} not found in chunked_data_dict")
#         except ValueError:
#             print(f"Invalid chunk_id: {chunk_id} - Skipping.")
#             continue

#     if not chunks:
#         return "IDK"

#     context = "\n".join(chunks)

#     prompt = f"""Given the following context, answer the question. If the answer cannot be found in the context, respond with 'IDK'.

#     Context:
#     {context}

#     Question: {query}

#     Answer:"""

#     print("----- Final Prompt -----")
#     print(prompt)
#     print("------------------------")


#     response = openai.chat.completions.create(
#         model="gpt-4o-mini",
#         messages=[{"role": "user", "content": prompt}],
#         temperature=0.7,
#         max_tokens=250
#     )

#     return response.choices[0].message.content.strip()

def generate_answer(query, chunk_ids):
    chunks = []
    for chunk_id in chunk_ids:
        try:
            chunk_id = int(chunk_id)  # Convert to integer
            if chunk_id in chunked_data_dict:
                chunks.append(chunked_data_dict[chunk_id])  # Fetch chunk text
            else:
                print(f"Warning: Chunk ID {chunk_id} not found in chunked_data_dict")
        except ValueError:
            print(f"Invalid chunk_id: {chunk_id} - Skipping.")
            continue

    if not chunks:
        return "IDK"

    context = "\n".join(chunks)

    # Debugging prints (without full context)
    # print("----- Final Prompt Structure -----")
    # print(f"Context: [Hidden, length: {len(context)} characters]")
    # print(f"Question: {query}")
    # print("Answer: [Generated by model]")
    # print("----------------------------------")

    prompt = f"""Given the following context, answer the question. If the answer cannot be found in the context, respond with 'IDK'.

Context:
{context}

Question: {query}

Answer:"""

    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=250
    )

    return response.choices[0].message.content.strip()

In [31]:
def rag_query(query):
    chunk_ids = retrieve_chunks(query)
    answer = generate_answer(query, chunk_ids)
    return answer

In [32]:
import json

with open('dev-questions.json', 'r') as f:
    dev_queries = json.load(f)

In [33]:
def test_rag_system(queries):
    generated_answers = []
    ground_truth_answers = []
    for query in queries:
        answer = rag_query(query['query'])
        print(f"Query: {query['query']}")
        print(f"Generated Answer: {answer}")
        print(f"Ground Truth: {query['answer']}")
        print("---")
        generated_answers.append(answer)
        ground_truth_answers.append(query['answer'])  # Append the ground truth
    return generated_answers, ground_truth_answers # Return both lists

In [34]:
import json
test_queries = [
  {
    "query": "What does RAG do?",
    "answer": "RAG retrieves relevant information based on the query and then prompts an LLM to generate a response in the context of the retrieved information."
  },
  {
    "query": "What are the names of the two retrievers used in the study?",
    "answer": "Contriever and Dragon"
  },
  {
    "query": "What is SELF-ROUTE?",
    "answer": "SELF-ROUTE utilizes LLM itself to route queries based on self-reflection, under the assumption that LLMs are well-calibrated in predicting whether a query is answerable given provided context."
  },
  {
    "query": "What is the main goal of the research described in the paper?",
    "answer": "The research aims to compare Retrieval Augmented Generation (RAG) and long-context Large Language Models (LLMs) to understand their strengths and weaknesses, and to combine them effectively."
  },
  {
    "query": "What advantage does RAG still have over long-context LLMs?",
    "answer": "Despite potentially lower performance, RAG offers a significantly lower computational cost compared to long-context LLMs."
  },
  {
    "query": "How does SELF-ROUTE decide whether to use RAG or a long-context LLM?",
    "answer": "SELF-ROUTE uses the LLM's own assessment of whether a query is answerable given the retrieved context to decide whether to use RAG or a long-context LLM."
  },
  {
    "query": "How does SELF-ROUTE perform on visual question answering tasks?",
    "answer": "IDK"
  },
  {
    "query": "What is the impact of SELF-ROUTE on the carbon footprint of LLM applications?",
    "answer": "IDK"
  },
  {
    "query": "Can SELF-ROUTE be applied to other modalities beyond text, such as audio or video?",
    "answer": "IDK"
  }
]

with open("test_queries.json", "w") as file:
    json.dump(test_queries, file, indent=4)

In [35]:
# Test with dev queries
dev_generated_answers, dev_ground_truth_answers = test_rag_system(dev_queries)
print("----------------------------- test_queries -----------------------------")
# Test with your own queries
generated_answers, ground_truth_answers = test_rag_system(test_queries)

Final retrieved chunk IDs: [340, 347, 378]
Query: What is RAG?
Generated Answer: RAG stands for Retrieval-Augmented Generation, which is a methodology that enhances Large Language Models (LLMs) by incorporating knowledge from external databases. This approach aims to improve the accuracy and credibility of generated responses, particularly for knowledge-intensive tasks, by retrieving relevant document chunks through semantic similarity calculations. RAG effectively reduces issues like hallucinations and outdated knowledge by referencing external information, allowing for continuous knowledge updates and integration of domain-specific content. The RAG framework includes three main components: retrieval, generation, and augmentation.
Ground Truth: RAG stands for Retrieval Augmented Generation. It is a method used in large language models (LLMs) that retrieves relevant information based on a query and then prompts an LLM to generate a response in the context of the retrieved information. 

## Conduct experiments to evaluate user queries

Report your average precision, recall and F1 score. You are welcome to sample the model multiple times for each query when computing your average, or you may sample once per query.

In [36]:
from bert_score import BERTScorer

scorer = BERTScorer(model_type='bert-base-uncased')

In [37]:
from bert_score import score

def evaluate_bertscores(generated_answers, ground_truth_answers):
    P, R, F1 = score(generated_answers, ground_truth_answers, lang="en", verbose=True)
    return P.mean().item(), R.mean().item(), F1.mean().item()

In [41]:
filtered_refs = []
filtered_preds = []
for ref, pred in zip(dev_ground_truth_answers, dev_generated_answers):
    if ref.strip():  # Keep only non-empty references
        filtered_refs.append(ref)
        filtered_preds.append(pred)

In [42]:
#Evaluate on Dev Queries
print("----------------------------- DEV QUERIES -----------------------------")
dev_precision, dev_recall, dev_f1 = evaluate_bertscores(filtered_preds, filtered_refs)
print(f"BertScore (Dev Queries) - Precision: {dev_precision:.4f}, Recall: {dev_recall:.4f}, F1: {dev_f1:.4f}")

----------------------------- DEV QUERIES -----------------------------


Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 2.10 seconds, 3.81 sentences/sec
BertScore (Dev Queries) - Precision: 0.8711, Recall: 0.8797, F1: 0.8750


In [40]:
for i, ref in enumerate(dev_ground_truth_answers):
    if not ref.strip():  # Check for empty or whitespace-only strings
        print(f"Empty reference at index {i}")

Empty reference at index 7


In [39]:
# Evaluate on Test Queries
print("----------------------------- TEST QUERIES -----------------------------")
test_precision, test_recall, test_f1 = evaluate_bertscores(generated_answers, ground_truth_answers)
print(f"BertScore (Test Queries) - Precision: {test_precision:.4f}, Recall: {test_recall:.4f}, F1: {test_f1:.4f}")

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


----------------------------- TEST QUERIES -----------------------------
calculating scores...
computing bert embedding.


  0%|          | 0/1 [00:00<?, ?it/s]

computing greedy matching.


  0%|          | 0/1 [00:00<?, ?it/s]

done in 1.12 seconds, 8.03 sentences/sec
BertScore (Test Queries) - Precision: 0.9159, Recall: 0.9382, F1: 0.9266
