## Retrieval-Augmented Generation (RAG) (40 points)

The goal of this assignment is to gain hands-on experience with aspects of **Retrieval-Augmented Generation (RAG)**, with a focus on retrieval. You will use **LangChain**, a framework that simplifies integrating external knowledge into generation tasks by:

- Implementing various vector databases for efficient neural retrieval. You will use a vector database for storing our memories.
- Allowing seamless integration of pretrained text encoders, which you will access via HuggingFace models. You will use a text encoder to get text embeddings for storing in the vector database.

**Data**  
You will build a retrieval system using the [QMSum Dataset](https://github.com/Yale-LILY/QMSum), a human-annotated benchmark designed for question answering on long meeting transcripts. The dataset includes over 230 meetings across multiple domains.


# RAG Workflow

Retrieval-Augmented Generation (RAG) systems involve several interconnected components. Below is a RAG workflow diagram from Hugging Face. Areas highlighted in blue indicate opportunities for system improvement.

In this assignment,  we will focus  on the ***Retriever**  so the PA does not cover any processes starting from "2. Reader" and below.









# First,  install the required model dependancies.

In [1]:
# pip install -q torch transformers langchain_chroma bitsandbytes langchain langchain_huggingface langchain-community sentence-transformers  pacmap tqdm matplotlib

In [2]:
from tqdm.notebook import tqdm
import pandas as pd
import os
import csv
import sys
import numpy as np
import time
import random
from typing import Optional, List, Tuple
import matplotlib.pyplot as plt
import textwrap
import torch

seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# Disable huffingface tokenizers parallelism <- should huggingface
os.environ["TOKENIZERS_PARALLELISM"] = "false"


# Load the meetings dataset

In [3]:
from langchain.docstore.document import Document
import csv
import sys

def set_csv_field_limit():
    maxInt = sys.maxsize
    while True:
        try:
            csv.field_size_limit(maxInt)
            break
        except OverflowError:
            maxInt = int(maxInt/10)
    return maxInt

def load_documents(doc_file):
    """
    Loads the document contents from the first file.

    :param doc_file: Path to the document file (document ID <TAB> document contents).
    :return: A dictionary {document_id: document_contents}.
    """
    # Set the field size limit first
    set_csv_field_limit()

    documents = {}
    with open(doc_file, 'r', encoding='utf-8') as f:
        reader = csv.reader(f, delimiter='\t')
        for row in reader:
            if len(row)==0: continue
            doc_id, content = row
            documents[doc_id] = content
    return documents

# Load and process the documents
docs = []
doc_file = 'meetings.tsv'
documents = load_documents(doc_file)

for doc_id in documents:
    doc = Document(page_content=documents[doc_id])
    metadata = {'source': doc_id}
    doc.metadata = metadata
    docs.append(doc)

print(f"Total meetings (docs): {len(documents)}")

Total meetings (docs): 230


# Retriever - Building the retriever 🗂️

The **retriever functions like a search engine**: given a user query, it returns relevant documents from the knowledge base.

These documents are then used by the Reader model to generate an answer. In this assignment, however, we are only focusing on the retriever, not the Reader model.

**Our goal:** Given a user question, find the most relevant documents from the knowledge base.

Key parameters:
- `top_k`: The number of documents to retrieve. Increasing `top_k` can improve the chances of retrieving relevant content.
- `chunk size`: The length of each document. While this can vary, avoid overly long documents, as too many tokens can overwhelm most reader models.


Langchain __offers a huge variety of options for vector databases and allows us to keep document metadata throughout the processing__.

 ### 1. Specify an Embedding Model and Visualize Document Lengths


In [4]:
EMBEDDING_MODEL_NAME = "thenlper/gte-small"

from sentence_transformers import SentenceTransformer

print(
    f"Model's maximum sequence length: {SentenceTransformer(EMBEDDING_MODEL_NAME).max_seq_length}"
)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(EMBEDDING_MODEL_NAME)
lengths = [len(tokenizer.encode(doc.page_content)) for doc in tqdm(docs)]

# # Plot the distribution of document lengths, counted as the number of tokens
# fig = pd.Series(lengths).hist()
# plt.title("Distribution of document lengths in the knowledge base (in count of tokens)")
# plt.show()

Model's maximum sequence length: 512


  0%|          | 0/230 [00:00<?, ?it/s]

### 2. Split the Documents into Chunks

The documents (meeting transcripts) are very long—some up to 30,000 tokens! To make retrieval effective, we’ll **split each document into smaller, semantically meaningful chunks**. These chunks will serve as the snippets the retriever compares to the query, returning the `top_k` most relevant ones.

**Objective**: Create Semantically Relevant Snippets

Chunks should be long enough to capture complete ideas but not so lengthy that they lose focus.

We will use Langchain's implementation of recursive chunking with `RecursiveCharacterTextSplitter`.
- Parameter `chunk_size` controls the length of individual chunks: this length is counted by default as the number of characters in the chunk.
- Parameter `chunk_overlap` lets adjacent chunks get a bit of overlap on each other. This reduces the probability that an idea could be cut in half by the split between two adjacent chunks.

From the produced plot below, you can see that now the chunk length distribution looks better!

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 768,
    chunk_overlap = 128,
)

doc_snippets = text_splitter.split_documents(docs)
print(f"Total {len(doc_snippets)} snippets to be stored in our vector store.")

lengths = [len(tokenizer.encode(doc.page_content)) for doc in tqdm(doc_snippets)]

# Plot the distribution of document snippet lengths, counted as the number of tokens
# fig = pd.Series(lengths).hist()
# plt.title("Distribution of document lengths in the knowledge base (in count of tokens)")
# plt.show()

Total 18070 snippets to be stored in our vector store.


  0%|          | 0/18070 [00:00<?, ?it/s]

### 3. Build the Vector Database

To enable retrieval, we need to compute embeddings for all chunks in our knowledge base. These embeddings will then be stored in a vector database.

#### How Retrieval Works

A query is embedded using an embedding model and a similarity search finds the closest matching chunks in the vector database.

The following cell builds the vector database consisting of  all chunks in our knowledge base.


In [6]:
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain_community.vectorstores.utils import DistanceStrategy

# Automatically set the device to 'cuda' if available, otherwise use 'cpu'
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Found device: {device}")


embedding_model = HuggingFaceEmbeddings(
    model_name=EMBEDDING_MODEL_NAME,
    multi_process=True,
    model_kwargs={"device": device},
    encode_kwargs={"normalize_embeddings": True},  # Set `True` for cosine similarity
)

start_time = time.time()

KNOWLEDGE_VECTOR_DATABASE = FAISS.from_documents(
    doc_snippets, embedding_model, distance_strategy=DistanceStrategy.COSINE
)

end_time = time.time()

elapsed_time = (end_time - start_time)/60
print(f"Time taken: {elapsed_time} minutes")


Found device: cuda
Time taken: 0.3240078091621399 minutes


### 4. Querying the Vector Database


Using LangChain’s vector database,  the function `vector_database.similarity_search(query)` implements a Bi-Encoder (covered in class), independently encoding the query and each document into a single-vector representation, allowing document embeddings to be precomputed.

Let's  define the Bi-Encoder ranking function and then use it on a sample query from the QMSum dataset.



In [7]:
## The function for ranking documents given a query:
def rank_documents_biencoder(user_query, top_k = 5):
    """
    Function for document ranking based on the query.

    :param query: The query to retrieve documents for.
    :return: A list of document IDs ranked based on the query (mocked).
    """
    retrieved_docs = KNOWLEDGE_VECTOR_DATABASE.similarity_search(query=user_query, k=top_k)
    ranked_list = []
    for i, doc in enumerate(retrieved_docs):
        ranked_list.append(retrieved_docs[i].metadata['source'])

    return ranked_list  # ranked document IDs.


user_query = "what did kirsty williams am say about her plan for quality assurance ?"
retrieved_docs = rank_documents_biencoder(user_query)

print("\n==================================Top-5 documents==================================")
print("\n\nRetrieved documents:", retrieved_docs)
print("\n====================================================================\n")




Retrieved documents: ['doc_211', 'doc_2', 'doc_43', 'doc_160', 'doc_43']




### <font color="red">5. TODO: Implementation of ColBERT as a Reranker for a Bi-Encoder (35 points)</font>

The Bi-Encoder’s ranking for the sample query is not optimal: the ground truth document is not ranked at position 1, instead the document ID, **doc_211** is ranked at position 1.  To determine the correct document ID for this query, refer to the `questions_answers.tsv` file.

In this task, you will implement the [ColBERT](https://arxiv.org/pdf/2004.12832) approach by Khattab and Zaharia. We’ll use a simplified version of ColBERT, focusing on the following key steps:

1. Retrieve the top \( K = 15 \) documents for query \( q \) using the Bi-Encoder.
2. Re-rank these top \( K = 15 \) documents using ColBERT's fine-grained interaction scoring. This will involve:
   - Using frozen BERT embeddings from a HuggingFace BERT model (no training is required, thus our version is not expected to work as well as full-fledged ColBERT).
   - Calculating scores based on fine-grained token-level interactions between the query and each document.
3. Implement the method `rank_documents_finegrained_interactions()` to perform this re-ranking.
   - Test your method on the same query as in the cell from #4 above.
   - Print out the entire re-ranked document list of 5 document IDs, as done in  #4 above (the code below does it for you)
4. Ensure that your ColBERT implementation ranks the correct document at position 1 for the sample query.


***Note***: Since the same document is divided into multiple chunks that retain the original document ID, you may see the same document ID appear multiple times in your top_k results. However, each instance refers to a different chunk of the document's content.

***Note2***:  For this PA we are not focused on query latency, just the late interactions part in the ColBERT approach. Thus, we don't have to pre-compute document matrix representations for ColBERT.


In [8]:
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel


# Load tokenizer and model BERT from HuggingFace
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")


def rank_documents_finegrained_interactions(user_query, shortlist = 15, top_k=5):

    """
    Rerank the top-K=15 retrieved documents from Bi-encoder using fine-grained token-level interactions
    and return the top_k=5 most similar documents.

    Args:
    - user_query (str): The user query string.
    - shortlist (list): Number of documents in the longer short list
    - top_k (int): Number of top reranked documents to return.

    Returns:
    - ranked_list of document IDs.
    """

    retrieved_docs = KNOWLEDGE_VECTOR_DATABASE.similarity_search(query=user_query, k=shortlist)


    # Tokenize the user query
    query_inputs = tokenizer(user_query, return_tensors='pt', truncation=True, padding=True)

    # Get query token embeddings from BERT
    with torch.no_grad():
        query_embeddings = model(**query_inputs).last_hidden_state  # Shape: (1, seq_len_query, hidden_dim)

    ranked_list = []

    ### YOUR CODE HERE
    ranked_scores = []
    for doc in retrieved_docs:
        # Tokenize the document content
        doc_inputs = tokenizer(doc.page_content, return_tensors='pt', truncation=True, padding=True)
        doc_embeddings = model(**doc_inputs).last_hidden_state

        similarity_matrix = torch.matmul(query_embeddings, doc_embeddings.transpose(-2, -1))

        max_similarities = torch.max(similarity_matrix, dim=-1).values
        doc_score = max_similarities.sum().item()
        # mean_similarities = torch.mean(similarity_matrix, dim=-1)
        # doc_score = mean_similarities.sum().item()

        ranked_scores.append([doc_score, doc.metadata['source']])

    ranked_scores.sort(reverse=True)

    ranked_list = [ranked_scores[i][1] for i in range(top_k)]


    return ranked_list  # ranked document IDs


user_query = "how did project manager and user interface introduce the prototype of the remote control ?"
retrieved_docs = rank_documents_finegrained_interactions(user_query)

print("\n==================================Top-5 documents==================================")
print("\n\nRetrieved documents:", retrieved_docs)
print("\n====================================================================\n")




Retrieved documents: ['doc_166', 'doc_9', 'doc_218', 'doc_56', 'doc_15']




### <font color="green">  7.  (Optional) Full evaluation pipeline for your own exploration. </font>


For this assignment, we only ask you to explore  one  sample query.
Running on many queries is super slow without the right compute.
If you have compute/and/or time to wait, below is a more complete evaluation setup that works with all the queries in QMSum dataset, and  reports the  `precision@k=5` metric.


**Note**: you need to remove the comment markers from the code below.

In [9]:

def load_questions_answers(qa_file):
    """
    Loads the questions and corresponding ground truth document IDs.

    :param qa_file: Path to the question-answer file (document ID <TAB> question <TAB> answer).
    :return: A list of tuples [(document_id, question, answer)].
    """
    qa_pairs = []
    with open(qa_file, 'r', encoding='utf-8') as f:
        reader = csv.reader(f, delimiter='\t')
        for row in reader:
            doc_id, question, answer = row
            qa_pairs.append((doc_id, question, answer))

    random.shuffle(qa_pairs)

    return qa_pairs

def precision_at_k(ground_truth, retrieved_docs, k):
    """
    Computes Precision at k for a single query.

    :param ground_truth: The name of the ground truth document.
    :param retrieved_docs: The list of document names returned by the model in ranked order.
    :param k: The cutoff for computing Precision.
    :return: Precision at k.
    """
    return 1 if ground_truth in retrieved_docs[:k] else 0

def evaluate(doc_file, qa_pairs, ranking_fuction = None, k= 5):
    """
    Evaluate the retrieval system based on the documents and question-answer pairs.

    :param doc_file: Path to the document file.
    :param qa_file: Path to the question-answer file.
    :param k: The cutoff for Precision@k.
    """
    # Load the QA pairs


    precision_scores = []


    for doc_id, question, _ in qa_pairs:

        retrieved_docs = ranking_fuction(question)
        precision_scores.append(precision_at_k(doc_id, retrieved_docs, k))

        avg_precision_at_k = sum(precision_scores) / len(precision_scores)

        if len(precision_scores) %10==0:
            print(f"After {len(precision_scores)} queries, Precision@{k}: {avg_precision_at_k}")

    # Compute average Precision@k
    avg_precision_at_k = sum(precision_scores) / len(precision_scores)

    print(f"Precision@{k}: {avg_precision_at_k}")


qa_file = 'questions_answers.tsv'  # document ID <TAB> question <TAB> answer
qa_pairs = load_questions_answers(qa_file)
print(len(qa_pairs))
qa_pairs[:3]

# start_time = time.time()
# evaluate(doc_file, qa_pairs,rank_documents_biencoder)
# end_time = time.time()
# elapsed_time = (end_time - start_time)/60
# print(f"Time taken: {elapsed_time} minutes")

1152


[('doc_32',
  "why did n't the team believe that the remote control could fully depend on speech recognition and have no buttons ?",
  'age group data for remote control use was not available ; many people may not want to learn to use the new remote control ; some buttons are still needed , such as channel control , volume settings and on/off .'),
 ('doc_162',
  'what was agreed upon on sample transcripts ?',
  'to save time , speaker mn005 will only mark the sample of transcribed data for regions of overlapping speech , as opposed to marking all acoustic events . the digits extraction task will be delegated to whomever is working on acoustics for the meeting recorder project .'),
 ('doc_116',
  'what did user interface propose in the discussion about buttons when discussing the functions ?',
  'user interface proposed that there should be six or seven buttons for the same number of categories . users could use these buttons to choose hundreds of channels . these buttons could be navig

In [10]:
import torch
import numpy as np
from tqdm import tqdm
import time

def batch_encode_queries_v2(queries, embedding_model, batch_size=256):
    """
    Optimized batch encoding with larger batches and better GPU utilization
    """
    # Pre-allocate memory for all embeddings
    num_queries = len(queries)
    embedding_dim = 384  # We know this from the output
    all_embeddings = np.zeros((num_queries, embedding_dim), dtype=np.float32)
    
    # Process in larger batches
    for i in tqdm(range(0, num_queries, batch_size), desc="Encoding queries"):
        end_idx = min(i + batch_size, num_queries)
        batch = queries[i:end_idx]
        
        # Get embeddings for batch
        batch_embeddings = embedding_model.embed_documents(batch)
        all_embeddings[i:end_idx] = batch_embeddings
    
    return all_embeddings

def evaluate_gpu_optimized_v2(qa_pairs, k=5, batch_size=256, search_batch_size=512):
    """
    Further optimized GPU evaluation with separate batch sizes for encoding and search
    """
    questions = [q for _, q, _ in qa_pairs]
    ground_truths = [doc_id for doc_id, _, _ in qa_pairs]
    
    print(f"Starting evaluation with batch_size={batch_size}, search_batch_size={search_batch_size}")
    start_time = time.time()
    
    # 1. Batch encode all queries
    print("Encoding queries...")
    query_embeddings = batch_encode_queries_v2(questions, embedding_model, batch_size)
    encoding_time = time.time() - start_time
    print(f"Encoding completed in {encoding_time:.1f} seconds")
    
    # 2. Perform similarity search in batches
    print("Performing batch similarity search...")
    search_start = time.time()
    
    all_D = []
    all_I = []
    num_queries = len(questions)
    
    for i in tqdm(range(0, num_queries, search_batch_size), desc="Searching"):
        end_idx = min(i + search_batch_size, num_queries)
        batch_embeddings = query_embeddings[i:end_idx]
        
        # Perform search for this batch
        D, I = KNOWLEDGE_VECTOR_DATABASE.index.search(batch_embeddings, k)
        all_D.extend(D)
        all_I.extend(I)
    
    search_time = time.time() - search_start
    print(f"Search completed in {search_time:.1f} seconds")
    
    # 3. Process results
    doc_dict = {i: doc.metadata['source'] for i, doc in enumerate(doc_snippets)}
    retrieved_docs = [[doc_dict[idx] for idx in query_indices] for query_indices in all_I]
    
    # Calculate precision scores
    precision_scores = [
        1 if gt in retrieved[:k] else 0 
        for gt, retrieved in zip(ground_truths, retrieved_docs)
    ]
    
    # Final results
    final_precision = np.mean(precision_scores)
    total_time = time.time() - start_time
    qps = num_queries / total_time
    
    print("\nPerformance Breakdown:")
    print(f"- Encoding time: {encoding_time:.1f}s ({num_queries/encoding_time:.1f} queries/s)")
    print(f"- Search time: {search_time:.1f}s ({num_queries/search_time:.1f} queries/s)")
    print(f"\nFinal Results:")
    print(f"Precision@{k}: {final_precision:.3f}")
    print(f"Total time: {total_time:.1f} seconds")
    print(f"Average speed: {qps:.1f} queries/second")
    
    return final_precision


qa_file = 'questions_answers.tsv'
qa_pairs = load_questions_answers(qa_file)

start_time = time.time()
# Use larger batch sizes
score = evaluate_gpu_optimized_v2(
    qa_pairs, 
    k=5, 
    batch_size=256,  # Encoding batch size
    search_batch_size=512  # Search batch size
)
print(f"Total time: {(time.time() - start_time)/60:.2f} minutes")

Starting evaluation with batch_size=256, search_batch_size=512
Encoding queries...


Encoding queries: 100%|██████████| 5/5 [00:18<00:00,  3.76s/it]


Encoding completed in 18.8 seconds
Performing batch similarity search...


Searching: 100%|██████████| 3/3 [00:03<00:00,  1.29s/it]

Search completed in 3.9 seconds

Performance Breakdown:
- Encoding time: 18.8s (61.3 queries/s)
- Search time: 3.9s (297.8 queries/s)

Final Results:
Precision@5: 0.467
Total time: 22.7 seconds
Average speed: 50.8 queries/second
Total time: 0.38 minutes





In [11]:
import faiss

def batch_initial_retrieval_v2(questions, k=15, batch_size=512):
    """
    More optimized batch retrieval using FAISS GPU
    """
    # Convert FAISS index to GPU if not already
    res = faiss.StandardGpuResources()
    gpu_index = faiss.index_cpu_to_gpu(res, 0, KNOWLEDGE_VECTOR_DATABASE.index)
    
    # Pre-allocate embeddings array
    num_queries = len(questions)
    embedding_dim = KNOWLEDGE_VECTOR_DATABASE.index.d
    all_embeddings = np.zeros((num_queries, embedding_dim), dtype=np.float32)
    
    print("Computing embeddings...")
    for i in tqdm(range(0, num_queries, batch_size)):
        end_idx = min(i + batch_size, num_queries)
        batch = questions[i:end_idx]
        
        # Get embeddings for batch
        embeddings = embedding_model.embed_documents(batch)
        all_embeddings[i:end_idx] = embeddings
    
    print("Performing batch search...")
    # Single batch search for all queries
    D, I = gpu_index.search(all_embeddings, k)
    
    # Convert indices to documents (in batches)
    retrieved_docs = []
    for indices in I:
        docs = [doc_snippets[idx] for idx in indices]
        retrieved_docs.append(docs)
    
    return retrieved_docs

def optimized_colbert_rerank_v3(questions, docs_list, batch_size=64):
    """
    Optimized ColBERT reranking that maintains high precision
    """
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model.to(device)
    all_reranked = []
    
    # Process queries in batches
    for i in tqdm(range(0, len(questions), batch_size), desc="Reranking"):
        batch_questions = questions[i:i + batch_size]
        batch_docs = docs_list[i:i + batch_size]
        
        for q_idx, (question, docs) in enumerate(zip(batch_questions, batch_docs)):
            # Get query embeddings
            q_inputs = tokenizer(
                question,
                padding=True,
                truncation=True,
                return_tensors="pt",
                max_length=512
            ).to(device)
            
            with torch.no_grad():
                q_embeds = model(**q_inputs).last_hidden_state  # [1, Q, D]
            
            # Process docs in mini-batches
            doc_scores = []
            doc_batch_size = 8  # Process multiple docs at once
            
            for j in range(0, len(docs), doc_batch_size):
                doc_batch = docs[j:j + doc_batch_size]
                doc_texts = [d.page_content for d in doc_batch]
                
                # Get document embeddings
                d_inputs = tokenizer(
                    doc_texts,
                    padding=True,
                    truncation=True,
                    return_tensors="pt",
                    max_length=512
                ).to(device)
                
                with torch.no_grad():
                    d_embeds = model(**d_inputs).last_hidden_state  # [B, P, D]
                
                # Compute similarity scores for each document
                for doc_idx in range(len(doc_batch)):
                    # Extract single document embeddings
                    d_embed = d_embeds[doc_idx:doc_idx+1]  # [1, P, D]
                    
                    # Compute token-level similarities
                    sim_matrix = torch.matmul(q_embeds, d_embed.transpose(-2, -1))  # [1, Q, P]
                    
                    # MaxSim operation
                    max_sim = torch.max(sim_matrix, dim=-1).values  # [1, Q]
                    score = max_sim.sum().item()
                    
                    doc_scores.append((score, doc_batch[doc_idx].metadata['source']))
            
            # Sort and get top k docs
            doc_scores.sort(reverse=True)
            all_reranked.append([score[1] for score in doc_scores[:5]])
    
    return all_reranked

def evaluate_optimized_v3(qa_pairs, initial_k=15, final_k=5, batch_size=512, rerank_batch_size=64):
    """
    Evaluation with optimized but accurate reranking
    """
    questions = [q for _, q, _ in qa_pairs]
    ground_truths = [doc_id for doc_id, _, _ in qa_pairs]
    
    print("1. Initial retrieval (batched)...")
    init_start = time.time()
    initial_retrieved = batch_initial_retrieval_v2(
        questions, 
        k=initial_k, 
        batch_size=batch_size
    )
    init_time = time.time() - init_start
    
    print("\n2. ColBERT reranking (accurate mode)...")
    rerank_start = time.time()
    reranked_docs = optimized_colbert_rerank_v3(
        questions, 
        initial_retrieved, 
        batch_size=rerank_batch_size
    )
    rerank_time = time.time() - rerank_start
    
    # Calculate metrics
    precision_scores = [
        1 if gt in reranked[:final_k] else 0 
        for gt, reranked in zip(ground_truths, reranked_docs)
    ]
    
    # Results
    final_precision = np.mean(precision_scores)
    total_time = time.time() - init_start
    
    print("\nPerformance Breakdown:")
    print(f"Initial Retrieval: {init_time:.1f}s ({len(questions)/init_time:.1f} q/s)")
    print(f"Reranking: {rerank_time:.1f}s ({len(questions)/rerank_time:.1f} q/s)")
    print(f"\nFinal Results:")
    print(f"Precision@{final_k}: {final_precision:.3f}")
    print(f"Total time: {total_time:.1f}s ({len(questions)/total_time:.1f} q/s)")
    
    return final_precision

# Test with same set for comparison
score = evaluate_optimized_v3(
    qa_pairs,
    batch_size=512,           # Initial retrieval batch size
    rerank_batch_size=64      # Reranking batch size
)

1. Initial retrieval (batched)...
Computing embeddings...


100%|██████████| 3/3 [00:11<00:00,  3.82s/it]


Performing batch search...

2. ColBERT reranking (accurate mode)...


Reranking: 100%|██████████| 18/18 [00:36<00:00,  2.02s/it]


Performance Breakdown:
Initial Retrieval: 11.6s (99.0 q/s)
Reranking: 36.4s (31.7 q/s)

Final Results:
Precision@5: 0.475
Total time: 48.0s (24.0 q/s)





# Reader

In [12]:
from dataclasses import dataclass
from typing import List, Optional, Dict

@dataclass
class PromptTemplate:
    template: str
    input_variables: List[str]
    
    def format(self, **kwargs) -> str:
        return self.template.format(**kwargs)

class PromptManager:
    def __init__(self):
        self.templates = {
            # Zero-shot prompting
            "basic": PromptTemplate(
                template="Answer the question based on the given context.\n\nContext: {context}\n\nQuestion: {question}\n\nAnswer:",
                input_variables=["context", "question"]
            ),
            
            # Chain-of-thought prompting
            "cot": PromptTemplate(
                template="Let's approach this step-by-step:\n\n1) First, understand the question: {question}\n\n2) Here's the relevant context: {context}\n\n3) Let's analyze the context and break down the key points\n\n4) Based on this analysis, provide a detailed answer.\n\nReasoning and answer:",
                input_variables=["context", "question"]
            ),
            
            # Role-based prompting
            "expert": PromptTemplate(
                template="As an expert in meeting analysis, review the following context and answer the question.\n\nContext: {context}\n\nQuestion: {question}\n\nExpert analysis and answer:",
                input_variables=["context", "question"]
            ),
            
            # Self-reflection prompting
            "reflective": PromptTemplate(
                template="Question: {question}\n\nContext: {context}\n\nLet me think about this carefully:\n1. What are the key points in the context?\n2. How do they relate to the question?\n3. What might I be missing?\n\nConsidering these points, here's my answer:",
                input_variables=["context", "question"]
            ),
            
            # Structured output prompting
            "structured": PromptTemplate(
                template="Based on the context below, provide a structured answer to the question.\n\nContext: {context}\n\nQuestion: {question}\n\nAnswer in the following format:\n- Main point:\n- Supporting details:\n- Additional context:\n- Confidence level:\n",
                input_variables=["context", "question"]
            )
        }

    def get_prompt(self, style: str, **kwargs) -> str:
        if style not in self.templates:
            raise ValueError(f"Unknown prompt style: {style}")
        return self.templates[style].format(**kwargs)

In [13]:
from typing import List, Dict, Optional
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM  
from dataclasses import dataclass

@dataclass
class ReaderConfig:
    """Configuration for the Reader component"""
    model_name: str = "google/flan-t5-base"  # Can also use larger variants
    max_source_length: int = 1024
    max_target_length: int = 256
    num_beams: int = 4
    device: str = "cuda" if torch.cuda.is_available() else "cpu"
    temperature: float = 0.7
    do_sample: bool = True
    top_p: float = 0.95
    prompt_template: str = "Answer the question based on the given context.\n\nContext: {context}\n\nQuestion: {question}\n\nAnswer:"

class EnhancedReader(Reader):
    def __init__(self, config: ReaderConfig):
        super().__init__(config)
        self.prompt_manager = PromptManager()
        
    def generate_answer_with_prompt_style(
        self, 
        question: str, 
        retrieved_docs: List[str],
        prompt_style: str = "basic",
        return_context: bool = False
    ) -> Dict[str, str]:
        """
        Generate an answer using a specific prompting strategy
        """
        context = self._prepare_context(retrieved_docs, self.config.max_source_length)
        
        # Get the appropriate prompt
        prompt = self.prompt_manager.get_prompt(
            style=prompt_style,
            context=context,
            question=question
        )
        
        # Generate answer with the selected prompt
        inputs = self.tokenizer(
            prompt,
            max_length=self.config.max_source_length,
            padding=True,
            truncation=True,
            return_tensors="pt"
        ).to(self.config.device)
        
        with torch.no_grad():
            outputs = self.model.generate(
                inputs.input_ids,
                max_length=self.config.max_target_length,
                num_beams=self.config.num_beams,
                temperature=self.config.temperature,
                do_sample=self.config.do_sample,
                top_p=self.config.top_p
            )
        
        answer = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        
        result = {
            "answer": answer,
            "prompt_style": prompt_style
        }
        
        if return_context:
            result["context"] = context
            result["full_prompt"] = prompt
            
        return result


NameError: name 'Reader' is not defined

In [None]:
reader_config = ReaderConfig()
reader = EnhancedReader(config=reader_config)

rag_output = reader.generate_answer(user_query, retrieved_docs, return_context=True)
rag_output


In [None]:
from typing import List, Tuple, Dict
import numpy as np
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
import spacy
from collections import defaultdict

class RAGQAEvaluator:
    def __init__(self):
        """Initialize the RAG QA evaluation system."""
        self.rouge_scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
        try:
            self.nlp = spacy.load('en_core_web_sm')
        except:
            print("Please install spaCy model: python -m spacy download en_core_web_sm")

    def evaluate_answer(self, 
                       rag_output: Dict[str, str],
                       qa_pairs: List[Tuple[str, str, str]]) -> Dict:
        """
        Evaluate RAG output against ground truth QA pairs.
        
        Args:
            rag_output: Dictionary containing 'answer' and 'context' from RAG
            qa_pairs: List of tuples (doc_id, question, answer) from ground truth
            
        Returns:
            Dictionary containing evaluation metrics
        """
        metrics = {}
        
        # Extract mentioned documents from RAG context
        mentioned_docs = set(rag_output['context'].split())
        
        # Find matching QA pairs based on document overlap
        matching_pairs = [
            (q, a) for doc_id, q, a in qa_pairs 
            if doc_id in mentioned_docs
        ]
        
        if not matching_pairs:
            return {
                'error': 'No matching ground truth QA pairs found for the retrieved documents',
                'retrieved_docs': list(mentioned_docs)
            }
        
        # Calculate metrics for the generated answer against all matching pairs
        answer_metrics = []
        for question, answer in matching_pairs:
            pair_metrics = self._calculate_metrics(
                generated_answer=rag_output['answer'],
                reference_question=question,
                reference_answer=answer
            )
            answer_metrics.append(pair_metrics)
        
        # Take the best scores across all matching pairs
        metrics['best_match'] = {
            metric: max(m[metric] for m in answer_metrics)
            for metric in answer_metrics[0].keys()
        }
        
        # Calculate relevance to retrieved context
        if 'context' in rag_output:
            context_doc = self.nlp(' '.join(rag_output['context'].split()))
            answer_doc = self.nlp(rag_output['answer'])
            metrics['context_relevance'] = context_doc.similarity(answer_doc)
        
        # Add document coverage metrics
        metrics['document_coverage'] = {
            'num_retrieved': len(mentioned_docs),
            'retrieved_docs': list(mentioned_docs)
        }
        
        return metrics

    def _calculate_metrics(self, 
                         generated_answer: str,
                         reference_question: str, 
                         reference_answer: str) -> Dict[str, float]:
        """Calculate various similarity metrics between generated and reference text."""
        metrics = {}
        
        # ROUGE scores
        rouge_scores = self.rouge_scorer.score(reference_answer, generated_answer)
        metrics['rouge1_f1'] = rouge_scores['rouge1'].fmeasure
        metrics['rouge2_f1'] = rouge_scores['rouge2'].fmeasure
        metrics['rougeL_f1'] = rouge_scores['rougeL'].fmeasure
        
        # BLEU score
        metrics['bleu'] = sentence_bleu(
            [reference_answer.split()],
            generated_answer.split()
        )
        
        # Semantic similarity
        ref_answer_doc = self.nlp(reference_answer)
        gen_answer_doc = self.nlp(generated_answer)
        metrics['semantic_similarity'] = ref_answer_doc.similarity(gen_answer_doc)
        
        # Question relevance
        ref_question_doc = self.nlp(reference_question)
        metrics['question_relevance'] = gen_answer_doc.similarity(ref_question_doc)
        
        return metrics

    def get_evaluation_summary(self, metrics: Dict) -> str:
        """Generate a human-readable summary of the evaluation metrics."""
        if 'error' in metrics:
            return f"Error: {metrics['error']}\nRetrieved documents: {', '.join(metrics['retrieved_docs'])}"
        
        summary = []
        
        if 'best_match' in metrics:
            summary.append("Best Matching Scores:")
            summary.append(f"Content Overlap:")
            summary.append(f"- ROUGE-1 F1: {metrics['best_match']['rouge1_f1']:.3f}")
            summary.append(f"- ROUGE-2 F1: {metrics['best_match']['rouge2_f1']:.3f}")
            summary.append(f"- ROUGE-L F1: {metrics['best_match']['rougeL_f1']:.3f}")
            summary.append(f"- BLEU Score: {metrics['best_match']['bleu']:.3f}")
            summary.append(f"\nSemantic Evaluation:")
            summary.append(f"- Semantic Similarity: {metrics['best_match']['semantic_similarity']:.3f}")
            summary.append(f"- Question Relevance: {metrics['best_match']['question_relevance']:.3f}")
        
        if 'context_relevance' in metrics:
            summary.append(f"\nContext Relevance: {metrics['context_relevance']:.3f}")
        
        if 'document_coverage' in metrics:
            summary.append(f"\nDocument Coverage:")
            summary.append(f"- Number of Retrieved Documents: {metrics['document_coverage']['num_retrieved']}")
            summary.append(f"- Retrieved Documents: {', '.join(metrics['document_coverage']['retrieved_docs'])}")
        
        return '\n'.join(summary)

In [None]:
# Initialize the evaluator
evaluator = RAGQAEvaluator()

# Get evaluation metrics
metrics = evaluator.evaluate_answer(rag_output, qa_pairs)

# Print human-readable summary
print(evaluator.get_evaluation_summary(metrics))