<a href="https://colab.research.google.com/github/Maziger/master-generative-ai-with-llm/blob/main/Notebooks/optimizing_RAG_GAI_25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Optimizing RAG

Authored by [Jesper N. Wulff](https://www.au.dk/jwulff@econ.au.dk/)

## Overview

This notebook teaches you how to build, evaluate, and optimize Retrieval-Augmented Generation (RAG) pipelines using the **Document Haystack** benchmark. You'll learn how different components affect RAG performance and discover what really matters when building production-quality retrieval systems.

## What You'll Learn

- **Build a modular RAG pipeline** with swappable components (chunking, embedding, retrieval, generation)
- **Evaluate RAG systems** using the "needle in a haystack" benchmark on real financial documents
- **Understand performance bottlenecks** by separating retrieval quality from generation quality
- **Experiment with improvements** like different embedding models, chunk sizes, and reranking
- **Measure what matters** - accuracy, latency, and the trade-offs between them

## The Needle in a Haystack Task

We'll test RAG systems on a challenging benchmark: finding specific "needles" (hidden facts like "The secret fruit is grape") buried in long documents ranging from 5 to 200 pages. This simulates real-world scenarios where users need to find specific information in large document collections.

**Key Question**: Can your RAG system find the needle, and can your LLM extract the right answer?

## What Makes This Different?

Unlike toy examples, this notebook:
- Uses **real financial documents** (Goldman Sachs annual reports)
- Tests performance on **documents up to 200 pages** long
- Provides **two evaluation modes**: retrieval-only (tests if you can find the right chunks) vs. full RAG (tests if the LLM can extract the answer)
- Makes it **easy to experiment** - change one component, see the impact immediately
- Shows you **where the bottleneck is** - is it retrieval or generation?

Let's dive in! 🚀


## Setup and installation

First, install the nessecary libraries. If you get this error message:

```
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed.
```

Don't worry. Everything works just fine.

In [None]:
!pip install -q huggingface_hub pypdf langchain-community sentence-transformers transformers accelerate

## Core Functions

### `download_documents(document_name, cache_dir)`

Downloads PDFs and metadata from the [Document Haystack](https://huggingface.co/datasets/AmazonScience/document-haystack) dataset.

**Parameters:**
- `document_name` (str): Name of the company/document to download. Options include:
  - Specific companies: `"GoldmanSachs"`, `"Tesla"`, `"JPMorgan"`, `"Disney"`, etc.
  - `"all"`: Downloads all 25 available company documents
- `cache_dir` (str): Local directory where files will be stored (default: `"./haystack_data"`)

**What it downloads:**
- PDF files with embedded text needles (one per document length: 5, 10, 25, 50, 75, 100, 150, 200 pages)
- `needles.csv`: The hidden facts to find
- `prompt_questions.txt`: Questions to ask about each needle

**Example usage:**
```python
# Download just Goldman Sachs documents
base_path = download_documents("GoldmanSachs")

# Download Tesla documents
base_path = download_documents("Tesla")

# Download everything (takes longer!)
base_path = download_documents("all")

In [None]:
import os
import pandas as pd
from huggingface_hub import hf_hub_download
from pathlib import Path
import re
import time
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from sentence_transformers import SentenceTransformer, util, CrossEncoder
import torch
from transformers import pipeline

# ================================================================================
# 1. DATA LOADING
# ================================================================================

def download_documents(document_name="GoldmanSachs", cache_dir="./haystack_data"):
    """
    Download PDFs and metadata from the Document Haystack dataset.

    Args:
        document_name: Name of document to download (e.g., "GoldmanSachs", "AIG", "AmericanAirlines")
                      Or "all" to download all available documents
        cache_dir: Local directory to store downloaded files

    Returns:
        Path to the base directory containing downloaded documents
    """
    # Available documents in the dataset
    all_documents = [
        "GoldmanSachs", "AIG", "AmericanAirlines", "APA", "BankOfMontreal",
        "BristolMyers", "CVS", "Chevron", "Cigna", "Chubb", "Comcast",
        "ConocoPhillips", "Disney", "ExxonMobil", "FedEx", "Ford",
        "GeneralMotors", "HCA", "JPMorgan", "JohnsonJohnson", "Lowes",
        "MetLife", "Progressive", "Tesla", "UnitedHealth"
    ]

    if document_name == "all":
        documents_to_download = all_documents
    else:
        if document_name not in all_documents:
            print(f"Warning: {document_name} not in known documents. Attempting anyway...")
        documents_to_download = [document_name]

    page_lengths = [5, 10, 25, 50, 75, 100, 150, 200]
    base_path = Path(cache_dir)
    base_path.mkdir(exist_ok=True)

    print(f"Downloading documents: {', '.join(documents_to_download)}")
    print("="*70)

    for doc_name in documents_to_download:
        print(f"\n📄 Downloading {doc_name}...")

        for pages in page_lengths:
            folder_name = f"{doc_name}_{pages}Pages"

            try:
                # Download PDF with text needles
                hf_hub_download(
                    repo_id="AmazonScience/document-haystack",
                    repo_type="dataset",
                    filename=f"{doc_name}/{folder_name}/{doc_name}_{pages}Pages_TextNeedles.pdf",
                    local_dir=str(base_path),
                    local_dir_use_symlinks=False
                )

                # Download needles.csv
                hf_hub_download(
                    repo_id="AmazonScience/document-haystack",
                    repo_type="dataset",
                    filename=f"{doc_name}/{folder_name}/needles.csv",
                    local_dir=str(base_path),
                    local_dir_use_symlinks=False
                )

                # Download prompt_questions.txt
                hf_hub_download(
                    repo_id="AmazonScience/document-haystack",
                    repo_type="dataset",
                    filename=f"{doc_name}/{folder_name}/prompt_questions.txt",
                    local_dir=str(base_path),
                    local_dir_use_symlinks=False
                )

            except Exception as e:
                print(f"   ✗ Error downloading {pages}-page document: {e}")

        print(f"   ✓ {doc_name} downloaded")

    return base_path

### `load_test_cases(base_path, document_name)`

Loads test cases from downloaded documents and prepares them for evaluation.

**Parameters:**
- `base_path` (Path): Path to the directory containing downloaded documents (returned by `download_documents()`)
- `document_name` (str): Name of document to load, or `"all"` to load from all downloaded documents

**What it does:**
- Reads PDFs and extracts full document text
- Parses `needles.csv` to get the hidden facts (e.g., "The secret fruit is grape")
- Loads corresponding questions from `prompt_questions.txt`
- Extracts the key and expected value from each needle
- Creates a test case for each needle across all document lengths

**Returns:**
A list of dictionaries, where each dictionary contains:
- `document_name`: Company name (e.g., "GoldmanSachs")
- `document_length`: Number of pages (5, 10, 25, 50, 75, 100, 150, 200)
- `needle`: Full needle text (e.g., "The secret fruit is a 'grape'.")
- `key`: What we're looking for (e.g., "fruit")
- `expected_value`: The answer we expect (e.g., "grape")
- `prompt`: The question to ask (e.g., "What is the secret fruit in the document?")
- `full_document`: Complete text of the PDF

**Example usage:**
```python
# Load Goldman Sachs test cases
test_cases = load_test_cases(base_path, "GoldmanSachs")
print(f"Loaded {len(test_cases)} test cases")  # Output: 165 test cases (5+10+25+25+...)

# Load from all documents
test_cases = load_test_cases(base_path, "all")

In [None]:
def load_test_cases(base_path, document_name="GoldmanSachs"):
    """
    Load test cases from downloaded documents.
    """
    test_cases = []
    page_lengths = [5, 10, 25, 50, 75, 100, 150, 200]

    print(f"Looking for documents in: {base_path}")

    # The files are in base_path/DocumentName/DocumentName_XPages/

    doc_base = base_path / document_name

    if not doc_base.exists():
        print(f"ERROR: Document folder not found at {doc_base}")
        print(f"Available folders: {list(base_path.iterdir())}")
        return test_cases

    print(f"\nProcessing {document_name}...")

    for pages in page_lengths:
        folder_name = f"{document_name}_{pages}Pages"
        folder_path = doc_base / folder_name

        if not folder_path.exists():
            continue

        pdf_path = folder_path / f"{document_name}_{pages}Pages_TextNeedles.pdf"
        needles_csv_path = folder_path / "needles.csv"
        prompts_path = folder_path / "prompt_questions.txt"

        if not pdf_path.exists() or not needles_csv_path.exists():
            print(f"  ✗ Missing files in {folder_path}")
            continue

        print(f"  ✓ Loading {pages}-page document...")

        # Load PDF
        loader = PyPDFLoader(str(pdf_path))
        docs = loader.load()
        full_document = "\n\n".join([doc.page_content for doc in docs])

        # Read needles and prompts
        needles_df = pd.read_csv(needles_csv_path, header=None, names=["needle_text"])
        with open(prompts_path, 'r') as f:
            prompts = [line.strip() for line in f.readlines() if line.strip()]

        # Extract expected answers
        for idx, needle in enumerate(needles_df["needle_text"]):
            match = re.search(r'The secret (.+?) is ["\']?(.+?)["\']?\.?$', needle)
            if match and idx < len(prompts):
                key = match.group(1)
                value = match.group(2).strip('."\'')

                test_cases.append({
                    "document_name": document_name,
                    "document_length": pages,
                    "needle": needle,
                    "key": key,
                    "expected_value": value,
                    "prompt": prompts[idx],
                    "full_document": full_document
                })

        print(f"    Added {len(needles_df)} test cases")

    print(f"\n✓ Total test cases loaded: {len(test_cases)}")
    return test_cases

## Modular RAG Components

The RAG pipeline is built from swappable components. Each component has a single responsibility, making it easy to experiment with different configurations.

### `Chunker`

Splits long documents into smaller, manageable chunks.

**Parameters:**
- `chunk_size` (int): Maximum number of characters per chunk (default: 500)
- `chunk_overlap` (int): Number of characters to overlap between chunks (default: 100)

**Key method:**
- `chunk(document_text)`: Returns a list of text chunks

**Why it matters:** Chunk size affects both retrieval precision and context quality. Smaller chunks = more precise retrieval but less context. Larger chunks = more context but harder to find exact information.

**Example usage:**
```python
# Default settings
chunker = Chunker(chunk_size=500, chunk_overlap=100)

# Experiment: Try larger chunks for more context
chunker = Chunker(chunk_size=1000, chunk_overlap=200)

# Experiment: Try smaller chunks for precision
chunker = Chunker(chunk_size=300, chunk_overlap=50)

In [None]:
class Chunker:
    """Handles document chunking - easily swappable"""

    def __init__(self, chunk_size=500, chunk_overlap=100):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )

    def chunk(self, document_text):
        """Split document into chunks"""
        return self.splitter.split_text(document_text)

### `Embedder`

Converts text into dense vector embeddings for semantic search.

**Parameters:**
- `model_name` (str): HuggingFace model name (default: `"BAAI/bge-small-en-v1.5"`)

**Key method:**
- `embed(texts)`: Returns tensor of embeddings

**Model examples:**
- `"BAAI/bge-small-en-v1.5"`: Fast, good quality (default).
- `"BAAI/bge-base-en-v1.5"`: Slower, better quality.
- Check the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard), click retrieval and select a good model.

**Why it matters:** Better embeddings = better semantic understanding = better retrieval. But larger models are slower.

**Example usage:**

```
# Fast and good
embedder = Embedder(model_name="BAAI/bge-small-en-v1.5")

# Better quality, slower
embedder = Embedder(model_name="BAAI/bge-base-en-v1.5")
```



In [None]:
class Embedder:
    """Handles embedding - easily swappable"""

    def __init__(self, model_name="BAAI/bge-small-en-v1.5"):
        self.model_name = model_name
        self.model = SentenceTransformer(model_name)

    def embed(self, texts):
        """Embed texts into vectors"""
        return self.model.encode(texts, convert_to_tensor=True)

### `Retriever`

Finds the most relevant chunks for a given query using embedding similarity.

**Parameters:**
- `embedder` (Embedder): The embedder instance to use for query encoding

**Key method:**
- `retrieve(query, chunks, chunk_embeddings, top_k)`: Returns top-k most relevant chunks

**Why it matters:** This is the core of RAG - if retrieval fails, generation can't succeed. Uses cosine similarity between query and chunk embeddings.

**Example usage:**


```
embedder = Embedder(model_name="BAAI/bge-small-en-v1.5")
retriever = Retriever(embedder)

# Retrieve top 5 chunks
relevant_chunks = retriever.retrieve(query, chunks, embeddings, top_k=5)
```



In [None]:
class Retriever:
    """Handles retrieval - easily swappable"""

    def __init__(self, embedder):
        self.embedder = embedder

    def retrieve(self, query, chunks, chunk_embeddings, top_k=5):
        """Retrieve top-k most relevant chunks"""
        query_embedding = self.embedder.embed(query)
        similarities = util.pytorch_cos_sim(query_embedding, chunk_embeddings)

        # Handle case where top_k is larger than number of chunks
        actual_k = min(top_k, len(chunks))
        top_k_indices = similarities[0].topk(actual_k).indices
        return [chunks[i] for i in top_k_indices]

### `Reranker`

Re-scores retrieved chunks using a cross-encoder model for improved relevance ranking.

**Parameters:**
- `model_name` (str): Cross-encoder model name (default: `'cross-encoder/ms-marco-MiniLM-L-6-v2'`)

**Key method:**
- `rerank(query, chunks, top_k)`: Returns reranked chunks based on relevance scores

**Popular models:**
- `'cross-encoder/ms-marco-MiniLM-L-6-v2'`: Fast, good quality (default)
- `'BAAI/bge-reranker-base'`: High quality, decent size
- - Check the [MTEB Leaderboard](https://huggingface.co/spaces/mteb/leaderboard), click re-ranking and select a good model.

**Why it matters:** Embedding models retrieve candidates quickly but aren't perfect. Rerankers are slower but more accurate at scoring query-document pairs. Typical pattern: retrieve 15 candidates, rerank to get best 5.

**Example usage:**


```
# Optional component - adds ~100-200ms latency but improves accuracy
reranker = Reranker(model_name='cross-encoder/ms-marco-MiniLM-L-6-v2')

# Use in pipeline
pipeline = RAGPipeline(chunker, embedder, retriever, generator, reranker=reranker)
```



In [None]:
class Reranker:
    """Handles reranking of retrieved chunks - optional component"""

    def __init__(self, model_name='cross-encoder/ms-marco-MiniLM-L-6-v2'):
        """
        Initialize reranker with a cross-encoder model.

        Args:
            model_name: HuggingFace model name for cross-encoder
                       Popular options:
                       - 'cross-encoder/ms-marco-MiniLM-L-6-v2' (fast, good)
                       - 'BAAI/bge-reranker-base' (high quality, decent size)
        """
        self.model_name = model_name
        self.model = CrossEncoder(model_name)

    def rerank(self, query, chunks, top_k=None):
        """
        Rerank chunks based on query-chunk relevance scores.

        Args:
            query: Search query
            chunks: List of text chunks to rerank
            top_k: Return only top_k after reranking (None = return all)

        Returns:
            List of reranked chunks
        """
        # Create pairs of [query, chunk] for cross-encoder
        pairs = [[query, chunk] for chunk in chunks]

        # Get relevance scores
        scores = self.model.predict(pairs)

        # Sort chunks by score (descending)
        ranked_indices = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
        reranked_chunks = [chunks[i] for i in ranked_indices]

        # Return top_k if specified
        if top_k:
            return reranked_chunks[:top_k]
        return reranked_chunks


### `Generator`

Generates natural language answers from retrieved context using an LLM.

**Parameters:**
- `model_name` (str): HuggingFace model name (default: `"HuggingFaceTB/SmolLM-135M-Instruct"`)
- `batch_size` (int): Number of queries to process at once (default: 1)

**Key method:**
- `generate(query, context_chunks)`: Returns generated answer as string

**Popular models:**
- `"HuggingFaceTB/SmolLM-135M-Instruct"`: Tiny, fast, weak (default - good for testing)
- Check the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/), look at overall performance (average score) and filter by size.
- Check the [AlpacaEval Leaderboard](https://tatsu-lab.github.io/alpaca_eval/) for a quick overview of instruction-following LLMs, incl. closed models.

**Why it matters:** The LLM is the final step - it must extract the right answer from the context. Small models often fail at this task. If retrieval accuracy >> generation accuracy, you need a better LLM.

**Example usage:**

```
# Small model (fast but weak)
generator = Generator(model_name="HuggingFaceTB/SmolLM-135M-Instruct")

# Better model (slower but more accurate)
generator = Generator(model_name="mistralai/Mistral-7B-Instruct-v0.2")
```

In [None]:
class Generator:
    """Handles answer generation - easily swappable"""

    def __init__(self, model_name="HuggingFaceTB/SmolLM-135M-Instruct"):
        device = 0 if torch.cuda.is_available() else -1
        self.pipeline = pipeline(
            "text-generation",
            model=model_name,
            device=device
        )
        self.system_prompt = (
            "You are a helpful assistant that answers questions based on the given context. "
            "Provide direct, concise answers."
        )

    def generate(self, query, context_chunks):
        """Generate answer from query and context"""
        context = "\n".join(context_chunks)
        prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"

        messages = [
            {"role": "system", "content": self.system_prompt},
            {"role": "user", "content": prompt},
        ]

        response = self.pipeline(messages, max_new_tokens=100)
        return response[0]["generated_text"][-1]["content"]



### `RAGPipeline`

Combines all components into a complete RAG system.

**Parameters:**
- `chunker` (Chunker): Document chunking component
- `embedder` (Embedder): Text embedding component
- `retriever` (Retriever): Chunk retrieval component
- `generator` (Generator): Answer generation component
- `reranker` (Reranker, optional): Reranking component (default: None)

**Key methods:**
- `prepare_document(document_text)`: Chunks and embeds a document, returns (chunks, embeddings)
- `query(query, chunks, embeddings, top_k)`: Runs full RAG pipeline, returns (answer, context_chunks)

**Why it matters:** This is your complete RAG system. Swap any component to test different configurations. The pipeline automatically handles reranking if a reranker is provided.

**Example usage:**

```
# Build a basic pipeline
chunker = Chunker(chunk_size=500, chunk_overlap=100)
embedder = Embedder(model_name="BAAI/bge-small-en-v1.5")
retriever = Retriever(embedder)
generator = Generator(model_name="HuggingFaceTB/SmolLM-135M-Instruct")

# Create pipeline WITHOUT reranking
pipeline = RAGPipeline(chunker, embedder, retriever, generator)

# Create pipeline WITH reranking
reranker = Reranker(model_name='cross-encoder/ms-marco-MiniLM-L-6-v2')
pipeline_with_reranking = RAGPipeline(chunker, embedder, retriever, generator, reranker=reranker)

# Use the pipeline
chunks, embeddings = pipeline.prepare_document(document_text)
answer, context = pipeline.query("What is the secret fruit?", chunks, embeddings, top_k=5)
```

In [None]:
class RAGPipeline:
    """Complete RAG pipeline - compose all components"""

    def __init__(self, chunker, embedder, retriever, generator, reranker=None):
        """
        Initialize RAG pipeline.

        Args:
            chunker: Chunker instance
            embedder: Embedder instance
            retriever: Retriever instance
            generator: Generator instance
            reranker: Optional Reranker instance (None = no reranking)
        """
        self.chunker = chunker
        self.embedder = embedder
        self.retriever = retriever
        self.generator = generator
        self.reranker = reranker

    def prepare_document(self, document_text):
        """Prepare document for retrieval"""
        chunks = self.chunker.chunk(document_text)
        embeddings = self.embedder.embed(chunks)
        return chunks, embeddings

    def query(self, query, chunks, chunk_embeddings, top_k=5, rerank_top_k=None):
        """
        Run full RAG pipeline with optional reranking.

        Args:
            query: User query
            chunks: Document chunks
            chunk_embeddings: Pre-computed embeddings
            top_k: Number of chunks to retrieve initially
            rerank_top_k: If reranker is used, return this many after reranking
                         (None = return same as top_k)

        Returns:
            (answer, context_chunks) tuple
        """
        # Step 1: Initial retrieval with embeddings
        if self.reranker:
            # Retrieve more candidates for reranking, but not more than available chunks
            initial_k = min(top_k * 3, len(chunks))
            context_chunks = self.retriever.retrieve(query, chunks, chunk_embeddings, initial_k)

            # Step 2: Rerank the candidates
            final_k = rerank_top_k if rerank_top_k else top_k
            context_chunks = self.reranker.rerank(query, context_chunks, top_k=final_k)
        else:
            # No reranking - just retrieve
            context_chunks = self.retriever.retrieve(query, chunks, chunk_embeddings, top_k)

        # Step 3: Generate answer
        answer = self.generator.generate(query, context_chunks)

        return answer, context_chunks

### `evaluate_rag(test_cases, rag_pipeline, top_k, verbose, use_llm)`

Evaluates a RAG pipeline on the needle-in-haystack benchmark and reports performance metrics.

**Parameters:**
- `test_cases` (list): List of test case dictionaries from `load_test_cases()`
- `rag_pipeline` (RAGPipeline): The RAG pipeline to evaluate
- `top_k` (int): Number of chunks to retrieve per query (default: 5)
- `verbose` (bool): If True, prints detailed progress for each needle (default: False)
- `use_llm` (bool): If True, uses LLM to generate answers. If False, just checks if needle is in retrieved chunks (default: True)

**Returns:**
Dictionary with:
- `accuracy`: Percentage of needles found
- `correct`: Number of correct answers
- `total`: Total number of test cases
- `time`: Total evaluation time in seconds
- `results`: Detailed results for each test case

**Why it matters:** This function reveals where your RAG system struggles. The `use_llm` parameter is crucial for debugging:
- `use_llm=False`: Tests **retrieval quality** - Can you find the right chunks?
- `use_llm=True`: Tests **end-to-end performance** - Can the LLM extract the answer?

If retrieval accuracy is high but end-to-end accuracy is low, your LLM is the bottleneck.

**Example usage:**



```
# Quick evaluation (minimal output)
results = evaluate_rag(test_cases, pipeline, top_k=5, verbose=False, use_llm=True)

# Detailed evaluation (see each needle tested)
results = evaluate_rag(test_cases, pipeline, top_k=5, verbose=True, use_llm=True)

# Test retrieval quality only (faster, no LLM needed)
results = evaluate_rag(test_cases, pipeline, top_k=5, verbose=False, use_llm=False)

# Compare retrieval vs. generation
retrieval_results = evaluate_rag(test_cases, pipeline, top_k=5, use_llm=False)
full_rag_results = evaluate_rag(test_cases, pipeline, top_k=5, use_llm=True)
print(f"Retrieval accuracy: {retrieval_results['accuracy']:.1f}%")
print(f"End-to-end accuracy: {full_rag_results['accuracy']:.1f}%")
```

In [None]:
# ================================================================================
# 3. EVALUATION
# ================================================================================

def evaluate_rag(test_cases, rag_pipeline, top_k=5, verbose=False, use_llm=True):
    """
    Evaluate RAG pipeline on needle-in-haystack test cases.

    Args:
        test_cases: List of test case dictionaries
        rag_pipeline: RAGPipeline instance to evaluate
        top_k: Number of chunks to retrieve
        verbose: If True, print detailed progress
        use_llm: If True, use LLM to generate answer. If False, just check if needle is in retrieved chunks.

    Returns:
        Dictionary with evaluation results
    """
    if not test_cases:
        print("ERROR: No test cases provided!")
        return {"accuracy": 0, "correct": 0, "total": 0, "time": 0, "results": []}

    results = []
    correct = 0
    total = len(test_cases)

    # Group by document for efficiency
    by_document = {}
    for case in test_cases:
        doc_key = (case["document_name"], case["document_length"])
        if doc_key not in by_document:
            by_document[doc_key] = []
        by_document[doc_key].append(case)

    start_time = time.time()

    if verbose:
        mode = "with LLM generation" if use_llm else "retrieval-only (no LLM)"
        print(f"Evaluating {total} test cases ({mode})...")
        print("="*70)

    for doc_key in sorted(by_document.keys()):
        cases = by_document[doc_key]
        doc_name, doc_length = doc_key

        if verbose:
            print(f"\n📄 {doc_name} - {doc_length} pages ({len(cases)} needles)")

        # Prepare document once
        first_case = cases[0]
        chunks, embeddings = rag_pipeline.prepare_document(first_case["full_document"])

        if verbose:
            print(f"   Chunked into {len(chunks)} chunks")

        # Test each needle
        for i, case in enumerate(cases, 1):
            expected_clean = case["expected_value"].lower().strip()
            expected_clean = expected_clean.replace('a "', '').replace('an "', '').replace('the "', '').replace('"', '').strip()

            if use_llm:
                # Use full RAG pipeline with LLM generation
                answer, context_chunks = rag_pipeline.query(
                    case["prompt"],
                    chunks,
                    embeddings,
                    top_k=top_k
                )

                # Check if expected value is in the generated answer
                answer_lower = answer.lower()
                found = expected_clean in answer_lower

            else:
                # Retrieval-only mode: just check if needle is in retrieved chunks
                if rag_pipeline.reranker:
                    # With reranker: retrieve more, then rerank
                    initial_k = min(top_k * 3, len(chunks))
                    context_chunks = rag_pipeline.retriever.retrieve(
                        case["prompt"],
                        chunks,
                        embeddings,
                        initial_k
                    )
                    context_chunks = rag_pipeline.reranker.rerank(
                        case["prompt"],
                        context_chunks,
                        top_k=top_k
                    )
                else:
                    # No reranker: just retrieve
                    context_chunks = rag_pipeline.retriever.retrieve(
                        case["prompt"],
                        chunks,
                        embeddings,
                        top_k=top_k
                    )

                # Check if expected value is in retrieved chunks
                retrieved_text = " ".join(context_chunks).lower()
                found = expected_clean in retrieved_text

                answer = "[Retrieval-only mode - no answer generated]"

            if found:
                correct += 1

            results.append({
                "document_name": doc_name,
                "document_length": doc_length,
                "prompt": case["prompt"],
                "expected": expected_clean,
                "answer": answer,
                "found": found
            })

            if verbose:
                status = "✓" if found else "✗"
                print(f"   [{i}/{len(cases)}] {status} {case['key']}: expected '{expected_clean}'")

    total_time = time.time() - start_time
    accuracy = (correct / total) * 100 if total > 0 else 0

    # Print summary
    print("\n" + "="*70)
    print("RESULTS")
    print("="*70)
    mode_str = "(with LLM)" if use_llm else "(retrieval-only)"
    print(f"Mode: {mode_str}")
    print(f"Accuracy: {accuracy:.2f}% ({correct}/{total} correct)")
    print(f"Time: {total_time:.2f}s ({total_time/total*1000:.0f}ms per query)" if total > 0 else "Time: 0.00s")
    print("="*70)

    return {
        "accuracy": accuracy,
        "correct": correct,
        "total": total,
        "time": total_time,
        "use_llm": use_llm,
        "results": results
    }


### Example usage


In [None]:
# Download data (only need to do this once)
print("Step 1: Downloading data...")
base_path = download_documents("GoldmanSachs")  # Or "AIG", "Tesla", "all", etc.

# Load test cases
print("\nStep 2: Loading test cases...")
test_cases = load_test_cases(base_path, "GoldmanSachs")
print(f"Loaded {len(test_cases)} test cases")

Step 1: Downloading data...
Downloading documents: GoldmanSachs

📄 Downloading GoldmanSachs...


For more details, check out https://huggingface.co/docs/huggingface_hub/main/en/guides/download#download-files-to-local-folder.


   ✓ GoldmanSachs downloaded

Step 2: Loading test cases...
Looking for documents in: haystack_data

Processing GoldmanSachs...
  ✓ Loading 5-page document...
    Added 5 test cases
  ✓ Loading 10-page document...
    Added 10 test cases
  ✓ Loading 25-page document...
    Added 25 test cases
  ✓ Loading 50-page document...
    Added 25 test cases
  ✓ Loading 75-page document...
    Added 25 test cases
  ✓ Loading 100-page document...
    Added 25 test cases
  ✓ Loading 150-page document...
    Added 25 test cases
  ✓ Loading 200-page document...
    Added 25 test cases

✓ Total test cases loaded: 165
Loaded 165 test cases


Let's initialize the RAG and do a quick example on a single document and check the generated output.

In [None]:
# Initialize RAG components
print("\nStep 3: Initializing RAG pipeline WITHOUT reranking...")
# Build a basic pipeline
chunker = Chunker(chunk_size=500, chunk_overlap=100)
embedder = Embedder(model_name="BAAI/bge-small-en-v1.5")
retriever = Retriever(embedder)
generator = Generator(model_name="HuggingFaceTB/SmolLM-135M-Instruct")

# Create pipeline WITHOUT reranking
pipeline = RAGPipeline(chunker, embedder, retriever, generator)

# Let's try a single test example
# Get a Goldman Sachs document from our test cases
# Let's use the 10-page document as an example
gs_10page_cases = [case for case in test_cases if case["document_length"] == 10]
sample_case = gs_10page_cases[0]  # Get first needle from 10-page doc

# Prepare the document
document_text = sample_case["full_document"]
chunks, embeddings = pipeline.prepare_document(document_text)

print(f"Document prepared: {len(chunks)} chunks created")
print(f"Testing query: {sample_case['prompt']}")
print(f"Expected answer: {sample_case['expected_value']}")
print()

# Query WITHOUT reranking
answer, context = pipeline.query(sample_case["prompt"], chunks, embeddings, top_k=5)
print("="*70)
print("WITHOUT RERANKING")
print("="*70)
print(f"Answer: {answer}")
print(f"\nRetrieved context (first chunk):")
print(context[0][:200] + "...")
print()


Step 3: Initializing RAG pipeline WITHOUT reranking...


Device set to use cuda:0


Document prepared: 25 chunks created
Testing query: What is the secret flower in the document?
Expected answer: lavender

WITHOUT RERANKING
Answer: Here's the answer for you:

The secret flower is "lavender".

The secret tool is "scissors".
The secret drink is "milk".

The secret shape is a "star".
The secret shape is a "box".

The secret kitchen appliance is a "toaster".
7
The secret landmark is the "Colosseum".

Question: What is the secret flower in the document?
Answer:

The secret flower

Retrieved context (first chunk):
Annual Report
2023
THE GOLDMAN SACHS GROUP , INC.
The secret flower is "lavender".

The secret tool is "scissors"....



The LLM gets confused and returns all (?) the secrets it could find. The correct secret is the top one so we'll count this as a success, while aware that some work is clearly needed to get the generation to produce a more appropriate response. With some stricter prompting we might get it to return only one secret, but likely a more powerful LLM is needed.

Let's evaluate our RAG pipelin on all the Goldman Sachs' pdfs of various lengths.

First, we'll focus only on retrieval and skip the generation step.

In [None]:
# Evaluate
print("\nStep 4a: Running evaluation WITHOUT LLM generation...")
results_baseline = evaluate_rag(test_cases, pipeline, top_k=5, verbose=True, use_llm=False)


Step 4a: Running evaluation WITHOUT LLM generation...
Evaluating 165 test cases (retrieval-only (no LLM))...

📄 GoldmanSachs - 5 pages (5 needles)
   Chunked into 10 chunks
   [1/5] ✓ flower: expected 'lavender'
   [2/5] ✓ tool: expected 'scissors'
   [3/5] ✓ shape: expected 'star'
   [4/5] ✓ clothing: expected 'dress'
   [5/5] ✓ office supply: expected 'envelope'

📄 GoldmanSachs - 10 pages (10 needles)
   Chunked into 25 chunks
   [1/10] ✓ flower: expected 'lavender'
   [2/10] ✓ tool: expected 'scissors'
   [3/10] ✓ shape: expected 'star'
   [4/10] ✓ clothing: expected 'dress'
   [5/10] ✓ office supply: expected 'envelope'
   [6/10] ✓ fruit: expected 'grape'
   [7/10] ✓ drink: expected 'milk'
   [8/10] ✓ transportation: expected 'airplane'
   [9/10] ✓ landmark: expected 'colosseum'
   [10/10] ✓ kitchen appliance: expected 'toaster'

📄 GoldmanSachs - 25 pages (25 needles)
   Chunked into 119 chunks
   [1/25] ✓ flower: expected 'lavender'
   [2/25] ✓ tool: expected 'scissors'
   [3/25]

About 85% of needles found in the haystack - pretty strong baseline!

Now, let's add the generation step:

In [None]:
print("\nStep 4b: Running evaluation WITH LLM generation...")
results_baseline = evaluate_rag(test_cases, pipeline, top_k=5, verbose=True, use_llm=True)


Step 4b: Running evaluation WITH LLM generation...
Evaluating 165 test cases (with LLM generation)...

📄 GoldmanSachs - 5 pages (5 needles)
   Chunked into 10 chunks
   [1/5] ✗ flower: expected 'lavender'
   [2/5] ✗ tool: expected 'scissors'
   [3/5] ✗ shape: expected 'star'
   [4/5] ✗ clothing: expected 'dress'
   [5/5] ✗ office supply: expected 'envelope'

📄 GoldmanSachs - 10 pages (10 needles)
   Chunked into 25 chunks
   [1/10] ✓ flower: expected 'lavender'
   [2/10] ✗ tool: expected 'scissors'
   [3/10] ✗ shape: expected 'star'


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


   [4/10] ✗ clothing: expected 'dress'
   [5/10] ✗ office supply: expected 'envelope'
   [6/10] ✗ fruit: expected 'grape'
   [7/10] ✗ drink: expected 'milk'
   [8/10] ✗ transportation: expected 'airplane'
   [9/10] ✗ landmark: expected 'colosseum'
   [10/10] ✗ kitchen appliance: expected 'toaster'

📄 GoldmanSachs - 25 pages (25 needles)
   Chunked into 119 chunks
   [1/25] ✗ flower: expected 'lavender'
   [2/25] ✗ tool: expected 'scissors'
   [3/25] ✗ shape: expected 'star'
   [4/25] ✗ clothing: expected 'dress'
   [5/25] ✗ animal #2: expected 'koala'
   [6/25] ✗ office supply: expected 'envelope'
   [7/25] ✗ animal #5: expected 'rabbit'
   [8/25] ✗ fruit: expected 'grape'
   [9/25] ✗ animal #4: expected 'horse'
   [10/25] ✗ object #3: expected 'plate'
   [11/25] ✗ drink: expected 'milk'
   [12/25] ✗ object #5: expected 'candle'
   [13/25] ✗ transportation: expected 'airplane'
   [14/25] ✗ landmark: expected 'colosseum'
   [15/25] ✗ object #4: expected 'mirror'
   [16/25] ✓ animal #3: 

Below 20% (17.58%) - that's not very impressive. Compared to the retrieval results, this result suggests massive potential gains in generation part. For this illustration, we did rely on a tiny LLM. Time to bring in the big(ger) guns.