<a href="https://colab.research.google.com/github/LashawnFofung/RAG-Pipelines/blob/main/src/Task_Comparing_Open_Source_Embedding_Models_for_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Comparing, Testing, and Choosing the Best Embedding Model for Retrieval-Augmented Generation (RAG)**

**Data:** *sample_contract.pdf*
<br><br>


The embedding model is the connective tissue of any RAG system, directly determining the quality and relevance of the retrieved context. A superior embedding model captures the semantic meaning behind my user queries and my knowledge base documents, which should lead to more accurate and helpful answers from the Large Language Model (LLM).
<br><br>

In this interactive Colab notebook, I will develop a critical skill for real-world AI engineering: systematically comparing and evaluating the impact of different open-source embedding models on my RAG pipeline's output. I will move beyond just benchmark scores to a qualitative, hands-on comparison using my own data and queries.
<br><br>


**Table of Contents**
- 🔧 [Section 1: Setup Environment](#scrollTo=phfOykN5Cc5n&line=1&uniqifier=1)
- 📄 [Section 2: Document Ingestion and Node Creation (PDF Loading using Fallback)](#scrollTo=T7Dbp7AqDF7p&line=1&uniqifier=1)
- 🧠 [Section 3: Initialize & Compare Embedding Models Testing Loop](#scrollTo=-K-dkhsmDYCS&line=1&uniqifier=1)
- 📊 [Section 4: Compare Outputs](#scrollTo=5Tmt3mJsDgPZ&line=1&uniqifier=1)
- 💡[Section 5: Embedding Model Scorecard Analysis](#scrollTo=IH0NiWQLJy7T&line=4&uniqifier=1)
- ⏱[Section 6: Testing Automation](#scrollTo=rC6swMC_CIUh&line=12&uniqifier=1)
- 📊[Section 7: Run 3 RAG Configurations and Log Output Differences](#scrollTo=HV5ydovyjaaC)
<br>


**🛠️ My Hands-On Evaluation Steps**

I will follow this structured process to assess how three different open-source embedding models (like MiniLM, E5, or BGE) affect the retrieval and final answer quality of my RAG system.
<br><br>


**1. I'll Choose and Implement 3 Embedding Models**

- I will select three small, popular open-source models (e.g., MiniLM, E5, BGE) from the available list.

- For each model, I'll easily update my RAG pipeline using the `HuggingFaceEmbedding` class:
<br>

**```Python```**
    
    
    embed_model = HuggingFaceEmbedding(model_name="your_model_name_here")
    
    
  <br>
  
- **Note:** Re-index documents if the embedding model is changed, as each model creates a unique vector space.
<br>

**2. I'll Test with Consistent Queries**

I will select 2-3 diverse test questions to use across all three models. This ensures a fair, apples-to-apples comparison.

- **Example**:

> **Query I'll use:** query = "What is the maximum loan amount a borrower can apply for?"


<br>


**3. I'll Analyze Retrieved Context (Chunks)**

- For each model and query, I'll print the chunks the RAG system retrieved to understand what the AI is using as context.
<br>

  **```Python```**
  
  
    for node in retriever.retrieve(query):
        print(node.get_text())
  
<br>

- My Key Check: Do the chunks feel on-topic? Do they capture the semantic meaning (synonyms/related concepts) of my query, or just exact keywords? Are they concise and free of unrelated noise?
<br>


 **4. Compare Final Results with a Scorecard**

I will qualitatively assess the final answer generated by the RAG system using the retrieved context. I will use a simple scorecard to document my findings for each model:

<br>

| Question | Score (1-5) | Notes |
| :--- | :---: | :--- |
| Was the answer complete?| 0 | add results  |
| Was the answer correct?|  0 | add results  |
| Was the language clear?|  0 | add results  |
| Did the context feel on-topic?| 0 | add results  |
| Were the chunks concise and useful?| 0 | add results  |

<br><br>


# **🔧 Section 1: Setup Environment**

Install necessary packages (libraries) for the RAG pipeline, specifically for indexing, embedding, and document parsing: llama-index, pymupdf, llama-index-embeddings-huggface

Optional (needed for Colab): nest_asyncio

In [1]:
# Install the necessary LlamaIndex packages, plus `pymupdf` for PDF parsing.
!pip install -q llama-index llama-index-embeddings-huggingface pymupdf


# Install `nest_asyncio`. Is necessary in Colab/Jupyter
# environments to allow asynchronous operations to run smoothly within a single thread.
!pip install -q nest_asyncio


# Install jedi to resolve a non-critical dependency warning related to ipython's
# interactive features, ensuring notebook output is completely clean.
!pip install -q jedi

# Ensure sentence-transformers is available for HuggingFaceEmbedding
!pip install -q sentence-transformers


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m90.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.9/11.9 MB[0m [31m102.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m303.3/303.3 kB[0m [31m24.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.8/51.8 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.9/63.9 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m329.5/329.5 kB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m65.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
# ------  Imports and Initial Configuration ------

import nest_asyncio
# Fix potential event loop conflicts
nest_asyncio.apply()

# Importing all the essential components from LlamaIndex
from llama_index.core import VectorStoreIndex, Document, Settings, get_response_synthesizer

#  Standard document-to-chunk tool(break documents into manageable pieces)
from llama_index.core.node_parser import SentenceSplitter

# Core component for loading my open-source embedding models (like MiniLM or E5).
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Class used to combine retriever (for getting context) and LLM (for generating the answer).
from llama_index.core.query_engine import RetrieverQueryEngine

# Simple time measurements to compare model speeds.
import time

# --- Embedding Models Definition ---
# These are the local, open-source embedding models we will compare.
embedding_models = {
    "MiniLM-L6-v2": "sentence-transformers/all-MiniLM-L6-v2",
    "BGE-small-en": "BAAI/bge-small-en-v1.5",
    "E5-small-v2": "intfloat/e5-small-v2"
}


# CRITICAL STEP: I am explicitly setting the LLM (Large Language Model) to None for now.
# Why? Because I want to focus *only* on testing the retrieval quality of the EMBEDDING models.
# By setting Settings.llm = None, I force the RAG pipeline to only retrieve context,
# or I can plug in a simple local LLM later without interference.
Settings.llm = None

# Print setup status
print("✅ Environment setup complete. LLM set to None.")

LLM is explicitly disabled. Using MockLLM.
✅ Environment setup complete. LLM set to None.


# 📄**Section 2: Document Ingestion and Node Creation (PDF Loading using Fallback)**

 This is a crucial step. This prepares the data (raw, unstructured PDF document), by extracting text and transforming into a list of structured format 'nodes' (chunks) ready for indexing (LlamaIndex).

**1. Document Load and Extraction**

In [3]:
# Placeholder content simulating a loaded document (used as a fallback)
raw_document_text = """
The monthly payment is due on the 1st of every month. Payments received after the 5th day
of the month will incur a late fee of $50. If payment is delayed by more than 30 days,
the account will be flagged, and an additional penalty of 1.5% of the outstanding balance
will be applied, compounded monthly. Failure to pay within 60 days will result in a
suspension of services and potential legal action. Please review section 4.3 for payment
processing guidelines and dispute resolution procedures. All disputes must be filed
within 10 calendar days of the late fee application date.
"""
text = raw_document_text
is_pdf_loaded = False


try:

  # The `files` utility for dynamic file uploads in the Colab environment and PyMuPDF.
  from google.colab import files

  # PyMuPDF (imported as 'fitz') for reliable, fast PDF parsing.
  import fitz
  print("\n--- Attempting interactive PDF upload ---")

  # --- 1. Document Loading and Extraction via Upload ---

  # Prompts to upload the PDF interactively from local machine.
  print("\n--- Uploading Document: 'sample_contract.pdf' ---")
  uploaded = files.upload()


  # Check if a file was successfully uploaded.
  if uploaded:
      # If successful, extracts the filename (which becomes the path) from the dictionary keys.
      pdf_path = list(uploaded.keys())[0]
      print(f"Successfully uploaded: {pdf_path}")

      # With valid pdf_path, the document can be opened and text can be extracted.
      # Using PyMuPDF (fitz) to open the PDF file for reading.
      doc = fitz.open(pdf_path)

      # Iterate through every page of the document to get the text from each,
      # and join them all together with a newline character (\n) as a separator.
      text = "\n".join([page.get_text() for page in doc])
      doc.close()

      # A quick check to make sure text extraction worked and to see the scale of data.
      print(f"✅ Extracted {len(text.split())} words from the contract.")
      is_pdf_loaded = True
  else:
      # If no file is uploaded, exits the cell execution to prevent errors in subsequent steps.
      print("No file uploaded. Using placeholder text for RAG processing.")

except ImportError:
    # This block handles running outside a Colab environment
    print("⚠️ Skipping Colab/PyMuPDF interactive file upload (environment dependency).")
    print("Using placeholder text for RAG processing.")




--- Attempting interactive PDF upload ---

--- Uploading Document: 'sample_contract.pdf' ---


Saving sample_contract.pdf to sample_contract.pdf
Successfully uploaded: sample_contract.pdf
✅ Extracted 315 words from the contract.


**2. Chunking with User-Specified Paramers (50/50)**

In [4]:
# This step is often the most important for RAG quality: chunking.
# Used a simple SentenceSplitter.
# Aggressive chunking strategy for precision and might increase retrieval time:
# Small chunks: chunk_size (50)
# High overlap: chunk_overlap (50)
# Maximize the chances of finding small, highly relevant facts.
text_splitter = SentenceSplitter(chunk_size=50, chunk_overlap=50)

# LlamaIndex needs the raw text wrapped in a Document object before splitting.
documents = Document(text=text)

# Convert the single large Document into many smaller, overlapping Nodes (chunks).
nodes = text_splitter.get_nodes_from_documents([documents])

print(f"✅ Document processed into {len(nodes)} nodes (chunks) with chunk_size = 50, overlap = 50.")


✅ Document processed into 16 nodes (chunks) with chunk_size = 50, overlap = 50.


# 🧠 **Section 3: Initialize and Compare Embedding Models Testing Loop**

This section iterates through each model, builds an index with that model, queries it, and records the result.


In [5]:
query = "What are the penalties for late payments?"
results = {}

for model_name, model_path in embedding_models.items():
    print(f"\n🔍 Testing Embedding Model: {model_name} (Downloading/Loading...)")

    # 1. Configure the embedding model for the current test
    # This downloads the model if it's not already cached.
    embed_model = HuggingFaceEmbedding(model_name=model_path)
    Settings.embed_model = embed_model

    # 2. Build the index with the new embedding model
    # The index must be rebuilt for each model to ensure the nodes are embedded correctly.
    # This step involves:
    #### 1. Taking each Node's text.
    #### 2. Passing it through the embedding model (set in Section 1).
    #### 3. Storing the resulting vector in the index for fast lookups.
    start_time_index = time.time()
    index = VectorStoreIndex(nodes)
    end_time_index = time.time()
    indexing_time = end_time_index - start_time_index
    print(f"   -> Index built in {indexing_time:.2f} seconds.")

    # 3. Configure the Query Engine
    start_time_query = time.time()
    retriever = index.as_retriever(similarity_top_k=2)
    # Note: LLM is None, so this engine will only perform retrieval.
    query_engine = RetrieverQueryEngine.from_args(retriever=retriever)

    # 4. Run the query
    response = query_engine.query(query)
    end_time_query = time.time()
    total_query_time = end_time_query - start_time_query

    # 5. Store results
    results[model_name] = {
        "response": str(response),
        "indexing_time": round(indexing_time, 2),
        "query_time": round(total_query_time, 2)
    }
    print(f"   -> Query complete. Time taken: {total_query_time:.2f} seconds.")



🔍 Testing Embedding Model: MiniLM-L6-v2 (Downloading/Loading...)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

   -> Index built in 0.60 seconds.
   -> Query complete. Time taken: 0.02 seconds.

🔍 Testing Embedding Model: BGE-small-en (Downloading/Loading...)


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

   -> Index built in 0.75 seconds.
   -> Query complete. Time taken: 0.04 seconds.

🔍 Testing Embedding Model: E5-small-v2 (Downloading/Loading...)


modules.json:   0%|          | 0.00/387 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/615 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

   -> Index built in 1.58 seconds.
   -> Query complete. Time taken: 0.09 seconds.


# 📊 **Section 4: Compare Outputs**

In [6]:
# This displays the results for analysis.
print("Section 4: Comparative Test Results")

for model, result in results.items():
    print(f"\n==============================")
    print(f"📊 Comparative Test Results ")
    print(f"")
    print(f"🧠 Model: {model}")
    print(f"")
    print(f"⏱️ Indexing Time: {result['indexing_time']} seconds")
    print(f"")
    print(f"⏱️ Retrieval Time: {result['query_time']} seconds")
    print(f"")
    print(f"📄 Top Response: {result['response']}")
    print(f"")
    print(f"___", "🔴 END", {model}, "MODEL TEST", "___")
    print(f"")


Section 4: Comparative Test Results

📊 Comparative Test Results 

🧠 Model: MiniLM-L6-v2

⏱️ Indexing Time: 0.6 seconds

⏱️ Retrieval Time: 0.02 seconds

📄 Top Response: Context information is below.
---------------------
Payment terms are
net 30 days from receipt of invoice.
2.3 Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full.
3.

4.2 Refunds are issued at the sole discretion of Service Provider and will be processed within 30 days
of approval.

4.3 No refunds will be issued for completed projects that meet the specifications outlined in Exhibit A.
5.
---------------------
Given the context information and not prior knowledge, answer the query.
Query: What are the penalties for late payments?
Answer: 

___ 🔴 END {'MiniLM-L6-v2'} MODEL TEST ___


📊 Comparative Test Results 

🧠 Model: BGE-small-en

⏱️ Indexing Time: 0.75 seconds

⏱️ Retrieval Time: 0.04 seconds

📄 Top Response: Context information is below.
---------------------
Paymen

# **Section 5: Embedding Model Scorecard Analysis**

This scorecard evaluates the performance of three embedding models (`MiniLM-L6-v2`, `BGE-small-en`, and `E5-small-v2`) on a single RAG query: "What are the penalties for late payments?"
<br><br>

The evaluation is based ***only*** on the context retrieved and the resulting answer generated by the LLM (which in this test was a "perfect" extraction of the relevant information from the context).

## 🧠 Model: MiniLM-L6-v2

<br>

| Question | Score (1-5) | Notes |
| :--- | :---: | :--- |
| Was the answer complete?| 5 | Yes, the answer explicitly stated the full penalty: <br><br> "Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full." <br><br> |
| Was the answer correct? | 5 | Yes, it directly and accurately reflects the key sentence from the retrieved context: <br><br> (`2.3 Late payments shall bear interest at the rate of 1.5% per month...`). <br><br> |
| Was the language clear? | 5 | The language is clear and unambiguous. <br><br> |
| Did the context feel on-topic?| 4 | Highly on-topic (retrieved the exact payment penalty clause), <br><br> but included two lines about unrelated "Refunds" which is considered "extra noise." <br><br> |
| Were the chunks concise and useful?| 4 | Useful, as the required sentence was present. <br><br> Not perfectly concise, as it included noise about "Refunds" (4.2 and 4.3). <br><br> |
<br><br>









## 🧠 Model: BGE-small-en

<br>

| Question | Score (1-5) | Notes |
| :--- | :---: | :--- |
| Was the answer complete?| 5 | Yes, the answer explicitly stated the full penalty:<br><br> "Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full." <br><br> |
| Was the answer correct? | 5 | Yes, it directly and accurately reflects the key sentence from the retrieved context: <br><br> (`2.3 Late payments shall bear interest at the rate of 1.5% per month...`). <br><br> |
| Was the language clear? | 5 | The language is clear and unambiguous. <br><br> |
| Did the context feel on-topic?| 5 | **Highly on-topic. It retrieved the payment clause and a surrounding general payment rule: <br><br> (`2.2 Service Provider shall invoice Client...`), <br><br> which is directly related to the concept of "payments."** <br><br>|
| Were the chunks concise and useful?| 5 | **Excellent. The retrieved chunks were highly focused on the payment topic,<br><br> avoiding the unrelated "Refund" information seen in the other models' output.** <br><br>  |

<br><br>

## 🧠 Model: E5-small-v2

<br>

| Question | Score (1-5) | Notes |
| :--- | :---: | :--- |
| Was the answer complete?| 5 | Yes, the answer explicitly stated the full penalty: <br><br> "Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full."<br><br> |
| Was the answer correct? | 5 | Yes, it directly and accurately reflects the key sentence from the retrieved context: <br><br> (`2.3 Late payments shall bear interest at the rate of 1.5% per month...`). <br><br> |
| Was the language clear? | 5 | The language is clear and unambiguous. |
| Did the context feel on-topic?| 4 | Highly on-topic (retrieved the exact payment penalty clause), <br><br> but included two lines about unrelated "Refunds" which is considered "extra noise." <br><br> |
| Were the chunks concise and useful?| 4 | Useful, as the required sentence was present. <br><br> Not perfectly concise, as it included noise about "Refunds" (`4.2` and `4.3`). <br><br> |

<br><br>




## **Comparison For all three models**

Date: 12/03/2025

- **Query:** What are the penalties for late payments?
- **Answer:** Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full.
<br>


<br>

## **Performance Metrics**

---
**📍Test #1 Results**
| Model | Indexing Time (s) | Retrieval Time (s)  | Context Consciseness Score |
| :--- | :---: | :--- | :--- |
| MiniLM-L6-v2| 2.58 | 0.16  | 4 (Pulled noise)  |
| BGE-small-en|  1.20 | 0.11  | 5 (Cleanest context)  |
| E5-small-v2|  4.42 | 0.11  | 4 (Pulled noise)  |

<br>

- **BGE-small-en** was the overall winner in speed, demonstrating the fastest Indexing Time (1.2s) and matching the fastest Retrieval Time (0.11s).

- **E5-small-v2** had the slowest Indexing Time (4.42s) but was fast during retrieval (0.11s).

- **MiniLM-L6-v2** had a moderate Indexing Time (2.58s) but was slightly slower on Retrieval Time (0.16s).

- For this specific RAG setup and document set, BGE-small-en offered the best combination of speed and retrieval accuracy in Test #1 .

---

**📍Test #2 Results**
| Model | Indexing Time (s) | Retrieval Time (s)  | Context Consciseness Score |
| :--- | :---: | :--- | :--- |
| MiniLM-L6-v2| 0.44 | 0.03  | 4 (Pulled noise)  |
| BGE-small-en|  0.75 | 0.05  | 5 (Cleanest context)  |
| E5-small-v2|  1.03 | 0.04  | 4 (Pulled noise)  |
<br>

- **MiniLM-L6-v2:** Indexing Time: 0.44s, Retrieval Time: 0.03s (Fastest indexing and retrieval)

- **BGE-small-en:** Indexing Time: 0.75s, Retrieval Time: 0.05s

- **E5-small-v2:** Indexing Time: 1.03s, Retrieval Time: 0.04s

The speed champion is MiniLM-L6-v2. The qualitative analysis (which chunks are pulled) remains the same: BGE-small-en remains the winner for context quality/conciseness in Test #2.

---
<br>

---

**📍Test #3 Results**
| Model | Indexing Time (s) | Retrieval Time (s)  | Context Consciseness Score |
| :--- | :---: | :--- | :--- |
| MiniLM-L6-v2| 1.87 | 0.11  | 4 (Pulled noise)  |
| BGE-small-en|  1.82 | 0.14  | 5 (Cleanest context)  |
| E5-small-v2|  1.61 | 0.09  | 4 (Pulled noise)  |
<br>

- **Speed Champion: E5-small-v2** is now the fastest model for both indexing (1.61s) and querying (0.09s).
<br>

- **Retrieval Quality Champion: BGE-small-en** remains the best for high-quality, concise context retrieval (Score 5), demonstrating superior semantic focus by isolating the penalty clause without pulling in unrelated sections (like the Refund clauses).

<br><br>

## **Conclusion: Trade-offs Between Speed and Retrieval Quality**
---

This summary evaluates each embedding model based on its averaged performance metrics and consistent retrieval quality scores across three test runs. The qualitative scores were perfectly consistent (5 for BGE-small-en, 4 for the others in conciseness). The primary difference was speed. the conclusion will focus on the trade-off between speed and retrieval quality.
<br><br>


**Average Performance Metrics for Embedding Models**
 Model | Average Indexing Time (s) | Average Retrieval Time (s)  |
| :--- | :---: | :--- |
| MiniLM-L6-v2| 1.63 | 0.10  |
| BGE-small-en|  1.26 | 0.10  |
| E5-small-v2|  2.35 | 0.08  |
<br>

**1. 🧠 MiniLM-L6-v2**

The MiniLM-L6-v2 model offers a highly competitive balance of speed, achieving fast indexing and retrieval times. However, it compromises slightly on retrieval precision. While it accurately found the answer, it consistently scored 4/5 for conciseness because it pulled in "noise" (unrelated sections about refunds). This suggests that MiniLM-L6-v2 might be prone to slightly less focused context retrieval, which could increase the potential for irrelevant information being passed to the LLM in a larger, more complex RAG system.
<br><br>

**2. 🧠 BGE-small-en (Balanced Winner)**

BGE-small-en emerged as the best overall choice when considering both speed and quality. It boasts the fastest average indexing time (1.26s), meaning it is the quickest to set up the knowledge base. Crucially, it consistently scored 5/5 for context conciseness, retrieving only the precise payment-related information and exhibiting superior semantic focus. This model minimizes the risk of feeding irrelevant information to the LLM, making it ideal for applications prioritizing high-quality, clean results, even if its query time is not the absolute fastest.
<br><br>

**3. 🧠 E5-small-v2**

The E5-small-v2 model is the champion of raw querying speed, demonstrating the fastest average retrieval time (0.08s). This makes it suitable for high-volume, real-time query applications. However, this speed comes at the cost of the slowest average indexing time (2.35s) and a slight drop in retrieval quality (scoring 4/5 due to extraneous context). The E5-small-v2 is best used when document setup is infrequent, but quick, real-time lookups are paramount.

# **Section 6: Testing Automation**


### **Rationale for Multiple Test Runs (N=3)**

I run the Indexing and Retrieval processes multiple times (NUM_TESTS = 3) to ensure the results are reliable and not skewed by system volatility.

- **Averaging Volatility:** Initial runs are often inflated due to "cold starts" (loading models and initializing libraries). Averaging across tests smooths out these transient spikes caused by background processes or initialization time.

- **Stable Metrics:** The average time provides me with a more stable and representative measure of the model's true, consistent performance, allowing me to draw a robust conclusion about the speed vs. quality trade-off.
<br>


In [7]:
# 1. Installation & Import
!pip install -q pandas numpy # Install Pandas for cleaner table generation
import numpy as np # Numpy(np); Used for efficient array operations and calculating mean (average) times.
import pandas as pd # Used for creating and displaying the final results table.

# --- 2. Configuration ---
NUM_TESTS = 3
QUERY = "What are the penalties for late payments?"
CHUNK_SIZE = 50
CHUNK_OVERLAP = 50


# --- 3. Testing Loop and Timing Collection ---

timing_results = {}

for model_name, model_path in embedding_models.items():
    print(f"\n=======================================================")
    print(f"🧠 Testing Model: {model_name}")
    print(f"=======================================================")

    # Configure the embedding model for the current test
    Settings.embed_model = HuggingFaceEmbedding(model_name=model_path)

    indexing_times = []
    retrieval_times = []

    for i in range(1, NUM_TESTS + 1):
        print(f"--- Running Test Run #{i} ---")

        # 1. INDEXING TIME
        start_time_index = time.time()
        index = VectorStoreIndex(nodes)
        end_time_index = time.time()
        indexing_time = end_time_index - start_time_index
        indexing_times.append(indexing_time)
        print(f"   -> Indexing Time: {indexing_time:.4f}s")

        # 2. RETRIEVAL TIME
        retriever = index.as_retriever(similarity_top_k=2)
        query_engine = RetrieverQueryEngine.from_args(retriever=retriever)

        start_time_query = time.time()
        # Run the query (Note: LLM is None, so only retrieval is timed)
        response = query_engine.query(QUERY)
        end_time_query = time.time()
        retrieval_time = end_time_query - start_time_query
        retrieval_times.append(retrieval_time)
        print(f"   -> Retrieval Time: {retrieval_time:.4f}s")

    # Store all results for this model
    timing_results[model_name] = {
        "indexing_times": indexing_times,
        "retrieval_times": retrieval_times,
        "avg_indexing": np.mean(indexing_times),
        "avg_retrieval": np.mean(retrieval_times)
    }

# --- 4. Results Output (Formatted Markdown Table using Pandas) ---

print("\n\n" + "="*80)
print(f"| FINAL PERFORMANCE COMPARISON (Over {NUM_TESTS} Runs) |")
print("="*80 + "\n")

# Prepare data for the Pandas DataFrame
data_for_df = []
columns = ["Model", "Avg. Indexing (s)", "Avg. Retrieval (s)"]
columns.extend([f"Index T{i} (s)" for i in range(1, NUM_TESTS + 1)])
columns.extend([f"Query T{i} (s)" for i in range(1, NUM_TESTS + 1)])

for model, data in timing_results.items():
    row = [
        model,
        f"{data['avg_indexing']:.3f}",
        f"{data['avg_retrieval']:.3f}"
    ]
    row.extend([f"{t:.3f}" for t in data['indexing_times']])
    row.extend([f"{t:.3f}" for t in data['retrieval_times']])
    data_for_df.append(row)

# Create the DataFrame
df = pd.DataFrame(data_for_df, columns=columns)

# Convert to Markdown table and print
markdown_output = "## Test Results\n"
markdown_output += df.to_markdown(index=False)

print(markdown_output)




🧠 Testing Model: MiniLM-L6-v2
--- Running Test Run #1 ---
   -> Indexing Time: 0.4083s
   -> Retrieval Time: 0.0262s
--- Running Test Run #2 ---
   -> Indexing Time: 0.3982s
   -> Retrieval Time: 0.0224s
--- Running Test Run #3 ---
   -> Indexing Time: 0.3757s
   -> Retrieval Time: 0.0226s

🧠 Testing Model: BGE-small-en
--- Running Test Run #1 ---
   -> Indexing Time: 0.7854s
   -> Retrieval Time: 0.0588s
--- Running Test Run #2 ---
   -> Indexing Time: 0.7532s
   -> Retrieval Time: 0.0412s
--- Running Test Run #3 ---
   -> Indexing Time: 1.2035s
   -> Retrieval Time: 0.0625s

🧠 Testing Model: E5-small-v2
--- Running Test Run #1 ---
   -> Indexing Time: 1.2527s
   -> Retrieval Time: 0.0565s
--- Running Test Run #2 ---
   -> Indexing Time: 0.7850s
   -> Retrieval Time: 0.0362s
--- Running Test Run #3 ---
   -> Indexing Time: 0.7349s
   -> Retrieval Time: 0.0353s


| FINAL PERFORMANCE COMPARISON (Over 3 Runs) |

## Test Results
| Model        |   Avg. Indexing (s) |   Avg. Retrieval (s)

# **📊 Section 7: Run 3 RAG Configurations and Log Output Differences**

For this task, I'm going to run the same question through three different retriever setups and track how each configuration affects the answer quality. Small changes in retrieval can lead to big differences in what the model sees—and what it says.

I will explore how varying these parameters changes the retrieval performance:

- **`top_k`**: How many chunks are retrieved

- **`Similarity threshold`:** Whether weak matches are filtered out

- **`Reranker`:** Whether the results are re-sorted using an LLM-based reranker
<br>

### **📝 Combined RAG Configuration and Observation Scorecard**

Now that I have my foundational RAG setup working, I'm focusing on optimization by systematically testing key retrieval parameters defined in the code block below. My goal with these six experiments (labeled B1-B3 and C1-C3) is to understand the trade-offs between maximizing recall and maximizing precision. I am testing three distinct values of top_k (Experiments B1-B3) to see how simply retrieving more context affects the final answer quality. Separately, in Experiments C1-C3, I'm keeping top_k fixed at 8 and applying three increasingly strict similarity thresholds (0.70, 0.75, 0.80).
<br><br>

## **BGE Reranking Experiment**

I will experiment to use the highly effective BGE-reranker-base model for local reranking.
<br><br>

**Rationale for Experiment D: Local Reranking (BGE)**

While our initial vector search (using the embedding models in Experiments B and C) is fast and effective for retrieving candidates, it often relies only on simple vector distance, which can sometimes miss the subtle semantic relevance of a document chunk. **Crucially, Experiments B and C do not use a separate reranker; they rely solely on the initial vector similarity score produced by my chosen embedding model.**

<br>

Experiment D directly addresses this by introducing a Cross-Encoder Reranker (**BGE-reranker-base model**) as a post-processing step. I am using the  `SentenceTransformerRerank`  **class from LlamaIndex to seamlessly integrate this powerful, locally-run cross-encoder model into my RAG pipeline**. This model takes the top 8 chunks retrieved from the vector store and calculates a joint score based on the query and the chunk text fused together. It then aggressively filters the list, keeping only the best 3 (top_n=3).

<br>

The goal is to test if this specialized, second-stage filtering—which is highly accurate for relevance but runs entirely locally and free—can significantly increase the final answer quality compared to simply increasing top_k or applying a simple similarity threshold. We anticipate Experiment D will show high precision and a potentially improved answer, despite using less overall context (only 3 nodes) for generation.
<br><br>


This table merges my experimental setup parameters with the results I observe, providing a single, complete view for analysis. By analyzing the final output table, specifically the 'Chunks Retrieved' count and the 'Best Score', I expect to identify the optimal configuration that balances getting enough relevant information with minimizing irrelevant or noisy context.
<br> <br>

Here is an explanation of the observation fields I need to log:

- **Chunks Retrieved (Count):** The final number of context chunks passed to the LLM after applying the top_k, similarity threshold, and reranker filters.

- **Best Chunk (short excerpt):** A short, direct quote from the most relevant chunk that contains the core information needed to answer the query.

- **Answer (shortened):** The final answer generated by the LLM, condensed for logging purposes.

- **Confidence (1-5):** How sure I am that the generated answer is clear, complete, and factually correct, based only on the context retrieved by the system.

- **Notes:** My qualitative observations on the run, such as why the retrieved context was particularly helpful or why the reranker succeeded/failed.
<br>

| Comparison | A (default) |  B1 | B2  | B3 | C1 | C2 | C3 | D |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| Configuration | "..." | "..." | "..." | "..." | "..." | "..." | "..." |"..." |<br>
| top_k | "..." | "..." | "..." |"..." |"..." |"..." |"..." |"..." |<br>
| Threshold |  "..." | "..." | "..."| "..." | "..." | "..." | "..." | "..." |<br>
| Reranker |  "..." | "..." | "..." | "..." | "..." | "..." | "..." | "..." |<br>
| Chunks Retrieved | "..." | "..." | "..." | "..." | "..." | "..." | "..." | "..." |<br>
| Best Chunk (short excerpt | "..."  | "..."  | "..." | "..." | "..." | "..." | "..." |"..." |<br>
| Answer (shortned) | "..." |  "..."  | "..."  | "..." | "..." | "..." | "..." |"..." |<br>
| Confidence (1-5) | "..." | "..." | "..." | "..." | "..." | "..." | "..." |"..." |<br>
| Notes | "..." | "..." | "..." | "..." | "..." | "..." | "..." |"..." |<br>

<br>


### Required Imports for Pandas Table and LlamaIndex

In [13]:
from llama_index.core.postprocessor import SentenceTransformerRerank # Correct BGE Reranker Import

#--- Required Imports for Pandas Table and LlamaIndex installed in previous sections ---
#!pip install -q pandas # Ensure Pandas is installed for table generation
#import pandas as pd
# import numpy as np # For calculating averages, etc. (though not strictly needed here, good practice)
# from llama_index.core.query_engine import RetrieverQueryEngine
# from llama_index.core import VectorStoreIndex # Include VectorStoreIndex import for clarity




 ## **Configuration Setup**


In [14]:
# Before running, check to ensure index is already initialized from PDF/text data)
QUERY = "What is the maximum loan amount a borrower can apply for?"
COHERE_API_KEY = "YOUR_COHERE_API_KEY" # Needed only if you add reranking later


## **Experimental Steps**

In [15]:
# This script defines six RAG experiments across two groups (B for top_k and C for threshold)
# to compare the combined effect of top_k and similarity threshold on retrieval performance.

# Define the six experimental setups
EXPERIMENTS = {
    # Set B: Testing Different top_k values (No threshold filtering for a clean comparison)
    "B1 (k=2)": {"top_k": 2, "threshold": 0.0, "reranker": False, "notes": "Low recall (2), no filter."},
    "B2 (k=5)": {"top_k": 5, "threshold": 0.0, "reranker": False, "notes": "Moderate recall (5), no filter."},
    "B3 (k=10)": {"top_k": 10, "threshold": 0.0, "reranker": False, "notes": "High recall (10), no filter."},

    # Set C: Testing Different Thresholds (Fixed top_k=8 for controlled comparison)
    "C1 (Th=0.70)": {"top_k": 8, "threshold": 0.70, "reranker": False, "notes": "Moderate recall (8), less strict threshold (0.70)."},
    "C2 (Th=0.75)": {"top_k": 8, "threshold": 0.75, "reranker": False, "notes": "Moderate recall (8), moderate threshold (0.75)."},
    "C3 (Th=0.80)": {"top_k": 8, "threshold": 0.80, "reranker": False, "notes": "Moderate recall (8), strict threshold (0.80)."},

    # Set D: Testing Local Reranking (Fixed k=8, no threshold pre-filter, reranker keeps top 3)
    "D (Rerank)": {"top_k": 8, "threshold": 0.0, "reranker": True, "notes": "Uses BGE Reranker on top 8 nodes, keeps the best 3."},

}



## **Initialize the local BGE Reranker model (used by Experiment D)**

In [16]:
# Initialize the local BGE Reranker model (used by Experiment D)
local_reranker = SentenceTransformerRerank(
    model="BAAI/bge-reranker-base",
    top_n=3,  # Reranker will only keep the top 3 most relevant nodes
    device="cpu"
)

config.json:   0%|          | 0.00/799 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/279 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

### **Experiment Runner Function**

In [17]:
def run_experiment(exp_name, top_k, threshold, use_reranker):
    """Runs a single RAG configuration test, prints results, and returns structured data."""
    if 'index' not in globals():
        print(f"Error: 'index' object not found. Please ensure your VectorStoreIndex is initialized.")
        return None

    print(f"\n=============================================")
    print(f"🔬 Running Experiment: {exp_name} (k={top_k}, Threshold={threshold:.2f}, Reranker={use_reranker})")
    print(f"=============================================")

    # 1. Initialize Retriever for initial top_k search
    retriever = index.as_retriever(similarity_top_k=top_k)

    # 2. Configure Query Engine based on experiment type
    node_postprocessors = []

    if use_reranker:
        print("   -> Applying BGE Reranker (Post-Processor)...")
        node_postprocessors.append(local_reranker)

    # We must apply threshold filtering *after* initial retrieval and *before* reranking
    # if the experiment calls for it (though only BGE uses the post-processor method)

    # For Experiments C (Thresholds), the filtering is often done manually or via
    # a separate post-processor. For simplicity here, we rely on the core LlamaIndex
    # retrieval mechanism for the initial list and log the filtering effect manually.

    query_engine = RetrieverQueryEngine.from_args(
        retriever=retriever,
        node_postprocessors=node_postprocessors, # Reranker is added here for Exp D
    )

    # --- Execute Query (Retrieval + Reranking/LLM) ---
    response = query_engine.query(QUERY)
    final_answer = response.response

    # Get the final nodes used by the LLM (post-reranking/post-threshold)
    final_nodes = response.source_nodes

    # 3. Apply Threshold Filter for logging/reporting (Experiments C)
    # The LlamaIndex query engine doesn't automatically filter by similarity in this setup,
    # so we filter the final nodes here for accurate logging of "Chunks Retrieved" for Exp C
    pre_filter_count = len(final_nodes)
    if threshold > 0.0:
        filtered_nodes = [node for node in final_nodes if node.score is not None and node.score >= threshold]
        final_nodes = filtered_nodes
        print(f"   -> Filtered {pre_filter_count - len(final_nodes)} nodes (Score < {threshold:.2f} discarded).")


    # Extract metrics for the table
    # Note: We must check for an empty list before trying to find the max score
    best_node = max(final_nodes, key=lambda n: n.score) if final_nodes and any(n.score for n in final_nodes) else None

    # Console Output (for immediate feedback)
    print(f"\n✅ Final Answer:")
    print(final_answer)

    print(f"\n📄 Retrieved Chunks (Total: {len(final_nodes)}):")
    if best_node:
        print(f"   -> Best Chunk Score: {best_node.score:.3f}")
        print(f"   -> Best Chunk Excerpt: {best_node.get_text().strip()[:50]}...")
    else:
        print("   -> No chunks retrieved or scored.")

    # Return structured data for the final table
    return {
        "Experiment": exp_name,
        "top_k": top_k,
        "Threshold": f"{threshold:.2f}" if threshold > 0 else "None",
        "Reranker": "BGE" if use_reranker else "Off",
        "Chunks Retrieved": len(final_nodes),
        "Best Score": f"{best_node.score:.3f}" if best_node else "N/A",
        "Best Excerpt": best_node.get_text().strip()[:50] + "..." if best_node else "N/A",
        "Answer (shortened)": final_answer.strip()[:80] + "...",
    }



## **Execution Loop & Pandas Table Generation**

In [18]:
# --- Execution Loop and Pandas Table Generation ---

results_list = []
for exp_name, params in EXPERIMENTS.items():
    result = run_experiment(exp_name, params["top_k"], params["threshold"], params["reranker"])
    if result:
        results_list.append(result)

print("\n\n" + "="*80)
print("| FINAL RAG CONFIGURATION COMPARISON TABLE |")
print("="*80 + "\n")

# Create and display the DataFrame
df_results = pd.DataFrame(results_list)

# Select and reorder columns for better readability
final_columns = [
    "Experiment", "top_k", "Threshold", "Reranker",
    "Chunks Retrieved", "Best Score", "Best Excerpt",
    "Answer (shortened)"
]
df_final = df_results[final_columns]

# Print the final Markdown table
print(df_final.to_markdown(index=False))





🔬 Running Experiment: B1 (k=2) (k=2, Threshold=0.00, Reranker=False)

✅ Final Answer:
Context information is below.
---------------------
Payment terms are
net 30 days from receipt of invoice.
2.3 Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full.
3.

4.2 Refunds are issued at the sole discretion of Service Provider and will be processed within 30 days
of approval.

4.3 No refunds will be issued for completed projects that meet the specifications outlined in Exhibit A.
5.
---------------------
Given the context information and not prior knowledge, answer the query.
Query: What is the maximum loan amount a borrower can apply for?
Answer: 

📄 Retrieved Chunks (Total: 2):
   -> Best Chunk Score: 0.786
   -> Best Chunk Excerpt: Payment terms are
net 30 days from receipt of invo...

🔬 Running Experiment: B2 (k=5) (k=5, Threshold=0.00, Reranker=False)

✅ Final Answer:
Context information is below.
---------------------
Payment terms are
net