<a href="https://colab.research.google.com/github/LashawnFofung/RAG-Pipelines/blob/main/src/Task_Comparing_Open_Source_Embedding_Models_for_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Comparing, Testing, and Choosing the Best Embedding Model for Retrieval-Augmented Generation (RAG)**

**Data:** *sample_contract.pdf*
<br><br>


The embedding model is the connective tissue of any RAG system, directly determining the quality and relevance of the retrieved context. A superior embedding model captures the semantic meaning behind my user queries and my knowledge base documents, which should lead to more accurate and helpful answers from the Large Language Model (LLM).
<br><br>

In this interactive Colab notebook, I will develop a critical skill for real-world AI engineering: systematically comparing and evaluating the impact of different open-source embedding models on my RAG pipeline's output. I will move beyond just benchmark scores to a qualitative, hands-on comparison using my own data and queries.
<br><br>


**Table of Contents**
- üîß [Section 1: Setup Environment](#scrollTo=phfOykN5Cc5n&line=1&uniqifier=1)
- üìÑ [Section 2: Document Ingestion and Node Creation (PDF Loading using Fallback)](#scrollTo=T7Dbp7AqDF7p&line=1&uniqifier=1)
- üß† [Section 3: Initialize & Compare Embedding Models Testing Loop](#scrollTo=-K-dkhsmDYCS&line=1&uniqifier=1)
- üìä [Section 4: Compare Outputs](#scrollTo=5Tmt3mJsDgPZ&line=1&uniqifier=1)
- üí°[Final Results Comparison with a Scorecard](#scrollTo=IH0NiWQLJy7T&line=4&uniqifier=1)
- ‚è±[Testing Automation](#scrollTo=rC6swMC_CIUh&line=12&uniqifier=1)
<br>


**üõ†Ô∏è My Hands-On Evaluation Steps**

I will follow this structured process to assess how three different open-source embedding models (like MiniLM, E5, or BGE) affect the retrieval and final answer quality of my RAG system.
<br><br>


**1. I'll Choose and Implement 3 Embedding Models**

- I will select three small, popular open-source models (e.g., MiniLM, E5, BGE) from the available list.

- For each model, I'll easily update my RAG pipeline using the `HuggingFaceEmbedding` class:
<br>

**```Python```**
    
    
    embed_model = HuggingFaceEmbedding(model_name="your_model_name_here")
    
    
  <br>
  
- **Note:** Re-index documents if the embedding model is changed, as each model creates a unique vector space.
<br>

**2. I'll Test with Consistent Queries**

I will select 2-3 diverse test questions to use across all three models. This ensures a fair, apples-to-apples comparison.

- **Example**:

> **Query I'll use:** query = "What is the maximum loan amount a borrower can apply for?"


<br>


**3. I'll Analyze Retrieved Context (Chunks)**

- For each model and query, I'll print the chunks the RAG system retrieved to understand what the AI is using as context.
<br>

  **```Python```**
  
  
    for node in retriever.retrieve(query):
        print(node.get_text())
  
<br>

- My Key Check: Do the chunks feel on-topic? Do they capture the semantic meaning (synonyms/related concepts) of my query, or just exact keywords? Are they concise and free of unrelated noise?
<br>


 **4. Compare Final Results with a Scorecard**

I will qualitatively assess the final answer generated by the RAG system using the retrieved context. I will use a simple scorecard to document my findings for each model:

<br>

| Question | Score (1-5) | Notes |
| :--- | :---: | :--- |
| Was the answer complete?| 0 | add results  |
| Was the answer correct?|  0 | add results  |
| Was the language clear?|  0 | add results  |
| Did the context feel on-topic?| 0 | add results  |
| Were the chunks concise and useful?| 0 | add results  |

<br><br>


# **üîß Section 1: Setup Environment**

Install necessary packages (libraries) for the RAG pipeline, specifically for indexing, embedding, and document parsing: llama-index, pymupdf, llama-index-embeddings-huggface

Optional (needed for Colab): nest_asyncio

In [1]:
# Install the necessary LlamaIndex packages, plus `pymupdf` for PDF parsing.
!pip install -q llama-index llama-index-embeddings-huggingface pymupdf


# Install `nest_asyncio`. Is necessary in Colab/Jupyter
# environments to allow asynchronous operations to run smoothly within a single thread.
!pip install -q nest_asyncio


# Install jedi to resolve a non-critical dependency warning related to ipython's
# interactive features, ensuring notebook output is completely clean.
!pip install -q jedi

# Ensure sentence-transformers is available for HuggingFaceEmbedding
!pip install -q sentence-transformers


In [2]:
# ------  Imports and Initial Configuration ------

import nest_asyncio
# Fix potential event loop conflicts
nest_asyncio.apply()

# Importing all the essential components from LlamaIndex
from llama_index.core import VectorStoreIndex, Document, Settings, get_response_synthesizer

#  Standard document-to-chunk tool(break documents into manageable pieces)
from llama_index.core.node_parser import SentenceSplitter

# Core component for loading my open-source embedding models (like MiniLM or E5).
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Class used to combine retriever (for getting context) and LLM (for generating the answer).
from llama_index.core.query_engine import RetrieverQueryEngine

# Simple time measurements to compare model speeds.
import time

# --- Embedding Models Definition ---
# These are the local, open-source embedding models we will compare.
embedding_models = {
    "MiniLM-L6-v2": "sentence-transformers/all-MiniLM-L6-v2",
    "BGE-small-en": "BAAI/bge-small-en-v1.5",
    "E5-small-v2": "intfloat/e5-small-v2"
}


# CRITICAL STEP: I am explicitly setting the LLM (Large Language Model) to None for now.
# Why? Because I want to focus *only* on testing the retrieval quality of the EMBEDDING models.
# By setting Settings.llm = None, I force the RAG pipeline to only retrieve context,
# or I can plug in a simple local LLM later without interference.
Settings.llm = None

# Print setup status
print("‚úÖ Environment setup complete. LLM set to None.")

LLM is explicitly disabled. Using MockLLM.
‚úÖ Environment setup complete. LLM set to None.


# üìÑ**Section 2: Document Ingestion and Node Creation (PDF Loading using Fallback)**

 This is a crucial step. This prepares the data (raw, unstructured PDF document), by extracting text and transforming into a list of structured format 'nodes' (chunks) ready for indexing (LlamaIndex).

**1. Document Load and Extraction**

In [3]:
# Placeholder content simulating a loaded document (used as a fallback)
raw_document_text = """
The monthly payment is due on the 1st of every month. Payments received after the 5th day
of the month will incur a late fee of $50. If payment is delayed by more than 30 days,
the account will be flagged, and an additional penalty of 1.5% of the outstanding balance
will be applied, compounded monthly. Failure to pay within 60 days will result in a
suspension of services and potential legal action. Please review section 4.3 for payment
processing guidelines and dispute resolution procedures. All disputes must be filed
within 10 calendar days of the late fee application date.
"""
text = raw_document_text
is_pdf_loaded = False


try:

  # The `files` utility for dynamic file uploads in the Colab environment and PyMuPDF.
  from google.colab import files

  # PyMuPDF (imported as 'fitz') for reliable, fast PDF parsing.
  import fitz
  print("\n--- Attempting interactive PDF upload ---")

  # --- 1. Document Loading and Extraction via Upload ---

  # Prompts to upload the PDF interactively from local machine.
  print("\n--- Uploading Document: 'sample_contract.pdf' ---")
  uploaded = files.upload()


  # Check if a file was successfully uploaded.
  if uploaded:
      # If successful, extracts the filename (which becomes the path) from the dictionary keys.
      pdf_path = list(uploaded.keys())[0]
      print(f"Successfully uploaded: {pdf_path}")

      # With valid pdf_path, the document can be opened and text can be extracted.
      # Using PyMuPDF (fitz) to open the PDF file for reading.
      doc = fitz.open(pdf_path)

      # Iterate through every page of the document to get the text from each,
      # and join them all together with a newline character (\n) as a separator.
      text = "\n".join([page.get_text() for page in doc])
      doc.close()

      # A quick check to make sure text extraction worked and to see the scale of data.
      print(f"‚úÖ Extracted {len(text.split())} words from the contract.")
      is_pdf_loaded = True
  else:
      # If no file is uploaded, exits the cell execution to prevent errors in subsequent steps.
      print("No file uploaded. Using placeholder text for RAG processing.")

except ImportError:
    # This block handles running outside a Colab environment
    print("‚ö†Ô∏è Skipping Colab/PyMuPDF interactive file upload (environment dependency).")
    print("Using placeholder text for RAG processing.")




--- Attempting interactive PDF upload ---

--- Uploading Document: 'sample_contract.pdf' ---


Saving sample_contract.pdf to sample_contract (3).pdf
Successfully uploaded: sample_contract (3).pdf
‚úÖ Extracted 315 words from the contract.


**2. Chunking with User-Specified Paramers (50/50)**

In [4]:
# This step is often the most important for RAG quality: chunking.
# Used a simple SentenceSplitter.
# Aggressive chunking strategy for precision and might increase retrieval time:
# Small chunks: chunk_size (50)
# High overlap: chunk_overlap (50)
# Maximize the chances of finding small, highly relevant facts.
text_splitter = SentenceSplitter(chunk_size=50, chunk_overlap=50)

# LlamaIndex needs the raw text wrapped in a Document object before splitting.
documents = Document(text=text)

# Convert the single large Document into many smaller, overlapping Nodes (chunks).
nodes = text_splitter.get_nodes_from_documents([documents])

print(f"‚úÖ Document processed into {len(nodes)} nodes (chunks) with chunk_size = 50, overlap = 50.")


‚úÖ Document processed into 16 nodes (chunks) with chunk_size = 50, overlap = 50.


# üß† **Section 3: Initialize and Compare Embedding Models Testing Loop**

This section iterates through each model, builds an index with that model, queries it, and records the result.


In [5]:
query = "What are the penalties for late payments?"
results = {}

for model_name, model_path in embedding_models.items():
    print(f"\nüîç Testing Embedding Model: {model_name} (Downloading/Loading...)")

    # 1. Configure the embedding model for the current test
    # This downloads the model if it's not already cached.
    embed_model = HuggingFaceEmbedding(model_name=model_path)
    Settings.embed_model = embed_model

    # 2. Build the index with the new embedding model
    # The index must be rebuilt for each model to ensure the nodes are embedded correctly.
    # This step involves:
    #### 1. Taking each Node's text.
    #### 2. Passing it through the embedding model (set in Section 1).
    #### 3. Storing the resulting vector in the index for fast lookups.
    start_time_index = time.time()
    index = VectorStoreIndex(nodes)
    end_time_index = time.time()
    indexing_time = end_time_index - start_time_index
    print(f"   -> Index built in {indexing_time:.2f} seconds.")

    # 3. Configure the Query Engine
    start_time_query = time.time()
    retriever = index.as_retriever(similarity_top_k=2)
    # Note: LLM is None, so this engine will only perform retrieval.
    query_engine = RetrieverQueryEngine.from_args(retriever=retriever)

    # 4. Run the query
    response = query_engine.query(query)
    end_time_query = time.time()
    total_query_time = end_time_query - start_time_query

    # 5. Store results
    results[model_name] = {
        "response": str(response),
        "indexing_time": round(indexing_time, 2),
        "query_time": round(total_query_time, 2)
    }
    print(f"   -> Query complete. Time taken: {total_query_time:.2f} seconds.")



üîç Testing Embedding Model: MiniLM-L6-v2 (Downloading/Loading...)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


   -> Index built in 1.87 seconds.
   -> Query complete. Time taken: 0.11 seconds.

üîç Testing Embedding Model: BGE-small-en (Downloading/Loading...)
   -> Index built in 1.82 seconds.
   -> Query complete. Time taken: 0.14 seconds.

üîç Testing Embedding Model: E5-small-v2 (Downloading/Loading...)
   -> Index built in 1.61 seconds.
   -> Query complete. Time taken: 0.09 seconds.


# üìä **Section 4: Compare Outputs**

In [6]:
# This displays the results for analysis.
print("Section 4: Comparative Test Results")

for model, result in results.items():
    print(f"\n==============================")
    print(f"üìä Comparative Test Results ")
    print(f"")
    print(f"üß† Model: {model}")
    print(f"")
    print(f"‚è±Ô∏è Indexing Time: {result['indexing_time']} seconds")
    print(f"")
    print(f"‚è±Ô∏è Retrieval Time: {result['query_time']} seconds")
    print(f"")
    print(f"üìÑ Top Response: {result['response']}")
    print(f"")
    print(f"___", "üî¥ END", {model}, "MODEL TEST", "___")
    print(f"")


Section 4: Comparative Test Results

üìä Comparative Test Results 

üß† Model: MiniLM-L6-v2

‚è±Ô∏è Indexing Time: 1.87 seconds

‚è±Ô∏è Retrieval Time: 0.11 seconds

üìÑ Top Response: Context information is below.
---------------------
Payment terms are
net 30 days from receipt of invoice.
2.3 Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full.
3.

4.2 Refunds are issued at the sole discretion of Service Provider and will be processed within 30 days
of approval.

4.3 No refunds will be issued for completed projects that meet the specifications outlined in Exhibit A.
5.
---------------------
Given the context information and not prior knowledge, answer the query.
Query: What are the penalties for late payments?
Answer: 

___ üî¥ END {'MiniLM-L6-v2'} MODEL TEST ___


üìä Comparative Test Results 

üß† Model: BGE-small-en

‚è±Ô∏è Indexing Time: 1.82 seconds

‚è±Ô∏è Retrieval Time: 0.14 seconds

üìÑ Top Response: Context information 

# **Embedding Model Scorecard Analysis**

This scorecard evaluates the performance of three embedding models (`MiniLM-L6-v2`, `BGE-small-en`, and `E5-small-v2`) on a single RAG query: "What are the penalties for late payments?"
<br><br>

The evaluation is based ***only*** on the context retrieved and the resulting answer generated by the LLM (which in this test was a "perfect" extraction of the relevant information from the context).

## üß† Model: MiniLM-L6-v2

<br>

| Question | Score (1-5) | Notes |
| :--- | :---: | :--- |
| Was the answer complete?| 5 | Yes, the answer explicitly stated the full penalty: <br><br> "Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full." <br><br> |
| Was the answer correct? | 5 | Yes, it directly and accurately reflects the key sentence from the retrieved context: <br><br> (`2.3 Late payments shall bear interest at the rate of 1.5% per month...`). <br><br> |
| Was the language clear? | 5 | The language is clear and unambiguous. <br><br> |
| Did the context feel on-topic?| 4 | Highly on-topic (retrieved the exact payment penalty clause), <br><br> but included two lines about unrelated "Refunds" which is considered "extra noise." <br><br> |
| Were the chunks concise and useful?| 4 | Useful, as the required sentence was present. <br><br> Not perfectly concise, as it included noise about "Refunds" (4.2 and 4.3). <br><br> |
<br><br>









## üß† Model: BGE-small-en

<br>

| Question | Score (1-5) | Notes |
| :--- | :---: | :--- |
| Was the answer complete?| 5 | Yes, the answer explicitly stated the full penalty:<br><br> "Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full." <br><br> |
| Was the answer correct? | 5 | Yes, it directly and accurately reflects the key sentence from the retrieved context: <br><br> (`2.3 Late payments shall bear interest at the rate of 1.5% per month...`). <br><br> |
| Was the language clear? | 5 | The language is clear and unambiguous. <br><br> |
| Did the context feel on-topic?| 5 | **Highly on-topic. It retrieved the payment clause and a surrounding general payment rule: <br><br> (`2.2 Service Provider shall invoice Client...`), <br><br> which is directly related to the concept of "payments."** <br><br>|
| Were the chunks concise and useful?| 5 | **Excellent. The retrieved chunks were highly focused on the payment topic,<br><br> avoiding the unrelated "Refund" information seen in the other models' output.** <br><br>  |

<br><br>

## üß† Model: E5-small-v2

<br>

| Question | Score (1-5) | Notes |
| :--- | :---: | :--- |
| Was the answer complete?| 5 | Yes, the answer explicitly stated the full penalty: <br><br> "Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full."<br><br> |
| Was the answer correct? | 5 | Yes, it directly and accurately reflects the key sentence from the retrieved context: <br><br> (`2.3 Late payments shall bear interest at the rate of 1.5% per month...`). <br><br> |
| Was the language clear? | 5 | The language is clear and unambiguous. |
| Did the context feel on-topic?| 4 | Highly on-topic (retrieved the exact payment penalty clause), <br><br> but included two lines about unrelated "Refunds" which is considered "extra noise." <br><br> |
| Were the chunks concise and useful?| 4 | Useful, as the required sentence was present. <br><br> Not perfectly concise, as it included noise about "Refunds" (`4.2` and `4.3`). <br><br> |

<br><br>




## **Comparison For all three models**

Date: 12/03/2025

- **Query:** What are the penalties for late payments?
- **Answer:** Late payments shall bear interest at the rate of 1.5% per month from the due date until paid in full.
<br>


<br>

## **Performance Metrics**

---
**üìçTest #1 Results**
| Model | Indexing Time (s) | Retrieval Time (s)  | Context Consciseness Score |
| :--- | :---: | :--- | :--- |
| MiniLM-L6-v2| 2.58 | 0.16  | 4 (Pulled noise)  |
| BGE-small-en|  1.20 | 0.11  | 5 (Cleanest context)  |
| E5-small-v2|  4.42 | 0.11  | 4 (Pulled noise)  |

<br>

- **BGE-small-en** was the overall winner in speed, demonstrating the fastest Indexing Time (1.2s) and matching the fastest Retrieval Time (0.11s).

- **E5-small-v2** had the slowest Indexing Time (4.42s) but was fast during retrieval (0.11s).

- **MiniLM-L6-v2** had a moderate Indexing Time (2.58s) but was slightly slower on Retrieval Time (0.16s).

- For this specific RAG setup and document set, BGE-small-en offered the best combination of speed and retrieval accuracy in Test #1 .

---

**üìçTest #2 Results**
| Model | Indexing Time (s) | Retrieval Time (s)  | Context Consciseness Score |
| :--- | :---: | :--- | :--- |
| MiniLM-L6-v2| 0.44 | 0.03  | 4 (Pulled noise)  |
| BGE-small-en|  0.75 | 0.05  | 5 (Cleanest context)  |
| E5-small-v2|  1.03 | 0.04  | 4 (Pulled noise)  |
<br>

- **MiniLM-L6-v2:** Indexing Time: 0.44s, Retrieval Time: 0.03s (Fastest indexing and retrieval)

- **BGE-small-en:** Indexing Time: 0.75s, Retrieval Time: 0.05s

- **E5-small-v2:** Indexing Time: 1.03s, Retrieval Time: 0.04s

The speed champion is MiniLM-L6-v2. The qualitative analysis (which chunks are pulled) remains the same: BGE-small-en remains the winner for context quality/conciseness in Test #2.

---
<br>

---

**üìçTest #3 Results**
| Model | Indexing Time (s) | Retrieval Time (s)  | Context Consciseness Score |
| :--- | :---: | :--- | :--- |
| MiniLM-L6-v2| 1.87 | 0.11  | 4 (Pulled noise)  |
| BGE-small-en|  1.82 | 0.14  | 5 (Cleanest context)  |
| E5-small-v2|  1.61 | 0.09  | 4 (Pulled noise)  |
<br>

- **Speed Champion: E5-small-v2** is now the fastest model for both indexing (1.61s) and querying (0.09s).
<br>

- **Retrieval Quality Champion: BGE-small-en** remains the best for high-quality, concise context retrieval (Score 5), demonstrating superior semantic focus by isolating the penalty clause without pulling in unrelated sections (like the Refund clauses).

<br><br>

## **Conclusion: Trade-offs Between Speed and Retrieval Quality**
---

This summary evaluates each embedding model based on its averaged performance metrics and consistent retrieval quality scores across three test runs. The qualitative scores were perfectly consistent (5 for BGE-small-en, 4 for the others in conciseness). The primary difference was speed. the conclusion will focus on the trade-off between speed and retrieval quality.
<br><br>


**Average Performance Metrics for Embedding Models**
 Model | Average Indexing Time (s) | Average Retrieval Time (s)  |
| :--- | :---: | :--- |
| MiniLM-L6-v2| 1.63 | 0.10  |
| BGE-small-en|  1.26 | 0.10  |
| E5-small-v2|  2.35 | 0.08  |
<br>

**1. üß† MiniLM-L6-v2**

The MiniLM-L6-v2 model offers a highly competitive balance of speed, achieving fast indexing and retrieval times. However, it compromises slightly on retrieval precision. While it accurately found the answer, it consistently scored 4/5 for conciseness because it pulled in "noise" (unrelated sections about refunds). This suggests that MiniLM-L6-v2 might be prone to slightly less focused context retrieval, which could increase the potential for irrelevant information being passed to the LLM in a larger, more complex RAG system.
<br><br>

**2. üß† BGE-small-en (Balanced Winner)**

BGE-small-en emerged as the best overall choice when considering both speed and quality. It boasts the fastest average indexing time (1.26s), meaning it is the quickest to set up the knowledge base. Crucially, it consistently scored 5/5 for context conciseness, retrieving only the precise payment-related information and exhibiting superior semantic focus. This model minimizes the risk of feeding irrelevant information to the LLM, making it ideal for applications prioritizing high-quality, clean results, even if its query time is not the absolute fastest.
<br><br>

**3. üß† E5-small-v2**

The E5-small-v2 model is the champion of raw querying speed, demonstrating the fastest average retrieval time (0.08s). This makes it suitable for high-volume, real-time query applications. However, this speed comes at the cost of the slowest average indexing time (2.35s) and a slight drop in retrieval quality (scoring 4/5 due to extraneous context). The E5-small-v2 is best used when document setup is infrequent, but quick, real-time lookups are paramount.

# **Testing Automation**


### **Rationale for Multiple Test Runs (N=3)**

I run the Indexing and Retrieval processes multiple times (NUM_TESTS = 3) to ensure the results are reliable and not skewed by system volatility.

- **Averaging Volatility:** Initial runs are often inflated due to "cold starts" (loading models and initializing libraries). Averaging across tests smooths out these transient spikes caused by background processes or initialization time.

- **Stable Metrics:** The average time provides me with a more stable and representative measure of the model's true, consistent performance, allowing me to draw a robust conclusion about the speed vs. quality trade-off.
<br>


In [10]:
# 1. Installation & Import
!pip install -q pandas # Install Pandas for cleaner table generation
import numpy as np # Numpy(np); Used for efficient array operations and calculating mean (average) times.
import pandas as pd # Used for creating and displaying the final results table.

# --- 2. Configuration ---
NUM_TESTS = 3
QUERY = "What are the penalties for late payments?"
CHUNK_SIZE = 50
CHUNK_OVERLAP = 50


# --- 3. Testing Loop and Timing Collection ---

timing_results = {}

for model_name, model_path in embedding_models.items():
    print(f"\n=======================================================")
    print(f"üß† Testing Model: {model_name}")
    print(f"=======================================================")

    # Configure the embedding model for the current test
    Settings.embed_model = HuggingFaceEmbedding(model_name=model_path)

    indexing_times = []
    retrieval_times = []

    for i in range(1, NUM_TESTS + 1):
        print(f"--- Running Test Run #{i} ---")

        # 1. INDEXING TIME
        start_time_index = time.time()
        index = VectorStoreIndex(nodes)
        end_time_index = time.time()
        indexing_time = end_time_index - start_time_index
        indexing_times.append(indexing_time)
        print(f"   -> Indexing Time: {indexing_time:.4f}s")

        # 2. RETRIEVAL TIME
        retriever = index.as_retriever(similarity_top_k=2)
        query_engine = RetrieverQueryEngine.from_args(retriever=retriever)

        start_time_query = time.time()
        # Run the query (Note: LLM is None, so only retrieval is timed)
        response = query_engine.query(QUERY)
        end_time_query = time.time()
        retrieval_time = end_time_query - start_time_query
        retrieval_times.append(retrieval_time)
        print(f"   -> Retrieval Time: {retrieval_time:.4f}s")

    # Store all results for this model
    timing_results[model_name] = {
        "indexing_times": indexing_times,
        "retrieval_times": retrieval_times,
        "avg_indexing": np.mean(indexing_times),
        "avg_retrieval": np.mean(retrieval_times)
    }

# --- 4. Results Output (Formatted Markdown Table using Pandas) ---

print("\n\n" + "="*80)
print(f"| FINAL PERFORMANCE COMPARISON (Over {NUM_TESTS} Runs) |")
print("="*80 + "\n")

# Prepare data for the Pandas DataFrame
data_for_df = []
columns = ["Model", "Avg. Indexing (s)", "Avg. Retrieval (s)"]
columns.extend([f"Index T{i} (s)" for i in range(1, NUM_TESTS + 1)])
columns.extend([f"Query T{i} (s)" for i in range(1, NUM_TESTS + 1)])

for model, data in timing_results.items():
    row = [
        model,
        f"{data['avg_indexing']:.3f}",
        f"{data['avg_retrieval']:.3f}"
    ]
    row.extend([f"{t:.3f}" for t in data['indexing_times']])
    row.extend([f"{t:.3f}" for t in data['retrieval_times']])
    data_for_df.append(row)

# Create the DataFrame
df = pd.DataFrame(data_for_df, columns=columns)

# Convert to Markdown table and print
markdown_output = "## Test Results\n"
markdown_output += df.to_markdown(index=False)

print(markdown_output)




üß† Testing Model: MiniLM-L6-v2
--- Running Test Run #1 ---
   -> Indexing Time: 0.3845s
   -> Retrieval Time: 0.0328s
--- Running Test Run #2 ---
   -> Indexing Time: 0.3843s
   -> Retrieval Time: 0.0221s
--- Running Test Run #3 ---
   -> Indexing Time: 0.3785s
   -> Retrieval Time: 0.0224s

üß† Testing Model: BGE-small-en


'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: c15e1552-ff19-4444-b483-db7e16087883)')' thrown while requesting HEAD https://huggingface.co/BAAI/bge-small-en-v1.5/resolve/main/./modules.json
Retrying in 1s [Retry 1/5].


--- Running Test Run #1 ---
   -> Indexing Time: 1.1255s
   -> Retrieval Time: 0.0448s
--- Running Test Run #2 ---
   -> Indexing Time: 0.7525s
   -> Retrieval Time: 0.0421s
--- Running Test Run #3 ---
   -> Indexing Time: 0.7597s
   -> Retrieval Time: 0.0460s

üß† Testing Model: E5-small-v2
--- Running Test Run #1 ---
   -> Indexing Time: 0.7683s
   -> Retrieval Time: 0.0371s
--- Running Test Run #2 ---
   -> Indexing Time: 0.7845s
   -> Retrieval Time: 0.0362s
--- Running Test Run #3 ---
   -> Indexing Time: 0.7509s
   -> Retrieval Time: 0.0379s


| FINAL PERFORMANCE COMPARISON (Over 3 Runs) |

## Test Results
| Model        |   Avg. Indexing (s) |   Avg. Retrieval (s) |   Index T1 (s) |   Index T2 (s) |   Index T3 (s) |   Query T1 (s) |   Query T2 (s) |   Query T3 (s) |
|:-------------|--------------------:|---------------------:|---------------:|---------------:|---------------:|---------------:|---------------:|---------------:|
| MiniLM-L6-v2 |               0.382 |             