# Embedding & Vector Store Engineering

This notebook focuses on engineering the embedding and vector storage layer of the Hybrid RAG system.

We are not simply generating embeddings.  
We are benchmarking models, measuring system impact, and making production-grade architectural decisions.

---

## Scope

- Compare embedding models  
- Measure embedding latency  
- Analyze vector dimensionality  
- Estimate memory footprint  
- Design vector store abstraction  
- Build and persist a FAISS index  
- Prepare the system for hybrid retrieval  

---

## Objectives

- Load processed chunks from Previous Notebook  
- Evaluate two embedding models under measurable criteria:
  - Embedding speed  
  - Vector dimensions  
  - Memory consumption  
- Build a FAISS index for efficient similarity search  
- Save the index for reuse  
- Design an abstraction layer to allow seamless migration to Pinecone or other managed vector databases  

This notebook establishes the semantic retrieval backbone required for scalable and optimized search in the Hybrid RAG architecture.


## 1 – Load Processed Chunks

The processed chunk dataset generated in Notebook 1 is loaded from disk using serialization. This avoids recomputing ingestion and chunking, ensuring faster iteration during embedding experiments.

A sanity check is performed to:

- Verify the total number of loaded chunks  
- Inspect a sample of the chunk content  
- Validate metadata integrity  

This confirms that the dataset is clean, structured, and ready for embedding benchmarking and vector indexing.


In [1]:
import pickle

with open("processed_chunks.pkl", "rb") as f:
    documents = pickle.load(f)

print(f"Loaded {len(documents)} chunks.")

# Sanity check
print(documents[0].page_content[:300])
print(documents[0].metadata)

Loaded 1333 chunks.
Artificial Intelligence
Index Report 2025
{'source': 'C:\\my_projects\\enterprise-rag\\data\\Stanford AI Index Report.pdf', 'page': 0, 'doc_id': 'bb44519d-f091-4d79-9f95-7e4bed72a9e6'}


## 2 – Embedding Model Selection

Two embedding models are selected for benchmarking to evaluate performance versus semantic quality trade-offs.



In [2]:
from sentence_transformers import SentenceTransformer
import warnings
warnings.filterwarnings("ignore")  # supress all warnings

model_a = SentenceTransformer("all-MiniLM-L6-v2")
print("model_a dimension vectors:", model_a.get_sentence_embedding_dimension())

model_b = SentenceTransformer("all-mpnet-base-v2")
print("model_b dimension vectors", model_b.get_sentence_embedding_dimension())

  from .autonotebook import tqdm as notebook_tqdm
W0221 18:05:18.921000 2372 site-packages\torch\distributed\elastic\multiprocessing\redirects.py:29] NOTE: Redirects are currently not supported in Windows or MacOs.


model_a dimension vectors: 384
model_b dimension vectors 768


### Model A – all-MiniLM-L6-v2  
- 384-dimensional vectors  
- Fast inference  
- Low memory footprint  
- Widely used production baseline  


### Model B – all-mpnet-base-v2  
- 768-dimensional vectors  
- Stronger semantic representation  
- Slower inference  
- Higher memory consumption  

This comparison enables a measurable decision between efficiency and retrieval quality before committing to a production embedding standard.


## 3 – Embedding Benchmark Framework

A modular benchmarking function is implemented to evaluate embedding models under measurable performance criteria.

For each model, the function:

- Loads the embedding model dynamically  
- Extracts chunk text for encoding  
- Measures total embedding latency  
- Reports embedding matrix shape  
- Extracts vector dimensionality  

This allows direct comparison of:

- Inference speed  
- Vector dimensions  
- Computational cost  

The design is intentionally model-agnostic, enabling easy extension to additional embedding models in future experiments. This benchmarking layer ensures embedding selection is driven by empirical performance rather than assumption.


In [3]:
import time
from sentence_transformers import SentenceTransformer
import numpy as np

def benchmark_embedding(model_name, documents):
    model = SentenceTransformer(model_name)
    texts = [doc.page_content for doc in documents]

    start_time = time.time()
    embeddings = model.encode(texts, batch_size=32, show_progress_bar=True)
    end_time = time.time()

    latency = end_time - start_time

    print("\nModel:", model_name)
    print("Embedding shape:", embeddings.shape)
    print("Time taken (s):", round(latency, 2))
    print("Embedding dimension:", embeddings.shape[1])

    return embeddings, latency

## 4 – Execute Embedding Benchmark

The benchmarking function is executed for both selected models to collect empirical performance metrics.

This step generates:

- Embedding matrices for each model  
- Total inference latency  
- Vector dimensionality comparison  

The results provide a direct trade-off analysis between speed (MiniLM) and semantic strength (MPNet), forming the basis for selecting the production embedding model before vector indexing.


In [4]:
# run benchmark on "all-MiniLM-L6-v2"
embedding_mini, latency_mini = benchmark_embedding(
    "all-MiniLM-L6-v2",  
    documents
)

Batches: 100%|██████████| 42/42 [00:33<00:00,  1.26it/s]


Model: all-MiniLM-L6-v2
Embedding shape: (1333, 384)
Time taken (s): 33.39
Embedding dimension: 384





In [5]:
# run benckmark on "all-mpnet-base-v2"
embedding_mpnet, latency_mpnet = benchmark_embedding(
    "all-mpnet-base-v2",
    documents
)

Batches: 100%|██████████| 42/42 [04:20<00:00,  6.20s/it]


Model: all-mpnet-base-v2
Embedding shape: (1333, 768)
Time taken (s): 260.58
Embedding dimension: 768





## 5 – Memory Footprint Estimation

To evaluate scalability, the memory footprint of each embedding matrix is calculated in megabytes.

This measurement is critical when projecting system behavior at scale. As the number of chunks grows into the millions, embedding dimensionality directly impacts:

- RAM consumption  
- Vector index size  
- Infrastructure cost  
- Retrieval latency  

Comparing memory usage between MiniLM and MPNet provides a realistic view of production deployment constraints, especially in high-volume enterprise environments.


In [6]:
def estimate_memory(embeddings):
    return embeddings.nbytes / (1024 ** 2)

print("MiniLM Memory (MB):", round(estimate_memory(embedding_mini), 2))
print("MPNet Memory (MB):", round(estimate_memory(embedding_mpnet), 2))

MiniLM Memory (MB): 1.95
MPNet Memory (MB): 3.91


## 6 – Build FAISS Index

The FAISS index is initialized using MiniLM embeddings, selected as the likely production candidate based on efficiency and memory considerations.

An `IndexFlatL2` index is created with the appropriate vector dimensionality and populated with the embedding matrix.

This establishes:

- In-memory similarity search capability  
- Deterministic L2 distance-based retrieval  
- A scalable foundation for semantic search  

The total vector count is verified to ensure all embeddings are successfully indexed before implementing the search interface.


In [7]:
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer

# Load embedding model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Assume documents is a list of LangChain Document objects
# documents[i].page_content contains text

texts = [doc.page_content for doc in documents]

# Generate embeddings
embeddings = model.encode(texts, show_progress_bar=True)

# Convert to float32 (required by FAISS)
embeddings = np.array(embeddings).astype("float32")

# Normalize embeddings 
faiss.normalize_L2(embeddings)

# Create FAISS index 
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)

# Add embeddings to index
index.add(embeddings)

print("Total vectors in index:", index.ntotal)


Batches: 100%|██████████| 42/42 [00:32<00:00,  1.28it/s]

Total vectors in index: 1333





## 7 – Semantic Search Interface

A reusable search function is implemented to query the FAISS index using semantic similarity.

The function:

- Encodes the input query using the selected embedding model  
- Performs nearest-neighbor search on the FAISS index  
- Retrieves the top-k most similar document chunks  
- Returns the corresponding `Document` objects  

This abstraction separates retrieval logic from the indexing layer, enabling future extensions such as hybrid search, reranking, or database-backed vector stores without modifying the core search interface.


In [8]:
def search(query, model, index, documents, top_k=5):
    if not query.strip():
        raise ValueError("Query cannot be empty.")

    # Encode query
    query_embedding = model.encode([query])
    query_embedding = np.array(query_embedding).astype("float32")

    # Normalize query embedding
    faiss.normalize_L2(query_embedding)

    # Perform search
    distances, indices = index.search(query_embedding, top_k)

    results = []
    for rank, (idx, distance) in enumerate(zip(indices[0], distances[0])):
        if idx == -1:
            continue

        results.append({
            "rank": rank + 1,
            "score": float(distance),  # Lower = better (L2 distance)
            "content": documents[idx].page_content,
            "metadata": documents[idx].metadata
        })

    return results


## 8 – Validate Retrieval Pipeline

The semantic search function is tested using a domain-specific query to evaluate retrieval relevance.

This validation step confirms:

- Correct query embedding generation  
- Proper FAISS nearest-neighbor search execution  
- Accurate mapping from index results to original document chunks  

Inspecting the top retrieved passages helps assess whether the embedding model captures conceptual similarity effectively.  

This qualitative check complements earlier quantitative benchmarks and verifies that the retrieval layer is functioning as expected before introducing hybrid search or reranking.


In [9]:
query = "How many AI publications were there in 2023?"

results = search(query, model, index, documents, top_k=5)

for r in results:
    print(f"\nRank {r['rank']} | L2 Distance: {r['score']:.4f}")
    print("-" * 80)
    print(r["content"])
    print("=" * 80)



Rank 1 | L2 Distance: 0.4747
--------------------------------------------------------------------------------
Table of Contents 36
Artificial Intelligence
Index Report 2025Chapter 1 Preview
By Sector
Academic institutions remain the primary source of AI 
publications worldwide (Figure 1.1.8). In 2013, they accounted 
for 85.9% of all AI publications, a figure that remained high, 
at 84.9%, in 2023. Industry contributed 7.1% of AI publications 
in 2023, followed by government institutions at 4.9% and 
nonprofit organizations at 1.7%.
AI publications in CS (% of total)
1.35%, Other
1.70%, Nonprot
4.90%, Government
7.14%, Industry
84.91%, Academia
AI publications in CS (% of total) by sector, 2013–23
Source: AI Index, 2025 | Chart: 2025 AI Index report
Figure 1.1.87
1.1 Publications
Chapter 1: Research and Development

Rank 2 | L2 Distance: 0.5127
--------------------------------------------------------------------------------
database. As a result, the numbers in this year’s report diff

In [10]:
def contains_limitation_terms(text):
    keywords = ["limitation", "drawback", "constraint", "trade", "reduce"]
    return any(k.lower() in text.lower() for k in keywords)

for r in results:
    print(r["rank"], contains_limitation_terms(r["content"]))


1 False
2 False
3 False
4 False
5 False


## 9 – Persist FAISS Index

The FAISS index is serialized and saved to disk for reuse in future sessions.

Persisting the index eliminates the need to recompute embeddings and rebuild the vector store during deployment or subsequent experiments.  

This step prepares the system for production workflows, where the embedding pipeline and index construction are executed offline, while the search interface operates in real time.


In [11]:
faiss.write_index(index, "faiss_index.bin")

## Embedding & Vector Store Engineering summary

This notebook established the semantic retrieval backbone of the Hybrid RAG system.

Completed stages include:

- Loading processed chunks from the ingestion pipeline  
- Benchmarking two embedding models (MiniLM vs MPNet)  
- Measuring embedding latency, dimensionality, and memory footprint  
- Selecting MiniLM based on performance-efficiency trade-offs  
- Building and validating a FAISS vector index  
- Designing a modular search interface  
- Persisting the FAISS index for production reuse  

The system now has a measurable, scalable embedding layer with a working vector store.


--------------------