## <b><center>Embedding & Vector Store Integration</center></b>
Welcome to the second phase of the InfoFusion Technologies Multi-Agent RAG project!
In this notebook, we will transform our curated text chunks into powerful numeric embeddings, store them in a high-performance vector database, and set the stage for fast, accurate semantic retrieval.

### **Loading Cleaned Data Chunks**

Before diving into embeddings, let's load the chunked dataset saved in the previous notebook.

_This step ensures that our workflow is modular and data is reproducibleâ€”no need to rerun heavy ingestion or chunking!_


In [1]:
import pickle
import os

chunk_path = "processed/chunks.pkl"

if os.path.exists(chunk_path):
    with open(chunk_path, "rb") as f:
        all_chunks = pickle.load(f)
    print(f"Loaded {len(all_chunks)} chunks from {chunk_path}")
else:
    print(f"Chunk file not found: {chunk_path}. Please run EDA notebook first.")

Loaded 8929 chunks from processed/chunks.pkl


### **Choosing our Embedding Engine**

Embeddings are the backbone of any semantic search system.  

We'll select a suitable model (e.g., Sentence Transformers, OpenAI, or Hugging Face) for generating vector representations of our text chunks.

The right embedding choice can drastically impact retrieval quality and downstream agent performance.


In [2]:
from sentence_transformers import SentenceTransformer

# Select an embedding model ('all-MiniLM-L6-v2' is fast and effective for semantic search)
embedder = SentenceTransformer("all-MiniLM-L6-v2")
print("Embedding model loaded.")

  from .autonotebook import tqdm as notebook_tqdm


Embedding model loaded.


### **Vectorizing Knowledge: Embedding Generation**

Now, let's transform each chunk into its corresponding numerical vector.  

We'll process the text, ensure efficient batching, and monitor performance.

The batch size of 32 in embedder.encode(texts, batch_size=32, show_progress_bar=True) is chosen because it provides an optimal balance between throughput (speed) and memory usage for most hardware and models, including the widely used all-MiniLM-L6-v2 Sentence Transformer.

Batching enables processing multiple text chunks simultaneously, leveraging parallelism on GPUs or CPUs for efficient embedding generation. While higher batch sizes (e.g., 64, 128) can speed up embedding, they also consume more memory and risk instability or out-of-memory errors on limited hardware. For common configurations, batch sizes of 16 or 32 are often recommended as the "sweet spot" for good speed without exceeding memory limits.


This step empowers the system to capture semantic meaning, enabling intelligent and context-aware retrieval later.


In [3]:
texts = [chunk.page_content for chunk in all_chunks]
print(f"Computing embeddings for {len(texts)} chunks.")

# Efficient batch embedding
embeddings = embedder.encode(texts, batch_size = 32, show_progress_bar = True)

print(f"Generated {len(embeddings)} embeddings.")

Computing embeddings for 8929 chunks.


Batches: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 280/280 [00:10<00:00, 27.48it/s]


Generated 8929 embeddings.


### **Building the Vector Store with ChromaDB**

With embeddings ready, we'll store them in ChromaDBâ€”a fast, scalable vector database built for RAG workloads.

Storing embeddings in ChromaDB allows instant, similarity-based retrieval, powering multi-agent workflows and conversational AI.


In [4]:
import chromadb
from chromadb.config import Settings

# Initialize PersistentClient for local persistent storage
client = chromadb.PersistentClient(
    path="db/chromadb_data",
    settings=Settings()
)

# Create or get existing collection
collection = client.get_or_create_collection(name="infofusion_chunks")
print(f"Collection '{collection.name}' created.")

# Prepare documents with IDs and metadata
docs_to_add = []
for i, chunk in enumerate(all_chunks):
    doc = {
        "id": str(i),
        "embedding": embeddings[i],
        "document": chunk.page_content,
        "metadata": chunk.metadata
    }
    docs_to_add.append(doc)


# Add documents in batch (recommended) instead of one-by-one
collection.add(
    ids=[doc["id"] for doc in docs_to_add],
    embeddings=embeddings,
    documents=[doc["document"] for doc in docs_to_add],
    metadatas=[doc["metadata"] for doc in docs_to_add]
)


print(f"Added {len(docs_to_add)} chunks to ChromaDB collection.")

Collection 'infofusion_chunks' created.
Added 8929 chunks to ChromaDB collection.


### **Smart Searches in Action**

Letâ€™s run a few sample queries to validate that our semantic search pipeline is functioning as expected.

We'll retrieve top-matching chunks for sample questions or keywords, demonstrating the power of our new vector store.


In [8]:
# Query example: "Explain gradient descent"
query = "Explain ann."
query_embedding = embedder.encode([query])[0]


# Search top 3 matches in ChromaDB
results = collection.query(
    query_embeddings= [query_embedding],
    n_results=1
)

print("Top 3 retrieved chunks for query:")
for doc, meta in zip(results['documents'][0], results['metadatas'][0]):
    print("---")
    print(meta["source"])
    print(doc, '...') # Shows first 500 characters of the chunks

Top 3 retrieved chunks for query:
---
data/Artificial Intelligence, Machine Learning, and Deep Learning.pdf
106 â€¢ Artifici Al intelligence , MAchine  leArning , Deep leArning
ful for understanding ANNs. A better way to understand ANNs is to think  
of their structure as a combination of the hyper parameters in the  
following list:
â€¢	 The number of hidden layers
â€¢	 The number of neurons in each hidden layer
â€¢	 The initial weights of edges connecting pairs of neurons
â€¢	 The activation function
â€¢	 A cost (a.k.a. loss) function
â€¢	 An optimizer (used with the cost function)
â€¢	 The learning rate (a small number)
â€¢	 The dropout rate (optional)
Figure 4.2 displays the contents of an ANN (there are many variations: 
this is simply one example).
FIGURE 4.2 An Example of an ANN.
Image adapted from [Cburnett, Source: https://commons.wikimedia.
org/wiki/File:Artificial_neural_network.svg] ...


### **Pipeline Stats & Housekeeping**

Finally, letâ€™s log important pipeline metadataâ€”such as embedding times, chunk counts, and indexing statsâ€”and clean up any intermediate artifacts.

Tracking these metrics ensures transparency, reproducibility, and provides insights for future optimizations.


In [6]:
print(f"Total number of chunks: {len(all_chunks)}")
print(f"Embedding dimensions: {embeddings[0].shape[0] if len(embeddings) > 0 else 'N/A'}")
print(f"ChromaDB collection count : {collection.count()}")

Total number of chunks: 8929
Embedding dimensions: 384
ChromaDB collection count : 8929


#### ðŸ“Š Collection Consistency Check

From the summary statistics above, we can confirm that our pipeline has successfully created and indexed the data as intended:

- **Total number of text chunks:** 8929  
- **Embedding vector dimensions:** 384  
- **Number of records in ChromaDB collection:** 8929

This perfect alignment between the number of generated text chunks and the indexed entries in ChromaDB demonstrates robust data integrity.  
It ensures that every processed document chunk is now ready for high-performance semantic retrieval in our RAG system.
