# This notebook only covers vector storage. No Retrieval is done here .

# Vector Store

## What is a Vector Store?
A vector store (or vector database) is a specialized database designed to store and efficiently search through high-dimensional vector embeddings. In RAG systems, vector stores enable:

- Fast similarity search using vector embeddings
- Persistent storage of document embeddings and metadata
- Efficient retrieval of semantically similar content
- Scalable document management

In [1]:
#Libraries used for creating the Vector store .
import chromadb # Vector DB used to store data .
import os # Used to show the path to keep the database .
import numpy as np # Embedding type .
from langchain_core.documents import Document # Collection type .
from typing import List # used to store list of Documents .
import hashlib # Used to make unique id's .

## Why ChromaDB?

- Open-source vector database optimized for embeddings
- Simple API for storing and querying vectors
- Built-in persistence for data durability
- Efficient similarity search using various distance metrics
- Lightweight and easy to integrate

In [2]:
# The following class helps to store documents and embeddings and also get do other tasks like deleting the collections and other stats features .
class VectorStore:
    def __init__(self,collection_name:str="test_documents",persistent_directory:str="../data/vector_store")->None: #Used to initialize the vector store .
        self.persistent_directory=persistent_directory # Directory to store the database .
        self.collection_name=collection_name # Collection name .
        self.collection=None # Collection used to store data .
        self.client=None # Client or connection to connect to database .
        self.initialize_store() # Function used to initialize the vector store .

    # The following function is to initialize the vector store .
    def initialize_store(self)->None:
        try:
            os.makedirs(self.persistent_directory,exist_ok=True) # Path to the directory .
            self.client=chromadb.PersistentClient(path=self.persistent_directory) # Establishing connection to database .
            if self.collection_exists(self.collection_name): #Checking if the given collection name exists .
                print(f"Loading {self.collection_name} collection from database .")
                self.collection=self.client.get_collection(name=self.collection_name) # If the collection exists it loads existing collection .
            else:
                print(f"Creating new collection named : {self.collection_name} .")
                self.collection=self.client.create_collection(name=self.collection_name,metadata={"Description":"PDF embeddings ."}) # If no existing collection a new collection is created .
            print(f"Vector store initialized. Collection : {self.collection_name}") # Success message for vector initialization .
            print(f"Existing documents in collection {self.collection.count()}")  # Counting existing documents in the collection .
        except Exception as e:
            raise RuntimeError("Could not initialize vector store ") from e # Exception handling .

    # The following function is used to add documents .
    def add_documents(self,documents:List[Document],embeddings:np.ndarray):
        if not self.collection: # If not collections is initialized it raises an error .
            raise RuntimeError("Collection did not initialize .")
        if len(documents)!=len(embeddings): # Checking if there is existing embeddings for corresponding documents and vice versa .
            raise ValueError("Number of documents doest no match embeddings .") # If equal number of the documents and embeddings are not present then it raises an error .
        existing_ids=set(self.collection.get()["ids"]) # Used to store previous embedding id's .
        skipped=0 # Used to store number of skipped or duplicate documents .
        ids=[] # Used to store ids of all documents .
        metadatas=[] # Used to store metadata of all the documents  .
        document_list=[] # Used to store all documents .
        embeddings_list=[] # Used to store embeddings of the documents .

        for doc,embedding in zip(documents,embeddings):
            content_hash=hashlib.sha256(doc.page_content.encode("utf-8")).hexdigest() # Creating unique hash using content . Basically converting to uif-8 -> creating 32 bytes unique fingerprint -> Hexadecimal .
            doc_id=f"doc_{content_hash}" # Unique id created using content_hash .
            if doc_id in existing_ids: # If same document passed then it is skipped using existing id .
                skipped+=1 # Count of number of skipped or duplicate documents .
                continue

            # Adding metadata for new documents .
            metadata=dict(doc.metadata)
            metadata['doc_id']=doc_id
            metadata['char_length']=len(doc.page_content)

            # Appending new documents .
            ids.append(doc_id)
            document_list.append(doc.page_content)
            embeddings_list.append(embedding.tolist())
            metadatas.append(metadata)

        # If no new documents were appended .
        if not ids:
            print("No new documents to add .")
            return

        # Trying to add the new documents to database .
        try:
            self.collection.add(
                ids=ids,
                metadatas=metadatas,
                documents=document_list,
                embeddings=embeddings_list
            )
            print(f"Successfully added {len(ids)} documents,  skipped {skipped} duplicate documents .") # Success message for adding new documents .
            print(f"Total number of documents in vector store : {self.collection.count()}") # Total count of collection after adding new documents .

        except Exception as e:
            raise RuntimeError("Failed to add documents") from e # Exception handling .

    # The following function is used to delete existing collection .
    def delete_collection(self,del_collection_name:str)->None:
        if self.collection_exists(del_collection_name): # Checking if the collection exists .
            self.client.delete_collection(del_collection_name) # Deleting the collection .
            print(f'Deleted collection {del_collection_name} .')
        else:
            print(f"No collection found with name {del_collection_name} .") # Message if no collection is available with name .
        if del_collection_name==self.collection_name:
            self.collection=None
            print("No collection pointing to add documents .") # If the current collection was deleted .

    # The following function is used to show the stats of all the database .
    def show_collections(self)->None:
        total_collections=self.client.list_collections() # All the collections in the database .
        if not total_collections:
            print("No collection in database .") # Message if no collection available in the database .
            return
        print(f"Total number of collections in database is {len(total_collections)}") # Total number of collections in the base .
        print("The collections in this database are :") # Displaying collection names .
        for i,col in enumerate(total_collections):
            print(f"{i+1}) --> {col.name}")

    # To following function is used to check the stat of individual collection .
    def collection_stats(self,stat_collection_name:str)->None:
        if self.collection_exists(stat_collection_name): # Checks if the collection exists .
            doc_count=self.client.get_collection(stat_collection_name).count()
            print(f"Collection name :{stat_collection_name} .") # Collection name .
            print(f"Number of documents in collection : {doc_count} .") # Number of documents in the collection .
        else:
            print(f"No collection exists with name : {stat_collection_name} .") # Message if no collection is available with the name in the database .

    # The following function checks if the collection exists in the database .
    def collection_exists(self,existing_col:str)->bool:
        collections_in_db=self.client.list_collections()
        return any(col.name==existing_col for col in collections_in_db) # Returns True if the collection exists .

## Important Notes
### Document ID Generation

- Uses SHA-256 hashing of document content
- Format: doc_{content_hash}
- Ensures uniqueness based on content
- Prevents duplicate documents automatically

## Persistence

- Database is stored on disk at the specified directory
- Collections persist across sessions
- Can be loaded on subsequent runs

## Metadata Management

- Original metadata is preserved
- Additional metadata added automatically:
    - `doc_id`: Unique identifier
    - `char_length`: Document length

## VectorStore Class
The VectorStore class provides a complete interface for managing vector embeddings and documents in ChromaDB.
Key Features:

- Initialization: Sets up persistent storage and creates/loads collections
- Document Management: Adds documents with embeddings, prevents duplicates
- Collection Operations: Create, delete, and inspect collections
- Statistics: View collection counts and database stats
- Duplicate Detection: Uses SHA-256 hashing to prevent duplicate documents

In [3]:
# Initializing vector store .
vector_store=VectorStore(collection_name="Test_Collection")

Creating new collection named : Test_Collection .
Vector store initialized. Collection : Test_Collection
Existing documents in collection 0


In [4]:
# Used to create a Document for testing .
from langchain_core.documents import Document

In [5]:
# Test documents .
test_documents = [
    Document(
        page_content="Retrieval Augmented Generation improves factual accuracy.",
        metadata={"source": "test_doc_1"}
    ),
    Document(
        page_content="Vector databases store embeddings for semantic search.",
        metadata={"source": "test_doc_2"}
    ),
    Document(
        page_content="Chroma is commonly used as a vector store in RAG systems.",
        metadata={"source": "test_doc_3"}
    ),
]

In [6]:
#Library used to get the embedding model .
from sentence_transformers import SentenceTransformer

In [7]:
# Model used for generating embeddings of the test documents .
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

In [8]:
# Generating embeddings for the test document .
texts = [doc.page_content for doc in test_documents]

test_embeddings = embedding_model.encode(
    texts,
    convert_to_numpy=True,
    normalize_embeddings=True
)

In [9]:
# Dimensions of the generated embeddings .
print(test_embeddings.shape)

(3, 384)


In [10]:
# Adding documents to collection .
vector_store.add_documents(
    documents=test_documents,
    embeddings=test_embeddings
)

Successfully added 3 documents,  skipped 0 duplicate documents .
Total number of documents in vector store : 3


In [11]:
# Doesnt add duplicate documents .
vector_store.add_documents(
    documents=test_documents,
    embeddings=test_embeddings
)

No new documents to add .


In [12]:
# Showing all the collection in database .
vector_store.show_collections()

Total number of collections in database is 1
The collections in this database are :
1) --> Test_Collection


In [13]:
# Stats if the collection .
vector_store.collection_stats("Test_Collection")

Collection name :Test_Collection .
Number of documents in collection : 3 .


In [14]:
vector_store.collection_exists("Test_Collection")

True

In [15]:
# Deleting a collection
vector_store.delete_collection("Test_Collection")

Deleted collection Test_Collection .
No collection pointing to add documents .


In [16]:
# Checking the database after deleting all the collections .
vector_store.show_collections()

No collection in database .


## What Comes Next

- Query Embeddings: Convert user queries to embeddings
- Similarity Search: Find relevant documents using vector similarity
- Retrieval: Fetch top-k most similar documents
- RAG Pipeline: Integrate with language models for generation

## Next Step: Implementing Retrieval for RAG