# **Vector Store**

- https://python.langchain.com/docs/integrations/vectorstores/

![alt text](https://python.langchain.com/assets/images/vectorstores-2540b4bc355b966c99b0f02cfdddb273.png "Vector Store")

##  📌 Why Do We Need Vector Stores Like ChromaDB Instead of Just Saving Vectors on Disk?

A **vector store** like **ChromaDB** is designed for **efficient similarity search and retrieval** of large-scale embeddings. While it's possible to save vectors on disk, a **specialized vector database** provides significant advantages.

---

## **🔹 Why Use ChromaDB Instead?**

### **1️⃣ Fast & Efficient Search (Approximate Nearest Neighbors)**
✅ **ChromaDB uses ANN (Approximate Nearest Neighbor) indexing** to retrieve vectors efficiently.  
✅ **Searching millions of embeddings takes milliseconds**, while disk-based searches take much longer.  

🔹 **Example Use Case:**  
- Searching **news articles** based on meaning instead of exact keywords.

---

### **2️⃣ Persistent & Scalable Storage**
✅ Unlike raw NumPy arrays, ChromaDB **persists embeddings** so they don’t disappear after a restart.  
✅ Handles **millions of embeddings** without performance issues.

🔹 **Example Use Case:**  
- **A chatbot retrieves past conversations** using stored embeddings, even after a server reboot.

---

### **3️⃣ Metadata & Filtering Support**
✅ **Stores metadata alongside vectors** (e.g., document titles, timestamps, categories).  
✅ Enables **filtered searches** (e.g., **"Find articles similar to X but only from 2023"**).

🔹 **Example Use Case:**  
- **A legal AI tool filters court cases** by year, jurisdiction, or relevance.


---

## **🚀 Why Use a Vector Database like Chroma?**
| **Feature**           | **No Vector database** | **Using ChromaDB** |
|----------------------|--------------------------|------------------|
| **Search Speed**     | ❌ Slow (Brute Force) | ✅ Fast (ANN Indexing) |
| **Persistence**      | ❌ Requires manual saving | ✅ Automatically stored |
| **Metadata Support** | ❌ No metadata | ✅ Supports structured queries |
| **Scalability**      | ❌ Limited by RAM | ✅ Handles millions of embeddings |
| **Real-Time Updates**| ❌ Static data only | ✅ Supports dynamic indexing |

📌 **ChromaDB enables fast, scalable, and efficient vector search, making it ideal for AI-powered retrieval!** 🚀😊


! `pip install chromadb`

`!pip install "langchain-chroma>=0.1.2"`

## **Chroma**
- https://python.langchain.com/docs/integrations/vectorstores/chroma/

In [None]:
import chromadb
print(chromadb.__version__)

import langchain
print(langchain.__version__)

import pydantic
print(pydantic.__version__)

0.4.24
0.2.17
2.9.2


In [2]:
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader

### **Load Document and Split**

In [3]:
# load the document and split it into chunks
loader = TextLoader("some_data/FDR_State_of_Union_1944.txt")
documents = loader.load()

### `RecursiveCharacterTextSplitter`

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# loader = TextLoader("some_data/FDR_State_of_Union_1944.txt") # or any other loader
# documents = loader.load()

splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    model_name="gpt-4",
    encoding_name="cl100k_base", 
    chunk_size=500, # The maximum number of Tokens in a chunk
    chunk_overlap=50, # Overlap between consecutive chunks
    add_start_index=True, # Flag to add start index to each chunk
)

splitted_docs = splitter.split_documents(documents)

In [5]:
print(len(splitted_docs))

10


## Connect to OpenAI for Embeddings

In [None]:
import os
os.environ["OPENAI_API_KEY"] = "api_key = os.getenv("OPENAI_API_KEY")
api_key




In [7]:
from langchain_openai import OpenAIEmbeddings

embedding_function = OpenAIEmbeddings(model="text-embedding-3-small")

### Create a `Chroma` object and pass Embeddings and Docs into it using `from_documents` function

In [8]:
# load it into Chroma
# persist_directory is where you want to save the embeddings
db = Chroma.from_documents(splitted_docs, 
                           embedding_function, 
                           persist_directory='./FDR_State_of_Union_Embeddings')

In [9]:
db._collection.count()

10

In [10]:
db.get()

{'ids': ['0ce2ef67-7de5-41c2-befc-1ec265310776',
  '3e9918f7-640d-4dd2-9d9e-c0e2f8aca2f9',
  '40299244-bcb4-450c-8736-ffa16718d369',
  '67bc7f19-a932-4157-a504-edbd69c19b5c',
  '6f70731e-2c61-4129-a6c8-41e2b1629fe1',
  'b51e644c-1141-4f2d-ae97-a9e0ac4afd93',
  'c039d218-d77d-4c3a-9182-d64953bc6370',
  'c721c7c9-11eb-40c5-a6ae-5870a8ec5d82',
  'dcb5bc8e-f9ba-440e-ae57-13eb2abb34e7',
  'e8324ebd-da6e-45da-8d37-821d1a90a752'],
 'embeddings': None,
 'metadatas': [{'source': 'some_data/FDR_State_of_Union_1944.txt',
   'start_index': -1},
  {'source': 'some_data/FDR_State_of_Union_1944.txt', 'start_index': -1},
  {'source': 'some_data/FDR_State_of_Union_1944.txt', 'start_index': 0},
  {'source': 'some_data/FDR_State_of_Union_1944.txt', 'start_index': -1},
  {'source': 'some_data/FDR_State_of_Union_1944.txt', 'start_index': -1},
  {'source': 'some_data/FDR_State_of_Union_1944.txt', 'start_index': 17837},
  {'source': 'some_data/FDR_State_of_Union_1944.txt', 'start_index': 4616},
  {'source': 

In [11]:
db.get(ids=['0d1794ce-25cc-470e-a433-4b44d5da375b'])
# or
ids = db.get()['ids']
db.get(ids=[ids[0]])

{'ids': ['0ce2ef67-7de5-41c2-befc-1ec265310776'],
 'embeddings': None,
 'metadatas': [{'source': 'some_data/FDR_State_of_Union_1944.txt',
   'start_index': -1}],
 'documents': ['But there were no secret treaties or political or financial commitments.\n\nThe one supreme objective for the future, which we discussed for each Nation individually, and for all the United Nations, can be summed up in one word: Security.\n\nAnd that means not only physical security which provides safety from attacks by aggressors. It means also economic security, social security, moral security—in a family of Nations.\n\nIn the plain down-to-earth talks that I had with the Generalissimo and Marshal Stalin and Prime Minister Churchill, it was abundantly clear that they are all most deeply interested in the resumption of peaceful progress by their own peoples—progress toward a better life. All our allies want freedom to develop their lands and resources, to build up industry, to increase education and individual

In [12]:
print(db.get()['documents'][0])

But there were no secret treaties or political or financial commitments.

The one supreme objective for the future, which we discussed for each Nation individually, and for all the United Nations, can be summed up in one word: Security.

And that means not only physical security which provides safety from attacks by aggressors. It means also economic security, social security, moral security—in a family of Nations.

In the plain down-to-earth talks that I had with the Generalissimo and Marshal Stalin and Prime Minister Churchill, it was abundantly clear that they are all most deeply interested in the resumption of peaceful progress by their own peoples—progress toward a better life. All our allies want freedom to develop their lands and resources, to build up industry, to increase education and individual opportunity, and to raise standards of living.

All our allies have learned by bitter experience that real development will not be possible if they are to be diverted from their purpo

In [13]:
# to delete all documents

# ids = db.get()['ids']

# db.delete(ids=ids)



<img src="https://pixionweb.blob.core.windows.net/web/lib/HNSW_888a78981d.svg" width="1000" height="1000">

## Load Embeddings from Disk and do a **Simillairty Search**

### 🔹 How Does HNSW Work in ChromaDB?
**HNSW (Hierarchical Navigable Small World)** is a **graph-based ANN algorithm**. It constructs a **multi-layered graph**, where:

- 🟢 **Nodes** represent vectors (embeddings).
- 🔗 **Edges** connect similar vectors (nearest neighbors).
- 🚀 **Search** navigates through these connections efficiently instead of scanning all vectors.

---

### ⚡ Why ChromaDB Uses HNSW
✅ **Fast Retrieval:** Search is **O(log N)** instead of **O(N)** (brute force).  
✅ **High Accuracy:** Finds nearest neighbors **with ~99% accuracy**.  
✅ **Scalability:** Handles **millions of embeddings efficiently**.  


### 1. Create a new `Chroma` object 

In [15]:
new_doc = "What did FDR say about the cost of food law?"

In [16]:
simillar_docs = db.similarity_search(new_doc,
                                     k=3) # k is the number of retrieved documents

simillar_docs

[Document(metadata={'source': 'some_data/FDR_State_of_Union_1944.txt', 'start_index': 9094}, page_content='That attitude on the part of anyone—Government or management or labor—can lengthen this war. It can kill American boys.\n\nLet us remember the lessons of 1918. In the summer of that year the tide turned in favor of the allies. But this Government did not relax. In fact, our national effort was stepped up. In August, 1918, the draft age limits were broadened from 21-31 to 18-45. The President called for "force to the utmost," and his call was heeded. And in November, only three months later, Germany surrendered.\n\nThat is the way to fight and win a war—all out—and not with half-an-eye on the battlefronts abroad and the other eye-and-a-half on personal, selfish, or political interests here at home.\n\nTherefore, in order to concentrate all our energies and resources on winning the war, and to maintain a fair and stable economy at home, I recommend that the Congress adopt:\n\n(1) A 

In [17]:
print(simillar_docs[0].page_content)

That attitude on the part of anyone—Government or management or labor—can lengthen this war. It can kill American boys.

Let us remember the lessons of 1918. In the summer of that year the tide turned in favor of the allies. But this Government did not relax. In fact, our national effort was stepped up. In August, 1918, the draft age limits were broadened from 21-31 to 18-45. The President called for "force to the utmost," and his call was heeded. And in November, only three months later, Germany surrendered.

That is the way to fight and win a war—all out—and not with half-an-eye on the battlefronts abroad and the other eye-and-a-half on personal, selfish, or political interests here at home.

Therefore, in order to concentrate all our energies and resources on winning the war, and to maintain a fair and stable economy at home, I recommend that the Congress adopt:

(1) A realistic tax law—which will tax all unreasonable profits, both individual and corporate, and reduce the ultimate c

In [18]:
simillar_docs = db.similarity_search_with_score(new_doc,
                                                k=3) # k is the number of retrieved documents

simillar_docs

[(Document(metadata={'source': 'some_data/FDR_State_of_Union_1944.txt', 'start_index': 9094}, page_content='That attitude on the part of anyone—Government or management or labor—can lengthen this war. It can kill American boys.\n\nLet us remember the lessons of 1918. In the summer of that year the tide turned in favor of the allies. But this Government did not relax. In fact, our national effort was stepped up. In August, 1918, the draft age limits were broadened from 21-31 to 18-45. The President called for "force to the utmost," and his call was heeded. And in November, only three months later, Germany surrendered.\n\nThat is the way to fight and win a war—all out—and not with half-an-eye on the battlefronts abroad and the other eye-and-a-half on personal, selfish, or political interests here at home.\n\nTherefore, in order to concentrate all our energies and resources on winning the war, and to maintain a fair and stable economy at home, I recommend that the Congress adopt:\n\n(1) A

## **🔹 Distance Metrics in Vector Search**
ChromaDB supports different **distance metrics** to measure similarity between vectors.

| **Distance**          | **Parameter** | **Equation** |
|----------------------|-------------|-------------|
| **Squared L2**      | `l2`        | $$ d = \sum (A_i - B_i)^2 $$ |
| **Inner Product**   | `ip`        | $$ d = 1.0 - \sum (A_i \times B_i) $$ |
| **Cosine Similarity** | `cosine`   | $$ d = 1.0 - \frac{\sum (A_i \times B_i)}{\sqrt{\sum (A_i^2)} \times \sqrt{\sum (B_i^2)}} $$ |

🔹 **Choosing the Right Metric:**
- **L2 Distance (`l2`)** → Best for **Euclidean distance-based similarity**.
- **Inner Product (`ip`)** → Used when **vectors are normalized**, good for **dot-product-based similarity**.
- **Cosine Similarity (`cosine`)** → Measures **directional similarity**, independent of magnitude.

---

## **🔹 HNSW Parameters in ChromaDB**
ChromaDB uses **HNSW (Hierarchical Navigable Small World)** for efficient Approximate Nearest Neighbor (ANN) search. Two key parameters control its performance:

### **1️⃣ `hnsw:construction_ef` (Indexing Parameter)**
✅ Determines **the size of the candidate list** used to **select neighbors during index creation**.  
✅ **Higher values → More accurate index but slower & memory-intensive.**  
✅ **Lower values → Faster construction but reduced accuracy.**  

🔹 **Default Value:** `100`  

---

### **2️⃣ `hnsw:search_ef` (Search Parameter)**
✅ Determines **the size of the dynamic candidate list** when searching for **nearest neighbors**.  
✅ **Higher values → Better recall & accuracy but slower query time.**  
✅ **Lower values → Faster search but reduced accuracy.**  

🔹 **Default Value:** `100`  

---

## **🚀 Distance Metrics & HNSW Parameters**
| **Feature** | **Purpose** | **Effect** |
|------------|------------|------------|
| **L2 Distance (`l2`)** | Measures squared Euclidean distance | Good for general vector similarity |
| **Inner Product (`ip`)** | Computes dot product similarity | Best for normalized vectors |
| **Cosine Similarity (`cosine`)** | Measures directional similarity | Best for NLP and text embeddings |
| **`hnsw:construction_ef`** | Controls indexing accuracy | Higher = More memory, better quality |
| **`hnsw:search_ef`** | Controls search recall & accuracy | Higher = More accurate but slower |

📌 **Tuning HNSW parameters helps balance speed, accuracy, and memory efficiency for optimal vector search in ChromaDB!** 🚀😊

- https://docs.trychroma.com/docs/collections/configure


In [19]:
from langchain_chroma import Chroma

vector_store = Chroma(
    embedding_function=embedding_function,
    persist_directory="./FDR_State_of_Union_Embeddings_cosine",  # Where to save data locally, remove if not necessary
    collection_metadata={
        "hnsw:space": "cosine", # cosine, ip or l2
        "hnsw:search_ef": 100
    }
)


print(vector_store._collection.metadata)

vector_store.from_documents(splitted_docs, 
                           embedding_function, 
                           persist_directory='./FDR_State_of_Union_Embeddings_cosine')

{'hnsw:space': 'cosine', 'hnsw:search_ef': 100}


<langchain_chroma.vectorstores.Chroma at 0x24ba43dba90>

In [21]:
len(vector_store.get()['ids'])

10

In [22]:
new_doc = "What did FDR say about the cost of food law?"

simillar_docs = vector_store.similarity_search_with_score(new_doc,
                                                           k=3) # k is the number of retrieved documents

simillar_docs

[(Document(metadata={'source': 'some_data/FDR_State_of_Union_1944.txt', 'start_index': 9094}, page_content='That attitude on the part of anyone—Government or management or labor—can lengthen this war. It can kill American boys.\n\nLet us remember the lessons of 1918. In the summer of that year the tide turned in favor of the allies. But this Government did not relax. In fact, our national effort was stepped up. In August, 1918, the draft age limits were broadened from 21-31 to 18-45. The President called for "force to the utmost," and his call was heeded. And in November, only three months later, Germany surrendered.\n\nThat is the way to fight and win a war—all out—and not with half-an-eye on the battlefronts abroad and the other eye-and-a-half on personal, selfish, or political interests here at home.\n\nTherefore, in order to concentrate all our energies and resources on winning the war, and to maintain a fair and stable economy at home, I recommend that the Congress adopt:\n\n(1) A

In [None]:
# # to delete all documents

# ids = vector_store.get()['ids']

# vector_store.delete(ids=ids)

In [23]:
new_doc = "AI super hero in Marevl universe"

simillar_docs = vector_store.similarity_search_with_score(new_doc,
                                                           k=3) # k is the number of retrieved documents

simillar_docs

[(Document(metadata={'source': 'some_data/FDR_State_of_Union_1944.txt', 'start_index': 0}, page_content='This Nation in the past two years has become an active partner in the world\'s greatest war against human slavery.\n\nWe have joined with like-minded people in order to defend ourselves in a world that has been gravely threatened with gangster rule.\n\nBut I do not think that any of us Americans can be content with mere survival. Sacrifices that we and our allies are making impose upon us all a sacred obligation to see to it that out of this war we and our children will gain something better than mere survival.\n\nWe are united in determination that this war shall not be followed by another interim which leads to new disaster- that we shall not repeat the tragic errors of ostrich isolationism—that we shall not repeat the excesses of the wild twenties when this Nation went for a joy ride on a roller coaster which ended in a tragic crash.\n\nWhen Mr. Hull went to Moscow in October, an

### Adding more documents

In [24]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


loader = PyPDFLoader('some_data/marvel_superheroes.pdf')


# Use a text splitter to break large text into smaller chunks
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(encoding_name="cl100k_base", 
                                                                     chunk_size=200,
                                                                     chunk_overlap=50)


marvel_pages = loader.load_and_split(text_splitter)
marvel_pages

[Document(metadata={'source': 'some_data/marvel_superheroes.pdf', 'page': 0}, page_content='Spider-Man\nSpider-Man, also known as Peter Parker, is one of Marvel\'s most iconic superheroes. Created by\nwriter Stan Lee and artist Steve Ditko, he first appeared in Amazing Fantasy #15 in 1962. As a\nteenager, Peter Parker was bitten by a radioactive spider, which granted him extraordinary powers,\nincluding superhuman strength, agility, the ability to cling to walls, and a "spider-sense" that warns\nhim of impending danger.\nDespite his incredible abilities, Peter\'s life is fraught with hardship. After the tragic murder of his\nUncle Ben, Peter learns a valuable lesson: "With great power comes great responsibility." This\nphilosophy shapes his journey as he fights crime and protects the citizens of New York City from\nnotorious villains like the Green Goblin, Doctor Octopus, and Venom. \nAs Spider-Man, Peter has been a key member of teams like the Avengers and the Fantastic Four.'),
 Docu

In [25]:
len(vector_store.get()['ids'])

10

In [26]:
from uuid import uuid4


uuids = [str(uuid4()) for _ in range(len(marvel_pages))]
vector_store.add_documents(documents=marvel_pages, ids=uuids)

['c21344fb-c212-42cc-8d3b-4eb358e21274',
 '71bb96ce-7ee1-4dc1-941b-64110eb5021e',
 'fc6206f8-d93c-471e-b3ec-49cd49f91b12',
 '7d7bad24-7207-4979-9d3d-780abf686649']

In [28]:
len(vector_store.get()['ids'])

14

In [29]:
new_doc = "AI super hero in Marevl universe"

simillar_docs = vector_store.similarity_search_with_score(new_doc,
                                                           k=3) # k is the number of retrieved documents

simillar_docs

[(Document(metadata={'page': 1, 'source': 'some_data/marvel_superheroes.pdf'}, page_content="creating versions with increased firepower, flight capabilities, and artificial intelligence integration\nthrough his AI assistant J.A.R.V.I.S.\nIron Man plays a pivotal role in forming the Avengers and becomes one of the most influential\nheroes in the Marvel Universe. His charisma, intelligence, and moral dilemmas make him an\nintriguing character. Despite his arrogance, Tony Stark is driven by a desire to protect the world,\nculminating in his ultimate sacrifice in Avengers: Endgame, where he uses the Infinity Gauntlet to\ndefeat Thanos.\nBeyond the comics, Iron Man's cinematic portrayal by Robert Downey Jr. in the Marvel Cinematic\nUniverse (MCU) redefined the character, making him a global phenomenon. His journey from an\negotistical billionaire to a selfless hero remains one of Marvel?s most compelling arcs."),
  0.5719774081635858),
 (Document(metadata={'page': 1, 'source': 'some_data/ma

In [30]:
print(simillar_docs[0][0].page_content)

creating versions with increased firepower, flight capabilities, and artificial intelligence integration
through his AI assistant J.A.R.V.I.S.
Iron Man plays a pivotal role in forming the Avengers and becomes one of the most influential
heroes in the Marvel Universe. His charisma, intelligence, and moral dilemmas make him an
intriguing character. Despite his arrogance, Tony Stark is driven by a desire to protect the world,
culminating in his ultimate sacrifice in Avengers: Endgame, where he uses the Infinity Gauntlet to
defeat Thanos.
Beyond the comics, Iron Man's cinematic portrayal by Robert Downey Jr. in the Marvel Cinematic
Universe (MCU) redefined the character, making him a global phenomenon. His journey from an
egotistical billionaire to a selfless hero remains one of Marvel?s most compelling arcs.
