# Understanding Vector Stores with ChromaDB in LangChain

This notebook demonstrates how to use ChromaDB as a vector store in LangChain for efficient semantic search and retrieval. Vector stores are essential components in modern RAG (Retrieval Augmented Generation) systems.

## What is a Vector Store?
A vector store is a specialized database that stores and retrieves text embeddings (high-dimensional vector representations of text). It enables:
1. Semantic search (finding similar content)
2. Efficient retrieval of relevant information
3. Scalable document storage and querying

## Why ChromaDB?
ChromaDB is a popular choice because it offers:
- Easy setup and usage
- Good performance
- Local storage option
- Integration with many embedding models
- Support for metadata filtering

## Notebook Structure
1. Setup and Configuration
2. Document Creation
3. Embedding Generation
4. Vector Store Operations
5. Similarity Search
6. Advanced Queries

In [1]:
# Silence warnings
import warnings
import logging
warnings.filterwarnings('ignore')
logging.getLogger().setLevel(logging.ERROR)

# Configure ChromaDB logging
logging.getLogger('chromadb').setLevel(logging.ERROR)

# 1. Setup and Configuration

First, we'll configure our environment by:
1. Silencing unnecessary warnings
2. Setting up logging
3. Loading environment variables

This helps keep our notebook output clean and focused on the important parts.

In [2]:
from dotenv import load_dotenv
load_dotenv()

True

In [3]:
!which python

/media/rgukt/data/RAG/venv/bin/python


In [4]:
import os
google_api_key = os.getenv('GOOGLE_API_KEY')
if google_api_key:
    print("Google API key is set ✅")
else:
    print("⚠️ Google API key is not set. Please set GOOGLE_API_KEY in your .env file")

Google API key is set ✅


In [5]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.vectorstores import Chroma




A module that was compiled using NumPy 1.x cannot be run in
NumPy 2.2.6 as it may crash. To support both 1.x and 2.x
versions of NumPy, modules must be compiled with NumPy 2.0.
Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.

If you are a user of the module, the easiest solution will be to
downgrade to 'numpy<2' or try to upgrade the affected module.
We expect that some modules will need time to support NumPy 2.

Traceback (most recent call last):  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/media/rgukt/data/RAG/venv/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>
    app.launch_new_instance()
  File "/media/rgukt/data/RAG/venv/lib/python3.10/site-packages/traitlets/config/application.py", line 1075, in launch_instance
    app.start()
  File "/media/rgukt/data/RAG/venv/li

# 2. Vector Store Components

For our vector store setup, we need:

1. **Embeddings Model**: `GoogleGenerativeAIEmbeddings`
   - Converts text into numerical vectors
   - Uses Google's Gemini model for high-quality embeddings

2. **Vector Store**: `Chroma`
   - Stores and manages the vectors
   - Provides similarity search capabilities
   - Persists data locally

In [6]:
# Verify chromadb installation
try:
    import chromadb
    print(f"ChromaDB version: {chromadb.__version__} ✅")
except ImportError as e:
    print("ChromaDB is not installed properly ❌")

ChromaDB version: 1.1.1 ✅


In [7]:
from langchain.schema import Document


# Create LangChain documents for IPL players

doc1 = Document(
        page_content="Virat Kohli is one of the most successful and consistent batsmen in IPL history. Known for his aggressive batting style and fitness, he has led the Royal Challengers Bangalore in multiple seasons.",
        metadata={"team": "Royal Challengers Bangalore"}
    )
doc2 = Document(
        page_content="Rohit Sharma is the most successful captain in IPL history, leading Mumbai Indians to five titles. He's known for his calm demeanor and ability to play big innings under pressure.",
        metadata={"team": "Mumbai Indians"}
    )
doc3 = Document(
        page_content="MS Dhoni, famously known as Captain Cool, has led Chennai Super Kings to multiple IPL titles. His finishing skills, wicketkeeping, and leadership are legendary.",
        metadata={"team": "Chennai Super Kings"}
    )
doc4 = Document(
        page_content="Jasprit Bumrah is considered one of the best fast bowlers in T20 cricket. Playing for Mumbai Indians, he is known for his yorkers and death-over expertise.",
        metadata={"team": "Mumbai Indians"}
    )
doc5 = Document(
        page_content="Ravindra Jadeja is a dynamic all-rounder who contributes with both bat and ball. Representing Chennai Super Kings, his quick fielding and match-winning performances make him a key player.",
        metadata={"team": "Chennai Super Kings"}
    )

# 3. Document Creation and Storage

Here we:
1. Create `Document` objects that contain:
   - `page_content`: The actual text content
   - `metadata`: Additional information about the document

2. Use sports data as an example:
   - IPL (Indian Premier League) player information
   - Team affiliations as metadata
   - Performance statistics

This structure allows us to:
- Store and retrieve content semantically
- Filter results based on metadata
- Maintain context with the source information

In [8]:
print(type(doc1))

<class 'langchain_core.documents.base.Document'>


In [9]:
docs = [doc1, doc2, doc3, doc4, doc5]

In [21]:
# Initialize Chroma with proper settings
try:
    embedding_function = GoogleGenerativeAIEmbeddings(
        model="gemini-embedding-001",
        task_type="retrieval_query"  # Specify the task type
    )
    
    vector_store = Chroma(
        embedding_function=embedding_function,
        persist_directory='my_chroma_db',
        collection_name='sample',
        collection_metadata={"hnsw:space": "cosine"}  # Specify distance metric
    )
    print("Vector store initialized successfully ✅")
except Exception as e:
    print(f"Error initializing vector store: {str(e)} ❌")

Vector store initialized successfully ✅


E0000 00:00:1760015279.443582   72159 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.
E0000 00:00:1760015279.444490   72159 alts_credentials.cc:93] ALTS creds ignored. Not running on GCP and untrusted ALTS is not enabled.


In [22]:
# Add documents with error handling
try:
    vector_store.add_documents(docs)
    print(f"Successfully added {len(docs)} documents to the vector store ✅")
except Exception as e:
    print(f"Error adding documents: {str(e)} ❌")

Error adding documents: Error embedding content: 429 You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits.
* Quota exceeded for metric: generativelanguage.googleapis.com/embed_content_free_tier_requests, limit: 0
* Quota exceeded for metric: generativelanguage.googleapis.com/embed_content_free_tier_requests, limit: 0
* Quota exceeded for metric: generativelanguage.googleapis.com/embed_content_free_tier_requests, limit: 0
* Quota exceeded for metric: generativelanguage.googleapis.com/embed_content_free_tier_requests, limit: 0 [violations {
  quota_metric: "generativelanguage.googleapis.com/embed_content_free_tier_requests"
  quota_id: "EmbedContentRequestsPerDayPerProjectPerModel-FreeTier"
}
violations {
  quota_metric: "generativelanguage.googleapis.com/embed_content_free_tier_requests"
  quota_id: "EmbedContentRequestsPerMinutePerProjectPerModel-FreeTier"
}
violations

In [12]:
# view documents
vector_store.get(include=['embeddings','documents', 'metadatas'])

{'ids': ['5ac839a7-4c52-4843-b5c0-f2b8fc8e14eb',
  '38cd1f72-fe6d-4f7b-a2c6-32fba8d3d939',
  'de8a58ec-d637-42c3-9a1b-2fd399b37e9b',
  '40914258-62a9-41f3-92f9-2bf5cac84313',
  '07799d52-f30e-43d6-af2f-3639243932cf'],
 'embeddings': array([[ 0.00187277,  0.03488091,  0.02627761, ...,  0.01857099,
         -0.0156446 , -0.00622499],
        [-0.01239185,  0.00825603,  0.01180784, ...,  0.01308419,
         -0.02513637, -0.01149639],
        [-0.00300455, -0.00262631,  0.02177223, ...,  0.01372966,
         -0.03213968,  0.00140538],
        [-0.00210677, -0.00996671,  0.00578952, ..., -0.00493928,
          0.00881039,  0.00110918],
        [-0.00770416, -0.03298219,  0.00641239, ...,  0.0049683 ,
         -0.01183177,  0.01347619]], shape=(5, 3072)),
 'documents': ['Virat Kohli is one of the most successful and consistent batsmen in IPL history. Known for his aggressive batting style and fitness, he has led the Royal Challengers Bangalore in multiple seasons.',
  "Rohit Sharma is the m

In [23]:
# search documents
vector_store.similarity_search(
    query='Who among these are a bowler?',
    k=1
)

GoogleGenerativeAIError: Error embedding content: 429 You exceeded your current quota, please check your plan and billing details. For more information on this error, head to: https://ai.google.dev/gemini-api/docs/rate-limits.
* Quota exceeded for metric: generativelanguage.googleapis.com/embed_content_free_tier_requests, limit: 0
* Quota exceeded for metric: generativelanguage.googleapis.com/embed_content_free_tier_requests, limit: 0
* Quota exceeded for metric: generativelanguage.googleapis.com/embed_content_free_tier_requests, limit: 0
* Quota exceeded for metric: generativelanguage.googleapis.com/embed_content_free_tier_requests, limit: 0 [violations {
  quota_metric: "generativelanguage.googleapis.com/embed_content_free_tier_requests"
  quota_id: "EmbedContentRequestsPerDayPerUserPerProjectPerModel-FreeTier"
}
violations {
  quota_metric: "generativelanguage.googleapis.com/embed_content_free_tier_requests"
  quota_id: "EmbedContentRequestsPerMinutePerUserPerProjectPerModel-FreeTier"
}
violations {
  quota_metric: "generativelanguage.googleapis.com/embed_content_free_tier_requests"
  quota_id: "EmbedContentRequestsPerMinutePerProjectPerModel-FreeTier"
}
violations {
  quota_metric: "generativelanguage.googleapis.com/embed_content_free_tier_requests"
  quota_id: "EmbedContentRequestsPerDayPerProjectPerModel-FreeTier"
}
, links {
  description: "Learn more about Gemini API quotas"
  url: "https://ai.google.dev/gemini-api/docs/rate-limits"
}
]

In [14]:
# search with similarity score
vector_store.similarity_search_with_score(
    query='Who among these are a bowler?',
    k=2
)

[(Document(metadata={'team': 'Mumbai Indians'}, page_content='Jasprit Bumrah is considered one of the best fast bowlers in T20 cricket. Playing for Mumbai Indians, he is known for his yorkers and death-over expertise.'),
  0.41753751039505005),
 (Document(metadata={'team': 'Chennai Super Kings'}, page_content='Ravindra Jadeja is a dynamic all-rounder who contributes with both bat and ball. Representing Chennai Super Kings, his quick fielding and match-winning performances make him a key player.'),
  0.43137550354003906)]

In [15]:
# meta-data filtering
vector_store.similarity_search_with_score(
    query=" ",
    filter={"team": "Chennai Super Kings"}
)

[(Document(metadata={'team': 'Chennai Super Kings'}, page_content='Ravindra Jadeja is a dynamic all-rounder who contributes with both bat and ball. Representing Chennai Super Kings, his quick fielding and match-winning performances make him a key player.'),
  0.4798821210861206),
 (Document(metadata={'team': 'Chennai Super Kings'}, page_content='MS Dhoni, famously known as Captain Cool, has led Chennai Super Kings to multiple IPL titles. His finishing skills, wicketkeeping, and leadership are legendary.'),
  0.5008953213691711)]

In [16]:
# update documents
updated_doc1 = Document(
    page_content="Virat Kohli, the former captain of Royal Challengers Bangalore (RCB), is renowned for his aggressive leadership and consistent batting performances. He holds the record for the most runs in IPL history, including multiple centuries in a single season. Despite RCB not winning an IPL title under his captaincy, Kohli's passion and fitness set a benchmark for the league. His ability to chase targets and anchor innings has made him one of the most dependable players in T20 cricket.",
    metadata={"team": "Royal Challengers Bangalore"}
)

In [17]:
vector_store.update_document(document_id='f7bbffc0-e2a7-4876-9509-facf24a6c91e',document=updated_doc1)

In [18]:
# view documents
vector_store.get(include=['embeddings','documents', 'metadatas'])

{'ids': ['5ac839a7-4c52-4843-b5c0-f2b8fc8e14eb',
  '38cd1f72-fe6d-4f7b-a2c6-32fba8d3d939',
  'de8a58ec-d637-42c3-9a1b-2fd399b37e9b',
  '40914258-62a9-41f3-92f9-2bf5cac84313',
  '07799d52-f30e-43d6-af2f-3639243932cf'],
 'embeddings': array([[ 0.00187277,  0.03488091,  0.02627761, ...,  0.01857099,
         -0.0156446 , -0.00622499],
        [-0.01239185,  0.00825603,  0.01180784, ...,  0.01308419,
         -0.02513637, -0.01149639],
        [-0.00300455, -0.00262631,  0.02177223, ...,  0.01372966,
         -0.03213968,  0.00140538],
        [-0.00210677, -0.00996671,  0.00578952, ..., -0.00493928,
          0.00881039,  0.00110918],
        [-0.00770416, -0.03298219,  0.00641239, ...,  0.0049683 ,
         -0.01183177,  0.01347619]], shape=(5, 3072)),
 'documents': ['Virat Kohli is one of the most successful and consistent batsmen in IPL history. Known for his aggressive batting style and fitness, he has led the Royal Challengers Bangalore in multiple seasons.',
  "Rohit Sharma is the m

In [19]:
vector_store.delete(ids=['f7bbffc0-e2a7-4876-9509-facf24a6c91e',
  '903a37e2-fe77-47cb-b6a1-6604e92d8cdc'])

In [20]:
# view documents
vector_store.get(include=['embeddings','documents', 'metadatas'])

{'ids': ['5ac839a7-4c52-4843-b5c0-f2b8fc8e14eb',
  '38cd1f72-fe6d-4f7b-a2c6-32fba8d3d939',
  'de8a58ec-d637-42c3-9a1b-2fd399b37e9b',
  '40914258-62a9-41f3-92f9-2bf5cac84313',
  '07799d52-f30e-43d6-af2f-3639243932cf'],
 'embeddings': array([[ 0.00187277,  0.03488091,  0.02627761, ...,  0.01857099,
         -0.0156446 , -0.00622499],
        [-0.01239185,  0.00825603,  0.01180784, ...,  0.01308419,
         -0.02513637, -0.01149639],
        [-0.00300455, -0.00262631,  0.02177223, ...,  0.01372966,
         -0.03213968,  0.00140538],
        [-0.00210677, -0.00996671,  0.00578952, ..., -0.00493928,
          0.00881039,  0.00110918],
        [-0.00770416, -0.03298219,  0.00641239, ...,  0.0049683 ,
         -0.01183177,  0.01347619]], shape=(5, 3072)),
 'documents': ['Virat Kohli is one of the most successful and consistent batsmen in IPL history. Known for his aggressive batting style and fitness, he has led the Royal Challengers Bangalore in multiple seasons.',
  "Rohit Sharma is the m