# ChromaDB: A Simple Course

ChromaDB is an open-source embedding database designed for storing and retrieving vector embeddings efficiently. This notebook provides a hands-on introduction to using ChromaDB in Python.

## What is ChromaDB?

ChromaDB is a database built specifically for AI applications that need to store and query vector embeddings. It's particularly useful for:

- Semantic search
- Recommendation systems
- Retrieval-augmented generation (RAG)
- Similar document detection
- Image search (when using image embeddings)

## Key Features

- **Simple API**: Easy to use from Python
- **Persistent storage**: Save your embeddings to disk
- **Fast similarity search**: Efficient nearest neighbor algorithms
- **Multi-modal**: Store embeddings of text, images, or other data
- **Metadata filtering**: Query based on both vector similarity and metadata
- **Multiple collections**: Organize different types of embeddings

## Installation

In [None]:
# Install ChromaDB
!pip install chromadb

## Basic Usage

Let's start with the fundamentals of ChromaDB:

1. Creating a client
2. Creating a collection
3. Adding documents with embeddings
4. Querying similar documents

In [None]:
import chromadb
import numpy as np

# Create an in-memory client
client = chromadb.Client()

# Create a collection
collection = client.create_collection(name="documents")

# Add documents with embeddings
collection.add(
    ids=["doc1", "doc2", "doc3"],
    documents=[
        "ChromaDB is a database for storing embeddings",
        "Embeddings are vector representations of data",
        "Vector databases make semantic search possible"
    ],
    # We're providing our own embeddings here as an example
    # (normally you might use an embedding function)
    embeddings=[
        [0.1, 0.2, 0.3, 0.4, 0.5],  # Embedding for doc1
        [0.5, 0.6, 0.7, 0.8, 0.9],  # Embedding for doc2
        [0.1, 0.3, 0.5, 0.7, 0.9]   # Embedding for doc3
    ]
)

# Query similar documents with a query embedding
results = collection.query(
    query_embeddings=[[0.1, 0.2, 0.3, 0.4, 0.5]],  # Similar to doc1
    n_results=2
)

print("Query results:")
for i, doc in enumerate(results['documents'][0]):
    print(f"{i+1}. {doc}")

## Using Embedding Functions

Rather than manually creating embeddings, ChromaDB can integrate with popular embedding models via embedding functions. Let's see how to use a Sentence Transformer embedding function.

In [None]:
# Install required dependencies if not already installed
!pip install -q sentence-transformers

from chromadb.utils import embedding_functions

# Create an embedding function using Sentence Transformers
embedding_function = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"  # A good, small model for embeddings
)

# Create a new collection with the embedding function
collection_with_embeddings = client.create_collection(
    name="documents_with_embedding_function", 
    embedding_function=embedding_function
)

# Now we can add documents without providing embeddings
collection_with_embeddings.add(
    ids=["doc1", "doc2", "doc3", "doc4"],
    documents=[
        "ChromaDB stores vector embeddings efficiently",
        "Neural networks can create embedding vectors from text",
        "Semantic search finds results based on meaning, not just keywords",
        "Vector similarity is often computed using cosine distance"
    ]
    # No need to provide embeddings - they're generated automatically
)

# Query with text instead of embeddings
results = collection_with_embeddings.query(
    query_texts=["How do embeddings work with neural networks?"],
    n_results=2
)

print("Query results:")
for i, doc in enumerate(results['documents'][0]):
    print(f"{i+1}. {doc}")

## Persistent Storage

In real applications, you'll want to save your embeddings to disk rather than keeping them in memory. Here's how to use persistent storage with ChromaDB.

In [None]:
import tempfile
import os

# Create a temporary directory for our persistent database
persist_directory = tempfile.mkdtemp()
print(f"Using persistent directory: {persist_directory}")

# Create a persistent client
persistent_client = chromadb.PersistentClient(path=persist_directory)

# Create a collection
persistent_collection = persistent_client.create_collection(
    name="persistent_docs",
    embedding_function=embedding_function  # Reusing the function from before
)

# Add documents
persistent_collection.add(
    ids=["doc1", "doc2", "doc3"],
    documents=[
        "This document will be saved to disk",
        "The embeddings are stored persistently",
        "You can restart your application and still access the collection"
    ]
)

# Query
results = persistent_collection.query(
    query_texts=["persistent storage"],
    n_results=2
)

print("Query results:")
for i, doc in enumerate(results['documents'][0]):
    print(f"{i+1}. {doc}")

# If you restart your application, you can load the collection again:
# reloaded_client = chromadb.PersistentClient(path=persist_directory)
# reloaded_collection = reloaded_client.get_collection("persistent_docs")

## Working with Metadata

ChromaDB allows you to store metadata alongside your documents and embeddings. This metadata can be used for filtering during queries.

In [None]:
# Create a collection for documents with metadata
collection_with_metadata = client.create_collection(
    name="documents_with_metadata", 
    embedding_function=embedding_function
)

# Add documents with metadata
collection_with_metadata.add(
    ids=["doc1", "doc2", "doc3", "doc4", "doc5"],
    documents=[
        "Python is a popular programming language",
        "JavaScript is commonly used for web development",
        "TensorFlow is a machine learning framework",
        "PyTorch is another machine learning framework",
        "SQL is used for database queries"
    ],
    metadatas=[
        {"type": "language", "level": "beginner", "year": 2023},
        {"type": "language", "level": "intermediate", "year": 2023},
        {"type": "framework", "level": "advanced", "year": 2022},
        {"type": "framework", "level": "advanced", "year": 2022},
        {"type": "language", "level": "intermediate", "year": 2021}
    ]
)

# Query with metadata filtering
results = collection_with_metadata.query(
    query_texts=["machine learning libraries"],
    where={"type": "framework"},  # Only return frameworks
    n_results=5
)

print("Query results (frameworks only):")
for i, doc in enumerate(results['documents'][0]):
    print(f"{i+1}. {doc}")

# Query with another metadata filter
results = collection_with_metadata.query(
    query_texts=["programming languages"],
    where={"level": "intermediate"},  # Only intermediate level
    n_results=5
)

print("\nQuery results (intermediate level only):")
for i, doc in enumerate(results['documents'][0]):
    print(f"{i+1}. {doc}")

# Query with a more complex metadata filter
results = collection_with_metadata.query(
    query_texts=["programming technologies"],
    where={"year": {"$gte": 2022}},  # Only from 2022 or later
    n_results=5
)

print("\nQuery results (from 2022 or later):")
for i, doc in enumerate(results['documents'][0]):
    print(f"{i+1}. {doc}")

## Advanced Usage: Updating and Deleting

ChromaDB supports updating and deleting documents from collections.

In [None]:
# Create a collection
update_collection = client.create_collection(
    name="documents_to_update", 
    embedding_function=embedding_function
)

# Add initial documents
update_collection.add(
    ids=["doc1", "doc2", "doc3"],
    documents=[
        "Initial version of document 1",
        "Initial version of document 2",
        "This document will be deleted"
    ],
    metadatas=[
        {"version": 1},
        {"version": 1},
        {"version": 1}
    ]
)

# Update documents
update_collection.update(
    ids=["doc1", "doc2"],
    documents=[
        "Updated version of document 1",
        "Updated version of document 2"
    ],
    metadatas=[
        {"version": 2},
        {"version": 2}
    ]
)

# Delete a document
update_collection.delete(
    ids=["doc3"]
)

# Get all documents
results = update_collection.get()

print("Documents after updates and deletion:")
for i, (doc_id, doc, metadata) in enumerate(zip(results['ids'], results['documents'], results['metadatas'])):
    print(f"{i+1}. ID: {doc_id}, Content: '{doc}', Metadata: {metadata}")

## Integration with RAG Systems

ChromaDB is particularly well-suited for Retrieval-Augmented Generation (RAG) systems. Let's see how it fits into a simple RAG pipeline.

In [None]:
# A simple RAG pipeline with ChromaDB

# 1. Set up collection
rag_collection = client.create_collection(
    name="rag_documents", 
    embedding_function=embedding_function
)

# 2. Add knowledge base documents
rag_collection.add(
    ids=["kb1", "kb2", "kb3", "kb4"],
    documents=[
        "The capital of France is Paris. It is known for the Eiffel Tower.",
        "Rome is the capital of Italy. The Colosseum is located in Rome.",
        "Japan is an island country in East Asia. Tokyo is its capital.",
        "The Great Wall of China is over 13,000 miles long."
    ],
    metadatas=[
        {"source": "geography", "topic": "France"},
        {"source": "geography", "topic": "Italy"},
        {"source": "geography", "topic": "Japan"},
        {"source": "geography", "topic": "China"}
    ]
)

# 3. Function to simulate LLM response (in a real system, you would call an LLM API)
def simulate_llm_response(question, context):
    print("Question:", question)
    print("Context provided to LLM:")
    for i, ctx in enumerate(context):
        print(f"  {i+1}. {ctx}")
    
    # In a real system, you would call an API like OpenAI here
    # For this example, we'll just return a simple response
    if "France" in " ".join(context):
        return "The capital of France is Paris, which is famous for the Eiffel Tower."
    elif "Italy" in " ".join(context):
        return "Rome is the capital city of Italy. It's known for the ancient Colosseum."
    elif "Japan" in " ".join(context):
        return "Tokyo is the capital city of Japan, which is an island nation in East Asia."
    elif "China" in " ".join(context):
        return "The Great Wall of China is one of the most impressive structures ever built, stretching over 13,000 miles."
    else:
        return "I don't have enough context to answer that question."

# 4. RAG function
def answer_with_rag(question):
    # Retrieve relevant context
    results = rag_collection.query(
        query_texts=[question],
        n_results=2
    )
    
    context = results['documents'][0]
    
    # Generate answer using the context
    answer = simulate_llm_response(question, context)
    
    return answer

# Test the RAG system
questions = [
    "What is the capital of France?",
    "Tell me about Rome",
    "What's interesting about the Great Wall of China?"
]

for question in questions:
    print("\n" + "="*50)
    answer = answer_with_rag(question)
    print("\nGenerated answer:")
    print(answer)

## Best Practices for ChromaDB

Here are some best practices for working with ChromaDB in production:

1. **Choose the right embedding model**: The quality of your embeddings significantly impacts retrieval performance.

2. **Index size consideration**: Large collections might require more sophisticated retrieval methods.

3. **Chunk size matters**: For text documents, how you chunk your text affects retrieval quality.

4. **Metadata for filtering**: Use metadata to enable efficient filtering.

5. **Regular backups**: With persistent storage, ensure you have backup strategies.

6. **Collection organization**: Create separate collections for different types of data.

7. **Monitor query performance**: As your collections grow, keep an eye on query latency.

## Conclusion

ChromaDB offers an elegant solution for storing and querying vector embeddings. Its integration with popular embedding models and efficient similarity search makes it ideal for RAG applications, semantic search, and other AI use cases requiring vector similarity.

For more information, check out the [official documentation](https://docs.trychroma.com/).