# Module 05 - Notebook 03: Vector Databases

## Learning Objectives
- Understand why vector databases are needed
- Set up and use ChromaDB
- Store and retrieve embeddings efficiently
- Work with collections and metadata
- Implement persistence and backup

---

## 1. Why Vector Databases?

Traditional databases (SQL, NoSQL) are optimized for exact matches:
- `WHERE name = 'John'` âœ“
- `WHERE age > 25` âœ“
- `WHERE description is semantically similar to 'AI'` âœ—

**Vector databases** are designed for:
- **Similarity search** on high-dimensional vectors
- **Fast k-NN** (k-nearest neighbors) queries
- **Scalability** to millions of vectors
- **Filtering** by metadata alongside similarity

### Popular Vector Databases:
- **Chroma**: Open source, easy to use, local-first
- **Pinecone**: Managed service, highly scalable
- **Weaviate**: Open source, GraphQL API
- **FAISS**: Facebook's library, very fast
- **Milvus**: Distributed, production-ready

## 2. Setup ChromaDB

In [None]:
!pip install -q chromadb openai python-dotenv

In [None]:
import chromadb
from chromadb.config import Settings
import os
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

# Initialize Chroma (in-memory mode for demo)
chroma_client = chromadb.Client()

print("âœ“ ChromaDB initialized")
print(f"  Version: {chromadb.__version__}")

## 3. Collections

Collections are like tables in traditional databases. They store embeddings with metadata.

In [None]:
# Create a collection
collection = chroma_client.create_collection(
    name="my_first_collection",
    metadata={"description": "A demo collection"}
)

print(f"Created collection: {collection.name}")
print(f"Count: {collection.count()} items")

## 4. Adding Documents

In [None]:
# Sample documents
documents = [
    "Python is a versatile programming language.",
    "Machine learning enables computers to learn from data.",
    "Neural networks are inspired by the human brain.",
    "JavaScript is essential for web development.",
    "Data science combines statistics and programming."
]

# Generate embeddings using OpenAI
response = openai_client.embeddings.create(
    model="text-embedding-3-small",
    input=documents
)
embeddings = [item.embedding for item in response.data]

# Add to collection
collection.add(
    documents=documents,
    embeddings=embeddings,
    ids=[f"doc_{i}" for i in range(len(documents))],
    metadatas=[
        {"category": "programming", "length": len(doc)}
        for doc in documents
    ]
)

print(f"Added {len(documents)} documents")
print(f"Collection now has {collection.count()} items")

## 5. Querying

In [None]:
# Query the collection
query = "I want to learn about AI"

# Get query embedding
query_response = openai_client.embeddings.create(
    model="text-embedding-3-small",
    input=query
)
query_embedding = query_response.data[0].embedding

# Search
results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3
)

print(f"Query: '{query}'\n")
print("Top Results:\n")
for i, (doc, distance, metadata) in enumerate(zip(
    results['documents'][0],
    results['distances'][0],
    results['metadatas'][0]
), 1):
    print(f"{i}. [distance: {distance:.3f}] {doc}")
    print(f"   Metadata: {metadata}\n")

## 6. Metadata Filtering

In [None]:
# Add more documents with varied metadata
new_docs = [
    "Deep learning revolutionizes computer vision.",
    "Cooking pasta requires boiling water.",
    "Basketball is a popular sport worldwide."
]

new_response = openai_client.embeddings.create(
    model="text-embedding-3-small",
    input=new_docs
)
new_embeddings = [item.embedding for item in new_response.data]

collection.add(
    documents=new_docs,
    embeddings=new_embeddings,
    ids=[f"doc_{i+5}" for i in range(len(new_docs))],
    metadatas=[
        {"category": "programming", "length": len(new_docs[0])},
        {"category": "cooking", "length": len(new_docs[1])},
        {"category": "sports", "length": len(new_docs[2])}
    ]
)

# Query with metadata filter
filtered_results = collection.query(
    query_embeddings=[query_embedding],
    n_results=3,
    where={"category": "programming"}  # Only programming docs
)

print("Results filtered by category='programming':\n")
for doc in filtered_results['documents'][0]:
    print(f"  â€¢ {doc}")

## 7. Persistence

In [None]:
# Create persistent client
import tempfile
import shutil

# Create temp directory for demo
db_path = tempfile.mkdtemp()
print(f"Database path: {db_path}")

# Initialize persistent client
persistent_client = chromadb.PersistentClient(path=db_path)

# Create collection
persistent_collection = persistent_client.create_collection(
    name="persistent_demo"
)

# Add data
sample_docs = ["Data persists across sessions.", "This is saved to disk."]
sample_response = openai_client.embeddings.create(
    model="text-embedding-3-small",
    input=sample_docs
)
sample_embeddings = [item.embedding for item in sample_response.data]

persistent_collection.add(
    documents=sample_docs,
    embeddings=sample_embeddings,
    ids=["persist_1", "persist_2"]
)

print(f"\nAdded {persistent_collection.count()} documents")
print("These will persist even after restart!")

# Cleanup demo directory
shutil.rmtree(db_path)
print(f"\n(Demo: cleaned up {db_path})")

## 8. Collection Management

In [None]:
# List all collections
collections = chroma_client.list_collections()
print("Available collections:")
for coll in collections:
    print(f"  â€¢ {coll.name}: {coll.count()} items")

# Get specific collection
existing = chroma_client.get_collection(name="my_first_collection")
print(f"\nRetrieved collection: {existing.name}")

# Delete collection
# chroma_client.delete_collection(name="my_first_collection")
# print("Collection deleted")

## 9. Update and Delete Documents

In [None]:
# Get a document by ID
result = collection.get(ids=["doc_0"])
print("Original document:")
print(f"  ID: {result['ids'][0]}")
print(f"  Text: {result['documents'][0]}")

# Update a document
new_text = "Python is an amazing programming language!"
new_emb = openai_client.embeddings.create(
    model="text-embedding-3-small",
    input=new_text
).data[0].embedding

collection.update(
    ids=["doc_0"],
    documents=[new_text],
    embeddings=[new_emb]
)

# Verify update
updated = collection.get(ids=["doc_0"])
print("\nUpdated document:")
print(f"  Text: {updated['documents'][0]}")

# Delete a document
# collection.delete(ids=["doc_0"])
# print(f"\nCollection count after delete: {collection.count()}")

## 10. Working with Custom Embeddings

In [None]:
# You can use any embedding model
from sentence_transformers import SentenceTransformer

# Load local model
local_model = SentenceTransformer('all-MiniLM-L6-v2')

# Create new collection
local_collection = chroma_client.create_collection(
    name="local_embeddings"
)

# Generate embeddings locally
texts = ["Free local embeddings", "No API costs"]
local_embeddings = local_model.encode(texts).tolist()

local_collection.add(
    documents=texts,
    embeddings=local_embeddings,
    ids=["local_1", "local_2"]
)

print(f"Added {local_collection.count()} documents with local embeddings")
print(f"Embedding dimension: {len(local_embeddings[0])}")

## Exercise: Build a Knowledge Base

Create a searchable knowledge base using ChromaDB.

In [None]:
# TODO: Complete this exercise
class KnowledgeBase:
    """
    A simple knowledge base powered by ChromaDB.
    """
    
    def __init__(self, collection_name: str = "knowledge_base"):
        # TODO: Initialize Chroma client and collection
        pass
    
    def add_facts(self, facts: list, categories: list = None):
        """
        Add facts to the knowledge base.
        
        Args:
            facts: List of text facts
            categories: Optional list of categories
        """
        # TODO: Implement
        # 1. Generate embeddings
        # 2. Add to collection with metadata
        pass
    
    def search(self, query: str, n_results: int = 3, category: str = None):
        """
        Search the knowledge base.
        
        Args:
            query: Search query
            n_results: Number of results
            category: Optional category filter
        
        Returns:
            List of matching facts
        """
        # TODO: Implement
        # 1. Generate query embedding
        # 2. Query collection (with filter if category provided)
        # 3. Return results
        pass

# Test your implementation
# kb = KnowledgeBase()
# kb.add_facts([...])
# results = kb.search("your query")
# print(results)

## Summary

You learned:
- âœ… Why vector databases are essential
- âœ… Setting up and using ChromaDB
- âœ… Adding, querying, and managing documents
- âœ… Metadata filtering
- âœ… Persistence and collection management

## Key Takeaways

1. **Vector DBs enable semantic search** at scale
2. **Collections** organize embeddings
3. **Metadata filtering** combines semantic + structured search
4. **Persistence** saves data across sessions
5. **ChromaDB is easy** to get started with

## Next Steps
- ðŸ“˜ Notebook 04: Similarity Search Techniques