# Session 2.2: ChromaDB Basics - Storage and Retrieval

## Overview

In this notebook, we'll explore:
- What is ChromaDB and why use vector databases
- Creating collections and understanding their structure
- Adding documents with embeddings and metadata
- Querying and retrieving data
- Understanding how ChromaDB organizes data
- Basic CRUD operations (Create, Read, Update, Delete)

**Key Concepts:**
- Vector databases optimize storage and retrieval of embeddings
- Collections are containers for related documents
- Each document has: ID, text, embedding, and optional metadata
- Queries return similar documents based on vector similarity

## Why Vector Databases?

### The Problem with Simple In-Memory Search

In the previous notebook, we built a simple semantic search using Python lists:
- ❌ Doesn't scale beyond ~10,000 documents
- ❌ Lost when program ends (no persistence)
- ❌ No efficient similarity search (compares ALL documents)
- ❌ No metadata filtering
- ❌ No concurrent access

### What Vector Databases Provide

- ✅ **Efficient similarity search** using ANN (Approximate Nearest Neighbors)
- ✅ **Scalability** to millions of documents
- ✅ **Persistence** (save and load data)
- ✅ **Metadata filtering** (search within subsets)
- ✅ **CRUD operations** (update, delete documents)
- ✅ **Production features** (backups, clustering)

### Why ChromaDB?

- **Embedded** - runs in your Python process, no separate server
- **Simple API** - easy to learn and use
- **Free and open-source**
- **Automatic embeddings** - can generate embeddings for you
- **Flexible** - in-memory or persistent storage
- **Great for learning and prototyping**

## Setup and Installation

In [1]:
# Install ChromaDB
!pip install  --upgrade pip
!pip install chromadb openai -q

print("✓ Packages installed")

✓ Packages installed


In [2]:
import chromadb
from chromadb.config import Settings
import json
from typing import List, Dict, Optional
from openai import OpenAI
import os

print(f"ChromaDB version: {chromadb.__version__}")

ChromaDB version: 1.3.5


## Part 1: Understanding ChromaDB Structure

### Hierarchy

```
ChromaDB Client
    ├── Collection 1 (e.g., "loomen_docs")
    │   ├── Document 1 (id, text, embedding, metadata)
    │   ├── Document 2
    │   └── Document 3
    │
    ├── Collection 2 (e.g., "user_questions")
    │   ├── Document 1
    │   └── Document 2
    │
    └── Collection 3...
```

### Key Components

1. **Client**: The database connection
2. **Collection**: A container for related documents (like a table in SQL)
3. **Document**: A unit of data with:
   - `id`: Unique identifier (string)
   - `document`: The text content
   - `embedding`: Vector representation (optional if using embedding function)
   - `metadata`: Key-value pairs for filtering (dict)

## Part 2: Creating a ChromaDB Client and Collection

In [3]:
# Create an in-memory ChromaDB client
# This data will be lost when the notebook kernel restarts
client = chromadb.Client()

print("✓ ChromaDB client created (in-memory mode)")
print("  Data will be lost when kernel restarts")

✓ ChromaDB client created (in-memory mode)
  Data will be lost when kernel restarts


### Creating a Collection

Collections need:
- A unique **name**
- An **embedding function** (how to convert text to vectors)
- Optional **metadata** (description, configuration)

In [4]:
# We'll use a custom embedding function with OpenRouter
# Add env variables from colab secrets
from google.colab import userdata
os.environ['OPENROUTER_API_KEY'] = userdata.get('OPENROUTER_API_KEY')

OPENROUTER_API_KEY = os.getenv('OPENROUTER_API_KEY', None)

openrouter_client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=OPENROUTER_API_KEY
)

EMBEDDING_MODEL = "openai/text-embedding-3-small"


class OpenRouterEmbeddingFunction:
    """
    Custom embedding function for ChromaDB using OpenRouter.
    """
    def __init__(self, api_key: str, model: str = EMBEDDING_MODEL):
        self.client = OpenAI(
            base_url="https://openrouter.ai/api/v1",
            api_key=api_key
        )
        self.model = model

    def __call__(self, input: List[str]) -> List[List[float]]:
        """
        Generate embeddings for a list of texts (for documents).
        ChromaDB expects this signature when adding documents.
        """
        response = self.client.embeddings.create(
            input=input,
            model=self.model
        )
        return [item.embedding for item in response.data]

    def embed_query(self, input: List[str]) -> List[List[float]]:
        """
        Generate embeddings for a list of query texts.
        ChromaDB expects this signature when querying.
        """
        return self.__call__(input)

    def name(self) -> str:
        """
        Returns the name of the embedding function, often the model name.
        """
        return self.model


# Create the embedding function instance
embedding_function = OpenRouterEmbeddingFunction(OPENROUTER_API_KEY)

print("✓ Custom embedding function created")

✓ Custom embedding function created


In [5]:
# Create a collection, or get it if it already exists
collection = client.get_or_create_collection(
    name="loomen_faq",
    embedding_function=embedding_function,
    metadata={"description": "Croatian Loomen FAQ documents"}
)

print(f"✓ Collection retrieved or created: '{collection.name}'")
print(f"  ID: {collection.id}")
print(f"  Metadata: {collection.metadata}")

✓ Collection retrieved or created: 'loomen_faq'
  ID: d46bdeaf-dc91-42af-90f2-5ee613f4352c
  Metadata: {'description': 'Croatian Loomen FAQ documents'}


## Part 3: Adding Documents

Let's add some Croatian language documents about Loomen with metadata.

In [6]:
# Prepare sample documents
documents = [
    "Loomen je sustav za e-učenje koji koriste hrvatska sveučilišta i škole.",
    "Prijava na Loomen sustav vrši se putem AAI@EduHr identiteta vaše institucije.",
    "Predavač može dodati kvizove, zadatke, forume i druge aktivnosti u tečaj.",
    "Polaznici mogu pristupiti materijalima tečaja kada god žele ako imaju pristup.",
    "Za tehničku podršku kontaktirajte helpdesk svoje obrazovne ustanove."
]

# Prepare unique IDs for each document
ids = [f"doc_{i}" for i in range(len(documents))]

# Prepare metadata for each document
metadatas = [
    {"category": "general", "topic": "platform", "language": "hr"},
    {"category": "authentication", "topic": "login", "language": "hr"},
    {"category": "features", "topic": "activities", "language": "hr"},
    {"category": "student", "topic": "access", "language": "hr"},
    {"category": "support", "topic": "help", "language": "hr"},
]

print("Documents prepared:")
for i, (doc_id, doc, meta) in enumerate(zip(ids, documents, metadatas), 1):
    print(f"\n[{i}] ID: {doc_id}")
    print(f"    Text: {doc}")
    print(f"    Metadata: {meta}")

Documents prepared:

[1] ID: doc_0
    Text: Loomen je sustav za e-učenje koji koriste hrvatska sveučilišta i škole.
    Metadata: {'category': 'general', 'topic': 'platform', 'language': 'hr'}

[2] ID: doc_1
    Text: Prijava na Loomen sustav vrši se putem AAI@EduHr identiteta vaše institucije.
    Metadata: {'category': 'authentication', 'topic': 'login', 'language': 'hr'}

[3] ID: doc_2
    Text: Predavač može dodati kvizove, zadatke, forume i druge aktivnosti u tečaj.
    Metadata: {'category': 'features', 'topic': 'activities', 'language': 'hr'}

[4] ID: doc_3
    Text: Polaznici mogu pristupiti materijalima tečaja kada god žele ako imaju pristup.
    Metadata: {'category': 'student', 'topic': 'access', 'language': 'hr'}

[5] ID: doc_4
    Text: Za tehničku podršku kontaktirajte helpdesk svoje obrazovne ustanove.
    Metadata: {'category': 'support', 'topic': 'help', 'language': 'hr'}


In [7]:
# Add documents to the collection
# ChromaDB will automatically:
# 1. Call our embedding function to generate embeddings
# 2. Store the embeddings in the vector index
# 3. Store the text and metadata

print("Adding documents to collection...")

collection.add(
    documents=documents,
    ids=ids,
    metadatas=metadatas
)

print(f"\n✓ Added {len(documents)} documents to collection")
print(f"  Collection now contains: {collection.count()} documents")

Adding documents to collection...

✓ Added 5 documents to collection
  Collection now contains: 5 documents


## Part 4: Inspecting the Collection

Let's explore what's actually stored in ChromaDB.

In [8]:
# Get basic collection info
print(f"Collection name: {collection.name}")
print(f"Document count: {collection.count()}")
print(f"Metadata: {collection.metadata}")

Collection name: loomen_faq
Document count: 5
Metadata: {'description': 'Croatian Loomen FAQ documents'}


In [9]:
# Retrieve all documents (peek operation)
# This shows the actual data structure
all_data = collection.peek(limit=5)

print("Collection data structure:")
print("=" * 80)

# ChromaDB returns data in this format:
# {
#   'ids': [...],
#   'embeddings': [...],  # May be None if not requested
#   'documents': [...],
#   'metadatas': [...]
# }

print(f"\nKeys in response: {all_data.keys()}")
print(f"\nNumber of documents: {len(all_data['ids'])}")
print(f"\nFirst document:")
print(f"  ID: {all_data['ids'][0]}")
print(f"  Document: {all_data['documents'][0]}")
print(f"  Metadata: {all_data['metadatas'][0]}")

# Check if embeddings exist and are not empty before trying to access them
if all_data['embeddings'] is not None and len(all_data['embeddings']) > 0:
    # Assuming all_data['embeddings'][0] is a list or numpy array
    # ChromaDB returns embeddings as List[List[float]], but it might be converted to numpy array.
    # Check its type and length to determine dimensions and values.
    first_embedding = all_data['embeddings'][0]
    if first_embedding is not None and len(first_embedding) > 0:
        print(f"  Embedding dimensions: {len(first_embedding)}")
        print(f"  First 5 embedding values: {first_embedding[:5]}")
    else:
        print("  Embeddings are present but the first embedding is empty or None.")
else:
    print("  Embeddings are not included or are empty in the retrieved data.")

Collection data structure:

Keys in response: dict_keys(['ids', 'embeddings', 'documents', 'uris', 'included', 'data', 'metadatas'])

Number of documents: 5

First document:
  ID: doc_0
  Document: Loomen je sustav za e-učenje koji koriste hrvatska sveučilišta i škole.
  Metadata: {'language': 'hr', 'topic': 'platform', 'category': 'general'}
  Embedding dimensions: 1536
  First 5 embedding values: [-0.03513004  0.00947953 -0.00695102 -0.00995063  0.03111133]


## Part 5: Querying Documents

The most important operation: semantic search!

In [10]:
# Query 1: Simple semantic search
query_text = "Kako se prijaviti u sustav?"

results = collection.query(
    query_texts=[query_text],  # Can query multiple texts at once
    n_results=3  # Return top 3 results
)

print(f"Query: '{query_text}'")
print("=" * 80)

# Results structure:
# {
#   'ids': [[...]],  # Nested list (one list per query)
#   'distances': [[...]],  # Lower is better (0 = identical)
#   'documents': [[...]],
#   'metadatas': [[[...]]]
# }

for i, (doc_id, document, metadata, distance) in enumerate(
    zip(
        results['ids'][0],
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    ), 1
):
    # Convert distance to similarity (0-1 scale, higher is better)
    similarity = 1 - distance

    print(f"\n[{i}] ID: {doc_id}")
    print(f"    Distance: {distance:.4f} | Similarity: {similarity:.4f}")
    print(f"    Category: {metadata['category']} | Topic: {metadata['topic']}")
    print(f"    Document: {document}")

Query: 'Kako se prijaviti u sustav?'

[1] ID: doc_1
    Distance: 0.7823 | Similarity: 0.2177
    Category: authentication | Topic: login
    Document: Prijava na Loomen sustav vrši se putem AAI@EduHr identiteta vaše institucije.

[2] ID: doc_4
    Distance: 1.0108 | Similarity: -0.0108
    Category: support | Topic: help
    Document: Za tehničku podršku kontaktirajte helpdesk svoje obrazovne ustanove.

[3] ID: doc_0
    Distance: 1.1135 | Similarity: -0.1135
    Category: general | Topic: platform
    Document: Loomen je sustav za e-učenje koji koriste hrvatska sveučilišta i škole.


In [11]:
# Query 2: Different question
query_text = "Što predavač može dodati u tečaj?"

results = collection.query(
    query_texts=[query_text],
    n_results=2
)

print(f"Query: '{query_text}'")
print("=" * 80)

for i, (doc_id, document, distance) in enumerate(
    zip(
        results['ids'][0],
        results['documents'][0],
        results['distances'][0]
    ), 1
):
    similarity = 1 - distance
    print(f"\n[{i}] Similarity: {similarity:.4f}")
    print(f"    {document}")

Query: 'Što predavač može dodati u tečaj?'

[1] Similarity: 0.4964
    Predavač može dodati kvizove, zadatke, forume i druge aktivnosti u tečaj.

[2] Similarity: 0.0592
    Polaznici mogu pristupiti materijalima tečaja kada god žele ako imaju pristup.


## Part 6: Metadata Filtering

One of ChromaDB's powerful features: filter results by metadata before similarity search.

In [12]:
# Query with metadata filter
# Only search within documents where category='features'

query_text = "Kako mogu raditi sa studentima?"

results = collection.query(
    query_texts=[query_text],
    n_results=3,
    where={"category": "features"}  # Filter by metadata
)

print(f"Query: '{query_text}'")
print(f"Filter: category='features'")
print("=" * 80)

for i, (document, metadata, distance) in enumerate(
    zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    ), 1
):
    print(f"\n[{i}] Distance: {distance:.4f}")
    print(f"    Category: {metadata['category']} ← (filtered)")
    print(f"    {document}")

Query: 'Kako mogu raditi sa studentima?'
Filter: category='features'

[1] Distance: 1.0245
    Category: features ← (filtered)
    Predavač može dodati kvizove, zadatke, forume i druge aktivnosti u tečaj.


In [13]:
# More complex filter: multiple conditions
# ChromaDB supports: $eq, $ne, $gt, $gte, $lt, $lte, $in, $nin, $and, $or

query_text = "pomoć"

results = collection.query(
    query_texts=[query_text],
    n_results=5,
    where={
        "$or": [
            {"category": "support"},
            {"category": "authentication"}
        ]
    }
)

print(f"Query: '{query_text}'")
print(f"Filter: category='support' OR category='authentication'")
print("=" * 80)

for i, (document, metadata, distance) in enumerate(
    zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0]
    ), 1
):
    print(f"\n[{i}] Category: {metadata['category']} | Distance: {distance:.4f}")
    print(f"    {document}")

Query: 'pomoć'
Filter: category='support' OR category='authentication'

[1] Category: support | Distance: 1.3168
    Za tehničku podršku kontaktirajte helpdesk svoje obrazovne ustanove.

[2] Category: authentication | Distance: 1.6047
    Prijava na Loomen sustav vrši se putem AAI@EduHr identiteta vaše institucije.


## Part 7: CRUD Operations

ChromaDB supports full Create, Read, Update, Delete operations.

### Get Specific Documents by ID

In [14]:
# Get specific documents by their IDs
result = collection.get(
    ids=["doc_0", "doc_2"]
)

print("Retrieved documents by ID:")
print("=" * 80)

for doc_id, document, metadata in zip(
    result['ids'],
    result['documents'],
    result['metadatas']
):
    print(f"\nID: {doc_id}")
    print(f"Document: {document}")
    print(f"Metadata: {metadata}")

Retrieved documents by ID:

ID: doc_0
Document: Loomen je sustav za e-učenje koji koriste hrvatska sveučilišta i škole.
Metadata: {'language': 'hr', 'topic': 'platform', 'category': 'general'}

ID: doc_2
Document: Predavač može dodati kvizove, zadatke, forume i druge aktivnosti u tečaj.
Metadata: {'category': 'features', 'topic': 'activities', 'language': 'hr'}


### Update Documents

In [15]:
# Update a document (text and/or metadata)
collection.update(
    ids=["doc_0"],
    documents=["Loomen je napredni sustav za e-učenje koji koriste sve hrvatske obrazovne institucije."],
    metadatas=[{"category": "general", "topic": "platform", "language": "hr", "updated": True}]
)

# Retrieve updated document
updated = collection.get(ids=["doc_0"])

print("Updated document:")
print("=" * 80)
print(f"Document: {updated['documents'][0]}")
print(f"Metadata: {updated['metadatas'][0]}")

Updated document:
Document: Loomen je napredni sustav za e-učenje koji koriste sve hrvatske obrazovne institucije.
Metadata: {'updated': True, 'topic': 'platform', 'language': 'hr', 'category': 'general'}


### Add More Documents

In [16]:
# Add new documents
new_docs = [
    "Kviz u Loomen sustavu podržava različite vrste pitanja.",
    "Forum omogućuje komunikaciju između svih sudionika tečaja."
]

new_ids = ["doc_5", "doc_6"]

new_metadata = [
    {"category": "features", "topic": "quiz", "language": "hr"},
    {"category": "features", "topic": "forum", "language": "hr"}
]

collection.add(
    documents=new_docs,
    ids=new_ids,
    metadatas=new_metadata
)

print(f"✓ Added {len(new_docs)} new documents")
print(f"  Total documents in collection: {collection.count()}")

✓ Added 2 new documents
  Total documents in collection: 7


### Delete Documents

In [17]:
print(f"Documents before delete: {collection.count()}")

# Delete by ID
collection.delete(ids=["doc_6"])

print(f"Documents after delete: {collection.count()}")
print(f"✓ Deleted doc_6")

Documents before delete: 7
Documents after delete: 6
✓ Deleted doc_6


In [18]:
# Delete by metadata filter
# (Be careful with this - it deletes ALL matching documents!)

print(f"Documents before delete: {collection.count()}")

collection.delete(
    where={"category": "student"}
)

print(f"Documents after delete: {collection.count()}")
print(f"✓ Deleted all documents with category='student'")

Documents before delete: 6
Documents after delete: 5
✓ Deleted all documents with category='student'


## Part 8: Working with Multiple Collections

In [19]:
# Create a second collection
collection2 = client.create_collection(
    name="user_questions",
    embedding_function=embedding_function,
    metadata={"description": "User submitted questions"}
)

# Add some data
collection2.add(
    documents=[
        "Kako mogu promijeniti lozinku?",
        "Zašto ne mogu pristupiti tečaju?"
    ],
    ids=["q1", "q2"],
    metadatas=[
        {"type": "technical", "status": "open"},
        {"type": "access", "status": "open"}
    ]
)

print(f"✓ Created second collection: '{collection2.name}'")
print(f"  Documents: {collection2.count()}")

✓ Created second collection: 'user_questions'
  Documents: 2


In [20]:
# List all collections
all_collections = client.list_collections()

print("All collections in database:")
print("=" * 80)

for col in all_collections:
    print(f"\nName: {col.name}")
    print(f"ID: {col.id}")
    print(f"Count: {col.count()}")
    print(f"Metadata: {col.metadata}")

All collections in database:

Name: user_questions
ID: 05cad8a5-df7d-4455-a946-728053f018ae
Count: 2
Metadata: {'description': 'User submitted questions'}

Name: loomen_faq
ID: d46bdeaf-dc91-42af-90f2-5ee613f4352c
Count: 5
Metadata: {'description': 'Croatian Loomen FAQ documents'}


In [21]:
# Get a collection by name
retrieved_collection = client.get_collection(name="loomen_faq")

print(f"Retrieved collection: '{retrieved_collection.name}'")
print(f"Documents: {retrieved_collection.count()}")

Retrieved collection: 'loomen_faq'
Documents: 5


In [22]:
# Delete a collection
client.delete_collection(name="user_questions")

print("✓ Deleted 'user_questions' collection")
print(f"\nRemaining collections: {[c.name for c in client.list_collections()]}")

✓ Deleted 'user_questions' collection

Remaining collections: ['loomen_faq']


## Part 9: Understanding the Data Flow

Let's visualize what happens when you add and query documents.

In [25]:
print("""
ADDING DOCUMENTS TO CHROMADB:
═══════════════════════════════════════════════════════════════════════

1. Your document text: "Loomen je sustav za e-učenje..."
   ↓
2. ChromaDB calls embedding_function(["Loomen je sustav..."])
   ↓
3. OpenRouter API generates embedding: [0.123, 0.456, ...] (1536 dims)
   ↓
4. ChromaDB stores:
   {
     id: "doc_0",
     document: "Loomen je sustav za e-učenje...",
     embedding: [0.123, 0.456, ...],
     metadata: {category: "general", ...}
   }
   ↓
5. Embedding indexed in vector space for fast similarity search


QUERYING DOCUMENTS:
═══════════════════════════════════════════════════════════════════════

1. Your query: "Kako se prijaviti?"
   ↓
2. ChromaDB generates query embedding: [0.789, 0.234, ...]
   ↓
3. Vector similarity search finds closest embeddings
   (using approximate nearest neighbors algorithm)
   ↓
4. Optional: Apply metadata filters
   ↓
5. Return top-k results with:
   - Original documents
   - Metadata
   - Distance/similarity scores
""")


ADDING DOCUMENTS TO CHROMADB:
═══════════════════════════════════════════════════════════════════════

1. Your document text: "Loomen je sustav za e-učenje..."
   ↓
2. ChromaDB calls embedding_function(["Loomen je sustav..."])
   ↓
3. OpenRouter API generates embedding: [0.123, 0.456, ...] (1536 dims)
   ↓
4. ChromaDB stores:
   {
     id: "doc_0",
     document: "Loomen je sustav za e-učenje...",
     embedding: [0.123, 0.456, ...],
     metadata: {category: "general", ...}
   }
   ↓
5. Embedding indexed in vector space for fast similarity search


QUERYING DOCUMENTS:
═══════════════════════════════════════════════════════════════════════

1. Your query: "Kako se prijaviti?"
   ↓
2. ChromaDB generates query embedding: [0.789, 0.234, ...]
   ↓
3. Vector similarity search finds closest embeddings
   (using approximate nearest neighbors algorithm)
   ↓
4. Optional: Apply metadata filters
   ↓
5. Return top-k results with:
   - Original documents
   - Metadata
   - Distance/similarity sc

## Summary and Key Takeaways

### What We Learned

1. **ChromaDB Structure**
   - Client → Collections → Documents
   - Each document has: ID, text, embedding, metadata

2. **Core Operations**
   - `create_collection()` - Create new collection
   - `add()` - Add documents
   - `query()` - Semantic search
   - `get()` - Retrieve by ID
   - `update()` - Modify documents
   - `delete()` - Remove documents

3. **Powerful Features**
   - Automatic embedding generation
   - Metadata filtering
   - Efficient similarity search
   - Multiple collections

4. **Storage Modes**
   - In-memory: `chromadb.Client()` (used in this notebook)
   - Persistent: `chromadb.PersistentClient(path="./db")` (for production)

### Limitations of In-Memory Mode

- ❌ Data lost when kernel restarts
- ❌ Cannot share data between sessions
- ✅ Perfect for learning and experimentation
- ✅ Fast and simple

### Next Steps

In the next notebook, we'll:
- Build a complete RAG system
- Combine ChromaDB with LLM generation
- Use persistent storage
- Process longer documents with chunking
- Create an intelligent Q&A system

## Exercises (Optional)

Try these to practice:

1. Add 10 more documents to the collection with diverse metadata
2. Create complex metadata filters using `$and`, `$or`, `$in`
3. Build a function that returns documents only if similarity > threshold
4. Create multiple collections for different document types
5. Implement a simple "duplicate detection" using similarity scores

In [24]:
# Your experiments here!
