# ChromaDB Docker Setup — Connectivity & Smoke Tests

**Prerequisites:**
1. Run `docker compose up -d chromadb` (or `docker compose up -d` for all services)
2. Wait for the health check to pass: `docker compose ps`
3. ChromaDB server should be accessible at **localhost:8001**

In [16]:
import chromadb
import requests

CHROMA_HOST = "localhost"
CHROMA_PORT = 8001

print(f"ChromaDB Python client version: {chromadb.__version__}")

ChromaDB Python client version: 1.0.15


## 1. Raw HTTP Health Check
Verify the ChromaDB server is reachable before using the Python client.

In [17]:
base_url = f"http://{CHROMA_HOST}:{CHROMA_PORT}"

try:
    resp = requests.get(f"{base_url}/api/v2/heartbeat", timeout=5)
    print(f"Heartbeat endpoint: {resp.status_code} — {resp.json()}")
except requests.exceptions.ConnectionError:
    print("ERROR: Cannot reach ChromaDB server. Is the container running?")
    print("  Run: docker compose up -d chromadb")
    raise

try:
    resp = requests.get(f"{base_url}/api/v2/version", timeout=5)
    print(f"Server version:     {resp.status_code} — {resp.json()}")
except Exception as e:
    print(f"Version endpoint not available (non-critical): {e}")

Heartbeat endpoint: 200 — {'nanosecond heartbeat': 1771479316705924690}
Server version:     200 — 1.0.0


## 2. Python Client Connection

In [18]:
client = chromadb.HttpClient(host=CHROMA_HOST, port=CHROMA_PORT)

heartbeat = client.heartbeat()
print(f"Client heartbeat: {heartbeat}")

version = client.get_version()
print(f"Server version:   {version}")

print("\nPython client connected to ChromaDB server successfully!")

Client heartbeat: 1771479317481408565
Server version:   1.0.0

Python client connected to ChromaDB server successfully!


## 3. Collection CRUD & Document Operations

In [19]:
TEST_COLLECTION = "docker_smoke_test"

# Clean slate
existing = [c.name for c in client.list_collections()]
if TEST_COLLECTION in existing:
    client.delete_collection(TEST_COLLECTION)
    print(f"Deleted pre-existing '{TEST_COLLECTION}' collection")

collection = client.get_or_create_collection(name=TEST_COLLECTION)
print(f"Created collection: {collection.name}")
print(f"Initial count:      {collection.count()}")

Deleted pre-existing 'docker_smoke_test' collection
Created collection: docker_smoke_test
Initial count:      0


In [20]:
# Provide dummy embeddings to avoid ChromaDB's default ONNX embedding function,
# which has DLL issues on some Windows setups. Real embeddings come from
# Google/OpenAI in the LangChain integration test (cell below).
import random
random.seed(42)
EMBED_DIM = 384

def dummy_embedding():
    return [random.random() for _ in range(EMBED_DIM)]

collection.add(
    ids=["lineage-1", "lineage-2", "lineage-3"],
    documents=[
        "source_table: orders, source_column: order_id, target_table: fact_orders, target_column: order_key, dependency_score: 0.9",
        "source_table: customers, source_column: customer_id, target_table: dim_customers, target_column: customer_key, dependency_score: 0.85",
        "source_table: products, source_column: product_id, target_table: dim_products, target_column: product_key, dependency_score: 0.75",
    ],
    embeddings=[dummy_embedding(), dummy_embedding(), dummy_embedding()],
    metadatas=[
        {"source": "orders", "target": "fact_orders", "score": 0.9},
        {"source": "customers", "target": "dim_customers", "score": 0.85},
        {"source": "products", "target": "dim_products", "score": 0.75},
    ],
)
print(f"Documents added. Collection count: {collection.count()}")

Documents added. Collection count: 3


In [21]:
results = collection.query(
    query_embeddings=[dummy_embedding()],
    n_results=2,
)

print("Query (using dummy embedding vector):")
print("-" * 50)
for i, doc in enumerate(results["documents"][0]):
    print(f"  Result {i+1}: {doc}")
    print(f"    Distance: {results['distances'][0][i]:.4f}")
    print(f"    Metadata: {results['metadatas'][0][i]}")
    print()

Query (using dummy embedding vector):
--------------------------------------------------
  Result 1: source_table: products, source_column: product_id, target_table: dim_products, target_column: product_key, dependency_score: 0.75
    Distance: 57.4429
    Metadata: {'target': 'dim_products', 'score': 0.75, 'source': 'products'}

  Result 2: source_table: customers, source_column: customer_id, target_table: dim_customers, target_column: customer_key, dependency_score: 0.85
    Distance: 60.1241
    Metadata: {'target': 'dim_customers', 'source': 'customers', 'score': 0.85}



In [22]:
!uv pip install onnxruntime

[2mAudited [1m1 package[0m [2min 11ms[0m[0m


## 4. Persistence Check
Verify data survives collection re-fetch (server-side persistence).

In [23]:
# Re-fetch the same collection from the server
refetched = client.get_collection(name=TEST_COLLECTION)
assert refetched.count() == 3, f"Expected 3 docs, got {refetched.count()}"
print(f"Persistence check passed — {refetched.count()} documents retained.")

Persistence check passed — 3 documents retained.


## 5. LangChain + ChromaDB Server Integration
Tests the `langchain_community.vectorstores.Chroma` wrapper with an `HttpClient`, 
which is the pattern `vector_db.py` needs for Docker mode.

In [24]:
import os
from langchain_community.vectorstores import Chroma
from langchain.schema import Document

GOOGLE_API_KEY = os.getenv("GOOGLE_API_KEY")

if not GOOGLE_API_KEY:
    print("GOOGLE_API_KEY not set — skipping LangChain embedding test.")
    print("Set the env var and re-run this cell to test the full pipeline.")
else:
    from langchain_google_genai import GoogleGenerativeAIEmbeddings

    embedding_fn = GoogleGenerativeAIEmbeddings(
        model="models/gemini-embedding-001",
        google_api_key=GOOGLE_API_KEY,
    )

    LANGCHAIN_COLLECTION = "langchain_docker_test"

    chroma_client = chromadb.HttpClient(host=CHROMA_HOST, port=CHROMA_PORT)

    db = Chroma(
        collection_name=LANGCHAIN_COLLECTION,
        client=chroma_client,
        embedding_function=embedding_fn,
    )

    docs = [
        Document(
            page_content="source_database: prod_tz\nsource_schema: edw_staging\nsource_table: allocationrule\nsource_column: payrollbasis\ntarget_table: dimsharedservicesallocationrule",
            metadata={"type": "lineage", "score": 1.0},
        ),
        Document(
            page_content="source_database: db1\nsource_table: orders\nsource_column: order_id\ntarget_table: fact_orders\ntarget_column: order_key",
            metadata={"type": "lineage", "score": 0.85},
        ),
    ]
    db.add_documents(docs)

    results = db.similarity_search("downstream impact of orders", k=2)
    print("LangChain similarity_search results:")
    for r in results:
        print(f"  - {r.page_content[:80]}...")
        print(f"    metadata: {r.metadata}")

    # Cleanup
    chroma_client.delete_collection(LANGCHAIN_COLLECTION)
    print("\nLangChain + ChromaDB server integration test PASSED!")

LangChain similarity_search results:
  - source_database: db1
source_table: orders
source_column: order_id
target_table: ...
    metadata: {'type': 'lineage', 'score': 0.85}
  - source_database: prod_tz
source_schema: edw_staging
source_table: allocationrule...
    metadata: {'score': 1.0, 'type': 'lineage'}

LangChain + ChromaDB server integration test PASSED!


## 6. Cleanup

In [25]:
client.delete_collection(TEST_COLLECTION)
print(f"Cleaned up '{TEST_COLLECTION}' collection.")

remaining = [c.name for c in client.list_collections()]
print(f"Remaining collections: {remaining}")

Cleaned up 'docker_smoke_test' collection.
Remaining collections: []


## Summary

| Test | Status |
|------|--------|
| HTTP health check | Run cells above |
| Python client connection | Run cells above |
| Collection CRUD | Run cells above |
| Document add & query | Run cells above |
| Persistence | Run cells above |
| LangChain integration | Run cells above |