
---

## 🧠 What is ChromaDB?

**ChromaDB** is an open-source **vector database** built for **AI-native applications**. It stores **embeddings** (vector representations of data like text, images, etc.) along with **metadata**, and allows **fast similarity search**.

It’s one of the most **developer-friendly vector stores**, with **local persistence**, **simple APIs**, and **tight LangChain integration**.

---

## 🎯 Purpose of ChromaDB in GenAI Pipelines

* Converts text into **embeddings** using a model (like OpenAI).
* Stores those vectors with **associated metadata and documents**.
* Allows **semantic search** (similarity-based retrieval).
* Serves as the **retriever backend** for **RAG** (Retrieval-Augmented Generation).

✅ It's used when you want a **fast, local, minimal-setup** solution for storing and querying vectorized documents.

---

## ⚖️ ChromaDB vs. FAISS – Key Differences

| Feature          | **ChromaDB**                      | **FAISS**                             |
| ---------------- | --------------------------------- | ------------------------------------- |
| Type             | Vector **database**               | Vector **index/search library**       |
| Persistence      | Built-in                          | Manual (must save to disk explicitly) |
| Metadata Support | Native support                    | Manual (requires parallel storage)    |
| Filters          | Yes (built-in metadata filtering) | No (requires custom code)             |
| Setup            | Easy (single pip install)         | More control, but manual              |
| Use Case         | Ideal for prototyping, local RAG  | High performance, large-scale apps    |

✅ If you want **fast dev loop + metadata + persistence** → use ChromaDB
✅ If you want **blazing fast control on indexing and search** → use FAISS

---

## 📌 Important Parameters of ChromaDB Vector Store

When initializing or using ChromaDB via LangChain:

```python
Chroma(
    collection_name="my_docs",
    embedding_function=OpenAIEmbeddings(),
    persist_directory="db"
)
```

### Key Parameters:

| Parameter            | Description                                     |
| -------------------- | ----------------------------------------------- |
| `collection_name`    | Name of the ChromaDB collection                 |
| `embedding_function` | Embedding model used to encode texts            |
| `persist_directory`  | Local directory to save DB for reuse            |
| `client_settings`    | Custom settings like number of threads, timeout |
| `metadata`           | Key-value pairs like `{"source": "file1.txt"}`  |

---

## 🧮 Similarity and Score Feature in ChromaDB

* ChromaDB uses **cosine similarity** by default.
* Score ranges from **0 (no match)** to **1 (exact match)**.
* During retrieval, you get tuples of `(Document, score)`.

```python
docs = retriever.similarity_search_with_score("quantum computing", k=3)
for doc, score in docs:
    print(score)  # Example: 0.83, 0.77, ...
```

✅ Scores help with **ranking**, **thresholding**, or **reranking with LLMs**.

---

## 🔄 How to Use ChromaDB as a Retriever — Full Example

Here’s a full LangChain pipeline:

### 📦 Step 1: Install dependencies

```bash
pip install chromadb langchain openai
```

---

### 📘 Step 2: Load and Split Documents

```python
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader("your_text.txt")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
```

---

### 🧠 Step 3: Create Embeddings and Store in ChromaDB

```python
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma

embedding_fn = OpenAIEmbeddings()

vectordb = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_fn,
    persist_directory="chroma_db",
    collection_name="my_collection"
)

vectordb.persist()  # Save DB to disk
```

---

### 🔍 Step 4: Query ChromaDB

```python
retriever = vectordb.as_retriever(search_type="similarity", search_kwargs={"k": 3})

results = retriever.get_relevant_documents("What is quantum physics?")

for doc in results:
    print(doc.page_content)
```

---

## 💾 Saving and Loading ChromaDB

### ✅ Saving (already done with `.persist()`):

```python
vectordb.persist()
```

### 🔄 Loading from disk later:

```python
vectordb = Chroma(
    persist_directory="chroma_db",
    embedding_function=OpenAIEmbeddings(),
    collection_name="my_collection"
)
```

You can now continue retrieving documents or adding new ones.

---

## 🧠 Important Parameters in Local Load of ChromaDB

```python
Chroma(
    persist_directory="chroma_db",
    collection_name="my_collection",
    embedding_function=OpenAIEmbeddings()
)
```

| Parameter            | Purpose                                  |
| -------------------- | ---------------------------------------- |
| `persist_directory`  | Path to local folder storing the DB      |
| `collection_name`    | Ensure same name as when saved           |
| `embedding_function` | Required to match embeddings and queries |

⚠️ Make sure the same `embedding_function` is used to **maintain embedding space consistency**!

---

## 💼 Important Questions

1. **What are the trade-offs between ChromaDB and FAISS?**
2. **How does ChromaDB handle similarity search?**
3. **Can you filter documents in ChromaDB by metadata?**
4. **How would you reload an existing ChromaDB vector store?**
5. **How does ChromaDB ensure persistent local storage?**
6. **What happens if you use different embedding models for query and documents?**
7. **How would you update or delete documents from ChromaDB?**
8. **How is chunking important before inserting into ChromaDB?**

---



---

### 🔹 **1. What are the trade-offs between ChromaDB and FAISS?**

**Answer:**

| Feature            | ChromaDB                                     | FAISS                                        |
| ------------------ | -------------------------------------------- | -------------------------------------------- |
| Type               | Vector database with persistence             | Vector search library                        |
| Metadata           | Native support for metadata                  | No native metadata support                   |
| Filtering          | Supports filtering via metadata              | Not natively supported                       |
| Storage            | Built-in local persistence (via `persist()`) | Must manually serialize and deserialize      |
| API Simplicity     | Very developer-friendly (higher-level APIs)  | Requires more boilerplate and setup          |
| Scale              | Best for local, dev-scale workloads          | Optimized for large-scale, production search |
| Distributed Search | Not yet supported (single-node)              | Requires extensions to support distributed   |

**Trade-off Summary:**
ChromaDB is **great for local dev and RAG prototyping**, but FAISS offers **better control and raw performance** at scale.

---

### 🔹 **2. How does ChromaDB handle similarity search?**

**Answer:**

ChromaDB performs **vector similarity search** using **cosine similarity** by default. This compares the angle between two high-dimensional vectors (query and document), making it scale-invariant.

* Score = cosine similarity ∈ \[0, 1]
* You query with an embedding vector and Chroma returns documents ranked by score.

```python
retriever = vectordb.as_retriever()
results = retriever.get_relevant_documents("What is AI?")
```

It uses **approximate nearest neighbor (ANN)** methods under the hood for performance.

---

### 🔹 **3. Can you filter documents in ChromaDB by metadata?**

**Answer: Yes.**

ChromaDB supports **metadata filtering** on vector search:

```python
retriever = vectordb.as_retriever(
    search_kwargs={
        "k": 3,
        "filter": {"source": "notes.txt"}
    }
)
```

This allows hybrid filtering — **semantic + metadata** filtering at the same time. It's very useful when you have many documents with different contexts (e.g., source, author, topic).

---

### 🔹 **4. How would you reload an existing ChromaDB vector store?**

**Answer:**

To reload an existing ChromaDB vector store (e.g., across app runs):

```python
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings

vectordb = Chroma(
    persist_directory="chroma_db",
    collection_name="my_collection",
    embedding_function=OpenAIEmbeddings()
)
```

❗ `collection_name` and `embedding_function` **must match** the ones used during persistence.

---

### 🔹 **5. How does ChromaDB ensure persistent local storage?**

**Answer:**

ChromaDB uses **SQLite** and **binary blob storage** to persist vectors and metadata locally.

* When you call `.persist()`, ChromaDB saves:

  * Collection config
  * Document metadata
  * Vector embeddings

You can specify the location using `persist_directory="my_folder"`.

This enables you to reuse the vector store across sessions or deployments.

---

### 🔹 **6. What happens if you use different embedding models for query and documents?**

**Answer:**

❌ It will **break the semantic consistency**.

* Each embedding model has its own **vector space**.
* If your documents were embedded using `OpenAI` and you query using `HuggingFace`, the cosine similarity won't be meaningful.

**Best Practice:** Always use the **same embedding function** (same model, version, and tokenizer) for indexing and querying.

---

### 🔹 **7. How would you update or delete documents from ChromaDB?**

**Answer:**

LangChain's integration with ChromaDB currently supports adding new documents but **doesn't have direct APIs for deletion or update**.

However, using raw ChromaDB APIs, you can do:

```python
# Use Chroma's lower-level client
import chromadb
client = chromadb.PersistentClient(path="chroma_db")
collection = client.get_collection("my_collection")

collection.delete(
    where={"source": "notes.txt"}  # metadata filter
)
```

✅ For updates: delete + re-insert the document.

---

### 🔹 **8. How is chunking important before inserting into ChromaDB?**

**Answer:**

**Chunking (using text splitters)** helps in:

* Keeping each chunk **small enough to embed meaningfully** (e.g., 500 tokens).
* Improving **semantic retrieval** by narrowing the scope of each document.
* Ensuring the LLM receives **focused, context-rich content** during RAG.

Tools like `RecursiveCharacterTextSplitter` ensure splits happen at **logical boundaries** (paragraphs → sentences → words), improving quality.

Example:

```python
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

chunks = splitter.split_documents(docs)
```

---
