# ðŸ“š ChromaDB Basics

## Overview
This notebook introduces **ChromaDB**, an open-source vector database designed for AI applications. You'll learn the fundamentals of storing and querying documents using embeddings.

## What You'll Learn
- How to install and initialize ChromaDB
- Creating in-memory vs. persistent clients
- Creating and managing collections
- Adding documents (with automatic embedding generation)
- Performing similarity queries
- Using custom embedding functions

## Prerequisites
```bash
pip install chromadb sentence-transformers
```

---

In [1]:
import chromadb

## 1. In-Memory Client

An **in-memory client** stores all data in RAM. Data is **lost when the session ends**.  
Best for: Quick experiments, testing, and learning.

In [2]:
client = chromadb.Client()

In [3]:
client.create_collection(name="news")


Collection(name=news)

### Creating a Collection

A **collection** is like a table in a traditional database. It stores documents and their embeddings.

- Each collection has a unique name
- ChromaDB automatically generates embeddings using a default model

In [4]:
collection = client.get_collection(name="news")

In [5]:
collection.add(
    documents=[
        "Crazy AI tools are taking over the world!",
        "AI agents are the next big thing in tech." 
    ],
    ids=["id1", "id2"]
)

### Adding Documents

When you add documents to a collection:
1. ChromaDB automatically converts text â†’ embeddings (vectors)
2. Each document needs a unique ID
3. Embeddings enable semantic similarity search

In [6]:
objects = collection.peek()
len(objects['embeddings'][0])

384

### Inspecting Embeddings

Use `peek()` to view the stored documents and their embeddings.  
The embedding dimension tells us about the model used (default model creates 384-dimensional vectors).

In [7]:
results = collection.query(
    query_texts=["Lovable is a good website to create softwares"],
    n_results=2
)
results

{'ids': [['id2', 'id1']],
 'embeddings': None,
 'documents': [['AI agents are the next big thing in tech.',
   'Crazy AI tools are taking over the world!']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[None, None]],
 'distances': [[1.555487871170044, 1.5718331336975098]]}

### Querying for Similar Documents

The `query()` method finds documents most similar to your query text:
- `query_texts`: Your search query (converted to embedding automatically)
- `n_results`: Number of results to return
- Results are ranked by similarity (closest first)

In [8]:
results = collection.query(
    query_texts=["Hydra is the best RP game"],
    n_results=2
)
results

{'ids': [['id2', 'id1']],
 'embeddings': None,
 'documents': [['AI agents are the next big thing in tech.',
   'Crazy AI tools are taking over the world!']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[None, None]],
 'distances': [[1.6278023719787598, 1.6739156246185303]]}

In [9]:
clientp = chromadb.PersistentClient(path="./chromadb_data")

---

## 2. Persistent Client

A **persistent client** saves data to disk. Data survives after the session ends.  
Best for: Production applications, larger datasets, data that needs to persist.

In [14]:
from chromadb.utils import embedding_functions

ef = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L12-v2"
)
collection = clientp.create_collection(
    name="news_v2",
    embedding_function=ef   
)

### Custom Embedding Functions

You can use different embedding models for better results:
- **SentenceTransformers** provides many pre-trained models
- `all-MiniLM-L12-v2` is a good balance of speed and quality
- Different models produce different embedding dimensions

**Why use custom embeddings?**
- Better semantic understanding for specific domains
- Trade-off between speed and accuracy
- Consistent embeddings across your application

In [15]:
collection.add(
    documents=[
        "Apple reported its quarterly earnings today.",
        "Apple has a lot of vitamin A"
    ],
    ids=["id1", "id2"]
)

### Testing Semantic Search

Let's add documents with the same word ("Apple") but different contexts:
- One about Apple Inc. (company)
- One about apple (fruit)

This demonstrates how semantic search understands **meaning, not just keywords**.

In [17]:
results = collection.query(
    query_texts=["Apple releases new iPhone"],
    n_results=2
)
results

{'ids': [['id1', 'id2']],
 'embeddings': None,
 'documents': [['Apple reported its quarterly earnings today.',
   'Apple has a lot of vitamin A']],
 'uris': None,
 'included': ['metadatas', 'documents', 'distances'],
 'data': None,
 'metadatas': [[None, None]],
 'distances': [[0.7136677503585815, 0.785379946231842]]}

In [18]:
clientp.heartbeat()

1767536877873768980

### Health Check

`heartbeat()` returns a timestamp to verify the client is running properly.