**ChromaDB** is an open-source vector database that stores embeddings (numerical representations of data) for semantic search and AI applications.
It’s lightweight, easy to use in Python/Colab, and often used in RAG pipelines with LLMs.

In [None]:
!pip install chromadb

Collecting chromadb
  Downloading chromadb-1.0.20-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.3 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.2-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.22.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.9 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.36.0-py3-none-any.whl.metadata (2.4 kB)
Collecting pypika>=0.48.9 (from chromadb)
  Downloading PyPika-0.48.9.tar.gz (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [

**Create a DB**

In [None]:
import chromadb
chroma_client = chromadb.Client()

**Create a Collection**

In [None]:
 collection = chroma_client.create_collection(name="my_collection")

**Add some text documents to the collection**

In [None]:
collection.add(documents=["This is a document about pineapple. This is a documents about oranges", "Welcome to NLP it is one of the most exciting research areas as today we will see how to work."],
               ids=["id1", "id2"])

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:00<00:00, 104MiB/s]


**Query the collection**

In [None]:
results = collection.query(
    query_texts=["This is a query document about hawaii"], #Chroma will embed this.
    n_results=2) #How many results to return

print(results)

{'ids': [['id1', 'id2']], 'embeddings': None, 'documents': [['This is a document about pineapple. This is a documents about oranges', 'Welcome to NLP it is one of the most exciting research areas as today we will see how to work.']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[None, None]], 'distances': [[1.0523101091384888, 1.6716597080230713]]}


In [None]:
results = collection.query(
    query_texts=["This is a query document about oranges"], #Chroma will embed this.
    n_results=2) #How many results to return

print(results)

{'ids': [['id1', 'id2']], 'embeddings': None, 'documents': [['This is a document about pineapple. This is a documents about oranges', 'Welcome to NLP it is one of the most exciting research areas as today we will see how to work.']], 'uris': None, 'included': ['metadatas', 'documents', 'distances'], 'data': None, 'metadatas': [[None, None]], 'distances': [[0.5729079842567444, 1.596559762954712]]}


**Persistant the data**

In [None]:
client = chromadb.PersistentClient(path="./db")

In [None]:
client.heartbeat() #returns a nanosecond heartbeat. Useful for making sure the client remains connected.

1757493940972152450

In [None]:
#client.reset() #admin related

In [None]:
from chromadb.config import DEFAULT_TENANT, DEFAULT_DATABASE, Settings

In [None]:
DEFAULT_TENANT, DEFAULT_DATABASE

('default_tenant', 'default_database')

In [None]:
client = chromadb.PersistentClient(
    path="./db1",
    settings=Settings(
        allow_reset=True,
        anonymized_telemetry=False
    )
)

In [None]:
client.reset()

True

#Creating, inspecting, and deleting Collections:
Chroma uses collection names in the url, so there are a few restrictions on naming them:

- The length of the name must be between 3 and 63 characters.
- The name must start and end with a lowercase letter or a digit, and it can contain dots, dashes, and underscores in between.
- The name must not contain two consecutive dots.
- The name must not be a valid IP address.

Chroma collections are created with a name and an optional embedding function. If we supply an embedding function, we must supply it every time we get the collection.

In [None]:
from chromadb.utils import embedding_functions

# Create an embedding function
ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")
len(ef(["foo"])[0])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

384

**Create Collection**

In [None]:
collection = client.create_collection(name="my_collection1", embedding_function=ef)

**Get Collection**

In [None]:
collection = client.get_collection(name="my_collection1", embedding_function=ef)

**Delete Collection**

In [None]:
client.delete_collection(name="my_collection1")

**If not sure if a collection exist or needs to be created:**

In [None]:
collection = client.get_or_create_collection(name="some_collection")

In [None]:
collection

Collection(name=some_collection)

**Rename Collection**

In [None]:
collection.modify(name="new_name")

In [None]:
collection

Collection(name=new_name)

**Add Documents:**

In [None]:
collection.add(documents=["some2","doc1","doc2"],
                metadatas=[{"chapter":"3", "verse":"16"},{"chapter":"3", "verse":"5"},{"chapter":"29", "verse":"11"}],
                ids=["id1", "id2", "id3"])

**How Chroma Handles Documents**

1. `**Automatic Embedding**`

- If you pass a list of documents to a collection, Chroma will tokenize and embed them using the collection’s embedding function.

- If no custom embedding function is supplied, the default embedding function will be used.

- The original documents themselves are also stored.

2. `**Large Documents**`

- If a document is too large to embed with the chosen embedding function, Chroma will raise an exception.

3. `**Unique IDs**`

- Each document must have a unique ID.

- If you try to .add() the same ID twice, only the first version will be stored (the later one is ignored).

4. `**Metadata Support**`

- You can optionally supply a list of metadata dictionaries (one per document).

- This metadata can store extra information and is useful for filtering queries later.

5. `**Supplying Precomputed Embeddings**`

- Instead of letting Chroma embed the documents, you can directly supply a list of precomputed embeddings.

- In that case, Chroma will store the documents with those embeddings and skip embedding itself.

In [None]:
collection.add(
    ids=["id21", "id31", "id41"],
    documents=["doc1", "doc2", "doc3"],
    # embeddings=[
    #     [1.1, 2.3, 3.2],
    #     [4.5, 6.9, 4.4],
    #     [1.1, 2.3, 3.2]
    # ],
    metadatas=[
        {"chapter": "3", "verse": "16"},
        {"chapter": "29", "verse": "5"},
        {"chapter": "10", "verse": "11"}
    ]
)


In [None]:
!pip install scikit-learn



In [None]:
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

In [None]:
# Initialize lists
doc_list = []
meta_list = []

# Example: create 100 documents from the word 'data'
for i in range(100):
    doc = f"data {i}"               # document text
    doc_list.append(doc)            # add to documents list
    meta_list.append({"len_of_doc": len(doc)})  # add metadata

In [None]:
collection.add(
    documents=doc_list,
    metadatas=meta_list,
    ids=['id_{}'.format(i)for i in range(100)]
)

In [None]:
collection.peek(2)

{'ids': ['id1', 'id2'],
 'embeddings': array([[-4.56306823e-02, -5.02210744e-02,  3.52436560e-03,
         -1.69078596e-02, -4.80551040e-03, -7.47122169e-02,
          1.35098726e-01,  5.47372922e-03, -4.60059149e-03,
          3.47462147e-02, -6.79754559e-03, -3.07896789e-02,
          2.13009752e-02,  2.53816508e-02,  6.54068440e-02,
          8.01216215e-02,  1.65005792e-02, -5.41628664e-03,
         -1.11458004e-01, -4.90530320e-02, -1.27690598e-01,
         -1.02593470e-02,  1.70116685e-03,  4.96229436e-03,
         -8.18195194e-02,  1.40507771e-02, -3.51107270e-02,
          8.01423118e-02,  4.01367955e-02, -4.33623344e-02,
         -4.30454835e-02, -6.16974151e-03, -9.65422019e-02,
          3.27749556e-04, -1.92344189e-02, -5.31340614e-02,
          3.29108462e-02,  2.04190351e-02, -4.69161160e-02,
          3.28324549e-02, -5.15381210e-02, -1.55513538e-02,
          5.57391765e-03,  3.70124285e-03,  7.19886925e-03,
         -4.41214330e-02,  1.71457548e-02, -3.80477570e-02,
  

**Get number of documents in a collection**

In [None]:
collection.count()

106

In [None]:
collection

Collection(name=new_name)

**Define a alternate distance function**

In [None]:
collection=client.create_collection(name="my_collection1", metadata={"hnsw:space": "cosine"})

**Query the collections**

In [None]:
results = collection.query(
    query_embeddings=[
        [11.1, 12.1, 13.1],   # first query embedding
        [1.1, 2.3, 3.2]       # second query embedding
    ],
    n_results=3,  # top 3 most similar documents
    where={"chapter": "3"},  # filter by metadata field
    where_document={"$contains": "search_string"}  # filter by text content
)


In [None]:
collection.get(include=["documents"])

{'ids': [],
 'embeddings': None,
 'documents': [],
 'uris': None,
 'included': ['documents'],
 'data': None,
 'metadatas': None}

In [None]:
collection.query(query_embeddings=[
        [11.1, 12.1, 13.1],   # first query embedding
        [1.1, 2.3, 3.2]       # second query embedding
    ],
                 include=["documents"])

{'ids': [[], []],
 'embeddings': None,
 'documents': [[], []],
 'uris': None,
 'included': ['documents'],
 'data': None,
 'metadatas': None,
 'distances': None}

**Using Where filters**

Chroma supports filtering queries by metadata and document contents. The where filter is used to filter by metadata, and the where_document filter is used to filter by document contents.

**Filtering by metadata**

In order to filter on metadata, you must supply a where filter dictionary to the query. The dictionary must have the following structure:

{ "metadata_field": { <operator>: <value> } }

Filtering metadata supports the following operators:

- $eq – equal to (string, int, float)

- $ne – not equal to (string, int, float)

- $gt – greater than (int, float)

- $gte – greater than or equal to (int, float)

- $lt – less than (int, float)

- $lte – less than or equal to (int, float)

**Filtering for a string search**

{ "$contains": "search_string" }

**Using Logical Operators in Chroma**

- $and → returns results that match all of the filters in the list.

where = {
    "$and": [
        {"metadata_field1": {"$eq": "value1"}},
        {"metadata_field2": {"$gte": 10}}
    ]
}

- $or → returns results that match any of the filters in the list.

where = {
    "$or": [
        {"metadata_field1": {"$eq": "value1"}},
        {"metadata_field2": {"$lt": 5}}
    ]
}

#Update

In [None]:
collection.update(
    ids=["id1", "id2", "id3"],                  # IDs of documents to update
    documents=["doc1", "doc2", "doc3"],         # New document texts
    embeddings=[
        [1.1, 2.3, 3.2],
        [4.5, 6.9, 4.4],
        [1.1, 2.3, 3.2]
    ],                                           # New embeddings
    metadatas=[
        {"chapter": "3", "verse": "16"},
        {"chapter": "29", "verse": "5"},
        {"chapter": "3", "verse": "9"}
    ]                                            # Updated metadata
)


**Find and Update if not found add**

In [None]:
collection.update(
    ids=["id1", "id2", "id3"],                  # IDs of documents to update
    documents=["doc1", "doc2", "doc3"],         # New document texts
    embeddings=[
        [1.1, 2.3, 3.2],
        [4.5, 6.9, 4.4],
        [1.1, 2.3, 3.2]
    ],                                           # Updated embeddings
    metadatas=[
        {"chapter": "3", "verse": "16"},
        {"chapter": "29", "verse": "5"},
        {"chapter": "3", "verse": "9"}
    ]                                            # Updated metadata
)

**Deleting data from a collection**

Chroma supports deleting items from a collection by ID using .delete(). The embeddings, documents, and metadata associated with each item will also be deleted.

⚠️ Naturally, this is a destructive operation and cannot be undone.

In [None]:
collection.delete(
    ids=["id1"],                  # IDs of documents to delete
    where={"chapter": "20"}       # Optional: delete based on metadata filter
)

In [None]:
# Get all items in the collection
all_items = collection.get()
print("All IDs in collection:", all_items['ids'])

# Check if the deleted ID is missing
if "id1" not in all_items['ids']:
    print("id1 was successfully deleted.")
else:
    print("id1 still exists.")


All IDs in collection: []
id1 was successfully deleted.


In [None]:
# Install ChromaDB (local vector DB) and Sentence-Transformers (for embeddings)
!pip install chromadb sentence-transformers

Collecting chromadb
  Downloading chromadb-1.0.20-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.3 kB)
Collecting pybase64>=1.4.1 (from chromadb)
  Downloading pybase64-1.4.2-cp312-cp312-manylinux1_x86_64.manylinux2014_x86_64.manylinux_2_17_x86_64.manylinux_2_5_x86_64.whl.metadata (8.7 kB)
Collecting posthog<6.0.0,>=2.4.0 (from chromadb)
  Downloading posthog-5.4.0-py3-none-any.whl.metadata (5.7 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.22.1-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.9 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.36.0-py3-none-any.whl.metadata (2.4 kB)
Collecting pypika>=0.48.9 (from chromadb)
  Downloading PyPika-0.48.9.tar.gz (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [

In [None]:
# Import ChromaDB client (manages the vector database)
import chromadb

# Import SentenceTransformer for creating text embeddings
from sentence_transformers import SentenceTransformer

In [None]:
# Initialize the ChromaDB client
# This creates a local in-memory database (like starting a mini DB in Colab)
client = chromadb.Client()

In [None]:
# Create a new collection named "test"
# A collection is like a "table" in databases where vectors + documents are stored
collection = client.create_collection("test")

In [None]:
# Load a pre-trained sentence embedding model
# This model converts text into numerical vectors
model = SentenceTransformer("all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Create some example documents
docs = [
    "AI is transforming the world",
    "Sachin is a cricket legend",
    "Python is great for data science"
]

# Convert each document into a numerical vector (embedding)
# These embeddings capture the meaning/semantics of the text
embeddings = model.encode(docs).tolist()

# Add documents + their embeddings into the Chroma collection
# 'ids' are unique identifiers (like primary keys in a database)
collection.add(
    documents=docs,
    embeddings=embeddings,
    ids=[str(i) for i in range(len(docs))]
)

In [None]:
# Define a search query
query = "Tell me about cricket"

# Convert the query into an embedding (same model as docs)
q_emb = model.encode([query]).tolist()

# Search in Chroma: find top 2 most similar documents to the query
results = collection.query(query_embeddings=q_emb, n_results=2)

# Print the query and its best-matching documents
print("Query:", query)
print("Results:", results["documents"])

Query: Tell me about cricket
Results: [['Sachin is a cricket legend', 'AI is transforming the world']]
