In [1]:
import chromadb
from pprint import pprint

# Create document collection. Collections are where we store our embeddings,
#  documents, and any additional metadata.

client = chromadb.Client()

# collection = client.create_collection("all-my-documents")

collection = client.create_collection(
      name="all-my-documents",
      metadata={"hnsw:space": "cosine"} # metadata is optional
  )

### 1. Embedding

Embedding refers to converting the documents or texts into vectors (numerical representations) that can be used for similarity search. In this code, the embedding process is handled automatically by Chroma when we add documents to the collection.

In [2]:
# we can store our text, and handle tokenization, embedding, and indexing automatically! 
# Chroma will also store the documents themselves. If the documents are too large to embed 
# using the chosen embedding function, an exception will be raised.


# By default, Chroma uses the Sentence Transformers all-MiniLM-L6-v2 model to create embeddings.
collection.add(
    documents=[
        "This is a document about food",
        "This is a document about animal's food",
        "This is a document about cats and dogs",
    ],
    metadatas=[{"topic": "food"}, {"topic": "animal"}, {"topic": "animal"}],
    ids=["doc1", "docs2", "doc3"],
)

In the collection.add method, the documents are passed to the collection, and Chroma uses the default Sentence Transformers all-MiniLM-L6-v2 model to create embeddings for these documents. This step transforms the text into vectors that can be used for indexing and querying.



### 2. Indexing

Indexing refers to organizing the embedded vectors in a way that allows for efficient similarity search. This is implicitly handled by Chroma when the documents are added to the collection.

During the collection.add method call, Chroma not only embeds the documents but also indexes these embeddings. The metadata {"hnsw:space": "cosine"} provided during collection creation suggests that the HNSW (Hierarchical Navigable Small World) algorithm is used for indexing, which optimizes for cosine similarity.

### 3. Querying
In the collection.query method, the provided query text is first embedded using the same model, and then Chroma searches the indexed vectors to find the top n_results most similar vectors to the query embedding.

In [3]:
results = collection.query(
    query_texts=["This is a query asking about pizza"],
    n_results=2,
)

In [6]:
pprint(results)

{'distances': [[0.5715243220329285, 0.7453498840332031]],
 'documents': [['This is a document about food',
                "This is a document about animal's food"]],
 'embeddings': None,
 'ids': [['doc1', 'docs2']],
 'metadatas': [[{'topic': 'food'}, {'topic': 'animal'}]]}


In [7]:
results = collection.query(
    query_texts=["This is a query asking about pizza"],
    n_results=2,
    where={"topic": "food"},
)

pprint(results)

{'distances': [[0.5715243220329285]],
 'documents': [['This is a document about food']],
 'embeddings': None,
 'ids': [['doc1']],
 'metadatas': [[{'topic': 'food'}]]}


Summary

Embedding: Done in collection.add method where documents are converted into vectors.

Indexing: Also done in collection.add method where the embedded vectors are organized for efficient similarity search.

Querying: Done in collection.query method where the query text is embedded, and the indexed vectors are searched for the most similar results.