[Chroma DB](https://realpython.com/chromadb-vector-database/)

 Vector databases extend the capabilities of traditional relational databases to embeddings. However, the key distinguishing feature of a vector database is that query results aren’t an exact match to the query. Instead, using a specified similarity metric, the vector database returns embeddings that are similar to a query.

In [1]:
import chromadb
from chromadb.utils import embedding_functions

Though, in this example we are using the default model, but we can integrate other models as well as described [here](https://docs.trychroma.com/integrations)

In [2]:

CHROMA_DATA_PATH = "chroma_data/"
EMBED_MODEL = "all-MiniLM-L6-v2"
COLLECTION_NAME = "demo_docs"

client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)

A collection is the object that stores your embedded documents along with any associated metadata. If you’re familiar with relational databases, then you can think of a collection as a table. In this example, your collection is named demo_docs, it uses the "all-MiniLM-L6-v2" embedding function that you instantiated, and it uses the cosine similarity distance function as specified by metadata={"hnsw:space": "cosine"}.

In [3]:
embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=EMBED_MODEL)

try:
    client.delete_collection(name=COLLECTION_NAME)
except ValueError:
    print("Collection does not exist. Creating Now")

collection = client.create_collection(name=COLLECTION_NAME,
                                      embedding_function=embedding_func,
                                      metadata={"hnsw:space": "cosine"},
                                     )


  from tqdm.autonotebook import tqdm, trange
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


We create some dummy documents and for each document we create a genre as metadata.

In [4]:
documents = [
            "The latest iPhone model comes with impressive features and a powerful camera.",
            "Exploring the beautiful beaches and vibrant culture of Bali is a dream for many travelers.",
            "Einstein's theory of relativity revolutionized our understanding of space and time.",
            "Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.",
            "The American Revolution had a profound impact on the birth of the United States as a nation.",
            "Regular exercise and a balanced diet are essential for maintaining good physical health.",
            "Leonardo da Vinci's Mona Lisa is considered one of the most iconic paintings in art history.",
            "Climate change poses a significant threat to the planet's ecosystems and biodiversity.",
            "Startup companies often face challenges in securing funding and scaling their operations.",
            "Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'",
            ]

genres = [
        "technology",
        "travel",
        "science",
        "food",
        "history",
        "fitness",
        "art",
        "climate change",
        "business",
        "music"
        ]

When we add these documents to the collection , we also add the genre as metadata

In [5]:
collection.add(documents=documents,
               ids=[f"id{i}" for i in range(len(documents))],
               metadatas=[{"genre": g} for g in genres]
               )

When we execute the following query , it returns the following

- ids of the documents that are most similar
- distance of the fetched document from the query
- metadata of the fetched document
- embeddings of the fetched document
- the fetched doucment itself  

In [6]:
query_results = collection.query(query_texts=["Find me some delicious food!"],
                                 n_results=1,
                                 )

query_results.keys()

dict_keys(['ids', 'distances', 'metadatas', 'embeddings', 'documents', 'uris', 'data'])

In [7]:
print(query_results["ids"])
print(query_results["documents"])
print(query_results["metadatas"])
print(query_results["distances"])

[['id3']]
[['Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.']]
[[{'genre': 'food'}]]
[[0.763826391677926]]


We can add multiple queries and limit the keys that we want to be returned

In [8]:
query_results = collection.query(query_texts=["Teach me about history",
                                              "What's going on in the world?"],
                                include = ["documents", "distances"],
                                n_results = 2 )
query_results["documents"]

[["Einstein's theory of relativity revolutionized our understanding of space and time.",
  'The American Revolution had a profound impact on the birth of the United States as a nation.'],
 ["Climate change poses a significant threat to the planet's ecosystems and biodiversity.",
  "Einstein's theory of relativity revolutionized our understanding of space and time."]]

In [9]:
print(query_results["distances"])

[[0.626588243350541, 0.6904192074815718], [0.8002943724796112, 0.8882107299851638]]


We can narrow down queries based on metadata. This results querying documents which match the metadata filter.<br>
This example can be read as , filter the collection where the "genre" metadata field equals "music".

In [10]:
collection.query(query_texts=["Teach me about music history"],
                 where={"genre": {"$eq": "music"}},
                 n_results=1,
                 )

{'ids': [['id9']],
 'distances': [[0.8186328681933428]],
 'metadatas': [[{'genre': 'music'}]],
 'embeddings': None,
 'documents': [["Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'"]],
 'uris': None,
 'data': None}