
### What Is a Vector Database?

A vector database is a database that allows you to efficiently store and query embedding data. Vector databases extend the capabilities of traditional relational databases to embeddings. However, the key distinguishing feature of a vector database is that query results aren’t an exact match to the query. Instead, using a specified similarity metric, the vector database returns embeddings that are _similar_ to a query.

To make this possible, vector databases are equipped with features that balance the speed and accuracy of query results. Here are the core components of a vector database that you should know about:

- **Embedding function**: When using a vector database, oftentimes you’ll store and query data in its raw form, rather than uploading embeddings themselves. Internally, the vector database needs to know how to convert your data to embeddings, and you have to specify an embedding function for this. For text, you can use the embedding functions available in the SentenceTransformers library or any other function that maps raw text to vectors.

https://docs.trychroma.com/docs/embeddings/embedding-functions

    
- **Similarity metric**: To assess embedding similarity, you need a similarity metric like cosine similarity, the dot product, or Euclidean distance. 
    
- **Indexing**: When you’re dealing with a large number of embeddings, comparing a query embedding to every embedding stored in the database is often too slow. To overcome this, vector databases employ indexing algorithms that group similar embeddings together.
    
    At query time, the query embedding is compared to a smaller subset of embeddings based on the index. Because the embeddings recommended by the index aren’t guaranteed to have the highest similarity to the query, this is called approximate nearest neighbor search.
    
- **Metadata**: You can store metadata with each embedding to help give context and make query results more precise. You can filter your embedding searches on metadata much like you would in a relational database. For example, you could store the year that a document was published as metadata and only look for similar documents that were published in a given year.
    
- **Storage location**: With any kind of database, you need a place to store the data. Vector databases can store embeddings and metadata both in memory and on disk. Keeping data in memory allows for faster reads and writes, while writing to disk is important for persistent storage.
    
- **CRUD operations**: Most vector databases support create, read, update, and delete (CRUD) operations. This means you can maintain and interact with data like you would in a relational database.


In [1]:
import chromadb
from chromadb.utils import embedding_functions

In [None]:
CHROMA_DATA_PATH = "chroma_data/"
EMBED_MODEL = "all-MiniLM-L6-v2"
COLLECTION_NAME = "demo_docs"

client = chromadb.PersistentClient(path=CHROMA_DATA_PATH)





Instantiate a `PersistentClient` object that writes your embedding data to `CHROMA_DB_PATH`. By doing this, you ensure that data will be stored at `CHROMA_DB_PATH` and persist to new clients. Alternatively, you can use `chromadb.Client()` to instantiate a ChromaDB instance that only writes to memory and doesn’t persist on disk.

In [4]:
embedding_func = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name=EMBED_MODEL
)


In [15]:
collection = client.get_or_create_collection(
    name=COLLECTION_NAME,
    embedding_function=embedding_func,
    metadata={"hnsw:space": "cosine"},
)


 ChromaDB will use `embedding_func` to embed all your documents and queries. We will use the `"all-MiniLM-L6-v2"` model. 

A collection is the object that stores your embedded documents along with any associated metadata. If you’re familiar with relational databases, then you can think of a collection as a table. In this example, your collection is named `demo_docs`, it uses the `"all-MiniLM-L6-v2"` embedding function that you instantiated, and it uses the cosine similarity distance function as specified by `metadata={"hnsw:space": "cosine"}`.

In [9]:
documents = [
    "The latest iPhone model comes with impressive features and a powerful camera.",
    "Exploring the beautiful beaches and vibrant culture of Bali is a dream for many travelers.",
    "Einstein's theory of relativity revolutionized our understanding of space and time.",
    "Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.",
    "The American Revolution had a profound impact on the birth of the United States as a nation.",
    "Regular exercise and a balanced diet are essential for maintaining good physical health.",
    "Leonardo da Vinci's Mona Lisa is considered one of the most iconic paintings in art history.",
    "Climate change poses a significant threat to the planet's ecosystems and biodiversity.",
    "Startup companies often face challenges in securing funding and scaling their operations.",
    "Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'",
]

genres = [
    "technology",
    "travel",
    "science",
    "food",
    "history",
    "fitness",
    "art",
    "climate change",
    "business",
    "music",
]

authors = [
    "Brian Merchant",  # Author of "The One Device: The Secret History of the iPhone"
    "Elizabeth Pisani",  # Author of "Indonesia, Etc." covering Balinese culture and travel
    "Albert Einstein",  # Author of "Relativity: The Special and General Theory"
    "Marc Vetri",  # Author of "Mastering Pizza: The Art and Practice of Handmade Pizza"
    "David McCullough",  # Author of "1776" on the American Revolution
    "Michael Greger",  # Author of "How Not to Die" emphasizing exercise and diet for health
    "Walter Isaacson",  # Author of "Leonardo da Vinci" biography focusing on the Mona Lisa
    "Elizabeth Kolbert",  # Author of "The Sixth Extinction" on climate change and ecosystems
    "Alejandro Cremades",  # Author of "The Art of Startup Fundraising" on funding challenges
    "David B. Levy",  # Author of "Beethoven: The Ninth Symphony"
]


In [17]:
# CRUD

collection.add(
    documents=documents,
    ids = [f"id{i+1}" for i in range(len(documents))],
    metadatas= [{"genre": g, "authors": a} for g, a  in zip(genres, authors)]

)


In [18]:
# for g, a in zip(genres, authors):
#     print(g,a)


Each document in the `documents` argument is embedded and stored in the collection. We also have to define the `ids` argument to uniquely identify each document and embedding in the collection. 

The `metadatas` argument is optional, but most of the time, it’s useful to store metadata with your embeddings. In this case, you define a single metadata field, `"genre"`, that records the genre of each document. When you query a document, metadata provides you with additional information that can be helpful to better understand the document’s contents. You can also filter on metadata fields, just like you would in a relational database query.


In [26]:
query_results = collection.query(
    query_texts=["Find me some delicious food!"],
    n_results=3,
)

query_results

{'ids': [['id4', 'id2', 'id3']],
 'embeddings': None,
 'documents': [['Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens.',
   'Exploring the beautiful beaches and vibrant culture of Bali is a dream for many travelers.',
   "Einstein's theory of relativity revolutionized our understanding of space and time."]],
 'uris': None,
 'data': None,
 'metadatas': [[{'authors': 'Marc Vetri', 'genre': 'food'},
   {'authors': 'Elizabeth Pisani', 'genre': 'travel'},
   {'authors': 'Albert Einstein', 'genre': 'science'}]],
 'distances': [[0.7638263154594184, 0.8292162662367584, 0.8853589345932061]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [27]:
query_results["ids"]

[['id4', 'id2', 'id3']]

In [28]:
query_results["distances"]

[[0.7638263154594184, 0.8292162662367584, 0.8853589345932061]]

In [29]:
query_results["metadatas"]

[[{'authors': 'Marc Vetri', 'genre': 'food'},
  {'authors': 'Elizabeth Pisani', 'genre': 'travel'},
  {'authors': 'Albert Einstein', 'genre': 'science'}]]

The results returned by `collection.query()` are stored in a dictionary with the keys `ids`, `distances`, `metadatas`, `embeddings`, and `documents`. This is the same information that you added to your collection at the beginning, but it’s filtered down to match your query. In other words, `collection.query()` returns all of the stored information about documents that are most similar to your query.

As you can see, the embedding for _Traditional Italian pizza is famous for its thin crust, fresh ingredients, and wood-fired ovens_ was most similar to the query _Find me some delicious food_. You probably agree that this document is the closest match. You can also see the ID, metadata, and distance associated with the matching document embedding. Here, you’re using **cosine distance**, which is one minus the cosine similarity between two embeddings.

In [31]:
query_results = collection.query(
    query_texts=["Teach me about history",
                 "What's going on in the world?"],
    include=["documents", "distances"],
    n_results=2
)

query_results


{'ids': [['id3', 'id5'], ['id8', 'id3']],
 'embeddings': None,
 'documents': [["Einstein's theory of relativity revolutionized our understanding of space and time.",
   'The American Revolution had a profound impact on the birth of the United States as a nation.'],
  ["Climate change poses a significant threat to the planet's ecosystems and biodiversity.",
   "Einstein's theory of relativity revolutionized our understanding of space and time."]],
 'uris': None,
 'data': None,
 'metadatas': None,
 'distances': [[0.6265883553621786, 0.690419353119635],
  [0.8002943811335208, 0.8882107242437847]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>]}

In [32]:
query_results["documents"][0]

["Einstein's theory of relativity revolutionized our understanding of space and time.",
 'The American Revolution had a profound impact on the birth of the United States as a nation.']

In [33]:
query_results["distances"][0]

[0.6265883553621786, 0.690419353119635]

In [34]:
def display_query_results(query_texts, query_results):
    for i, query in enumerate(query_texts):
        print(f"\n🔍 Query {i + 1}: \"{query}\"\n{'-' * 60}")
        documents = query_results["documents"][i]
        distances = query_results["distances"][i]

        for j, (doc, dist) in enumerate(zip(documents, distances), start=1):
            print(f"Result {j}:")
            print(f"📄 Document: {doc}")
            print(f"📏 Distance: {dist:.4f}")
            print()


query_texts = [
    "Teach me about history",
    "What's going on in the world?"
]

display_query_results(query_texts, query_results)


🔍 Query 1: "Teach me about history"
------------------------------------------------------------
Result 1:
📄 Document: Einstein's theory of relativity revolutionized our understanding of space and time.
📏 Distance: 0.6266

Result 2:
📄 Document: The American Revolution had a profound impact on the birth of the United States as a nation.
📏 Distance: 0.6904


🔍 Query 2: "What's going on in the world?"
------------------------------------------------------------
Result 1:
📄 Document: Climate change poses a significant threat to the planet's ecosystems and biodiversity.
📏 Distance: 0.8003

Result 2:
📄 Document: Einstein's theory of relativity revolutionized our understanding of space and time.
📏 Distance: 0.8882



> **Note:** Keep in mind that so-called similar documents returned from a semantic search over embeddings may not actually be relevant to the task that you’re trying to solve. The success of a semantic search is somewhat subjective, and you or your stakeholders might not agree on the quality of the results.

> If there are no relevant documents in your collection for a given query, or your embedding algorithm wasn’t trained on the right or enough data, then your results might be poor. It’s up to you to understand your application, your stakeholders’ expectations, and the limitations of your embedding algorithm and document collection.

Another awesome feature of ChromaDB is the ability to filter queries on metadata. 

In [35]:
collection.query(
    query_texts=["Teach me about music history"],
    n_results=1
)

{'ids': [['id3']],
 'embeddings': None,
 'documents': [["Einstein's theory of relativity revolutionized our understanding of space and time."]],
 'uris': None,
 'data': None,
 'metadatas': [[{'authors': 'Albert Einstein', 'genre': 'science'}]],
 'distances': [[0.7625820506219841]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}


Your query is _Teach me about music history_, and the most similar document is _Einstein’s theory of relativity revolutionized our understanding of space and time_. While Einstein is a historical figure who was a [musician](https://en.wikipedia.org/wiki/Albert_Einstein#Love_of_music) and teacher, this isn’t quite the result that you’re looking for. Because you’re particularly interested in music history, you can filter on the `"genre"` metadata field to search over more relevant documents:

In [36]:
collection.query(
    query_texts=["Teach me about music history"],
    where={"genre": {"$eq": "music"}},
    n_results=1,
)

{'ids': [['id10']],
 'embeddings': None,
 'documents': [["Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'"]],
 'uris': None,
 'data': None,
 'metadatas': [[{'authors': 'David B. Levy', 'genre': 'music'}]],
 'distances': [[0.818632941929768]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In this query, you specify in the where argument that you’re only looking for documents with the "music" genre. To apply filters, ChromaDB expects a dictionary where the keys are metadata names and the values are dictionaries specifying how to filter. In plain English, you can interpret `{"genre": {"$eq": "music"}}` as filter the collection where the "genre" metadata field equals "music".In this query, you specify in the `where` argument that you’re only looking for documents with the `"music"` genre. To apply filters, ChromaDB expects a dictionary where the keys are metadata names and the values are dictionaries specifying how to filter. In plain English, you can interpret `{"genre": {"$eq": "music"}}` as _filter the collection where the `"genre"` metadata field equals `"music"`_.

In [37]:
query_results = collection.query(
    query_texts=["Teach me about music history"],
    where={"genre": {"$in": ["music", "history"]}},
    n_results=2,
)

query_results["documents"]

[["Beethoven's Symphony No. 9 is celebrated for its powerful choral finale, 'Ode to Joy.'",
  'The American Revolution had a profound impact on the birth of the United States as a nation.']]

In [39]:

## read operation

collection.get("id3")

{'ids': ['id3'],
 'embeddings': None,
 'documents': ["Einstein's theory of relativity revolutionized our understanding of space and time."],
 'uris': None,
 'data': None,
 'metadatas': [{'authors': 'Albert Einstein', 'genre': 'science'}],
 'included': [<IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [41]:
collection.update(
    ids=["id1", "id2"],
    documents=["The new iPhone is awesome!",
               "Bali has beautiful beaches"],
    metadatas=[{"genre": "tech"}, {"genre": "beaches"}]
)

If you want to update existing documents, embeddings, or metadata, then you can use `collection.update()`. This requires you to know the IDs of the data that you want to update. In this example, you’ll update both the documents and metadata for `"id1"` and `"id2"`:

In [44]:
collection.get("id2")

{'ids': ['id2'],
 'embeddings': None,
 'documents': ['Bali has beautiful beaches'],
 'uris': None,
 'data': None,
 'metadatas': [{'authors': 'Elizabeth Pisani', 'genre': 'beaches'}],
 'included': [<IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [45]:
query_results = collection.get(ids=["id1", "id2"])

query_results['documents']

['The new iPhone is awesome!', 'Bali has beautiful beaches']

In [47]:
## upsert

collection.upsert(
    ids=["id12", "id13"],
    documents=["The new iPhone is awesome!",
               "Bali has beautiful beaches"],
    metadatas=[{"genre": "tech"}, {"genre": "beaches"}]
)

In [48]:
query_results = collection.get(ids=["id12", "id13"])

query_results['documents']

['The new iPhone is awesome!', 'Bali has beautiful beaches']

In [49]:
collection.count()

12

In [50]:
collection.delete(ids=["id12", "id13"])

In [51]:
collection.count()

10