<a href="https://colab.research.google.com/github/StrategicalIT/PipedPiperAI/blob/main/Lab05.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LAB5: Exploring ChromaDB
In this lab we are going to use Python to get familiar with the vector databases and in particular with ChromaDB

The most basic operations in vector databases include adding embeddings to the database and querying the database to find similarity with a given embedding. Additionally, it is important to configure and index that can be used to speed up queries. We don't need to do this with ChromaDB because it provides a single index called HNSW. On the other hand, Milvus database provides 12 different index types that can be advantageous for different use cases.

## Install dependencies

The first step is to install the necessary libraries. In this case we will install the chromadb Python library

In [None]:
!pip install chromadb

Now we can import the components we need for this lab. The output of queries to the ChromaDB databases is formatted as a JSON structure with several keys and values, so we are going to import "pprint" just to make the output more readable.

In [None]:
import chromadb
from pprint import pprint

## Connect to Chroma

There are two ways to connect to local Chroma DB. If we use "```Client()```" it will create an in-memory only database. If we want to persist data to disk we can use "```PersistentClient()```". We need to specify in the brackets the directory where the database files will be stored. In this case we are specifying the "current" directory, ie the directory from which you you launched Jupyter. After running the command you can check that directory. You should find a file called "chroma.sqlite3". This will store a copy of the HSNW index, the embeddings as well as the chunks of documents that the embeddings represent.

In [None]:
client = chromadb.PersistentClient(path=".")

## Create a collection and load documents

First, you have to create a collection which is similar to the tables in a relational database or to the concept of namespace in other products.

Notice the syntax below to "get or create" collection. If the collection does exist it will get it, otherwise it will create it. In ChromaDB the embedding model is a property of the "collection". So different collections can use different embedding models. This is changed with the "```embedding_function=" parameter```. In this exercise we won't specify it, so it will use the default "```all-MiniLM-L6-v2```". This is very convenient for quickstarting a project which is where ChromaDB excels.

In [None]:
collection = client.get_or_create_collection(name="my_collection")

We can use "```list_collections()```" to verify the collection is created

In [None]:
print(client.list_collections())

Now we are in a position to start loading embeddings into the database, but we would have to create the embeddings first. In our final project we will use a NIM embedding model to create the vector embeddings but ChromaDB supports adding documents directly as well. In this exercise, we will add documents and will let it convert them into the embeddings with the default model.

At this point the program will attempt to download the embedding model. It is about 80MB so it should complete quickly

In [None]:
collection.upsert(
    documents=[
        "This is a document about pineapples",
        "This is a document about oranges",
        "This is a document about planes",
        "This is a document about cars"
    ],
    ids=["id1", "id2", "id3", "id4"]
)

Notice in the previous code how we used "upsert" which is the short for "update" or "insert". In other words, if the documents already exist it will update them, otherwise it will create them. If we used "add" instead of "upsert" and repeat the same command, it will treat them as separate documents and add them again.

When you insert documents you need to make sure the id you provide is unique.

In [None]:
print("ChromaDB currently contains ", collection.count(), " items\n")

The above code uses the "```count()```" method to tell you how many documents there are in the collection

Another diagnostic tool is "```.peek()```" which allows us to show the first 10 items in the collection. We will use the "pprint" function instead of the standard "print" so that the output is more readable. Pprint stands for "Pretty Print".

In [None]:
pprint(collection.peek())

## Query the database

Now we use the "```query()```" method to perform a query. Notice how we are requesting the 2 best matches.

In [None]:
results = collection.query(
    query_texts=["I need information about fruits"], # Chroma will embed this for you
    n_results=2 # how many results to return
)

It should have retrieved 2 documents from the database that are related to our query text. Let's see if the results make sense.

In [None]:
pprint(results)

The output should include the documents that are relevant to fruits. Try changing the query about other topics like "transportation" and check what output you get.

Also, you can add more documents and repeat the queries.

## Working with metadata

Now we are going to explore how to leverage metadata to filter results. With the "upsert" function we can update existing documents and insert new ones all in one go. We use the "metadatas" key to specify the metadata for each record. Both "documents" and "metadatas" are Python lists, which is a structure where order matters. So the first metadata dictionary in the list corresponds to the first document and so on.

In [None]:
collection.upsert(
    documents=[
        "This is a document about pineapples",
        "This is a document about oranges",
        "This is a document about coconuts",
        "This is a document about pears"
    ],
    ids=["id1", "id2", "id5", "id6"],
    metadatas=[{"climate":"tropical"},
               {"climate":"mediterranean"},
               {"climate":"tropical"},
               {"climate":"mediterranean"}]
)

If we repeat the same query as before we might get fruits from all climates.

In [None]:
results = collection.query(
    query_texts=["I need information about fruits"], # Chroma will embed this for you
    n_results=2 # how many results to return
)

pprint(results)

However, we can use the metadata field to retrieve only "tropical" fruits for example

In [None]:
results = collection.query(
    query_texts=["I need information about fruits"], # Chroma will embed this for you
    n_results=2, # how many results to return
    where={"climate": "tropical"}
)

pprint(results)

You can experiment further by adding multiple metadatas to each document by simply adding more keys to the dictionaries, ex:

```[{"climate": "tropical", "colour": "yellow"} ...]```

Can you think how you would use metadatas for a real-world use case at your business?

## End of Lab5