# Try out ChromaDB

### Objective:
* Hands on experience with a Vector database
* Familiarize with common operations/capabilities

The exercise uses Chroma in *embedded* mode. You may also use the client-server mode using the instructions below.

#### Chroma samples
* https://www.trychroma.com/
* https://docs.trychroma.com/

#### API Reference

https://docs.trychroma.com/api-reference

#### Client-Server mode
To use the client server setup, launch the chromadb server with command below

$ chroma run --path ./temp.db1

Use the HTTP client instead of PersistentClient

Math word problems

https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k?row=5

## Install ChromaDB

* To use the embedded model, install the package: *chormadb*
* For client-server, you just need the package: *chromadb-client*

#### Potential issues on Windows 10/11
* pip install for chromadb may fail !!
* Pay attention to the failure message : You may need to install Microsoft C++ build tools : Download "Build Tools" & Install "Desktop development with C++" and then try pip install chromadb
* Refer :    https://visualstudio.microsoft.com/visual-cpp-build-tools/
* Stackexchange : https://stackoverflow.com/questions/73969269/error-could-not-build-wheels-for-hnswlib-which-is-required-to-install-pyprojec/76245995#76245995

#### Runtime error?
RuntimeError: Chroma is running in http-only client mode, and can only be run with 'chromadb.api.fastapi.FastAPI' or 'chromadb.api.async_fastapi.AsyncFastAPI' as the chroma_api_impl.

If you get the above error, it means that you have installed both chromadb & chromadb-client, which is causing the conflict. You need to:
* pip unintsall chromadb  chromadb-client
* pip install chromadb

In [6]:
# DO NOT INSTALL BOTH chomadb and chromadb-client as there are conflicts : You will get Runtime error !!
# For in-memory and persistent class
# Not needed if client-server is in user
# !pip install chromadb

# Needed for client-server
# !pip install chromadb-client

# Part-1 Used default embeddings

## 1. Create a collection

Instructions below are for using the embedded mode.

1. Create a database with the *PersistentClient* - pass the local file system location for persistence
2. Create the collection. Metadata is used to select the:
* Algorithm
* Distance metric

#### API Documentation

**Client** 
https://docs.trychroma.com/reference/Client

**Collection**
https://docs.trychroma.com/reference/Collection

collection = client.create_collection(
        name="collection_name",
        metadata={"hnsw:space": "cosine"} # l2 is the default
    )


In [1]:
import chromadb

In [2]:
# Create the client
client = chromadb.PersistentClient(path="c:/temp/chromadb-test")

# Setup the collection name - use a naming convention, add metasdata
collection_name = "testset_MiniLM-L6-V2"

# Add relevant fields to the metadata e.g., source of information, PDF page#, Internal-ID, expiry date, owner of information, ....
collection_metadata= {"hnsw:space": "cosine"} 

# Comment the code below to prevent re-creation of the collection
try:
    client.delete_collection(name=collection_name)
except:
    print("Ignore if the collection is not there")

# Create the collection
collection = client.get_or_create_collection(name=collection_name, metadata=collection_metadata)

# collection

⚠️ It looks like you upgraded from a version below 0.5.6 and could benefit from vacuuming your database. Run chromadb utils vacuum --help for more information.


## 2. Add embeddings to collection

Chroma uses the "all-MiniLM-L6-v2" by default to generate the embeddings.

In [3]:
# Test corpus
corpus = [
  "A man is eating food.", "A man is eating a piece of bread.",
  "The chef is preparing a delicious meal in the kitchen.", "A chef is tossing vegetables in a sizzling pan.",
  "A man is riding a horse.", "A man is riding a white horse on an enclosed ground.",
  "A woman is playing violin.", "A musician is tuning his guitar before the concert.",
  "The girl is carrying a baby.", "The baby is giggling while playing with her toys.",
  "The family is having a picnic under the shady oak tree.", "A group of friends is hiking up the mountain trail.",
  "The mechanic is repairing a broken-down car in the garage.", "The old man is feeding breadcrumbs to the ducks at the pond.",
  "The artist is sketching a beautiful landscape at sunset.", "A man is painting a colorful mural on the city wall.",
  "A team of scientists is conducting experiments in the laboratory.", "A group of students is studying together in the library.",
  "The birds are chirping happily in the morning sun.", "The dog is chasing its tail around the backyard.",
  "A group of children are playing soccer in the park.", "A monkey is playing drums.",
  "A boy is flying a kite in the open field.", "Two men pushed carts through the woods.",
  "A woman is walking her dog along the beach.", "A young girl is reading a book under a shady tree.",
  "The dancer is gracefully performing on stage.", "The farmer is harvesting ripe tomatoes from the vine."
]

# Fileds stored alongside embeddings
ids  = []
metadatas = []

# Loop through corpus to generate the ids and metdata = {wordcount, uri}
for i in range(len(corpus)):
    # print(corpus[i])
    ids.append("id-"+str(i))
    uri = "https://link-"+str(i)
    metadatas.append({"words_count": len(corpus[i]), "uri": uri})

collection.add(documents = corpus, metadatas=metadatas, ids=ids)

## 3. Check count & peek()

In [6]:
# Get the count and check a few items
count = collection.count()

# Returns 10 rows
result = collection.peek()
# print(result)

# Print information on the collection
print("Index count : ", count)
print(result['ids'])
print(result['metadatas'])
print(result['documents'])


Index count :  28
['id-0', 'id-1', 'id-10', 'id-11', 'id-12', 'id-13', 'id-14', 'id-15', 'id-16', 'id-17']
[{'uri': 'https://link-0', 'words_count': 21}, {'uri': 'https://link-1', 'words_count': 33}, {'uri': 'https://link-10', 'words_count': 55}, {'uri': 'https://link-11', 'words_count': 51}, {'uri': 'https://link-12', 'words_count': 58}, {'uri': 'https://link-13', 'words_count': 60}, {'uri': 'https://link-14', 'words_count': 56}, {'uri': 'https://link-15', 'words_count': 52}, {'uri': 'https://link-16', 'words_count': 65}, {'uri': 'https://link-17', 'words_count': 56}]
['A man is eating food.', 'A man is eating a piece of bread.', 'The family is having a picnic under the shady oak tree.', 'A group of friends is hiking up the mountain trail.', 'The mechanic is repairing a broken-down car in the garage.', 'The old man is feeding breadcrumbs to the ducks at the pond.', 'The artist is sketching a beautiful landscape at sunset.', 'A man is painting a colorful mural on the city wall.', 'A te

## 4. Query

https://docs.trychroma.com/usage-guide#querying-a-collection

Carry out ANN search over the embeddings files."}
)

In [9]:
result = collection.query(
    query_texts = ["I like to cook"], # "small child is having fun"],
    n_results = 3,
    # include=["documents"],             # projections
    # where={"words_count": {"$lt": 50}},
    # where_document={"$contains":"chef"},
    
)

print(result['documents'])
print(result['ids'])
print(result['metadatas'])

[['The chef is preparing a delicious meal in the kitchen.', 'A chef is tossing vegetables in a sizzling pan.']]
[['id-2', 'id-3']]
[[{'uri': 'https://link-2', 'words_count': 54}, {'uri': 'https://link-3', 'words_count': 47}]]


## 5. Get

Retrieve items with ids

In [25]:
# get documents collection
result = collection.get(
    ids = ["id-17", "id-21"],
    where = {"words_count": {"$gt": 0}},
    # where_document={"$contains":"chef"},
    include = ["documents", "metadatas"]    # projections
)

result

{'ids': [],
 'embeddings': None,
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None}

# Part-2 : Cohere custom embedding
You can specify your own embedding function

Chroma provides a lightweight wrapper for multiple popular models

https://docs.trychroma.com/embeddings

In [12]:
from chromadb.utils import embedding_functions

## 1.Read Cohere API Key

In [13]:
from dotenv import load_dotenv
import os

import warnings

warnings.filterwarnings("ignore")

# Load the file that contains the API keys
load_dotenv('C:\\Users\\raj\\.jupyter\\.env')

COHERE_API_KEY = os.getenv('COHERE_API_KEY')

## 2.Create the Cohere Embedding Function

In [14]:
# https://docs.cohere.com/docs/models#embed
# Check documentation for embedding size
model_name = 'embed-english-light-v2.0'
embedding_dimension = 1024

cohere_ef  = embedding_functions.CohereEmbeddingFunction(
        api_key=COHERE_API_KEY, 
        model_name=model_name)

## 3.Create collection with Cohere embeddings

In [15]:
# Setup the collection name - use a naming convention, add metasdata
cohere_collection_name = "testset_cohere-embed-english-light-v2.0"

collection_cohere = client.get_or_create_collection(name=cohere_collection_name, embedding_function=cohere_ef)

collection_cohere

Collection(name=testset_cohere-embed-english-light-v2.0)

## 4.Add documents

In [16]:
collection_cohere.add(documents = corpus, metadatas=metadatas, ids=ids)

In [17]:
collection_cohere.count()

28

## 5.Query

In [19]:
result = collection_cohere.query(
    query_texts = ["I like to cook"], # "small child is having fun"],
    n_results = 3,
    # include=["documents"],             # projections
    # where={"words_count": {"$lt": 50}},
    # where_document={"$contains":"chef"},
    
)

print(result['documents'])
print(result['ids'])
print(result['metadatas'])

[['The chef is preparing a delicious meal in the kitchen.', 'A chef is tossing vegetables in a sizzling pan.', 'A man is eating food.']]
[['id-2', 'id-3', 'id-0']]
[[{'uri': 'https://link-2', 'words_count': 54}, {'uri': 'https://link-3', 'words_count': 47}, {'uri': 'https://link-0', 'words_count': 21}]]
