<a href="https://colab.research.google.com/github/Muntasir2179/vector-database-learning/blob/main/VD_ChromaDB.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chroma DB

[Chroma](https://docs.trychroma.com/) is an open source vector database. It also has built in capabilities for embedding as well.

**Working principle:** As it is called a database it must have features to store data. The data here is the embedding. Embedding is the learnable number representation of any kind of data. When data (doc, images, audio, video etc.) is passes into the vector database, it converts the data into embedding by passing the data through some embedding generation model.

> We can choose our own embedding model if we like. [HuggingFace](https://huggingface.co/models) hub is a good source where we will find embedding models that better suits for our problem.

<br>

<img src="https://docs.trychroma.com/img/hrm4.svg" height="300">

In [1]:
try:
  !pip install python-multipart
  !pip install kaleido
  !pip install typing-extensions==4.5.0
  print("\n[INFO] Successfully installed all the packages.")
except:
  print("\n[INFO] Not able to install all the packages.")

Collecting python-multipart
  Downloading python_multipart-0.0.6-py3-none-any.whl (45 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.7/45.7 kB[0m [31m794.0 kB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: python-multipart
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
lida 0.0.10 requires fastapi, which is not installed.
lida 0.0.10 requires kaleido, which is not installed.
lida 0.0.10 requires uvicorn, which is not installed.[0m[31m
[0mSuccessfully installed python-multipart-0.0.6
Collecting kaleido
  Downloading kaleido-0.2.1-py2.py3-none-manylinux1_x86_64.whl (79.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.9/79.9 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: kaleido
[31mERROR: pip's dependency resolver does not currently take into account a

In [2]:
!pip install chromadb -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m509.0/509.0 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m27.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m92.0/92.0 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.3/60.3 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m74.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.4/6.4 MB[0m [31m59.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.9/57.9 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.6/105.6 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
!pip install sentence-transformers -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/86.0 kB[0m [31m834.5 kB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━[0m [32m61.4/86.0 kB[0m [31m930.2 kB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone


## What is Collection?

> It is a database instance that can store informations. It can be called a table or vector database.

In [4]:
import chromadb
client = chromadb.Client()

In [5]:
collection1 = client.create_collection(name="my_collection1", get_or_create=True)
print(collection1)

name='my_collection1' id=UUID('f8173d5c-093f-4aa1-b730-0f3230590e97') metadata=None tenant='default_tenant' database='default_database'


## Adding some text documents to the collection

In [6]:
collection1.add(
    # documents that i want to add to the database
    documents=["This is a document", "This is another document"],
    # metadata refers to the information about the documents that we have been passing
    metadatas=[{"source": "source1", "language": "en"}, {"source": "source2", "language": "bangla"}],
    # providing distinct ids to each documents
    ids=["id1", "id2"]
)

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:02<00:00, 29.5MiB/s]


In [7]:
# methods can be run on client
client.list_collections()  # chcking how many collections I created so far

[Collection(name=my_collection1)]

In [8]:
collection2 = client.create_collection(name="my_collection2",
                                       get_or_create=True) # returns the collection if it is already created

In [9]:
client.list_collections()

[Collection(name=my_collection2), Collection(name=my_collection1)]

In [10]:
# get an existing collection
collect = client.get_collection("my_collection1")
collect

Collection(name=my_collection1)

In [11]:
# get a collection or create if it doesn't exist
collection3 = client.get_or_create_collection("my_collection3")
client.list_collections()

[Collection(name=my_collection3),
 Collection(name=my_collection2),
 Collection(name=my_collection1)]

In [12]:
# removing collection
client.delete_collection("my_collection3")
client.list_collections()

[Collection(name=my_collection2), Collection(name=my_collection1)]

In [13]:
# heartbeat - check if the collections is still running on server or not
# it will return a number which refers to the time stemp of the collection
# the number changes each time we run the heartbeat() function
client.heartbeat()

1704516373507141322

In [None]:
# it immediately resets all the changes we have made to the database so far
# changing the settings to allow reset operation on collections object
client.get_settings().allow_reset = True

# calling reset() function to undo all the changes we have made so far
# it basically going to delete all the collections that we have created so far
client.reset() # ⚠️ not recommanded
client.list_collections()

## Methods on Collectinos

Vector DB -> Collections -> Documents/Items

In [14]:
# counting how many items in our collections
collection1.count()

2

In [15]:
# get all the items from the collection
collection1.get()

{'ids': ['id1', 'id2'],
 'embeddings': None,
 'metadatas': [{'language': 'en', 'source': 'source1'},
  {'language': 'bangla', 'source': 'source2'}],
 'documents': ['This is a document', 'This is another document'],
 'uris': None,
 'data': None}

In [21]:
import numpy as np

# creating some dummy embedding
embedding_doc3 = np.random.rand(384).tolist()  # using the default embedding size of the vector database which is 384
embedding_doc4 = np.random.rand(384).tolist()

# adding new item to the collection
collection1.add(
    embeddings=[embedding_doc3, embedding_doc4],
    documents=["This is document 3", "This is document 4"],
    metadatas=[{"source": "source3", "language": "en"}, {"source": "source4"}],
    ids=["id3", "id4"]
)

In [22]:
collection1.get()

{'ids': ['id1', 'id2', 'id3', 'id4'],
 'embeddings': None,
 'metadatas': [{'language': 'en', 'source': 'source1'},
  {'language': 'bangla', 'source': 'source2'},
  {'language': 'en', 'source': 'source3'},
  {'source': 'source4'}],
 'documents': ['This is a document',
  'This is another document',
  'This is document 3',
  'This is document 4'],
 'uris': None,
 'data': None}

In [25]:
# overriding the existing data using collection.upsert() method
collection1.upsert(
    documents=["Doc3", "This is document 4"],
    metadatas=[{"source": "source3", "language": "en"}, {"source": "source4"}],
    ids=["id3", "id4"]
)

In [26]:
collection1.get()

{'ids': ['id1', 'id2', 'id3', 'id4'],
 'embeddings': None,
 'metadatas': [{'language': 'en', 'source': 'source1'},
  {'language': 'bangla', 'source': 'source2'},
  {'language': 'en', 'source': 'source3'},
  {'source': 'source4'}],
 'documents': ['This is a document',
  'This is another document',
  'Doc3',
  'This is document 4'],
 'uris': None,
 'data': None}

In [34]:
# we can use the peek() function to get first 4 items from the database
len(collection1.peek()["embeddings"][2])

384

## Now we will wrap up the whole thing with a broader demonstration

In [52]:
movie_collection = client.create_collection(name="matrix_movie_collection", get_or_create=True)
print(movie_collection)

name='matrix_movie_collection' id=UUID('6c0f6d42-1b0a-488c-b784-c5bcf5a43d07') metadata=None tenant='default_tenant' database='default_database'


In [53]:
# if we want to rename the collection
movie_collection.modify(name="movie_collection")
print(movie_collection)

name='movie_collection' id=UUID('6c0f6d42-1b0a-488c-b784-c5bcf5a43d07') metadata=None tenant='default_tenant' database='default_database'


In [54]:
# checking the number of items in the collection
movie_collection.count()

0

### Distance function in ChromaDB

In ChromaDB distance function determines how the "distance" or "difference" between two items in the collection is calculated. This is crucial when performing operations like querying for similar items. The default function in ChromaDB is `"l2"`, which stands for `Euclidean Distance`. It's common measure of distance in a plane.

https://docs.trychroma.com/usage-guide#changing-the-distance-function

There are no way available to insert custom distance funciton: https://github.com/langchain-ai/langchain/issues/2595

In [55]:
movie_collection = client.get_or_create_collection(name="movie_collection",
                                                   metadata={"hnsw:space": "cosine"})
print(movie_collection)

name='movie_collection' id=UUID('6c0f6d42-1b0a-488c-b784-c5bcf5a43d07') metadata={'hnsw:space': 'cosine'} tenant='default_tenant' database='default_database'


In [56]:
client.list_collections()

[Collection(name=movie_collection),
 Collection(name=my_collection2),
 Collection(name=my_collection1)]

In [57]:
# let's add some data to our collection
movie_collection.add(
    ids=["quote1", "quote2"],
    documents=[
        "There is no spoon.",
        "I know kung fu."
    ]
)

In [58]:
movie_collection.count()

2

In [60]:
movie_collection.peek(limit=1)  # we can specify the number of items we want

{'ids': ['quote1'],
 'embeddings': [[0.004506412893533707,
   -0.07763516902923584,
   -0.038877587765455246,
   -0.01235272828489542,
   -0.08395075052976608,
   0.04847825691103935,
   0.027022896334528923,
   -0.08260542154312134,
   0.07585711777210236,
   0.016495604068040848,
   0.034576863050460815,
   -0.0697631761431694,
   -0.012515238486230373,
   -0.05832795426249504,
   -0.0736757442355156,
   -0.12272055447101593,
   -0.0331801138818264,
   -0.10826481133699417,
   -0.010775878094136715,
   0.0024138721637427807,
   0.03132103383541107,
   0.000363168801413849,
   0.057470910251140594,
   -0.01934056729078293,
   0.06213092431426048,
   0.05513307452201843,
   0.019474970176815987,
   -0.06181753799319267,
   -0.025465352460741997,
   0.06344398111104965,
   -0.019318535923957825,
   -0.005409401375800371,
   -0.08224662393331528,
   -0.04527893662452698,
   0.037652596831321716,
   0.007059104740619659,
   0.050451889634132385,
   0.040108174085617065,
   0.0501829385757