# **Introduction to ChromaDB**

<img src="Images/trychroma.png">

**credits:** https://www.trychroma.com/

---

<img src="Images/chromadb_img.png">

**credits:** [GitHub repository](https://github.com/bansalkanav/Generative-AI-Scratch-2-Advance-By-ThatAIGuy/blob/main/6.%20Building%20Apps%20Powered%20by%20GenAI%20using%20LangChain/1.%20Introduction%20to%20LangChain/7.%20VectorDB%20-%20ChromaDB/intro_to_chromadb.ipynb)

### **What is ChromaDB?**

ChromaDB is the open-source vector database, specially designed for AI applications.

### **Features**
1. **Has everything we need for retrieval**
    - Store document embedding and their metadata
    - Search Embeddings
    - Full-tect Search
    - Metadata filtering
    - Multi-modal retrieval
2. **Free and Open source**
3. **Integrations** 
    - Works with HuggingFace, OpenAI, Google, LangChain and more.
4. **Simple to Get Started**
    - ```pip install chromadb```

In [1]:
!pip show chromadb

Name: chromadb
Version: 0.4.24
Summary: Chroma.
Home-page: https://github.com/chroma-core/chroma
Author: 
Author-email: Jeff Huber <jeff@trychroma.com>, Anton Troynikov <anton@trychroma.com>
License: 
Location: c:\users\91889\anaconda3\envs\geminienv\lib\site-packages
Requires: bcrypt, build, chroma-hnswlib, fastapi, grpcio, importlib-resources, kubernetes, mmh3, numpy, onnxruntime, opentelemetry-api, opentelemetry-exporter-otlp-proto-grpc, opentelemetry-instrumentation-fastapi, opentelemetry-sdk, orjson, overrides, posthog, pulsar-client, pydantic, pypika, PyYAML, requests, tenacity, tokenizers, tqdm, typer, typing-extensions, uvicorn
Required-by: crewai-tools, embedchain


In [2]:
import chromadb

client = chromadb.PersistentClient(path="vector_store")

In [3]:
client.heartbeat()

1732970215210755100

In [4]:
# Warning: Empties and completely resets the database.

# client.reset() 

## **Create a Collection**

**Parameters**
- **name:** Select a `name` for your collection.
- **embedding_function**: By default, Chroma uses the Sentence Transformers `all-MiniLM-L6-v2 model` to create embeddings.
- **metadata={"hnsw:space": "cosine"}**: By default `L2 distance`

<img src="Images/metadatas.png">

In [5]:
collections = client.create_collection(name="my_first_collection")

**Alternative:** You can also use `client.get_or_create_collection()`

In [6]:
# returns the number of items in the collection

collections.count()

0

In [7]:
# returns a list of the first 10 items in the collection

collections.peek()

{'ids': [],
 'embeddings': [],
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None}

In [8]:
collections

Collection(name=my_first_collection)

In [9]:
# Rename the collection

collections.modify(name="first_collection")

In [10]:
collections

Collection(name=first_collection)

## **Get An Already Existing Collection**

In [4]:
collections = client.get_collection(name="first_collection")

collections

Collection(name=first_collection)

## **Adding data to Collection**

In [12]:
collections.add(
    documents=[
        "This is my first document",
        "This is my second document"
    ],
    metadatas=[{'key_1': 'abc_1', 'key_2': 'abc_2'}, {'key_1': 'xyz_1', 'key_2': 'xyz_2'}],
    ids=['id1', 'id2']
)

C:\Users\91889\.cache\chroma\onnx_models\all-MiniLM-L6-v2\onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:24<00:00, 3.33MiB/s]


In [20]:
# collections.peek()

## **Search Embeddings**

In [18]:
result = collections.query(
    query_texts = ["first"],
    n_results = 1
)

In [19]:
result

{'ids': [['id1']],
 'distances': [[1.1781862767622826]],
 'metadatas': [[{'key_1': 'abc_1', 'key_2': 'abc_2'}]],
 'embeddings': None,
 'documents': [['This is my first document']],
 'uris': None,
 'data': None}

In [23]:
result = collections.query(
    query_texts = ["second"],
    n_results = 1
)

result

{'ids': [['id2']],
 'distances': [[1.1943339473992967]],
 'metadatas': [[{'key_1': 'xyz_1', 'key_2': 'xyz_2'}]],
 'embeddings': None,
 'documents': [['This is my second document']],
 'uris': None,
 'data': None}

In [None]:
# adding a new item to the collection
collections.add(
    documents=["This is a document about pineapple"],
    metadatas=[{'key_1': 'pqrs_1', 'key_2': 'pqrs_2'}],
    ids=['id3']
)

In [22]:
# search for the specific element in the collection
result = collections.query(
    query_texts=["apple"],
    n_results=1
)

result

{'ids': [['id3']],
 'distances': [[1.602426156795167]],
 'metadatas': [[{'key_1': 'pqrs_1', 'key_2': 'pqrs_2'}]],
 'embeddings': None,
 'documents': [['This is a document about pineapple']],
 'uris': None,
 'data': None}

In [5]:
collections.peek()

{'ids': ['id1', 'id2', 'id3'],
 'embeddings': [[-0.03435761108994484,
   0.08950142562389374,
   0.05766802281141281,
   0.05894389748573303,
   0.006176931783556938,
   0.009239946492016315,
   0.01215403713285923,
   0.0368492491543293,
   -0.006383647210896015,
   0.04612163454294205,
   0.001853765221312642,
   0.09664405882358551,
   0.022192206233739853,
   -0.024687903001904488,
   -0.09719019383192062,
   0.024644607678055763,
   0.007520821411162615,
   -0.03820456564426422,
   -0.006342257838696241,
   0.04306107014417648,
   -0.02518555521965027,
   0.06134136766195297,
   0.016563862562179565,
   -0.043258123099803925,
   -0.021922936663031578,
   0.07426302134990692,
   -0.055368319153785706,
   0.04033143073320389,
   0.008668136782944202,
   -0.11814282834529877,
   -0.006578274071216583,
   -0.0022475020959973335,
   0.11545643955469131,
   0.0235588438808918,
   0.08048868179321289,
   0.018548930063843727,
   0.09033994376659393,
   -0.06405339390039444,
   0.04453029

## **Updating data in a collection**

In [12]:
collections.update(
    ids=['id1', 'id2'],
    metadatas=[{'doc1_key_1': 'abc_1', 'doc1_key_2': 'abc_2'}, {'doc2_key_1': 'xyz_1', 'doc2_key_2': 'xyz_2'}],
    documents=[
        "This is a document about apple",
        "This is a document about banana"
    ]
)

In [13]:
collections.peek()

{'ids': ['id1', 'id2', 'id3'],
 'embeddings': [[-0.024620406329631805,
   0.07059556245803833,
   0.05013927444815636,
   -0.018136925995349884,
   0.04086008295416832,
   -0.008825192227959633,
   0.015652043744921684,
   0.022105390205979347,
   0.05109895020723343,
   0.03826845437288284,
   0.05871409550309181,
   0.10319343209266663,
   0.001710479729808867,
   -0.04735669493675232,
   -0.042706962674856186,
   0.004835282452404499,
   -0.04602249711751938,
   -0.060991182923316956,
   -0.024713413789868355,
   0.05953214317560196,
   0.06955622881650925,
   0.050125978887081146,
   0.02397453412413597,
   0.0052350289188325405,
   0.004579909145832062,
   0.05848690867424011,
   -0.02647259831428528,
   -0.04277157038450241,
   0.006546068470925093,
   -0.005419732071459293,
   -0.012026013806462288,
   0.038383662700653076,
   0.12025418877601624,
   0.04461690038442612,
   -0.0019964484963566065,
   -0.06669319421052933,
   0.0991801768541336,
   -0.00410090247169137,
   -0.004

## **Deleting data from a collection**

In [14]:
collections.delete(
    ids=['id1', 'id2']
)

In [15]:
collections.peek()

{'ids': ['id3'],
 'embeddings': [[-0.0070936777628958225,
   0.0655461773276329,
   -0.011660597287118435,
   0.08644583821296692,
   -0.039543576538562775,
   0.058005254715681076,
   -0.024075627326965332,
   0.014950579963624477,
   0.009917323477566242,
   -0.021074064075946808,
   0.0810110867023468,
   0.034703198820352554,
   -0.02125372923910618,
   -0.013366407714784145,
   -0.03322690725326538,
   0.020637447014451027,
   0.0035065559204667807,
   0.038441531360149384,
   -0.015314367599785328,
   0.07312286645174026,
   0.05147815868258476,
   0.10698502510786057,
   -0.04180420562624931,
   0.03101285733282566,
   -0.01113535463809967,
   -0.013799767941236496,
   -0.022165443748235703,
   -0.06409504264593124,
   -0.02690820023417473,
   -0.012651787139475346,
   -0.004961030557751656,
   0.05542483925819397,
   0.1174350380897522,
   0.05173472687602043,
   -0.051716163754463196,
   -0.03095361590385437,
   0.10146088153123856,
   -0.0695117935538292,
   0.160278081893920