# **Introduction to Vector Databases - ChromaDB**

Chroma is the open-source AI application database.

[Click here](https://www.trychroma.com/) to visit the official website.

<img src="images/chromadb_1.png">

### **Features**
1. **Has everything we need for retrieval**
    - Store document embedding and their metadata
    - Search Embeddings
    - Full-tect Search
    - Metadata filtering
    - Multi-modal retrieval
2. **Free and Open source**
3. **Integrations**
    - Works with HuggingFace, OpenAI, Google, LangChain and more.
4. **Simple to Get Started**
    - ```pip install chromadb```
  
### **Syntax**
```python
import chromadb

# Initiating a Persistent Chroma Client
client = chromadb.PersistentClient(path="/path/to/save/to")

# Create a new collection or get if already exist
collection = client.get_or_create_collection(name="my_collection", embedding_function=emb_fn, metadata={"hnsw:space": "cosine"})

# add embeddings and documents
collection.add(
    documents=[
        "This is a document about pineapple",
        "This is a document about oranges"
    ],
    metadatas=[{"key_1": "value_1", "key_2": "value_2"}, {"key_1": "value_1", "key_2": "value_2"}],
    ids=["id1", "id2"]
)

# get back similar embeddings
# Note that Chroma will embed query_texts for you and return n_results
results = collection.query(
    query_texts=["This is a query document about hawaii"],
    n_results=2
)

# switch `add` to `upsert` to avoid adding the same documents every time
collection.upsert(
    documents=[
        "This is a document about pineapple",
        "This is a document about oranges"
    ],
    ids=["id1", "id2"]
)

```

[Click here](https://docs.trychroma.com/guides) to read the complete chromadb guide.

In [1]:
!pip install chromadb

Collecting chromadb
  Downloading chromadb-0.5.17-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting fastapi>=0.95.2 (from chromadb)
  Downloading fastapi-0.115.4-py3-none-any.whl.metadata (27 kB)
Collecting uvicorn>=0.18.3 (from uvicorn[standard]>=0.18.3->chromadb)
  Downloading uvicorn-0.32.0-py3-none-any.whl.metadata (6.6 kB)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.7.0-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting onnxruntime>=1.14.1 (from chromadb)
  Downloading onnxruntime-1.19.2-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.5 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.27.0-py3

In [2]:
import chromadb

In [3]:
# Initializing a persistent chroma client

client = chromadb.PersistentClient(path = 'vector_store')

In [4]:
client.heartbeat()
# client.heartbeat() used in client-server architectures to monitor the health and connectivity between the client and server.

1730465804285220369

In [5]:
# Create a new collection or get if already exist
collection = client.create_collection(name = 'my_first_collection')

In [6]:
collection

Collection(id=0b5f079c-1d17-4256-bbc1-460abf4f58bd, name=my_first_collection)

In [7]:
collection.count()

0

In [8]:
collection.peek()

{'ids': [],
 'embeddings': array([], dtype=float64),
 'documents': [],
 'uris': None,
 'data': None,
 'metadatas': [],
 'included': [<IncludeEnum.embeddings: 'embeddings'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [9]:
collection.modify(name ='new_name')

In [10]:
collection = client.get_collection(name = 'new_name')

In [11]:
collection

Collection(id=0b5f079c-1d17-4256-bbc1-460abf4f58bd, name=new_name)

In [None]:
# Collection(id=0b5f079c-1d17-4256-bbc1-460abf4f58bd, name=my_first_collection)

In [12]:
# Add Embeddings and documents

collection.add(
    documents=[
        "This is a document about pineapple",
        "This is a document about oranges"
    ],
    metadatas=[{"key_1": "value_1", "key_2": "value_2"}, {"key_1": "value_1", "key_2": "value_2"}],
    ids=["id1", "id2"]
)

/root/.cache/chroma/onnx_models/all-MiniLM-L6-v2/onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:02<00:00, 39.7MiB/s]


In [13]:
# Search Embeddings value

results = collection.query(query_texts = ['apple'],n_results =1)
results

{'ids': [['id1']],
 'embeddings': None,
 'documents': [['This is a document about pineapple']],
 'uris': None,
 'data': None,
 'metadatas': [[{'key_1': 'value_1', 'key_2': 'value_2'}]],
 'distances': [[1.6024261567951665]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}

In [15]:
results1 = collection.query(query_texts = ['orangess'],n_results =2)
results1

{'ids': [['id2', 'id1']],
 'embeddings': None,
 'documents': [['This is a document about oranges',
   'This is a document about pineapple']],
 'uris': None,
 'data': None,
 'metadatas': [[{'key_1': 'value_1', 'key_2': 'value_2'},
   {'key_1': 'value_1', 'key_2': 'value_2'}]],
 'distances': [[0.8239155035921554, 1.6875501252905905]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: 'metadatas'>]}