# **Vector Database (VectorDB)**

A Vector Database (VectorDB) is a specialized database designed to store, index, and query high-dimensional vectors, which are mathematical representations of data such as text, images, or audio. It is commonly used in applications like semantic search, recommendation systems, and natural language processing, where similarity between data points is crucial. VectorDBs efficiently handle operations like nearest neighbor search using advanced indexing techniques, making them ideal for large-scale machine learning and AI tasks.

**Different VectorDBs:**
- **ChromaDB**: Lightweight and good for local projects.
- **Pinecone**: Great for managed, scalable solutions.
- **FAISS (Facebook AI Similarity Search)**: Efficient similarity search and clustering of dense vectors. 
- **Milvus**: Ideal for open-source, large-scale AI applications.

# **Introduction to ChromaDB**

<img src="Images/trychroma.png">

**credits:** https://www.trychroma.com/

---

<img src="Images/chromadb_img.png">

**credits:** [GitHub repository](https://github.com/bansalkanav/Generative-AI-Scratch-2-Advance-By-ThatAIGuy/blob/main/6.%20Building%20Apps%20Powered%20by%20GenAI%20using%20LangChain/1.%20Introduction%20to%20LangChain/7.%20VectorDB%20-%20ChromaDB/intro_to_chromadb.ipynb)

### **What is ChromaDB?**

ChromaDB is the open-source vector database, specially designed for AI applications.

### **Features**
1. **Has everything we need for retrieval**
    - Store document embedding and their metadata
    - Search Embeddings
    - Full-tect Search
    - Metadata filtering
    - Multi-modal retrieval
2. **Free and Open source**
3. **Integrations** 
    - Works with HuggingFace, OpenAI, Google, LangChain and more.
4. **Simple to Get Started**
    - ```pip install chromadb```

In [1]:
!pip show chromadb

Name: chromadb
Version: 0.4.24
Summary: Chroma.
Home-page: https://github.com/chroma-core/chroma
Author: 
Author-email: Jeff Huber <jeff@trychroma.com>, Anton Troynikov <anton@trychroma.com>
License: 
Location: c:\users\91889\anaconda3\envs\geminienv\lib\site-packages
Requires: bcrypt, build, chroma-hnswlib, fastapi, grpcio, importlib-resources, kubernetes, mmh3, numpy, onnxruntime, opentelemetry-api, opentelemetry-exporter-otlp-proto-grpc, opentelemetry-instrumentation-fastapi, opentelemetry-sdk, orjson, overrides, posthog, pulsar-client, pydantic, pypika, PyYAML, requests, tenacity, tokenizers, tqdm, typer, typing-extensions, uvicorn
Required-by: crewai-tools, embedchain


In [2]:
import chromadb

client = chromadb.PersistentClient(path="vector_store")

In [3]:
client.heartbeat()

1732970215210755100

In [4]:
# Warning: Empties and completely resets the database.

# client.reset() 

## **Create a Collection**

**Parameters**
- **name:** Select a `name` for your collection.
- **embedding_function**: By default, Chroma uses the Sentence Transformers `all-MiniLM-L6-v2 model` to create embeddings.
- **metadata={"hnsw:space": "cosine"}**: By default `L2 distance`

<img src="Images/metadatas.png">

In [5]:
collections = client.create_collection(name="my_first_collection")

**Alternative:** You can also use `client.get_or_create_collection()`

In [6]:
# returns the number of items in the collection

collections.count()

0

In [7]:
# returns a list of the first 10 items in the collection

collections.peek()

{'ids': [],
 'embeddings': [],
 'metadatas': [],
 'documents': [],
 'uris': None,
 'data': None}

In [8]:
collections

Collection(name=my_first_collection)

In [9]:
# Rename the collection

collections.modify(name="first_collection")

In [10]:
collections

Collection(name=first_collection)

## **Get An Already Existing Collection**

In [4]:
collections = client.get_collection(name="first_collection")

collections

Collection(name=first_collection)

## **Adding data to Collection**

In [12]:
collections.add(
    documents=[
        "This is my first document",
        "This is my second document"
    ],
    metadatas=[{'key_1': 'abc_1', 'key_2': 'abc_2'}, {'key_1': 'xyz_1', 'key_2': 'xyz_2'}],
    ids=['id1', 'id2']
)

C:\Users\91889\.cache\chroma\onnx_models\all-MiniLM-L6-v2\onnx.tar.gz: 100%|██████████| 79.3M/79.3M [00:24<00:00, 3.33MiB/s]


In [20]:
# collections.peek()

## **Search Embeddings**

In [18]:
result = collections.query(
    query_texts = ["first"],
    n_results = 1
)

In [19]:
result

{'ids': [['id1']],
 'distances': [[1.1781862767622826]],
 'metadatas': [[{'key_1': 'abc_1', 'key_2': 'abc_2'}]],
 'embeddings': None,
 'documents': [['This is my first document']],
 'uris': None,
 'data': None}

In [23]:
result = collections.query(
    query_texts = ["second"],
    n_results = 1
)

result

{'ids': [['id2']],
 'distances': [[1.1943339473992967]],
 'metadatas': [[{'key_1': 'xyz_1', 'key_2': 'xyz_2'}]],
 'embeddings': None,
 'documents': [['This is my second document']],
 'uris': None,
 'data': None}

In [None]:
# adding a new item to the collection
collections.add(
    documents=["This is a document about pineapple"],
    metadatas=[{'key_1': 'pqrs_1', 'key_2': 'pqrs_2'}],
    ids=['id3']
)

In [22]:
# search for the specific element in the collection
result = collections.query(
    query_texts=["apple"],
    n_results=1
)

result

{'ids': [['id3']],
 'distances': [[1.602426156795167]],
 'metadatas': [[{'key_1': 'pqrs_1', 'key_2': 'pqrs_2'}]],
 'embeddings': None,
 'documents': [['This is a document about pineapple']],
 'uris': None,
 'data': None}

In [1]:
# collections.peek()

## **Updating data in a collection**

In [12]:
collections.update(
    ids=['id1', 'id2'],
    metadatas=[{'doc1_key_1': 'abc_1', 'doc1_key_2': 'abc_2'}, {'doc2_key_1': 'xyz_1', 'doc2_key_2': 'xyz_2'}],
    documents=[
        "This is a document about apple",
        "This is a document about banana"
    ]
)

In [2]:
# collections.peek()

## **Deleting data from a collection**

In [14]:
collections.delete(
    ids=['id1', 'id2']
)

In [3]:
# collections.peek()