# Overview of Chroma DB

**Chroma DB** is a vector database and search engine designed for handling high-dimensional vector data efficiently. It excels in similarity search and clustering tasks involving embeddings.

| **Feature**                    | **Description**                                                                                          |
|--------------------------------|----------------------------------------------------------------------------------------------------------|
| **Optimized for Embeddings**   | Designed to work with high-dimensional vectors such as word embeddings, image features, etc.             |
| **Scalable Vector Search**     | Efficient mechanisms for searching large-scale vector datasets.                                         |
| **High Performance**           | Delivers fast retrieval times for similarity searches.                                                   |
| **Advanced Indexing Techniques** | Incorporates sophisticated indexing methods for efficient search and retrieval.                          |
| **Support for Various Search Types** | Supports exact and approximate nearest neighbor searches, providing flexibility based on application needs. |
| **Integration with ML Frameworks** | Integrates with popular machine learning frameworks and tools.                                          |
| **Cross-Language Support**     | Provides APIs and libraries in multiple programming languages.                                           |
| **User-Friendly Interface**    | Offers an intuitive interface for managing and querying vector data.                                     |

## Advantages of Chroma DB

| **Advantage**                  | **Description**                                                                                          |
|--------------------------------|----------------------------------------------------------------------------------------------------------|
| **Efficient Vector Search**    | Optimized for fast search operations on high-dimensional vectors.                                        |
| **Scalability**                | Handles large-scale datasets effectively, accommodating millions or billions of vectors.                  |
| **Flexible Indexing**          | Offers various indexing options to balance search speed and memory usage.                                |
| **Real-Time Performance**      | Provides real-time or near-real-time search capabilities.                                                 |
| **Integration with ML Tools**  | Seamlessly integrates with machine learning tools and frameworks.                                         |
| **Customizable**               | Allows customization of search parameters and indexing strategies.                                        |
| **Low Latency**                | Minimizes latency during search operations.                                                               |
| **Support for Various Query Types** | Supports different query types and search modes for diverse use cases.                                  |

## Use Cases of Chroma DB

| **Use Case**                   | **Description**                                                                                          |
|--------------------------------|----------------------------------------------------------------------------------------------------------|
| **Recommendation Systems**     | Finds similar items or content for recommendations based on vector embeddings.                           |
| **Image and Video Retrieval**  | Searches for similar images or videos by comparing their feature vectors.                                |
| **Natural Language Processing (NLP)** | Retrieves similar text or document embeddings for tasks like information retrieval or semantic search. |
| **Clustering and Classification** | Facilitates clustering of data points based on vector similarities.                                       |
| **Anomaly Detection**          | Identifies outliers or unusual data points by comparing vector similarities.                             |
| **Real-Time Search**           | Powers search engines and real-time information retrieval systems requiring quick responses.            |




# Detailed Comparison of Chroma DB and FAISS

| **Aspect**                     | **Chroma DB**                                                                             | **FAISS**                                                                                       |
|--------------------------------|--------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|
| **Origin and Development**     | Not tied to a specific organization; developed by various contributors.                    | Developed by Facebook AI Research (FAIR), backed by extensive research and development.        |
| **Main Focus**                 | Designed for embedding-based search and retrieval.                                         | Specializes in large-scale vector similarity search and clustering.                             |
| **Indexing Techniques**        | Typically uses inverted index and vector embeddings; specific details can vary.            | Offers a wide range of indexing techniques including Flat (exact search), IVF (inverted file), PQ (product quantization), and HNSW (Hierarchical Navigable Small World graphs). |
| **Scalability**                | Scales to large datasets but might not be as optimized for billions of vectors.              | Extremely scalable, optimized for billions of vectors using advanced indexing and quantization.|
| **Performance with Large Datasets** | Generally high performance, but specifics can depend on implementation.                   | Known for very high performance, especially with GPU acceleration for extremely large datasets. |
| **Memory Efficiency**          | Memory efficiency may vary based on the implementation and configuration.                   | Highly optimized for low memory usage with techniques like quantization and compression.        |
| **Real-Time Search Capabilities** | Capable of real-time searches, but performance depends on the implementation.               | Highly optimized for real-time or near-real-time search with low latency.                       |
| **GPU Support**                | GPU support varies; may not be as comprehensive or optimized.                              | Provides robust GPU support, leveraging GPU acceleration for fast vector searches.             |
| **Customizability**            | Customizable indexing and search parameters; specifics may vary.                           | Highly customizable, allowing fine-tuning of precision-recall trade-offs and various indexing methods. |
| **Integration with ML Frameworks** | Integrates well with ML frameworks but specifics can depend on the implementation.         | Commonly used in ML pipelines; well-supported with integrations for popular ML frameworks.     |
| **Language Support**           | Available in multiple languages, including Python, but may vary.                           | Available in C++ and Python, with Python bindings for easy integration into Python-based workflows. |
| **Real-World Use Cases**       | Used in recommendation systems, document retrieval, and search engines.                    | Widely used in recommendation systems, image and video search, NLP, clustering, and anomaly detection. |

## Specific Differences

1. **Indexing Techniques**:
   - **Chroma DB**: May use simpler or different indexing methods tailored to specific use cases.
   - **FAISS**: Provides a range of advanced indexing methods like IVF, PQ, and HNSW for optimizing search efficiency.

2. **Performance Optimization**:
   - **Chroma DB**: Optimized for high-dimensional embeddings but may not offer the same level of performance for extremely large datasets.
   - **FAISS**: Specifically engineered for high performance with large-scale datasets, especially when using GPU acceleration.

3. **Memory and Efficiency**:
   - **Chroma DB**: May not have the same level of optimization for memory usage and compression as FAISS.
   - **FAISS**: Known for advanced memory efficiency using techniques such as product quantization and compressed storage.

4. **GPU Capabilities**:
   - **Chroma DB**: GPU support is available but may not be as comprehensive.
   - **FAISS**: Robust GPU support, designed to leverage GPUs for fast search operations.

5. **Customization and Tuning**:
   - **Chroma DB**: Customization options may be less extensive compared to FAISS.
   - **FAISS**: Highly customizable with various settings to balance between accuracy and performance.

This comparison provides a clearer view of the practical differences between Chroma DB and FAISS, highlighting their strengths and use cases.


In [1]:
## building a sample vectordb
from langchain_chroma import Chroma
from langchain_community.document_loaders import TextLoader
from langchain_community.embeddings import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter

In [2]:
loader = TextLoader("speech.txt")
data = loader.load()
data

[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.\n\nJust because we fight without rancor and without selfish object, seeking nothing for ourselves but what we shall wish to share with all free peoples, we shall, I feel confident, conduct our operations as belligerents without passion and ourselves observe with proud punctilio the principles of right and of fair play we profess to be fighting for.\n\nâ€¦\n\nIt will be all the easier for us to conduct ourselves as belligerents in a high spirit of right and fairness 

In [3]:
# Split
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)

In [6]:
embedding=OllamaEmbeddings(model="gemma:2b")
vectordb=Chroma.from_documents(documents=splits,embedding=embedding)

In [7]:
vectordb

<langchain_chroma.vectorstores.Chroma at 0x22c7a567d50>

In [8]:
#  Querying db
query="What is more precious than peace, what shall we fight for? "
docs=vectordb.similarity_search(query)
docs

[Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.'),
 Document(metadata={'source': 'speech.txt'}, page_content='To such a task we can dedicate our lives and our fortunes, everything that we are and everything that we have, with the pride of those who know that the day has come when America is privileged to spend her blood and her might for the principles that gave her birth and happiness and the peace which she has treasured. God helping her, she can do no other.'),
 Document(metadata={'source': 'speech.txt'}, page

In [9]:
#Save To Local
# created sqllite db
vectordb=Chroma.from_documents(documents=splits,embedding=embedding,persist_directory="./chroma_db")

In [11]:
db2 = Chroma(persist_directory="./chroma_db", embedding_function=embedding)
docs=db2.similarity_search(query)
print(docs[0].page_content)

The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.


In [12]:
## similarity Search With Score
docs = vectordb.similarity_search_with_score(query)
docs

[(Document(metadata={'source': 'speech.txt'}, page_content='The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.'),
  4709.066333773646),
 (Document(metadata={'source': 'speech.txt'}, page_content='To such a task we can dedicate our lives and our fortunes, everything that we are and everything that we have, with the pride of those who know that the day has come when America is privileged to spend her blood and her might for the principles that gave her birth and happiness and the peace which she has treasured. God helping her, she can do no other.'),
  4873.9416458496635),
 (

In [13]:
### Retriever option
retriever=vectordb.as_retriever()
retriever.invoke(query)[0].page_content

'The world must be made safe for democracy. Its peace must be planted upon the tested foundations of political liberty. We have no selfish ends to serve. We desire no conquest, no dominion. We seek no indemnities for ourselves, no material compensation for the sacrifices we shall freely make. We are but one of the champions of the rights of mankind. We shall be satisfied when those rights have been made as secure as the faith and the freedom of nations can make them.'

Retriever: Often supports more sophisticated querying capabilities, including contextual search, filtering, and ranking based on various criteria. This can improve the quality and relevance of the search results.

Retriever: Can be integrated into larger data processing pipelines, often providing additional functionality such as preprocessing, post-processing, or combining multiple data sources.
