# Various Vector Database

### Chroma DB

- Open-source vector store used for storing and retrieving vector embeddings. 
- Main use is to save embeddings along with metadata to be used later by large language models.
- Can also be used for semantic search engines over text data.
- Offers a self-hosted server option.
- Provides various options for storing vector embeddings.
- Storing the vector database in-memory
- Storing vector database in the local file system
- Host the database on a server machine
 
#### Working:

- Install Chroma for vector database.
- Initialize the client using the database(in-memory or persistent).
- Create a collection object using client similar to creating a table in traditional database.
- Add text data with metadata and unique IDs.
- Then, Chroma will automatically download the all-MiniLM-L6-v2 model to convert the text into embeddings and store the in collection.
- Ask the query in natural language, it will convert the query into embedding and use similarity search to similar results.
 
#### Features:

- Client Initialization: Chroma Db can initialize and accessed through client libraries.
- Client/Server Mode: Chroma Db can run client- server mode.
 - Collection Management: We can create and manage collections of vectors.
- Update and delete records: We can update or remove the values from collections.

### Pinecone:

- Cloud-native vector database.
- Core approach - based on the Approximate Nearest Neighbor (ANN) search that efficiently locates faster matches and ranks them within a large dataset.
- Offers high-performance search and similarity matching.
- Deals with high-dimensional vector data at a higher scale, easy integration, and faster query results.

#### Features:

- Dynamic data updates: Supports dynamic updates to vector data, allowing users to add, update, or delete vectors from the database in real-time.
- Multi-Model Indexing: 

    Allows users to create custom indexing structures tailored to their specific data and query requirements.

    Enables efficient indexing and retrieval of vector data in diverse use cases, including text search, image similarity, and recommendation systems.

- Managed Service

    eliminates the need for users to manage infrastructure by offering a cloud-based deployment model with automated provisioning, monitoring, and maintenance.
 
- Low Latency Search
 
    Similarity search operations and the nearest neighbor queries are performed with low latency, providing fast and accurate results.

### Open search
 
- Formerly known as Amazon Elasticsearch Service (Amazon ES).
- Open-source distributed search and analytics engine that allows users to index, search and analyze large volumes of structured and unstructured data in real time.
- OpenSearch’s vector database bridges the gap between traditional search and AI-powered vector search, making it easier to build flexible, scalable, and future-proof AI applications.
 
#### Features:

- Hybrid Search
    - combines traditional lexical search with vector search capabilities.
    - can perform both text-based and vector-based searches within the same system.
    - enhances search accuracy and relevance.
- Generative AI enhancement
    - Acts as a knowledge base for generative AI models (such as chatbots or language models).
    - Used as a long-term memory for AI systems.

### Cassandra:
- Open source vector database
- Used to store vector data and perform similarity search operations.
- Highly scalable, distributed NoSQL database designed for handling large volumes of structured and semi-structured data across multiple nodes in a cluster.
- While it is primarily used for key-value and tabular data storage, it can also be adapted to store and query vector data for certain use cases.                
 
#### Features              

- Data Modeling
    - Allows users to define flexible and schema-flexible data models suitable for storing vector data.
    - Users can design table schemas to accommodate the dimensions, characteristics, and metadata associated with vector data.
- Security and Access Control
    - Offers robust security features, including authentication, authorization, encryption, and role-based access control (RBAC).
    - Users can restrict access to vector data and resources based on user roles, permissions, and data sensitivity.   
- Secondary Indexing
    - Provides support for secondary indexing, allowing users to create indexes on specific columns or attributes within a table.
- Custom Querying
    - Users can execute custom queries to retrieve vector data based on specific criteria, such as similarity scores, distance metrics, or metadata attributes.

Reference:

https://cassandra.apache.org/doc/latest/cassandra/vector-search/vector-search-working-with.html

https://ubuntu.com/blog/what-is-opensearch

https://docs.pinecone.io/reference/architecture/serverless-architecture

https://www.datacamp.com/tutorial/chromadb-tutorial-step-by-step-guide