Vector Database Benchmarking Project

This project benchmarks the ingestion performance of three popular vector databases: Qdrant, Weaviate, and ChromaDB using synthetic business and product data.

Overview

The benchmark measures:

Insertion throughput (records per second)
Total ingestion time
Index build time
Final storage size on disk

All databases use consistent vector embeddings generated from sentence-transformers/all-MiniLM-L6-v2 model.

Project Structure

vector-db-benchmark/
├── benchmark_vector_db_ingestion.py  # Main benchmarking script
├── docker-compose.yml                # Docker configuration for vector DBs
├── requirements.txt                  # Python dependencies
├── businesses.csv                    # Synthetic business data
├── businesses.json                   # Synthetic business data (JSON)
├── products.csv                      # Synthetic product data
├── products.json                     # Synthetic product data (JSON)
└── benchmark_results.txt            # Benchmark results output

Prerequisites

Docker Desktop installed and running
Python 3.8+ with virtual environment
At least 4GB of free RAM

Setup Instructions

1. Start Vector Databases in Docker

docker-compose up -d

This will start:

Qdrant on ports 6333 (HTTP) and 6334 (gRPC)
Weaviate on ports 8081 (HTTP) and 50051 (gRPC)
ChromaDB on port 8000

2. Install Python Dependencies

# Activate virtual environment (if using one)
.\vector_db_env\Scripts\Activate.ps1

# Install required packages
pip install -r requirements.txt

3. Run the Benchmark

python benchmark_vector_db_ingestion.py

Benchmark Results

Latest Test Results (2025-10-07)

Database	Records	Ingestion Time (s)	Index Time (s)	Throughput (rec/s)
Qdrant	12,173	12.39	0.25	982.43
Weaviate	12,173	7.14	0.22	1,704.00
ChromaDB	12,173	15.78	0.01	771.44

Key Findings

Weaviate showed the highest throughput at 1,704 records/second
ChromaDB had the fastest index build time (0.01s)
Qdrant provided balanced performance with 982.43 records/second
All databases successfully ingested 12,173 records (100 businesses + 12,073 products)

Data Schema

Businesses

business_id: Unique identifier
business_name: Company name
email: Contact email
business_type: Category (transport, online retail, hotel)
branches: Pipe-separated list of branch addresses

Products

product_id: Unique identifier
product_name: Product name
quantity: Stock quantity
price: Product price
business_id: Foreign key to business

Vector Embeddings

Model: sentence-transformers/all-MiniLM-L6-v2
Dimension: 384
Text Representation:
- Businesses: {business_name} {email} {business_type}
- Products: {product_name} quantity: {quantity} price: {price}

Stopping the Services

To stop and remove all Docker containers:

docker-compose down

To stop and remove containers along with volumes (deletes all data):

docker-compose down -v

Troubleshooting

Port Conflicts

If you encounter port conflicts:

Edit docker-compose.yml to use different ports
Update the port numbers in benchmark_vector_db_ingestion.py accordingly

Container Not Starting

Check container logs:

docker-compose logs [service-name]
# Example: docker-compose logs weaviate

Resource Warnings

Some minor resource warnings from ChromaDB client are expected and don't affect benchmark results.

Technical Details

Batch Sizes

Qdrant: 100 records per batch
Weaviate: Dynamic batching
ChromaDB: 5,000 records per batch

Distance Metrics

All databases use Cosine similarity for vector search.

License

This is a benchmarking project for educational purposes.

Contributing

Feel free to extend this benchmark with:

Additional vector databases
Different embedding models
Query performance tests
Larger datasets
Memory usage metrics

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
benchmarks		benchmarks
data		data
results		results
.gitignore		.gitignore
README.md		README.md
RESEARCH_SUMMARY.md		RESEARCH_SUMMARY.md
docker-compose.yml		docker-compose.yml
generate_synthetic_vector_db_data.py		generate_synthetic_vector_db_data.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Vector Database Benchmarking Project

Overview

Project Structure

Prerequisites

Setup Instructions

1. Start Vector Databases in Docker

2. Install Python Dependencies

3. Run the Benchmark

Benchmark Results

Latest Test Results (2025-10-07)

Key Findings

Data Schema

Businesses

Products

Vector Embeddings

Stopping the Services

Troubleshooting

Port Conflicts

Container Not Starting

Resource Warnings

Technical Details

Batch Sizes

Distance Metrics

License

Contributing

About

Uh oh!

Releases

Packages

Languages

Shavidika/vector-db-benchmark

Folders and files

Latest commit

History

Repository files navigation

Vector Database Benchmarking Project

Overview

Project Structure

Prerequisites

Setup Instructions

1. Start Vector Databases in Docker

2. Install Python Dependencies

3. Run the Benchmark

Benchmark Results

Latest Test Results (2025-10-07)

Key Findings

Data Schema

Businesses

Products

Vector Embeddings

Stopping the Services

Troubleshooting

Port Conflicts

Container Not Starting

Resource Warnings

Technical Details

Batch Sizes

Distance Metrics

License

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages