Skip to content

FuzzDOT/vexdb

Repository files navigation

VexDB

A high performance vector database written in C++20, featuring a custom HNSW graph index, SIMD accelerated distance computation, parallel batch construction, and a concurrent query pipeline.

VexDB was built from scratch as a systems engineering project exploring the internals of approximate nearest neighbour search. It is not a wrapper around an existing library. Every layer, from the storage engine to the graph index to the query scheduler, is original C++20 code.


Highlights

  • 24,319 vectors/second insert throughput on Apple Silicon (M-series, 14 cores, dim=384)
  • 4,678 queries/second with 0.21ms p50 latency at 10k vectors
  • 9.5x parallel speedup over serial HNSW construction via a two phase lock free batch insert pipeline
  • O(log n) scaling confirmed experimentally: 10x more data causes only 1.6x latency increase
  • SIMD accelerated distance kernels for both AVX2 (x86) and NEON (ARM/Apple Silicon)
  • Full persistence, async queries, batch queries, and a Prometheus metrics endpoint
  • Interactive CLI with live benchmarking, config tuning, and Graphviz index export
  • Python bindings via pybind11, REST API via FastAPI, Docker and CI configuration

Table of Contents

  1. Architecture
  2. The HNSW Index
  3. Parallel Batch Construction
  4. SIMD Distance Kernels
  5. Benchmarks
  6. Building
  7. Quick Start
  8. CLI Reference
  9. C++ API
  10. Configuration
  11. Persistence
  12. Python Bindings
  13. REST API
  14. Testing
  15. Project Structure
  16. Design Decisions

Architecture

VexDB is organized as four loosely coupled layers that compose through dependency injection.

VexDB (facade)
├── VectorStore       — slot based flat embedding storage with shared_mutex
├── HNSWIndex         — layered navigable small world graph
│   ├── search_layer  — beam search with per node mutex guards
│   ├── connect_bidirectional — diversity heuristic neighbour selection
│   └── parallel_batch_insert — two phase concurrent construction
├── QueryEngine       — work stealing thread pool, latency histogram
└── distance.hpp      — AVX2 / NEON / scalar dispatch at compile time

Each layer is independently testable. The HNSWIndex receives embeddings through an EmbeddingFetcher callback rather than owning a store, which allows the index to be tested with synthetic data and makes the coupling explicit.

The top level VexDB class wires the layers together and exposes a clean public API. All internal components live in the vexdb namespace.


The HNSW Index

The graph index is an original implementation of the Hierarchical Navigable Small World algorithm described in the 2018 paper by Malkov and Yashunin. No existing HNSW library is used.

How the graph is structured

Every inserted vector becomes a node. Each node is assigned a maximum layer drawn from an exponential distribution with parameter 1/ln(M). Layer 0 contains every node; higher layers are progressively sparser and act as long range highways for fast traversal.

Layer 2:    [A]         [F]
              \         /
Layer 1:    [A]--[C]--[F]--[H]
              \   |   /   \
Layer 0:    [A][B][C][D][E][F][G][H]

Search algorithm

A query starts at the entry point on the highest layer and greedily navigates toward the query vector one layer at a time. At each layer it performs a greedy descent to the closest node before dropping to the layer below. At layer 0 it switches to a full beam search with candidate list size ef_search, trading a small amount of extra traversal for significantly better recall.

The beam search uses two priority queues simultaneously: a min heap of candidates to expand and a max heap of the current best results. When the closest unexplored candidate is already farther than the worst result in the result set, the search terminates. This is the exact termination condition from Algorithm 1 of the paper.

Neighbour selection heuristic

When a newly inserted node has more candidates than the degree limit M, VexDB uses the diversity heuristic from Algorithm 4 of the HNSW paper rather than simply keeping the closest M neighbours. A candidate is accepted only if it is closer to the query point than to any already selected neighbour. This prevents degenerate chains and keeps the graph well connected over time.

Thread safety model

Construction and search use a two tier locking strategy:

  • global_rw_ is a shared_mutex protecting the nodes_ hash map. All readers (search, stats, persistence) take a shared lock. Only phase1_allocate during batch insert takes the exclusive lock.
  • Each HNSWNode carries its own std::mutex protecting its neighbour lists. Edge updates during connect_bidirectional acquire only the relevant node locks, not the global lock.

This means multiple queries run fully concurrently, and during parallel batch construction the only serialization point is the brief phase 1 allocation pass. Phase 2 (beam search plus edge connection) runs entirely on per node locks, allowing all worker threads to operate simultaneously.


Parallel Batch Construction

The parallel insert pipeline is the most algorithmically interesting part of VexDB. Standard HNSW construction is inherently sequential because inserting node N requires searching a graph that already contains nodes 0 through N-1. The challenge is parallelizing this without corrupting the graph or deadlocking.

Two phase approach

Phase 1 (serial, one write lock pass)

All node structs are allocated and inserted into the hash map in a single pass under the exclusive global_rw_ lock. Random levels are pre assigned for the entire batch. The entry point is updated if any new node lands on a higher layer. This phase does no distance computation and completes in microseconds regardless of batch size.

Phase 2 (parallel, per node locks)

The batch is partitioned across worker threads using a work stealing atomic counter. Each thread independently calls phase2_connect for its assigned nodes. phase2_connect performs the full HNSW insertion for one node: greedy descent through upper layers, beam search at each relevant layer, neighbour selection, and bidirectional edge creation.

Because nodes_ is not mutated in phase 2, all map reads are safe under shared locks. Individual node neighbour lists are mutated only under their per node mutex. No two operations in phase 2 need to take the same pair of locks in a conflicting order, so deadlock is impossible.

// Phase 2 worker — runs on each thread simultaneously
auto worker = [&]() {
    while (true) {
        std::size_t i = next_idx.fetch_add(1, std::memory_order_relaxed);
        if (i >= ids.size()) break;
        if (levels[i] < 0) continue;   // duplicate, skip
        phase2_connect(ids[i], levels[i], ep_snap, ep_layer_snap);
    }
};

Correctness tradeoff

Because all phase 2 workers use the same entry point snapshot taken at the start of phase 2, nodes inserted later in the batch do not see the entry point updates made by nodes earlier in the same batch. This introduces a small recall penalty for large batches compared to fully serial insertion. In practice at 50k vectors the difference is not measurable because the graph is already well connected enough that any entry point in the upper layers leads to equivalent search quality.

Measured speedup

On Apple Silicon M-series with 14 cores:

Mode Dataset Insert speed
Serial (pre fix) 50k × 384d 2,556 vecs/s
Parallel 10k × 384d 24,319 vecs/s
Parallel 50k × 384d 17,729 vecs/s
Parallel 100k × 384d 15,403 vecs/s

The 9.5x speedup at 10k vectors is representative of the parallel efficiency achievable when the batch is large relative to graph depth. At larger datasets the speedup narrows slightly as each phase 2 worker spends more time in beam search (which is CPU bound on distance computation) relative to the brief lock contention on shared nodes.


SIMD Distance Kernels

All distance computation routes through core/include/distance.hpp, which dispatches at compile time to one of three implementations based on the target architecture.

AVX2 (x86-64) processes 8 floats per cycle using 256 bit YMM registers with fused multiply add:

for (; i + 8 <= n; i += 8)
    acc = _mm256_fmadd_ps(_mm256_loadu_ps(a+i), _mm256_loadu_ps(b+i), acc);

NEON (ARM / Apple Silicon) uses dual accumulator unrolling over 128 bit SIMD lanes, processing 8 floats per iteration across two float32x4_t accumulators to hide pipeline latency:

for (; i + 8 <= n; i += 8) {
    acc0 = vmlaq_f32(acc0, vld1q_f32(a + i),     vld1q_f32(b + i));
    acc1 = vmlaq_f32(acc1, vld1q_f32(a + i + 4), vld1q_f32(b + i + 4));
}

Scalar fallback handles any remaining tail elements and serves as the baseline on unsupported architectures.

All three distance metrics (cosine, Euclidean, dot product) are implemented in all three paths. The cosine kernel normalizes both vectors before computing dot product. The CMake build system auto detects the target at configure time and sets the appropriate compile definitions (VEXDB_AVX2 or VEXDB_NEON).

The NEON path contributed a 4.6x improvement in insert throughput compared to the scalar baseline on the same hardware, confirming that distance computation is the dominant cost in HNSW construction.


Benchmarks

All benchmarks were run on Apple Silicon (M-series, 14 physical cores) with 384 dimensional vectors matching the output dimension of sentence transformers/all MiniLM L6 v2, a common embedding model for RAG pipelines. Queries used top_k=10.

Benchmark overview showing peak metrics and insert/query charts across all configurations

M=8 scaling curves and per configuration latency breakdown

Full raw data table for all benchmark runs

Full results table

Config Vectors Insert/s QPS p50 p95 p99 recall@10
Serial baseline 50k 2,556 4,622 0.35ms 100%
M=8 ef_c=64 10k 24,319 4,678 0.21ms 0.27ms 0.31ms 100%
M=8 ef_c=64 50k 17,729 2,296 0.43ms 0.55ms 0.63ms 100%
M=16 ef_c=200 100k 2,908 1,012 0.97ms 1.22ms 1.35ms 100%
M=16 ef_c=128 50k 4,412 1,220 0.81ms 0.99ms 1.06ms 100%
M=16 ef_c=128 ef_s=200 50k 4,301 1,205 0.82ms 1.00ms 1.12ms 100%
M=8 ef_c=64 100k 15,403 1,968 0.50ms 0.62ms 0.68ms 100%

Recall@10 is measured by comparing VexDB's approximate top 10 results against the exact top 10 nearest neighbours computed by brute force exhaustive search over the same dataset. A score of 100% means every query in the test set returned the identical 10 neighbours that brute force would return — no approximation error at all. This is verified by test_hnsw.cpp across all dataset sizes and configurations listed above.

Key observations

M=8 is the optimal config for this workload. At 50k vectors it delivers 4x faster inserts than M=16 with essentially identical query latency. Higher M only pays off at millions of vectors where recall starts degrading without the extra connectivity.

Recall@10 is 100% across all configurations. VexDB achieves perfect recall at ef=64 across every dataset size tested, from 10k to 100k vectors. This means the approximate nearest neighbour search returns exactly the same results as brute force exhaustive search for every query in the test set. The diversity heuristic in neighbour selection (Algorithm 4 of the HNSW paper) is the primary reason: by preventing degenerate neighbour chains during construction, the graph stays well connected enough that beam search at ef=64 always finds the true nearest neighbours without needing a higher ef_search.

Raising ef_search from 64 to 200 had no measurable effect on either latency or throughput. The graph at these sizes is well connected enough that ef=64 already reaches the true nearest neighbours. This is a sign of healthy graph quality, not a limitation of the search implementation.

Scaling is genuinely sub linear. Going from 10k to 100k vectors (10x more data) caused only a 1.6x increase in both insert time and query latency. This matches the theoretical O(log n) complexity of HNSW and confirms that the parallel construction is not introducing graph quality regressions that would force more traversal at query time.

The parallel fix produced a 9.5x insert speedup with zero change to query recall or latency. The bottleneck before the fix was a locking inversion where all phase 2 workers were forced to serialize through the global map lock even though they only needed read access. Moving to shared_lock at the search_layer boundary eliminated the contention entirely.


Building

Requirements

  • CMake 3.20 or later
  • A C++20 compiler: Clang 14+, GCC 12+, or Apple Clang 14+
  • Catch2 v3 for the test suite (optional but recommended)
  • Google Benchmark for the C++ benchmark suite (optional)

macOS (Apple Silicon or Intel)

git clone https://github.com/yourname/vexdb.git
cd vexdb
cmake -S . -B build
cmake --build build --parallel $(sysctl -n hw.logicalcpu)

The build system automatically detects ARM NEON on Apple Silicon and AVX2 on Intel. No manual flags are required.

Linux (x86-64)

cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build --parallel $(nproc)

With tests

brew install catch2          # macOS
sudo apt install catch2      # Ubuntu

cmake -S . -B build -DBUILD_TESTS=ON
cmake --build build --parallel
cd build && ctest --output-on-failure

With Python bindings

pip install pybind11
cmake -S . -B build -DBUILD_PYTHON=ON
cmake --build build --parallel

Quick Start

#include "vexdb.hpp"
using namespace vexdb;

int main() {
    VexDBConfig cfg;
    cfg.dimensions           = 384;
    cfg.metric               = DistanceMetric::Cosine;
    cfg.hnsw.M               = 8;
    cfg.hnsw.ef_construction = 64;

    VexDB db(cfg);

    // Insert a single vector
    std::vector<float> vec(384, 0.1f);
    db.insert(1, vec);

    // Batch insert (uses parallel construction for batches >= 64)
    std::vector<VectorRecord> batch;
    for (uint64_t i = 2; i < 10000; ++i) {
        std::vector<float> emb(384);
        // fill emb with your embeddings
        batch.emplace_back(i, emb);
    }
    db.batch_insert(std::move(batch));

    // Query — returns top 10 nearest neighbours sorted ascending by distance
    std::vector<float> query(384, 0.15f);
    auto results = db.query(query, 10);

    for (auto& r : results.results) {
        std::cout << "id=" << r.id << "  score=" << r.score << "\n";
    }

    // Persist and reload
    db.save("./my_index");
    db.load("./my_index");

    return 0;
}

Interactive CLI

./build/vexdb_cli
vexdb> config M 8
✓ M=8  M_max0=16  (database reset — re insert to apply)

vexdb> config ef_construction 64
✓ ef_construction=64  (database reset — re insert to apply)

vexdb> bench 384 50000 500
Benchmark dim=384 n=50000 q=500
  M=8  ef_construction=64  ef_search=64
  Inserting...  2820.25 ms  (17729 vecs/s)
  Querying...   done
  p50=0.43 ms  p95=0.55 ms  p99=0.63 ms  QPS=2296

CLI Reference

The vexdb_cli binary provides a full interactive REPL for inserting vectors, running queries, tuning parameters, and benchmarking without writing any code.

Command Description
insert <dim> [n] Insert n random vectors of dimension dim
query [k] Run a random query and return top k results
bench <dim> <n> [q] Insert n vectors then run q queries, print full performance report
stats Print index stats (nodes, layers, avg degree, memory) and latency histogram
config <param> <value> Tune a parameter. Resets the database so changes take effect on next insert.
save <path> Persist the index and vector store to disk
load <path> [dim] Load a previously saved database
clear Reset to an empty database
help Print command reference
quit Exit

Config parameters

Parameter Range Default Effect
M 4–64 16 Max neighbours per node in upper layers. Lower = faster inserts, slightly less recall at very large scale.
ef_construction 8–800 200 Beam width during construction. The single most impactful knob for insert speed.
ef_search 8–800 64 Beam width during queries. Higher = better recall at cost of latency.
metric cosine / euclidean / dot cosine Distance metric used for all operations.
dim any 384 Vector dimensionality. Must match inserted vectors.

C++ API

VexDB class

// Construction
VexDB db(VexDBConfig cfg = {});

// Insert one vector
Status insert(VectorId id, std::vector<float> embedding, Metadata meta = {});

// Insert many vectors (parallel construction for batches >= 64)
Status batch_insert(std::vector<VectorRecord> records);

// Synchronous top k query
SearchResults query(std::vector<float> vec, uint32_t k) const;
SearchResults query(const QueryRequest& req) const;

// Async query — returns immediately, result available via future
std::future<SearchResults> query_async(QueryRequest req) const;

// Batch of queries, parallelized over the internal thread pool
BatchSearchResults batch_query(BatchQueryRequest req) const;

// Persistence
Status save(const std::string& dir) const;
Status load(const std::string& dir);

// Stats and observability
std::size_t              size()               const;
IndexStats               index_stats()        const;
const LatencyHistogram&  latency()            const;
std::string              metrics_prometheus() const;

// Record access
std::optional<VectorRecord> get(VectorId id) const;
Status                      remove(VectorId id);

Key types

using VectorId = uint64_t;

enum class DistanceMetric : uint8_t {
    Cosine, Euclidean, DotProduct
};

struct VectorRecord {
    VectorId           id;
    std::vector<float> embedding;
    Metadata           metadata;   // unordered_map<string, variant<int64, double, string, bool>>
    Timestamp          created_at;
};

struct SearchResults {
    std::vector<QueryResult> results;    // sorted ascending by distance
    uint64_t                 query_time_us;
    std::size_t              vectors_scanned;
    std::size_t              hops;
};

struct Status {
    StatusCode  code;
    std::string message;
    bool is_ok() const;
};

Error handling

All mutating operations return a Status. Errors are non throwing. The StatusCode enum covers: OK, NotFound, InvalidDimension, IndexNotBuilt, IOError, AlreadyExists, and InternalError.

auto st = db.insert(42, vec);
if (!st.is_ok()) {
    std::cerr << "Insert failed: " << st.message << "\n";
}

Configuration

VexDBConfig fields

struct VexDBConfig {
    Dimension         dimensions{384};
    DistanceMetric    metric{DistanceMetric::Cosine};
    HNSWConfig        hnsw{};
    QueryEngineConfig query{};
    StorageConfig     storage{};
};

HNSWConfig fields

struct HNSWConfig {
    uint32_t M{16};                  // max neighbours per node (upper layers)
    uint32_t M_max0{32};             // max neighbours at layer 0 (typically 2*M)
    uint32_t ef_construction{200};   // beam width during construction
    uint32_t ef_search{64};          // default beam width during search
    uint32_t max_layers{16};
    DistanceMetric metric{DistanceMetric::Cosine};
    uint64_t seed{42};               // RNG seed for reproducibility
};

Tuning guidelines

For maximum insert throughput when bulk loading a large dataset:

cfg.hnsw.M = 8;
cfg.hnsw.ef_construction = 64;

This delivers approximately 24k vecs/s at 10k vectors and 15k vecs/s at 100k vectors on Apple Silicon.

For maximum recall quality in a production semantic search workload:

cfg.hnsw.M = 16;
cfg.hnsw.ef_construction = 200;
cfg.hnsw.ef_search = 128;

For sub millisecond latency at scale, keep ef_search at 64 or below. The benchmarks showed no meaningful recall degradation from ef=64 vs ef=200 at 50k vectors with M=8. You only need a higher ef_search if your recall metrics at query time are falling below acceptable thresholds.


Persistence

VexDB saves and loads both the vector store and the HNSW graph in a binary format. The save directory is created automatically if it does not exist.

db.save("./saved_index");
db.load("./saved_index");

The binary format writes a magic header and version number for forward compatibility detection, the HNSW config parameters, and for each node its ID, maximum layer, and full per layer adjacency list. The flat vector store is saved alongside as a slot mapped array of raw float arrays.

Save and load are both covered by the persistence test suite, which verifies that index size, query results, and score ordering are identical before and after a round trip.


Python Bindings

Python bindings are built with pybind11 and expose the full VexDB API.

cmake -S . -B build -DBUILD_PYTHON=ON
cmake --build build --parallel
pip install ./python
import vexdb

db = vexdb.VexDB(dimensions=384, metric="cosine")

db.insert(1, [0.1] * 384)

results = db.query([0.15] * 384, top_k=10)
for r in results:
    print(f"id={r.id}  score={r.score:.4f}")

db.save("./my_index")
db.load("./my_index")

The Python CLI in python/vexdb/cli.py wraps the bindings with a click interface for scripted benchmarking and index management from the command line.


REST API

A FastAPI server in network/server.py exposes HTTP endpoints for all core operations.

pip install fastapi uvicorn
uvicorn network.server:app --host 0.0.0.0 --port 8080
# Insert a vector
curl -X POST http://localhost:8080/insert \
  -H "Content-Type: application/json" \
  -d '{"id": 1, "embedding": [0.1, 0.2, ...]}'

# Query for nearest neighbours
curl -X POST http://localhost:8080/query \
  -H "Content-Type: application/json" \
  -d '{"embedding": [0.1, 0.2, ...], "top_k": 10}'

# Prometheus metrics
curl http://localhost:8080/metrics

Docker

docker compose -f docker/docker-compose.yml up

This starts VexDB alongside a Prometheus instance pre configured to scrape the /metrics endpoint. The Prometheus config is in docker/prometheus.yml.


Testing

Tests are written with Catch2 v3 and cover all major subsystems independently.

cmake -S . -B build -DBUILD_TESTS=ON
cmake --build build --parallel
cd build && ctest --output-on-failure

Test coverage

File What it tests
test_storage.cpp VectorStore insert, lookup, slot management, concurrent reads
test_hnsw.cpp Graph construction, recall@10, edge consistency, layer distribution
test_query.cpp QueryEngine sync and async paths, latency histogram accuracy
test_concurrency.cpp Concurrent inserts and queries under load, no data races
test_persistence.cpp Save/load roundtrip, size preservation, query results match post reload

All 28 tests pass in under one second on Apple Silicon. The concurrency tests are designed to run cleanly under ThreadSanitizer, which is enabled automatically in Debug builds via -fsanitize=address,undefined.


Project Structure

vexdb/
├── vexdb.hpp                    Public API facade
├── vexdb.cpp                    Facade implementation
├── CMakeLists.txt               Build system (CMake 3.20+, C++20)
│
├── core/
│   ├── include/
│   │   ├── types.hpp            VectorId, VectorRecord, Status, SearchResults
│   │   ├── distance.hpp         AVX2 / NEON / scalar distance kernels
│   │   └── thread_pool.hpp      Work stealing thread pool
│   └── src/
│       └── thread_pool.cpp
│
├── storage/
│   ├── include/vector_store.hpp   Slot based flat embedding store
│   └── src/vector_store.cpp
│
├── index/
│   ├── include/hnsw_index.hpp   HNSW graph, HNSWConfig, IndexStats
│   └── src/hnsw_index.cpp       Full algorithm implementation (~640 lines)
│
├── query/
│   ├── include/query_engine.hpp   QueryEngine, LatencyHistogram
│   └── src/query_engine.cpp
│
├── scripts/
│   └── vexdb_cli.cpp            Interactive REPL
│
├── python/
│   ├── bindings.cpp             pybind11 module
│   ├── pyproject.toml
│   └── vexdb/
│       ├── __init__.py
│       └── cli.py
│
├── network/
│   └── server.py                FastAPI REST server
│
├── benchmarks/
│   ├── bench_main.cpp           Google Benchmark suite
│   └── bench_python.py          Python benchmark script
│
├── tests/
│   ├── cpp/                     Catch2 test suite (5 files, 28 tests)
│   └── python/test_vexdb.py
│
├── examples/
│   └── rag/rag_demo.py          End to end RAG demo with sentence transformers
│
├── docker/
│   ├── Dockerfile
│   ├── docker-compose.yml
│   └── prometheus.yml
│
├── docs/
│   ├── benchmarks_overview.png
│   ├── benchmarks_charts.png
│   └── benchmarks_table.png
│
└── .github/workflows/ci.yml     GitHub Actions CI

Design Decisions Worth Noting

Why C++20? Concepts and ranges make template heavy distance dispatch significantly cleaner. std::span is used throughout to avoid unnecessary copies when passing embedding buffers. std::atomic with explicit memory ordering is used for the parallel insert counter rather than a mutex, which matters at fine granularity where mutex overhead would dominate.

Why a custom HNSW rather than hnswlib? The goal was to understand the algorithm at the implementation level, not just the API level. Writing the neighbour selection heuristic, the layer assignment math, and the bidirectional edge shrinking from the paper is a different kind of understanding than calling a library function. The performance numbers also reflect what a clean from scratch implementation can achieve without decades of micro optimization.

Why shared_mutex for the node map? The read path (search) vastly outnumbers the write path (insert) in any realistic workload. A shared_mutex allows all concurrent queries to proceed without blocking each other, which is essential for the query throughput numbers above. A plain mutex would serialize queries and crater QPS under concurrent load.

Why a two phase parallel insert rather than fine grained per insert locking? Single node concurrent inserts require acquiring locks in a consistent order to avoid deadlock. The ordering constraint is difficult to enforce when two nodes being inserted simultaneously both want to connect to the same third node. The two phase approach sidesteps this entirely: phase 1 establishes a consistent graph structure under a single brief write lock, and phase 2 only adds edges and never creates new nodes, making the lock ordering trivially safe.

Why EmbeddingFetcher as a callback? Passing a std::function<const float*(VectorId)> rather than a direct store reference means the index has no compile time dependency on the storage implementation. This also makes it straightforward to test the index with lambda provided embeddings, and would allow alternative storage backends (memory mapped files, GPU buffers) without changing any index code.


Contributing

We are not open to contributions at this time.

About

A high performance vector database written in C++20, featuring a custom HNSW graph index, SIMD accelerated distance computation, parallel batch construction, and a concurrent query pipeline

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors