Pure-Zig vector similarity search. Zig 0.16, AGPL-3.0, single-binary, libc-only.
v0.7 — four index families: Flat + HNSW + IVFFlat + IVFPQ.
| Index family | Module | Default config | Headline @ N=10K D=128 |
|---|---|---|---|
| Flat | FlatIndex |
L2 / IP / cosine, SIMD | 1,792 µs/q, 557 q/s, ground truth |
| HNSW | IndexHNSW |
M=16, efC=200, efS=100 | 70%+ top-10 recall vs Flat |
| IVFFlat | IndexIVFFlat |
nlist=100, nprobe=10 | 208 µs/q, 4,801 q/s, 8.61x vs Flat |
| IVFPQ | IndexIVFPQ |
nlist=100, nprobe=10, M=8 | 522 µs/q, 1,915 q/s, 3.43x vs Flat, 16.94x mem-compression |
End-to-end head-to-head against faiss-cpu (Python wrapper of
facebookresearch/faiss) on N=10000, D=128, K=10, Q=100, L2:
| Implementation | ns/query | queries/sec | vs faiss-cpu |
|---|---|---|---|
faiss-cpu 1.13.2 (C++ + AVX-2 + multi-thread) |
~510,000 | ~1,960 | 1.00x (ref) |
faiss-zig v0.2 (Zig + @Vector + 4 threads) |
~772,000 | ~1,295 | ~1.5x slower |
faiss-zig v0.1 (Zig + @Vector + 1 thread) |
~2,290,000 | ~440 | ~4.5x slower |
| faiss-zig v0.0.1 (Zig scalar + 1 thread) | ~22,900,000 | ~44 | ~45x slower |
Versions are the FlatIndex baselines; HNSW/IVFFlat/IVFPQ since v0.2 add approximate-index speedups on top of the v0.2 Flat number.
Inner-loop distance throughput (no Index overhead, median of 3 trials, N=100K vectors per dim, single thread):
| Dim | SIMD @Vector(8, f32) |
Scalar reduction | Ratio |
|---|---|---|---|
| 16 | 7 ns/dist | 16 ns/dist | 2.3x |
| 64 | 23 ns/dist | 103 ns/dist | 4.5x |
| 128 | 49 ns/dist | 239 ns/dist | 4.9x |
| 256 | 89 ns/dist | 489 ns/dist | 5.5x |
| 512 | 169 ns/dist | 1,024 ns/dist | 6.1x |
| 1024 | 362 ns/dist | 2,148 ns/dist | 5.9x |
The scalar loop is also ReleaseFast Zig — the win comes from
@Vector(8, f32) letting LLVM emit 8 parallel f32 accumulators
that the strict single-accumulator scalar reduction can't.
-
FlatIndex — brute-force kNN with L2-squared / inner-product / cosine. SIMD
@Vector(8, f32)distance loops, multi-threadedsearchBatch. -
IndexHNSW — hierarchical navigable small-world (Malkov 2016). L2-squared only at v0.3+. Default M=16, ef_construction=200, ef_search=100. Greedy multi-level descent + ef_search-bounded layer search.
-
IndexIVFFlat — inverted-file approximate index (v0.6). Lloyd k-means clusters the training set into
nlistcentroids; at query time we rank centroids and scan thenprobenearest lists. Default nlist=100, nprobe=10, n_iter=10, L2-squared. -
IndexIVFPQ — inverted-file + Product Quantization (v0.7, the production FAISS workhorse). Coarse k-means → nlist centroids; each vector stored as an M-byte PQ code of its residual against the assigned coarse centroid. Search uses Asymmetric Distance Computation (ADC) via a per-query M × 256 LUT — pure byte-fetch + sum, no per-vector multiply. Default nlist=100, nprobe=10, pq_m=8. 16.94x memory compression vs raw f32 storage at N=10K D=128 (295 KB total vs 5,000 KB), 42.67x per-vector compression (12 bytes/vec vs 512).
Recall trades against memory: 0.10 on uniform [0,1]^128 random data at this config (real clustered embeddings recover toward the FAISS-published 0.6–0.8 R@10).
-
Public
faiss.scalar.*/faiss.distance.*for callers that want to bypass the Index wrapper or compare paths.
- Cosine + inner-product on HNSW / IVFFlat / IVFPQ — L2-squared only at v0.7; cosine + IP land in v0.7.1.
- Optimal Product Quantization (OPQ rotation matrix) — v0.8.
- k-means++ init — v0.8 quality bump. Currently uniform-random sampling.
- Parallel k-means training — v0.8. Currently single-thread.
- GPU — not a near-term target.
- Disk-backed indexes — future.
- Vector serialisation format — future.
const std = @import("std");
const faiss = @import("faiss");
pub fn main() !void {
var gpa: std.heap.DebugAllocator(.{}) = .init;
defer _ = gpa.deinit();
const alloc = gpa.allocator();
var idx = faiss.FlatIndex.init(alloc, 128, .l2_squared);
defer idx.deinit();
var v: [128]f32 = undefined;
// ... fill v ...
_ = try idx.add(&v);
var query: [128]f32 = undefined;
const hits = try idx.search(alloc, &query, 10);
defer alloc.free(hits);
for (hits) |h| std.debug.print("id={d} score={d:.4}\n", .{ h.id, h.score });
}zig build test # unit tests (incl. SIMD vs scalar agreement)
zig build search # synthetic kNN demo CLI
zig build bench # flat-index throughput
zig build bench-simd # SIMD vs scalar speedup at varied D
# faiss-cpu head-to-head (requires venv with faiss-cpu + numpy)
python3 -m venv .venv && source .venv/bin/activate
pip install faiss-cpu numpy
LD_LIBRARY_PATH=/usr/lib python3 bench/bench_faiss_cpu.pyfacebookresearch/faiss— C++ canonical. faiss-zig at v0.7 covers the same four index families that matter most in production: Flat, HNSW, IVFFlat, IVFPQ. The C++ side has hand-tuned AVX-2 + cache-aware tiling + GPU; faiss-zig is multi-threaded SIMD on CPU only. v0.2 closed the Flat gap to ~1.5x. HNSW / IVFFlat / IVFPQ approximate-index numbers compete in their own regimes.
AGPL-3.0-or-later. See LICENSE.
- No GPU. Not a near-term target.
- No cache-aware tiling on FlatIndex. faiss-cpu blocks; we don't. Closing remaining gap below 1.5x needs this.
- L2-squared only on the three approximate indexes at v0.7. Cosine + IP for HNSW / IVF / IVFPQ are v0.7.1.
- Single-platform tested (x86_64-linux, Zig 0.16 ReleaseFast, Ryzen-class consumer Linux).
- The SIMD speedup table reflects this specific CPU's
auto-vectorisation behavior at ReleaseFast. Different
compilers /
-Olevels / CPUs may produce different ratios. - Recall on uniform random data is misleading. PQ shines on clustered real embeddings; the 0.10 R@10 number at default config on uniform [0,1]^128 is an artifact of the worst-case distribution, not the expected real-world recall.