Google Benchmark harness for FAISS low-level SIMD distance functions. Part of the ES SimdVec comparison suite — measures FAISS natively in C++ to compare against ES SimdVec (benchmarked from Java via JMH).
| Level | Benchmark | FAISS function | Comparable ES SimdVec API |
|---|---|---|---|
| 1 | BM_dot_f32 |
fvec_inner_product |
Similarities.dotProductF32 |
| 1 | BM_l2sqr_f32 |
fvec_L2sqr |
Similarities.squareDistanceF32 |
| 2a | BM_dot_f32_batch4 |
fvec_inner_product_batch_4 |
ESVectorUtil.squareDistanceBulk (4-way) |
| 2a | BM_l2sqr_f32_batch4 |
fvec_L2sqr_batch_4 |
ESVectorUtil.squareDistanceBulk (4-way) |
| 2b | BM_dot_f32_ny |
fvec_inner_products_ny |
Similarities.dotProductF32Bulk |
| 2b | BM_l2sqr_f32_ny |
fvec_L2sqr_ny |
Similarities.squareDistanceF32Bulk |
| 2b | BM_dot_f32_loop_seq |
fvec_inner_product loop, sequential |
(loop-of-singles baseline) |
| 2b | BM_dot_f32_loop_random |
fvec_inner_product loop, random |
(HNSW-like scattered access) |
Level 1 — Single-vector kernel throughput (one pair, ns/op). Dimensions: 128, 256, 384, 512, 768, 1024, 1536, 3072.
Level 2a — Batch-of-4 (one query vs 4 doc vectors in a single call). Comparable to HNSW 4-way neighbor scoring.
Level 2b — 1-query-N-docs bulk (one query vs N contiguous doc vectors). Two ranges are available:
BulkRange: N = 4, 16, 32, 64, 128 at dims = 384, 768, 1024.BulkRange_1024: N = 32, 625, 32500 at dims = 1024 (sized to exceed L1, L2, and L3 cache respectively for 1024d float32 vectors).
Two loop-of-singles baselines compare against the native fvec_inner_products_ny bulk API:
BM_dot_f32_loop_seq: iterates vectors 0..N-1 in order (best case for HW prefetch).BM_dot_f32_loop_random: same N vectors visited via shuffled ordinals, simulating HNSW-like scattered graph neighbor access that defeats HW prefetching.
sudo apt update
sudo apt install -y \
build-essential \
cmake \
git \
libopenblas-dev \
liblapack-dev \
swig- build-essential — GCC/G++, make
- cmake — build system (>= 3.14 required)
- libopenblas-dev — BLAS implementation (FAISS requires BLAS to link; the low-level distance functions we benchmark are pure SIMD and do not use BLAS at runtime)
- liblapack-dev — LAPACK (FAISS cmake expects it)
- swig — may be needed by FAISS cmake even with Python disabled
Google Benchmark is fetched automatically via CMake FetchContent.
git clone --depth 1 https://github.com/facebookresearch/faiss.git /path/to/faisscd /path/to/faiss-benchmarks
mkdir build && cd build
cmake .. \
-DFAISS_DIR=/path/to/faiss \
-DCMAKE_BUILD_TYPE=Release
make -j$(nproc)The CMakeLists.txt auto-detects the architecture: on ARM (Graviton)
it sets FAISS_OPT_LEVEL=generic (FAISS uses NEON intrinsics
automatically); on x86 it defaults to FAISS_OPT_LEVEL=avx512 and
links against libfaiss_avx512.
To build an AVX2-only binary instead (e.g. for A/B comparison):
mkdir build-avx2 && cd build-avx2
cmake .. \
-DFAISS_DIR=/path/to/faiss \
-DCMAKE_BUILD_TYPE=Release \
-DFAISS_OPT_LEVEL=avx2 \
-DFAISS_LINK_TARGET=faiss_avx2
make -j$(nproc)This lets you produce separate faiss-avx512.json and faiss-avx2.json
result files from the two builds for ISA-level comparison.
Quick correctness check and timing for 1024 dimensions:
./faiss_probeExpected output:
=== FAISS Distance Kernel Probe ===
Compiled with: AVX-512
Compiler flags: -O3 -march=native (expected)
fvec_inner_product(1024d): -8.123456
fvec_L2sqr(1024d): 682.345678
...
fvec_inner_product(1024d): 18.3 ns/op
fvec_inner_product_batch_4(1024d): 52.1 ns/op (13.0 ns/score)
fvec_inner_products_ny(1024d, N=64): 1150.0 ns/op (18.0 ns/score)
./faiss_bench --benchmark_repetitions=5Run a subset (e.g. only single-vector dot product):
./faiss_bench --benchmark_filter="BM_dot_f32/"Run only 1024 dimensions:
./faiss_bench --benchmark_filter=".*1024.*"Run only Level 2b bulk benchmarks (native bulk API):
./faiss_bench --benchmark_filter="BM_.*_ny"Run only the large-N bulk benchmarks at 1024d (cache-pressure sweep):
./faiss_bench --benchmark_filter="BM_dot_f32_(ny|loop_seq|loop_random)/1024" --benchmark_min_time=2sOutput as JSON (for merging with JMH results):
./faiss_bench --benchmark_repetitions=5 \
--benchmark_out=faiss_results.json \
--benchmark_out_format=jsonFor stable results, pin to a single CPU core and disable frequency scaling:
# x86: disable turbo boost
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo 2>/dev/null
# or for AMD:
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost 2>/dev/null
# Pin to core 0
taskset -c 0 ./faiss_bench --benchmark_repetitions=5On ARM (Graviton), frequency scaling is typically fixed, but you can still pin:
taskset -c 0 ./faiss_bench --benchmark_repetitions=5faiss-benchmarks/
├── CMakeLists.txt # Build config, fetches Google Benchmark, links FAISS
├── README.md # This file
├── src/
│ ├── faiss_bench.cpp # Google Benchmark harness (Level 1 + 2)
│ └── faiss_probe.cpp # Diagnostic: correctness + quick timing
└── .gitignore