FAISS Distance Kernel Benchmarks

Google Benchmark harness for FAISS low-level SIMD distance functions. Part of the ES SimdVec comparison suite — measures FAISS natively in C++ to compare against ES SimdVec (benchmarked from Java via JMH).

What is benchmarked

Level	Benchmark	FAISS function	Comparable ES SimdVec API
1	`BM_dot_f32`	`fvec_inner_product`	`Similarities.dotProductF32`
1	`BM_l2sqr_f32`	`fvec_L2sqr`	`Similarities.squareDistanceF32`
2a	`BM_dot_f32_batch4`	`fvec_inner_product_batch_4`	`ESVectorUtil.squareDistanceBulk` (4-way)
2a	`BM_l2sqr_f32_batch4`	`fvec_L2sqr_batch_4`	`ESVectorUtil.squareDistanceBulk` (4-way)
2b	`BM_dot_f32_ny`	`fvec_inner_products_ny`	`Similarities.dotProductF32Bulk`
2b	`BM_l2sqr_f32_ny`	`fvec_L2sqr_ny`	`Similarities.squareDistanceF32Bulk`
2b	`BM_dot_f32_loop_seq`	`fvec_inner_product` loop, sequential	(loop-of-singles baseline)
2b	`BM_dot_f32_loop_random`	`fvec_inner_product` loop, random	(HNSW-like scattered access)

Level 1 — Single-vector kernel throughput (one pair, ns/op). Dimensions: 128, 256, 384, 512, 768, 1024, 1536, 3072.

Level 2a — Batch-of-4 (one query vs 4 doc vectors in a single call). Comparable to HNSW 4-way neighbor scoring.

Level 2b — 1-query-N-docs bulk (one query vs N contiguous doc vectors). Two ranges are available:

BulkRange: N = 4, 16, 32, 64, 128 at dims = 384, 768, 1024.
BulkRange_1024: N = 32, 625, 32500 at dims = 1024 (sized to exceed L1, L2, and L3 cache respectively for 1024d float32 vectors).

Two loop-of-singles baselines compare against the native fvec_inner_products_ny bulk API:

BM_dot_f32_loop_seq: iterates vectors 0..N-1 in order (best case for HW prefetch).
BM_dot_f32_loop_random: same N vectors visited via shuffled ordinals, simulating HNSW-like scattered graph neighbor access that defeats HW prefetching.

Dependencies (Ubuntu)

sudo apt update
sudo apt install -y \
    build-essential \
    cmake \
    git \
    libopenblas-dev \
    liblapack-dev \
    swig

build-essential — GCC/G++, make
cmake — build system (>= 3.14 required)
libopenblas-dev — BLAS implementation (FAISS requires BLAS to link; the low-level distance functions we benchmark are pure SIMD and do not use BLAS at runtime)
liblapack-dev — LAPACK (FAISS cmake expects it)
swig — may be needed by FAISS cmake even with Python disabled

Google Benchmark is fetched automatically via CMake FetchContent.

Clone FAISS

git clone --depth 1 https://github.com/facebookresearch/faiss.git /path/to/faiss

Build

cd /path/to/faiss-benchmarks
mkdir build && cd build
cmake .. \
    -DFAISS_DIR=/path/to/faiss \
    -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

The CMakeLists.txt auto-detects the architecture: on ARM (Graviton) it sets FAISS_OPT_LEVEL=generic (FAISS uses NEON intrinsics automatically); on x86 it defaults to FAISS_OPT_LEVEL=avx512 and links against libfaiss_avx512.

To build an AVX2-only binary instead (e.g. for A/B comparison):

mkdir build-avx2 && cd build-avx2
cmake .. \
    -DFAISS_DIR=/path/to/faiss \
    -DCMAKE_BUILD_TYPE=Release \
    -DFAISS_OPT_LEVEL=avx2 \
    -DFAISS_LINK_TARGET=faiss_avx2
make -j$(nproc)

This lets you produce separate faiss-avx512.json and faiss-avx2.json result files from the two builds for ISA-level comparison.

Run

Diagnostic probe

Quick correctness check and timing for 1024 dimensions:

./faiss_probe

Expected output:

=== FAISS Distance Kernel Probe ===

Compiled with: AVX-512
Compiler flags: -O3 -march=native (expected)

fvec_inner_product(1024d): -8.123456
fvec_L2sqr(1024d):         682.345678
...
fvec_inner_product(1024d): 18.3 ns/op
fvec_inner_product_batch_4(1024d): 52.1 ns/op (13.0 ns/score)
fvec_inner_products_ny(1024d, N=64): 1150.0 ns/op (18.0 ns/score)

Full benchmarks

./faiss_bench --benchmark_repetitions=5

Run a subset (e.g. only single-vector dot product):

./faiss_bench --benchmark_filter="BM_dot_f32/"

Run only 1024 dimensions:

./faiss_bench --benchmark_filter=".*1024.*"

Run only Level 2b bulk benchmarks (native bulk API):

./faiss_bench --benchmark_filter="BM_.*_ny"

Run only the large-N bulk benchmarks at 1024d (cache-pressure sweep):

./faiss_bench --benchmark_filter="BM_dot_f32_(ny|loop_seq|loop_random)/1024" --benchmark_min_time=2s

Output as JSON (for merging with JMH results):

./faiss_bench --benchmark_repetitions=5 \
    --benchmark_out=faiss_results.json \
    --benchmark_out_format=json

Reducing variance

For stable results, pin to a single CPU core and disable frequency scaling:

# x86: disable turbo boost
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo 2>/dev/null
# or for AMD:
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost 2>/dev/null

# Pin to core 0
taskset -c 0 ./faiss_bench --benchmark_repetitions=5

On ARM (Graviton), frequency scaling is typically fixed, but you can still pin:

taskset -c 0 ./faiss_bench --benchmark_repetitions=5

Project structure

faiss-benchmarks/
├── CMakeLists.txt          # Build config, fetches Google Benchmark, links FAISS
├── README.md               # This file
├── src/
│   ├── faiss_bench.cpp     # Google Benchmark harness (Level 1 + 2)
│   └── faiss_probe.cpp     # Diagnostic: correctness + quick timing
└── .gitignore

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FAISS Distance Kernel Benchmarks

What is benchmarked

Dependencies (Ubuntu)

Clone FAISS

Build

Run

Diagnostic probe

Full benchmarks

Reducing variance

Project structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

FAISS Distance Kernel Benchmarks

What is benchmarked

Dependencies (Ubuntu)

Clone FAISS

Build

Run

Diagnostic probe

Full benchmarks

Reducing variance

Project structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages