Skip to content

ChrisHegarty/faiss-kernel-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

FAISS Distance Kernel Benchmarks

Google Benchmark harness for FAISS low-level SIMD distance functions. Part of the ES SimdVec comparison suite — measures FAISS natively in C++ to compare against ES SimdVec (benchmarked from Java via JMH).

What is benchmarked

Level Benchmark FAISS function Comparable ES SimdVec API
1 BM_dot_f32 fvec_inner_product Similarities.dotProductF32
1 BM_l2sqr_f32 fvec_L2sqr Similarities.squareDistanceF32
2a BM_dot_f32_batch4 fvec_inner_product_batch_4 ESVectorUtil.squareDistanceBulk (4-way)
2a BM_l2sqr_f32_batch4 fvec_L2sqr_batch_4 ESVectorUtil.squareDistanceBulk (4-way)
2b BM_dot_f32_ny fvec_inner_products_ny Similarities.dotProductF32Bulk
2b BM_l2sqr_f32_ny fvec_L2sqr_ny Similarities.squareDistanceF32Bulk
2b BM_dot_f32_loop_seq fvec_inner_product loop, sequential (loop-of-singles baseline)
2b BM_dot_f32_loop_random fvec_inner_product loop, random (HNSW-like scattered access)

Level 1 — Single-vector kernel throughput (one pair, ns/op). Dimensions: 128, 256, 384, 512, 768, 1024, 1536, 3072.

Level 2a — Batch-of-4 (one query vs 4 doc vectors in a single call). Comparable to HNSW 4-way neighbor scoring.

Level 2b — 1-query-N-docs bulk (one query vs N contiguous doc vectors). Two ranges are available:

  • BulkRange: N = 4, 16, 32, 64, 128 at dims = 384, 768, 1024.
  • BulkRange_1024: N = 32, 625, 32500 at dims = 1024 (sized to exceed L1, L2, and L3 cache respectively for 1024d float32 vectors).

Two loop-of-singles baselines compare against the native fvec_inner_products_ny bulk API:

  • BM_dot_f32_loop_seq: iterates vectors 0..N-1 in order (best case for HW prefetch).
  • BM_dot_f32_loop_random: same N vectors visited via shuffled ordinals, simulating HNSW-like scattered graph neighbor access that defeats HW prefetching.

Dependencies (Ubuntu)

sudo apt update
sudo apt install -y \
    build-essential \
    cmake \
    git \
    libopenblas-dev \
    liblapack-dev \
    swig
  • build-essential — GCC/G++, make
  • cmake — build system (>= 3.14 required)
  • libopenblas-dev — BLAS implementation (FAISS requires BLAS to link; the low-level distance functions we benchmark are pure SIMD and do not use BLAS at runtime)
  • liblapack-dev — LAPACK (FAISS cmake expects it)
  • swig — may be needed by FAISS cmake even with Python disabled

Google Benchmark is fetched automatically via CMake FetchContent.

Clone FAISS

git clone --depth 1 https://github.com/facebookresearch/faiss.git /path/to/faiss

Build

cd /path/to/faiss-benchmarks
mkdir build && cd build
cmake .. \
    -DFAISS_DIR=/path/to/faiss \
    -DCMAKE_BUILD_TYPE=Release
make -j$(nproc)

The CMakeLists.txt auto-detects the architecture: on ARM (Graviton) it sets FAISS_OPT_LEVEL=generic (FAISS uses NEON intrinsics automatically); on x86 it defaults to FAISS_OPT_LEVEL=avx512 and links against libfaiss_avx512.

To build an AVX2-only binary instead (e.g. for A/B comparison):

mkdir build-avx2 && cd build-avx2
cmake .. \
    -DFAISS_DIR=/path/to/faiss \
    -DCMAKE_BUILD_TYPE=Release \
    -DFAISS_OPT_LEVEL=avx2 \
    -DFAISS_LINK_TARGET=faiss_avx2
make -j$(nproc)

This lets you produce separate faiss-avx512.json and faiss-avx2.json result files from the two builds for ISA-level comparison.

Run

Diagnostic probe

Quick correctness check and timing for 1024 dimensions:

./faiss_probe

Expected output:

=== FAISS Distance Kernel Probe ===

Compiled with: AVX-512
Compiler flags: -O3 -march=native (expected)

fvec_inner_product(1024d): -8.123456
fvec_L2sqr(1024d):         682.345678
...
fvec_inner_product(1024d): 18.3 ns/op
fvec_inner_product_batch_4(1024d): 52.1 ns/op (13.0 ns/score)
fvec_inner_products_ny(1024d, N=64): 1150.0 ns/op (18.0 ns/score)

Full benchmarks

./faiss_bench --benchmark_repetitions=5

Run a subset (e.g. only single-vector dot product):

./faiss_bench --benchmark_filter="BM_dot_f32/"

Run only 1024 dimensions:

./faiss_bench --benchmark_filter=".*1024.*"

Run only Level 2b bulk benchmarks (native bulk API):

./faiss_bench --benchmark_filter="BM_.*_ny"

Run only the large-N bulk benchmarks at 1024d (cache-pressure sweep):

./faiss_bench --benchmark_filter="BM_dot_f32_(ny|loop_seq|loop_random)/1024" --benchmark_min_time=2s

Output as JSON (for merging with JMH results):

./faiss_bench --benchmark_repetitions=5 \
    --benchmark_out=faiss_results.json \
    --benchmark_out_format=json

Reducing variance

For stable results, pin to a single CPU core and disable frequency scaling:

# x86: disable turbo boost
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo 2>/dev/null
# or for AMD:
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost 2>/dev/null

# Pin to core 0
taskset -c 0 ./faiss_bench --benchmark_repetitions=5

On ARM (Graviton), frequency scaling is typically fixed, but you can still pin:

taskset -c 0 ./faiss_bench --benchmark_repetitions=5

Project structure

faiss-benchmarks/
├── CMakeLists.txt          # Build config, fetches Google Benchmark, links FAISS
├── README.md               # This file
├── src/
│   ├── faiss_bench.cpp     # Google Benchmark harness (Level 1 + 2)
│   └── faiss_probe.cpp     # Diagnostic: correctness + quick timing
└── .gitignore

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages