Skip to content

AlexTuring010/ann-search-cpp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ann-search-cpp

Four approximate-nearest-neighbour algorithms — LSH, Hypercube, IVF-Flat, IVF-PQ — implemented from scratch in C++ and benchmarked on MNIST (60k × 784 uint8) and SIFT (1M × 128 float) against a cache-backed brute-force baseline.

First homework of the Software Development for Algorithmic Problems course at the University of Athens (Department of Informatics & Telecommunications). The course's defining feature is that it's the only one in the curriculum where you ship a real, multi-month software project end-to-end — and this is the first installment of a three-part arc. Graded ~10/10.

Built with @soc9999 (Sokratis Papargyris) — partner across all three homeworks.

Highlights from the parameter sweeps

All numbers are recall vs. brute-force, with 5 nearest neighbours per query. Full sweeps in REPORT.md; per-algorithm raw dumps in docs/.

  • IVF-Flat is the practical winner. SIFT: 500 clusters, 10 probes → 0.95 recall at 5.1× speedup. MNIST: 50 clusters, 10 probes → 0.99 recall at 3.1× speedup. Clean, predictable trade-off curve.
  • LSH works well on normalized data, badly without. SIFT normalized (k=8, L=10, w=3) → 0.71 recall at 6× speedup, or w=50.94 at 1.1×. On unnormalized SIFT you need w=600 just to scrape past 0.6 recall — variance in raw SIFT amplitudes wrecks the bucket distribution.
  • IVF-PQ wins on memory. SIFT, M=128, nbits=80.92 recall at 5.9× speedup while encoding each 128-D float vector into 128 bytes (vs. 512 raw). MNIST never crosses 0.39 even at maximum M; we diagnose this in REPORT.md.
  • Hypercube is the weakest of the four. Tops out around 0.62 recall at 2.5× speedup on normalized SIFT — never finds a better operating point than LSH on this dataset class. Worth implementing for the memory trick (see below) more than the algorithm itself.

What's interesting about the implementation

  • Composition over duplication. Hypercube is implemented on top of LSHashTable; IVFPQ is implemented on top of IVFFlat. Same primitives, half the code.
  • Templates for dataset duality. A single codebase serves both MNIST (uint8 vectors) and SIFT (float vectors) via C++ templates — no runtime branching, no parallel implementations.
  • Brute-force result cache. Brute is the ground truth for every recall measurement, so we run it once per (input, queries, maxq, maxd) tuple, hash the parameters, and dump the result to a binary cache. Subsequent runs read the cache directly — turned weeks of experiments into hours.
  • Memory-aware Hypercube g→bit map. The naive table mapping is O(kproj · N/4). When that exceeds 50 MB, we fall back to a splitmix64 hash with a per-g seed — same uniform distribution, zero memory.
  • OpenMP-parallel IVF-PQ training. All M subspaces train k-means in parallel, and the encoding step is parallel too.
  • K-Means++ alongside random initialization, configurable per run, with optional silhouette computation on a subsample for cluster-quality reporting.

Build & run

make                                            # produces bin/search
bash scripts/install_data.sh                    # optional: fetches MNIST and SIFT
./bin/search -i sift/sift_base.fvecs \
             -q sift/sift_query.fvecs \
             -algo lsh -N 5 -k 8 -L 10 -w 3 \
             -normalize 1

Useful flags beyond what the assignment required:

Flag What it does
-maxd N Cap on dataset size loaded (0 = all)
-maxq N Cap on query count loaded (default 100; brute-force is very slow with more)
-normalize 1 Scale all vectors so the largest coordinate is 1; range queries auto-adjust R. Reported metrics are always in the original scale.
-compute_silhouette true Print mean silhouette across IVF-Flat clusters (subsample-based for speed)
-ivfflatt_init_method random|plusplus Cluster initialization (default plusplus)

The full argument table lives in include/Config.hpp.

Project layout

src/         # search.cpp + per-algorithm: brute, lsh, hypercube, ivfflat, ivfpq, utils
include/     # data structures: LSHashTable, Hypercube, IVFFlat, IVFPQ, Config, search, utils
docs/        # auto-generated experiment dumps per algorithm × dataset
scripts/     # Python harnesses for sweeping parameters + regenerating docs/
cache/       # binary brute-force result cache (created at runtime, gitignored)

search.cpp is intentionally thin: parse args via Config, optionally normalize, dispatch to one of the algorithm .cpp files, format output.

Sequence — the trilogy

This is part one of a three-homework project arc:

  1. ann-search-cpp (you are here) — 4 ANN algorithms in C++ on MNIST + SIFT
  2. neural-lsh — Neural LSH (Dong et al. 2019): k-NN graph + KaHIP partitioning + MLP-trained bucket assignment, with Bayesian HPO via Optuna
  3. protein-homolog-search — capstone: protein remote-homolog detection on SwissProt using ESM-2 embeddings + the ANN methods from parts 1 and 2

The repo for HW2 reuses this code directly — search through the C++ headers there and you'll find them.

License

MIT — applies to my and Sokratis's joint work in this repo. Assignment-distributed materials (e.g. docs/ergasia2025_1.pdf) retain their original course copyright.

About

Four ANN algorithms (LSH, Hypercube, IVF-Flat, IVF-PQ) implemented from scratch in C++ and benchmarked on MNIST + SIFT against a cache-backed brute-force baseline. First HW of the UoA DI 'Software Development for Algorithmic Problems' course; team of two.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors