ann-search-cpp

Four approximate-nearest-neighbour algorithms — LSH, Hypercube, IVF-Flat, IVF-PQ — implemented from scratch in C++ and benchmarked on MNIST (60k × 784 uint8) and SIFT (1M × 128 float) against a cache-backed brute-force baseline.

First homework of the Software Development for Algorithmic Problems course at the University of Athens (Department of Informatics & Telecommunications). The course's defining feature is that it's the only one in the curriculum where you ship a real, multi-month software project end-to-end — and this is the first installment of a three-part arc. Graded ~10/10.

Built with @soc9999 (Sokratis Papargyris) — partner across all three homeworks.

Highlights from the parameter sweeps

All numbers are recall vs. brute-force, with 5 nearest neighbours per query. Full sweeps in REPORT.md; per-algorithm raw dumps in docs/.

IVF-Flat is the practical winner. SIFT: 500 clusters, 10 probes → 0.95 recall at 5.1× speedup. MNIST: 50 clusters, 10 probes → 0.99 recall at 3.1× speedup. Clean, predictable trade-off curve.
LSH works well on normalized data, badly without. SIFT normalized (k=8, L=10, w=3) → 0.71 recall at 6× speedup, or w=5 → 0.94 at 1.1×. On unnormalized SIFT you need w=600 just to scrape past 0.6 recall — variance in raw SIFT amplitudes wrecks the bucket distribution.
IVF-PQ wins on memory. SIFT, M=128, nbits=8 → 0.92 recall at 5.9× speedup while encoding each 128-D float vector into 128 bytes (vs. 512 raw). MNIST never crosses 0.39 even at maximum M; we diagnose this in REPORT.md.
Hypercube is the weakest of the four. Tops out around 0.62 recall at 2.5× speedup on normalized SIFT — never finds a better operating point than LSH on this dataset class. Worth implementing for the memory trick (see below) more than the algorithm itself.

What's interesting about the implementation

Composition over duplication. Hypercube is implemented on top of LSHashTable; IVFPQ is implemented on top of IVFFlat. Same primitives, half the code.
Templates for dataset duality. A single codebase serves both MNIST (uint8 vectors) and SIFT (float vectors) via C++ templates — no runtime branching, no parallel implementations.
Brute-force result cache. Brute is the ground truth for every recall measurement, so we run it once per (input, queries, maxq, maxd) tuple, hash the parameters, and dump the result to a binary cache. Subsequent runs read the cache directly — turned weeks of experiments into hours.
Memory-aware Hypercube g→bit map. The naive table mapping is O(kproj · N/4). When that exceeds 50 MB, we fall back to a splitmix64 hash with a per-g seed — same uniform distribution, zero memory.
OpenMP-parallel IVF-PQ training. All M subspaces train k-means in parallel, and the encoding step is parallel too.
K-Means++ alongside random initialization, configurable per run, with optional silhouette computation on a subsample for cluster-quality reporting.

Build & run

make                                            # produces bin/search
bash scripts/install_data.sh                    # optional: fetches MNIST and SIFT
./bin/search -i sift/sift_base.fvecs \
             -q sift/sift_query.fvecs \
             -algo lsh -N 5 -k 8 -L 10 -w 3 \
             -normalize 1

Useful flags beyond what the assignment required:

Flag	What it does
`-maxd N`	Cap on dataset size loaded (0 = all)
`-maxq N`	Cap on query count loaded (default 100; brute-force is very slow with more)
`-normalize 1`	Scale all vectors so the largest coordinate is 1; range queries auto-adjust `R`. Reported metrics are always in the original scale.
`-compute_silhouette true`	Print mean silhouette across IVF-Flat clusters (subsample-based for speed)
`-ivfflatt_init_method random\|plusplus`	Cluster initialization (default plusplus)

The full argument table lives in include/Config.hpp.

Project layout

src/         # search.cpp + per-algorithm: brute, lsh, hypercube, ivfflat, ivfpq, utils
include/     # data structures: LSHashTable, Hypercube, IVFFlat, IVFPQ, Config, search, utils
docs/        # auto-generated experiment dumps per algorithm × dataset
scripts/     # Python harnesses for sweeping parameters + regenerating docs/
cache/       # binary brute-force result cache (created at runtime, gitignored)

search.cpp is intentionally thin: parse args via Config, optionally normalize, dispatch to one of the algorithm .cpp files, format output.

Sequence — the trilogy

This is part one of a three-homework project arc:

ann-search-cpp (you are here) — 4 ANN algorithms in C++ on MNIST + SIFT
neural-lsh — Neural LSH (Dong et al. 2019): k-NN graph + KaHIP partitioning + MLP-trained bucket assignment, with Bayesian HPO via Optuna
protein-homolog-search — capstone: protein remote-homolog detection on SwissProt using ESM-2 embeddings + the ANN methods from parts 1 and 2

The repo for HW2 reuses this code directly — search through the C++ headers there and you'll find them.

License

MIT — applies to my and Sokratis's joint work in this repo. Assignment-distributed materials (e.g. docs/ergasia2025_1.pdf) retain their original course copyright.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
build		build
docs		docs
include		include
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
REPORT.md		REPORT.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ann-search-cpp

Highlights from the parameter sweeps

What's interesting about the implementation

Build & run

Project layout

Sequence — the trilogy

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ann-search-cpp

Highlights from the parameter sweeps

What's interesting about the implementation

Build & run

Project layout

Sequence — the trilogy

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages