Four approximate-nearest-neighbour algorithms — LSH, Hypercube, IVF-Flat, IVF-PQ — implemented from scratch in C++ and benchmarked on MNIST (60k × 784 uint8) and SIFT (1M × 128 float) against a cache-backed brute-force baseline.
First homework of the Software Development for Algorithmic Problems course at the University of Athens (Department of Informatics & Telecommunications). The course's defining feature is that it's the only one in the curriculum where you ship a real, multi-month software project end-to-end — and this is the first installment of a three-part arc. Graded ~10/10.
Built with @soc9999 (Sokratis Papargyris) — partner across all three homeworks.
All numbers are recall vs. brute-force, with 5 nearest neighbours per query. Full sweeps in REPORT.md; per-algorithm raw dumps in docs/.
- IVF-Flat is the practical winner. SIFT: 500 clusters, 10 probes → 0.95 recall at 5.1× speedup. MNIST: 50 clusters, 10 probes → 0.99 recall at 3.1× speedup. Clean, predictable trade-off curve.
- LSH works well on normalized data, badly without. SIFT normalized (
k=8, L=10, w=3) → 0.71 recall at 6× speedup, orw=5→ 0.94 at 1.1×. On unnormalized SIFT you needw=600just to scrape past 0.6 recall — variance in raw SIFT amplitudes wrecks the bucket distribution. - IVF-PQ wins on memory. SIFT,
M=128, nbits=8→ 0.92 recall at 5.9× speedup while encoding each 128-D float vector into 128 bytes (vs. 512 raw). MNIST never crosses 0.39 even at maximumM; we diagnose this inREPORT.md. - Hypercube is the weakest of the four. Tops out around 0.62 recall at 2.5× speedup on normalized SIFT — never finds a better operating point than LSH on this dataset class. Worth implementing for the memory trick (see below) more than the algorithm itself.
- Composition over duplication.
Hypercubeis implemented on top ofLSHashTable;IVFPQis implemented on top ofIVFFlat. Same primitives, half the code. - Templates for dataset duality. A single codebase serves both MNIST (uint8 vectors) and SIFT (float vectors) via C++ templates — no runtime branching, no parallel implementations.
- Brute-force result cache. Brute is the ground truth for every recall measurement, so we run it once per
(input, queries, maxq, maxd)tuple, hash the parameters, and dump the result to a binary cache. Subsequent runs read the cache directly — turned weeks of experiments into hours. - Memory-aware Hypercube g→bit map. The naive table mapping is
O(kproj · N/4). When that exceeds 50 MB, we fall back to asplitmix64hash with a per-gseed — same uniform distribution, zero memory. - OpenMP-parallel IVF-PQ training. All M subspaces train k-means in parallel, and the encoding step is parallel too.
- K-Means++ alongside random initialization, configurable per run, with optional silhouette computation on a subsample for cluster-quality reporting.
make # produces bin/search
bash scripts/install_data.sh # optional: fetches MNIST and SIFT
./bin/search -i sift/sift_base.fvecs \
-q sift/sift_query.fvecs \
-algo lsh -N 5 -k 8 -L 10 -w 3 \
-normalize 1Useful flags beyond what the assignment required:
| Flag | What it does |
|---|---|
-maxd N |
Cap on dataset size loaded (0 = all) |
-maxq N |
Cap on query count loaded (default 100; brute-force is very slow with more) |
-normalize 1 |
Scale all vectors so the largest coordinate is 1; range queries auto-adjust R. Reported metrics are always in the original scale. |
-compute_silhouette true |
Print mean silhouette across IVF-Flat clusters (subsample-based for speed) |
-ivfflatt_init_method random|plusplus |
Cluster initialization (default plusplus) |
The full argument table lives in include/Config.hpp.
src/ # search.cpp + per-algorithm: brute, lsh, hypercube, ivfflat, ivfpq, utils
include/ # data structures: LSHashTable, Hypercube, IVFFlat, IVFPQ, Config, search, utils
docs/ # auto-generated experiment dumps per algorithm × dataset
scripts/ # Python harnesses for sweeping parameters + regenerating docs/
cache/ # binary brute-force result cache (created at runtime, gitignored)
search.cpp is intentionally thin: parse args via Config, optionally normalize, dispatch to one of the algorithm .cpp files, format output.
This is part one of a three-homework project arc:
- ann-search-cpp (you are here) — 4 ANN algorithms in C++ on MNIST + SIFT
- neural-lsh — Neural LSH (Dong et al. 2019): k-NN graph + KaHIP partitioning + MLP-trained bucket assignment, with Bayesian HPO via Optuna
- protein-homolog-search — capstone: protein remote-homolog detection on SwissProt using ESM-2 embeddings + the ANN methods from parts 1 and 2
The repo for HW2 reuses this code directly — search through the C++ headers there and you'll find them.
MIT — applies to my and Sokratis's joint work in this repo. Assignment-distributed materials (e.g. docs/ergasia2025_1.pdf) retain their original course copyright.