Summary
RankQuant::search_asymmetric_subset gathers the M candidate docs into a contiguous sub_packed buffer (m * bytes_per_vec) before the SIMD rerank. Negligible for the paper's operating regime (M ≈ 100–500 from a bitmap probe), but for large M (5000+, reachable via the planned Python FFI) the gather adds measurable overhead — candidate bytes move through memory ~3× (scattered gather-read + sub_packed write + contiguous scan-read).
Why it's low-fuss
The RankQuant asymmetric kernels (scan_b{2,4}_asym_avx512, scan_b{2,4}_asym_avx2, scan_via_lut_scalar) are per-doc / coordinate-parallel: each iteration slices one doc (&packed[di*bpv..]) and vectorizes across its coordinates. They don't need candidate docs to be adjacent — only renumbered into 0..m, which is the sole reason sub_packed exists.
Proposed change
Thread an optional candidate-index map through the kernels:
// inside `for di in 0..n`:
let base = match candidates { Some(c) => c[di] as usize, None => di } * bytes_per_vec;
let doc = &packed[base .. base + bytes_per_vec]; // SIMD body unchanged
search_asymmetric_subset then passes &self.packed + candidates instead of materializing sub_packed. Each doc read once, zero allocation. TopK still yields the local di, remapped via candidates[di] as today. The full-scan path passes None and is unchanged.
Caveat / scope
- Win at large
M (~⅓ the candidate-data traffic); roughly neutral at small M (the in-place pass is scattered vs the current contiguous scan). So this is a scalability fix, not a paper-path speedup.
- Optional: sort
candidates by id (or have the bitmap probe return them sorted) so the scattered scan becomes sequential-with-gaps (prefetch-friendly).
Acceptance
- New exactness test: index-mapped path matches the current gather path bit-for-bit (template:
rankquant_asymmetric_matches_reference_b{1,2,4}).
- Best landed alongside the Python FFI (roadmap step 3), where large
M becomes externally reachable.
Summary
RankQuant::search_asymmetric_subsetgathers theMcandidate docs into a contiguoussub_packedbuffer (m * bytes_per_vec) before the SIMD rerank. Negligible for the paper's operating regime (M≈ 100–500 from a bitmap probe), but for largeM(5000+, reachable via the planned Python FFI) the gather adds measurable overhead — candidate bytes move through memory ~3× (scattered gather-read +sub_packedwrite + contiguous scan-read).Why it's low-fuss
The RankQuant asymmetric kernels (
scan_b{2,4}_asym_avx512,scan_b{2,4}_asym_avx2,scan_via_lut_scalar) are per-doc / coordinate-parallel: each iteration slices one doc (&packed[di*bpv..]) and vectorizes across its coordinates. They don't need candidate docs to be adjacent — only renumbered into0..m, which is the sole reasonsub_packedexists.Proposed change
Thread an optional candidate-index map through the kernels:
search_asymmetric_subsetthen passes&self.packed+candidatesinstead of materializingsub_packed. Each doc read once, zero allocation.TopKstill yields the localdi, remapped viacandidates[di]as today. The full-scan path passesNoneand is unchanged.Caveat / scope
M(~⅓ the candidate-data traffic); roughly neutral at smallM(the in-place pass is scattered vs the current contiguous scan). So this is a scalability fix, not a paper-path speedup.candidatesby id (or have the bitmap probe return them sorted) so the scattered scan becomes sequential-with-gaps (prefetch-friendly).Acceptance
rankquant_asymmetric_matches_reference_b{1,2,4}).Mbecomes externally reachable.