perf: gather-free in-place subset scan for large M (FFI scalability)

## Summary
`RankQuant::search_asymmetric_subset` gathers the `M` candidate docs into a contiguous `sub_packed` buffer (`m * bytes_per_vec`) before the SIMD rerank. Negligible for the paper's operating regime (`M` ≈ 100–500 from a bitmap probe), but for large `M` (5000+, reachable via the planned Python FFI) the gather adds measurable overhead — candidate bytes move through memory ~3× (scattered gather-read + `sub_packed` write + contiguous scan-read).

## Why it's low-fuss
The RankQuant asymmetric kernels (`scan_b{2,4}_asym_avx512`, `scan_b{2,4}_asym_avx2`, `scan_via_lut_scalar`) are **per-doc / coordinate-parallel**: each iteration slices one doc (`&packed[di*bpv..]`) and vectorizes across *its* coordinates. They don't need candidate docs to be adjacent — only renumbered into `0..m`, which is the sole reason `sub_packed` exists.

## Proposed change
Thread an optional candidate-index map through the kernels:
```rust
// inside `for di in 0..n`:
let base = match candidates { Some(c) => c[di] as usize, None => di } * bytes_per_vec;
let doc = &packed[base .. base + bytes_per_vec];   // SIMD body unchanged
```
`search_asymmetric_subset` then passes `&self.packed` + `candidates` instead of materializing `sub_packed`. Each doc read once, zero allocation. `TopK` still yields the local `di`, remapped via `candidates[di]` as today. The full-scan path passes `None` and is unchanged.

## Caveat / scope
- Win at large `M` (~⅓ the candidate-data traffic); roughly neutral at small `M` (the in-place pass is scattered vs the current contiguous scan). So this is a scalability fix, not a paper-path speedup.
- Optional: sort `candidates` by id (or have the bitmap probe return them sorted) so the scattered scan becomes sequential-with-gaps (prefetch-friendly).

## Acceptance
- New exactness test: index-mapped path matches the current gather path bit-for-bit (template: `rankquant_asymmetric_matches_reference_b{1,2,4}`).
- Best landed alongside the Python FFI (roadmap step 3), where large `M` becomes externally reachable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: gather-free in-place subset scan for large M (FFI scalability) #11

Summary

Why it's low-fuss

Proposed change

Caveat / scope

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

perf: gather-free in-place subset scan for large M (FFI scalability) #11

Description

Summary

Why it's low-fuss

Proposed change

Caveat / scope

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions