Vector subsystem — correctness, memory-safety, quantization (7 sub-items)

Seven independent bugs in `nodedb-vector` beyond what's already filed (#37 mmap overflow, #40 SQ8 brute-force scan, #41 ef_search unbounded). Three are correctness bugs that silently return wrong results. One is memory-safety UB on the search hot path. One is a feature that's silently a no-op. Two are performance pathologies that turn the index from log-scale into linear.

---

## 1. SIMD distance kernels read past buffer end on length mismatch (memory-safety UB)

**File:** [`nodedb-vector/src/distance/simd.rs:147-210`](nodedb-vector/src/distance/simd.rs) (AVX2 L2 + cosine, NEON twins)

```rust
#[target_feature(enable = "avx2,fma")]
unsafe fn l2_squared_impl(a: &[f32], b: &[f32]) -> f32 {
    unsafe {
        let n = a.len();                              // ← uses a.len() only
        let chunks = n / 8;
        for i in 0..chunks {
            let off = i * 8;
            let va = _mm256_loadu_ps(a.as_ptr().add(off));
            let vb = _mm256_loadu_ps(b.as_ptr().add(off));   // UB if b.len() < a.len()
            ...
        }
        for i in (chunks * 8)..n {
            let d = a[i] - b[i];                      // panic if b shorter
            ...
        }
    }
}
```

The kernel iterates with `a.len()`, dereferences `b.as_ptr().add(off)` without verifying `b.len() >= a.len()`. The `assert_eq!(query.len(), self.dim)` guard in `hnsw/search.rs:19` validates the *query* against the *index dim*, but does **not** validate the query against each individual stored vector's length. A node whose `Vec<f32>` is shorter than the index `dim` (possible after a truncated checkpoint restore — `from_checkpoint` has no per-node length check) lands the search hot path in undefined behavior via `_mm256_loadu_ps` past the buffer end.

---

## 2. Tenant Roaring bitmap pre-filter feeds GLOBAL ids to per-segment LOCAL-id lookup → near-zero recall after first segment seal

**File:** [`nodedb-vector/src/collection/search.rs:115-135`](nodedb-vector/src/collection/search.rs) + `hnsw/search.rs:140` (`f.contains(id)`)

```rust
for seg in &self.sealed {
    let results = seg.index.search_with_bitmap_bytes(query, top_k, ef, bitmap);
    for mut r in results {
        r.id += seg.base_id;        // ← offset added AFTER bitmap filtering
        all.push(r);
    }
}
```

The Roaring bitmap is built by the query planner from **global** vector IDs. Each sealed segment stores vectors at **local** IDs `0..seg.index.len()`. `HnswIndex::search_with_bitmap_bytes` tests `f.contains(id)` with the local ID. For the first segment (`base_id == 0`) local == global so filtering works by accident; for **every** segment with `base_id > 0` (i.e. once the collection has > ~65k vectors and seals once), the bitmap lookup is against the wrong ID space and almost every candidate is silently dropped → recall collapses. The growing segment has the same shape via `growing_base_id`.

---

## 3. `index_type='hnsw_pq'` is silently a no-op — PQ is never instantiated

**File:** [`nodedb-vector/src/collection/lifecycle.rs:209-261`](nodedb-vector/src/collection/lifecycle.rs)

```rust
pub fn complete_build(&mut self, segment_id: u32, index: HnswIndex) {
    ...
    let sq8 = Self::build_sq8_for_index(&index);    // ← always SQ8, never PQ
    ...
}
```

`grep -rn PqCodec nodedb-vector/src/collection/` returns zero matches. The DSL accepts `index_type='hnsw_pq', pq_m=8` and stores it, but `complete_build` unconditionally calls `build_sq8_for_index`. Users who configured PQ for the advertised 8–16× memory reduction silently get 4× SQ8 instead. At 1024-dim × 100k+ rows per collection that's the difference between fitting in RAM and spilling to mmap on the hot path. Stats API (`lifecycle.rs:385`) reports quantization as `Sq8` or `None`, never `Pq` — operators have no signal that the configuration was ignored.

**Repro:**
```sql
CREATE INDEX ON t USING hnsw (v) WITH (index_type='hnsw_pq', pq_m=8, m=16);
INSERT INTO t SELECT array_fill(random()::real, ARRAY[1024])::vector(1024) FROM generate_series(1, 100000);
SELECT * FROM vector_index_stats WHERE relation = 't'; -- quantization = 'Sq8'
```

---

## 4. Soft-deletes in growing & building segments lost on checkpoint restore

**File:** [`nodedb-vector/src/collection/checkpoint.rs:55-76, 120-154`](nodedb-vector/src/collection/checkpoint.rs)

```rust
growing_vectors: (0..self.growing.len() as u32)
    .filter_map(|i| self.growing.get_vector(i).map(|v| v.to_vec()))
    .collect(),
growing_deleted: (0..self.growing.len() as u32)
    .map(|i| self.growing.get_vector(i).is_none())
    .collect(),
...
let mut growing = FlatIndex::new(snap.dim, metric);
for v in &snap.growing_vectors {
    growing.insert(v.clone());
}
```

`FlatIndex::get_vector` (flat.rs:161-169) returns `Some(...)` even for tombstoned vectors — it gates only on bounds, not on `deleted[idx]`. So `growing_deleted` is always all-false and every live vector is serialized. Then `from_checkpoint` (line 120-123) never reads `growing_deleted` anyway — every restored vector comes back as live. Building-segment snapshots (line 74-76) have the same bug. Soft-delete-via-`valid_until` workflows depend on deletes being durable across restarts; after a crash, every deleted row in growing/building segments resurrects.

---

## 5. HNSW compaction renumbers local IDs but `doc_id_map` / `multi_doc_map` are not remapped

**Files:** [`nodedb-vector/src/hnsw/graph.rs:282-352`](nodedb-vector/src/hnsw/graph.rs) (`HnswIndex::compact`) + [`nodedb-vector/src/collection/lifecycle.rs:38-40, 119-122, 279-285`](nodedb-vector/src/collection/lifecycle.rs)

```rust
// lifecycle.rs
pub fn compact(&mut self) -> usize {
    let mut total_removed = 0;
    for seg in &mut self.sealed {
        total_removed += seg.index.compact();   // renumbers LOCAL node ids
    }
    total_removed
}
```

`HnswIndex::compact` builds an `id_map: Vec<u32>` and rewrites every node's neighbor edges so the in-graph references stay consistent. But the **collection** stores `doc_id_map: HashMap<u32 /* GLOBAL */, String>` and `multi_doc_map: HashMap<String, Vec<u32 /* GLOBAL */>>`, and these are not remapped. After compact, `collection.get_doc_id(vid)` either returns a stale string (the global id `seg.base_id + old_local` no longer maps to the right live node) or the multi-vector delete-by-doc dispatches to IDs that point at a different live chunk. With bulk-insert workloads (many vectors per logical doc) this corrupts the doc → vector mapping every compaction cycle.

---

## 6. `random_layer` has no `max_layer` cap — one outlier insert promotes index to ~60+ layers, every subsequent search pays per-layer cost

**File:** [`nodedb-vector/src/hnsw/graph.rs:257-261`](nodedb-vector/src/hnsw/graph.rs)

```rust
pub(crate) fn random_layer(&mut self) -> usize {
    let ml = 1.0 / (self.params.m as f64).ln();
    let r = self.rng.next_f64().max(f64::MIN_POSITIVE);
    (-r.ln() * ml).floor() as usize
}
```

`next_f64` is `xorshift.0 as f64 / u64::MAX as f64`. The `.max(f64::MIN_POSITIVE)` clamp floors `r` at `2.2e-308` so `-ln(r)` reaches ~708; with `m=2` the layer can be ~1022, with `m=16` it can be ~177. Realistic xorshift states a few thousand inserts in produce `r` ≈ 1e-17..1e-19 → 50-65 layers. Every subsequent search's Phase 1 greedy descent (`hnsw/search.rs:31`) iterates `(1..=self.max_layer).rev()`, so one outlier insert turns the constant-time descent into per-search O(max_layer). Standard HNSW caps `max_layer` at ~16.

---

## 7. PQ k-means in `pq.rs` and `ivf.rs` is deterministic farthest-point with broken min-distance update — codebooks collapse on duplicated data

**Files:** [`nodedb-vector/src/quantize/pq.rs:181-201`](nodedb-vector/src/quantize/pq.rs), [`nodedb-vector/src/ivf.rs:223-238`](nodedb-vector/src/ivf.rs) (identical pattern)

```rust
// Update min_dists against centroids[c - 1] only
for (i, dists) in min_dists.iter_mut().enumerate() {
    let d = squared_distance(data[i], &centroids[c - 1]);
    if d < *dists { *dists = d; }
}
// Then deterministic argmax (NOT k-means++ sampling proportional to d²)
let best_idx = min_dists.iter().enumerate()
    .max_by(|a, b| a.1.partial_cmp(b.1).unwrap_or(Ordering::Equal))
    .map(|(i, _)| i).unwrap_or(0);
centroids.push(data[best_idx].to_vec());
```

Two compounding bugs: (a) `min_dists` is updated only against the *last* centroid, not against the *set*, so once two centroids coincide every subsequent pick lands on the same `best_idx`. (b) Comment claims "k-means++" but logic is deterministic farthest-point — outliers dominate. With `pq_m=8` on 1024-dim vectors (128 dims per subspace), a chunk-ingest batch where embeddings share a common prefix or suffix (typical for templated chat / repeated headers / footers) trains a codebook where most of 256 centroids alias to one or two points. Encode then maps essentially every vector to the same code → PQ recall destroyed.

---

## Checklist

- [ ] 1. Validate length parity in SIMD distance kernels (or in dispatcher) — no `loadu` past `min(a.len(), b.len())`; reject on mismatch
- [ ] 2. Either pass per-segment-localized bitmaps to `search_with_bitmap_bytes`, or apply the global-id offset before the `f.contains(id)` check
- [ ] 3. Wire `index_type='hnsw_pq'` end-to-end through `complete_build` → `PqCodec::train` instead of falling through to SQ8; surface `Pq` in stats
- [ ] 4. `FlatIndex::get_vector` should return `None` for tombstoned indices (or expose a `live_iter`); `from_checkpoint` must read & apply `growing_deleted`/building-segment delete bitmaps
- [ ] 5. After `HnswIndex::compact`, walk `doc_id_map`/`multi_doc_map` and rewrite global IDs through `(seg.base_id + new_local)`
- [ ] 6. Cap `random_layer` at a configured max (typical: 16); also consider switching xorshift seed handling so first-call output isn't pathological
- [ ] 7. Real k-means++ (sample proportional to d²) and update `min_dists` against the full centroid set on every pick; treat as warning when centroid duplicates after training

All items independently reproducible with the SQL/Rust shown above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vector subsystem — correctness, memory-safety, quantization (7 sub-items) #50

1. SIMD distance kernels read past buffer end on length mismatch (memory-safety UB)

2. Tenant Roaring bitmap pre-filter feeds GLOBAL ids to per-segment LOCAL-id lookup → near-zero recall after first segment seal

3. `index_type='hnsw_pq'` is silently a no-op — PQ is never instantiated

4. Soft-deletes in growing & building segments lost on checkpoint restore

5. HNSW compaction renumbers local IDs but `doc_id_map` / `multi_doc_map` are not remapped

6. `random_layer` has no `max_layer` cap — one outlier insert promotes index to ~60+ layers, every subsequent search pays per-layer cost

7. PQ k-means in `pq.rs` and `ivf.rs` is deterministic farthest-point with broken min-distance update — codebooks collapse on duplicated data

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vector subsystem — correctness, memory-safety, quantization (7 sub-items) #50

Description

1. SIMD distance kernels read past buffer end on length mismatch (memory-safety UB)

2. Tenant Roaring bitmap pre-filter feeds GLOBAL ids to per-segment LOCAL-id lookup → near-zero recall after first segment seal

3. index_type='hnsw_pq' is silently a no-op — PQ is never instantiated

4. Soft-deletes in growing & building segments lost on checkpoint restore

5. HNSW compaction renumbers local IDs but doc_id_map / multi_doc_map are not remapped

6. random_layer has no max_layer cap — one outlier insert promotes index to ~60+ layers, every subsequent search pays per-layer cost

7. PQ k-means in pq.rs and ivf.rs is deterministic farthest-point with broken min-distance update — codebooks collapse on duplicated data

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

3. `index_type='hnsw_pq'` is silently a no-op — PQ is never instantiated

5. HNSW compaction renumbers local IDs but `doc_id_map` / `multi_doc_map` are not remapped

6. `random_layer` has no `max_layer` cap — one outlier insert promotes index to ~60+ layers, every subsequent search pays per-layer cost

7. PQ k-means in `pq.rs` and `ivf.rs` is deterministic farthest-point with broken min-distance update — codebooks collapse on duplicated data