Skip to content

Vector subsystem — correctness, memory-safety, quantization (7 sub-items) #50

@hollanf

Description

@hollanf

Seven independent bugs in nodedb-vector beyond what's already filed (#37 mmap overflow, #40 SQ8 brute-force scan, #41 ef_search unbounded). Three are correctness bugs that silently return wrong results. One is memory-safety UB on the search hot path. One is a feature that's silently a no-op. Two are performance pathologies that turn the index from log-scale into linear.


1. SIMD distance kernels read past buffer end on length mismatch (memory-safety UB)

File: nodedb-vector/src/distance/simd.rs:147-210 (AVX2 L2 + cosine, NEON twins)

#[target_feature(enable = "avx2,fma")]
unsafe fn l2_squared_impl(a: &[f32], b: &[f32]) -> f32 {
    unsafe {
        let n = a.len();                              // ← uses a.len() only
        let chunks = n / 8;
        for i in 0..chunks {
            let off = i * 8;
            let va = _mm256_loadu_ps(a.as_ptr().add(off));
            let vb = _mm256_loadu_ps(b.as_ptr().add(off));   // UB if b.len() < a.len()
            ...
        }
        for i in (chunks * 8)..n {
            let d = a[i] - b[i];                      // panic if b shorter
            ...
        }
    }
}

The kernel iterates with a.len(), dereferences b.as_ptr().add(off) without verifying b.len() >= a.len(). The assert_eq!(query.len(), self.dim) guard in hnsw/search.rs:19 validates the query against the index dim, but does not validate the query against each individual stored vector's length. A node whose Vec<f32> is shorter than the index dim (possible after a truncated checkpoint restore — from_checkpoint has no per-node length check) lands the search hot path in undefined behavior via _mm256_loadu_ps past the buffer end.


2. Tenant Roaring bitmap pre-filter feeds GLOBAL ids to per-segment LOCAL-id lookup → near-zero recall after first segment seal

File: nodedb-vector/src/collection/search.rs:115-135 + hnsw/search.rs:140 (f.contains(id))

for seg in &self.sealed {
    let results = seg.index.search_with_bitmap_bytes(query, top_k, ef, bitmap);
    for mut r in results {
        r.id += seg.base_id;        // ← offset added AFTER bitmap filtering
        all.push(r);
    }
}

The Roaring bitmap is built by the query planner from global vector IDs. Each sealed segment stores vectors at local IDs 0..seg.index.len(). HnswIndex::search_with_bitmap_bytes tests f.contains(id) with the local ID. For the first segment (base_id == 0) local == global so filtering works by accident; for every segment with base_id > 0 (i.e. once the collection has > ~65k vectors and seals once), the bitmap lookup is against the wrong ID space and almost every candidate is silently dropped → recall collapses. The growing segment has the same shape via growing_base_id.


3. index_type='hnsw_pq' is silently a no-op — PQ is never instantiated

File: nodedb-vector/src/collection/lifecycle.rs:209-261

pub fn complete_build(&mut self, segment_id: u32, index: HnswIndex) {
    ...
    let sq8 = Self::build_sq8_for_index(&index);    // ← always SQ8, never PQ
    ...
}

grep -rn PqCodec nodedb-vector/src/collection/ returns zero matches. The DSL accepts index_type='hnsw_pq', pq_m=8 and stores it, but complete_build unconditionally calls build_sq8_for_index. Users who configured PQ for the advertised 8–16× memory reduction silently get 4× SQ8 instead. At 1024-dim × 100k+ rows per collection that's the difference between fitting in RAM and spilling to mmap on the hot path. Stats API (lifecycle.rs:385) reports quantization as Sq8 or None, never Pq — operators have no signal that the configuration was ignored.

Repro:

CREATE INDEX ON t USING hnsw (v) WITH (index_type='hnsw_pq', pq_m=8, m=16);
INSERT INTO t SELECT array_fill(random()::real, ARRAY[1024])::vector(1024) FROM generate_series(1, 100000);
SELECT * FROM vector_index_stats WHERE relation = 't'; -- quantization = 'Sq8'

4. Soft-deletes in growing & building segments lost on checkpoint restore

File: nodedb-vector/src/collection/checkpoint.rs:55-76, 120-154

growing_vectors: (0..self.growing.len() as u32)
    .filter_map(|i| self.growing.get_vector(i).map(|v| v.to_vec()))
    .collect(),
growing_deleted: (0..self.growing.len() as u32)
    .map(|i| self.growing.get_vector(i).is_none())
    .collect(),
...
let mut growing = FlatIndex::new(snap.dim, metric);
for v in &snap.growing_vectors {
    growing.insert(v.clone());
}

FlatIndex::get_vector (flat.rs:161-169) returns Some(...) even for tombstoned vectors — it gates only on bounds, not on deleted[idx]. So growing_deleted is always all-false and every live vector is serialized. Then from_checkpoint (line 120-123) never reads growing_deleted anyway — every restored vector comes back as live. Building-segment snapshots (line 74-76) have the same bug. Soft-delete-via-valid_until workflows depend on deletes being durable across restarts; after a crash, every deleted row in growing/building segments resurrects.


5. HNSW compaction renumbers local IDs but doc_id_map / multi_doc_map are not remapped

Files: nodedb-vector/src/hnsw/graph.rs:282-352 (HnswIndex::compact) + nodedb-vector/src/collection/lifecycle.rs:38-40, 119-122, 279-285

// lifecycle.rs
pub fn compact(&mut self) -> usize {
    let mut total_removed = 0;
    for seg in &mut self.sealed {
        total_removed += seg.index.compact();   // renumbers LOCAL node ids
    }
    total_removed
}

HnswIndex::compact builds an id_map: Vec<u32> and rewrites every node's neighbor edges so the in-graph references stay consistent. But the collection stores doc_id_map: HashMap<u32 /* GLOBAL */, String> and multi_doc_map: HashMap<String, Vec<u32 /* GLOBAL */>>, and these are not remapped. After compact, collection.get_doc_id(vid) either returns a stale string (the global id seg.base_id + old_local no longer maps to the right live node) or the multi-vector delete-by-doc dispatches to IDs that point at a different live chunk. With bulk-insert workloads (many vectors per logical doc) this corrupts the doc → vector mapping every compaction cycle.


6. random_layer has no max_layer cap — one outlier insert promotes index to ~60+ layers, every subsequent search pays per-layer cost

File: nodedb-vector/src/hnsw/graph.rs:257-261

pub(crate) fn random_layer(&mut self) -> usize {
    let ml = 1.0 / (self.params.m as f64).ln();
    let r = self.rng.next_f64().max(f64::MIN_POSITIVE);
    (-r.ln() * ml).floor() as usize
}

next_f64 is xorshift.0 as f64 / u64::MAX as f64. The .max(f64::MIN_POSITIVE) clamp floors r at 2.2e-308 so -ln(r) reaches ~708; with m=2 the layer can be ~1022, with m=16 it can be ~177. Realistic xorshift states a few thousand inserts in produce r ≈ 1e-17..1e-19 → 50-65 layers. Every subsequent search's Phase 1 greedy descent (hnsw/search.rs:31) iterates (1..=self.max_layer).rev(), so one outlier insert turns the constant-time descent into per-search O(max_layer). Standard HNSW caps max_layer at ~16.


7. PQ k-means in pq.rs and ivf.rs is deterministic farthest-point with broken min-distance update — codebooks collapse on duplicated data

Files: nodedb-vector/src/quantize/pq.rs:181-201, nodedb-vector/src/ivf.rs:223-238 (identical pattern)

// Update min_dists against centroids[c - 1] only
for (i, dists) in min_dists.iter_mut().enumerate() {
    let d = squared_distance(data[i], &centroids[c - 1]);
    if d < *dists { *dists = d; }
}
// Then deterministic argmax (NOT k-means++ sampling proportional to d²)
let best_idx = min_dists.iter().enumerate()
    .max_by(|a, b| a.1.partial_cmp(b.1).unwrap_or(Ordering::Equal))
    .map(|(i, _)| i).unwrap_or(0);
centroids.push(data[best_idx].to_vec());

Two compounding bugs: (a) min_dists is updated only against the last centroid, not against the set, so once two centroids coincide every subsequent pick lands on the same best_idx. (b) Comment claims "k-means++" but logic is deterministic farthest-point — outliers dominate. With pq_m=8 on 1024-dim vectors (128 dims per subspace), a chunk-ingest batch where embeddings share a common prefix or suffix (typical for templated chat / repeated headers / footers) trains a codebook where most of 256 centroids alias to one or two points. Encode then maps essentially every vector to the same code → PQ recall destroyed.


Checklist

  • 1. Validate length parity in SIMD distance kernels (or in dispatcher) — no loadu past min(a.len(), b.len()); reject on mismatch
  • 2. Either pass per-segment-localized bitmaps to search_with_bitmap_bytes, or apply the global-id offset before the f.contains(id) check
  • 3. Wire index_type='hnsw_pq' end-to-end through complete_buildPqCodec::train instead of falling through to SQ8; surface Pq in stats
  • 4. FlatIndex::get_vector should return None for tombstoned indices (or expose a live_iter); from_checkpoint must read & apply growing_deleted/building-segment delete bitmaps
  • 5. After HnswIndex::compact, walk doc_id_map/multi_doc_map and rewrite global IDs through (seg.base_id + new_local)
  • 6. Cap random_layer at a configured max (typical: 16); also consider switching xorshift seed handling so first-call output isn't pathological
  • 7. Real k-means++ (sample proportional to d²) and update min_dists against the full centroid set on every pick; treat as warning when centroid duplicates after training

All items independently reproducible with the SQL/Rust shown above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions