Seven independent bugs in nodedb-vector beyond what's already filed (#37 mmap overflow, #40 SQ8 brute-force scan, #41 ef_search unbounded). Three are correctness bugs that silently return wrong results. One is memory-safety UB on the search hot path. One is a feature that's silently a no-op. Two are performance pathologies that turn the index from log-scale into linear.
1. SIMD distance kernels read past buffer end on length mismatch (memory-safety UB)
File: nodedb-vector/src/distance/simd.rs:147-210 (AVX2 L2 + cosine, NEON twins)
#[target_feature(enable = "avx2,fma")]
unsafe fn l2_squared_impl(a: &[f32], b: &[f32]) -> f32 {
unsafe {
let n = a.len(); // ← uses a.len() only
let chunks = n / 8;
for i in 0..chunks {
let off = i * 8;
let va = _mm256_loadu_ps(a.as_ptr().add(off));
let vb = _mm256_loadu_ps(b.as_ptr().add(off)); // UB if b.len() < a.len()
...
}
for i in (chunks * 8)..n {
let d = a[i] - b[i]; // panic if b shorter
...
}
}
}
The kernel iterates with a.len(), dereferences b.as_ptr().add(off) without verifying b.len() >= a.len(). The assert_eq!(query.len(), self.dim) guard in hnsw/search.rs:19 validates the query against the index dim, but does not validate the query against each individual stored vector's length. A node whose Vec<f32> is shorter than the index dim (possible after a truncated checkpoint restore — from_checkpoint has no per-node length check) lands the search hot path in undefined behavior via _mm256_loadu_ps past the buffer end.
2. Tenant Roaring bitmap pre-filter feeds GLOBAL ids to per-segment LOCAL-id lookup → near-zero recall after first segment seal
File: nodedb-vector/src/collection/search.rs:115-135 + hnsw/search.rs:140 (f.contains(id))
for seg in &self.sealed {
let results = seg.index.search_with_bitmap_bytes(query, top_k, ef, bitmap);
for mut r in results {
r.id += seg.base_id; // ← offset added AFTER bitmap filtering
all.push(r);
}
}
The Roaring bitmap is built by the query planner from global vector IDs. Each sealed segment stores vectors at local IDs 0..seg.index.len(). HnswIndex::search_with_bitmap_bytes tests f.contains(id) with the local ID. For the first segment (base_id == 0) local == global so filtering works by accident; for every segment with base_id > 0 (i.e. once the collection has > ~65k vectors and seals once), the bitmap lookup is against the wrong ID space and almost every candidate is silently dropped → recall collapses. The growing segment has the same shape via growing_base_id.
3. index_type='hnsw_pq' is silently a no-op — PQ is never instantiated
File: nodedb-vector/src/collection/lifecycle.rs:209-261
pub fn complete_build(&mut self, segment_id: u32, index: HnswIndex) {
...
let sq8 = Self::build_sq8_for_index(&index); // ← always SQ8, never PQ
...
}
grep -rn PqCodec nodedb-vector/src/collection/ returns zero matches. The DSL accepts index_type='hnsw_pq', pq_m=8 and stores it, but complete_build unconditionally calls build_sq8_for_index. Users who configured PQ for the advertised 8–16× memory reduction silently get 4× SQ8 instead. At 1024-dim × 100k+ rows per collection that's the difference between fitting in RAM and spilling to mmap on the hot path. Stats API (lifecycle.rs:385) reports quantization as Sq8 or None, never Pq — operators have no signal that the configuration was ignored.
Repro:
CREATE INDEX ON t USING hnsw (v) WITH (index_type='hnsw_pq', pq_m=8, m=16);
INSERT INTO t SELECT array_fill(random()::real, ARRAY[1024])::vector(1024) FROM generate_series(1, 100000);
SELECT * FROM vector_index_stats WHERE relation = 't'; -- quantization = 'Sq8'
4. Soft-deletes in growing & building segments lost on checkpoint restore
File: nodedb-vector/src/collection/checkpoint.rs:55-76, 120-154
growing_vectors: (0..self.growing.len() as u32)
.filter_map(|i| self.growing.get_vector(i).map(|v| v.to_vec()))
.collect(),
growing_deleted: (0..self.growing.len() as u32)
.map(|i| self.growing.get_vector(i).is_none())
.collect(),
...
let mut growing = FlatIndex::new(snap.dim, metric);
for v in &snap.growing_vectors {
growing.insert(v.clone());
}
FlatIndex::get_vector (flat.rs:161-169) returns Some(...) even for tombstoned vectors — it gates only on bounds, not on deleted[idx]. So growing_deleted is always all-false and every live vector is serialized. Then from_checkpoint (line 120-123) never reads growing_deleted anyway — every restored vector comes back as live. Building-segment snapshots (line 74-76) have the same bug. Soft-delete-via-valid_until workflows depend on deletes being durable across restarts; after a crash, every deleted row in growing/building segments resurrects.
5. HNSW compaction renumbers local IDs but doc_id_map / multi_doc_map are not remapped
Files: nodedb-vector/src/hnsw/graph.rs:282-352 (HnswIndex::compact) + nodedb-vector/src/collection/lifecycle.rs:38-40, 119-122, 279-285
// lifecycle.rs
pub fn compact(&mut self) -> usize {
let mut total_removed = 0;
for seg in &mut self.sealed {
total_removed += seg.index.compact(); // renumbers LOCAL node ids
}
total_removed
}
HnswIndex::compact builds an id_map: Vec<u32> and rewrites every node's neighbor edges so the in-graph references stay consistent. But the collection stores doc_id_map: HashMap<u32 /* GLOBAL */, String> and multi_doc_map: HashMap<String, Vec<u32 /* GLOBAL */>>, and these are not remapped. After compact, collection.get_doc_id(vid) either returns a stale string (the global id seg.base_id + old_local no longer maps to the right live node) or the multi-vector delete-by-doc dispatches to IDs that point at a different live chunk. With bulk-insert workloads (many vectors per logical doc) this corrupts the doc → vector mapping every compaction cycle.
6. random_layer has no max_layer cap — one outlier insert promotes index to ~60+ layers, every subsequent search pays per-layer cost
File: nodedb-vector/src/hnsw/graph.rs:257-261
pub(crate) fn random_layer(&mut self) -> usize {
let ml = 1.0 / (self.params.m as f64).ln();
let r = self.rng.next_f64().max(f64::MIN_POSITIVE);
(-r.ln() * ml).floor() as usize
}
next_f64 is xorshift.0 as f64 / u64::MAX as f64. The .max(f64::MIN_POSITIVE) clamp floors r at 2.2e-308 so -ln(r) reaches ~708; with m=2 the layer can be ~1022, with m=16 it can be ~177. Realistic xorshift states a few thousand inserts in produce r ≈ 1e-17..1e-19 → 50-65 layers. Every subsequent search's Phase 1 greedy descent (hnsw/search.rs:31) iterates (1..=self.max_layer).rev(), so one outlier insert turns the constant-time descent into per-search O(max_layer). Standard HNSW caps max_layer at ~16.
7. PQ k-means in pq.rs and ivf.rs is deterministic farthest-point with broken min-distance update — codebooks collapse on duplicated data
Files: nodedb-vector/src/quantize/pq.rs:181-201, nodedb-vector/src/ivf.rs:223-238 (identical pattern)
// Update min_dists against centroids[c - 1] only
for (i, dists) in min_dists.iter_mut().enumerate() {
let d = squared_distance(data[i], ¢roids[c - 1]);
if d < *dists { *dists = d; }
}
// Then deterministic argmax (NOT k-means++ sampling proportional to d²)
let best_idx = min_dists.iter().enumerate()
.max_by(|a, b| a.1.partial_cmp(b.1).unwrap_or(Ordering::Equal))
.map(|(i, _)| i).unwrap_or(0);
centroids.push(data[best_idx].to_vec());
Two compounding bugs: (a) min_dists is updated only against the last centroid, not against the set, so once two centroids coincide every subsequent pick lands on the same best_idx. (b) Comment claims "k-means++" but logic is deterministic farthest-point — outliers dominate. With pq_m=8 on 1024-dim vectors (128 dims per subspace), a chunk-ingest batch where embeddings share a common prefix or suffix (typical for templated chat / repeated headers / footers) trains a codebook where most of 256 centroids alias to one or two points. Encode then maps essentially every vector to the same code → PQ recall destroyed.
Checklist
All items independently reproducible with the SQL/Rust shown above.
Seven independent bugs in
nodedb-vectorbeyond what's already filed (#37 mmap overflow, #40 SQ8 brute-force scan, #41 ef_search unbounded). Three are correctness bugs that silently return wrong results. One is memory-safety UB on the search hot path. One is a feature that's silently a no-op. Two are performance pathologies that turn the index from log-scale into linear.1. SIMD distance kernels read past buffer end on length mismatch (memory-safety UB)
File:
nodedb-vector/src/distance/simd.rs:147-210(AVX2 L2 + cosine, NEON twins)The kernel iterates with
a.len(), dereferencesb.as_ptr().add(off)without verifyingb.len() >= a.len(). Theassert_eq!(query.len(), self.dim)guard inhnsw/search.rs:19validates the query against the index dim, but does not validate the query against each individual stored vector's length. A node whoseVec<f32>is shorter than the indexdim(possible after a truncated checkpoint restore —from_checkpointhas no per-node length check) lands the search hot path in undefined behavior via_mm256_loadu_pspast the buffer end.2. Tenant Roaring bitmap pre-filter feeds GLOBAL ids to per-segment LOCAL-id lookup → near-zero recall after first segment seal
File:
nodedb-vector/src/collection/search.rs:115-135+hnsw/search.rs:140(f.contains(id))The Roaring bitmap is built by the query planner from global vector IDs. Each sealed segment stores vectors at local IDs
0..seg.index.len().HnswIndex::search_with_bitmap_bytestestsf.contains(id)with the local ID. For the first segment (base_id == 0) local == global so filtering works by accident; for every segment withbase_id > 0(i.e. once the collection has > ~65k vectors and seals once), the bitmap lookup is against the wrong ID space and almost every candidate is silently dropped → recall collapses. The growing segment has the same shape viagrowing_base_id.3.
index_type='hnsw_pq'is silently a no-op — PQ is never instantiatedFile:
nodedb-vector/src/collection/lifecycle.rs:209-261grep -rn PqCodec nodedb-vector/src/collection/returns zero matches. The DSL acceptsindex_type='hnsw_pq', pq_m=8and stores it, butcomplete_buildunconditionally callsbuild_sq8_for_index. Users who configured PQ for the advertised 8–16× memory reduction silently get 4× SQ8 instead. At 1024-dim × 100k+ rows per collection that's the difference between fitting in RAM and spilling to mmap on the hot path. Stats API (lifecycle.rs:385) reports quantization asSq8orNone, neverPq— operators have no signal that the configuration was ignored.Repro:
4. Soft-deletes in growing & building segments lost on checkpoint restore
File:
nodedb-vector/src/collection/checkpoint.rs:55-76, 120-154FlatIndex::get_vector(flat.rs:161-169) returnsSome(...)even for tombstoned vectors — it gates only on bounds, not ondeleted[idx]. Sogrowing_deletedis always all-false and every live vector is serialized. Thenfrom_checkpoint(line 120-123) never readsgrowing_deletedanyway — every restored vector comes back as live. Building-segment snapshots (line 74-76) have the same bug. Soft-delete-via-valid_untilworkflows depend on deletes being durable across restarts; after a crash, every deleted row in growing/building segments resurrects.5. HNSW compaction renumbers local IDs but
doc_id_map/multi_doc_mapare not remappedFiles:
nodedb-vector/src/hnsw/graph.rs:282-352(HnswIndex::compact) +nodedb-vector/src/collection/lifecycle.rs:38-40, 119-122, 279-285HnswIndex::compactbuilds anid_map: Vec<u32>and rewrites every node's neighbor edges so the in-graph references stay consistent. But the collection storesdoc_id_map: HashMap<u32 /* GLOBAL */, String>andmulti_doc_map: HashMap<String, Vec<u32 /* GLOBAL */>>, and these are not remapped. After compact,collection.get_doc_id(vid)either returns a stale string (the global idseg.base_id + old_localno longer maps to the right live node) or the multi-vector delete-by-doc dispatches to IDs that point at a different live chunk. With bulk-insert workloads (many vectors per logical doc) this corrupts the doc → vector mapping every compaction cycle.6.
random_layerhas nomax_layercap — one outlier insert promotes index to ~60+ layers, every subsequent search pays per-layer costFile:
nodedb-vector/src/hnsw/graph.rs:257-261next_f64isxorshift.0 as f64 / u64::MAX as f64. The.max(f64::MIN_POSITIVE)clamp floorsrat2.2e-308so-ln(r)reaches ~708; withm=2the layer can be ~1022, withm=16it can be ~177. Realistic xorshift states a few thousand inserts in producer≈ 1e-17..1e-19 → 50-65 layers. Every subsequent search's Phase 1 greedy descent (hnsw/search.rs:31) iterates(1..=self.max_layer).rev(), so one outlier insert turns the constant-time descent into per-search O(max_layer). Standard HNSW capsmax_layerat ~16.7. PQ k-means in
pq.rsandivf.rsis deterministic farthest-point with broken min-distance update — codebooks collapse on duplicated dataFiles:
nodedb-vector/src/quantize/pq.rs:181-201,nodedb-vector/src/ivf.rs:223-238(identical pattern)Two compounding bugs: (a)
min_distsis updated only against the last centroid, not against the set, so once two centroids coincide every subsequent pick lands on the samebest_idx. (b) Comment claims "k-means++" but logic is deterministic farthest-point — outliers dominate. Withpq_m=8on 1024-dim vectors (128 dims per subspace), a chunk-ingest batch where embeddings share a common prefix or suffix (typical for templated chat / repeated headers / footers) trains a codebook where most of 256 centroids alias to one or two points. Encode then maps essentially every vector to the same code → PQ recall destroyed.Checklist
loadupastmin(a.len(), b.len()); reject on mismatchsearch_with_bitmap_bytes, or apply the global-id offset before thef.contains(id)checkindex_type='hnsw_pq'end-to-end throughcomplete_build→PqCodec::traininstead of falling through to SQ8; surfacePqin statsFlatIndex::get_vectorshould returnNonefor tombstoned indices (or expose alive_iter);from_checkpointmust read & applygrowing_deleted/building-segment delete bitmapsHnswIndex::compact, walkdoc_id_map/multi_doc_mapand rewrite global IDs through(seg.base_id + new_local)random_layerat a configured max (typical: 16); also consider switching xorshift seed handling so first-call output isn't pathologicalmin_distsagainst the full centroid set on every pick; treat as warning when centroid duplicates after trainingAll items independently reproducible with the SQL/Rust shown above.