Skip to content

docs: changelog perf entries + 0.6.0 un-patch checklist#283

Open
Nelson Spence (Fieldnote-Echo) wants to merge 60 commits into
mainfrom
docs/release-hygiene
Open

docs: changelog perf entries + 0.6.0 un-patch checklist#283
Nelson Spence (Fieldnote-Echo) wants to merge 60 commits into
mainfrom
docs/release-hygiene

Conversation

@Fieldnote-Echo

Copy link
Copy Markdown
Member

CHANGELOG ### Performance entries for the four unlogged perf changes (numbers verified against source commit messages) and the RELEASING.md one-time un-patch step for ordinaldb's [patch.crates-io] block. Fleet-authored, reviewed APPROVE. Merge LAST — cut from the integration branch, so its commit list collapses to docs-only once #277/#278/#279/#280/#281 land. Known follow-up: a CHANGELOG line for the CLI-parity PR once it merges.

Verification now bounds every artifact read by its manifest-declared
file_size_bytes (manifest hard-capped at 1 MiB; SHA-256 pins content);
creation bounds reads by the observed file size. Flat ResourceLimits
byte caps become opt-in ceilings (default unbounded). The primary
artifact read, previously unbounded, gains a declared-size bound and
the artifact_file_too_large reason code. sha256_file_bounded now
streams with constant memory instead of materialising files.

Fixes the undocumented 64 MiB auxiliary cap that made sign-sidecar
bundles impossible to write past 524,288 rows at dim=1024 (measured
on a 1,258,135-row corpus).
Independent oracle (score_all + full lexicographic sort by hamming asc,
doc_id asc) pins top_m_candidates and top_m_candidates_batched_serial_csr
exactly: random corpora across block boundaries, massive-tie and
duplicate-run corpora exercising boundary tie-breaks, edge geometries
(m >= n, single doc, empty batch), and the dim=1024 shape. Must pass
bit-identically before and after the tiling swap.
Security-audit remediation (CIPHER-02): the primary artifact read was
bounded only by the attacker-declared size with no configurable
ceiling, unlike the auxiliary/profile classes — the documented
THREAT-QUERY-003 mitigation was silently incomplete for the largest
artifact in a bundle. Adds ResourceLimits::max_index_artifact_bytes
(default unbounded) applied as a min() with the declared size, and
bounds the create-path primary read by its observed size so the code
matches the CHANGELOG claim. CLI flag parity follows separately.
Code-review remediation: the sqlite-feature cache-key path duplicated
the hashing logic and missed the derived-bound change — the primary
artifact hash was fully unbounded, and the calibration/encoder
profile hashes were bounded only by the flat caps, which the default
flip turned into effectively unbounded reads. All three now use the
same declared-size .min(opt-in ceiling) derivation as the verify
path; a bound violation is a cache miss. Adds default-limits
grown-profile coverage for the calibration and encoder-distortion
call sites, closing the per-site test gap that let this slip.
top_m_candidates_batched_serial_csr previously looped the single-query
path, re-streaming the full sign bitmap per query (documented-naive
Track-1). The internals now scan the corpus once per call in L2-sized
doc blocks, score every query of the call against each hot block in
query tiles via the existing batched kernel, and select per-query
top-m with bounded (hamming, doc_id) min-collectors — bit-identical
to a full sort by construction, independent of processing order (the
key IS the contract's sort key). top_m_candidates routes through the
same core, dropping its per-call n-row Hamming materialisation.

Per-query corpus traffic drops by the call's query count: at 1.26M
rows x 1024 dims, a 2048-query call reads the 161MB sidecar once
instead of 2048 times. Serial contract preserved (no rayon); the
oracle suite (tests/tiled_candgen.rs) pins bit-identical outputs
across random, tie-heavy, duplicate-run, and edge geometries.
Audit remediation: routing top_m_candidates through the streamed core
measured +50-90% at small/medium n with m in the hundreds (bounded
heap O(n log m) vs select_nth_unstable_by O(n)); with one query there
is no scan to share, so nq=1 stays on the dense path (parity-or-better
at every measured size). Also per audit: the block-boundary oracle
test now genuinely spans three blocks (the dim=128 shape fit one
block), and adds the dim=768 AVX-512 tail-residue x multi-block case
to the permanent suite.
assert_all_finite paid a full serial pass per add/search batch —
measured ~0.1s per GiB, twice per ingest batch counting the caller
layer. Scans of 1M+ floats now split across the rayon pool (4.4x
measured). RankQuant::add's per-row closure allocated a fresh ranks
Vec per vector inside the parallel loop; for_each_init now reuses a
per-worker scratch via rank_transform_into. Measured on the 1.26M x
1024 corpus slice: encode-path attribution 0.097s serial scan ->
0.022s parallel; alloc churn removed from the hot loop.
Doc ids visit each per-query heap strictly ascending, so a candidate
tying the worst kept hamming always loses the (hamming, doc_id)
tie-break — once the collector is full, the accept test is exactly
'hamming < worst kept hamming'. Cache that bound in a register-friendly
u32 (u32::MAX while filling) and skip the heap peek + tuple compare on
the ~99.8% reject path. Bit-identical by construction; pinned by the
tie-heavy and duplicate-run oracle suites.
load_rankquant's forged-buffer defense histogrammed every packed code
serially — 1.29 billion shift/mask ops at 1.26M x 1024, ~1s of the
1.27s verified open. A 4KB per-byte bucket-count LUT replaces the
per-code inner loop and rows validate in parallel; find_first keeps
the lowest-offending-row error contract, with a scalar recheck
producing the identical message. The security property is unchanged:
every row still proves uniform composition before the index is
usable.
CHANGELOG Unreleased gains the measured perf work merged to
integration/full-stack: tiled streaming sign candidate generation +
cached collector worst-bound (bit-identical internals swap; downstream
batched search 220 -> 10.2k q/s at 1.26M x 1024), parallel finite
validation + scratch rank encode (0.097s -> 0.022s attribution), and
the LUT + parallel constant-composition load check (verified open
1.27s -> 0.38s). RELEASING gains a one-time pre-publish item: remove
OrdinalDB's [patch.crates-io] block pointing at integration/full-stack
when 0.6.0 publishes.
Expose --max-index-artifact-bytes on the ordvec-manifest CLI LimitArgs,
wiring it to ResourceLimits::max_index_artifact_bytes so the opt-in
primary-artifact read ceiling reaches feature parity with the existing
--max-auxiliary-artifact-bytes flag.

Close the deferred CIPHER-04 reason-code symmetry: validate_manifest_shape
now rejects a zero manifest-declared artifact.file_size_bytes
(artifact_file_size_zero) and validate_auxiliary_artifact_shape rejects
zero-size declarations on required auxiliary artifacts
(auxiliary_artifact_file_size_zero), mirroring the calibration and
encoder-distortion *_file_size_zero checks. Optional artifacts keep the
established zero-size absent-placeholder convention.
Security-review remediation (fleet CIPHER-001): with derived read
bounds, a FIFO inside the bundle directory would block verification
forever — File::open on a reader-less FIFO blocks, and a device node
would stream without EOF under a large declared size. Stat the path
before opening and refuse anything that is not a regular file, at
every hashing call site (create, verify, sqlite cache). Regular files
terminate at EOF and remain post-checked against the declaration.
Unix regression test uses a real FIFO.
Security-review note (fleet CIPHER-002): parallel finite validation
introduced in the encode train transitively touches the global rayon
pool from inside the 'serial' CSR primitive. The serial guarantee is
about candidate scan/selection ownership, not input validation; say
so explicitly.
Bot-review remediation (Qodo, PR #277): the streaming loop surfaced
ErrorKind::Interrupted as a hash failure, a reliability regression vs
the previous read_to_end which retried EINTR internally.
Bot-review remediation (Qodo, PR #278): the shared core derived nq by
integer division; a ragged buffer from a future caller would silently
truncate. All current callers validate upstream — this is the cheap
in-core invariant.
@chatgpt-codex-connector

Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@qodo-code-review

Copy link
Copy Markdown

PR Summary by Qodo

Add performance changelog entries and implement derived artifact size bounds

✨ Enhancement 🐞 Bug fix 🧪 Tests 📝 Documentation 🕐 40+ Minutes

Grey Divider

AI Description

• Document previously-unlogged performance work and add a one-time 0.6.0 release hygiene step.
• Make ordvec-manifest artifact read limits derived from declared sizes, with opt-in ceilings.
• Speed up hot paths: streamed batched candidate generation, parallel finite checks, LUT validation.
• Add contract-pinning and regression tests for streaming candgen and derived-size enforcement.
Diagram

graph TD
A["Public APIs"] --> B["SignBitmap candgen"] --> C["Streamed top-m core"]
A --> D["RankQuant"] --> E["Load/encode perf"]
A --> F["ordvec-manifest"] --> G["Bounded hashing/limits"]
H["Docs"] --> A
Loading
High-Level Assessment

The following are alternative approaches to this PR:

1. Keep flat default caps; only raise the 64 MiB defaults
  • ➕ Minimal behavioral change for deployments relying on implicit ceilings
  • ➕ Less surface area for new limit interactions (declared-size vs cap)
  • ➖ Still couples correctness to an arbitrary default; legitimate artifacts can regress again
  • ➖ Doesn't address unbounded primary-artifact hashing unless separately handled
2. Stream candgen but use per-query partial sort on chunked score buffers
  • ➕ Could retain O(n) average selection per query via select_nth-like approaches
  • ➕ May reduce heap overhead for large m
  • ➖ More complex to implement correctly with global (hamming, doc_id) ordering across blocks
  • ➖ Harder to make processing-order-independent without careful merging logic

Recommendation: The PR’s approach is the better long-term contract fit: (1) derived-size read bounds make manifests the trust anchor while still allowing opt-in ceilings for hostile-input deployments, and (2) streamed candidate generation removes pathological re-scans while preserving a strict, test-pinned ordering contract. The included oracle/edge-case suites materially de-risk the algorithmic changes, making this strategy preferable to simply bumping flat defaults or introducing more complex block-merge selection logic.

Files changed (13) +824 / -49

Enhancement (4) +156 / -17
quant.rsReuse per-worker rank scratch during RankQuant::add +10/-7

Reuse per-worker rank scratch during RankQuant::add

• Switches the parallel encode loop to 'for_each_init' and reuses a per-worker 'Vec<u16>' ranks buffer. Uses 'rank_transform_into' to avoid per-vector allocation churn inside the hot loop.

src/quant.rs

rank_io.rsSpeed up RankQuant load validation with LUT + parallel scan +34/-2

Speed up RankQuant load validation with LUT + parallel scan

• Adds a per-byte bucket-count LUT to validate constant-composition rows using table lookups instead of per-code shift/mask loops. Parallelizes the row validation with 'find_first' to preserve the 'lowest offending row' contract and re-runs the scalar histogram only for error reporting.

src/rank_io.rs

sign_bitmap.rsImplement streamed batched top-m candidate generation +101/-7

Implement streamed batched top-m candidate generation

• Introduces a new serial streamed core that scans the corpus once per call in doc blocks, scores query tiles, and keeps exact per-query top-m using bounded heaps keyed by (hamming, doc_id). Optimizes collector acceptance via cached worst-hamming bounds and updates batched CSR generation to route through the streamed core (with explicit empty-batch handling) while keeping single-query on the dense partition path for performance.

src/sign_bitmap.rs

util.rsParallelize assert_all_finite for large inputs +11/-1

Parallelize assert_all_finite for large inputs

• Adds a size threshold to scan large float slices in parallel using rayon chunks, reducing validation overhead for large ingest/search batches. Keeps the existing fast serial path for smaller inputs.

src/util.rs

Bug fix (2) +93 / -19
lib.rsDerive artifact read bounds and stream bounded hashing +72/-12

Derive artifact read bounds and stream bounded hashing

• Changes default ResourceLimits byte caps to 'u64::MAX' (unbounded) and introduces 'max_index_artifact_bytes' as an opt-in ceiling. Updates verification and SQLite-related hashing call sites to bound reads by 'min(declared_size, configured_ceiling)' and to fail fast with '*_file_too_large' codes. Reimplements 'sha256_file_bounded' as a streaming reader to keep memory constant and bounds create-time hashing by observed file length.

ordvec-manifest/src/lib.rs

sqlite.rsBound SQLite cache-key hashing like verify path +21/-7

Bound SQLite cache-key hashing like verify path

• Removes unbounded 'sha256_file' usage for artifact cache keys and switches to 'sha256_file_bounded' using declared size with the opt-in ceiling. Updates profile hashing helpers to use declared-size-derived bounds as well, ensuring cache behavior matches verification semantics.

ordvec-manifest/src/sqlite.rs

Tests (3) +473 / -9
derived_limits.rsAdd end-to-end tests for derived-size limits and opt-in ceilings +223/-0

Add end-to-end tests for derived-size limits and opt-in ceilings

• Introduces a new test module asserting that default options accept artifacts above the legacy 64 MiB cap, while verify fails fast when artifacts grow beyond declared size. Adds regression coverage for truncated/inflated declarations and confirms explicit flat caps (including the new index ceiling) remain enforceable.

ordvec-manifest/tests/derived_limits.rs

manifest.rsAdjust artifact corruption tests and add grown-profile fast-fail coverage +75/-9

Adjust artifact corruption tests and add grown-profile fast-fail coverage

• Updates a corruption test to flip bytes in-place (same size) so the declared-size bound passes and the digest mismatch is exercised. Updates the 'not a byte pin' test to expect 'artifact_file_too_large' when the artifact grows past its declaration. Adds a new regression test ensuring grown calibration/distortion profile files fail fast with the *_too_large codes under default limits.

ordvec-manifest/tests/manifest.rs

tiled_candgen.rsAdd oracle-based contract tests for candidate generation ordering +175/-0

Add oracle-based contract tests for candidate generation ordering

• Adds an independent oracle (score_all + full lexicographic sort by (hamming asc, doc_id asc)) to pin 'top_m_candidates' and 'top_m_candidates_batched_serial_csr' outputs exactly. Covers random corpora across block boundaries, tie-heavy and duplicate-run cases, empty batches, and high-dimension shapes (1024 and 768 tail residue).

tests/tiled_candgen.rs

Documentation (4) +102 / -4
CHANGELOG.mdAdd missing Unreleased performance and manifest-limit entries +75/-1

Add missing Unreleased performance and manifest-limit entries

• Replaces the empty Unreleased section with detailed Performance bullets describing streamed batched candidate generation, collector accept-test optimization, parallel finite validation + RankQuant scratch reuse, and LUT-based RankQuant load validation. Adds Changed entries documenting derived artifact size bounds, new opt-in index artifact ceiling, fail-fast reason-code behavior, and constant-memory bounded hashing.

CHANGELOG.md

RELEASING.mdAdd one-time 0.6.0 downstream un-patch checklist step +5/-0

Add one-time 0.6.0 downstream un-patch checklist step

• Documents that OrdinalDB must remove its '[patch.crates-io]' overrides for 'ordvec' and 'ordvec-manifest' upon the 0.6.0 publish. Clarifies this is a one-time release hygiene action so downstream consumes crates.io releases rather than a git branch.

RELEASING.md

THREAT_MODEL.mdDocument derived artifact read bounds threat +13/-0

Document derived artifact read bounds threat

• Adds THREAT-QUERY-003 describing that verification work scales with manifest-declared bytes even with constant-memory hashing. Explains the role of opt-in ResourceLimits ceilings and reiterates that VerifiedLoadPlan is not a byte pin under local write access.

THREAT_MODEL.md

README.mdUpdate limit contract docs for derived-size bounds +9/-3

Update limit contract docs for derived-size bounds

• Rewrites artifact byte-limit bullets to state that reads are bounded by manifest-declared 'file_size_bytes' on verify and observed size on create. Adds the primary index artifact bound and clarifies flat caps are opt-in ceilings with unbounded defaults.

ordvec-manifest/README.md

@codecov

codecov Bot commented Jul 4, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 97.88732% with 3 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/sign_bitmap.rs 98.09% 2 Missing ⚠️
src/rank_io.rs 95.83% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@qodo-code-review

qodo-code-review Bot commented Jul 4, 2026

Copy link
Copy Markdown

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0) 🎨 UX issues (0) 🔗 Cross-repo conflicts (0) 📜 Skill insights (0)

Grey Divider


Remediation recommended

1. Bounded hash reads overshoot ✓ Resolved 🐞 Bug ≡ Correctness
Description
sha256_file_bounded enforces max_bytes only after reading a full 64KiB chunk, so it may read up
to ~64KiB past the declared/configured bound (and reads 64KiB even when max_bytes=0). This breaks
the intended strict read-bound contract and causes unnecessary extra I/O under tight limits.
Code

ordvec-manifest/src/lib.rs[R3486-3506]

+    let mut file = File::open(path)?;
    let mut hasher = Sha256::new();
-    hasher.update(&bytes);
+    let mut size_bytes = 0u64;
+    let mut buf = [0u8; 64 * 1024];
+    loop {
+        let n = file.read(&mut buf)?;
+        if n == 0 {
+            break;
+        }
+        size_bytes += n as u64;
+        if size_bytes > max_bytes {
+            return Err(ManifestError::limit_exceeded(
+                code,
+                format!(
+                    "{context} exceeds {max_bytes} bytes while reading {}",
+                    path.display()
+                ),
+            ));
+        }
+        hasher.update(&buf[..n]);
+    }
Relevance

⭐⭐⭐ High

Team previously enforced strict bounded reads via take(max_bytes+1) in read_bounded_file; likely
expects same for hashing.

PR-#157

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
sha256_file_bounded reads fixed 64KiB chunks and only checks size_bytes > max_bytes after
incrementing, so the function can consume bytes past the limit before failing. The repo’s
read_bounded_file demonstrates the strict-bounding approach using take(max_bytes + 1),
highlighting the inconsistency with the new hashing implementation.

ordvec-manifest/src/lib.rs[3479-3511]
ordvec-manifest/src/lib.rs[138-176]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`sha256_file_bounded` can read beyond `max_bytes` by up to one buffer (64KiB) because it reads into a fixed buffer and only checks `size_bytes > max_bytes` after the read. This violates a strict "bounded read" expectation (especially noticeable when `max_bytes` is very small).

### Issue Context
The codebase already uses a strict-bounding pattern for `read_bounded_file` via `take(max_bytes + 1)`.

### Fix Focus Areas
- ordvec-manifest/src/lib.rs[3479-3511]

### Implementation notes
Adjust the loop so the reader never consumes more than `max_bytes + 1` total bytes, e.g.:
- wrap the file with `file.by_ref().take(max_bytes.saturating_add(1))` and stream from that, or
- compute `remaining = max_bytes.saturating_sub(size_bytes)` and only attempt to read `min(buf.len(), remaining as usize + 1)` each iteration.
Ensure the error is still returned with the same `LimitExceeded` code/message when the bound is exceeded.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. TOCTOU size/hash mismatch ✓ Resolved 🐞 Bug ≡ Correctness
Description
create_manifest_for_index_with_options probes index metadata (including file_size_bytes) and
then hashes the index using a separately-fetched fs::metadata(index_path)?.len() bound; if the
file size changes between those calls, the manifest can embed a file_size_bytes that does not
match the bytes that were actually hashed. This can produce a manifest that fails verification
immediately or behaves inconsistently under concurrent writers.
Code

ordvec-manifest/src/lib.rs[R3566-3569]

+    let index_hash = sha256_file_bounded(
+        index_path,
+        fs::metadata(index_path)?.len(),
+        "artifact_file_too_large",
Relevance

⭐⭐⭐ High

Similar TOCTOU-hardening was definitely accepted: reuse verification-time resolved paths/metadata to
avoid same-call races.

PR-#163

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The create path uses probe_index_metadata() (whose file_size_bytes is derived from
file.metadata().len()) but hashes with a fresh fs::metadata(...).len() bound and then writes
metadata.file_size_bytes into the manifest, enabling divergence if the file size changes between
the two observations.

ordvec-manifest/src/lib.rs[3565-3586]
src/rank_io.rs[368-384]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`create_manifest_for_index_with_options` observes the index file size twice: once indirectly via `probe_index_metadata()` (stored into `metadata.file_size_bytes` and later written into the manifest), and again via `fs::metadata(index_path)?.len()` to set the hash read bound. If the file changes between these observations, the created manifest can contain a `file_size_bytes` that doesn't correspond to the hashed content.

### Issue Context
This is a classic TOCTOU inconsistency during manifest creation. The create path should either:
- use one size snapshot consistently for both hashing and the manifest field, or
- explicitly fail if the size changes between probe and hash.

### Fix Focus Areas
- ordvec-manifest/src/lib.rs[3565-3586]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Informational

3. Per-query bitmap allocations ✓ Resolved 🐞 Bug ➹ Performance
Description
SignBitmap::top_m_candidates_streamed builds q_bitmaps by calling build_query_bitmap per
query, which allocates a fresh Vec<u64> each time and re-validates finiteness, even though the
batch entry point already validates the whole query buffer. This adds avoidable allocation/CPU
overhead in the new hot-path implementation.
Code

src/sign_bitmap.rs[R241-245]

+        let mut q_bitmaps = vec![0u64; nq * qpv];
+        for qi in 0..nq {
+            let qb = self.build_query_bitmap(&queries[qi * dim..(qi + 1) * dim]);
+            q_bitmaps[qi * qpv..(qi + 1) * qpv].copy_from_slice(&qb);
+        }
Relevance

⭐⭐⭐ High

Repo has repeatedly accepted perf fixes removing avoidable allocations/copies in hot paths; similar
guidance accepted for SignBitmap/scan code.

PR-#96
PR-#208

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The streamed path allocates per query by calling build_query_bitmap inside a loop and then copies
into q_bitmaps. build_query_bitmap itself allocates a new Vec<u64> and runs
assert_all_finite, while top_m_candidates_batched_serial_csr already validates the full
queries slice before calling the streamed implementation.

src/sign_bitmap.rs[202-213]
src/sign_bitmap.rs[235-245]
src/sign_bitmap.rs[424-447]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The streamed batched candidate generator constructs query bitmaps via `build_query_bitmap` for each query. `build_query_bitmap` allocates a `Vec<u64>` and also calls `assert_all_finite`, which means the streamed path incurs per-query allocations and additional passes over query memory.

### Issue Context
`top_m_candidates_batched_serial_csr` already asserts `assert_all_finite(queries)` and has a preallocated `q_bitmaps` buffer sized `nq * qpv`.

### Fix Focus Areas
- src/sign_bitmap.rs[202-245]
- src/sign_bitmap.rs[424-447]

### Implementation notes
Introduce a private helper that fills an existing `&mut [u64]` (or `&mut Vec<u64>`) for a single query without allocating, e.g. `build_query_bitmap_into(q: &[f32], out: &mut [u64])`.
- In `top_m_candidates_streamed`, fill `q_bitmaps[qi*qpv..(qi+1)*qpv]` directly.
- Keep `build_query_bitmap` as the public allocating API, but avoid calling it from the streamed inner loop.
- Since the batch entry point already validated `queries`, avoid re-running `assert_all_finite` per query in the streamed path.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Artifact zero-size unvalidated ✓ Resolved 🐞 Bug ⚙ Maintainability
Description
Verification now bounds primary-artifact hashing by manifest.artifact.file_size_bytes, but
validate_manifest_shape does not reject artifact.file_size_bytes == 0 (unlike
calibration/distortion profile artifacts). For malformed manifests that declare file_size_bytes=0
for a non-empty index, this shifts the failure into artifact_file_too_large instead of a clearer
schema/shape error, making reports less consistent and harder to interpret.
Code

ordvec-manifest/src/lib.rs[R264-273]

+        match sha256_file_bounded(
+            &resolved.canonical_path,
+            document
+                .manifest
+                .artifact
+                .file_size_bytes
+                .min(options.limits.max_index_artifact_bytes),
+            "artifact_file_too_large",
+            "index artifact",
+        ) {
Relevance

⭐⭐ Medium

Team accepts tightening verifier contracts, but no prior review explicitly required rejecting
artifact file_size_bytes==0.

PR-#157
PR-#152

ⓘ Recommendations generated based on similar findings in past PRs

Evidence
The verification path now uses artifact.file_size_bytes as the hash bound, but the manifest shape
validator checks many artifact fields and does not validate artifact.file_size_bytes. In contrast,
both profile artifact validators explicitly error when file_size_bytes == 0, showing an
inconsistency in schema enforcement.

ordvec-manifest/src/lib.rs[258-301]
ordvec-manifest/src/lib.rs[328-387]
ordvec-manifest/src/lib.rs[1200-1217]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Primary artifact `file_size_bytes` is now used as the effective hash read bound, but the manifest shape validator does not assert it is > 0. This creates an inconsistency with profile artifacts (which explicitly reject zero) and can lead to confusing error codes for malformed manifests.

### Issue Context
A zero-sized primary index artifact is not a meaningful/valid index file in practice (it has at least a header), so treating `file_size_bytes == 0` as a schema/shape error keeps reporting consistent and clearer.

### Fix Focus Areas
- ordvec-manifest/src/lib.rs[328-387]
- ordvec-manifest/src/lib.rs[258-301]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread ordvec-manifest/src/lib.rs
Comment thread ordvec-manifest/src/lib.rs
The release-publish SBOM invariant requires member package versions in
lockstep with the root; ordvec-ffi and both python bindings follow the
0.6.0 bump.
Closes the remaining release-publish invariant layers, verified by
running tests/release_publish_invariants.py locally to a clean exit:
pyproject + __init__ versions in lockstep, the changelog cut as a
dated 0.6.0 section (invariant convention: the current version always
has a dated section; [Unreleased] stays empty), THREAT_MODEL status
line at v0.6.0, and the README quickstart installing 0.6.
Bot-review remediation (Qodo, PR #278): nq * m_eff can overflow usize
on 32-bit/wasm32 targets, and the CSR wrapper's saturating_mul would
attempt a usize::MAX allocation. Both sites now use checked
multiplication with a clear tile-the-batch message, matching the
crate's checked-allocation discipline. The exact m_eff + 1 heap
reservation is kept deliberately: gradual growth double-allocates to
the next power of two (~2x peak per query) — the reservation is the
memory-optimal choice, now documented.
…d 'origin/feat/index-ceiling-cli-parity' into docs/release-hygiene
…imary shape check

Bot-review remediation (Qodo, #283 inline):
- create_manifest_for_index_with_options observed the index size twice
  (probe, then a separate stat for the hash bound) — a concurrent
  writer could produce a manifest whose size and digest describe
  different bytes. The hash is now bounded by the probe's size, the
  manifest records the byte count actually hashed, and any
  disagreement fails loudly.
- sha256_file_bounded could read (not hash) up to one 64KiB chunk past
  the bound; reads now clamp to max_bytes + 1, mirroring
  read_bounded_file's take() pattern.
- validate_manifest_shape gains artifact_file_size_zero for the
  primary artifact, matching the profile artifacts' explicit zero
  rejection instead of surfacing a confusing artifact_file_too_large.
Bot-review remediation (Qodo, #283 inline): build_query_bitmap
allocated a fresh Vec and re-validated finiteness per query; the
entry points already validate the whole buffer and the destination
is preallocated. Oracle suites pin bit-identical output.
Bot-review remediation (Qodo, #282): --max-index-artifact-bytes wired
into ResourceLimits but the create path bounded the primary hash by
the probed size alone — the opt-in ceiling was ineffective for
create, unlike auxiliary artifacts. Create now mirrors verify:
declared/observed size min explicit ceiling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants