0.1.0 — 2026-05-19
Release Notes
First release out of the 0.0.x line. Mostly a performance + parallelism
follow-up to 0.0.4 — no new evaluation paradigms, every shipped kernel
keeps strict bit-equal parity with its oracle. The cross-paradigm
benchmark page is refreshed against the post-0.0.4 SHA on the same
machine fingerprint as the 0.0.4 snapshot (37652a58e939).
Added
num_threadsparallelism (ADR-0047)
(#251, #253, #254, #256) — opt-innum_threads: int | None = None
on every public evaluate surface across all four paradigms: instance
(bbox / segm / boundary / keypoints), semantic, panoptic, and LVIS,
on batch + streaming + background entry points (Evaluator.evaluate,
Evaluator.background,submit/submit_png). The sequential
path (num_threads=Noneor1) is byte-for-byte unchanged from
0.0.4; no rayon symbol is entered.parity_threadsparity tests
assert bit-equal results acrossnum_threads ∈ {None, 1, 2, 4, 8}
on every paradigm. CLI gainsvernier eval --threads N.bench-timingsCargo feature (#256) — atomic(par_iter, serial_post)split +build_*_annscall counter on
evaluate_with_parallel, attributed via the newBenchCounterSet
shared helper (#258). Off by default and stripped from the shipped
wheel; powers the bbox-scaling attribution at
docs/engineering/benchmarking/2026-05-bbox-cdf.md.mimalloc-globalCargo feature onvernier-ffi(#256) —
allocator A/B knob, off by default; lets users opt into mimalloc
for hot-allocation workloads without it being a default cost.- Semantic divan microbench (#261) —
crates/vernier-semantic/benches/accumulate_confusion.rs
exercises three input distributions (realistic_perfect,
realistic_jittered,uniform_random) at the val2017
panoptic-semantic geometry; prereq for the chunked-u8 kernel work.
Changed
- bbox AP perf (#256, #258, #259) — KernelScratch per-worker
annotation pool + direct-write parallel runner (replaces the
per-imageVec<CellOutput>intermediate withpar_chunks_mut);
in-place image-major → canonical transpose via cycle-following
(eliminates a 26 MB intermediate buffer pair on val2017); the
eval_imgs+eval_imgs_metatransposes fuse into a single
cycle walk (halves index arithmetic, drops one of two 1.6 MB
visited-bitset allocations). Net val2017 nt=4: par_iter region
42 → 32 ms, serial_post 45 → 19 ms, peak working-set
−24 MB. The remaining Amdahl floor on--num-threadsfor bbox is
the ~200 ms single-threadeddataset_build(HashMap validation in
CocoDataset::from_parts), attributed viabench-timings. - Panoptic PQ perf (#260) — sparse-remap adjacent-pixel cache on
build_dense_intersectionsandbuild_dense_boundary_intersections.
COCO panoptic always hits the sparse branch (RGB-packed ids exceed
the 1 M dense cap) and panoptic segments are spatially contiguous,
so consecutive(g, d)pairs are usually identical; a 4-state
(last_g, last_d, last_gi, last_di)cache skips theFxHashMap
lookup on adjacent-pixel matches. Dense branch is deliberately
uncached (Vec::getis cheap enough that the miss overhead
regresses synthetic by ~70%). SSSE3 RGB→u32 pack on the panoptic
PNG decode path. Newcoco_like_rgbmicrobench arm exercises the
sparse-RGB path that the existingcoco_likearms missed
(their ids 1..=50 took the dense path). - Semantic mIoU perf (#261) — decode buffer pool + chunked u8
kernel onaccumulate_confusionfor theT = u8PNG fused-decode
path that drivesSemantic — mIoU (val2017). The pool reuses the
per-image decodeVec<u8>across submissions; the chunked kernel
keeps the strict-mode u64-additive fold but processes pixels in
cache-line-sized batches. - Background-evaluator threading wired (#253, #254) —
BackgroundConfig.num_threadsis no longer hardcodedNoneon the
panoptic and semantic FFI ctors;BackgroundCapablegains a
default-methodapply_update_parallelthat the panoptic and
semantic streaming impls override. Panopticsubmit_pngdefers
PNG decode into the worker pool (PyBackedByteszero-copy) so
libpng decode parallelises across submissions; the single-threaded
path keeps inline decode and is byte-for-byte unchanged. vernier-pixel-packfolded intovernier-panoptic— the
SSSE3 RGB→u32 pack primitive added in #260 lived briefly as a
standalone workspace crate. With a single consumer
(vernier-panoptic::decode) and 172 LOC, it sat below the
leaf-crate threshold and the audited-unsafe carveout fits cleanly
inside the host crate (#![deny(unsafe_code)]at root, module-local
#[allow(unsafe_code)]on the SSSE3pshufbfn). Folding it
back keeps the published crate set at the six 0.0.4 crates and
avoids the registry-reservations + Trusted-Publisher loop in the
release runbook for a non-reusable internal SIMD primitive.- Bench harness
--num-threads(#251, #252) —bench run --num-threads "1,2,4,8"override overrides the workload's pinned
num_threadstuple; panoptic + semantic spawn helpers now forward
the flag (previously dropped, so every panoptic / semantic cell
ran withargs.num_threads = Noneregardless of what the CLI
swept). - Bench page refreshed against
3a509df6c525on the same
37652a58e939fingerprint as the 0.0.4 snapshot, so the speedup
deltas are not confounded by host change. Per-cell movements
(vernier median, 0.0.4 → HEAD):- panoptic PQ: 12.59 s → 10.53 s (−16.4%; speedup
2.73× → 3.30× vs panopticapi). IQR also narrows from 21.22%
to 9.78% (still over the 5% gate — PNG decode is chronically
noisy on this host). - semantic mIoU val2017: 5.00 s → 2.82 s (−43.6%;
speedup 4.12× → 7.40× vs mmsegmentation). - instance bbox / segm / boundary / keypoints / synth-semantic /
LVIS move within VPS noise of their 0.0.4 numbers; speedups
widen by 0.1×–0.5× as baselines drift slightly slower on this
run.
- panoptic PQ: 12.59 s → 10.53 s (−16.4%; speedup
Fixed
bench run --impl allon non-instance paradigms —impls_for_iou
raisedKeyErrorfor the paradigm-specific impls
(vernier_panoptic,panopticapi,mmsegmentation,
vernier_lvis,lvis-api) that #252 widenedALL_IMPLSto
include. Falls back to an empty IoU set for impls that aren't
registered for the instance paradigm.
Install vernier-cli 0.1.0
Install prebuilt binaries via shell script
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/NoeFontana/vernier/releases/download/v0.1.0/vernier-cli-installer.sh | shInstall prebuilt binaries via powershell script
powershell -ExecutionPolicy Bypass -c "irm https://github.com/NoeFontana/vernier/releases/download/v0.1.0/vernier-cli-installer.ps1 | iex"Download vernier-cli 0.1.0
| File | Platform | Checksum |
|---|---|---|
| vernier-cli-aarch64-apple-darwin.tar.xz | Apple Silicon macOS | checksum |
| vernier-cli-x86_64-pc-windows-msvc.zip | x64 Windows | checksum |
| vernier-cli-aarch64-unknown-linux-gnu.tar.xz | ARM64 Linux | checksum |
| vernier-cli-x86_64-unknown-linux-gnu.tar.xz | x64 Linux | checksum |