Skip to content

NoeFontana/vernier

vernier

CI PyPI crates.io vernier-core crates.io vernier-mask crates.io vernier-cli License: MIT OR Apache-2.0

Fast, parity-preserving evaluation for object detection, instance / panoptic / semantic segmentation, boundary IoU, OKS keypoints, and LVIS federated. Rust core, Python frontend, optional CLI.

pycocotools==2.0.11 is the de-facto reference for COCO evaluation — slow, unmaintained, and full of edge-case quirks. Faster reimplementations exist, but each silently fixes some quirks and not others, so you discover the divergences empirically. vernier takes a third path:

  • Auditable parity. Every divergence from pycocotools is filed in the quirks survey under ADR-0002 as either strict (bit-equal output, even when vernier's implementation is structurally different) or corrected (opt-in opinionated fix). Strict is the default; corrected fixes are itemized so you always know when your numbers diverge from a reference run. A drop-in shim (vernier.patch_pycocotools()) keeps existing pycocotools-based scripts working with one line.
  • One toolkit instead of five. bbox / segm / boundary / keypoints AP, panoptic PQ, semantic mIoU, and LVIS federated all live behind one Python API and one CLI. Per-paradigm migration guides under docs/migrate/ show how to replace pycocotools, faster-coco-eval, panopticapi, lvis-api, and mmsegmentation one at a time.
  • Rust core, Python frontend. The matching kernel is pure Rust with runtime SIMD dispatch; the FFI layer is data conversion only. The CLI ships as a static binary, so CI pipelines call vernier without provisioning a Python interpreter.

Status & validation

Pre-1.0; public API is unstable. See docs/adr/ for the design decisions shaping it. Per-paradigm parity status:

Paradigm / metric Oracle Parity tier Open caveat
Instance bbox / segm / keypoints AP pycocotools==2.0.11 strict bit-equal none
Instance boundary IoU boundary-iou-api strict bit-equal none
Segm + boundary TIDE thresholds (t_b) none yet corrected-only ADR-0022 still proposed; defaults extrapolated, not measured
Panoptic PQ panopticapi (single-core path) strict bit-equal boundary=True raises NotImplementedError (ADR-0025 §Q3)
Semantic mIoU / FWIoU / pAcc / mAcc mmseg.IoUMetric vendored at v1.2.2 (ADR-0036, still proposed); cityscapesScripts + ADE20K cross-impl bench externally blocked strict bit-equal on the four per-class u64 marginals at val2017 scale ADR-0028; ADE20K-scale bench gated on license-cleared cache
LVIS federated AP lvis-api (vendored at 031ac21f, ORACLE_LVIS_COMMIT_SHA) strict bit-equal on the (T, R, K, A) precision tensor at full LVIS v1 val bench paradigm wired; segm cell waits on evaluate_segm_grid_with_dataset

Three-tier parity model: ADR-0002; per-library comparison: docs/comparison.md.

Benchmarks

Workload vernier median Speedup vs alternatives
Instance — bbox AP (val2017) 360 ms 5.9× faster-coco-eval · 16.2× pycocotools
Instance — segm AP (val2017) 968 ms 3.7× faster-coco-eval · 7.1× pycocotools
Instance — boundary AP (val2017) 3.1 s 5.7× faster-coco-eval · 19.9× boundary-iou-api
Instance — keypoints AP (val2017, OKS) 136 ms 12.5× faster-coco-eval · 17.1× pycocotools
Panoptic — PQ (val2017) 11.6 s 3.04× panopticapi
Semantic — mIoU (val2017) 5.1 s 4.2× mmsegmentation
Instance — LVIS bbox AP (v1 val, perfect-DT) 3.7 s 56.9× lvis-api · 10× lower peak RSS (1.49 GiB vs 15.01 GiB)

Median total-stage wall time on a KVM VPS (AMD EPYC-Milan, 4 cores × 2 threads = 8 logical CPUs, x86_64 — not a bare-metal Milan box), harness mode release (N=10 measurement reps + 2 warmup, randomised impl order, 5% relative-IQR gate per impl), build profile = cargo release defaults (opt-level=3, lto=thin, codegen-units=1, no target-cpu) — same as the PyPI wheel. Full per-cell breakdown (including IQRs), RSS, and methodology in docs/benchmarks.md; per-library comparison of when to pick which in docs/comparison.md.

Baselines pinned for these numberspycocotools==2.0.11, faster-coco-eval==1.7.2, panopticapi @ 7bb4655, boundary-iou-api @ 37d2558, mmsegmentation @ c685fe6 (vendored), lvis-api @ 031ac21 (PyPI lvis==0.5.3). COCO and panoptic / semantic numbers were measured at HEAD 1fd5720bf56c; the LVIS row was added at HEAD e9d9c4d71303 after the bench paradigm landed. Each baseline is locked in its own uv-managed venv per ADR-0017.

Install

pip install vernier                  # Python wheel
cargo add vernier-core               # Rust library
cargo install vernier-cli            # `vernier` CLI binary

Wheels ship for linux x86_64 / aarch64 (glibc + musl), macOS x86_64 / arm64, and windows x64. The umbrella vernier crate name on crates.io is held as a 0.0.0 placeholder; vernier-core is the real Rust entry point — see docs/engineering/registry-reservations.md.

60-second example

One-shot — predictions already serialized to JSON (end-of-epoch checkpoint, CI gate, post-training inspection):

from pathlib import Path
from vernier.instance import Bbox, CocoDataset, Evaluator

gt_bytes = Path("instances_val2017.json").read_bytes()
dt_bytes = Path("detections.json").read_bytes()

dataset = CocoDataset.from_json(gt_bytes)
summary = Evaluator(iou=Bbox()).evaluate(dataset, dt_bytes)
for line in summary.pretty_lines():
    print(line)

In a training loop — overlap eval with the next training step. The matching kernel runs on a worker thread, so submit(...) returns immediately and the training thread keeps moving. Passing a CocoDataset reuses the parsed-once GT and its per-kernel derivation cache across every epoch (ADR-0020):

import json
from pathlib import Path
from vernier.instance import Bbox, CocoDataset, Evaluator

gt = CocoDataset.from_json(Path("instances_val2017.json").read_bytes())
evaluator = Evaluator(iou=Bbox())
with evaluator.background(gt) as bg:
    for images, _ in val_loader:
        detections = model(images)  # list[{image_id, category_id, bbox, score}]
        bg.submit(json.dumps(detections).encode())
    summary = bg.finalize()
print("AP =", summary.stats[0])

Both end in the same 12-line pycocotools-shaped Summary; docs/tutorials/first-evaluation.md walks each end-to-end.

Three evaluation paradigms

Pick the submodule whose input shape matches your model's output — they have different data models, different matching rules, and different parity oracles:

  • vernier.instance — detections with scores → bbox / segm / boundary / keypoints AP.
  • vernier.panoptic — RGB-encoded panoptic PNGs + segments_info JSON → PQ.
  • vernier.semantic — single-channel class-id label maps → mIoU / FWIoU / pAcc / mAcc.

See Three paradigms for when to use which.

Documentation

Contributing

Local checks: just lint && just test && just audit. The full contributor workflow (ADR lifecycle, vendoring policy, code style) is in CONTRIBUTING.md. Repository layout and common just recipes are in CLAUDE.md.

License

Dual-licensed under Apache-2.0 or MIT at your option.

Third-party code

vernier vendors a small number of test-only reference implementations to support parity testing. None of this code is included in published wheels or linked into the Rust binary. See THIRD_PARTY_NOTICES.md for the full inventory and license attributions.

About

A modern toolkit to evaluate 2D detection and segmentation models

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Contributing

Stars

Watchers

Forks

Contributors

Languages