BenchCert

BenchCert is a Python package, CLI, and web app for deployment-complete benchmarking.

Benchmarks often report scores, but deployment decisions may depend on responses not captured by the benchmark. BenchCert checks whether a deployment action is determined by the available benchmark evidence.

Given:

benchmark evidence E
a deployment action D
optional candidate probes U

BenchCert reports:

certifiable fraction
ambiguous fraction
mixed evidence fibers
benchmark-only decision risk
completion curve
recommended next probe

Use BenchCert to check whether a benchmark score supports a deployment claim, identify ambiguous cases, and choose the next measurement that would make the claim certifiable.

Install

pip install -e .

If your environment blocks isolated build downloads but already has setuptools installed:

pip install -e . --no-build-isolation

For the Streamlit app:

pip install -e ".[web]"

CLI quickstart

benchcert audit examples/toy_claim.csv \
  --evidence score_bin,predicted_label \
  --action deployment_action \
  --out reports/toy-audit

This writes:

summary.json
fiber_table.csv
ambiguous_cases.csv
report.html

Run a completion audit:

benchcert complete examples/toy_claim.csv \
  --evidence score_bin,predicted_label \
  --action deployment_action \
  --probes robustness_score,calibration_family \
  --costs 2,1 \
  --out reports/toy-completion

Python API

import pandas as pd
from benchcert import audit_dataframe, completion_curve

df = pd.read_csv("examples/toy_claim.csv")

audit = audit_dataframe(
    df,
    evidence_columns=["score_bin", "predicted_label"],
    action_column="deployment_action",
)

print(audit.summary)

completion = completion_curve(
    df,
    evidence_columns=["score_bin", "predicted_label"],
    action_column="deployment_action",
    probe_columns=["robustness_score", "calibration_family"],
    costs=[2, 1],
)

print(completion.recommended_probe)

Web demo

Audit view:

Completion curve and probe ranking:

From PowerShell, run:

cd benchcert
$env:PYTHONPATH = "src"
python -m streamlit run app\streamlit_app.py

Then open:

http://localhost:8501

If BenchCert is already installed with pip install -e ".[web]", you can also run:

streamlit run app/streamlit_app.py

Fiber rules

BenchCert v0.1 supports four fiber rules:

exact: candidates share a fiber when all evidence values match.
quantile: numeric evidence columns are binned before exact matching.
knn: each candidate is audited against its k nearest evidence neighbors.
error-window: each candidate is audited against evidence neighbors inside a numeric tolerance window.

Repository map

benchcert/
├── README.md
├── LICENSE
├── pyproject.toml
├── src/benchcert/
├── app/
├── examples/
├── tests/
├── docs/
└── paper/

The paper reproduction code lives outside this product repo. This repo is for the reusable deployment-claim auditor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BenchCert

Install

CLI quickstart

Python API

Web demo

Fiber rules

Repository map

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
docs		docs
examples		examples
paper		paper
src/benchcert		src/benchcert
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

BenchCert

Install

CLI quickstart

Python API

Web demo

Fiber rules

Repository map

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages