BenchCert is a Python package, CLI, and web app for deployment-complete benchmarking.
Benchmarks often report scores, but deployment decisions may depend on responses not captured by the benchmark. BenchCert checks whether a deployment action is determined by the available benchmark evidence.
Given:
- benchmark evidence E
- a deployment action D
- optional candidate probes U
BenchCert reports:
- certifiable fraction
- ambiguous fraction
- mixed evidence fibers
- benchmark-only decision risk
- completion curve
- recommended next probe
Use BenchCert to check whether a benchmark score supports a deployment claim, identify ambiguous cases, and choose the next measurement that would make the claim certifiable.
pip install -e .If your environment blocks isolated build downloads but already has setuptools installed:
pip install -e . --no-build-isolationFor the Streamlit app:
pip install -e ".[web]"benchcert audit examples/toy_claim.csv \
--evidence score_bin,predicted_label \
--action deployment_action \
--out reports/toy-auditThis writes:
summary.json
fiber_table.csv
ambiguous_cases.csv
report.html
Run a completion audit:
benchcert complete examples/toy_claim.csv \
--evidence score_bin,predicted_label \
--action deployment_action \
--probes robustness_score,calibration_family \
--costs 2,1 \
--out reports/toy-completionimport pandas as pd
from benchcert import audit_dataframe, completion_curve
df = pd.read_csv("examples/toy_claim.csv")
audit = audit_dataframe(
df,
evidence_columns=["score_bin", "predicted_label"],
action_column="deployment_action",
)
print(audit.summary)
completion = completion_curve(
df,
evidence_columns=["score_bin", "predicted_label"],
action_column="deployment_action",
probe_columns=["robustness_score", "calibration_family"],
costs=[2, 1],
)
print(completion.recommended_probe)Audit view:
Completion curve and probe ranking:
From PowerShell, run:
cd benchcert
$env:PYTHONPATH = "src"
python -m streamlit run app\streamlit_app.pyThen open:
http://localhost:8501
If BenchCert is already installed with pip install -e ".[web]", you can also run:
streamlit run app/streamlit_app.pyBenchCert v0.1 supports four fiber rules:
exact: candidates share a fiber when all evidence values match.quantile: numeric evidence columns are binned before exact matching.knn: each candidate is audited against its k nearest evidence neighbors.error-window: each candidate is audited against evidence neighbors inside a numeric tolerance window.
benchcert/
├── README.md
├── LICENSE
├── pyproject.toml
├── src/benchcert/
├── app/
├── examples/
├── tests/
├── docs/
└── paper/
The paper reproduction code lives outside this product repo. This repo is for the reusable deployment-claim auditor.

