Skip to content

E-zClap/benchcert

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BenchCert

BenchCert is a Python package, CLI, and web app for deployment-complete benchmarking.

Benchmarks often report scores, but deployment decisions may depend on responses not captured by the benchmark. BenchCert checks whether a deployment action is determined by the available benchmark evidence.

Given:

  • benchmark evidence E
  • a deployment action D
  • optional candidate probes U

BenchCert reports:

  • certifiable fraction
  • ambiguous fraction
  • mixed evidence fibers
  • benchmark-only decision risk
  • completion curve
  • recommended next probe

Use BenchCert to check whether a benchmark score supports a deployment claim, identify ambiguous cases, and choose the next measurement that would make the claim certifiable.

Install

pip install -e .

If your environment blocks isolated build downloads but already has setuptools installed:

pip install -e . --no-build-isolation

For the Streamlit app:

pip install -e ".[web]"

CLI quickstart

benchcert audit examples/toy_claim.csv \
  --evidence score_bin,predicted_label \
  --action deployment_action \
  --out reports/toy-audit

This writes:

summary.json
fiber_table.csv
ambiguous_cases.csv
report.html

Run a completion audit:

benchcert complete examples/toy_claim.csv \
  --evidence score_bin,predicted_label \
  --action deployment_action \
  --probes robustness_score,calibration_family \
  --costs 2,1 \
  --out reports/toy-completion

Python API

import pandas as pd
from benchcert import audit_dataframe, completion_curve

df = pd.read_csv("examples/toy_claim.csv")

audit = audit_dataframe(
    df,
    evidence_columns=["score_bin", "predicted_label"],
    action_column="deployment_action",
)

print(audit.summary)

completion = completion_curve(
    df,
    evidence_columns=["score_bin", "predicted_label"],
    action_column="deployment_action",
    probe_columns=["robustness_score", "calibration_family"],
    costs=[2, 1],
)

print(completion.recommended_probe)

Web demo

Audit view:

BenchCert Streamlit app

Completion curve and probe ranking:

BenchCert completion curve and probe ranking

From PowerShell, run:

cd benchcert
$env:PYTHONPATH = "src"
python -m streamlit run app\streamlit_app.py

Then open:

http://localhost:8501

If BenchCert is already installed with pip install -e ".[web]", you can also run:

streamlit run app/streamlit_app.py

Fiber rules

BenchCert v0.1 supports four fiber rules:

  • exact: candidates share a fiber when all evidence values match.
  • quantile: numeric evidence columns are binned before exact matching.
  • knn: each candidate is audited against its k nearest evidence neighbors.
  • error-window: each candidate is audited against evidence neighbors inside a numeric tolerance window.

Repository map

benchcert/
├── README.md
├── LICENSE
├── pyproject.toml
├── src/benchcert/
├── app/
├── examples/
├── tests/
├── docs/
└── paper/

The paper reproduction code lives outside this product repo. This repo is for the reusable deployment-claim auditor.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages