Skip to content

AppliedScientific/refusalbench

Repository files navigation

RefusalBench

License: MIT Python 3.11+ CI HF Space Dataset arXiv

RefusalBench is a modular, reproducible, evergreen benchmark for tracking frontier LLM refusal on biological research prompts across successive model generations. It evaluates 19 frontier models on 141 matched-triple prompts spanning eight protein-design subdomains and three biological risk tiers (benign / borderline / dual-use), using a three-judge AI council to classify each response on a five-class compliance ladder.

The v1.0 prompt set and the inaugural May 2026 snapshot (13,389 adjudicated rows across 19 models, v1.1-frozen) are fully committed to this repository. All statistical analyses can be re-run without API keys from the committed data.

🤗 Interactive leaderboard: Explore the v1.1-frozen results without cloning anything — the HuggingFace Space hosts the leaderboard, a calibration scatter showing the headline finding, and the per-model TPR breakdown.

🤗 Dataset: The trial-level compliance labels are also on the Hubload_dataset("appliedscientific/refusalbench").


Quickstart

git clone https://github.com/AppliedScientific/refusalbench
cd refusalbench

# Create and activate a virtual environment (recommended)
python3 -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate

make install        # pip install -e ".[dev,stats]"
make test           # 324 tests, all mock-driven — no API keys needed

A 5-minute end-to-end demo using mock providers and mock judges:

python3 scripts/run_pilot_categorization.py --demo

Repository layout

refusalbench/
├── benchmark/
│   ├── prompts/v1.0/           141 frozen prompt JSONs (benign/borderline/dual_use)
│   ├── council/v1.1.json       Three-judge council config (NVIDIA, Cohere, AI21)
│   ├── rubric/v1.0.json        Five-class compliance ladder × 16-category reason taxonomy
│   ├── config/
│   │   ├── sweep_models.json       19-model routing, pricing, jurisdiction metadata
│   │   ├── model_lineage.json      Lineage tracking for longitudinal comparisons
│   │   ├── sampling_config.json    Bundle counts and stratification rules per subdomain
│   │   └── should_refuse_criteria.yaml  Eligibility criteria (C1–C5) for positive-control module
│   └── templates/              Jinja/text templates for prompt rendering
├── data/
│   ├── raw/                    Source annotation tables (UniProt, BSL maps, OT JSONs)
│   ├── catalogues/             Per-subdomain JSONL catalogues (derived, committed)
│   └── bundle_definitions.csv  47-row bundle mapping table
├── results/
│   ├── snapshots/2026-05/      Inaugural sweep: 19 eval CSVs + adjudicated.csv (13,389 rows, v1.1-frozen)
│   ├── pilot/                  Pilot council outputs (pilot categorization CSVs)
│   ├── pretest/                Pre-test sweep CSVs (sonnet-4-6, opus-4-7)
│   ├── should_refuse/          Should-refuse positive-control public manifests
│   └── figures/                Generated paper figures — gitignored, rebuild with `python -m refusalbench.analysis.figures`
├── src/refusalbench/
│   ├── prompts.py              Prompt loader and validator
│   ├── runner.py               Sweep runner with resumption and deduplication
│   ├── council.py              Three-judge aggregation (modal label, Krippendorff α)
│   ├── score.py                Refusal rates, Wilson CIs, bootstrap
│   ├── providers/              anthropic / openrouter / bedrock / mock
│   ├── judges/                 llm_judge / mock
│   └── analysis/
│       ├── stats.py            H1–H5 statistical tests
│       └── figures.py          Figure generation utilities
├── hf_space/                   HuggingFace Space scaffold (Gradio leaderboard, ready to deploy)
├── scripts/                    CLI entry points (see below)
├── tests/                      324 unit tests, all mock-driven
└── docs/
    ├── methodology.md          Full evaluation methodology
    ├── data_schemas.md         Schema reference for every CSV and JSON in the repo
    ├── catalogue_provenance.md Per-protein audit trail (UniProt verification, BSL sources)
    ├── prompt_construction.md  Bundle derivation rules and source literature
    └── adapter_decisions.md    Provider and judge design decision log

Key scripts

Script Purpose
scripts/run_sweep_all.py Full 19-model sweep — creates a dated snapshot
scripts/run_council.py Adjudicate an existing sweep snapshot
scripts/should_refuse_cli.py Run the should-refuse positive-control module
scripts/validate_prompts.py Validate the frozen prompt set integrity
scripts/build_catalogues.py Rebuild per-subdomain JSONL catalogues from raw data

All scripts support --help.


Environment setup

Create .env from the template:

cp .env.example .env

Required keys for running a new sweep (not needed for analysis-only work):

Variable Purpose How to obtain
OPENROUTER_API_KEY Routes most models (OpenAI, xAI, Meta, Mistral, Asian providers) openrouter.ai/keys
BEDROCK_API_KEY AWS Bedrock access (Amazon, Mistral, DeepSeek, Qwen, GLM, NVIDIA) Bedrock console — must be ABSK-prefixed format
AWS_REGION Bedrock region (default: us-east-1) Standard AWS region string

Analysis-only: No API keys required. All results can be reproduced from the committed results/snapshots/2026-05/council/adjudicated.csv.


Running the analysis

Re-run the statistical analyses against the committed snapshot:

python3 -c "
import pandas as pd, json
from refusalbench.analysis import stats

df   = pd.read_csv('results/snapshots/2026-05/council/adjudicated.csv')
meta = json.load(open('benchmark/config/sweep_models.json'))
print(stats.h2_provider_clustering(df, meta))    # jurisdiction clustering
print(stats.h3_subdomain_anthropic(df, meta))    # subdomain sensitivity
print(stats.h5_capability_correlation(df, meta)) # capability vs refusal
"

To run a new snapshot against the same prompt set with updated or additional models:

python3 scripts/run_sweep_all.py --label 2026-08 --models benchmark/config/sweep_models.json
python3 scripts/run_council.py   --snapshot results/snapshots/2026-08/

See docs/methodology.md for the complete evaluation methodology and DEVELOPER.md for the full architecture and contributor guide.


Documentation

Document Contents
docs/methodology.md Full methodology: models, prompts, sweep, council, statistical model, reproducibility
docs/data_schemas.md Schema for every CSV, JSON, and JSONL file
docs/catalogue_provenance.md Per-protein audit trail — UniProt accession verification, BSL source documentation
docs/prompt_construction.md Bundle derivation rules, subdomain design rationale, source literature
docs/adapter_decisions.md Provider/judge design decisions, routing history
DEVELOPER.md Architecture deep-dive, adding new models, running new snapshots

Contributing

Contributions are welcome — new models, updated snapshots, bug fixes, and documentation improvements. Please read CONTRIBUTING.md before opening a pull request.


Citation

If you use RefusalBench in your research, please cite:

@misc{weidener2026refusalbenchrefusalratemisranks,
      title={RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts},
      author={Lukas Weidener and Marko Brkić and Mihailo Jovanović and Emre Ulgac and Aakaash Meduri},
      year={2026},
      eprint={2605.21545},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2605.21545},
}

About

Reproducible, evergreen benchmark for LLM refusal on biological research prompts — 19 models, 141 prompts, 13,389 adjudicated trials

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages