RefusalBench

RefusalBench is a modular, reproducible, evergreen benchmark for tracking frontier LLM refusal on biological research prompts across successive model generations. It evaluates 19 frontier models on 141 matched-triple prompts spanning eight protein-design subdomains and three biological risk tiers (benign / borderline / dual-use), using a three-judge AI council to classify each response on a five-class compliance ladder.

The v1.0 prompt set and the inaugural May 2026 snapshot (13,389 adjudicated rows across 19 models, v1.1-frozen) are fully committed to this repository. All statistical analyses can be re-run without API keys from the committed data.

🤗 Interactive leaderboard: Explore the v1.1-frozen results without cloning anything — the HuggingFace Space hosts the leaderboard, a calibration scatter showing the headline finding, and the per-model TPR breakdown.

🤗 Dataset: The trial-level compliance labels are also on the Hub — load_dataset("appliedscientific/refusalbench").

Quickstart

git clone https://github.com/AppliedScientific/refusalbench
cd refusalbench

# Create and activate a virtual environment (recommended)
python3 -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate

make install        # pip install -e ".[dev,stats]"
make test           # 324 tests, all mock-driven — no API keys needed

A 5-minute end-to-end demo using mock providers and mock judges:

python3 scripts/run_pilot_categorization.py --demo

Repository layout

refusalbench/
├── benchmark/
│   ├── prompts/v1.0/           141 frozen prompt JSONs (benign/borderline/dual_use)
│   ├── council/v1.1.json       Three-judge council config (NVIDIA, Cohere, AI21)
│   ├── rubric/v1.0.json        Five-class compliance ladder × 16-category reason taxonomy
│   ├── config/
│   │   ├── sweep_models.json       19-model routing, pricing, jurisdiction metadata
│   │   ├── model_lineage.json      Lineage tracking for longitudinal comparisons
│   │   ├── sampling_config.json    Bundle counts and stratification rules per subdomain
│   │   └── should_refuse_criteria.yaml  Eligibility criteria (C1–C5) for positive-control module
│   └── templates/              Jinja/text templates for prompt rendering
├── data/
│   ├── raw/                    Source annotation tables (UniProt, BSL maps, OT JSONs)
│   ├── catalogues/             Per-subdomain JSONL catalogues (derived, committed)
│   └── bundle_definitions.csv  47-row bundle mapping table
├── results/
│   ├── snapshots/2026-05/      Inaugural sweep: 19 eval CSVs + adjudicated.csv (13,389 rows, v1.1-frozen)
│   ├── pilot/                  Pilot council outputs (pilot categorization CSVs)
│   ├── pretest/                Pre-test sweep CSVs (sonnet-4-6, opus-4-7)
│   ├── should_refuse/          Should-refuse positive-control public manifests
│   └── figures/                Generated paper figures — gitignored, rebuild with `python -m refusalbench.analysis.figures`
├── src/refusalbench/
│   ├── prompts.py              Prompt loader and validator
│   ├── runner.py               Sweep runner with resumption and deduplication
│   ├── council.py              Three-judge aggregation (modal label, Krippendorff α)
│   ├── score.py                Refusal rates, Wilson CIs, bootstrap
│   ├── providers/              anthropic / openrouter / bedrock / mock
│   ├── judges/                 llm_judge / mock
│   └── analysis/
│       ├── stats.py            H1–H5 statistical tests
│       └── figures.py          Figure generation utilities
├── hf_space/                   HuggingFace Space scaffold (Gradio leaderboard, ready to deploy)
├── scripts/                    CLI entry points (see below)
├── tests/                      324 unit tests, all mock-driven
└── docs/
    ├── methodology.md          Full evaluation methodology
    ├── data_schemas.md         Schema reference for every CSV and JSON in the repo
    ├── catalogue_provenance.md Per-protein audit trail (UniProt verification, BSL sources)
    ├── prompt_construction.md  Bundle derivation rules and source literature
    └── adapter_decisions.md    Provider and judge design decision log

Key scripts

Script	Purpose
`scripts/run_sweep_all.py`	Full 19-model sweep — creates a dated snapshot
`scripts/run_council.py`	Adjudicate an existing sweep snapshot
`scripts/should_refuse_cli.py`	Run the should-refuse positive-control module
`scripts/validate_prompts.py`	Validate the frozen prompt set integrity
`scripts/build_catalogues.py`	Rebuild per-subdomain JSONL catalogues from raw data

All scripts support --help.

Environment setup

Create .env from the template:

cp .env.example .env

Required keys for running a new sweep (not needed for analysis-only work):

Variable	Purpose	How to obtain
`OPENROUTER_API_KEY`	Routes most models (OpenAI, xAI, Meta, Mistral, Asian providers)	openrouter.ai/keys
`BEDROCK_API_KEY`	AWS Bedrock access (Amazon, Mistral, DeepSeek, Qwen, GLM, NVIDIA)	Bedrock console — must be `ABSK`-prefixed format
`AWS_REGION`	Bedrock region (default: `us-east-1`)	Standard AWS region string

Analysis-only: No API keys required. All results can be reproduced from the committed results/snapshots/2026-05/council/adjudicated.csv.

Running the analysis

Re-run the statistical analyses against the committed snapshot:

python3 -c "
import pandas as pd, json
from refusalbench.analysis import stats

df   = pd.read_csv('results/snapshots/2026-05/council/adjudicated.csv')
meta = json.load(open('benchmark/config/sweep_models.json'))
print(stats.h2_provider_clustering(df, meta))    # jurisdiction clustering
print(stats.h3_subdomain_anthropic(df, meta))    # subdomain sensitivity
print(stats.h5_capability_correlation(df, meta)) # capability vs refusal
"

To run a new snapshot against the same prompt set with updated or additional models:

python3 scripts/run_sweep_all.py --label 2026-08 --models benchmark/config/sweep_models.json
python3 scripts/run_council.py   --snapshot results/snapshots/2026-08/

See docs/methodology.md for the complete evaluation methodology and DEVELOPER.md for the full architecture and contributor guide.

Documentation

Document	Contents
`docs/methodology.md`	Full methodology: models, prompts, sweep, council, statistical model, reproducibility
`docs/data_schemas.md`	Schema for every CSV, JSON, and JSONL file
`docs/catalogue_provenance.md`	Per-protein audit trail — UniProt accession verification, BSL source documentation
`docs/prompt_construction.md`	Bundle derivation rules, subdomain design rationale, source literature
`docs/adapter_decisions.md`	Provider/judge design decisions, routing history
`DEVELOPER.md`	Architecture deep-dive, adding new models, running new snapshots

Contributing

Contributions are welcome — new models, updated snapshots, bug fixes, and documentation improvements. Please read CONTRIBUTING.md before opening a pull request.

Citation

If you use RefusalBench in your research, please cite:

@misc{weidener2026refusalbenchrefusalratemisranks,
      title={RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts},
      author={Lukas Weidener and Marko Brkić and Mihailo Jovanović and Emre Ulgac and Aakaash Meduri},
      year={2026},
      eprint={2605.21545},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2605.21545},
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github/workflows		.github/workflows
benchmark		benchmark
data		data
docs		docs
hf_space		hf_space
results		results
scripts		scripts
src/refusalbench		src/refusalbench
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CONTRIBUTING.md		CONTRIBUTING.md
DEVELOPER.md		DEVELOPER.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RefusalBench

Quickstart

Repository layout

Key scripts

Environment setup

Running the analysis

Documentation

Contributing

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RefusalBench

Quickstart

Repository layout

Key scripts

Environment setup

Running the analysis

Documentation

Contributing

Citation

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages