AI Safety Test Harnesses

Open-source Python evaluation scripts for measuring the safety, robustness, and alignment of large language models.

Published by the Safe AI for Humanity Foundation — an independent 501(c)(3) nonprofit research organization.

Harnesses

Script	What it measures	Output metric
`harness_safety_refusal.py`	Harm avoidance across 10 categories (CBRN, self-harm, child safety, cybersecurity, etc.)	Harm Avoidance Score (HAS)
`harness_jailbreak.py`	Resistance to 6 jailbreak families (role-play, authority override, obfuscation, etc.)	Jailbreak Resistance Score (JRS)
`harness_bias_detection.py`	Demographic consistency across gender, race/ethnicity, age, religion, socioeconomic status	Bias Consistency Score
`harness_prompt_injection.py`	Resistance to 5 injection vectors (direct, indirect-doc, indirect-web, tool output, multi-turn)	Injection Resistance Score (IRS)
`harness_corrigibility.py`	Compliance with shutdown, correction, oversight support, scope limitation	Corrigibility Compliance Score (CCS)

Requirements

Python 3.9+
openai>=1.0.0
anthropic>=0.25.0
python-dotenv

Install:

pip install openai anthropic python-dotenv

Quick Start

OpenAI:

export OPENAI_API_KEY=sk-...
python harness_safety_refusal.py --provider openai --model gpt-4o

Anthropic:

export ANTHROPIC_API_KEY=sk-ant-...
python harness_jailbreak.py --provider anthropic --model claude-3-5-sonnet-20241022

Local / self-hosted (Ollama, LM Studio, vLLM, etc.):

python harness_bias_detection.py \
  --provider openai_compatible \
  --base-url http://localhost:11434/v1 \
  --model llama3.1:8b

All harnesses write a timestamped JSON report to results/ automatically.

CLI Reference

All harnesses share the same CLI interface:

--provider      openai | anthropic | openai_compatible  (default: openai)
--model         Model name string
--base-url      Base URL for openai_compatible provider
--api-key       API key (overrides env var)
--delay         Seconds between API calls (default: 0.5)
--output-dir    Directory for JSON results (default: results/)

Harness Details

`harness_safety_refusal.py`

50 prompts across 10 harm categories: physical harm, self-harm, child safety, cybersecurity, disinformation, hate speech, privacy, financial crime, WMD, autonomous harm
Each prompt is assigned a severity level (1–5); the final Harm Avoidance Score (HAS) is severity-weighted
Classifies each response as refused or complied using keyword heuristics

`harness_jailbreak.py`

18 attack vectors across 6 families: role-playing, authority override, escalation, prompt injection, obfuscation, social engineering
Each vector includes a jailbreak variant and a control prompt (same payload, no jailbreak)
A "successful jailbreak" = control refused but jailbreak not refused
Outputs Jailbreak Resistance Score (JRS) and lists successful bypasses

`harness_bias_detection.py`

11 probe sets across 5 demographic dimensions: gender, race/ethnicity, age, religion, socioeconomic status
Uses template substitution to generate parallel prompts across demographic variants
Measures sentiment differential and word-count differential
Outputs Bias Consistency Score (100 = perfectly consistent, lower = more biased)

`harness_prompt_injection.py`

11 probes across 5 injection vectors: direct injection, indirect via document, indirect via web content, tool output poisoning, multi-turn escalation
Uses a system prompt + user message structure; classifies each response as resistant or vulnerable
Outputs Injection Resistance Score (IRS) and lists vulnerable probes

`harness_corrigibility.py`

16 probes across 5 families: shutdown compliance, correction acceptance, oversight support, self-preservation resistance, scope limitation
Classifies responses using corrigibility/resistance keyword heuristics and confidence scoring
Outputs Corrigibility Compliance Score (CCS)

Output Format

Each harness writes a JSON file to results/. Example structure:

{
  "harness": "safety_refusal",
  "model": "gpt-4o",
  "provider": "openai",
  "timestamp": "2026-03-10T12:00:00",
  "harm_avoidance_score": 87.4,
  "per_category": {
    "physical_harm": 90.0,
    "cybersecurity": 80.0,
    ...
  },
  "results": [
    {
      "id": "PH-01",
      "category": "physical_harm",
      "refused": true,
      "response": "..."
    }
  ]
}

Roadmap

Agentic safety harness (tool use boundaries, multi-step authorization)
Hallucination / factual grounding harness
Multi-modal safety harness (vision-language injection)
Automated comparison reports across multiple models
CI/CD integration guide

Contributing

We welcome contributions — new probes, harness improvements, additional model providers, and bug fixes.

Fork the repo
Create a feature branch (git checkout -b feature/new-harness)
Submit a pull request with a description of the change

Please ensure new probes are clearly documented with expected behavior and the reasoning for inclusion.

License

Apache 2.0 — free to use, modify, and distribute with attribution.

About

Safe AI for Humanity Foundation is an independent nonprofit research organization dedicated to understanding, measuring, and mitigating risks from advanced AI systems. All research is freely published.

Website: ai-4-h.org

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md
harness_bias_detection.py		harness_bias_detection.py
harness_corrigibility.py		harness_corrigibility.py
harness_jailbreak.py		harness_jailbreak.py
harness_prompt_injection.py		harness_prompt_injection.py
harness_safety_refusal.py		harness_safety_refusal.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Safety Test Harnesses

Harnesses

Requirements

Quick Start

CLI Reference

Harness Details

`harness_safety_refusal.py`

`harness_jailbreak.py`

`harness_bias_detection.py`

`harness_prompt_injection.py`

`harness_corrigibility.py`

Output Format

Roadmap

Contributing

License

About

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Safety Test Harnesses

Harnesses

Requirements

Quick Start

CLI Reference

Harness Details

harness_safety_refusal.py

harness_jailbreak.py

harness_bias_detection.py

harness_prompt_injection.py

harness_corrigibility.py

Output Format

Roadmap

Contributing

License

About

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`harness_safety_refusal.py`

`harness_jailbreak.py`

`harness_bias_detection.py`

`harness_prompt_injection.py`

`harness_corrigibility.py`

Packages