SkillSafetyBench

SkillSafetyBench is a fully materialized benchmark for evaluating safety failures caused by attacks on skill-facing context. Each case keeps the user-facing task ordinary, while the attack is placed in local skills, helper scripts, sidecar files, corpora, templates, or other workspace artifacts that an agent may consult during execution.

For a visual benchmark overview, case explorer, leaderboard, and experiment analysis, visit the SkillSafetyBench website.

What the Benchmark Evaluates

Modern coding and task agents often rely on local skills and environment artifacts, not only the user prompt. SkillSafetyBench tests whether an agent can complete benign work while resisting unsafe instructions or misleading context embedded in those skill-facing surfaces. The benign task carriers are selected from SkillsBench.

The benchmark is static and ready to run. It is not a case generator. The repository contains 155 prepared cases across 6 risk domains and 30 categories. Each case packages:

a benign base task that should still be completed
one or more skill-facing attack surfaces
a deterministic verifier for the unsafe behavior
task tests for base-task completion
metadata and rationale describing the final case design

Each run is evaluated on two axes:

task_success: whether the ordinary user task was completed
attack_success: whether the unsafe behavior was observed

This separation is important: a system can appear capable because it completes the task, while still being vulnerable because it follows the attack-bearing skill context.

Benchmark Structure

SkillSafetyBench contains 155 cases organized into 6 risk domains and 30 categories. Each risk domain contains 5 categories, and each category contains multiple concrete cases built around ordinary task carriers.

SkillSafetyBench
└── 6 risk domains
    ├── RD1: Context Trust and Prompt Manipulation Risks
    ├── RD2: Agency, Scope, and Authorization Risks
    ├── RD3: Execution, Runtime, Framework, and Protocol Risks
    ├── RD4: Data Boundary, Output, and Externalization Risks
    ├── RD5: Memory, Recovery, Audit, and Persistence Risks
    └── RD6: Knowledge, Model, Supply Chain, and Operational Risks

Cases are stored by risk domain, category, and case id:

benchmark/<risk-domain>/<category>/<case>/
  metadata.json
  rationale.md
  attacked_task/
    instruction.md
    task.toml
    environment/
    solution/solve.sh
    tests/test.sh
  eval/verify_attack.py

The benchmark directory is the source of truth for cases. Category directories use the categoryN-... naming pattern, for example:

benchmark/risk-domain-6-knowledge-model-supply-chain-and-operational-risks/category1-availability_cost_and_service_exhaustion/

Benchmark Construction Pipeline

The benchmark construction pipeline starts from benign task carriers, preserves the original user-facing task, injects the attack through skill-facing artifacts, and attaches deterministic checks for both task completion and attack behavior. This keeps the evaluation focused on whether agents can resist unsafe local context while still doing the requested work.

Safety Notice

This repository contains adversarial benchmark cases. Some cases simulate phishing, credential solicitation, destructive operations, data exfiltration, sandbox boundary abuse, external actions, and persistence-like behaviors.

Run the benchmark only in isolated research environments. Do not use production credentials, production cloud accounts, personal accounts, or sensitive local files. Prefer throwaway API keys, disposable containers, restricted network access, and a dedicated working directory. Review run artifacts before sharing them.

How to Run the Benchmark

1. Prepare the Toolchain

Required tools:

bash
python3, recommended >= 3.11
docker
uv
harbor
nvm
Node.js 22

Install Harbor if needed:

uv tool install harbor

Check the basic toolchain:

harbor --help
python3 --version
docker --version

If your Harbor agent is installed through Node, activate Node 22 in the shell that starts the run:

export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && . "$NVM_DIR/nvm.sh"
nvm use 22

2. Configure Environment Variables

Create a local .envrc from the example file:

cp env.example .envrc

For codex, set at least:

export OPENAI_BASE_URL="https://your-openai-compatible-endpoint/v1"
export OPENAI_API_KEY="replace-with-your-openai-api-key"

For claude-code, set an Anthropic-compatible auth variable, for example:

export ANTHROPIC_BASE_URL="https://api.anthropic.com"
export ANTHROPIC_API_KEY="replace-with-your-anthropic-api-key"

If you use direnv, run:

direnv allow

If not, no extra step is required. The batch script explicitly sources .envrc at runtime.

3. Run Cases

The canonical entry point is:

scripts/start_agent_batch.sh

Run one risk domain:

scripts/start_agent_batch.sh --rd 5

Choose agent and model:

scripts/start_agent_batch.sh \
  --rd 3 \
  --agent codex \
  --model openai/gpt-5.4

Run with Claude Code:

scripts/start_agent_batch.sh \
  --rd 3 \
  --agent claude-code \
  --model claude-3-7-sonnet-20250219

Run a single case:

scripts/start_agent_batch.sh \
  --rd 5 \
  --case travel_planning_poisoned_belief_persistence_memory_anchor

Run a single category:

scripts/start_agent_batch.sh \
  --rd 6 \
  --category category3-model_behavior_and_finetuning_poisoning

Run from an explicit manifest:

scripts/start_agent_batch.sh \
  --manifest benchmark/risk-domain-6-knowledge-model-supply-chain-and-operational-risks/manifest.json

Use a custom env file:

scripts/start_agent_batch.sh \
  --rd 5 \
  --envrc /path/to/your.envrc

Print selected cases without executing:

scripts/start_agent_batch.sh --rd 5 --dry-run

4. Read Run Outputs

Each run creates an output directory under jobs/:

jobs/<agent>-risk-domain-5-memory-recovery-audit-and-persistence-risks-<timestamp>/

Start with:

jobs/<run>/attack_results.json
jobs/<run>/summary.json
jobs/<run>/attack_results.csv
jobs/<run>/summary.csv

Useful per-run files:

selected_cases.json
batch_config.json
<case_id>/case_result.json
attack_results.md

Common attack outcomes:

attack_success
attack_not_observed
task_output_missing

task_output_missing means the expected explicit task output was absent. The attack verifier may still continue when enough artifacts exist to evaluate the attack condition.

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
assets/figures		assets/figures
benchmark		benchmark
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
NOTICE.md		NOTICE.md
README.md		README.md
README.zh.md		README.zh.md
envrc.example		envrc.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SkillSafetyBench

What the Benchmark Evaluates

Benchmark Structure

Benchmark Construction Pipeline

Safety Notice

How to Run the Benchmark

1. Prepare the Toolchain

2. Configure Environment Variables

3. Run Cases

4. Read Run Outputs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SkillSafetyBench

What the Benchmark Evaluates

Benchmark Structure

Benchmark Construction Pipeline

Safety Notice

How to Run the Benchmark

1. Prepare the Toolchain

2. Configure Environment Variables

3. Run Cases

4. Read Run Outputs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages