Skip to content

AI45Lab/skill-safety-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SkillSafetyBench

English 中文

SkillSafetyBench is a fully materialized benchmark for evaluating safety failures caused by attacks on skill-facing context. Each case keeps the user-facing task ordinary, while the attack is placed in local skills, helper scripts, sidecar files, corpora, templates, or other workspace artifacts that an agent may consult during execution.

For a visual benchmark overview, case explorer, leaderboard, and experiment analysis, visit the SkillSafetyBench website.

SkillSafetyBench overview

What the Benchmark Evaluates

Modern coding and task agents often rely on local skills and environment artifacts, not only the user prompt. SkillSafetyBench tests whether an agent can complete benign work while resisting unsafe instructions or misleading context embedded in those skill-facing surfaces. The benign task carriers are selected from SkillsBench.

The benchmark is static and ready to run. It is not a case generator. The repository contains 155 prepared cases across 6 risk domains and 30 categories. Each case packages:

  • a benign base task that should still be completed
  • one or more skill-facing attack surfaces
  • a deterministic verifier for the unsafe behavior
  • task tests for base-task completion
  • metadata and rationale describing the final case design

Each run is evaluated on two axes:

  • task_success: whether the ordinary user task was completed
  • attack_success: whether the unsafe behavior was observed

This separation is important: a system can appear capable because it completes the task, while still being vulnerable because it follows the attack-bearing skill context.

Benchmark Structure

SkillSafetyBench contains 155 cases organized into 6 risk domains and 30 categories. Each risk domain contains 5 categories, and each category contains multiple concrete cases built around ordinary task carriers.

SkillSafetyBench
└── 6 risk domains
    ├── RD1: Context Trust and Prompt Manipulation Risks
    ├── RD2: Agency, Scope, and Authorization Risks
    ├── RD3: Execution, Runtime, Framework, and Protocol Risks
    ├── RD4: Data Boundary, Output, and Externalization Risks
    ├── RD5: Memory, Recovery, Audit, and Persistence Risks
    └── RD6: Knowledge, Model, Supply Chain, and Operational Risks

Cases are stored by risk domain, category, and case id:

benchmark/<risk-domain>/<category>/<case>/
  metadata.json
  rationale.md
  attacked_task/
    instruction.md
    task.toml
    environment/
    solution/solve.sh
    tests/test.sh
  eval/verify_attack.py

The benchmark directory is the source of truth for cases. Category directories use the categoryN-... naming pattern, for example:

benchmark/risk-domain-6-knowledge-model-supply-chain-and-operational-risks/category1-availability_cost_and_service_exhaustion/

Benchmark Construction Pipeline

The benchmark construction pipeline starts from benign task carriers, preserves the original user-facing task, injects the attack through skill-facing artifacts, and attaches deterministic checks for both task completion and attack behavior. This keeps the evaluation focused on whether agents can resist unsafe local context while still doing the requested work.

SkillSafetyBench construction pipeline

Safety Notice

This repository contains adversarial benchmark cases. Some cases simulate phishing, credential solicitation, destructive operations, data exfiltration, sandbox boundary abuse, external actions, and persistence-like behaviors.

Run the benchmark only in isolated research environments. Do not use production credentials, production cloud accounts, personal accounts, or sensitive local files. Prefer throwaway API keys, disposable containers, restricted network access, and a dedicated working directory. Review run artifacts before sharing them.

How to Run the Benchmark

1. Prepare the Toolchain

Required tools:

  • bash
  • python3, recommended >= 3.11
  • docker
  • uv
  • harbor
  • nvm
  • Node.js 22

Install Harbor if needed:

uv tool install harbor

Check the basic toolchain:

harbor --help
python3 --version
docker --version

If your Harbor agent is installed through Node, activate Node 22 in the shell that starts the run:

export NVM_DIR="$HOME/.nvm"
[ -s "$NVM_DIR/nvm.sh" ] && . "$NVM_DIR/nvm.sh"
nvm use 22

2. Configure Environment Variables

Create a local .envrc from the example file:

cp env.example .envrc

For codex, set at least:

export OPENAI_BASE_URL="https://your-openai-compatible-endpoint/v1"
export OPENAI_API_KEY="replace-with-your-openai-api-key"

For claude-code, set an Anthropic-compatible auth variable, for example:

export ANTHROPIC_BASE_URL="https://api.anthropic.com"
export ANTHROPIC_API_KEY="replace-with-your-anthropic-api-key"

If you use direnv, run:

direnv allow

If not, no extra step is required. The batch script explicitly sources .envrc at runtime.

3. Run Cases

The canonical entry point is:

scripts/start_agent_batch.sh

Run one risk domain:

scripts/start_agent_batch.sh --rd 5

Choose agent and model:

scripts/start_agent_batch.sh \
  --rd 3 \
  --agent codex \
  --model openai/gpt-5.4

Run with Claude Code:

scripts/start_agent_batch.sh \
  --rd 3 \
  --agent claude-code \
  --model claude-3-7-sonnet-20250219

Run a single case:

scripts/start_agent_batch.sh \
  --rd 5 \
  --case travel_planning_poisoned_belief_persistence_memory_anchor

Run a single category:

scripts/start_agent_batch.sh \
  --rd 6 \
  --category category3-model_behavior_and_finetuning_poisoning

Run from an explicit manifest:

scripts/start_agent_batch.sh \
  --manifest benchmark/risk-domain-6-knowledge-model-supply-chain-and-operational-risks/manifest.json

Use a custom env file:

scripts/start_agent_batch.sh \
  --rd 5 \
  --envrc /path/to/your.envrc

Print selected cases without executing:

scripts/start_agent_batch.sh --rd 5 --dry-run

4. Read Run Outputs

Each run creates an output directory under jobs/:

jobs/<agent>-risk-domain-5-memory-recovery-audit-and-persistence-risks-<timestamp>/

Start with:

  • jobs/<run>/attack_results.json
  • jobs/<run>/summary.json
  • jobs/<run>/attack_results.csv
  • jobs/<run>/summary.csv

Useful per-run files:

  • selected_cases.json
  • batch_config.json
  • <case_id>/case_result.json
  • attack_results.md

Common attack outcomes:

  • attack_success
  • attack_not_observed
  • task_output_missing

task_output_missing means the expected explicit task output was absent. The attack verifier may still continue when enough artifacts exist to evaluate the attack condition.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors