MT-JailBench is a benchmark framework for studying multi-turn jailbreak attacks. It supports component-level attack analysis, reusable attack/defense modules, resource-aware evaluation, reproducible experimentation, and more.
- Attack decomposition: Break multi-turn attacks into reusable components for analysis and recombination.
- Modular architecture: Mix and match attacks, defenses, datasets, and models.
- Resource-aware evaluation: Set resource budgets that reflect real-world constraints.
- Strong built-in baselines: Includes five multi-turn attacks from prior work.
- Unified LLM client: Access multiple LLM providers through a single client.
- Cross-validation judging: Re-evaluate successful attacks using alternative success criteria.
- Experiment checkpointing: Pause and resume experiments with automatic result caching.
- Well-engineered codebase: Built for reproducibility and extensibility—not obscure, one-off scripts.
Expand
Configure API keys for the LLM providers you plan to use:
# OpenAI / Azure OpenAI
export OPENAI_API_KEY=...
# Gemini
export GEMINI_API_KEY=...
# Anthropic
export ANTHROPIC_API_KEY=...
# OpenRouter
export OPENROUTER_API_KEY=...
# AWS Bedrock
export AWS_ACCESS_KEY_ID=...
export AWS_SECRET_ACCESS_KEY=...
export AWS_DEFAULT_REGION=...
# Azure OpenAI (find endpoint in Foundry portal)
export AZURE_OPENAI_ENDPOINT="https://xxxxxxxxx.cognitiveservices.azure.com/openai/v1/"
# Hugging Face (for downloading open-source models)
export HF_TOKEN=...
export HF_HOME=... # Use $SCRATCH/LLMs if you're on NERSCTo set up the environment, simply run:
uv syncIf you also need a similarity model server (for using similarity-based attacks like COA):
uv run --extra sim-server sim-serverWe do not provide general instructions for deploying LLMs locally. However, if you are using NERSC, the following steps may work.
# install conda environment for using vLLM
conda env create -f scripts/nersc/vllm.yaml
# Deploy the attacker model (defaults to Qwen2.5-32B-Instruct)
# Uses a context window 4× larger than the target model below
./scripts/nersc/deploy_attacker.sh
# deploy target model (specify name model name)
./scripts/nersc/deploy_target.shExpand
Demo mode runs the multi-turn jailbreak workflow on a single harmful behavior. This mode is useful for quick experimentation and debugging.
uv run demo <attack_type>Currently supported attack types: crescendo, actor, coa, fitd, xteaming, mix, interactive (see here for details)
Expand
Benchmark mode evaluates 159 harmful behaviors from the HarmBench dataset. It requires a YAML configuration file; see sample config file for an example.
# Step 1: Run the benchmark.
uv run benchmark --config ./config/sample.yaml --name sample_run
# Step 2: Summarize results.
uv run summarize --name sample_run
# Optional: Retroactively run independent judge.
uv run retro-judge --name sample_run- Step 1 runs the benchmark using the specified configuration file. Results are stored in an output directory named after
--name. If the output directory already contains partial results, completed behaviors are automatically skipped. - Step 2 summarizes the benchmark outputs and generates a
summary.jsonfile in the same output directory. - Optional step retrofits an independent judge if it was not enabled during the original experiment. Warning: This command directly modifies output files.
Running MT-JailBench in benchmark mode creates the following output structure:
outputs/
└── experiment_name/
├── config.json
├── summary.json
└── logs/
├── behavior_id_1.json
├── behavior_id_2.json
└── ...
config.jsonstores a copy of the run configuration for reproducibility. Rerunning a partially completed experiment with a different configuration will raise an error. If the configuration change is safe and intentional, update the saved config file or delete it to unblock the run.summary.jsoncontains aggregated benchmark results and key metrics, such as attack success rate.logs/contains detailed logs for each behavior. These logs are useful for analysis, debugging, and auditing.
Expand
This README covers the basics of using MT-JailBench. More detailed documentation is available in ./docs.
- Developer Guide: explains key terminology and the system architecture
- Unified LLM Client Guide: explains how to use the unified LLM client across different providers
- Attacks: explains how to use built-in attacks and how to implement new attacks