AgentBench

Automated benchmarking framework for evaluating AI agent security.

AgentBench executes standardized test suites against AI agents, generates comparable security scores, and produces reproducible security reports. Complementary to mcpwn (active attacks), AgentBench focuses on standardized, repeatable benchmarking.

Features

8 Benchmark Suites: Prompt injection, tool poisoning, SSRF, data exfiltration, jailbreak, memory attacks, multi-agent attacks, MCP attacks
Automated Scoring: Normalized 0-100 scores per category and overall
Comparative Analysis: Compare agents, versions, and configurations
Trend Analysis: Track security posture over time
Leaderboards: Generate ranked leaderboards from benchmark results
Multiple Report Formats: Console (Rich), JSON, HTML, Markdown
CI/CD Ready: GitHub Actions integration for regression testing
Reproducible: Fixed seeds and controlled test environments

Installation

git clone https://github.com/Carlos-Projects/agentbench.git
cd agentbench
pip install -e .

For development (includes test/lint dependencies):

pip install -e ".[dev]"

Quick Start

Run all benchmark suites against a target agent:

agentbench run http://localhost:8080 --agent-id my-agent

Run specific suites:

agentbench run http://localhost:8080 --suites prompt_injection,ssrf

Save reports:

agentbench run http://localhost:8080 --output results.json --html report.html

Example Output

$ agentbench info
AgentBench v0.2.0-dev
  Python: 3.13.9 | Platform: macOS-arm64

Suites: 8 built-in (64 test cases)
  prompt_injection   → 8 tests
  tool_poisoning     → 8 tests
  ssrf               → 8 tests
  data_exfiltration  → 8 tests
  jailbreak          → 8 tests
  memory_attacks     → 8 tests
  multi_agent        → 8 tests
  mcp_attack         → 8 tests

Output formats: console, JSON, HTML, Markdown, SARIF, MCPGuard, MCPscop
API formats: generic, openai, anthropic, mcp

$ agentbench suite --list
Available Benchmark Suites:
  prompt_injection    Tests agent resistance to prompt injection attacks
  tool_poisoning      Tests agent resistance to tool poisoning
  ssrf                Tests agent resistance to SSRF attacks
  data_exfiltration   Tests agent resistance to data exfiltration
  jailbreak           Tests agent resistance to jailbreak techniques
  memory_attacks      Tests agent resistance to memory poisoning
  multi_agent         Tests agent security in multi-agent scenarios
  mcp_attack          Tests MCP protocol-level security

Commands

`agentbench run`

Execute benchmark suites against a target agent.

agentbench run <TARGET_URL> [options]

Options:

--suites, -s: Comma-separated suite names or all (default)
--agent-id, -a: Agent identifier
--agent-version, -v: Agent version string
--output, -o: JSON output file
--html: HTML report output
--markdown, -m: Markdown report output
--timeout, -t: Request timeout in seconds (default: 30)
--header, -H: Additional HTTP headers

`agentbench suite`

List and inspect benchmark suites.

agentbench suite --list
agentbench suite --show prompt_injection

`agentbench compare`

Compare two benchmark reports.

agentbench compare baseline.json target.json

`agentbench leaderboard`

Generate a leaderboard from multiple reports.

agentbench leaderboard report1.json report2.json report3.json

`agentbench report`

Convert reports between formats.

agentbench report results.json --html report.html --markdown report.md

`agentbench trend`

Analyze security trends over time.

agentbench trend v1.json v2.json v3.json

Benchmark Suites

Suite	Category	Description	Tests
`prompt_injection`	Prompt Injection	Direct, indirect, role-playing, multi-language, code-context, nested, delimiter, JSON injection	8
`tool_poisoning`	Tool Poisoning	Argument injection, name confusion, permissions, output manipulation, chaining, overflow, discovery, type confusion	8
`ssrf`	SSRF	Internal IP, localhost, metadata, DNS rebinding, URL scheme, redirect, IPv6, port scanning	8
`data_exfiltration`	Data Exfiltration	Secret extraction, encoding, files, gradual, steganography, session, training data, error leaks	8
`jailbreak`	Jailbreak	Hypothetical, research, Base64, translation, decomposition, emotional, smuggling, awareness	8
`memory_attacks`	Memory Attacks	Poisoning, overflow, false memory, leak, context, history, tokens, persona	8
`multi_agent`	Multi-Agent	Inter-agent injection, cooperation, impersonation, consensus, delegation, cascade, coordinated, resources	8
`mcp_attack`	MCP Attack	Tool enumeration, argument injection, protocol downgrade, malformed JSON-RPC, resource access, prompt injection via tool response, notification flooding, ping/pong manipulation	8

Scoring

Scores are normalized to a 0-100 scale:

90-100: Excellent security posture
70-89: Good security with minor gaps
50-69: Moderate security, improvements needed
0-49: Poor security, significant vulnerabilities

Integration

AgentBench integrates with the MCP security ecosystem:

mcpwn: Active attack framework (complementary: mcpwn attacks, AgentBench benchmarks)
MCPGuard: Runtime security proxy (AgentBench generates compatible policies)
MCPscop: Security dashboard (consumes AgentBench reports)
mcp-taxonomy: Shared security classification taxonomy

Development

# Install dev dependencies
pip install -e ".[dev]"

# Run tests
python -m pytest tests/ -v

# Lint
ruff check src/ tests/

# Type check
mypy src/agentbench/

# Coverage
coverage run -m pytest tests/ -v
coverage report

Contributing

See CONTRIBUTING.md for development guidelines.

Security

See SECURITY.md for security policies and disclosure.

License

MIT License - see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github		.github
docs		docs
examples		examples
src/agentbench		src/agentbench
tests		tests
.bandit		.bandit
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
Dockerfile.test-agent		Dockerfile.test-agent
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
hatch.toml		hatch.toml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentBench

Features

Installation

Quick Start

Example Output

Commands

`agentbench run`

`agentbench suite`

`agentbench compare`

`agentbench leaderboard`

`agentbench report`

`agentbench trend`

Benchmark Suites

Scoring

Integration

Development

Contributing

Security

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AgentBench

Features

Installation

Quick Start

Example Output

Commands

agentbench run

agentbench suite

agentbench compare

agentbench leaderboard

agentbench report

agentbench trend

Benchmark Suites

Scoring

Integration

Development

Contributing

Security

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`agentbench run`

`agentbench suite`

`agentbench compare`

`agentbench leaderboard`

`agentbench report`

`agentbench trend`

Packages