A reproducible social deduction benchmark for evaluating agents using the Werewolf game. This repo provides a green evaluator (A2A server) and a purple agent (A2A player), plus local and Docker workflows for AgentBeats submissions.
This benchmark focuses on social reasoning in multi-agent settings:
- Strategic reasoning under uncertainty
- Deception and persuasion (wolves vs villagers)
- Resistance to manipulation
- Multi-step planning across rounds
Your purple agent plays Werewolf against NPCs. The green agent runs games, records logs, and computes aggregate metrics. Results are emitted as JSON and can be submitted to the leaderboard repo.
Data flow: AgentBeats -> Green (A2A) -> Purple (A2A) + NPCs -> Results -> Leaderboard
- Green Agent: https://agentbeats.dev/SulmanK/werewolfarena
- Purple Agent: https://agentbeats.dev/SulmanK/werewolfarena-purple
- Leaderboard Repo: https://github.com/SulmanK/Werewolf-Arena-Benchmark-Leaderboard
This repo uses deterministic heuristics and aggregate scoring (no external judge):
- Win rate, survival rate
- Vote accuracy, misvote rate, flip rates
- Seer discovery rate, doctor protection rate
- IRP (identity recognition proxy), VSS (vote skill score), KRE (key role effectiveness)
- Safety flags (toxic, pii)
See details in docs/deployment_design.md and docs/design.md.
Game mechanics are seeded (role assignment, order, tie-breaking). If you use an external LLM (Gemini proxy), outputs can still vary slightly across runs. For best stability:
- Keep
shuffle_seedfixed - Pin images/tags
- Use deterministic proxy settings (enabled by default in
purple/proxies/a2a_gemini_proxy.py)
Single game:
python -m benchmark.runner --seed 123 --max-turns 4 --max-rounds 4 --output fixtures/golden_score.json --log-jsonl fixtures/sample_log.jsonl
Multi-seed aggregate:
python -m benchmark.multi --seeds-file configs/seeds.txt --max-turns 4 --max-rounds 4 --output fixtures/aggregate.json
Agent vs NPC (LLM purple via A2A, role-balanced):
python -m benchmark.agent_vs_npc --a2a-endpoint http://localhost:8080 --num-games 12 --shuffle-seed 20206 --output fixtures/agent_vs_npc_12.json --log-dir fixtures/agent_vs_npc_logs
python -m dotenv run -- python purple/proxies/a2a_gemini_proxy.py --model gemini-2.5-flash-lite --host 0.0.0.0 --port 8080 --log-dir logs
python -m benchmark.agent_vs_npc --a2a-endpoint http://localhost:8080 --num-games 12 --sanity-check 12 --shuffle-seed 20206 --output fixtures/agent_vs_npc_12.json --log-dir fixtures/agent_vs_npc_logs
Requires a local .env file with your Gemini key:
GEMINI_API_KEY=your_key_here
docker compose -f infra/docker-compose.agentbeats.yml up --build --abort-on-container-exit
- Build your purple agent image.
docker build -t my-purple-agent -f infra/Dockerfile.purple .
-
Update compose to point to your purple image (in
infra/docker-compose.agentbeats.yml). -
Run the evaluation.
docker compose -f infra/docker-compose.agentbeats.yml up --abort-on-container-exit
- Check results.
cat results/agentbeats_docker_results.json | jq .performance_metrics
Expected output format (shape):
{
"status": "complete",
"num_games": 12,
"games_completed": 12,
"performance_metrics": {
"irs": 0.0,
"vrs": 0.0,
"sr": 0.0,
"win_rate": 0.0,
"games_survived": 0,
"games_won": 0,
"total_games": 12
},
"roles_played": {
"werewolf": 0,
"villager": 0,
"seer": 0,
"doctor": 0
},
"advanced_metrics": {
"avg_kre": 0.0,
"avg_irp": 0.0,
"avg_vss": 0.0,
"safety_counts": {
"toxic": 0,
"pii": 0
}
}
}
docker build -t agentbeats_green_agent -f infra/Dockerfile.green .
docker build -t agentbeats_purple_agent -f infra/Dockerfile.purple .
docker tag agentbeats_green_agent ghcr.io/sulmank/agentbeats-green:latest
docker tag agentbeats_purple_agent ghcr.io/sulmank/agentbeats-purple:latest
docker push ghcr.io/sulmank/agentbeats-green:latest
docker push ghcr.io/sulmank/agentbeats-purple:latest
Step 1: Build and push images (see GHCR section above).
Step 2: Register on AgentBeats
- Go to agentbeats.dev
- Register green and purple agents
- Copy both
agentbeats_idvalues
Step 3: Submit to leaderboard
- Fork the leaderboard repo
- Add GitHub Actions secret:
GEMINI_API_KEY - Update
scenario.toml:
[green_agent]
agentbeats_id = "YOUR_GREEN_ID"
env = { GEMINI_API_KEY = "${GEMINI_API_KEY}" }
[[participants]]
agentbeats_id = "YOUR_PURPLE_ID"
name = "agent"
env = { GEMINI_API_KEY = "${GEMINI_API_KEY}" }
[config]
num_tasks = 40
- Commit/push, open PR, merge
- Register green and purple agents on AgentBeats.
- Fork the leaderboard repo and edit
scenario.toml. - Push and open a PR; merge to update the leaderboard.
Example scenario.toml:
[green_agent]
agentbeats_id = "YOUR_GREEN_ID"
env = { GEMINI_API_KEY = "${GEMINI_API_KEY}" }
[[participants]]
agentbeats_id = "YOUR_PURPLE_ID"
name = "agent"
env = { GEMINI_API_KEY = "${GEMINI_API_KEY}" }
[config]
num_tasks = 40
Endpoint:
GET /.well-known/agent-card.jsonPOST /with JSON body
Expected response schema:
{"type": "speak|vote|night_power", "content"?: "...", "target"?: "Name"}
Purple agents are expected to run as LLM-backed A2A services.
- HTTP A2A (recommended): run an external agent server that implements the A2A
schema and point the benchmark at it via
--a2a-endpoint. - Python class (advanced): implement
AgentBaseand register it inagents/registry.pyto plug in directly without an HTTP server.
- Build or run your purple agent as an A2A HTTP service (must return
speak,vote,night_power). - Start the benchmark and point it at your agent:
python -m benchmark.agent_vs_npc --a2a-endpoint http://localhost:8080 --num-games 12 --shuffle-seed 20206 --output fixtures/agent_vs_npc_12.json --log-dir fixtures/agent_vs_npc_logs - Inspect
fixtures/agent_vs_npc_12.jsonand logs to compare results.
See purple/README.md for quickstart notes and the Gemini proxy path.
benchmark/game engine, runner, A2A protocol, logsgreen_agent/A2A green evaluator serveragents/scripted baselines and adapterspurple/purple agent helpers and proxiesscorer/metrics and aggregationinfra/Dockerfiles and composescripts/Gemini proxy and helpersdocs/design and deployment docs
docs/deployment_design.mddocs/design.mddocs/design_agent_vs_npc.md
python -m pytest scorer/tests/test_score.py benchmark/tests/test_protocol.py