AgentBeats Green - Werewolf Social Reasoning Benchmark

A reproducible social deduction benchmark for evaluating agents using the Werewolf game. This repo provides a green evaluator (A2A server) and a purple agent (A2A player), plus local and Docker workflows for AgentBeats submissions.

What this benchmark evaluates

This benchmark focuses on social reasoning in multi-agent settings:

Strategic reasoning under uncertainty
Deception and persuasion (wolves vs villagers)
Resistance to manipulation
Multi-step planning across rounds

How it works

Your purple agent plays Werewolf against NPCs. The green agent runs games, records logs, and computes aggregate metrics. Results are emitted as JSON and can be submitted to the leaderboard repo.

Data flow: AgentBeats -> Green (A2A) -> Purple (A2A) + NPCs -> Results -> Leaderboard

Links

Green Agent: https://agentbeats.dev/SulmanK/werewolfarena
Purple Agent: https://agentbeats.dev/SulmanK/werewolfarena-purple
Leaderboard Repo: https://github.com/SulmanK/Werewolf-Arena-Benchmark-Leaderboard

Metrics (current implementation)

This repo uses deterministic heuristics and aggregate scoring (no external judge):

Win rate, survival rate
Vote accuracy, misvote rate, flip rates
Seer discovery rate, doctor protection rate
IRP (identity recognition proxy), VSS (vote skill score), KRE (key role effectiveness)
Safety flags (toxic, pii)

See details in docs/deployment_design.md and docs/design.md.

Reproducibility note

Game mechanics are seeded (role assignment, order, tie-breaking). If you use an external LLM (Gemini proxy), outputs can still vary slightly across runs. For best stability:

Keep shuffle_seed fixed
Pin images/tags
Use deterministic proxy settings (enabled by default in purple/proxies/a2a_gemini_proxy.py)

Quickstart (local)

Single game:

python -m benchmark.runner --seed 123 --max-turns 4 --max-rounds 4 --output fixtures/golden_score.json --log-jsonl fixtures/sample_log.jsonl

Multi-seed aggregate:

python -m benchmark.multi --seeds-file configs/seeds.txt --max-turns 4 --max-rounds 4 --output fixtures/aggregate.json

Agent vs NPC (LLM purple via A2A, role-balanced):

python -m benchmark.agent_vs_npc --a2a-endpoint http://localhost:8080 --num-games 12 --shuffle-seed 20206 --output fixtures/agent_vs_npc_12.json --log-dir fixtures/agent_vs_npc_logs

Tested commands (local)

python -m dotenv run -- python purple/proxies/a2a_gemini_proxy.py --model gemini-2.5-flash-lite --host 0.0.0.0 --port 8080 --log-dir logs
python -m benchmark.agent_vs_npc --a2a-endpoint http://localhost:8080 --num-games 12 --sanity-check 12 --shuffle-seed 20206 --output fixtures/agent_vs_npc_12.json --log-dir fixtures/agent_vs_npc_logs

Docker (local integration)

Requires a local .env file with your Gemini key:

GEMINI_API_KEY=your_key_here

docker compose -f infra/docker-compose.agentbeats.yml up --build --abort-on-container-exit

Testing your agent (local)

Build your purple agent image.

docker build -t my-purple-agent -f infra/Dockerfile.purple .

Update compose to point to your purple image (in infra/docker-compose.agentbeats.yml).
Run the evaluation.

docker compose -f infra/docker-compose.agentbeats.yml up --abort-on-container-exit

Check results.

cat results/agentbeats_docker_results.json | jq .performance_metrics

Expected output format (shape):

{
  "status": "complete",
  "num_games": 12,
  "games_completed": 12,
  "performance_metrics": {
    "irs": 0.0,
    "vrs": 0.0,
    "sr": 0.0,
    "win_rate": 0.0,
    "games_survived": 0,
    "games_won": 0,
    "total_games": 12
  },
  "roles_played": {
    "werewolf": 0,
    "villager": 0,
    "seer": 0,
    "doctor": 0
  },
  "advanced_metrics": {
    "avg_kre": 0.0,
    "avg_irp": 0.0,
    "avg_vss": 0.0,
    "safety_counts": {
      "toxic": 0,
      "pii": 0
    }
  }
}

Build and push images (GHCR)

docker build -t agentbeats_green_agent -f infra/Dockerfile.green .
docker build -t agentbeats_purple_agent -f infra/Dockerfile.purple .
docker tag agentbeats_green_agent ghcr.io/sulmank/agentbeats-green:latest
docker tag agentbeats_purple_agent ghcr.io/sulmank/agentbeats-purple:latest
docker push ghcr.io/sulmank/agentbeats-green:latest
docker push ghcr.io/sulmank/agentbeats-purple:latest

Deployment to AgentBeats

Step 1: Build and push images (see GHCR section above).

Step 2: Register on AgentBeats

Go to agentbeats.dev
Register green and purple agents
Copy both agentbeats_id values

Step 3: Submit to leaderboard

Fork the leaderboard repo
Add GitHub Actions secret: GEMINI_API_KEY
Update scenario.toml:

[green_agent]
agentbeats_id = "YOUR_GREEN_ID"
env = { GEMINI_API_KEY = "${GEMINI_API_KEY}" }

[[participants]]
agentbeats_id = "YOUR_PURPLE_ID"
name = "agent"
env = { GEMINI_API_KEY = "${GEMINI_API_KEY}" }

[config]
num_tasks = 40

Commit/push, open PR, merge

AgentBeats submission (leaderboard repo)

Register green and purple agents on AgentBeats.
Fork the leaderboard repo and edit scenario.toml.
Push and open a PR; merge to update the leaderboard.

Example scenario.toml:

[green_agent]
agentbeats_id = "YOUR_GREEN_ID"
env = { GEMINI_API_KEY = "${GEMINI_API_KEY}" }

[[participants]]
agentbeats_id = "YOUR_PURPLE_ID"
name = "agent"
env = { GEMINI_API_KEY = "${GEMINI_API_KEY}" }

[config]
num_tasks = 40

A2A contract (purple agent)

Endpoint:

GET /.well-known/agent-card.json
POST / with JSON body

Expected response schema:

{"type": "speak|vote|night_power", "content"?: "...", "target"?: "Name"}

Add your own purple agent

Purple agents are expected to run as LLM-backed A2A services.

HTTP A2A (recommended): run an external agent server that implements the A2A schema and point the benchmark at it via --a2a-endpoint.
Python class (advanced): implement AgentBase and register it in agents/registry.py to plug in directly without an HTTP server.

Contributing: add a purple agent in 3 steps

Build or run your purple agent as an A2A HTTP service (must return speak, vote, night_power).
Start the benchmark and point it at your agent: python -m benchmark.agent_vs_npc --a2a-endpoint http://localhost:8080 --num-games 12 --shuffle-seed 20206 --output fixtures/agent_vs_npc_12.json --log-dir fixtures/agent_vs_npc_logs
Inspect fixtures/agent_vs_npc_12.json and logs to compare results.

See purple/README.md for quickstart notes and the Gemini proxy path.

Repo map

benchmark/ game engine, runner, A2A protocol, logs
green_agent/ A2A green evaluator server
agents/ scripted baselines and adapters
purple/ purple agent helpers and proxies
scorer/ metrics and aggregation
infra/ Dockerfiles and compose
scripts/ Gemini proxy and helpers
docs/ design and deployment docs

Docs

docs/deployment_design.md
docs/design.md
docs/design_agent_vs_npc.md

Tests

python -m pytest scorer/tests/test_score.py benchmark/tests/test_protocol.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgentBeats Green - Werewolf Social Reasoning Benchmark

What this benchmark evaluates

How it works

Links

Metrics (current implementation)

Reproducibility note

Quickstart (local)

Tested commands (local)

Docker (local integration)

Testing your agent (local)

Build and push images (GHCR)

Deployment to AgentBeats

AgentBeats submission (leaderboard repo)

A2A contract (purple agent)

Add your own purple agent

Contributing: add a purple agent in 3 steps

Repo map

Docs

Tests

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
agentify-example-tau-bench-main		agentify-example-tau-bench-main
agents		agents
benchmark		benchmark
configs		configs
core		core
docker		docker
docs		docs
fixtures		fixtures
green_agent		green_agent
infra		infra
purple		purple
results		results
scorer		scorer
scripts		scripts
werewolf_arena-main		werewolf_arena-main
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
changes.md		changes.md
requirements.txt		requirements.txt

SulmanK/Werewolf-Arena-Benchmark

Folders and files

Latest commit

History

Repository files navigation

AgentBeats Green - Werewolf Social Reasoning Benchmark

What this benchmark evaluates

How it works

Links

Metrics (current implementation)

Reproducibility note

Quickstart (local)

Tested commands (local)

Docker (local integration)

Testing your agent (local)

Build and push images (GHCR)

Deployment to AgentBeats

AgentBeats submission (leaderboard repo)

A2A contract (purple agent)

Add your own purple agent

Contributing: add a purple agent in 3 steps

Repo map

Docs

Tests

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages