Heterogeneous Multi-Model AI Agent Swarm · Autonomous Offensive & Defensive Security Automation
English · 简体中文
This is a truly open-source, multi-model CTF-solving AI agent swarm. The goal is to live up to its very name — 無敵 · Project Muteki ("Invincible").
At its core, the project implements a scheduling scheme for AI agents that automatically and intelligently coordinates and controls each agent's context — like a swarm, each with its own division of labor, but all working toward the final goal. It currently supports commanding and dispatching only cursor, codex, and Claude Code. More kinds of CLI agents will be supported through continuous iteration.
Muteki exists to solve a specific problem: a single AI agent, when working toward a goal, very easily falls into a dead-loop at one spot — unable to pull itself out, unable to reach the final goal — and a single agent is extremely inefficient. I designed an architecture to solve this. It may not be the most perfect one, but I'll keep iterating and upgrading it.
CTF is only the most basic capability. The core architecture is built for goal-driven multi-agent collaboration across all kinds of scenarios; in real-world testing it can autonomously complete penetration testing, code auditing, CTF solving, cybersecurity work, and more.
Muteki is an offensive security automation tool. It drives CLI agents to execute commands, invoke security tools, and reach target services; it does not promise to isolate malicious challenges.
It is recommended to run it only in a dedicated, disposable environment — a dedicated VPS, a throwaway VM, or a standalone machine with no sensitive data. Do not run it on your main workstation, a shared host, or a production environment. See SECURITY.md for details.
That said, I personally just run it straight on my own computer all the time, because setting up the environment that way is more convenient (
At RIFFHACK 2026, fully automated for 3 hours with zero human takeover, it speed-ran and AK'd (solved every) challenge — taking 8th place.
On the Qianxin Yunjing penetration-testing range "blackmaze" — zero solves for three months — Muteki speed-ran the first blood in 2 hours. (Why does the platform show 39 hours? Because during that time I was dealing with all sorts of debugging, testing, and multi-flag mode support, which wasted a lot of time; the actual solving took only 2 hours.)
AK'd all badge scenarios on Qianxin Yunjing.
AK'd all categories of HackTheBox Insane and Hard difficulty.
For the full NYU CTF benchmark evaluation results, see the end of this document.
Many more first-bloods and high scores — across competitions you know and ones you don't — all bear Muteki's mark; I won't enumerate them one by one here.
In short, after a month of engineering optimization, architecture/capability tuning, and bug fixing, this project is now officially open-sourced — no star-baiting, no boastful marketing copy, no undermining your confidence, no sub-groups, no community gatekeeping, no money scams, no paywalls, no marketing — just open-sourced and shared directly.
You're welcome to use it and help build and upgrade it together. If you run into any problems, feel free to open an issue, and you're welcome to join the discussion group. Let's build the world's strongest CTF agent together.
Muteki points a group of heterogeneous coding agents (Claude Code / Codex / cursor-agent) at the same challenge, collaborating on a single shared blackboard: facts one of them discovers are usable by all, dead ends one of them walks are never retried by the others, and a flag is accepted only when it appears verbatim in real execution output. The core isn't "swap in a smarter brain" — it's heterogeneity + shared evidence + a provenance gate.
So how does a worker hand its data to the platform, and how does it see its teammates' progress? It all relies on the muteki-blackboard skill built into every worker — this is the only data channel between a worker and the blackboard.
For a detailed architecture explanation, see: docs/工作原理.md
The project follows a "less is more" principle: it injects no security tools and no security knowledge, keeps the network open, and lets workers improvise freely — writing and installing their own dependencies and scripts.
Web command deck: the run list on the left, the coordinator conversation stream in the middle, and a live run control panel with per-worker status on the right.
The outer ①②③④ are the four phases of a single run; the inner (1)~(5) are the per-tick collaboration loop of phase ③. All the hard work on tough challenges happens inside the ③ loop, and every read/write between a worker ↔ the blackboard inside that loop goes through the muteki-blackboard skill.
One lap of (1)→(5) is the heart of Muteki: the coordinator reads the blackboard → Reason plans the next step → an intent goes onto the blackboard → workers each claim one and run real commands → the results are written back to the blackboard via the skill (flags still have to pass the gate), then it reads again… one lap every 2 seconds — that's how hard challenges pile evidence thicker, lap after lap. The outer ①②③④ is the full timeline of a single run.
| Phase | When it starts | What it does | Output |
|---|---|---|---|
| ① Prepare | At the start of a run | Build the blackboard, stage attachments, health-check engines, install the skill, and (in container mode) start containers + reverse connection | Empty blackboard + available engines + channels wired up |
| ② Recon Race | Cold start only (skipped when re-examining an already-solved challenge) | Multiple engines single-shot the whole challenge in parallel for breadth-first recon | A flag (→ fast path) or a batch of facts |
| ③ Coordination main loop | When recon didn't solve it directly | (1)~(5) keeps looping, expanding the swarm as evidence grows |
The blackboard keeps growing until there's enough for a flag |
| ④ Wind-down | Enough for a flag / operator stops / budget exhausted | Persist the winner, release claims, emit terminal events, clean up | RUN_FINISHED + a replayable blackboard |
To keep Muteki from falling into a dead-loop while working a single task, we set up a review mechanism: while Muteki executes the task, it periodically runs a review that checks and verifies the facts already recorded, then corrects course promptly whenever needed.
# 1. Bootstrap: install deps + run the quick test suite
./init.sh
# 2a. Web command deck — FastAPI backend (:8000) + Next UI (:3001)
./run.sh web
# Backend only: ./run.sh web --backend-onlyThe .env at the repo root is loaded automatically (copy it from .env.example); variables exported in your shell always take precedence. Configuration is done through MUTEKI_* environment variables.
Recommended setting:
MUTEKI_DEEPSEEK_API_KEY=sk-xxxx
This is mainly the credential used to set up the Reason planner that plans the whole agent pipeline. You can also swap it for any other endpoint, and configure the model in the frontend settings. The default is DeepSeek, because it's relatively cost-effective.
If you don't set it, the main impact is that the Reason planner won't autonomously plan challenges or summarize progress.
- uv — Python toolchain and runner
- Python ≥ 3.13 (declared in
pyproject.toml; managed byuv) - Node.js — only needed for the web UI (
apps/web/ui, Next.js) - Go ≥ 1.26 — only needed when building the in-container supervisor inside the worker image
- Docker — only needed for the
containerworker backend / building the worker image - The engine CLIs you intend to use, available on your
PATH(see below) - This project has so far only been tested on macOS, not on Windows — handle accordingly.
Muteki shells out to the three closed-source agent CLIs below; install and authenticate whichever ones you want to use. Each has its own license and sends data back to its respective vendor:
| Engine | CLI | Vendor | Credential |
|---|---|---|---|
claude |
@anthropic-ai/claude-code |
Anthropic | OAuth token (claude setup-token) |
codex |
@openai/codex |
OpenAI | ~/.codex/auth.json (codex login) |
cursor |
cursor-agent (cursor.com/install) |
Cursor | API key |
You need at least one of them to run. Beyond these three, you can also configure a custom OpenAI-compatible endpoint (base_url + key) in a worker profile — suitable for self-hosted or third-party models. Credentials are read from the macOS Keychain / environment and injected into the worker environment; see Credentials and SECURITY.md.
The three agents' credentials are configured along with the web settings. In local mode you can skip configuring them — you just need your subscription to be usable when you run the CLI yourself.
The remaining cases are generally for configuring remote or container environments, where container credential information is involved.
In container mode, or in other cases where you need to use a key, you can configure it as follows:
| Engine | File in the account directory | How to get it |
|---|---|---|
claude |
CLAUDE_CODE_OAUTH_TOKEN |
claude setup-token |
codex |
codex-home/auth.json |
codex login (copy ~/.codex/auth.json) |
cursor |
CURSOR_API_KEY |
cursor.com → API key |
| Custom endpoint | API_KEY + BASE_URL |
Any OpenAI-compatible vendor |
After saving, you can click "Save & test" at any time.
local vs container mode:
- In
containermode an account is mandatory — the host login is not mounted into the container; credentials are mounted into the container via command injection and file mounting. - In
localmode, if no account is registered, the worker inherits the host CLI's existing login — though you can also configure it manually.
The DeepSeek reasoning model (used by the coordinator, not a worker engine) is configured separately via MUTEKI_DEEPSEEK_API_KEY in .env.
For the credential trust model, see SECURITY.md.
To meet environment-isolation and containerization needs, I also provide a container mode. That said, this container mode hasn't been tested enough and isn't guaranteed to always work.
The container backend runs workers inside a single general-purpose Kali image (no more per-template/recipe variants), containing the full CTF toolchain + an offline knowledge base + the engine CLIs + the supervisor. No credentials are baked into the image — they are injected at runtime.
Pull the prebuilt image (recommended):
docker pull snowywar/muteki-worker:latest # or pin a version: :0.2.3The code defaults to snowywar/muteki-worker:latest (the published image);
use the MUTEKI_WORKER_IMAGE environment variable to override it with a different name/tag (e.g. MUTEKI_WORKER_IMAGE=snowywar/muteki-worker:0.2.3).
Or build from source:
./docker/worker/build.sh # → muteki-worker:0.2.3 + muteki-worker:latest
./docker/worker/build.sh snowywar/muteki-worker 0.2.0 # custom repo + version (for push)The image is large (~19.7 GB: Kali headless + Ghidra + SageMath installed via conda + the offline knowledge base).
Since authentication logic isn't implemented yet, deploying on a public VPS server is not recommended for now. Working on it, working on it.
The recommended best practice is to launch it locally — log in and install the relevant workers, and start it whenever you like.
./run.sh web
# visit localhost:3001You can also start it in container mode, but this part hasn't been thoroughly validated and may have hidden pitfalls — players are welcome to test it together.
- After opening the project, you'll land on a page like this

- First, open the settings page in the bottom-left, check the engines you want to field, and configure your worker models.
For model selection: if you already hold the Cyber / CVP certification, I recommend Opus 4.8 and GPT-5.5; if not, I personally recommend GPT-5.4 and Opus 4.6. For Cursor I personally recommend Compose 2.5, which works wonders on easy challenges.
Of course, you can also configure custom domestic models via a custom base_url (DeepSeek, Kimi, GLM).

- For the runtime environment, local is recommended; if you have special needs you can choose container, which will remind you to configure the relevant credentials — please configure those yourself. You can click "Test model" to check whether it works correctly; the test invokes the agent and asks the model to repeat "ok".

- Next, you can configure your workers in detail; configuring them as shown in the picture is recommended.
The starting worker count is the number for the race phase; it follows your engine count and runs all three agent engines simultaneously until the flag is solved or the challenge times out. It's used for quickly grabbing first blood and quickly solving easy challenges.
The maximum worker count is recommended to stay around 5–6, because for web challenges too many workers could cause a DDoS-like situation.

- It's recommended to configure and test connectivity for the reasoning model here, for better planning and pacing of the challenge.

- Once everything is configured, you can click "Run self-check"; if there are no issues, save and close the settings page.
- The recommended prompting approach for solving a challenge is as follows:
- State the challenge description, category, name, website/URL, and flag format.
- The frontend also supports copy-paste and file upload, so you can directly upload attachment-based challenges.
- The "network" toggle in the picture controls whether the agent's own web-search capability is enabled; it's on by default, and turning it off is for benchmark evaluation.
- Ignore the local/container button — it's tied to the settings feature and may be removed later. Under "Advanced" you can manually specify the flag format and a few simple settings, which can be ignored.

- After starting, it initializes for about half a minute — initialization involves file setup and config-file setup, which is a bit slow — and then you'll enter the main page.


- After a challenge is solved, you can use the "x" in the top-right to report a specific flag as a false positive, which will spin workers back up to keep re-solving; you can click "Generate writeup" to generate it directly.
- The other pages are for viewing or exploring on your own — feel free to try and use them.
Muteki was fully evaluated on the NYU CTF Bench test set (CSAW 2017–2023, 200 challenges in total). The results are as follows:
In this evaluation, no security or reverse-engineering tools were preinstalled; only a single x86 Ubuntu 24 VPS was prepared as the evaluation environment.
Covering all six major categories and spanning the full CSAW difficulty range across 200 challenges, with a 30-minute budget per challenge:
| Metric | Value |
|---|---|
| Solved | 200 / 200 = 100% |
| Hard/Expert tier (difficulty leaderboard) | 36 / 36 all solved |
| Cumulative tokens | ~370 M |
| Cumulative cost | ~$214 |
| Solve time | median ~2–4 min (fastest 22 s) |
| Winners per engine | cursor 80 · claude 75 · codex 45 |
The three engines' blind spots don't overlap — together they sweep all six categories, including CSAW top-tier challenge types such as V8-engine pwn, Windows remote privilege escalation, and 16 GB disk-image forensics. Full report: eval_nyu/_reports/FINAL_eval_report.md, with per-challenge details in eval_nyu/_reports/RESULTS.md.
Engine/model versions change as the CLIs update (workers shell out and run each CLI's own default model: Claude Opus 4.7 / GPT-5.5 / Cursor). Treat these numbers as a capability snapshot, not a leaderboard verdict.
| Path | Contents |
|---|---|
muteki/ |
Core: swarm/ (coordinator), solver/ (CLI driver, gate, control plane), models/, platform/, sandbox/ |
apps/web/ |
FastAPI backend (server.py) + Next.js operator UI (ui/) |
apps/tui/ |
Textual TUI command deck (unfinished) |
cmd/runtime-agent/ |
In-container Go supervisor (reverse-connects to the control plane) |
docker/worker/ |
Worker image (Dockerfile, build scripts, tool-awareness map) |
muteki_kit/ |
Small SDK helpers (e.g. flag submission) |
scripts/ |
eval / backtest harness |
docs/ |
eval reports + open-source-readiness review; design docs in docs/internal-design/ |
Each challenge you launch is a run. Its working path and structure under sessions/ is as follows — workers on both the host and container backends see the same layout:
sessions/
├── run-XXXX.jsonl # The "event stream" for this challenge: the source of truth for SSE replay / resume (one line = one event)
├── run-XXXX/ # The working root for this challenge
│ ├── uploads/ # Raw challenge files uploaded via the web (unprocessed; processed ones go to workspace/inputs)
│ └── workspace/ # The workspace for this challenge
│ ├── inputs/ # Immutable challenge inputs (content-addressed, CAS)
│ │ ├── objects/ # CAS object store (bucketed by sha256)
│ │ └── by-name/ # Symlinks from original filename → object
│ ├── shared/ # Artifacts shared between workers (CAS)
│ │ ├── objects/ # CAS object store
│ │ ├── links/ # Symlinks by name → object
│ │ └── index.jsonl # Shared-artifact index (a rebuildable materialized view)
│ ├── graph/
│ │ └── shared_graph.db # ★ Shared blackboard: event-sourced SQLite, the single source of truth (facts/intents/dead-ends/...)
│ ├── arts/ # Artifact store: tool output / transcript snapshots (<hex>.txt, addressed by artifact_id, peekable)
│ ├── workers/ # Each worker's own cwd (scratch)
│ │ └── cli-codex-2/ # One worker's working directory (agent temp files + relative symlinks into inputs/shared)
│ ├── homes/ # Each worker's isolated HOME (especially needed in container mode)
│ ├── final/ # Final artifacts
│ ├── tmp/ # Temp directory
│ ├── logs/ # Logs
│ ├── manifest.json # Workspace manifest: topology + inputs list + runtime metadata
│ ├── winner.json # The winning worker's continuation handle (for follow-ups / writeups / review after solving)
│ ├── writeup.md # The (post-solve generated) writeup, optional
│ └── .muteki_board.md # Blackboard snapshot: a Markdown version for workers to read directly
│
├── _secrets/accounts/<id>/ # Credential account store (dirs 0700 / files 0600, never enters the image or prompts)
├── _worker_config.json # Global worker config (engine roster / profiles)
└── _rail_meta.json # Rail metadata (names / order of the run list)
A few key points:
run-XXXX.jsonl(the event history) andrun-XXXX/(the files that do the work) are linked by the same run id: the former can be replayed to the frontend, the latter is the workspace actually written to disk.inputs/andshared/are both content-addressed (CAS): the same file is stored only once, and worker directories are full of relative symlinks — soworkers/can be created and deleted at will without losing data.graph/shared_graph.dbis the core: all of the blackboard's state lives here; workers read and write it through themuteki-blackboardskill.- Wind-down only clears the non-winner scratch under
workers/;shared/,graph/,arts/,final/, andwinner.jsonare all kept, so a challenge can still be fully reviewed after it finishes.
uv run pytest # Python suite (live tests auto-skip when no key is set)
go test -C cmd/runtime-agent ./... # Go supervisor (the module lives under cmd/runtime-agent/)
( cd apps/web/ui && npx tsc --noEmit ) # UI type-check- Add authentication logic
- Fully optimize and test the container mode
- Keep iterating and improving the web UI experience
- Support more agent worker types, e.g. pi, zai, opencode, etc.
- TUI mode
- Fully automatic crawling of CTF-platform challenges, with auto-solving, auto-submission, and auto report generation
Thanks to c3 for the cyber-range account — I burned through a lot of "grit" and farmed like crazy.
Thanks to master l4n for the inspiration — the newly added reviewer brought a qualitative leap in overall solving efficiency.
Thanks to master 陈橘墨 for the range resources and writeups, used for extensive testing and fine-tuning.
Thanks to Sam Altman for not banning my account. Now banned, I will always remember his name.
Thanks to Dario Amodei for not banning my account.
This project's design and evaluation drew on the following academic work:
- NYU CTF Bench: A Scalable Open-Source Benchmark Dataset for Evaluating LLMs in Offensive Security Minghao Shao, Sofija Jancheska, Meet Udeshi, Brendan Dolan-Gavitt, et al. NeurIPS 2024 Datasets & Benchmarks Track. arXiv:2406.05590
- Teams of LLM Agents can Exploit Zero-Day Vulnerabilities Richard Fang, Rohan Bindu, Akul Gupta, Daniel Kang. EACL 2026. Paper
- D-CIPHER: Dynamic Collaborative Intelligent Multi-Agent System with Planner and Heterogeneous Executors for Offensive Security Chenhui Zhang, et al. 2025. arXiv:2502.10931
- HackSynth: LLM Agent and Evaluation Framework for Autonomous Penetration Testing Lajos Muzsai, David Imolai, András Lukács. 2024. arXiv:2412.01778
- CTFAgent: An LLM-powered Agent for CTF Challenge Solving Jiaze Sun, et al. Computers & Security, 2025. ScienceDirect
- Co-RedTeam: Orchestrated Security Discovery and Exploitation with LLM Agents Jiahao Zhu, et al. 2025. arXiv:2602.02164








