Multi-Agent Code Review System

A code review system where a host agent dispatches a pull request to specialist agents written in three different agent frameworks, communicating over the A2A (Agent-to-Agent) protocol, with tool access via MCP (Model Context Protocol), optionally deployed to Vertex AI Agent Engine.

The point of three frameworks is to learn their tradeoffs firsthand — not to recommend this heterogeneity for production. The interview value is in being able to say "I picked LangGraph for X because Y, and here's where it bit me" with a code reference attached.

Start here for the narrative: docs/STUDY_GUIDE.md walks through what was built, the architecture top-down, every major decision in build order, the pitfalls that bit us, and model interview answers in matt's voice. ~30 minutes top-to-bottom; the doc you actually rehearse from.

Architecture

                    ┌─── Security Reviewer (LangGraph + RAG) ──┐
User → Mesop UI → Host Agent (Google ADK) ─A2A─├─── Style/Quality Crew (CrewAI) ──────────┤→ Aggregated verdict
                                                └─── Repo-Context Agent (ADK + MCP) ──────┘
                                                              │              │
                                          GitHub MCP server (tool access)    Local CWE corpus (Chroma)

Full rendered diagram (Mermaid) + sequence trace: docs/ARCHITECTURE.md.

Why each framework where it is

Component	Framework	Reason	Code
Security review	LangGraph + RAG	Reflection loop on confidence; state-machine semantics let the agent backtrack. RAG step grounds findings in MITRE CWE top-25 via local Chroma	`agents/security/graph.py`, `nodes.py`, `retrieval.py`
Style/quality review	CrewAI	Role-based crew (linter, naming, test-coverage agents + synthesizer) with sequential context handoff	`agents/style/crew.py`
Repo-context	Google ADK	`SequentialAgent(repo_explorer, synthesizer)` — explorer uses MCPToolset for GitHub, synthesizer turns prose into typed `ReviewReport`	`agents/repo/agent.py`
Host / orchestrator	Google ADK	`LlmAgent` exposes the three A2A peers as `FunctionTool`s; Gemini picks which to call based on input shape	`host/agent.py`

A2A vs MCP — the distinction this project demonstrates

A2A: peer agent ↔ peer agent. Standardized JSON-RPC + agent cards. Lets the host swap a specialist without code changes. See common/a2a.py (server side) and common/a2a_client.py (client).
MCP: agent ↔ tool. Used for the GitHub integration in the repo-context agent. See agents/repo/agent.py::_github_mcp_toolset.

These are often confused. The project uses both deliberately — side-by-side in docs/ARCHITECTURE.md.

What was built (3-week status: complete)

Week	Days	Deliverable	Key commits
1	1–7	LangGraph security agent end-to-end (parse → enumerate → investigate → critique → finalize loop)	`bff67c8`, `e17d95e`
2	8–14	CrewAI style crew, ADK repo+MCP, A2A wrap all three, ADK host with peers-as-tools	`e17d95e`, `23dfc08`, `c933b89`, `d9cff51`
3	15–21	Mesop UI, OTel distributed tracing, eval harness (5 golden + LLM judge), Vertex deploy scaffold, docs	`c88306b`, `1504000`, `9aa284b`, `6a8c052`
Beyond	—	pytest suite (35 tests), GitHub Actions CI, 8 ADRs, 61 flashcards, PORTFOLIO + STUDY_GUIDE, MIT LICENSE	`5250a42`, `81f6d6e`
Post-plan	—	RAG: local CWE corpus (25 entries) + Chroma index + retrieve_context LangGraph node + `cwe_id` grounding on findings	`agents/security/retrieval.py`, ADR-0009, rag.md

git log --oneline is the source of truth; the table above is the narrated version.

Quick start

# Once
cp .env.example .env  # fill GEMINI_API_KEY (and optionally GITHUB_TOKEN, GCP_PROJECT_ID)
uv sync

# Each in its own terminal
uv run python -m agents.security.server
uv run python -m agents.style.server
uv run python -m agents.repo.server   # optional, needs GITHUB_TOKEN + podman

# Then any one of
uv run python -m host.demo              # parallel asciinema demo
uv run python -m host.run               # ADK host orchestrator
uv run mesop frontend/main.py           # Mesop UI on :32123
uv run python -m evals.run --agent security    # eval harness

Demo recording script with timings: docs/demo.md.

Folder structure

multi-agent-code-review/
├── host/                       # ADK host agent + asyncio demo + Mesop entry shared by run.py
├── agents/
│   ├── security/               # LangGraph security reviewer
│   ├── style/                  # CrewAI style/quality crew
│   └── repo/                   # ADK repo-context agent + GitHub MCP
├── frontend/                   # Mesop UI
├── evals/                      # Golden diffs + rubric + LLM judge harness
├── deploy/                     # Vertex AI Agent Engine deploy CLI + slim ADK agent
├── common/                     # Shared schemas, A2A scaffolding, OTel init, config
└── docs/
    ├── ARCHITECTURE.md         # System + sequence diagrams, A2A vs MCP table
    ├── demo.md                 # Recording script with timings
    └── study/                  # Per-topic interview-prep Q&A

Observability

The host, all three specialist servers, and the Mesop frontend are instrumented with OpenTelemetry. A single review request produces one trace that spans every process: the host's host.review span has the A2A client spans as children, those propagate the W3C traceparent header to each peer's Starlette server span, and the peer's review.runner span lands as a grandchild — heterogeneous frameworks (LangGraph, CrewAI, ADK) stitched together because the shared seam is plain HTTP. See common/telemetry.py.

Switch exporters via OTEL_EXPORTER in .env:

Mode	Effect
`console` (default)	Print spans as JSON to stderr. No setup. Good for local dev.
`gcp`	Export to Google Cloud Trace. Needs `GCP_PROJECT_ID`.
`none`	Disable tracing entirely.

Vertex AI Agent Engine deploy

The full host + 3 peers system runs locally. For the "I've deployed an ADK agent to a managed runtime" interview talking point, deploy/ ships a slim standalone reviewer to Vertex AI Agent Engine:

# Once
gcloud storage buckets create gs://<name> --location=us-central1
export VERTEX_STAGING_BUCKET=gs://<name>

uv run python -m deploy.vertex deploy
uv run python -m deploy.vertex test --engine <resource-name>
uv run python -m deploy.vertex teardown --engine <resource-name>

deploy/agent.py is intentionally smaller than host/agent.py — no localhost peers, no MCP subprocess, just an ADK LlmAgent with a structured-output ReviewReport. The deploy story is the point; the agent's shape is incidental. See deploy/agent.py for the full rationale.

Cost plan

Stays free or pennies if:

Use the AI Studio API key (free tier, no card) for LLM calls.
Use Vertex AI Express Mode for the Day-19 deploy (~10 engines, 90 days, no billing required).
Tear down engines after the screenshot (deploy/vertex.py teardown).
Set a $1 budget alert in GCP Billing as a safety net.

Free-tier headroom (verified 2026-05): gemini-2.5-flash is 20 RPD on the free tier — burns out after ~5 full host reviews (host LLM + 3 peers + judge ≈ 5 calls each). gemini-2.5-flash-lite and gemini-2.0-flash have higher daily limits if you swap common/llm.py::DEFAULT_MODEL. Vertex Agent Engine: 50 vCPU-h + 100 GB-h/month free. New-account credits: $300 / 90 days as a buffer.

Interview talking points

Long-form Q&A by topic lives in docs/study/. The 2-minute version:

"When would you pick LangGraph over CrewAI?" — State-machine reasoning with reflection/backtracking vs. role-based collaboration. Concrete: the security agent's critique → investigate loop on low confidence (agents/security/graph.py) is awkward to express in CrewAI; CrewAI's "linter, naming, coverage, synthesizer" specialist pattern (agents/style/crew.py) is awkward to express as a graph.
"What's A2A vs MCP?" — A2A is peer-to-peer between agents; MCP is agent-to-tool. Both are JSON-RPC. The host uses A2A to reach the three specialists; the repo specialist uses MCP to reach GitHub. Code-level: common/a2a.py vs. agents/repo/agent.py::_github_mcp_toolset.
"How did you evaluate it?" — evals/: 5 golden diffs, each paired with YAML expectations (which categories must appear, which must NOT — the false-positive guard). Two tiers: a deterministic rubric (free, in CI) and an opt-in Gemini-as-judge tier with structured output.
"How does the host decide which peer to call?" — It doesn't, the LLM does. The host is an ADK LlmAgent with three FunctionTools and an instruction that disambiguates diff vs. PR-ref input. The tools' outputs are kept tiny (finding count, confidence, one-line summary) so Gemini's context stays lean; the full ReviewReport lands in session state via tool_context.state and the driver assembles AggregatedReport in Python after the run.
"How does tracing work across three frameworks?" — A2A is HTTP, and OTel's httpx + Starlette auto-instrumentation already speaks W3C traceparent. One trace per request, four service.names, none of the agent code needs to know tracing exists. See the trace shape in docs/ARCHITECTURE.md.
"What would you change for production?" Standardize on one framework; add A2A auth + retries + timeouts + partial-result aggregation; content-hash cache between peer and Gemini; explicit agent-disagreement resolution; per-peer circuit breakers. None of that was in scope here — the point was the comparison.
"Why three frameworks?" — Explicitly to learn the tradeoffs. In production I'd consolidate (probably on ADK or LangGraph, given each has the orchestration primitives I need). The exercise was understanding when each shines, and now I have a 5-line answer per framework instead of a vague preference.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
agents		agents
common		common
deploy		deploy
docs		docs
evals		evals
frontend		frontend
host		host
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
langgraph-visual.html		langgraph-visual.html
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Agent Code Review System

Architecture

Why each framework where it is

A2A vs MCP — the distinction this project demonstrates

What was built (3-week status: complete)

Quick start

Folder structure

Observability

Vertex AI Agent Engine deploy

Cost plan

Interview talking points

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multi-Agent Code Review System

Architecture

Why each framework where it is

A2A vs MCP — the distinction this project demonstrates

What was built (3-week status: complete)

Quick start

Folder structure

Observability

Vertex AI Agent Engine deploy

Cost plan

Interview talking points

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages