A code review system where a host agent dispatches a pull request to specialist agents written in three different agent frameworks, communicating over the A2A (Agent-to-Agent) protocol, with tool access via MCP (Model Context Protocol), optionally deployed to Vertex AI Agent Engine.
The point of three frameworks is to learn their tradeoffs firsthand — not to recommend this heterogeneity for production. The interview value is in being able to say "I picked LangGraph for X because Y, and here's where it bit me" with a code reference attached.
Start here for the narrative: docs/STUDY_GUIDE.md walks through what was built, the architecture top-down, every major decision in build order, the pitfalls that bit us, and model interview answers in matt's voice. ~30 minutes top-to-bottom; the doc you actually rehearse from.
┌─── Security Reviewer (LangGraph + RAG) ──┐
User → Mesop UI → Host Agent (Google ADK) ─A2A─├─── Style/Quality Crew (CrewAI) ──────────┤→ Aggregated verdict
└─── Repo-Context Agent (ADK + MCP) ──────┘
│ │
GitHub MCP server (tool access) Local CWE corpus (Chroma)
Full rendered diagram (Mermaid) + sequence trace: docs/ARCHITECTURE.md.
| Component | Framework | Reason | Code |
|---|---|---|---|
| Security review | LangGraph + RAG | Reflection loop on confidence; state-machine semantics let the agent backtrack. RAG step grounds findings in MITRE CWE top-25 via local Chroma | agents/security/graph.py, nodes.py, retrieval.py |
| Style/quality review | CrewAI | Role-based crew (linter, naming, test-coverage agents + synthesizer) with sequential context handoff | agents/style/crew.py |
| Repo-context | Google ADK | SequentialAgent(repo_explorer, synthesizer) — explorer uses MCPToolset for GitHub, synthesizer turns prose into typed ReviewReport |
agents/repo/agent.py |
| Host / orchestrator | Google ADK | LlmAgent exposes the three A2A peers as FunctionTools; Gemini picks which to call based on input shape |
host/agent.py |
- A2A: peer agent ↔ peer agent. Standardized JSON-RPC + agent
cards. Lets the host swap a specialist without code changes. See
common/a2a.py(server side) andcommon/a2a_client.py(client). - MCP: agent ↔ tool. Used for the GitHub integration in the
repo-context agent. See
agents/repo/agent.py::_github_mcp_toolset.
These are often confused. The project uses both deliberately — side-by-side in docs/ARCHITECTURE.md.
| Week | Days | Deliverable | Key commits |
|---|---|---|---|
| 1 | 1–7 | LangGraph security agent end-to-end (parse → enumerate → investigate → critique → finalize loop) | bff67c8, e17d95e |
| 2 | 8–14 | CrewAI style crew, ADK repo+MCP, A2A wrap all three, ADK host with peers-as-tools | e17d95e, 23dfc08, c933b89, d9cff51 |
| 3 | 15–21 | Mesop UI, OTel distributed tracing, eval harness (5 golden + LLM judge), Vertex deploy scaffold, docs | c88306b, 1504000, 9aa284b, 6a8c052 |
| Beyond | — | pytest suite (35 tests), GitHub Actions CI, 8 ADRs, 61 flashcards, PORTFOLIO + STUDY_GUIDE, MIT LICENSE | 5250a42, 81f6d6e |
| Post-plan | — | RAG: local CWE corpus (25 entries) + Chroma index + retrieve_context LangGraph node + cwe_id grounding on findings |
agents/security/retrieval.py, ADR-0009, rag.md |
git log --oneline is the source of truth; the table above is the
narrated version.
# Once
cp .env.example .env # fill GEMINI_API_KEY (and optionally GITHUB_TOKEN, GCP_PROJECT_ID)
uv sync
# Each in its own terminal
uv run python -m agents.security.server
uv run python -m agents.style.server
uv run python -m agents.repo.server # optional, needs GITHUB_TOKEN + podman
# Then any one of
uv run python -m host.demo # parallel asciinema demo
uv run python -m host.run # ADK host orchestrator
uv run mesop frontend/main.py # Mesop UI on :32123
uv run python -m evals.run --agent security # eval harnessDemo recording script with timings: docs/demo.md.
multi-agent-code-review/
├── host/ # ADK host agent + asyncio demo + Mesop entry shared by run.py
├── agents/
│ ├── security/ # LangGraph security reviewer
│ ├── style/ # CrewAI style/quality crew
│ └── repo/ # ADK repo-context agent + GitHub MCP
├── frontend/ # Mesop UI
├── evals/ # Golden diffs + rubric + LLM judge harness
├── deploy/ # Vertex AI Agent Engine deploy CLI + slim ADK agent
├── common/ # Shared schemas, A2A scaffolding, OTel init, config
└── docs/
├── ARCHITECTURE.md # System + sequence diagrams, A2A vs MCP table
├── demo.md # Recording script with timings
└── study/ # Per-topic interview-prep Q&A
The host, all three specialist servers, and the Mesop frontend are
instrumented with OpenTelemetry. A single review request produces one
trace that spans every process: the host's host.review span has the
A2A client spans as children, those propagate the W3C traceparent
header to each peer's Starlette server span, and the peer's
review.runner span lands as a grandchild — heterogeneous frameworks
(LangGraph, CrewAI, ADK) stitched together because the shared seam is
plain HTTP. See common/telemetry.py.
Switch exporters via OTEL_EXPORTER in .env:
| Mode | Effect |
|---|---|
console (default) |
Print spans as JSON to stderr. No setup. Good for local dev. |
gcp |
Export to Google Cloud Trace. Needs GCP_PROJECT_ID. |
none |
Disable tracing entirely. |
The full host + 3 peers system runs locally. For the "I've deployed an
ADK agent to a managed runtime" interview talking point, deploy/
ships a slim standalone reviewer to Vertex AI Agent Engine:
# Once
gcloud storage buckets create gs://<name> --location=us-central1
export VERTEX_STAGING_BUCKET=gs://<name>
uv run python -m deploy.vertex deploy
uv run python -m deploy.vertex test --engine <resource-name>
uv run python -m deploy.vertex teardown --engine <resource-name>deploy/agent.py is intentionally smaller than host/agent.py — no
localhost peers, no MCP subprocess, just an ADK LlmAgent with a
structured-output ReviewReport. The deploy story is the point; the
agent's shape is incidental. See deploy/agent.py for the full
rationale.
Stays free or pennies if:
- Use the AI Studio API key (free tier, no card) for LLM calls.
- Use Vertex AI Express Mode for the Day-19 deploy (~10 engines, 90 days, no billing required).
- Tear down engines after the screenshot (
deploy/vertex.py teardown). - Set a
$1budget alert in GCP Billing as a safety net.
Free-tier headroom (verified 2026-05): gemini-2.5-flash is 20
RPD on the free tier — burns out after ~5 full host reviews
(host LLM + 3 peers + judge ≈ 5 calls each). gemini-2.5-flash-lite
and gemini-2.0-flash have higher daily limits if you swap
common/llm.py::DEFAULT_MODEL. Vertex Agent Engine: 50 vCPU-h +
100 GB-h/month free. New-account credits: $300 / 90 days as a
buffer.
Long-form Q&A by topic lives in docs/study/. The
2-minute version:
-
"When would you pick LangGraph over CrewAI?" — State-machine reasoning with reflection/backtracking vs. role-based collaboration. Concrete: the security agent's
critique → investigateloop on low confidence (agents/security/graph.py) is awkward to express in CrewAI; CrewAI's "linter, naming, coverage, synthesizer" specialist pattern (agents/style/crew.py) is awkward to express as a graph. -
"What's A2A vs MCP?" — A2A is peer-to-peer between agents; MCP is agent-to-tool. Both are JSON-RPC. The host uses A2A to reach the three specialists; the repo specialist uses MCP to reach GitHub. Code-level:
common/a2a.pyvs.agents/repo/agent.py::_github_mcp_toolset. -
"How did you evaluate it?" —
evals/: 5 golden diffs, each paired with YAML expectations (which categories must appear, which must NOT — the false-positive guard). Two tiers: a deterministic rubric (free, in CI) and an opt-in Gemini-as-judge tier with structured output. -
"How does the host decide which peer to call?" — It doesn't, the LLM does. The host is an ADK
LlmAgentwith threeFunctionTools and an instruction that disambiguates diff vs. PR-ref input. The tools' outputs are kept tiny (finding count, confidence, one-line summary) so Gemini's context stays lean; the fullReviewReportlands in session state viatool_context.stateand the driver assemblesAggregatedReportin Python after the run. -
"How does tracing work across three frameworks?" — A2A is HTTP, and OTel's httpx + Starlette auto-instrumentation already speaks W3C
traceparent. One trace per request, fourservice.names, none of the agent code needs to know tracing exists. See the trace shape in docs/ARCHITECTURE.md. -
"What would you change for production?" Standardize on one framework; add A2A auth + retries + timeouts + partial-result aggregation; content-hash cache between peer and Gemini; explicit agent-disagreement resolution; per-peer circuit breakers. None of that was in scope here — the point was the comparison.
-
"Why three frameworks?" — Explicitly to learn the tradeoffs. In production I'd consolidate (probably on ADK or LangGraph, given each has the orchestration primitives I need). The exercise was understanding when each shines, and now I have a 5-line answer per framework instead of a vague preference.