Skip to content

MattDevOps/multi-agent-code-review

Repository files navigation

Multi-Agent Code Review System

CI

A code review system where a host agent dispatches a pull request to specialist agents written in three different agent frameworks, communicating over the A2A (Agent-to-Agent) protocol, with tool access via MCP (Model Context Protocol), optionally deployed to Vertex AI Agent Engine.

The point of three frameworks is to learn their tradeoffs firsthand — not to recommend this heterogeneity for production. The interview value is in being able to say "I picked LangGraph for X because Y, and here's where it bit me" with a code reference attached.

Start here for the narrative: docs/STUDY_GUIDE.md walks through what was built, the architecture top-down, every major decision in build order, the pitfalls that bit us, and model interview answers in matt's voice. ~30 minutes top-to-bottom; the doc you actually rehearse from.

Architecture

                    ┌─── Security Reviewer (LangGraph + RAG) ──┐
User → Mesop UI → Host Agent (Google ADK) ─A2A─├─── Style/Quality Crew (CrewAI) ──────────┤→ Aggregated verdict
                                                └─── Repo-Context Agent (ADK + MCP) ──────┘
                                                              │              │
                                          GitHub MCP server (tool access)    Local CWE corpus (Chroma)

Full rendered diagram (Mermaid) + sequence trace: docs/ARCHITECTURE.md.

Why each framework where it is

Component Framework Reason Code
Security review LangGraph + RAG Reflection loop on confidence; state-machine semantics let the agent backtrack. RAG step grounds findings in MITRE CWE top-25 via local Chroma agents/security/graph.py, nodes.py, retrieval.py
Style/quality review CrewAI Role-based crew (linter, naming, test-coverage agents + synthesizer) with sequential context handoff agents/style/crew.py
Repo-context Google ADK SequentialAgent(repo_explorer, synthesizer) — explorer uses MCPToolset for GitHub, synthesizer turns prose into typed ReviewReport agents/repo/agent.py
Host / orchestrator Google ADK LlmAgent exposes the three A2A peers as FunctionTools; Gemini picks which to call based on input shape host/agent.py

A2A vs MCP — the distinction this project demonstrates

  • A2A: peer agent ↔ peer agent. Standardized JSON-RPC + agent cards. Lets the host swap a specialist without code changes. See common/a2a.py (server side) and common/a2a_client.py (client).
  • MCP: agent ↔ tool. Used for the GitHub integration in the repo-context agent. See agents/repo/agent.py::_github_mcp_toolset.

These are often confused. The project uses both deliberately — side-by-side in docs/ARCHITECTURE.md.

What was built (3-week status: complete)

Week Days Deliverable Key commits
1 1–7 LangGraph security agent end-to-end (parse → enumerate → investigate → critique → finalize loop) bff67c8, e17d95e
2 8–14 CrewAI style crew, ADK repo+MCP, A2A wrap all three, ADK host with peers-as-tools e17d95e, 23dfc08, c933b89, d9cff51
3 15–21 Mesop UI, OTel distributed tracing, eval harness (5 golden + LLM judge), Vertex deploy scaffold, docs c88306b, 1504000, 9aa284b, 6a8c052
Beyond pytest suite (35 tests), GitHub Actions CI, 8 ADRs, 61 flashcards, PORTFOLIO + STUDY_GUIDE, MIT LICENSE 5250a42, 81f6d6e
Post-plan RAG: local CWE corpus (25 entries) + Chroma index + retrieve_context LangGraph node + cwe_id grounding on findings agents/security/retrieval.py, ADR-0009, rag.md

git log --oneline is the source of truth; the table above is the narrated version.

Quick start

# Once
cp .env.example .env  # fill GEMINI_API_KEY (and optionally GITHUB_TOKEN, GCP_PROJECT_ID)
uv sync

# Each in its own terminal
uv run python -m agents.security.server
uv run python -m agents.style.server
uv run python -m agents.repo.server   # optional, needs GITHUB_TOKEN + podman

# Then any one of
uv run python -m host.demo              # parallel asciinema demo
uv run python -m host.run               # ADK host orchestrator
uv run mesop frontend/main.py           # Mesop UI on :32123
uv run python -m evals.run --agent security    # eval harness

Demo recording script with timings: docs/demo.md.

Folder structure

multi-agent-code-review/
├── host/                       # ADK host agent + asyncio demo + Mesop entry shared by run.py
├── agents/
│   ├── security/               # LangGraph security reviewer
│   ├── style/                  # CrewAI style/quality crew
│   └── repo/                   # ADK repo-context agent + GitHub MCP
├── frontend/                   # Mesop UI
├── evals/                      # Golden diffs + rubric + LLM judge harness
├── deploy/                     # Vertex AI Agent Engine deploy CLI + slim ADK agent
├── common/                     # Shared schemas, A2A scaffolding, OTel init, config
└── docs/
    ├── ARCHITECTURE.md         # System + sequence diagrams, A2A vs MCP table
    ├── demo.md                 # Recording script with timings
    └── study/                  # Per-topic interview-prep Q&A

Observability

The host, all three specialist servers, and the Mesop frontend are instrumented with OpenTelemetry. A single review request produces one trace that spans every process: the host's host.review span has the A2A client spans as children, those propagate the W3C traceparent header to each peer's Starlette server span, and the peer's review.runner span lands as a grandchild — heterogeneous frameworks (LangGraph, CrewAI, ADK) stitched together because the shared seam is plain HTTP. See common/telemetry.py.

Switch exporters via OTEL_EXPORTER in .env:

Mode Effect
console (default) Print spans as JSON to stderr. No setup. Good for local dev.
gcp Export to Google Cloud Trace. Needs GCP_PROJECT_ID.
none Disable tracing entirely.

Vertex AI Agent Engine deploy

The full host + 3 peers system runs locally. For the "I've deployed an ADK agent to a managed runtime" interview talking point, deploy/ ships a slim standalone reviewer to Vertex AI Agent Engine:

# Once
gcloud storage buckets create gs://<name> --location=us-central1
export VERTEX_STAGING_BUCKET=gs://<name>

uv run python -m deploy.vertex deploy
uv run python -m deploy.vertex test --engine <resource-name>
uv run python -m deploy.vertex teardown --engine <resource-name>

deploy/agent.py is intentionally smaller than host/agent.py — no localhost peers, no MCP subprocess, just an ADK LlmAgent with a structured-output ReviewReport. The deploy story is the point; the agent's shape is incidental. See deploy/agent.py for the full rationale.

Cost plan

Stays free or pennies if:

  1. Use the AI Studio API key (free tier, no card) for LLM calls.
  2. Use Vertex AI Express Mode for the Day-19 deploy (~10 engines, 90 days, no billing required).
  3. Tear down engines after the screenshot (deploy/vertex.py teardown).
  4. Set a $1 budget alert in GCP Billing as a safety net.

Free-tier headroom (verified 2026-05): gemini-2.5-flash is 20 RPD on the free tier — burns out after ~5 full host reviews (host LLM + 3 peers + judge ≈ 5 calls each). gemini-2.5-flash-lite and gemini-2.0-flash have higher daily limits if you swap common/llm.py::DEFAULT_MODEL. Vertex Agent Engine: 50 vCPU-h + 100 GB-h/month free. New-account credits: $300 / 90 days as a buffer.

Interview talking points

Long-form Q&A by topic lives in docs/study/. The 2-minute version:

  1. "When would you pick LangGraph over CrewAI?" — State-machine reasoning with reflection/backtracking vs. role-based collaboration. Concrete: the security agent's critique → investigate loop on low confidence (agents/security/graph.py) is awkward to express in CrewAI; CrewAI's "linter, naming, coverage, synthesizer" specialist pattern (agents/style/crew.py) is awkward to express as a graph.

  2. "What's A2A vs MCP?" — A2A is peer-to-peer between agents; MCP is agent-to-tool. Both are JSON-RPC. The host uses A2A to reach the three specialists; the repo specialist uses MCP to reach GitHub. Code-level: common/a2a.py vs. agents/repo/agent.py::_github_mcp_toolset.

  3. "How did you evaluate it?"evals/: 5 golden diffs, each paired with YAML expectations (which categories must appear, which must NOT — the false-positive guard). Two tiers: a deterministic rubric (free, in CI) and an opt-in Gemini-as-judge tier with structured output.

  4. "How does the host decide which peer to call?" — It doesn't, the LLM does. The host is an ADK LlmAgent with three FunctionTools and an instruction that disambiguates diff vs. PR-ref input. The tools' outputs are kept tiny (finding count, confidence, one-line summary) so Gemini's context stays lean; the full ReviewReport lands in session state via tool_context.state and the driver assembles AggregatedReport in Python after the run.

  5. "How does tracing work across three frameworks?" — A2A is HTTP, and OTel's httpx + Starlette auto-instrumentation already speaks W3C traceparent. One trace per request, four service.names, none of the agent code needs to know tracing exists. See the trace shape in docs/ARCHITECTURE.md.

  6. "What would you change for production?" Standardize on one framework; add A2A auth + retries + timeouts + partial-result aggregation; content-hash cache between peer and Gemini; explicit agent-disagreement resolution; per-peer circuit breakers. None of that was in scope here — the point was the comparison.

  7. "Why three frameworks?" — Explicitly to learn the tradeoffs. In production I'd consolidate (probably on ADK or LangGraph, given each has the orchestration primitives I need). The exercise was understanding when each shines, and now I have a 5-line answer per framework instead of a vague preference.

About

Multi-agent code review system: LangGraph + CrewAI + Google ADK specialists, A2A protocol fan-out, MCP-powered GitHub context, OpenTelemetry distributed tracing, Vertex AI Agent Engine deploy. Python, Gemini, Mesop UI.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors