A full-stack web application that performs intelligent, multi-step code review using GenAI. Analyzes submitted code in Python, TypeScript, or Java and provides structured feedback covering security, performance, style, and logic issues.
User → POST /api/review → FastAPI Backend
│
┌─────────┴─────────┐
│ Comment Stripping │ (tree-sitter, removes injection vectors)
└─────────┬─────────┘
│
┌───────────────┼───────────────┐
│ │ │
tree-sitter semgrep ruff/lizard
(syntax) (security) (style/perf/quality)
│ │ │
└───────────────┼───────────────┘
│ tool findings
┌─────────┴─────────┐
│ 3 Parallel Claude │ (Haiku reviewers via asyncio.gather)
│ Haiku Reviewers │
└─────────┬─────────┘
│ + AI findings
┌─────────┴─────────┐
│ Post-Processing │ (deterministic: dedup, rank)
└─────────┬─────────┘
│
┌─────────┴─────────┐
│ Report Aggregator │ (Claude Haiku: group related findings)
└─────────┬─────────┘
│
SSE Stream → React Frontend
Three-layer design:
- Layer 1 (Deterministic): 4 static analysis tools run in parallel. Every applicable tool runs on every submission — guaranteed, no LLM gatekeeping.
- Layer 2 (AI Reviewers): 3 Claude Haiku reviewers (security, performance, quality) run in parallel. Each receives tool findings as context and adds deeper, contextual analysis.
- Post-processing: Deterministic dedup and severity-based ranking. No LLM needed.
- Layer 3 (Report Aggregation): A 4th Claude Haiku call groups semantically related findings into distinct issues with validated coverage — every original finding must appear in exactly one group.
Split-pane interface built with React 19, TypeScript, and CodeMirror 6:
- Code editor with syntax highlighting for Python, TypeScript, and Java (CodeMirror 6 with language-specific extensions)
- Language selector dropdown, disabled during review
- Review button with phase-aware states — shows tool progress, then AI reviewer progress, with per-agent status indicators
- Tabbed results panel — "All Findings" shows every individual finding grouped by category; "Report" shows the AI-aggregated executive summary with deduplicated issues
- Click-to-navigate — clicking a finding scrolls the editor to the relevant line
- Markdown export — generates a downloadable
code-review-report.mdwith executive summary, severity breakdown, all findings with suggestions and fixed code - Responsive layout — stacks vertically on mobile, side-by-side on desktop
git clone <repo-url> && cd code_review_assistant
# Configure API key
cp backend/.env.example backend/.env
# Edit backend/.env and set: AWS_BEARER_TOKEN_BEDROCK=your-key-here
# Start everything
docker compose up --build
# Frontend: http://localhost:5173
# Backend: http://localhost:8000/api/healthgit clone <repo-url> && cd code_review_assistant
# Backend
cd backend
python3 -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Configure API key
cp .env.example .env
# Edit .env and set: AWS_BEARER_TOKEN_BEDROCK=your-key-here
# Start backend
uvicorn app.main:app --reload --port 8000
# Frontend (separate terminal)
cd frontend
npm install
npm run dev
# Opens at http://localhost:5173# Health check
curl http://localhost:8000/api/health
# → {"status":"ok"}
# Test review
curl -X POST http://localhost:8000/api/review \
-H "Content-Type: application/json" \
-d '{"code":"def foo(x):\n return eval(x)","language":"python"}'
# → SSE stream with findingsSubmit code for review. Returns an SSE (Server-Sent Events) stream.
Request:
{
"code": "def get_user_data(user_id):\n query = \"SELECT * FROM users WHERE id = \" + str(user_id)\n cursor.execute(query)\n return cursor.fetchall()",
"language": "python"
}code: string, 1-50,000 characterslanguage:"python"|"typescript"|"java"
Response: text/event-stream with these event types:
| Event | Fields | When |
|---|---|---|
tools_start |
— | Pipeline begins |
finding |
finding: {category, severity, source, title, description, line_range, suggestion, fixed_code} |
Each finding (tool or AI) |
tools_complete |
finding_count, error_count |
All tools finished |
agent_start |
agent: "security" | "performance" | "quality" | "report" |
AI reviewer or report aggregator starts |
agent_complete |
agent, finding_count, error? |
AI reviewer finished (includes error string if that reviewer failed) |
summary |
summary |
Deterministic summary of deduped findings |
report |
issues: [{title, category, severity, line_start, suggestion, fixed_code, source_indices}], coverage |
AI-aggregated report grouping related findings |
error |
error, detail? |
Error occurred (findings still delivered) |
Returns {"status": "ok"}.
| Layer | Technology |
|---|---|
| Backend | Python 3.12, FastAPI, Pydantic v2 |
| AI | Claude Haiku 4.5 via AWS Bedrock, accessed through litellm |
| Static Analysis | tree-sitter, semgrep, ruff, lizard |
| Frontend | React 19, TypeScript, Vite 8, CodeMirror 6 (syntax highlighting), Tailwind CSS v4 |
| Infrastructure | Docker, nginx (SSE proxy) |
| Decision | Why |
|---|---|
| Two-layer architecture | Deterministic tools always run (guaranteed findings). LLM adds nuance on top. Graceful degradation if Claude is down. |
asyncio.gather for parallelism |
3 Claude calls is trivially parallelized without a framework. Simpler than Strands/LangGraph. |
| Claude Haiku 4.5 (not Sonnet) for reviewers | Structured JSON extraction is a focused task. Haiku is 5x cheaper, 3x faster. Sufficient quality. |
| litellm for LLM access | Abstracts Bedrock auth (bearer token). Supports model switching without code changes. |
| AI report aggregation with validated coverage | Groups related findings into distinct issues. Validates that every original finding is covered — auto-adds any the AI missed. |
| Deterministic post-processing | Dedup and ranking are mechanical — no LLM needed. |
| Comment stripping (not classification) | Eliminates the prompt injection vector entirely. No BERT model, no false positives. Replaces comment bytes with spaces (not deletion) so line numbers stay valid between original and stripped code. |
| Line-numbered code in LLM prompts | LLMs hallucinate line numbers when counting themselves. Pre-numbering the code eliminates this. Extraction layer validates references against actual line count and nullifies out-of-range values. |
| Graceful degradation at every layer | return_exceptions=True isolates tool crashes. Reviewer timeouts don't block other reviewers. Report aggregation falls back to ungrouped findings. The pipeline always delivers results. |
| Sonnet-judges-Haiku eval framework | A stronger model (Sonnet) judges Haiku's output to avoid self-evaluation blind spots. Mock mode uses keyword heuristics for fast CI without LLM costs. |
| CodeMirror 6 (not Monaco) | 43% smaller bundle, mobile support, native line scrolling API. |
| SSE via fetch (not EventSource) | EventSource only supports GET. We POST the code body. |
| 4 tools (not 8) | ruff includes pyflakes + bandit rules. No redundancy, each tool has a distinct purpose. All pip install-able, no standalone binaries. |
- No persistence: Reviews are stateless. No history or saved reviews.
- No authentication: Rate limiting by IP only (10 req/min via slowapi).
- Java/TypeScript have fewer tools: 3 tools vs 4 for Python (no ruff). Semgrep's open-source rules also have uneven coverage — strong for SQL injection and deserialization, but no rules for
Runtime.exec()command injection or resource leaks. Claude compensates for the gaps. - Single-file only: No multi-file or project analysis.
- Comments ignored: Comment stripping eliminates prompt injection but also means the pipeline never reviews comment quality, documentation accuracy, or TODO/FIXME tracking.
- No library-specific guidance: Reviews are language-generic. The pipeline has no awareness of framework or library best practices (e.g. React hook rules, SQLAlchemy session management, Spring Security configuration). It can only flag general language-level issues.
- Semgrep image size: ~500MB in Docker due to rule database.
| Threat | Mitigation |
|---|---|
| Prompt injection via comments | Tree-sitter comment stripping removes all comment nodes before LLM receives code. Replaces bytes with spaces to preserve line offsets. Falls back to original code if stripping fails. |
| Prompt injection via code | System prompts include safety instructions to ignore embedded directives. An inherent LLM limitation — string literals and variable names can contain instructions that survive comment stripping. |
| Tool process crashes | ToolRunner uses asyncio.gather(return_exceptions=True) — each tool is isolated. A crash in semgrep doesn't affect tree-sitter or lizard. |
| LLM reviewer failures | Each reviewer runs independently. Timeouts and exceptions are caught per-reviewer; surviving reviewers still produce findings. |
| Report aggregation failure | Falls back to ungrouped findings (1:1 mapping). The pipeline always delivers results. |
| LLM output validation | extract_findings() validates line numbers against actual code length. Hallucinated references are nullified. Malformed JSON is caught and dropped per-finding. |
| Input validation | Pydantic enforces code length (1–50,000 chars), language enum, and field constraints. Invalid requests return 422. |
| Rate limiting | slowapi enforces 10 requests/minute per IP on /api/review. |
| CORS | Restricted to configured origins (default: http://localhost:5173). Configurable via CORS_ORIGINS env var. Only Content-Type header allowed. |
| XSS via LLM output | Frontend renders all finding text as plain text (React escapes by default). No dangerouslySetInnerHTML. |
| Security headers | nginx sets Content-Security-Policy, X-Content-Type-Options: nosniff, X-Frame-Options: DENY, and Referrer-Policy. |
| Error information disclosure | Error events sent to clients use generic messages. Full details are logged server-side only. |
| Subprocess injection | All tools use create_subprocess_exec (not shell=True). Temp files use tempfile.NamedTemporaryFile with random names. |
| Secret management | API keys loaded from .env (gitignored). Only .env.example with placeholders is tracked. |
These items are not needed for local development but should be addressed before production deployment:
- HTTPS termination — Add TLS via a reverse proxy (Cloudflare, ALB, or nginx with certs). The current setup uses HTTP only.
- Trusted proxy configuration — Rate limiting uses client IP from request headers. Behind a load balancer, configure
slowapito trust only your proxy's IP range to preventX-Forwarded-Forspoofing. - Concurrent request limiting — Rate limiting caps requests/minute but not concurrent requests. Add a semaphore or API gateway throttle (e.g., max 5 concurrent reviews) to prevent LLM slot exhaustion.
- CORS origins — Set
CORS_ORIGINSto your production domain. The defaulthttp://localhost:5173is for development only. - HSTS header — Add
Strict-Transport-Securityto nginx once HTTPS is enabled. - Container user — Dockerfiles run as root. Add a non-root
USERdirective for production. - Secret rotation — Rotate the Bedrock API key periodically. Consider using IAM roles or AWS Secrets Manager instead of bearer tokens.
- Request body size limit — Add
client_max_body_size 100k;to nginx to reject oversized payloads at the proxy layer. - Monitoring / audit logging — Add structured logging of review requests (timestamp, IP, language, code hash) for abuse detection. No code content should be logged.
# All backend tests (activate venv first)
cd backend && source .venv/bin/activate
python -m pytest -v --timeout=60
# Coverage
python -m pytest --cov=app --cov-report=term-missing
# Frontend lint + build check
cd frontend && npm run buildAn LLM-as-a-judge eval framework validates that the pipeline reliably detects bugs. 58 YAML test cases (172 runnable snippets across Python, TypeScript, and Java) cover security, performance, logic, style, and syntax categories. A stronger model (Sonnet) judges whether Haiku's review output correctly identifies each planted bug, scoring on detection, actionability, category accuracy, severity, and conciseness.
| Metric | Score |
|---|---|
| Detection rate | 98% (169/172) |
| Category accuracy | 91% |
| Actionability | 96% |
| Severity accuracy | 70% |
By category: Security 100%, Logic 98%, Style 97%, Performance 97%, Syntax 100%
By language: Java 100%, TypeScript 98%, Python 96%
3 remaining misses: a boundary error where the reviewer identified the right fix but wrong root cause, a memory leak via unremoved event listeners (subtle lifetime issue), and a long function where the reviewer flagged low-level style issues but missed the SRP violation.
Full report: backend/eval/results/
# Framework self-tests (no LLM calls)
cd backend && pytest eval/tests/ -v
# Mock mode — keyword heuristics, fast CI validation
cd backend && python -m eval --mode mock
# Full live run — Sonnet judges Haiku's output
cd backend && python -m eval --mode live
# Against a running Docker container
docker compose up -d backend
cd backend && python -m eval --mode live --target http://localhost:8000See backend/eval/README.md for test case format, configuration, and report details.
Three sample inputs with expected outputs are in backend/test_cases/:
| File | Language | Key Expected Findings |
|---|---|---|
python_sql_injection.py |
Python | Security: SQL injection, Style/Perf findings from ruff or AI |
typescript_xss.ts |
TypeScript | Security: innerHTML XSS, possible unused variable |
java_resource_leak.java |
Java | Performance/Logic: unclosed stream, broad exception |
Each .expected.json file specifies minimum expected finding categories.
- Expand eval dataset: Current 58 bug types / 172 cases provide good coverage but miss edge cases. Target 150+ bug types with difficulty tiers, multi-bug snippets, and false-positive cases (clean code that shouldn't trigger findings).
- Stronger reviewer models: Upgrade from Haiku to Sonnet for reviewers, or use a mix (Sonnet for security, Haiku for style). The 3 remaining detection failures are reasoning gaps, not extraction failures — a more capable model would likely catch them.
- Extended thinking budget: Add a thinking/reasoning budget to reviewer prompts so the model can reason through subtle bugs (e.g. lifetime issues, SRP violations) before committing to findings. Particularly impactful for the quality reviewer.
- More language-specific static analyzers: Add ESLint for TypeScript, PMD/SpotBugs for Java, and mypy for Python. Semgrep's open-source Java rules have notable gaps (no
Runtime.exec()command injection, no resource leak detection) — PMD/SpotBugs would fill these. ESLint would add TS-specific linting that ruff provides for Python. - Skill-based reviewer agent: A reviewer agent that dynamically loads library-specific guidance and patterns based on the imports/dependencies detected in the submitted code. For example, if the code imports
react, load React hook rules and component patterns; if it usessqlalchemy, load session management best practices. This turns the generic reviewer into a context-aware one without hardcoding library knowledge into prompts.
code_review_assistant/
├── backend/
│ ├── app/
│ │ ├── main.py # FastAPI app, CORS, rate limiting, endpoints
│ │ ├── config.py # AppConfig (pydantic-settings, Bedrock credentials)
│ │ ├── orchestrator.py # Pipeline: tools → LLM → post-processing → report → SSE
│ │ ├── post_processing.py # Deterministic dedup, rank, summary
│ │ ├── format.py # number_lines, format_findings, build_user_message
│ │ ├── models/ # Pydantic models (Finding, SSE events, request/response)
│ │ ├── tools/ # 4 tool wrappers (tree-sitter, semgrep, ruff, lizard) + ToolRunner
│ │ ├── agents/ # AI service (litellm), extraction, prompts, report aggregator
│ │ └── guards/ # Comment stripping (tree-sitter based)
│ ├── tests/ # pytest suite (unit + integration)
│ ├── eval/ # LLM-as-a-judge evaluation framework
│ │ ├── cases/ # 58 YAML test cases (172 snippets)
│ │ ├── tests/ # Framework self-tests
│ │ └── README.md # Full eval documentation
│ ├── test_cases/ # 3 sample inputs with expected outputs
│ ├── requirements.txt
│ └── Dockerfile
├── frontend/
│ ├── src/
│ │ ├── App.tsx # Main layout (split pane, tabs)
│ │ ├── components/ # 11 React components (editor, findings, report, export)
│ │ ├── hooks/ # useStreamingReview SSE hook
│ │ └── types/ # TypeScript interfaces
│ ├── nginx.conf # SSE proxy config
│ └── Dockerfile
├── docker-compose.yml
├── DESIGN.md # Full architecture design document
└── README.md